Category: What makes a good benchmark?

Machine learning performance tool update

on October 19, 2017

Earlier this year we started talking about our efforts to develop a tool to help in evaluating machine learning performance. We’ve given some updates since then, but we’ve also gotten some questions, so I thought I’d do my best to summarize our answers for everyone.

Some have asked what kinds of algorithms we’ve been looking into. As we said in an earlier blog, we’re looking at algorithms involved in computer vision, natural language processing, and data analytics, particularly different aspects of computer vision.

One seemingly trivial question we’ve received regards the proposed name, MLXPRT. We have been thinking of this tool as evaluating machine learning performance, but folks have raised a valid concern that it may well be broader than that. Does machine learning include deep learning? What about other artificial intelligence approaches? I’ve certainly seen other approaches lumped into machine learning, probably because machine learning is the hot topic of the moment. It feels like everything is boasting, “Now with machine learning!”

While there is some value in being part of such a hot movement, we’ve begun to wonder if a more inclusive name, such as AIXPRT, would be better. We’d love to hear your thoughts on that.

We’ve also had questions about the kind of devices the tool will run on. The short answer is that we’re concentrating on edge devices. While there is a need for server AI/ML tools, we’ve been focusing on the evaluating the devices close to the end users. As a result, we’re looking at the inference aspect of machine learning rather than the training aspect.

Probably the most frequent thing we’ve been asked about is the timetable. While we’d hoped to have something available this year, we were overly optimistic. We’re currently working on a more detailed proposal of what the tool will be, and we aim to make that available by the end of this year. If we achieve that goal, our next one will be to have a preliminary version of the tool itself ready in the first half of 2018.

As always, we seek input from folks, like yourself, who are working in these areas. What would you most like to see in an AI/machine learning performance tool? Do you have any questions?

Bill

Posted in AI, Benchmark metrics, Collaborative benchmark development, computer vision, Future of performance evaluation, Machine learning, What makes a good benchmark? |

What’s next for HDXPRT?

By Justin Greene

on October 12, 2017

A few months ago, we discussed some initial ideas for the next version of HDXPRT, including updating the benchmark’s workloads and real-world trial applications and improving the look and feel of the UI. This week, we’d like to share more about the status of the HDXPRT development process.

We’re planning to keep HDXPRT’s three test categories: editing photos, editing music, and converting videos. We’re also planning to use the latest trial versions of the same five applications included in HDXPRT 2014: Adobe Photoshop Elements, Apple iTunes, Audacity, CyberLink MediaEspresso, and HandBrake. The new versions of each of these programs include features and capabilities that may enhance the HDXPRT workloads. For example, Adobe Photoshop Elements 2018 includes interesting new AI tools such as “Open Closed Eyes,” which purports to fix photos ruined by subjects who blinked at the wrong time. We’re evaluating whether any of the new technologies on offer will be a good fit for HDXPRT.

We’re also evaluating how the new Windows 10 SDK and Fall Creators Update will affect HDXPRT. It’s too early to discuss potential changes in any detail, but we know we’ll need to adapt to new development tools, and it’s possible that the Fluent Design System will affect the HDXPRT UI beyond the improvements we already had in mind.

As HDXPRT development progresses, we’ll continue to keep the community up to date. If you have suggestions or insights into the new Fall Creators Update or any of HDXPRT’s real-world applications, we’d love to hear from you! If you’re just reading out about HDXPRT for the first time, you can find out more about the purpose, structure, and capabilities of the test here.

Justin

Posted in AI, Application-based benchmarks, BenchmarkXPRT development community, HDXPRT, HDXPRT development process, HDXPRT release cycle, Performance benchmarking, What makes a good benchmark?, Windows 10, Windows 10 Creators Update |

Decisions, decisions

By Justin Greene

on September 14, 2017

Back in April, we shared some of our initial ideas for a new version of WebXPRT, and work on the new benchmark is underway. Any time we begin the process of updating one of the XPRT benchmarks, one of the first decisions we face is how to improve workload content so it better reflects the types of technology average consumers use every day. Since benchmarks typically have a life cycle of two to four years, we want the benchmark to be relevant for at least the next couple of years.

For example, WebXPRT contains two photo-related workloads, Photo Effects and Organize Album. Photo Effects applies a series of effects to a set of photos, and Organize Album uses facial recognition technology to analyze a set of photos. In both cases, we want to use photos that represent the most relevant combination of image size, resolution, and data footprint possible. Ideally, the resulting image sizes and resolutions should differentiate processing speed on the latest systems, but not at the expense of being able to run reasonably on most current devices. We also have to confirm that the photos aren’t so large as to impact page load times unnecessarily.

The way this strategy works in practice is that we spend time researching hardware and operating system market share. Given that phones are the cameras that most people use, we look at them to help define photo characteristics. In 2017, the most widespread mobile OS is Android, and while reports vary depending on the metric used, the Samsung Galaxy S5 and Galaxy S7 are at or near the top of global mobile market share. For our purposes, the data tells us that choosing photo sizes and resolutions that mirror those of the Galaxy line is a good start, and a good chunk of Android users are either already using S7-generation technology, or will be shifting to new phones with that technology in the coming year. So, for the next version of WebXPRT, we’ll likely use photos that represent the real-life environment of an S7 user.

I hope that provides a brief glimpse into the strategies we use to evaluate workload content in the XPRT benchmarks. Of course, since the BenchmarkXPRT Development Community is an open development community, we’d love to hear your comments or suggestions!

Justin

Posted in Android, Benchmark metrics, Benchmarking, BenchmarkXPRT development community, Collaborative benchmark development, Performance benchmarking, WebXPRT, WebXPRT 2017, What makes a good benchmark? |

Planning the next version of HDXPRT

By Justin Greene

on July 20, 2017

A few weeks ago, we wrote about the capabilities and benefits of HDXPRT. This week, we want to share some initial ideas for the next version of HDXPRT, and invite you to send us any comments or suggestions you may have.

The first step towards a new HDXPRT will be updating the benchmark’s workloads to increase their value in the years to come. Primarily, this will involve updating application content, such as photos and videos, to more contemporary file resolutions and sizes. We think 4K-related workloads will increase the benchmark’s relevance, but aren’t sure whether 4K playback tests are necessary. What do you think?

The next step will be to update versions of the real-world trial applications included in the benchmark, including Adobe Photoshop Elements, Apple iTunes, Audacity, CyberLink MediaEspresso, and HandBrake. Are there other any applications you feel would be a good addition to HDXPRT’s editing photos, editing music, or converting videos test scenarios?

We’re also planning to update the UI to improve the look and feel of the benchmark and simplify navigation and functionality.

Last but not least, we’ll work to fix known problems, such as the hardware acceleration settings issue in MediaEspresso, and eliminate the need for workarounds when running HDXPRT on the Windows 10 Creators Update.

Do you have feedback on these ideas or suggestions for applications or test scenarios that we should consider for HDXPRT? Are there existing features we should remove? Are there elements of the UI that you find especially useful or would like to see improved? Please let us know. We want to hear from you and make sure that HDXPRT continues to meet your needs.

Justin

Posted in 4K, BenchmarkXPRT, Collaborative benchmark development, Future of performance evaluation, HDXPRT, HDXPRT capabilities, HDXPRT development process, HDXPRT release cycle, Let us know your thoughts, Performance benchmarking, What makes a good benchmark?, Windows 10 |

Apples and pears vs. oranges and bananas

By Eric Hale

on July 6, 2017

When people talk about comparing disparate things, they often say that you’re comparing apples and oranges. However, sometimes that expression doesn’t begin to describe the situation.

Recently, Justin wrote about using CrXPRT on systems running Neverware CloudReady OS. In that post, he noted that we couldn’t guarantee that using CrXPRT on CloudReady and Chrome OS systems would be a fair comparison. Not surprisingly, that prompted the question “Why not?”

Here’s the thing: It’s a fair comparison of those software stacks running on those hardware configurations. If everyone accepted that and stopped there, all would be good. However, almost inevitably, people will read more into the scores than is appropriate.

In such a comparison, we’re changing multiple variables at once. We’ve written before about the effect of the software stack on performance. CloudReady and Chrome OS are two different implementations of the Chromium OS, and it’s possible that one is more efficient than the other. If so, that would affect CrXPRT scores. At the same time, the raw performance of the two hardware configurations under test could also differ to a certain degree, which would also affect CrXPRT scores.

Here’s a metaphor: If you measure the effective force at the end of two levers and find a difference, to what do you attribute that difference? If you know the levers are the same length, you can attribute the difference to the amount of applied force. If you know the applied force is identical, you can attribute the difference to the length of the levers. If you lack both of those data points, you can’t know whether the difference is due to the length, the force, or a combination of the two.

With a benchmark, you can run multiple experiments designed to isolate variables and use the results from those experiments to look for trends. For example, we could install both CloudReady OS and Chrome OS on the same Intel-based Chromebook and compare the CrXPRT results. Because that removes hardware differences as a variable, such an experiment would offer some insight into how the two implementations compare. However, because differences in hardware can affect the performance of a given piece of software, this single data point would be of limited value. We could repeat the experiment on a variety of other Intel-based Chromebooks, and other patterns might emerge. If one of the implementations consistently scored higher, that would suggest that it was more efficient than the other, but would still not be definitively conclusive.

I hope this gives you some idea about why we are cautious about drawing conclusions when comparing results from different sets of hardware running different software stacks.

Eric

Posted in Benchmark metrics, Benchmarking, Benchmarks in general, Chrome OS, Chromebooks, CrXPRT, Google Chrome, Performance benchmarking, What makes a good benchmark? |

BatteryXPRT: A quick and reliable way to estimate Android battery life

By Justin Greene

on May 25, 2017

In the last few weeks, we reintroduced readers to the capabilities and benefits of TouchXPRT and CrXPRT. This week, we’d like to reintroduce BatteryXPRT 2014 for Android, an app that evaluates the battery life and performance of Android devices.

When purchasing a phone or tablet, it’s good to know how long the battery will last on a typical day and how often you’ll need to charge it. Before BatteryXPRT, you had to rely on a manufacturer’s estimate or full rundown tests that perform tasks that don’t resemble the types of things we do with our phones and tablets every day.

We developed BatteryXPRT to estimate battery life reliably in just over five hours, so testers can complete a full evaluation in one work day or while sleeping. You can configure it to run while the device is connected to a network or in Airplane mode. The test also produces a performance score by running workloads that represent common everyday tasks.

BatteryXPRT is easy to install and run, and is a great resource for anyone who wants to evaluate how well an Android device will meet their needs. If you’d like to see test results from a variety of Android devices, go to BatteryXPRT.com and click View Results, where you’ll find scores from many different Android devices.

If you’d like to run BatteryXPRT:

Simply download BatteryXPRT from the Google Play store or BatteryXPRT.com. The BatteryXPRT installation instructions and user manual provide step-by-step instructions for how to configure your device and kick off a test. We designed BatteryXPRT 2014 for Android to be compatible with a wide variety of Android devices, but because there are so many devices on the market, it is inevitable that users occasionally run into problems. In the Tips, tricks, and known issues document, we provide troubleshooting suggestions for issues we encountered during development testing.

If you’d like to learn more:

We offer a full online BatteryXPRT training course that covers almost every aspect of the benchmark. You can view the sections in order or jump to the parts that interest you. We guarantee that you’ll learn something new!

If you’d like to dig into the details:

Check out the Exploring BatteryXPRT 2014 for Android white paper. In it, we discuss the app’s development and structure. We also describe the component tests; explain the differences between the test’s Airplane, Wi-Fi, and Cellular modes; and detail the statistical processes we use to calculate expected battery life.

If you’d like to dig even deeper, the BatteryXPRT source code is available to members of the BenchmarkXPRT Development Community, so consider joining today. Membership is free for members of any company or organization with an interest in benchmarks, and there are no obligations after joining.

If you haven’t used BatteryXPRT before, try it out and let us know what you think!

Justin

Posted in Android, Battery life, BatteryXPRT 2014 for Android, Benchmark metrics, Benchmarking, BenchmarkXPRT development community, CrXPRT, TouchXPRT, What makes a good benchmark? |

Category: What makes a good benchmark?

Machine learning performance tool update

What’s next for HDXPRT?

Decisions, decisions

Planning the next version of HDXPRT

Apples and pears vs. oranges and bananas

BatteryXPRT: A quick and reliable way to estimate Android battery life

If you’d like to run BatteryXPRT:

If you’d like to learn more:

If you’d like to dig into the details:

Check out the other XPRTs: