BenchmarkXPRT Blog banner

Category: Benchmark metrics

Apples and pears vs. oranges and bananas

When people talk about comparing disparate things, they often say that you’re comparing apples and oranges. However, sometimes that expression doesn’t begin to describe the situation.

Recently, Justin wrote about using CrXPRT on systems running Neverware CloudReady OS. In that post, he noted that we couldn’t guarantee that using CrXPRT on CloudReady and Chrome OS systems would be a fair comparison. Not surprisingly, that prompted the question “Why not?”

Here’s the thing: It’s a fair comparison of those software stacks running on those hardware configurations. If everyone accepted that and stopped there, all would be good. However, almost inevitably, people will read more into the scores than is appropriate.

In such a comparison, we’re changing multiple variables at once. We’ve written before about the effect of the software stack on performance. CloudReady and Chrome OS are two different implementations of the Chromium OS, and it’s possible that one is more efficient than the other. If so, that would affect CrXPRT scores. At the same time, the raw performance of the two hardware configurations under test could also differ to a certain degree, which would also affect CrXPRT scores.

Here’s a metaphor: If you measure the effective force at the end of two levers and find a difference, to what do you attribute that difference? If you know the levers are the same length, you can attribute the difference to the amount of applied force. If you know the applied force is identical, you can attribute the difference to the length of the levers. If you lack both of those data points, you can’t know whether the difference is due to the length, the force, or a combination of the two.

With a benchmark, you can run multiple experiments designed to isolate variables and use the results from those experiments to look for trends. For example, we could install both CloudReady OS and Chrome OS on the same Intel-based Chromebook and compare the CrXPRT results. Because that removes hardware differences as a variable, such an experiment would offer some insight into how the two implementations compare. However, because differences in hardware can affect the performance of a given piece of software, this single data point would be of limited value. We could repeat the experiment on a variety of other Intel-based Chromebooks, and other patterns might emerge. If one of the implementations consistently scored higher, that would suggest that it was more efficient than the other, but would still not be definitively conclusive.

I hope this gives you some idea about why we are cautious about drawing conclusions when comparing results from different sets of hardware running different software stacks.

Eric

Learning something new every day

We’re constantly learning and thinking about how the XPRTs can help people evaluate the tech that will soon be a part of daily life. It’s why we started work on a tool to evaluate machine learning capabilities, and it’s why we developed CrXPRT in response to Chromebooks’ growing share of the education sector.

The learning process often involves a lot of tinkering in the lab, and we recently began experimenting with Neverware’s CloudReady OS. CloudReady is an operating system based on the open-source Chromium OS. Unlike Chrome OS, which can run on only Chromebooks, CloudReady can run on many types of systems, including older Windows and OS X machines. The idea is that individuals and organizations can breathe new life into aging hardware by incorporating it into a larger pool of devices managed through a Google Admin Console.

We were curious to see if it worked as advertised, and if it would run CrXPRT 2015. Installing CloudReady on an old Dell Latitude E6430 was easy enough, and we then installed CrXPRT from the Chrome Web Store. Performance tests ran without a hitch. Battery life tests would kick off but not complete, which was not a big surprise because the battery life calls involved were developed specifically for Chrome OS.

So, what role can CrXPRT play with CloudReady, and what are the limitations? CloudReady has a lot in common with Chrome OS, but there are some key differences. One way we see the CrXPRT performance test being useful is for comparing CloudReady devices. Say that an organization was considering adopting CloudReady on certain legacy systems but not on others; CrXPRT performance scores would provide insight into which devices performed better with CloudReady. While you could use CrXPRT to compare those devices to Chromebooks, the differences between the operating systems are significant enough that we cannot guarantee the comparison would be a fair one.

Have you spent any time working with CloudReady, or are there other interesting new technologies you’d like us to investigate? Let us know!

Justin

BatteryXPRT: A quick and reliable way to estimate Android battery life

In the last few weeks, we reintroduced readers to the capabilities and benefits of TouchXPRT and CrXPRT. This week, we’d like to reintroduce BatteryXPRT 2014 for Android, an app that evaluates the battery life and performance of Android devices.

When purchasing a phone or tablet, it’s good to know how long the battery will last on a typical day and how often you’ll need to charge it. Before BatteryXPRT, you had to rely on a manufacturer’s estimate or full rundown tests that perform tasks that don’t resemble the types of things we do with our phones and tablets every day.

We developed BatteryXPRT to estimate battery life reliably in just over five hours, so testers can complete a full evaluation in one work day or while sleeping. You can configure it to run while the device is connected to a network or in Airplane mode. The test also produces a performance score by running workloads that represent common everyday tasks.

BatteryXPRT is easy to install and run, and is a great resource for anyone who wants to evaluate how well an Android device will meet their needs. If you’d like to see test results from a variety of Android devices, go to BatteryXPRT.com and click View Results, where you’ll find scores from many different Android devices.

If you’d like to run BatteryXPRT:

Simply download BatteryXPRT from the Google Play store or BatteryXPRT.com. The BatteryXPRT installation instructions and user manual provide step-by-step instructions for how to configure your device and kick off a test. We designed BatteryXPRT 2014 for Android to be compatible with a wide variety of Android devices, but because there are so many devices on the market, it is inevitable that users occasionally run into problems. In the Tips, tricks, and known issues document, we provide troubleshooting suggestions for issues we encountered during development testing.

If you’d like to learn more:

We offer a full online BatteryXPRT training course that covers almost every aspect of the benchmark. You can view the sections in order or jump to the parts that interest you. We guarantee that you’ll learn something new!

BatteryXPRT 2014 for Android Training Course

If you’d like to dig into the details:

Check out the Exploring BatteryXPRT 2014 for Android white paper. In it, we discuss the app’s development and structure. We also describe the component tests; explain the differences between the test’s Airplane, Wi-Fi, and Cellular modes; and detail the statistical processes we use to calculate expected battery life.

If you’d like to dig even deeper, the BatteryXPRT source code is available to members of the BenchmarkXPRT Development Community, so consider joining today. Membership is free for members of any company or organization with an interest in benchmarks, and there are no obligations after joining.

If you haven’t used BatteryXPRT before, try it out and let us know what you think!

Justin

Evolve or die

Last week, Google announced that it would retire its Octane benchmark. Their announcement explains that they designed Octane to spur improvement in JavaScript performance, and while it did just that when it was first released, those improvements have plateaued in recent years. They also note that there are some operations in Octane that optimize Octane scores but do not reflect real-world scenarios. That’s unfortunate, because they, like most of us, want improvements in benchmark scores to mean improvements in end-user experience.

WebXPRT comes at the web performance issue differently. While Octane’s goal was to improve JavaScript performance, the purpose of WebXPRT is to measure performance from the end user’s perspective. By doing the types of work real people do, WebXPRT doesn’t measure only improvements in JavaScript performance; it also measures the quality of the real-world user experience. WebXPRT’s results also reflect the performance of the entire device and software stack, not just the performance of the JavaScript interpreter.

Google’s announcement reminds us that benchmarks have finite life spans, that they must constantly evolve to keep pace with changes in technology, or they will become useless. To make sure the XPRT benchmarks do just that, we are always looking at how people use their devices and developing workloads that reflect their actions. This is a core element of the XPRT philosophy.

As we mentioned last week, we’ve working on the next version of WebXPRT. If you have any thoughts about how it should evolve, let us know!

Eric

Digging deeper

From time to time, we like to revisit the fundamentals of the XPRT approach to benchmark development. Today, we’re discussing the need for testers and benchmark developers to consider the multiple factors that influence benchmark results. For every device we test, all of its hardware and software components have the potential to affect performance, and changing the configuration of those components can significantly change results.

For example, we frequently see significant performance differences between different browsers on the same system. In our recent recap of the XPRT Weekly Tech Spotlight’s first year, we highlighted an example of how testing the same device with the same benchmark can produce different results, depending on the software stack under test. In that instance, the Alienware Steam Machine entry included a WebXPRT 2015 score for each of the two browsers that consumers were likely to use. The first score (356) represented the SteamOS browser app in the SteamOS environment, and the second (441) represented the Iceweasel browser (a Firefox variant) in the Linux-based desktop environment. Including only the first score would have given readers an incomplete picture of the Steam Machine’s web-browsing capabilities, so we thought it was important to include both.

We also see performance differences between different versions of the same browser, a fact especially relevant to those who use frequently updated browsers, such as Chrome. Even benchmarks that measure the same general area of performance, for example, web browsing, are usually testing very different things.

OS updates can also have an impact on performance. Consumers might base a purchase on performance or battery life scores and end up with a device that behaves much differently when updated to a new version of Android or iOS, for example.

Other important factors in the software stack include pre-installed software, commonly referred to as bloatware, and the proliferation of apps that sap performance and battery life.

This is a much larger topic than we can cover in the blog. Let the examples we’ve mentioned remind you to think critically about, and dig deeper into, benchmark results. If we see published XPRT scores that differ significantly from our own results, our first question is always “What’s different between the two devices?” Most of the time, the answer becomes clear as we compare hardware and software from top to bottom.

Justin

Experience is the best teacher

One of the core principles that guides the design of the XPRT tools is they should reflect the way real-world users use their devices. The XPRTs try to use applications and workloads that reflect what users do and the way that real applications function. How did we learn how important this is? The hard way—by making mistakes! Here’s one example.

In the 1990s, I was Director of Testing for the Ziff-Davis Benchmark Operation (ZDBOp). The benchmarks ZDBOp created for its technical magazines became the industry standards, because of both their quality and Ziff-Davis’ leadership in the technical trade press.

WebBench, one of the benchmarks ZDBOp developed, measured the performance of early web servers. We worked hard to create a tool that used physical clients and tested web server performance over an actual network. However, we didn’t pay enough attention to how clients actually interacted with the servers. In the first version of WebBench, the clients opened connections to the server, did a small amount of work, closed the connections, and then opened new ones.

When we met with vendors after the release of WebBench, they begged us to change the model. At that time, browsers opened relatively long-lived connections and did lots of work before closing them. Our model was almost the opposite of that. It put vendors in the position of having to choose between coding to give their users good performance and coding to get good WebBench results.

Of course, we were horrified by this, and worked hard to make the next version of the benchmark reflect more closely the way real browsers interacted with web servers. Subsequent versions of WebBench were much better received.

This is one of the roots from which the XPRT philosophy grew. We have tried to learn and grow from the mistakes we’ve made. We’d love to hear about any of your experiences with performance tools so we can all learn together.

Eric

Check out the other XPRTs:

Forgot your password?