BenchmarkXPRT Blog banner

Category: Benchmark metrics

Question we get a lot

“How come your benchmark ranks devices differently than [insert other benchmark here]?” It’s a fair question, and the reason is that each benchmark has its own emphasis and tests different things. When you think about it, it would be unusual if all benchmarks did agree.

To illustrate the phenomenon, consider this excerpt from a recent browser shootout in VentureBeat:

 
While this looks very confusing, the simple explanation is that the different benchmarks are testing different things. To begin with, SunSpider, Octane, JetStream, PeaceKeeper, and Kraken all measure JavaScript performance. Oort Online measures WebGL performance. WebXPRT measures both JavaScript and HTML 5 performance. HTML5Test measures HTML5 compliance.

Even with benchmarks that test the same aspect of browser performance, the tests differ. Kraken and SunSpider both test the speed of JavaScript math, string, and graphics operations in isolation, but run different sets of tests to do so. PeaceKeeper profiles the JavaScript from sites such as YouTube and FaceBook.

WebXPRT, like the other XPRTs, uses scenarios that model the types of work people do with their devices.

It’s no surprise that the order changes depending on which aspect of the Web experience you emphasize, in much the same way that the most fuel-efficient cars might not be the ones with the best acceleration.

This is a bigger topic than we can deal with in a single blog post, and we’ll examine it more in the future.

Eric

What’s in a name?

A couple of weeks ago, the Notebookcheck German site published a review of the Huawei P8lite. We were pleased to see they used WebXPRT 2015, and the P8 Lite got an overall score of 47. This week, AnandTech published their review of the Huawei P8lite. In their review, the P8lite got an overall score of 59!

Those scores are very different, but it was not difficult to figure out why. The P8lite comes in two versions, depending on your market. The version Notebookcheck used is based on HiSilicon’s Kirin 620, while the version AnandTech used was Qualcomm’s Snapdragon 615 SoC. It’s also worth noting that the phone Notebookcheck tested was running Android 5.0, while the phone AnandTech tested was running Android 4.4. With different hardware and different operating systems, it’s no surprise that the results were different.

One consequence of the XPRTs being used across the world is that is that it is not uncommon to see results from devices in different markets. As we’ve said before, many things can influence benchmark results, so don’t assume that two devices with the same name are identical.

Kudos to both AnandTech and Notebookcheck for their care in presenting the system information for the devices in their reviews. The AnandTech review even included a brief description of the two models of the P8lite. This type of information is essential for helping people make informed decisions.

In other news, Windows 10 launched yesterday. We’re looking forward to seeing the TouchXPRT and WebXPRT results!

Eric

Seeing the whole picture

In past posts, we’ve discussed how people tend to focus on hardware differences when comparing performance or battery life scores between systems, but software factors such as OS version, choice of browser, and background activity often influence benchmark results on multiple levels.

For example, AnandTech recently published an article explaining how a decision by Google Chrome developers to increase Web page rendering times may have introduced a tradeoff between performance and battery life. To increase performance, Chrome asks Windows to use 1ms interrupt timings instead of the default 15.6ms timing. Unlike other applications that wait for the default timing, Chrome ends up getting its work done more often.

The tradeoff for that increased performance is that waking up the OS more frequently can diminish the effectiveness of a system’s innate power-saving attributes, such as a tick-less kernel and timer coalescing in Windows 8, or efficiency innovations in a new chip architecture. In this case, because of the OS-level interactions between Chrome and Windows, a faster browser could end up having a greater impact on battery life than might initially be suspected.

The article discusses the limitations of their test in detail, specifically with regards to Chrome 36 not being able to natively support the same HiDPI resolution as the other browsers, but the point we’re drawing out here is that accurate testing involves taking all relevant factors into consideration. People are used to the idea that changing browsers may impact Web performance, but not so much is said about a browser’s impact on battery life.

Justin

Comment on this post in the forums

It’s all in the presentation

The comment period for BatteryXPRT CP2 ended on Monday. Now we are in the final sprint to release the benchmark.

The extensive testing we’ve been doing has meant that we’ve been staring at a lot of numbers. This has led us to make a change in how we present the results. As you would expect, the battery life when you’re running the test using Wi-Fi is different than when you’re running it using your cellular network. Although individual devices vary, the difference is in the vicinity of 10 percent, about the same as the difference between Airplane mode and using Wi-Fi.

BatteryXPRT has always captured a device’s Wi-Fi setting in its disclosure results, but had not included this information with the results. Because we found it so helpful to have the Wi-Fi setting alongside the results, we have changed the presentation of the results to recognize three modes: Airplane, Wi-Fi, and Cellular. We hope that this will avoid confusion as people are using BatteryXPRT.

Note that we have not changed the way the results are calculated. Results you generated during the preview are still valid. However, results from one mode should not be compared to results from another mode.

We’ve been talking a lot about BatteryXPRT, but TouchXPRT is also looking great! We’re looking forward to releasing both of them soon!

Eric

Comment on this post in the forums

Staying out in the open

Back in July, Anandtech publicized some research about possible benchmark optimizations in the Galaxy S4. Yesterday, Anandtech published a much more comprehensive article, “The State of Cheating in Android Benchmarks.” It’s well worth the read.

Anandtech doesn’t accuse any of the benchmarks of being biased—it’s the OEMS who are supposedly doing the optimizations. I will note that none of the XPRT benchmarks are among the whitelisted CPU tests. That being said, I imagine that everyone in the benchmark game is concerned about any implication that their benchmark could be biased.

When I was a kid, my parents taught me that it’s a lot harder to cheat in the open. This is one of the reasons we believe so strongly in the community model for software development. The source code is available to anyone who joins the community. It’s impossible to hide any biases. At the same time, it allows us to control derivative works. That’s necessary to avoid biased versions of the benchmarks being published. We think the community model strikes the right balance.

However, any time there is a system, someone will try to game it. We’ll always be on the lookout for optimizations that happen outside the benchmarks.

Eric

Comment on this post in the forums

Lies, damned lies, and statistics

No one knows who first said “lies, damned lies, and statistics,” but it’s easy to understand why they said it. It’s no surprise that the bestselling statistics book in history is titled How to Lie with Statistics. While the title is facetious, it is certainly true that statistics can be confusing—consider the word “average,” which can refer to the mean, median, or mode. “Mean average,” in turn, can refer to the arithmetic mean, the geometric mean, or the harmonic mean. It’s enough to make a non-statistician’s head spin.

In fact, a number of people have been confused by the confidence interval WebXPRT reports. We believe that the best way to stand behind your results is to be completely open about how you crunch the numbers. To this end, we released the white paper WebXPRT 2013 results calculation and confidence interval this past Monday.

This white paper, which does not require a background in mathematics, explains what the WebXPRT confidence interval is and how it differs from the benchmark variability we sometimes talk about. The paper also gives an overview of the statistical and mathematical techniques WebXPRT uses to translate the raw timing numbers into results.

Because sometimes the devil is in the details, we wanted to augment our overview by showing exactly how WebXPRT calculates results. The white paper is accompanied by a spreadsheet that reproduces the calculations WebXPRT uses. If you are mathematically inclined and would like to suggest improvements to the process, by all means let us know!

Eric

Comment on this post in the forums

Check out the other XPRTs:

Forgot your password?