BenchmarkXPRT Blog banner

Category: What makes a good benchmark?

The AIXPRT Request for Comments preview build

In the next few days, we’ll be publishing the first AIXPRT tool as a Request for Comments (RFC) preview build, an early version of one of the AIXPRT tools we’re developing to help evaluate machine learning performance.

We’re inviting folks to run the workload and send in their thoughts and suggestions. Only BenchmarkXPRT Development Community members have access to our RFCs and the opportunity to provide feedback. However, because we’re seeking broad input from experts in this field, we’ll gladly make anyone interested in participating a member.

This AIXPRT RFC preview build includes support for the Intel OpenVINO computer vision toolkit to run image classification workloads with ResNet-50 and SSD-MobileNet v1 networks. The test reports FP32 and FP16 levels of precision. The system requirements are:

  • Operating system = Ubuntu 16.04
  • CPU = 6th to 8th generation Intel Core or Xeon processors, or Intel Pentium processors N4200/5, N3350/5, N3450/5 with Intel HD Graphics


We welcome input on all aspects of the benchmark, including scope, workloads, metrics and scores, user experience, and reporting. We will add support for TensorFlow and TensorRT to the AIXPRT RFC preview build during the preview period. We are accepting feedback through January 25th, 2019, after which we’ll collect and evaluate responses before publishing the next build. Because this is an RFC release, we ask that testers do not publish scores or use the results for comparison purposes.

We’ll send out a community announcement when the RFC preview build is officially available, and we’ll also post an announcement and RFC preview build user guide on AIXPRT.com. We’re hosting the AIXPRT RFC preview build in a dedicated GitHub repository, so please contact us at BenchmarkXPRTsupport@principledtechnologies.com to gain access.

This is just the next step for AIXPRT. With your help, we hope to add more workloads and other frameworks in the coming months. We look forward to receiving your feedback!

Bill

Notes from the lab: choosing a calibration system for MobileXPRT 3

Last week, we shared some details about what to expect in MobileXPRT 3. This week, we want to provide some insight into one part of the MobileXPRT development process, choosing a calibration system.

First, some background. For each of the benchmarks in the XPRT family, we select a calibration system using criteria we’ll explain below. This system serves as a reference point, and we use it to calculate scores that will help users understand a single benchmark result. The calibration system for MobileXPRT 2015 is the Motorola DROID RAZR M. We structured our calculation process so that the mean performance score from repeated MobileXPRT 2015 runs on that device is 100. A device that completes the same workloads 20 percent faster than the DROID RAZR M would have a performance score of 120, and one that performs the test 20 percent more slowly would have a score of 80. (You can find a more in-depth explanation of MobileXPRT score calculations in the Exploring MobileXPRT 2015 white paper.)

When selecting a calibration device, we are looking for a relevant reference point in today’s market. The device should be neither too slow to handle modern workloads nor so fast that it outscores most devices on the market. It should represent a level of performance that is close to what the majority of consumers experience, and one that will continue to be relevant for some time. This approach helps to build context for the meaning of the benchmark’s overall score. Without that context, testers can’t tell whether a score is fast or slow just by looking at the raw number. When compared to a well-known standard such as the calibration device, however, the score has more informative value.

To determine a suitable calibration device for MobileXPRT 3, we started by researching the most popular Android phones by market share around the world. It soon became clear that in many major markets, the Samsung Galaxy S8 ranked first or second, or at least appeared in the top five. As last year’s first Samsung flagship, the S8 is no longer on the cutting edge, but it has specs that many current mid-range phones are deploying, and the hardware should remain relevant for a couple of years.

For all of these reasons, we made the Samsung Galaxy S8 the calibration device for MobileXPRT 3. The model in our lab has a Qualcomm Snapdragon 835 SoC, 4 GB of RAM, and runs Android 7.0 (Nougat). We think it has the balance we’re looking for.

If you have any questions or concerns about MobileXPRT 3, calibration devices, or score calculations, please let us know. We look forward to sharing more information about MobileXPRT 3 as we get closer to the community preview.

Justin

Which browser is the fastest? It’s complicated.

PCWorld recently published the results of a head-to-head browser performance comparison between Google Chrome, Microsoft Edge, Mozilla Firefox, and Opera. As we’ve noted about similar comparisons, no single browser was the fastest in every test. Browser speed sounds like a straightforward metric, but the reality is complex.

For the comparison, PCWorld used three JavaScript-centric test suites (JetStream, SunSpider, and Octane), one benchmark that simulates user actions (Speedometer), a few in-house tests of their own design, and one benchmark that simulates real-world web applications (WebXPRT). Edge came out on top in JetStream and SunSpider, Opera won in Octane and WebXPRT, and Chrome had the best results in Speedometer and PCWorld’s custom workloads.

The reason that the benchmarks rank the browsers so differently is that each one has a unique emphasis and tests a specific set of workloads and technologies. Some focus on very low-level JavaScript tasks, some test additional technologies such as HTML5, and some are designed to identify strengths or weakness by stressing devices in unusual ways. These approaches are all valid, and it’s important to understand exactly what a given score represents. Some scores reflect a very broad set of metrics, while others assess a very narrow set of tasks. Some scores help you to understand the performance you can expect from a device in your everyday life, and others measure performance in scenarios that you’re unlikely to encounter. For example, when Eric discussed a similar topic in the past, he said the tests in JetStream 1.1 provided information that “can be very useful for engineers and developers, but may not be as meaningful to the typical user.”

As we do with all the XPRTs, we designed WebXPRT to test how devices handle the types of real-world tasks consumers perform every day. While lab techs, manufacturers, and tech journalists can all glean detailed data from WebXPRT, the test’s real-world focus means that the overall score is relevant to the average consumer. Simply put, a device with a higher WebXPRT score is probably going to feel faster to you during daily use than one with a lower score. In today’s crowded tech marketplace, that piece of information provides a great deal of value to many people.

What are your thoughts on browser testing? We’d love to hear from you.

Justin

The WebXPRT 3 results calculation white paper is now available

As we’ve discussed in prior blog posts, transparency is a core value of our open development community. A key part of being transparent is explaining how we design our benchmarks, why we make certain development decisions, and how the benchmarks actually work. This week, to help WebXPRT 3 testers understand how the benchmark calculates results, we published the WebXPRT 3 results calculation and confidence interval white paper.

The white paper explains what the WebXPRT 3 confidence interval is, how it differs from typical benchmark variability, and how the benchmark calculates the individual workload scenario and overall scores. The paper also provides an overview of the statistical techniques WebXPRT uses to translate raw times into scores.

To supplement the white paper’s overview of the results calculation process, we’ve also published a spreadsheet that shows the raw data from a sample test run and reproduces the calculations WebXPRT uses.

The paper and spreadsheet are both available on WebXPRT.com and on our XPRT white papers page. If you have any questions about the WebXPRT results calculation process, please let us know, and be sure to check out our other XPRT white papers.

Justin

The value of speed

I was reading an interesting article on how high-end smartphones like the iPhone X, Pixel 2 XL, and Galaxy S8 generate more money from in-game revenue than cheaper phones do.

One line stood out to me: “With smartphones becoming faster, larger and more capable of delivering an engaging gaming experience, these monetization key performance indicators (KPIs) have begun to increase significantly.”

It turns out the game companies totally agree with the rest of us that faster devices are better!

Regardless of who is seeking better performance—consumers or game companies—the obvious question is how you determine which models are fastest. Many folks rely on device vendors’ claims about how much faster the new model is. Unfortunately, the vendors’ claims don’t always specify on what they base the claims. Even when they do, it’s hard to know whether the numbers are accurate and applicable to how you use your device.

The key part of any answer is performance tools that are representative, dependable, and open.

  • Representative – Performance tools need to have realistic workloads that do things that you care about.
  • Dependable – Good performance tools run reliably and produce repeatable results, both of which require that significant work go into their development and testing.
  • Open – Performance tools that allow people to access the source code, and even contribute to it, keep things above the table and reassure you that you can rely on the results.

Our goal with the XPRTs is to provide performance tools that meet all these criteria. WebXPRT 3 and all our other XPRTs exist to help accurately reveal how devices perform. You can run them yourself or rely on the wealth of results that we and others have collected on a wide array of devices.

The best thing about good performance tools is that everyone, even vendors, can use them. I sincerely hope that you find the XPRTs helpful when you make your next technology purchase.

Bill

AIXPRT: We want your feedback!

Today, we’re publishing the AIXPRT Request for Comments (RFC) document. The RFC explains the need for a new artificial intelligence (AI)/machine learning benchmark, shows how the BenchmarkXPRT Development Community plans to address that need, and provides preliminary design specifications for the benchmark.

We’re seeking feedback and suggestions from anyone interested in shaping the future of machine learning benchmarking, including those not currently part of the Development Community. Usually, only members of the BenchmarkXPRT Development Community have access to our RFCs and the opportunity to provide feedback. However, because we’re seeking input from non-members who have expertise in this field, we will be posting this RFC in the New events & happenings section of the main BenchmarkXPRT.com page and making it available at AIXPRT.com.

We welcome input on all aspects of the benchmark, including scope, workloads, metrics and scores, UI design, and reporting requirements. We will accept feedback through May 13, 2018, after which BenchmarkXPRT Development Community administrators will collect and evaluate the feedback and publish the final design specification.

Please share the RFC with anyone interested in machine learning benchmarking and please send us your feedback before May 13.

Justin

Check out the other XPRTs:

Forgot your password?