Recently, a tester wrote in and asked for help determining why they were seeing different WebXPRT scores on two tablets with the same hardware configuration. The scores differed by approximately 7.5 percent. This can happen for many reasons, including different software stacks, but score variability can also result from different testing behavior and environments. While some degree of variability is natural, the question provides us with a great opportunity to talk about the basic benchmarking practices we follow in the XPRT lab, practices that contribute to the most consistent and reliable scores.
Below, we list a few basic best practices you might find useful in your testing. While they’re largely in the context of the WebXPRT focus on evaluating browser performance, several of these practices apply to other benchmarks as well.
- Test with clean images: We use an out-of-box (OOB) method for testing XPRT Spotlight devices. OOB testing means that other than initial OS and browser version updates that users are likely to run after first turning on the device, we change as little as possible before testing. We want to assess the performance that buyers are likely to see when they first purchase the device, before installing additional apps and utilities. This is the best way to provide an accurate assessment of the performance retail buyers will experience. While OOB is not appropriate for certain types of testing, the key is to not test a device that’s bogged down with programs that influence results unnecessarily.
- Turn off updates: We do our best to eliminate or minimize app and system updates after initial setup. Some vendors are making it more difficult to turn off updates completely, but you should always account for update settings.
- Get a feel for system processes: Depending on the system and the OS, quite a lot of system-level activity can be going on in the background after you turn it on. As much as possible, we like to wait for a stable baseline (idle) of system activity before kicking off a test. If we start testing immediately after booting the system, we often see higher variability in the first run before the scores start to tighten up.
- Disclosure is not just about hardware: Most people know that different browsers will produce different performance scores on the same system. However, testers aren’t always aware of shifts in performance between different versions of the same browser. While most updates don’t have a large impact on performance, a few updates have increased (or even decreased) browser performance by a significant amount. For this reason, it’s always worthwhile to record and disclose the extended browser version number for each test run. The same principle applies to any other relevant software.
- Use more than one data point: Because of natural variability, our standard practice in the XPRT lab is to publish a score that represents the median from at least three to five runs. If you run a benchmark only once, and the score differs significantly from other published scores, your result could be an outlier that you would not see again under stable testing conditions.
We hope those tips will make testing a little easier for you. If you have any questions about the XPRTs, or about benchmarking in general, feel free to ask!
Justin