In my last blog entry, I noted the challenge of balancing real-world and real-science considerations when benchmarking Web page loads. That issue, however, is inherent in all benchmarking. Real world argues for benchmarks that emphasize what users and computers actually do. For servers, that might mean something like executing real database transactions against a real database from real client computers. For tablets, that might mean real fingers selecting and displaying real photos. There are obvious issues with both—setting up such a real database environment is difficult and who wants to be the owner of the real fingers driving the tablet? It is also difficult to understand what causes performance differences—is it the network, the processors, or the disks in the server? There are also more subtle challenges, such as how to make the tests work on servers or tablets other than the original ones. Worse, such real-world environments are subject to all sorts of repeatability and reproducibility issues.
Real science, on the other hand, argues for benchmarks that emphasize repeatable and reproducible results. Further, real science wants benchmarks that isolate the causes of performance differences. For servers, that might mean a suite of tests targeting processor speed, network bandwidth, and disk transfer rate. For tablets, that might mean tests targeting processor speed, touch responsiveness, and graphics-rendering rate. The problem is that it is not always obvious what combination of such factors actually delivers better database server performance or tablet experience. Worse, it is possible that testing different databases and transactions would result in very different characteristics that these tests don’t at all measure.
The good news is that real world and real science are not always in opposition. The bad news is that a third factor exacerbates the situation—benchmarks take real time (and of course real money) to develop. That means benchmark developers need to make compromises if they want to bring tests to market before the real world they are attempting to measure has changed. And, they need to avoid some of the most difficult technical hurdles. Like most things, that means trying to find the right balance between real world and real science.
Unfortunately, there is no formula for determining that balance. Instead, it really is somewhat of an art. I’d love to hear from you some examples of benchmarks (current or from the past) that you think do a good job implementing this balance and showing the real art of benchmarking.
Bill