HDXPRT, like most other application-based benchmarks, works by timing lots of individual operations. Some other benchmarks just time the entire script. The downside of that approach is that the time includes things that are constant regardless of the speed of the underlying hardware. Some things, like how fast a menu drops down or text scrolls, are tied to the user experience and should not go faster on a faster system. Including those items in the overall time dilutes the importance of the operations that we wait on and are frustrated by, the operations we need to time.
In the case of HDXPRT 2011, we time between 20 and 30 operations. We then roll these up into the times we report as well as the overall score. We do not, however, report the individual times. We expect to include even more timed operations in HDXPRT 2012. As we have been thinking about what the right metrics are, we have started to wonder what to do with all of those times. We could total up the times of similar operations and create additional results. For example, we could total up all the application load times and produce an application-load result. Or, we could total up all the times for an individual application and produce an application result. I can definitely see value in results like those.
Another possibility is to try and look at the general pattern of the results to understand responsiveness. One way would be to collect the times in a histogram, where buckets correspond to ranges of response times for the operations. Such a histogram might give a sense of how responsive a target system feels to an end user. There are certainly other possibilities as well.
If nothing else, I think it makes sense to expose these times in some way. If we make them available, I’m confident that people will find ways to use them. My concern is the danger of burdening a benchmark with too many results. The engineer in me loves all the data possible. The product designer knows that focus is critical. Successful benchmarks have one or maybe two results. How to balance the two?
One wonder of this benchmark development community is the ability to ask you what you think. What would you prefer, simple and clean or lots of numbers? Maybe a combination where we just have the high-level results we have now, but also make other results or times available in an “expert” or an “advanced” mode? What do you think?
Bill