BenchmarkXPRT Blog banner

Author Archives: Bill Catchings

Sharing results

A few weeks back, I wrote about different types of results from benchmarks. HDXPRT 2011’s primary metric is an overall score. One of the challenges of a score, unlike a metric such as minutes of battery life, is that it is hard to interpret without context. Is 157 a good score? The use of a calibration, or base, system helps a bit, because if that system has a score of 100, then a 157 is definitely better. Still, two scores do not give you a lot of context.

To help make comparisons easier, we are releasing a set of results from our testing at http://hdxprt.com/hdxprt2011results. With the results viewer we’ve provided, you can sort the results on a variety of fields and filter them for matching text. We’ve include results from our beta testing and our results white papers.

We’ll continue to add results, but we want to invite members of the HDXPRT Development Community to do the same. We would especially like to get any results you have published on your Web sites. Please submit your results using this link: http://www.hdxprt.com/forum/2011resultsubmit. We’ll give them a sanity check and then include them in the results viewer. Thanks!

Bill

Comment on this post in the forums

Getting to the source

Many of the earliest benchmarks came in source code form. Dhrystone and many others relied on the compiler for optimization. In fact, some compilers even recognized the code and basically optimized it to a few lines of code that did nothing but return the result! Even some modern benchmarks, such as SPEC CPU and LINPACK, come in source code form.

The source code to application benchmarks, however, has not typically been available. Two of the leading benchmarks of the last twenty years, Winstone and SYSmark, were never available in source code form. The makers of those tools had good reasons for keeping the code private; we know, because led the creation of Winstone. Keeping code private protects your intellectual investment, can make it easier to hit development schedules, and provides many other advantages.

It also, however, can lead some people to criticize that the reason you’re not showing the source code is that it is in some way biased. In benchmarks as in so many areas, transparency is the best way to allay such concerns.

Which leads us to today’s big announcement

We want HDXPRT to be as open as possible, so we’re bucking the normal practice for application-based benchmarks and planning to make the HDXPRT 2011 source code available to the HDXPRT Development Community.

The code will include both the benchmark harness and the scripts that drive the applications. You’ll be able to study everything about the benchmark. You’ll also be able to more easily contribute new code. Which is exactly what we hope you’ll do. We want you not only to be completely comfortable with the benchmark, we want you to contribute to future versions of it.

There will, of course, be some ground rules. We are making the code available only to the HDXPRT Development Community. (If you’re not already a member, joining is cheap and easy: just go here.) Because we want to limit the code to the community, to get access to it, members will have to agree to a license agreement that prevents them from releasing it to the public.

We don’t have an exact schedule in place yet, but over the next week or two, we should have all the necessary things in place to make the source code available.

When you’ve had a chance to look at it, please let us know what improvements you would like to see in HDXPRT 2012. We’ll discuss that version, and how you can help, in the coming weeks.

Bill

Comment on this post in the forums

Looking deeper into results

A few weeks ago, I mentioned some questions we had about graphics performance using HDXPRT 2011 after releasing our results white paper. The issue was that HDXPRT 2011 gave results I had not expected—the integrated graphics outperformed discrete graphics cards. I suspected that this was both because HDXPRT 2011’s lack of 3D work lessens the advantage of discrete graphics cards and because the integrated graphics on the second-generation Intel Core processors we used performed well.

We ran some tests with discrete graphics cards on an older processor (an Intel Core 2 Quad processor Q6600) and report our findings in a second results white paper. My suspicions were correct: On the older processor, the discrete graphics cards performed 21 to 36 percent better than the integrated graphics.

As an aside, we are looking into putting our test results on the Web site in some easy-to-access fashion so you can look at them in more detail. My hope is that doing so will facilitate sharing of results among all of us in the HDXPRT Development Community.

Based on this second results white paper, I would love to hear your responses to two questions. First, do you think that future versions of HDXPRT should include 3D graphics? Second, what other areas of HDXPRT 2011 would you like to see us look into?

Bill

Comment on this post in the forums

Scoring with HDXPRT

Two weeks ago, I began explaining how benchmarks keep score (http://www.hdxprt.com/blog/2011/08/17/keeping-score/). HDXPRT 2011 fundamentally measures the time a PC required to complete a series of tasks, such as editing photos and converting videos from one format to another. It uses the times of three sets of tasks to come up with three use case times (Edit videos from your camcorder, Create memories from your digital camera, and Prepare media for on-the-go). Because an early version of the benchmark took too long to run, we trimmed the size of the workloads (such as the number of photos) to make it complete more quickly. Because we believed the size of the original workloads was realistic, we extrapolated (multiplied by the difference in size) what the time would have been. That process results in times in minutes.

We could have simply combined the three times into one total time, but doing so would have created a score where smaller is better, which can be confusing. To avoid this, HDXPRT 2011 normalizes the three times to the times a calibration, or base, system required to complete the same work. The benchmark then calculates a geometric mean of those three normalized scores and multiplies that number by 100 to create the overall Create HD Score. This scoring method sets the calibration system’s score to 100 and makes it easy for you to compare multiple systems. For example, if PC A gets a score of 200, and PC B gets a 400, PC B is twice the speed of PC A (and four times the speed of the calibration system) at creating HD content.

The term “geometric mean” might be unfamiliar. One way to get benchmark geeks arguing is to ask about the correct mean for combining results. (Yes, there really are enough of us for an argument.) At the risk of inflaming my fellow benchmark geeks, I will give a quick summary of the main ways people combine results.

An arithmetic mean is a simple average, where you add all the numbers and divide by the number of numbers. It is good for combining amounts, such as gigabytes of RAM, across multiple computers.

A geometric mean is more mathematically complex. You compute it by multiplying all the numbers and then taking the nth root, where n is the number of numbers. This kind of mean is appropriate for combining normalized numbers. Its advantage over the arithmetic mean is that it keeps one really good number from drowning out all the others.

The final mean is the harmonic. You calculate it by dividing the number of numbers by the sum of 1 divided by the square of each element. (If that makes little sense to you, don’t worry about it!) The harmonic mean is appropriate for combining rates, such as megabytes per second.

I should also mention one other result from HDXPRT 2011, the Overall Play HD Experience score. This is a very different kind of score that uses one to five stars to indicate the quality of three HD video playbacks. HDXPRT uses mean opinion scores (MOS) based on smoothness of playback to compute these results. (I’ll discuss MOS in more detail in a future blog.) With this kind of score, a four-star rating is better than a two-star rating, but it is hard to say how much better. The MOS research indicates that people would rate the four-star playback as good and the two-star playback as poor, but you can’t say that one is twice as good as the other because the relationship is not linear.

What do you think of the metrics that HDXPRT 2011 provides? Are there others you would find more useful or meaningful? Your input is vital to improving the benchmark and making sure it does what you want it to do.

Bill

Comment on this post in the forums

Helping hands

We ran into a problem last week with HDXPRT 2011. Basically, it would fail when we installed it. One of the biggest problems for application-based benchmarks like HDXPRT 2011 is dealing with existing applications on the system. Even more difficult to account for are the many DLLs, drivers, and Registry settings that can collide between applications and different versions of the same application.

After a lot of effort, we found the problem was indeed a conflict between some of the pre-installed software on the system and the HDXPRT 2011 installer. We were able to narrow down which applications caused the problem and posted on the site some instructions for how to work around the issues. (For more details, log into the forum and then see http://www.hdxprt.com/forum/showthread.php?18-Troubleshooting-Installation-problems-on-Dell-Latitude-notebooks. You won’t be able to read that message if you’re not logged in.)

My hope is that if you run into issues with HDXPRT 2011, you’ll share them. And, share the workarounds you find as well! So, please let us know any tips, tricks, or issues you find with the benchmark by sending email to hdxprtsupport@hdxprt.com. The more we work together, the better we can make both HDXPRT 2011 and the future versions. Thanks!

Next week, we’ll return to looking at the results HDXPRT 2011 provides.

Bill

Comment on this post in the forums

Keeping score

One question I received as a result of the last two blog entries on benchmark anatomy was whether I was going to talk about the results or scores.  That topic seemed like a natural follow up.

All benchmarks need to provide some sort of metric to let you know how well the system under test (SUT) did.  I think the best metrics are the easily understood ones.  These metrics have units like time or watts.  The problem with some of these units is that sometimes smaller can be better.  For example, less time to complete a task is better.  (Of course, more time before the battery runs down is better!)  People generally see bigger bars in a chart as better.

Some tests, however, give units that are not so understandable.  Units like instructions per second, requests per second, or frames per second are tougher to relate to.  Sure, more bytes per second would be better, but it is not as easy to understand what that means in the real world.

There is a solution to both the problem of smaller is better and non-intuitive units—normalization.  With normalization, you take the result of the SUT and divide it by that of a defined base or calibration system.  The result is a unit-less number.  So, if the base system can do 100 blips a second and the SUT can do 143 blips a second, the SUT would get 143 / 100 or a score of 1.43.  The units cancel out in the math and what is left is a score.  For appearance or convenience, the score may be multiplied by some number like 10 or 100 to make the SUT’s score 14.3 or 143.

The nice thing about such scores is that it is easy to see how much faster one system is than another.  If you are measuring normalized execution time, a score of 286 means a system is twice as fast as one of 143.  As a bonus, bigger numbers are better.  An added benefit is that it is much easier to combine multiple normalized results into a single score.  These benefits are the reason that many modern benchmarks use normalized scores.

There is another kind of score, which is more of a rating.  These scores, such as a number of stars or thumbs up, are good for relative ratings.  However, they are not necessarily linear.  Four thumbs up is better than two, but is not necessarily twice as good.

Next week, we’ll look closer at the results HDXPRT 2011 provides and maybe even venture into the difference between arithmetic, geometric, and harmonic means!  (I know I can’t wait.)

Bill

Comment on this post in the forums

Check out the other XPRTs:

Forgot your password?