BenchmarkXPRT Blog banner

Category: What makes a good benchmark?

It’s always worth asking

Last week, one of our community members asked for a couple of enhancements to WebXPRT. They wanted WebXPRT to be easier to automate, and they made two specific requests:

  •  Add debug/result logs
  • Add the ability to start the test without UI interactions, by using a specific URL or a command line

This is a great example of why we put so much emphasis on the community. We have tried to make the BenchmarkXPRT benchmarks easy to use, but we don’t always face the same testing demands you do. If there’s anything we can do to make these tools more valuable, please let us know by posting on the forums or e-mailing us at benchmarkxprtsupport@principledtechnologies.com.

We are adding those abilities to the upcoming WebXPRT 2014 community preview. Speaking of the community preview, we have been working hard on it, and in the next few weeks, we’ll be talking about what will be in it.

Keep those requests coming!

Eric

Comment on this post in the forums

Sounds easy, but…

Sounds easy, but…

In Endurance, Bill said that we were going to be investigating battery life testing. He also discussed some of the issues that make battery testing difficult to do well. Finally, he explained why we were looking at MobileXPRT as the basis for the first version of the battery life test.

Over the last couple of months, we have been experimenting with a number of different approaches to battery testing. We now think that we have enough empirical data that we can make a proposal. We are working on that now. It should be available to community members in the next couple of weeks.

We hope you’ll look at the proposal and let us know what you think. Your input is an essential part of developing a really great test. If you’re not a member of the community, it’s easy to join.

In other news, we’re going to CES and would love to talk with you. If you’d like to chat, send an e-mail to benchmarkxprtsupport@principledtechnologies.com.

Eric

Comment on this post in the forums

There is such a thing as too much

There’s been a lot of excitement about TouchXPRT recently. However, we haven’t been ignoring HDXPRT. On November 9, we released a patch that lets HDXPRT support Windows 8. We’ve now integrated the patch into HDXPRT2012, so all copies of HDXPRT 2012 going forward will install on Windows 8 without the need for a separate step.

As promised, we will be releasing the source code for HDXPRT 2012. We anticipate having it available for community members by December 14.

During the comment period for HDXPRT, this message came through loud and clear: HDXPRT 2012 is too big and takes too long to run. So we are working hard to find the best way to reduce the number of applications and scenarios. While we want to make the benchmark smaller and faster, we want to make sure that HDXPRT 2013 is comprehensive enough to provide useful performance metrics for the greatest number of people.

We’re working toward having an RFC in late January that will define a leaner, meaner HDXPRT 2013, and will reflect the other comments we have as received as well.  If you have thoughts about which applications and scenarios are most important to you, please let us know.

In other news, CES is coming in January, and Principled Technologies will be there! Once again, Bill is hoping to meet with as many of you in the Development Community as possible. We’ll have a suite at the Hilton and would love for you to come, kick back, and talk about HDXPRT, TouchXPRT, the future of benchmarks, or about the cool things you’ve seen at the show. (Bill loves talking about gadgets. Last year, he went into gadget overload!)

If you plan to be at CES, but are stuck working a booth or suite, let us know and Bill will try to stop by and say hi. Drop us an email at hdxrpt_CES@principledtechnologies.com and we will set up an appointment.

Finally, we’re really excited about the big changes at the Principled Technologies Web site. The new Web site gives us a lot of opportunities. Over the next few weeks, we’ll be looking at ways the Development Community can take advantage of them.

Eric

on this post in the forums

The real art of benchmarking

In my last blog entry, I noted the challenge of balancing real-world and real-science considerations when benchmarking Web page loads. That issue, however, is inherent in all benchmarking. Real world argues for benchmarks that emphasize what users and computers actually do. For servers, that might mean something like executing real database transactions against a real database from real client computers. For tablets, that might mean real fingers selecting and displaying real photos. There are obvious issues with both—setting up such a real database environment is difficult and who wants to be the owner of the real fingers driving the tablet? It is also difficult to understand what causes performance differences—is it the network, the processors, or the disks in the server? There are also more subtle challenges, such as how to make the tests work on servers or tablets other than the original ones. Worse, such real-world environments are subject to all sorts of repeatability and reproducibility issues.

Real science, on the other hand, argues for benchmarks that emphasize repeatable and reproducible results. Further, real science wants benchmarks that isolate the causes of performance differences. For servers, that might mean a suite of tests targeting processor speed, network bandwidth, and disk transfer rate. For tablets, that might mean tests targeting processor speed, touch responsiveness, and graphics-rendering rate. The problem is that it is not always obvious what combination of such factors actually delivers better database server performance or tablet experience. Worse, it is possible that testing different databases and transactions would result in very different characteristics that these tests don’t at all measure.

The good news is that real world and real science are not always in opposition. The bad news is that a third factor exacerbates the situation—benchmarks take real time (and of course real money) to develop. That means benchmark developers need to make compromises if they want to bring tests to market before the real world they are attempting to measure has changed. And, they need to avoid some of the most difficult technical hurdles. Like most things, that means trying to find the right balance between real world and real science.

Unfortunately, there is no formula for determining that balance. Instead, it really is somewhat of an art. I’d love to hear from you some examples of benchmarks (current or from the past) that you think do a good job implementing this balance and showing the real art of benchmarking.

Bill

Comment on this post in the forums

Benchmarking a benchmark

One of the challenges of any benchmark is understanding its characteristics. The goal of a benchmark is to measure performance under a defined set of circumstances. For system-level, application-oriented benchmarks, it isn’t always obvious how individual components in the system influence the overall score. For instance, how does doubling the amount of memory affect the benchmark score? The best way to understand the characteristics of a benchmark is to run a series of carefully controlled experiments that change one variable at a time. To test the benchmark’s behavior with increased memory, you would take a system and run the benchmark with different amounts of RAM. Changing the processor, graphics subsystem, or hard disk lets you see the influence of those components. Some components, like memory, can change in both their amount and speed.

The full matrix of system components to test can quickly grow very large. While the goal is to change only one component at a time, this is not always possible. For example, you can’t change the processor from an Intel to an AMD without also changing the motherboard.

We are in the process of putting HDXPRT 2011 through a series of such tests. HDXPRT 2011 is a system-level, application-oriented benchmark for measuring the performance of PCs on consumer-oriented HD media scenarios. We want to understand, and share with you, how different components influence HDXPRT scores. We expect to release a report on our findings next week. It will include results detailing the effect of processor speed, amount of RAM, hard disk type, and graphics subsystem.

There is a tradeoff between the size of the matrix and how long it takes to produce the results. We’ve tried to choose the areas we felt were most important, but we’d like to hear what you consider important. So, what characteristics of HDXPRT 2011 would you like to see us test?

Bill

Comment on this post in the forums

Knowing when to wait

Mark mentioned in his blog entry a few weeks ago that waiting sucks.  I think we can all agree with that sentiment.  However, an experience I had while in Taipei for Computex made me reevaluate that thinking a bit.  

I went jogging one morning in a park near my hotel.  It was a relatively small park, just a quarter mile around the pond that took up most of the park.  I was one of only a couple people jogging, but the park was full of people.  Some were walking around the pond.  There also were groups of people doing some form of Tai Chi in various clearings around the pond.  The path I was on was narrow.  At times, there was no way of getting around the people walking without running into the ones doing Tai Chi.  That in turn meant running in place at times.  Or, put another way, waiting.  

Everyone was polite at the encounters, but the contrast between me jogging and the folks doing Tai Chi was stark.  I wanted to run my miles as quickly as possible.  Those doing Tai Chi were decidedly not in a rush.  They were doing their exercises together with others.  The goal was to do them at the proper pace in the proper way.  

That got me to thinking about waiting on my computer.  (Hey, time to think is one of the main reasons I exercise!)  There are times when waiting for a computer infuriates me.  Other times, however, the computer is fast enough.  Or even too fast, like when I’m trying to scroll down to the right cell in Excel and it jumps down to a whole screen full of empty cells.  This phenomenon, of course, relates to benchmarks.  Benchmarks should measure those operations that are slow enough to hurt productivity or are downright annoying.  There is less value in measuring operations that users don’t have to wait on. 

Have you had any thoughts about what makes a good benchmark?  Even if you weren’t exercising when you had the thought, please share it with the community. 

Bill

Comment on this post in the forums

Check out the other XPRTs:

Forgot your password?