Examining Benchmarking
VL writes "Benchmarks exist to determine how a particular piece of hardware performs in relation to itself, and to others. Question is, are readers getting the information they really need?"
← Back to Stories (view on slashdot.org)
studies and benchmarks are so often biased. it's hard to get a study that isn't. follow the money trail --- sponsor of the study
no big sig
Benchmarks are inherently flawed for the reasons stated in the posts. Comparing hardware to itself and similar hardware means there's no external reference point. Comparing one thing to another is okay, but you can't get absolute numbers in a closed Platonic system.
Goedel's Incompleteness Theorem states that you can't define a system entirely in its own terms, and that any system needs to be defined by terms outside of it.
So, how can you accurately rate hardware based on similar hardware? To meet the GIT (Goedel's Incompleteness Theorem), you would need to compare the hardware with something outside of the system, so you have an external reference point. For example, if you're benchmarking graphics cards, you need to also compare them to something outside of that area of hardware.. so.. say, a graphics tablet, or an iPod.
So, say that the first graphics card is 0.7% compared to the iPod, we now have an external reference to use with the other graphics cards.. so a better card might be 10% compared with the iPod, or a few percent compared to the graphics tablet, which proves that the second card is better than the first, due to the respective ratings compared to the external objects.
This is just regular math. I have to say, it's pretty amazing what you can apply regular math to.. yes, even benchmarks!
It all depends on the range of excercise-able aspects of some hardware a particular benchmarking suite excercices. That's why you prefer a suite rather than a stand-alone benchmark. For instance, Top500.org ranks HPC machine according to LINPACK, for which the ES (earth simulator) of course does well due its vectorization capabilities.
So, if you want to know about your hardware, you better run more than one benchmark, and more importantly, your 'problem code'. Yes, you want hardware that performs well for you problem. Something that can be good in general, is ratrher rare.
My favorite computers haven't been the fastest. In fact, I've been the most productive on systems that were objectively less impressive.
My favorite Operating Systems haven't been the ones with the best selection of software.
My favorite games haven't been the ones with the best graphics.
The reviews I find most valuable don't have the most complete set of numbers of why something's the best or worst.
It's interesting that the goal of benchmarks is to be objective as possible, when it's the subjective that makes me want to buy or not buy something. But meanwhile the more the objectivity of the benchmark tests are in doubt, the less important the tests become. So I guess that means benchmarks don't mean anything to me one way or the other, huh?
Alex.
When I was performed photographic quality control, I ahd a reference platform and true statistically valid performance data to base any decisions on. Unfortunately hardware sites don't exactly do the same thing. They use different hardware (usually provided by the vendor or a reseller looking for a plug), and everything becomes a variable. What I was taught about analyzing anything was to eliminate variables. Most benchmarks will work as long as you create your own reference platform, specify everything used in excruiating detail (driver version, etc.) And also place a disclaimer that the test is only good based on your hardware and setup. When I read a benchmark, I use it as only a guide. I do not take the numbers literally since I cannot reproduce the test. And that is where the problem lies in hardware site benchmarking. Anyone should be able to get the specific hardware mentioned, assemble it, install the OS and run the benchmarks and get similar results. My money is they won't because of "tweaked" drivers, benchmark program versions, or hardware, software, or OS settings that do not make it into the documentation or the column for the site. The only benchmarks I pay any serious attention to is SPECInt and SPECf, because there has to be full disclosure of all options used before SPEC will approve of it.
The problem is as a benchmark becomes widespread and respected, the incentive to cheat the mark increases at a much greater rate.
For less widely used benchmarks, its possible to do one offs in the lab and include the false results in the marketing material. The primary examples of this are spec, drhystone, and whetstone. For awhile Intels compilers had recognition routines just for these benchmarks. Apple has always done tuned versions of the benchmarks.
Once a benchmark gets into the wild and is in a form that anyone with a website can just load without too much trouble on a machine, you get manufacturers actively moving to cheat the benchmark. Best examples are Nvidia and ATI's optimizations that are specific to 3dmark and quake III.
I don't know of anyone who would buy a piece of hardware solely on a benchmark, However salesmen when they can't sell are without peer in inventing excuses and shifting blame. So as long as you have sales goals that are unrealistic and salespeople that are good at inventing excuses, you will have engineering departments forced to cheat the benchmarks.
Its true money changes everything.
And that's the only way to look at it. I use photoshop more often than anything else, and as long as a machine can run it well then it's passed my benchmarking tests just fine.
When it comes pointless is when a single simple benchmark is taken alone. If that were the case then a machine like a 1GHz G4 would own everything else looking at just RC5-72 benchmarks. 10 million keys/sec? no problem, quicker than any other machine like it on the market.
Look at that as just one benchmark among dozens and you form a better picture, that the G4 has a vector unit that performs exceptionally well, and you can get an idea how the rest of it performs.
Add up enough of those simple numeric benchmarks and all you get is one huge mess in mind with no REAL idea of how a machine will perform other than theoretically. Best combine them all together and go back to running the app(s) you're likely to use most.
The bottom line is that you really can't put much trust in benchmarks. Well... Thats not exactly true, but think - of those games and apps that you always see the same people run over and over again, how many of those do you use on a daily basis? Personally, i've read so many reviews that I don't even have to think about what a pixel shader is anymore, so it probably will come as no suprize that I skip through the mumbo jumbo they tell you about the card and go straight to the benchies. And its always the same ones.
Thats all well and good, and I guess it gives you a VERY generic view at how those particular things work, but how about real life performance? How about a screenshot in the HL mod Natural Selection when there are 15 turrets firing at bile bombing aliens with the show_fps set to 1? Can we get something like that? I guess that would consitute in there with fill rate, and before you tell me thats an arcane game. Let me direct you to the little X on the top right of your browser. I don't care.
You can get a very good idea about the speed of a card, but you have no idea what the card will have trouble with until you load up your copy of Star Wars : Pod Racer just to be greeted by a big white screen when the race starts. Thats one thing I really miss about 3dfx. Thier cards worked. Always. Well, at least they did at the time.
... on mainframes in the old days. The idea of a benchmark is to determine how your workload will perform on a given platform. The key here is "your workload". Using synthetic benchmarks is a great way to determine relative performance, if your workload is running synthethic benchmarks. For most people this isn't the case.
The problem is that every workload will have a different I/O and instruction mix. Each instruction has a different execution time, and the performance of I/O devices is frequently a function of the access patterns to data.
As a result, a synthetic benchamrk may be a poor indicator of the result from the actual execution of your individual workload. These benchmarks are intended to provide guidance, and potentially identify platform performance bottlenecks. That's all. Reading any more into them is the fault of those that use the results improperly.
Can You Say Linux? I Knew That You Could.
What matters is how much stuff you can draw per frame time, not how many times you can redraw it during a single frame time. 3D benchmarks should gradually increase the scene complexity until the frame rate drops. Often, there's a huge performance drop when the onboard memory of the graphics board fills up. Running old games at huge frame rates won't show that.
Scene complexity is the limiting factor for game developers. Artists are always saying "I need a bigger poly budget". If benchmarks focused on scene complexity, we'd have gigabyte graphics boards, and "wow, you can see every eyelash" scene complexity.
We also need more intensity depth in graphics boards, to clean up that murky look so typical of games. Rendering really should be done into at least 16 bits of intensity, then sent to the screen through a film-like gamma conversion. That's how it's done in offline renderers for film.
I am not as interested in benchmarks as I am interested if my harware will work well with my software. If I buy a video card, will it run my apps? Perhaps some of these review sites should take the 10 most popular applications, like games, a compiler, database, etc... and tell you if your hardware will run it without hangups or hiccups.
The other bad thing about benchmarks is you will probably not have the same motherboard/ram/cpu as the test system.
Rosco: "If brains were gunpowder, Enos couldn't blow his nose."
1. Aquire your piece of test equipment (video card, motherboard, tower case)
2. Hold the equipment 3 to 5 feet above the bench surface
3. Release. Gravity will take care of the test
4. Measure the mark left in the bench by the equipment. Bigger mark = better equipment.
This might sound like I am stating the bloody obvious, but it's true. I think there are several facets to good benchmarking (based on my own experiences and reading other reports)....
1, Choose a test/workload that is representative of what *you* will be doing. There is no point in looking at SPECINT200 if you are going to be running an I/O intensive application like a RDBMS. Try and run or study tests that are relevant to the intended use of the system/component you are benchmarking.
2, Take note of things like compiler flags etc. These are important in tests like SPEC, as your results can vary wildly according to things like optimisation level. Some compilers produce faster code on certain CPU families and not on others. This is a reason why a lot of vendors will build their own compilers and test with them (e.g. SGI, SUN, DECPAQ).
3, Look at the full disclosure notice in the benchmarks. Take a look at the system configuration used. This is particularly, IMHO, on tests like TPC-C. The score you see might be based on a really whacky config, like most of the figures at the top of the list. For example, look at the Proliant figure (709k) and look at the config: 32 x 8 way servers to run a single database. Then compare it to a 64-way SuperDome or 32-way p690. Which comfig makes more sense? For a database, I would likely go with the single system for simplicity's sake. On another application, maybe the cluster would make sense.
4, Compare apples to apples. This is the hardest part, as CPU's, OS's, I/O, Apps. Compilers etc etc all vary across platforms. I like to to try and compare one variable if possible. To take the TPC-C again, I try to compare DB against DB, Cluster against Cluster, SMP against SMP etc etc. There is nothing to be gained, IMHO, from comparing MS-SQL server in a cluster on Xeon with Win2k3 to Sybase on a SF15k running SPARC Solaris. How do you properly compare these two results? Maybe the solution would be to look at SQLServer on one system against another or Sybase vs Oracle on a similar Unix system.
5, YMMV. Benchmarks are only ever an indicator of performance, not a guarantee. I tell my customers this all the time. They represent a result with a particular system, data set, O/S, tuning settings etc etc at a point in time. Other people's results with a similar config might differ considerably.
I could go on forever, but the above are my 2c
I don't know.. The higher the 2001SE score, the higher the FPS in a game, typically. That is, unless someone's drivers are cheating.
He didn't even mention 3DMark2003, which does a more comprehensive job testing modern GPU features and is included on any benchmarks of 'modern' (aka DX9 supported) cards.
Think about it, in 2000 when they were working on the 3dmark 2001, directx8.1 wasn't even done; to my knowledge, most of DX8 wasn't even used in 3dmark 2001se.. Since then, cards came out with tons of new feature sets (directx9, AGP 8X, etc) and there was simply a lag time between good benchmarking software.
Now, I do agree with charting performance over time. This would be much more handy when doing comparisons of AMD and Inel processors. I get the same over-all frame rate with my AMD 2400 as an Intel @ ~2.6gig. But, the Intel w/ a faster bus will likely not be getting those split second ticks where the AMD is 100% occupied or the FSB is flooded.
I'm not knocking AMD at all. I can just tell a difference in the overall smoothness of a CPU intensive game. When I bought mine, I spent about 3/4 of what I would have spent on an Inel rig and got around 3/4 of the performance.
It all works out once you stop paying attention to a marketing department. People always say you can't trust advertising, but act so suprised when a company is exposed for making a false claim of some minor sort.
you get what you pay for.
Also, instead of complaining about poor benchmarks in real-world situations, you should write the various game developers and request they add, or consider adding, a benchmark to their game engines. Having to 'devise' a way to test game performance probably isn't going to result in wide-spread adoption of that particular benchmark. ID Software's engines have always come with built-in benchmarks (timedemo), thus making them very easy to test. That's why you always see the games that use ID's engines in benchmarks.
That brings me to my final point, he mentions that StarWars game should be tested instead of Q3, yet it uses the same engine. Sorry, more copies of Q3 exist, and since any game using that engine doesn't bring anything new to the table, might as well stick with it. eh?
There's Lies, there's damn Lies and finally there are benchmarks.
Robert
It seems rather obvious that we need a paradigm shift in the way we benchmark our hardware. I like benchmarks of things I actually do with me computer. For example, the time it takes a setup to encode an mp3 or svcd file. Some people are using benchmarks like these, but there is no readily available program suite that benchmarks your system using these real life scenarios. Sure, I could do them myself, but I wouldn't know how my system performs to other systems if there isn't a standard benchmark.