Making a Fair Gfx Benchmarking Utility?
Moggie68 asks: "Always when the big two release new GPU's and graphics cards that reach astounding heights with their benchmark scores, the same heated debate about unfair benchmarking utilities rises again. But what about the flipside of the coin? Would it really be that easy to construct a fair benchmarking utility for GPU's and graphics cards? What facts need to be considered? What problems solved?"
Just stick to using popular games. Seriously.
Here's the problem: ATI and NVidia have diverged a bit. They get performance upgrades from different optimizations/workflows. For this reason, performance is more a question of which card the game developer favors than it is about which card is better. Granted, what I'm saying isn't quite as black and white as that, but it's worth considering that if the benchmark uses an optimization that the game doesn't, then the benchmark is misleading.
I don't find video card benchmarks interesting, but I do enjoy CPU benchmarks. I'm a 3D artist, so render speed is very important to me. I recently had to go through the "Do I want a P4 or Athlon?" debate. Lightwave comes with benchmark scenes. You're supposed to load the scene, hit the render button, and write the number down. Some decent sites actually do the benchmark that way. That is a selling point for me, not the rest of those idiotic benchmarks that they throw in there. Yeah, like I care about how fast Office is.
I hope my point got across. Real world numbers are gold, theoretical numbers are pyrite.
"Derp de derp."
This is the probelem: the graphics drivers check the process/executable to see what program is making the graphics calls. If it matches a known target profile (benchmarking, quake3, etc), the graphics are tuned.
The problem here is that the Windows driver model allows the driver to check what program is making calls into it. This is not a bad thing by itself, so I wouldn't advocate getting rid of it.
So.. lets say you make a new benchmarking program and you don't leak any copies out to the graphics people. What happen when you release it? It might work and be fair on the current batch of drivers.. but as soon as the graphics people get their hands on it, there's nothing you can do to prevent them from "optimizing" (tuning down rendering) for your benchmark.
So maybe you can make a fair benchmark today. But as soon as you give it to anyone, don't bet on it being fair on the next driver revision.
-molo
Using your sig line to advertise for friends is lame.
There is no such thing as a fair benchmark. Each persons needs differ and therefore a different product suits those needs best. Best thing to do, is grab demo's of the things you like to do with your video cards and then head down to your local computer store and see how it works.
Benchmarkers can just always rename their benchmark programs to something else when testing. Isn't this how a lot of recent driver optimizations were discovered in the first place? How about a benchmark installer that installs a differently-named executable every time.
But what if the graphics drivers then use the memory space of the process to see if it is a benchmarking program running? You get into an "arms race" of sorts, like that between malicious code writers and antivirus companies, or crackers (as in people who crack programs) and shareware programmers.
...repeat it
Does anyone still care about MIPS, MFLOPS, Dhrystone, Whetstone, or SPEC? Why do we want to rehash history with GPU's?
If you want a synthetic benchmark, the companies will make their product work well with the benchmark, a little else. When the inevitable happens (As it has with both major players) you should neither get upset nor demand a better benchmark, instead laugh when someone fronts a synthetic benchmark score.
So you want to know if a card you are going to buy will work well for a game that is going to come out in 6 months to a year. We'd all like to know the future as well, I'd prefer a crystal ball.
"I don't know that atheists should be considered citizens, nor should they be considered patriots." George HW Bush
One possibility is to have each vendor create two test suites -- a suite that the vendor thinks highlights the best performance features of their own system and a suite that highlights the worst performance features of the competitor's system. For two vendors, this results in a total for 4 test suites (vendor 1's favorites, vendor 1's killer for vendor 2, vendor 2's favorites, vendor 2's killer for vendor 1).
Then run all four suites on both systems and take normalized averages. The best system can win only by being robust and of overall high performance. With four tests in all, the vendor's own "best foot forward" suite can't overweight the result. And with the other vendor looking for any weaknesses, the downsides of each vendor's system becomes quite evident.
Such testing may not produce over-optimized one-application super-stars, but it should lead to well-rounded graphics boards for high performance on a range of graphical display tasks.
I bet that ATI and NVidia will never go for this approach becuase it would lead to real head-to-head fair competition as opposed to carefully staged, optimized, marketing-controlled demos.
Two wrongs don't make a right, but three lefts do.
- How do you benchmark image quality?
- How do you compare different performance advantages in different areas?
- How do you stop the card manufacturers from cheating on the tests?
The only way to test the first is with the human eye. You need to look at two images and make a subjective decision on which is better. And the programs that generally have the right amount of graphical frills are popular games.The performance question is harder. But again, popular games level the playing field. When you benchmark using a game you know that programmers are actually using the features you are testing.
And finally, there is the matter of cheating. If a manufacturer is noticeably decreasing image quality for frame rate, he is usually "cheating." When image quality is maintained, it is an optimization. So again, it becomes a matter of subjective judgments of the human eye.
Subjective judgments are not so bad of course. A five star restaurant is only subjectively better than a two star restaurant. But usually that will mean a lot to the customer. So we can tolerate the errors that come from benchmarking cards from games pretty well. When manufacturers pull their tricks, you can bet that the review sites will be there to catch them.
OK, then how about benchmarking in Linux or FreeBSD. They both support Direct Rendering Manager. I'm sure that a vendor arms race would be a welcome sight in the free operating system arena.
Then the drivers will check a md5sum of the executable.. or they'll search for certain signatures within the file.. plenty of options.. it would be an arms race of sorts. There's no way to gurantee it.
-molo
Using your sig line to advertise for friends is lame.
The problem here is that the Windows driver model allows the driver to check what program is making calls into it. This is not a bad thing by itself, so I wouldn't advocate getting rid of it.
Hey, this aint MSDN. Get your priorities straight!
meh.
My proposition: randomize the program name (as reported to the OS/scheduler).
-Billco, Fnarg.com
In graphics, everything is redundant because you really can't see that lone pixel among the other 1920x1440. So the solution is to render one out of every four polygons... tada, 4x performance.
-Billco, Fnarg.com
Those typical office/desktop benchmarks aren't real world.
;).
Why? Coz they don't have antivirus software running in the background. AV software running in the background could change results significantly.
In most offices, the desktop PCs have AV software installed. If they don't have AV software installed, they usually have worms and viruses and those tend to take up more CPU.
That's real world.
Which AV software to use in the benchmark is one question that they may not want to deal with
But, hey, doesn't anyone want to know whether AV+apps works better with or without Hyperthreading enabled etc? Whether it works better with Athlons or P4s?
Oh well..
We take a bunch of gamers and group them by what video card they own. We give each of them the test board. After one month we take away the test board and give them their old one back. The benchmark is: How many out of 10 owners of board X would buy the test board? Because that's what you really want to know, right? And who better to tell you this than people who own the same board you do?
If all this should have a reason, we would be the last to know.
So...what exactly is wrong with this?
I can't see why you'd care whether a vendor is "cheating" or not. Lets say that you're a Tribes 2 fan. You run out and look at Tribes 2 benchmarks in reviews. The reviewer says something about image quality, and includes bits of screenshots (I vaguely remember this happening with the Riva128 and G200 the last time I purchased a 3d card for gaming). End of story.
Now, there are a couple of possibilities. First, both you and the reviewer can't see the image quality degradation that's taking place, and you do notice the speed increase. That's not cheating! The card vendor has just figured out a way to provide you with more resources that you care about at the cost of something that you don't even notice. We do this all the time with lossy compression in JPEG and MP3 -- you don't care about 90% of the data, but you do care about the size savings. People didn't care when lossy texture compression became the standard on video cards because the only thing that lossless compression gives them is a psychological "this is a flawless image".
Another possibility is that the reviewer or you notice image quality degradation. If this is the case, the card gets a lower image quality score. Big deal!
Finally, you may be worried about game-specific tweaking in that the game won't provide a representative sample of how the card will do on other games. This is *always* the case! Cards could perform quite differently on any set of games just due to the fact that designs differ, and different things form a bottleneck on different cards in different games.
Just let some reviewer sit down and try the stupid card out, and if they're enjoying the card...hey, who cares what hacks are included in the driver?
May we never see th
This would drive both vendors to improve the robustness of their chips and drivers. Knowing that the competitor is goign to try to crash your system would put pressure on the development team to avoid or fix bugs.
These would be true test suites as opposed to nice speed demo suites. As a graphic board customer, I do want speed. But I would probably say that robustness has a higher implicit priority. A graphics chip that crashes is the last thing I want, regardless of how fast it is on some more limited set of code.
Two wrongs don't make a right, but three lefts do.
Perhaps I did not explain the idea well enough. Since manufacturer A has to also run the anti-manufacturer B test suite, any sleep calls will effect both of them. Because every card as to run ALL of the tests (both the "best-case" tests and "worst-case" tests of all cards), each manufacturer must make sure that their own card can handle whatever they are trying to throw at the competitor's card.
Sleep calls cannot bias the results unless the two cards have different definitions of "sleep." Bypassing sleep would not improve performance. I would assume that if one card ignored a sleep call, that would be scored as a failure by the card to execute a valid command.
Two wrongs don't make a right, but three lefts do.
then how about benchmarking in Linux or FreeBSD. They both support Direct Rendering Manager
I thought Microsoft was using Linux's and FreeBSD's non-support of DRM as a selling point for Windows.
Oh, that DRM.
Will I retire or break 10K?
Writing drivers that will survive running malicious code takes time away from addressing other programming issues and the thing is that no one except for your compititor is writing that kind of code into their App.
What if somebody finds a way to break Windows through a video driver bug? What if somebody puts that exploit into the next Windows worm?
The more fundamental problem is that all any kind of test can ever measure is your ability to do well at that test.
And if that test measures a video card's ability to process OpenGL instructions without bringing down the computer, I'm all for it.
Will I retire or break 10K?
Benchmarks are, or should be taken as, just guidelines.
In the real world there are huge number of varibles, old dll files from previous drivers, IM clients running in the background, stuff in boot config files which are old yet effects performance, stuff hanging around since the last clean reboot, physical environment etc.
The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
You need to find the video card that results in the most number of winners. Scrounge up cash, run a lan party, and get down info on the video cards that people are using.
The card that correlates to the most wins is obviously the superior video card.
id10t.
gutless coward
meh.
Once that's out of the way, the next step is to crank up scene complexity until the rendering rate drops. Crank up the polygon count, the texture count, the shader count, etc. until the card misses a frame refresh time. That's what matters when you're running 3D applications. It's also what matters to game developers - this tells you your resource budget.
Being able to re-render the same scene at higher than the refresh rate is meaningless. The main reason people have so much trouble with this is that the people who write game reviews don't program.
Randomize your benchmark. It'll take a few more runs to get an average performance figure, but then the benchmark is immune to cheating drivers.