AGP Texture Download Problem Revealed
EconolineCrush writes "The latest high-end graphics cards are capable of rendering games at 1600x1200 in 32-bit color at jaw-dropping frame rates, but that might be all they're good for. For all their gaming prowess, all of these cards have horrific AGP download speeds that realize only 1/100th of their theoretical peak. This article lays it all out, testing video cards from ATI, Matrox, and NVIDIA, and clearly illustrates just how bad the problem is. While these cards have no problems rendering images to your screen, you're out of luck if you want to capture those images with any kind of reasonable frame rate via the AGP bus."
Correct me if I am wrong, but does it really make a difference if a card has 128 MB of onboard RAM? AGP's main benefit is texture streaming from the system RAM to the video card, but actually, with 128MB of RAM on a card, I don't believe it is even an issue.
I'd certainly expect the AGP bus to be used asymmetrically, how often do you want to do high speed data capture with a card that's primarily output?
The only situation I can see where you'd want more than PCI bandwidth returning would be for uncompressed HDTV capture, and there are better ways to do that (grab the raw broadcast stream for example)
-Yarn - Rio Karma: Excellent
Notice that they're quick to point out the problem isn't likely a hardware issue. There should be plenty of bandwidth on the AGP bus, but graphics chip makers don't seem to have written their drivers to handle transfers from AGP cards to main memory properly.
Basically, if enough people want to use their card in this manner, Nvidia, with their super-duper driver support, will do so. 'nuff said. or, whoever knows how (i sure don't!), can write one for linux that takes this into account.
In any event, there's another issue he doesn't really touch upon; while he mentions that a single frame at 1600x1200@32bit colour is 7.5MB, he ignores the fact that a 30fps movie would require (30*7.5)=225MB per second uncompressed; you either have to have that much disk bandwidth or have enough CPU grunt to compress that on the fly. I guess a dedicated MPEG encoder card could help, but your average box is going to have trouble keeping up with on-screen gibs, rocket trails and blood splatters and encoding video.
Should speed things up abit though last time I checked the linux kernel didnt support it, even on Alphas. It's been part of the AGP spec from the beginning.
Someone please correct me if I'm wrong.
Peter
www.alphalinux.org
"However, no manufacturer has presently made this aspect of driver performance a priority." .. Will it justify the cost
Why should they, was anybody complaining till now. The well wont come to horse, the horse has to go to the well to drink water.
So unless a large number of people want it nobody wants to mess around with a perfectly working driver.
And it is not a piece of cake. Recording its own rendrings the software way would be a bitch, the best way would be to provide an access point on the bus itself, though it would play havoc with the board timings and noise issues.
In the end it will call come down to
My Aurora : http://www.youtube.com/watch?v=o91ZsGwJYyg
FB : https://www.facebook.com/TanveersPhotography
In summary, who the fuck cares?
"To put these results into perspective, a single frame rendered at 1600x1200 with 32 bits of color weighs in at about 7.5MB. Double that to 64-bit color, and it's 15MB per frame. And a single image at 1600x1200 in 128-bit color is over 30MB."
Huh? Why on earth would they want 128-bit colour. AFAIK the human eye can't tell the difference beyond 24-bit, and 32-bit is just there to make the processing a bit easier. Maybe someone can correct me on this, but it seems an extremely poorly thought out complaint.
all of these cards have horrific AGP download speeds that realize only 1/100th of their theoretical peak...you're out of luck if you want to capture those images with any kind of reasonable frame rate via the AGP bus."
As the quoted article clearly indicates, the problem lies with the drivers and not with the cards, the latter which the original poster intimates.
And the underlying reason is immediately understandable: after years of AGP cards and years of noone really complaining raising this issue - (except, now, developers of video-editing software who could benefit) - it seems clear that there isn't much demand for this kind of performance. In the (near ?) future there might be, but why should these companies spend money working on driver performance in areas like this when really customers only care about how well Quake will run ?
When people are willing to pay for these features is when companies will pay to build the requisite drivers. And that is how it should be.
The article presents that once the images are rendered out to the display, they are simply discarded. Sure, for any sort of video capture or whatnot, that sucks. However, the article does not attempt to answer why video card manufacturers do this, or if there are any cards that do take advantage of the AGPx4 bandwidth. My guess is cost. If all AGP video cards provided video feedback into the bus, you're probably looking at a non-consumer level product. And you know what? All I do IS use my GeForce to play video games. If dumping the frames after they are rendered keep the cost of my card down, I'm probably happier for it. Quite simply: Does this matter for the average consumer?
Lacking <sarcasm> tags,
In Quake 3. I've been working on a music video using some Q3A demos that me and some friends have made. I've been using the cl_avidemo 30 setting before playing the demo's to have it output 30 frames per second to the hard drive. Quake 3 will only dump the first 10,000 frames (or about 5 minutes of footage), but it takes about 30 minutes to dump all of that at 800x600x32 to tga files. How much of that is compression delay, and how much is trying to download off of the AGP bus, I'm not sure (anyone care to help me on this?) but the difference in speed when capturing frames vs not capturing frames is quite noticeable.
The AGP bus concept was created to move textures and I guess hardware driver programmers optimized for this.
A quote from the readme file: In the second mode it renders, displays and downloads the same image to the PC.
This is probably not what driver programmers were expecting. Wrong direction of data.
If I'm reading this article right, they're claiming that it also hinders normal screen captures.
That would mean that software like VNC would have much higher performance, if the drivers were updated, the way these guys are demanding. (Wouldn't it?)
That'd be fantastic!
Education is the silver bullet.
the bottleneck is with network bandwidth, not AGP bus. unless you are running over firewire or gig eth
I know nothing about anything, obviously, but I can see that game designers might think it nice to be able to send stuff to your screen but for you to be unable to send it to storage somewhere.
This *is meant to be* a dumb question. Mod me down if I'm wrong; it's only Karma.
Way back when I was working on libfbx, we (the two main libfbx developers) learned of a 48-bit framebuffer developed by SGI. It's used mainly to render special FX for Hollywood. After several composited layers with various effects on an 8-bit per channel system, you can really start to notice the quantization artifacts. Moving to 12- or 16-bits per color channel (depending on whether there's an alpha channel) makes a huge improvement. I've never heard of any 16 byte per pixel (128bit) image format. It'd probably be something like 16-bits per channel RGBA (64), plus 32-bit depth buffer (96), plus 16-bit stencil and select(pick) buffers (128).
A solution to the problem with music today
If I'm reading the article correctly, they're claiming that you can barely get 30 frames per second, full-screen. If you want to do a diff, and send the delta, you potentially need to be able to capture the full screen to do it. If you can only capture at 30 frames per second, you are LOCKED at 30 frames per second, even if you try to compress the output, and send only deltas.
Education is the silver bullet.
A couple of salient points come to mind when reading this article:
1) Recording games/presentations/etc. The reason why we don't do it is because if the system was capable of generating it real time in the first place, it's far less space intensive to record the parameters of the animation than the output. i.e. It's cheaper to say "Daemia fires rocket at these coordinates" than record an MPEG of said rocket shot. AND, as hardware gets better, your recording does too.
Which leads me to point 2:
2) Since it's cheaper to capture realtime animation by capturing parameters, the only use of the capture function would be NON-realtime applications - i.e. getting your Geforce5TiUltraPro to render an extremely complex scene with incredible realism at 1 fps. That's not a typo. If we have 10MB/s back-into-the-PC bandwidth and each super high resolution shot takes 10MB on average, we have a wonderful solution working at 1 fps. Spend the fill rates on 600 passes for each pixel or something like that. Imagine the quality of the scenes! Capture the damn things and be glad you're not rendering at 1 frame per hour like they were 5 years ago.
Repeat after me - if you're rendering for posterity you don't need real time... That'll come eventually.
-JackAsh
You think the **AA would ever allow this the ability to make a perfect digital copy of what ever is displayed on you screen. Now your monitor will have to be disabled every time a copyrighted work is displayed on your screen.
A stunning example of stating the obvious.
The hardcore 3D gamer market is small enough; I can't see manufacturers busting their humps to serve an even smaller one.
...that I have ever read. Either that, or I am missing something here... The idea that graphics subsytems have 'bandwidth to burn' is kind of ironic, given that every graphics chip is ultimately held back in performance by the amount of bandwidth available to it - especially when using high quality options like anti-aliasing. The main focus of the article is actually a very niche segment... the idea of transeferring back rendered images over the AGP bus for TV / film / etc. is a joke... Rendering at high quality takes a huge amount of bandwidth (ie. textures and geometry)... as someone else pointed out, transferring back high-res images would take up over 200MB - that's a quarter of your AGP bandwidth! And without taking into account contention and timing issues in uploading/downloading that would mean that you simple couldn't realise the full potential of the bandwidth without a lot of other (expensive?) hardware... The simple fact is that for production uses, you would be *far* better off taking a stream of data from the DVI connector, and storing that for later use... Screen capture for business use is a reasonable point - however when does that require 3d rendering to be taking place? There should be no contention and no reason why the AGP bus couldn't be utilised fully - although would the graphics companies make enough out of this to justify the effort? As for internet streaming - how many people have access to bandwidth fast enough for high quality, full screen video streaming? Enough said...
Since when did textures become video frames and vice-versa. I can't think of a scenario in which anybody would want to download *textures* from the GPU. If it would have been video frames, that would be another matter. If the benchmark actually measures *texture download* from the GPU then something is really wrong with this picture.
"Fighting terrorists with millitary might is like killing a mosquitor on your Dad's forehead with a rifle."
When you record video it is normally compressed by hardware or a DSP. They are compressed for a damn good reason.
Uncompressed, say just 1600x1200x24bit is about 6Mb per frame. At say 70 frames/sec is about 420Mb a second to store to disk.
So what exactly are you going to do with that much data? If you had 512Mb of ram you could hold 1 seconds worth.
Forget a hard disk, even a 3 disk raid doesn't have that sustained IO rate.
Have you ever seen the price of high end 3D Cards?
They start in the $1500 range and go WAY up.
I had a guy at the local base show me a card (forgot the name) that they spent over $3000 on (for doing 3ds max rendering) and they couldn't figure out why a geforce 2 mx was better in quake 3. The performance on the 3d design was amazing compared to a retail card.
They dont want to kill their high end market so it's unlikely that you'll see drivers that take full advantage. $$$ is king.
Obligatory comment on how every day, /. looks more and more like the random story generator.
Have I read this right ? They want to be able to take the output of the card (say playing a game of Quake) and save that output into RAM then maybe onto disk as a video ?
If so then why not just use a system that has been designed to do such a thing ? My SGI O2 has been doing it since 1996.
Mark
Not 30 frames a second. 8 frames a second assuming you don't use a larger resolution.
Users could actually record game output in real-time...a compressed movie of your game play saved on your HD.
Or you could always record a gamedemo (available Quake1 onwards, I believe). much less data to handle. If you really then really do want to convert the demo to a video of some sort, do it after the game, when you also have the time to mpeg-or-whatever encode it.Despite the popularity of Internet streaming, it is not currently possible to stream live output from graphics cards over the Internet. The connections, processors and codecs are all fast enough today. Sadly, all of this horsepower is being held back by one remaining weak link: the texture download speed of today's graphics card drivers.
Excuse me? The bandwidth off the graphics cards they test is in the 10megabyte/s range! Not many users have that sort of bandwidth on their internet connection.no, you're a typical /. reader because you are a moronic twit and a flaming homosexual with fantasies of public masturbation followed by public humiliation. idiot.
How about the obvious for video production... since going out isn't a problem... why not just hook up a recording device (could be digital media) to the video out port of the video card.
Does this really have to be over-engineered?
Skiers and Riders -- http://www.snowjournal.com
from the article
After playing there could be a compressed movie of your game play saved on your HD. On a reasonably fast machine you could actually record your game play digitally to your DV camcorder as you play or even compress and burn it to a Video CD or DVD at the same time you are actually playing.
Maybe I'm missing something, but why should be so vital to write on HD the output of your Video Card?
GPU are widely diffused goods, so it would be enough to save the meta information needed to recreate the images, send them, and let my friends own video cards recreate the images of me fragging around the enemies.
Beside that to save 50 fps at 1600x1200x32 (about 7mb pro frame uncompressed) 350mb/s are needed, even with a compression ratio of 1/350 (possible) that would mean no more than 10 minutes on a cd.
Many video cards hold frame buffer/textures in a private format optimized for the video processor. This means that if you want to read them, there may be an uncompress or untiling operation being done without your knowledge. This is expecially true for textures. These operations may be computationally intense - and any memory that can be written in AGP space is un cached, and so slow to access by the proceesor. In general, cards are optimized for drawing and display, not for read back of the buffers. Another problem is that access to the host memory for writing (via the AGP bus) is not immediate. The card is competing with the processor.
I noticed this problem when I was trying to take screenshots in FireArms [ a half-life mod ] I slaughtered some dude thru the throat and had the reflexes to hit F5, but I guess not cuz not only did the game lag for about 3-5 seconds, but I failed to capture the image I wanted! doh!...
* Their graphics cards would become invaluable for rendering production output for TV, film and video
:-)
:)
Oh yeah, the current market for this is huuuge, NOT! When and if the need arises I'm sure the card manufacturers will address this. Right now it's FPS that count and I don't think GFX companies are going to waste engineering time on this useless feature without any significant return on investment.
* Users could actually record game output in real-time with little impact on game performance. After playing there could be a compressed movie of your game play saved on your HD. On a reasonably fast machine you could actually record your game play digitally to your DV camcorder as you play or even compress and burn it to a Video CD or DVD at the same time you are actually playing
This one is just silly. Why not record the game engine commands instead of the videocard output? Oh wait, that has been possible for years in most FPS games no? What these people are proposing is to capture high resolution images, compress (eeks), and them stream out to a TV screen that probably has only 1/3 the resolution of the original capture. What a great way to waste time!
* Screen capture software that grabs motion images of user interfaces for the purposes of tutorials and training is a vital business application
Haha, this one takes the cake. Most of these tutorials do not need high FPS numbers to be usefull at all. And even more importantly, a lot of these applications simply script the real application to demonstrate the needed behaviour. You can't beat that.
* Despite the popularity of Internet streaming, it is not currently possible to stream live output from graphics cards over the Internet. The connections, processors and codecs are all fast enough today. Sadly, all of this horsepower is being held back by one remaining weak link: the texture download speed of today's graphics card drivers.
Woehahhahaa
I'd rather fetch a UDP stream of game engine commands render the game action on my side of the Internet, thank you very much.
What a joke
-adnans
"In short: just say NO TO DRUGS, and maybe you won't end up like the Hurd people." --Linus Torvalds
Our ray intersection algorithm implemented on the GPU (an "old" Radeon 8500) was able to intersect 114M rays per second. This was loads faster than the best CPU implementation, which could handle between 20 and 40 intersections.
But when we tried to implement a ray tracer based on this, and an efficient one that didn't intersect every ray with every triangle, the readback rate killed us. Our execution times slowed down to the low end of the fastest CPU implementations.
And the readback delay seems to be completely due to the drivers, which apparently still use the old PCI-bus code. If the drivers could use the full potential of the AGP bus, our ray tracer could approach twice the speed of the best CPU ray tracers.
If the drivers are truely the only issue and not the hardware, wouldn't this be a great opportunity for the XF86 guys and whoever writes the particular tdfx modules to optimize Linux first.
"No Mr. Vallenti sir you don't understand we have to use Linux. It's the only game out there for our CG budget. Windows can't do RAM write back with decent FPSes, and commodity GPU's are 20 times cheaper..."
Wouldn't that suck for them... at least it would be amusing.
Novel theory: Modern Man evolved from psychopath
Somehow this doesn't surprise me.
I noticed long ago that my simple Pinnacle TV tuner card (PCI) has a *much* better card-to-memory transfer rate than my Matrox eTV AGP display/tuner card.
With the Pinnacle I can capture video, let the CPU handle on-the-fly compression in Indeo (at maximum resolution, or even straight to DivX at lower resolutions), while the image is tee-ed to the display adapter at the same time.
With the Matrox, capture is only possible if the compression is done on-card, otherwise you get dropped frames because the bus can't keep up.
The Matrox is the better card for practical use, even though the tuner is slightly less in quality, because it does have a very good compressor on board (real-time MPEG2 at DVD level bitrate and resolution, or lower if you want to -- I usually capture at 8 Mbps max resolution, and later recompress it through DivX).
That's what render-to-texture is for, you don't need to read data back to the CPU.
b) split world/image-space occlusion culling.
This wouldn't be too useful for realtime graphics anyways, because of the way the 3D graphics pipeline works. The CPU can already be processing data a few frames ahead of what the GPU is currently working on. If you read back data from the card every frame, you have to wait for the GPU to finish rendering the current frame before you can start work on the next one.
Very few people use their typical desktop video cards for actual video production or anything related to it because the hardware up until now was simply unable to handle that sort of load. Now we have these cards that are the beginning of a new era of computer-generated visuals. The article is saying that they can do quite a bit more than they can do now if someone would just write some better drivers for them.
Now, streaming real-time rendering images over the internet? Maybe not fullscreen stuff right now because of a multitude of hampering factors on affordable internet bandwidth which I won't name for clarity's sake, but for the limiting factor to be the internet itself and not the graphics card is still a significant step.
This would definately be very beneficial to low-budget game developers and movie directors. We could very well see the return of the shareware boom (remember the early-mid 90's?) because of this.
sure, only a small portion of the people who'd buy the cards would use these features that the article talks about, but they'd be people that didn't have that capability before. Whenever this happens in any medium/artform/what-have-you, there is the tendency for a lot of experimental stuff to appear. I think we have some very interesting times ahead of us if someone gets these drivers written.
Just another freak in the freak kingdom.
Let's say Pixar starts using 3D chips to accelerate their rendering. They will be doing one of two things:
1) High quality rendering - It takes one hour to render a frame, so the download time is negligible.
2) Realtime previewing - Why would you want to download each frame to the CPU if all you want is a preview?
You hit the nail on the head.
Just take a look at who's behind this article. Serious Magic writes low-end video editing software, which is hardly the target market for 3D cards. What's more is that their "CTO" Stephan Schaem already has a bad reputation of pestering DirectX developers with constant demands of odd API features.
If you read some of the posts he makes on DirectX mailing lists, you'll realize that has very little idea of how to use a 3D graphics API properly. I would definately call into question the results of his benchmark.
"What kind of idiot puts their most powerful processor at the end of a one way street?"
Maybe they're the kind of idiots who know most people just want the best possible OUTPUT for gaming possible, and so don't want to add any overhead in card performance - or even additional design time - that isn't related to gaming performance. You know, the idiots who make cards that get award after award from gaming companies, then write near-perfect drivers, port those drivers to linux, and let you overclock the card to your heart's content. Those sort of idiots. My, they're idiotic.
Nobody says, "buy a geforce 4 ti, make the next toy story." No, it's advertised as a gaming card, and that's what its designed to do. If you want to do high-end video rendering things, perhaps a gaming card isn't the best choice.
I'm the stranger...posting to
TV cards are designed to take incoming data and put it to the bus. GPU's are designed to do the opposite. So yeah, your tuner card, even on PCI, should beat the vid card for readback.
-Looking for a job as a materials chemist or multivariat
And I tend to agree that its a software issue.
NVIDIA says that if you ask for contents of the framebuffer in a call to glReadPixels and you ask for it in the same pixel units its stored in, you won't be really disappointed. If, however, you ask for that same region of the framebuffer in another format, you're screwed. (So, if your framebuffer is 8-8-8-8 RGBA, and you ask for luminance or 10-10-10-2 or something else odd, you aren't going to be pleased with the performance.)
This isn't by the way, just a render-movies-on-your-PC issue. Lots of scientific computing, visualization, etc., applications render with OpenGL and then grab the framebuffer to store a result. This throughput issue is significant considering that for many applications, what was an enormous data set 10 years ago is now not such a big data set. Like another poster said, this issue is one of the ones that still ties people to SGI.
While 99% of your other concerns might be dealt with, there are still lingering problems like this one that keep some people from moving to commodity hardware.
Outside of a dog, a book is a man's best friend. Inside a dog, its too dark to read.
nt
I can't see why they come to the conclusion its the drivers. Current graphics cards are pipelines which are heavily optimised for taking lots of data from the AGP bus, processing it (TnL, rasterisation etc) and outputting it to the monitor. Thats what they're designed to do. Sure, the possibility exists for reading back from the pipeline across the AGP (due to calls imnplemented in graphics APIs designed before the cards), but anyone who has coded using a modern graphics API should know its not advisable if you want to keep pipeline speed.as it means reversing some parts of the pipeline (meaning flushing it, flushing chache, refilling cache, etc etc, then same in reverse to get back to drawing again....).
As far as I can tell, this article is about a company realising something that everyone else already knew, then whining cos they don't get the cheap rendering platform they were hoping for.
Seriously though, how much extra effort is it to capture the output and store it (via another PC with vid capture if needed)? This whole thing smells of "OMG! this thing doesn't do wonderous new things that normally cost 15 times as much, straight out of the box!"
OK, the newer cards seem to be even more heavily optimised towards rendering towards monitors. Can you blame them? If a feature is already almost never used because its so inefficient, and you can improve performance of regularly used features at the cost of the rarely used one, then thats the way to go.
Even with adapted drivers, the performance would still suck compared to letting the pipline run how it likes and caopturing the frame at the other end.
Rant over.
Dan
I've done my graduate research for the past two years on topics requiring fast frame buffer readback, and here's what I've found: For nVidia, reading back in native format (GL_BGRA_EXT) with OpenGL is very fast. I get performance in the range of 40-50 Mpix/sec, which comes out to be 160-200 MB/sec. I know people at some of the other research labs in California have been able to reproduce these numbers. Reading back in any other format is slow, and reading back in DirectX is slow. Reading back anything but color information (e.g. Z-buffer) is really, really slow. I've talked to the driver people at nVidia, and my understanding is that they just haven't optimized these paths yet. The driver code reads data from the card one word at a time and doesn't use any of AGP's block transfer modes simply because that would take development time away from providing features that most people are going to use. As much as we all like to bash Microsoft, I don't think it has anything to do with them. Since nVidia writes both the Microsoft and Linux drivers, we'd probably see any improvement in readback performance in both drivers at approximately the same time, but I could be wrong... The market drives companies like ATI and nVidia, so as soon as people start demanding fast frame buffer feedback, they'll put it in. Until then, there's still the fast OpenGL path that nVidia has put in for research purposes.
Isn't the whole point of the hyper-fast GPU's to be able to render 3D, texture-mapped objects with a minimal of communication with the CPU? If a graphics card is going to make full use of the AGP bandwidth, isn't this going to put one hell of a strain on the motherboard processing power itself?
If you look at the pre-hardware accel cards, the CPU was responsible for the calcs needed to render the display and then dump the raw data out to the video card. I would think that the design focus for the GF4 in my system was to render as many 3D objects at as high of a framerate possible WITHOUT the need to send gigs of data streaming into the video card.
While it might not be that big of an engineering modification to up the bandwidth capabilities of a video card's interface with the mobo, I don't think this was part of the initial design goals... in fact, I think the goal was the exact opposite.
This has been an issue for quite some time. Raster once put reading from the card at being 1/10th the speed of writing to it. This is the reason we have very little "fake transparency" going on right now. Those methods read the frame buffer and then composite upon the necessary region. With this method transparency can neither be fast nor update in real-time.
The solution is to take this into account when desgning the compositing model which Apple has done and Keith Packard and co are doing with Xrender and it's offshoots.
macros
I've been doing real-time 3D graphics for 10 years and read-back speeds have been the biggest problem for doing many advanced algorithms. We have asked the companies to improve this many times. The problem as I see it: Quake and other benchmark apps don't rely on readback. ./ may even have run a link to one of these techniques a while back.
Here are a few other important but non-Quake techniques that are driven by readback speeds. I'll go into more detail on the first for illustration purposes.
High-quality real-time occlusion culling -- many techniques render the scene quickly by using a unique color tag per object or polygon and then read back the framebuffer to figure out everything that was visible (and how many pixels for each) for a final high-quality pass. If HW drivers would even just implement the standard glHistogram functions (which essentially compress the framebuffer before readback), this would become practical. NVidia adds their NVOcclusion extension, but it's limited in how many objects at a time you can test, it's very asynchronous, and it requires depth sorting on the CPU to make it most useful. The render-color technique does not. Yet HW makers are spending lots of money adding custom HW to do z-occlusion when a simple driver-based software technique may be easier.
Dynamic Reflection Maps -- for simple, reflective surfaces -- Requires background rendering from multiple POVs (generally six 90 degree views) and caching these. Even if you can cache a small set of maps in AGP memory, you want fast async readback if you have a large fairly static scene and you're roaming around.
Real-time radiosity -- similar to above, but needs more CPU processing of the returned images and possibly depth maps (reading back the depth buffer is often even more expensive than the color).
Real-time ray tracing -- the better quality approaches need fast readback to store intermediate results (due to recursion, etc..). With floating point framebuffers and good vertex/pixel shaders, ray-tracing becomes possible, but not yet practical. I believe
So there's a lot more to this issue than just making movies of your games. Faster, better graphics would be possible. So why isn't this a priority?
------------ cyranose@realityprime.com
The article claims that the drivers, not the HW, are causing the performance problem. Based on my conversations with a premier graphics programmer and some x86 experts, I don't believe that it is this simple. In particular, note that XFree86 2D, which uses its own drivers, also has pathetic readback rates.
I barely understand the technical details, but it seems like there are some serious misfeatures in the way that the AGP bus interacts with CPUs and caches on both Intel and AMD during readback; it is going to be hard for card vendors to fix this problem (even if they decide to care). It may be that a new bus and/or new CPU glue will be needed for high-readback-rate applications.
I just dont get it! My car still doesnt fly.. they keep gettin faster, but my car just doesnt fly. I belive this is a driver issue, after all its 2002 now... 15 years ago they said the cars now should be flyig, it must be the drivers fault! How come I'm the only one complaining about this..? Hello? Its me the .00001% of the population that is DISGUSTED that we can't fly our cars around yet, but they have fancy gps, maps, airbags and big ass engines... but they can't fly! ...
*voice in the background* - "go buy a plane you jerk!"
My card will ouput the same image to its VGA & TV ouputs at the same time.
Surelly simply by connecting the S-video output to a VCR while playing quake through the monitor should do the trick.
no, that is a typical slashdot user/reader.
It's quite ironic that this story was posted today, because I'm having the same problem and I was beating my head into the ground as to why it was so slow.
.25ms, which was totally unacceptable because I needed to do this as much as possible within a frame time (so ~30ms..)
Basically I wanted the GPU to map some textures for me (cause its been designed to do that) and then I wanted to get those back and do some other operations on it.
What I found with my really cude benchmarking is that a call to glReadPixels() of 128x128 8-bit RGB data from a GeForce 2 Ti took around
It boggled my mind as to why this was so slow, and now I know.
I think that the bandwidth figures generated by the author of this article are seriously suspect.
I've written an OpenGL application that supports a capture movie feature . . . It's possible to record a 720x480 movie at 32bpp at 30 frames per second (GF4 Ti 4600, 1.33GHz Athlon), and this includes the compression and rendering time. That's about 40 megabytes per second. 30 fps is the highest supported capture rate, so I haven't tried to find an upper limit. But even 40 megs per second is three times higher than the best figure reported in the benchmark.
I'm not sure whether the benchmark uses OpenGL or DirectX--that could have a significant performance impact. But I think it's more likely that there's a big problem with the way the benchmark is written--without having the source, it's difficult to tell what's really happening.
The author of the article also seems to have confused "texture download" with "frame buffer read." Such a deep confusion about the very subject of criticism casts further doubt upon the author's results.
--Chris
Half Life is an old game, and its maximum visuals can already be seen by any decent card today. And I doubt it has support for clustering of any kind.
OpenGL supports reading back the screen buffer mostly so that the OpenGL validation suite can check the rendering accuracy. For that, it doesn't have to be efficient. And if you read back in some format other than the actual structure of the framebuffer, every pixel gets converted in software and performance will be awful.
This article reads like it was written by an overclocker, not a graphics developer.
The nascent art of machinima, which involves using 3D game engines to make desktop movies, could benefit from a practical way to record game output faster. (It would also be nice to export directly to .AVI format for editing in Premiere or Avid, but that's another wishlist.)
Also, having the ability to render faster means that you can do it faster than real-time. If you are working to a deadline in a TV news studio, that might be a real advantage (think late-breaking news where a story has to be put together during a comercial break).
science is a religion
That's the design, but it doesn't really work that way in practice AFAICT. If I have some geometry in AGP memory, the fastest way seems to be to render it to part of the main framebuffer before the final main rendering. Keeps the context switches low. I haven't yet found a way to preserve VAR settings across context switches, which gets in the way of asynchronous rendering.
Pbuffers are better suited for when you want to render data that isn't in the same config as the main framebuffer, or want to render and buffer up at other than the main framerate. Besides, there's still a readback required.
The card i'm refering to is of a different architecture. You'll have to take my word for it since it was a while ago.
science is a religion
If you read the AGP spec, which was written by Intel, you will note that it is based on the PCI 2.0 spec. The PCI 2.0 spec is for a 32 bit, 33 MHz symmetric bus which gives you a max transfer of rate of 132 MB per second. The AGP spec is for an asymmetric bus, 33 MHz read and 66+ MHz write. But writes were optimized at the expense of reads, since Intel was pushing video with NO onboard texture memory, and who would want to read back the image in real-time anyway, right?!?
Yes, I am sure that drivers do have some affect, but the AGP spec is the first bottleneck. On an OpenGL news group it was reported last year that a person tested two identical video cards, the only difference being one was AGP and the other was PCI. The read performance for the PCI version was several times faster than the AGP version.
Of course, some video cards are also to blame because of the frame buffer format they use, but that is another story...
This is one of the more interesting and compelling reasons for the 128-bit requirement.
Black holes are where the Matrix raised SIGFPE
Comment removed based on user account deletion
Follow my reasoning here. I've heard from other articles at /. that Alan Cox (or one of the big name advocates) couldn't think of a reason to justify to NVidea to OpenSource their drivers. There would be no profit for them to do so.
But if they had, the drivers would have been updated to scratch whoever's itch needed to be scratched. In this case the bandwidth from card to Memory.
One of the benifits of Open source is that even seldom used features are enhanced, so that when suddenly there is a demand for this the features are in place.
It seems the lesson here is that proper captures from video RAM are slow. Yeah, it'd be nice to change that. But how many people really care? Given how long it took anyone to notice, I can't help but think that very very few people really care - and with good reason. Unless you're into making rendered movies, it's irrelevant.
Build stuff. Stuff that walks, stuff that rolls, whatever.
I think it's "Donny Don't"
I asked nVidia at SIGGRAPH why image readback is so slow. They said, no motherboard they know of (not even their own) supports AGP Writes back to the system memory. Without that, you're limited to PCI bandwidth at best, far less than what the AGP spec allows.
However, we're not even seeing that. Results are showing 1% of what is possible. It's certainly a hardware issue, but there may be a lot of room to improve from the software side, too.
Why would anyone engrave "Elbereth"?
Why is it that a much more expensive Quadro card gives equally slow results? I've run a very similar test on an SGI 320 (shared-memory design) and it only gives 18.9 MB/s.
Anyone reading this with a Wildcat 6000-series? What does that bench at?
Why would anyone engrave "Elbereth"?
What you thing companies are going to let you use a cheap video card for frame grabbing? This suckers were designed to video games, and home entertainment purposes, not studio work.
I noticed that there were no reviews of cards by 3d Labs? I wonder why? Could it be that 3d Labs builds its cards for Professional Graphics users and could care less how things like Quake benchmarks?
If someone is passing you on the right, you are an asshole for driving in the wrong lane.
the kind of idiots who know most people just want the best possible OUTPUT for gaming, and so don't want to add any overhead in card performance - or even additional design time - that isn't related to gaming performance. You know, the idiots who make cards that get award after award from gaming companies, then write near-perfect drivers,
here it comes...
port those drivers to linux
Bingo!
The only problem is in the driver. Hardware's up to the job.
The driver has been ported to Linux.
So fix it!
Closed source? Reverse engineer it.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
but it is rare that image read back is the best solution. However, when it is, proper us of AGP makes a significant difference. A prototype was done in open source for the Matrox G400 driver, but was never maintained. There were more recent discussions on DRI-devel to bring this functionality back, here is a pointer to the thread:p 3?subject =%5BDri-devel%5D+proper+ioctls+%28%3F%29++to+expor t+agp+to+a+user&list=680
http://www.geocrawler.com/mail/thread.ph
That IS the expected bandwidth you get copying from video ram to system memory with the CPU. Even having the graphics engine push the data across for you you'll only get PCI bandwith. This is a the way AGP is SUPPOSED to work. If they're claiming you should be able to get 1Gig/sec copying with the CPU or even with the graphics engine into main memory then they don't understand the way the technology works. That bandwidth only goes in one direction. It has nothing to do with drivers. This is the expected behavior for AGP. If they're copying with the CPU in their benchmark (and it looks like they are from those numbers), they got exactly the expected results. Methinks those guys just don't know what they're doing.
Or if you're running tightvnc.
I'm pretty sure this COULD help vnc performance, if not at the client side, then at the server side, which often spends quite a bit of CPU fetching those pixels.
A witty
The benchmark uses DirectX.
Seems to me that many posters here don't have
an idea why Serious Magic needs fast readback
rates from the graphic card to the PC's memory.
SeriousMagic have developed a really cool video compositing engine based on DirectX that allows them to generate realtime video effects, like mapping stuff onto 3D objects, alpha blending, bluebox effects, wipes, fades, etc...
I've seen a demo of it and I was really impressed.
Now if you want to encode that stuff into a video stream (like MPEG or Windows Media) you need to read the generated output back into the computer's main memory for CPU based encoding.
And that's where the bottleneck is. They can't get the data back fast enough.
So their release of a benchmark application shows how bad the cards actually perform. If they want to put some pressure on card makers to improve performance, that's the way to go.
--- Eat my sig.
Hmm. Way back in the early 80s we had a nifty device known as a "genlock" that converted PC video card output back to NTSC-compliant signals for viewing on a standard TV. These have gotten much better, and I've seen projectors that can handle 1200x1600 or better in true color. I'm just surprised that enthusiasts haven't devised some sort of "loopback" device utilizing one of these. It could theoretically get the data back to the CPU, but it wouldn't help in the way of increasing performance if in fact it is the problem of bad drivers, as the article suggests...
Of course, I suspect it's not entirely the fault of the drivers; more than likely, there would have to be some near-redundant circuitry to help prevent lagging on the video card.
Can anyone confirm that the AGP spec is not symmetric? The fact remain... Nvidia card under OpenGL deliver 200meg second , yet most card (include nivida) deliver 15meg with DirectX and XP. Note: We did see nvidia driver deliver 80-130meg second under W98. Their is a big gap betwen 10meg to 200meg. Also another factor is that in that benchmark 100% of the processor is used (driver use CPU loop VS DMA) And its very probable that OpenGL using DMA and freeup the processor. So 200meg second 'only' but with little procesor usage. Double bonus! Stephan
This does not explain why the benchmark work 8 to 13time faster under Win9x and people posting result of 130-200 meg scond using OpenGL. Its a software issue. No need to invent elaborate HW hacks. Stephan
I spent most of the summer working on AGP driver bugs, so let me clarify a few things.
AGP was designed by Intel as an ad hoc solution to combat the problem of transferring large textures to a graphics card over the PCI bus. It's an extension to PCI, essentially, allowing fast, pipelined, ONE-WAY transfers. That should be repeated. AGP is PCI, with a different connector, and a bunch of extra pins and logic for pipelined transfers from system memory to the card. In fact, without "fast writes" enabled, CPU -> graphics card writes are plain PCI; only transfers requested BY THE CARD are accelerated.
There is nothing new about this. It's in the spec.
It is NOT meant to be a two-way bus. It it was never designed for offloading cinematic rendering to the card, for later recovery. AGP came out around 1997, before NVIDIA or ATI had shaders in hardware. PC rendering was nowhere near photorealistic at the time; that was the domain of software raytracers. Without AGP, video cards seriously hog the AGP bus with their texture streaming. That is ALL that AGP fixes.
The real solution is to come up with a new bus. I tend to like unified memory architecture designs, but they have disadvantages as well. The real trouble is getting the PC industry to agree on anything; if ATI came up with a new bus standard, for instance, I doubt NVIDIA or Matrox would adopt it, not wishing to appear to submit to their competitor.
-John
Ok, so the problem is there's too much frame data and no way to get it back over the bus to the system for storage. Proposed solution: Seperate capture device. Method of connection: DVI digital output. All modern graphics cards have digital DVI out these days. Most of them can run it simultaneously with a VGA monitor or second DVI, depending on your card. Some can even fullscreen an app on one interface while having it windowed on the other. Perfect for this. So, you tap the DVI into a custom piece of hardware designed to do the capture. Say a box with a couple hundred megs of ram (what's a gig of PC133 these days, $80?) with the bandwidth to do the capture (PC133 = 1066MB/sec, well above the 225MB/sec estimate from another post). Then you add in a hardware compression chip a la mpeg2 hardware encoder or mpeg4 hardware encoder, whichever you please. Then dump the compressed result to a hard drive. Hell, I bet you could put this entire thing except the hard drive on a PCI card, put it in a slot, and run your video card's DVI out to the PCI card's input port, then capture back to the disk. All you need is a way to decode DVI. Since it's already a digital signal designed to display an image, I don't think decoding it to say, a TGA format would be that impossible to do. After all, LCDs have to have some kind of decoder for it right? Is this really any less feasible than those old mpeg2 PCI decoders that used pass through connections to the video card? I mean it'll need the ram for temp storage and probably a bit more processing power to encode instead of decode, but I don't see it being unfeasible. I bet you could mass produce one for $299 that worked with any 3D card on the market. Need it right now? Two PCs, one renders, one captures. Optimize each box for its task. One with a fast CPU, fast GPU, the other with a vid capture/hardware encoding card, and RAID array. Of course then you could only capture output dependant on the source machine, so doing individual frames might be slightly tricky, but I'm betting the timing issues for syncing could be worked out in software.
Introducing the new Occam Fusion! Now with sqrt(-1) fewer blades!
Someone build a bloody box with a DVI input and a gigabit ethernet port on it. Connect DVI out of video card to DVI input on our magic box, gigabit Ethernet on the box to gigabit ethernet on the PC. As each frame is generated, capture it and spew it back to the PC over the ethernet, then ask the custom software on the PC (via a packet from the magic box) to put the next frame over the DVI.
Lather, rinse, repeat.
Won't be cheap, but someone could almost certainly whip one up with a Xilinx FPGA. I know they make one with a built-in TMDS receiver, which is what you'd need to decode the DVI signal.
Why not plug a capture device into the DMI port?
"AGP Texture Download Problem Revealed"
/. editing staff; your readership is depending on you to drag the other editors up the bell curve kicking and screaming by your example. Don't give up now. =)
"AGP Texture Download Problem" implies that there's a problem downloading textures via AGP from main memory. But it's not about texture transfers at all, it's about transfers of rendered frames back to the system (in the opposite direction).
Hey, 'Taco... You're the high point of the
I'm not suprised at this - when you spend your effort optimising for
output, dragging that final image back up to the input is kinda like
running up a downward moving escalator...you *can* do it - but you
probably shouldn't.
It seems to me that if you are rendering movies with this technology,
you are either a small operation who can probably afford to wait (say)
10x longer than realtime to do it - or you are some big production house
who can afford to do better.
In those cases, why not simply stick a frame-grabber onto the digital output?
Heck you can even get around the 8 bits-per-component problem by using a
fragment shader to render the high order bits to red and the middle bits
to green and the low order bits to blue - then do three passes to render
the Red component of your image at 24 bits per pixel, then the green, then
the blue.
Using the downstream performance to your advantage is the way to go.
The title of this article (which talks about "Texture Download" is most
confusing because that's a term usually used to describe the process of
taking a texture map out of the CPU and stuffing it into the graphics
card's texture memory.
This is more like "Screen Dump Upload".
www.sjbaker.org
wtf are you all talking about? computers are crap...
My experience with P-buffer on Nvidia card is that you get even lower performance for the readback phase ! If you copy-to-texture, it's ok. Since, I switched back to regular frame-buffer rendering. Luc.
Have you ever compared software-rendered GL (read, Mesa) to even an old Voodoo1? The difference in time required is staggering. I always knew when my GL drivers weren't working right because it took 15 seconds to render one frame.
.pov/.diff hybrid, or a .pov derivitive with motion thrown in. These new, cutting edge video cards are *capable* of onboard reytracing at 16000x12000. (sic) Suddenly, rendering Toy Story III in native IMAX sounds real good right now...
Now scoot ahead to five years from now, when 3d accelerators take their data in the form of some
(On a side note... throw in dedicated mass-spring simulation hardware for fluid and materials emulation. [drool])
What's this Submit thingy do?
Capturing FPS play would be absolutely perfect for all the mod developers out there for whom still-picture screenshots don't do their mod justice.
What's this Submit thingy do?
Juiz de Fora IRC
I wrote a benchmark last night that did DirectDraw and OpenGL pixelblock transfers, both ways across the AGP bus. Now, I wouldn't call my results totally rigorous (there are various versions of drivers, no Win9x machines, a couple WinXP & the rest are Win2k), but I ran many of them multiple times, on a selection of machines/cards, & got pretty consistent numbers each time. Also, the DirectDraw readback numbers agreed fairly closely with the Studio Magic Direct3D results.
(Write denotes system to gfx card, Read denotes gfx card to system)
A few things struck me:- OpenGL does WAY faster readbacks, especially on nVidia hardware.
- OpenGL is faster for writes too, on nVidia, but a lot slower on ATI
- ATI seem to optimise more for DirectX
- The SGI's unified memory architecture does help, though not as much as I would have expected.
- Matrox's OpenGL drivers sucked big time.
- These numbers would look better in one of Damage's graphs.
Anyway, I'm convinced that there's no particular hardware problems involved, other than perhaps readback being limited to PCI66 speeds. I have no idea why DirectX readbacks are so much slower - can it really be that every single company just hasn't bothered to optimise this path, even though they have for OpenGL? Or is there something within DirectX itself that's holding them all back?
Why would anyone engrave "Elbereth"?
Also, as I already stated, the MAX read is ~132 MB per second (33 MHz bus, 32 bits wide). There is no way you could get a transfer rate of 200 MB per second.
I already hinted that the frame buffer format could be a bottle neck. For example, the frame buffer might not be in a 'normal' RGB or RGBA format. I could be in a format like ARBG, BRG, or even somehting exotic like RGBAZZZS (Z-buffer, Stencil buffer). And it is very likely the scan lines are not contiguous, so the padding will need to be skipped.
AGP is 66mhz... Have you, yourself read the spec? AGP4x is 66mhz 32bit quadpump. And are you calling people reporting 130-200meg read under OpenGL lyars? Example: nVidia Quadro DCC / P4 Xeon 1.5 GHz x 2 OpenGL Write: 482.03 MB/s Read: 157.60 MB/s Email: d a n i e l @ e y e o n l i n e . c o m. if you want to prove yourself wrong. Stephan