Hardware Based XRender Slower than Software Rendering?
Neon Spiral Injector writes "Rasterman of Enlightenment fame has finally updated the news page of his personal site. It seems that the behind the scenes work for E is coming along. He is investigating rendering backends for Evas. The default backend is a software renderer written by Raster. Trying to gain a little more speed he ported it to the XRender extension, only to find that it became 20-50 times slower on his NVidia card. He has placed some sample code on this same news page for people to try, and see if this is also experienced on other setups."
He didn't really get too far into that, but it would be interesting to see how feasible it is to do all the 2D rendering using OpenGL, encapsulated by some layer, like his Evas.
Has anyone done that? Any interesting results? One would think that there's a lot of potential here...
I have used both ATI and NVIDIA,(and 3dfx, and matrox, but staying relevant). Generally the NVIDIA cards I have owned have been vastly outperformed by the ATI cards right off the bat, without tweakage. (This is under Linux, mind you) Even with tweakage, in my experience, you rarely get the full potential from your card.
I hate sigs.
Irix.
IrisGL or OpenGL (I think OpenGL is based on IrisGL, so Irix probably now uses OpenGL) is used extensively in Irix, for both 2D and 3D.
A solution to the problem with music today
Here is the entry from the driver README:
Following that option, this one is noted:
It may be big and bloated, but at least it's slow.
I'm an American. I love this country and the freedoms that we used to have.
There has been some work on using graphics cards for computation. The tough part is figuring out how to rephrase your algorithm in terms of what the GPU can handle. You'd expect matrix math to work out but people have tried to implement more interesting algorithms too. :-)
- AmitNormally, he would answer some questions or comments posted about something he has written, but he will be out of town for at least a few days.
I highly doubt he meant for this to get wide-spread exposure beyond developers of Enlightenment or X. Since it has, this is a good opportunity. I'll make this clear for anyone that didn't catch it, raster WANTS XRENDER TO BE FASTER! If there is a way to alter configuration or to recode the benchmark to do so, he wants to know about it.
Rather than posting questions about his configuration (which he can't answer right now), grab the benchmarks that he put up and get better results.
Now back to your regularly scheduled trolling...
There's an example from back in the 80's that still probably serves as a good engineering reference for people working on hardware/software driver issues.
In those days of yore (only in the computer industry can one refer to something 20 years ago as "yore"...) there was the Commodore 64. It retains it's place as a pioneering home computer in that it offered very good (for the time) graphics and sound capability, and an amazing 64K of RAM, in an inexpensive unit. But then came its bastard son...
The 1541 floppy disk drive. It became the storage option for a home user once they became infuriated enough with the capabilites of cassette-tape backup to pony up for storage on a real medium. Unfortunately, the 1541 was slow. Unbelievably slow. Slow enough to think, just maybe, there were little dwarven people in your serial interface cable running your bits back and forth by hand.
Now, a very unique attribute of the 1541 drive was that it had its own 6502 processor and firmware. Plausibly, having in effect a "disk-drive-coprocessor" would accelerate your data transfer. It did not. Not remotely. Running through a disassembly of the 6502 firmware revealed endless, meandering code to provide what would appear, on the surface, to be a pretty straightforward piece of functionality: send data bits over the data pin and handshake it over the handshake signal pin.
As the market forces of installed base and demand for faster speed imposed themselves, solutions to the 1541 speed problem were found by third party companies. Software was released which performed such functions as loading from disk and backing up floppies as speeds that were many, many times faster than the 1541's base hardware and firmware could offer.
The top of this particular speed-enhancement heap was a nice strategy involving utilizing both the Commodore 64's and the 1541's processors, and the serial connection, optimally. Literally optimally. Assembly routines were written to run on the both 64 and the 1541 side to exactly synchronize the sending and receiving of bits on a clock-cycle by clock-cycle basis. Taking advantage of the fact both 6502's were running at 1 Mhz, the 1541's code would start blasting the data across the serial line to the corresponding 64 code, which would pull it off the serial bus within a 3-clock-cycle window (you could not write the two routines to be any more in sync than a couple 6502 instructions). This method used no handshaking whatsoever for large blocks of data being sent from the drive to the computer, and so, in an added speed coup, the handshaking line was also used for data, doubling the effective speed.
The 1541 still seems pertinent as an example of a computer function that one would probably think would best be done primarily on a software level (running on the Commodore 64), but was engineered instead to utilize a more-hardware approach (on the 1541), only to be rescued by better software to utilize the hardware (on both).
There's probably still a few design lessons from the "ancient" 1541, for both the hardware and the software guys.
~ Whence do you come, slayer of men, or where are you going, conqueror of space?
A lot of people are questioning the results claimed by Rasterman; however try downloading the thing and running it for yourself. I see the same trend that Rasterman claims when I do it.
My system: Athlon 800, nVidia 2-GTS.
Drivers: nVidia driver, 1.0.4363 (Gentoo)
Kernel: 2.4.20-r6 (Gentoo)
X11: XFree86 4.3.0
I've checked and:
The benchmark consists of rendering an alphablended bitmap to the screen repeatedly using Render extension (on- and off-screen) and imlib2. Various scaling modes are also tried.
When there's no scaling involved, the hardware Render extension wins; it's over twice as fast. That's only the first round of tests though. The rest of the rounds all involve scaling (half- and double-size, various antialiasing modes). For these, imlib2 walks all over the Render extension; we're talking three and a half minutes versus 6 seconds in one of the rounds; the rest are similar.
I'm not posting the exact figures since the benchmark isn't scientific and worrying about exact numbers isn't the point; the trend is undeniable. Things like agpgart versus nVidia's internal AGP driver should not account for the wide gap.
Given that at least one of the rounds in the benchmark shows the Render extension winning, I'm going to take a stab at explaining the results by suggesting that the hardware is probably performing the scaling operations each and every time, while imlib2 caches the results (or something). The results seem to suggest that scaling the thing once and then reverting to non-scaling blitting would improve at least some of the rounds; this is too easy, however, since while it helps the application that knows it's going to repeatedly blit the same scaled bitmap, not all applications know this a priori.
- Andrew
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
The problem is in *sending* the graphics commands to the hardware. If you're manually sending quads one at a time, I found that for 16x16 squares on screen, it's faster to do it in software than on a GEForce 2 (that was what I had at the time - this was a few years back). Think about it:
:)
== Hardware ==
Vertex coordinates, texture coordinates and primative types are DMA'd to the video card. The video card finds the texture and loads all the information into it's registers. It the executes triangle setup, then the triangle fill operation - twice (because it's drawing a quad).
== Software ==
Source texture is copied by the CPU to hardware memory, line by line.
Actual peak fill rate in software will be lower than hardware - but if your code is structured correctly (textures in the right format, etc) - there's no setup. The hardware latency looses out to the speed of your CPU's cache - the software copy has the same complexity as making the calls to the graphics card.
The trick is to *batch* your commands. Sending several hundred primatives to the hardware at the same time will blow software away - especially as the area to be filled increases. Well.. most of the time, but it really depends on what you're doing.
XRender is a new extension with only a reference implementation in XFree86. The point is to experiment with an API prior to freezing it. I know this may come as news to people who have grown up on Microsoft software, but real software developers first try out various ideas and then later start hacking it for speed. It would be quite surprising, actually, if it were faster than a hand-tuned client-side software implementation.
It will be a while until XRender beats client-side software implementations. Furthermore, you can't just take a client-side renderer and hack in XRender calls and expect it to run fast--code that works efficiently with a client-server window system like X11 needs to be written differently than something that moves around pixels locally.
Obviously XRender is getting crushed here by Imlib2. There are a million reasons this might be happening, it's definitely worth looking into. In the best Slashdot tradition, here's some wild speculation about what might be causing the slowdown:
- Renderman's code might be giving an unfair advantage to Imlib2. The Imlib2 results are never shown on the screen. However, XRender is tested both with display and without, so it seems like it should be fair.
- Renderman's code might be using XRender in an inefficient way. I'm no X programming expert so I have no idea if what he's doing is the best way to do it, but Rasterman is supposed to be some sort of expert in producing nice fast graphics on X so I'd say this is unlikely.
- XRender might not be hardware accelerated for some reason, probably having to do with driver configuration or something. But geez, does the software rendering have to be that slow? Maybe they could learn something from Imlib2.
- The hotly debated "X protocol overhead" might be causing this slowdown. But given the magnitude of the slowdown, I think this is unlikely.
Hopefully someone knowledgeable like Keith Packard himself will come and enlighten us with the true cause.main(c,r){for(r=32;r;) printf(++c>31?c=!r--,"\n":c<r?" ":~c&r?" `":" #");}
After installing imlib2, and running render_bench's 'make', it gives me the following:
cc -g -I/usr/X11R6/include `imlib2-config --cflags` -c main.c -o main.o
main.c: In function `xrender_surf_new':
main.c:67: `PictStandardARGB32' undeclared (first use in this function)
main.c:67: (Each undeclared identifier is reported only once
main.c:67: for each function it appears in.)
main.c:67: warning: assignment makes pointer from integer without a cast
main.c:69: `PictStandardRGB24' undeclared (first use in this function)
main.c:69: warning: assignment makes pointer from integer without a cast
main.c: In function `xrender_surf_blend':
main.c:153: `XFilters' undeclared (first use in this function)
main.c:153: `flt' undeclared (first use in this function)
main.c:154: `XTransform' undeclared (first use in this function)
main.c:154: parse error before `xf'
main.c:156: `xf' undeclared (first use in this function)
main.c: In function `main_loop':
main.c:439: `XFilters' undeclared (first use in this function)
main.c:439: `flt' undeclared (first use in this function)
make: *** [main.o] Error 1
It seems to do this at the same speed, whether or not I have render acceleration enabled.
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
And the results were pretty much the same. Using render was several magnitudes slower on tests 2 - 7. I have a GeForce1 with 1.0.4349 nvidia driver and haven't had the same trouble others have with this option on so I run with this extension on all the time.
t up...
Here are the results for the interested:
Available XRENDER filters:
nearest
bilinear
fast
good
best
Se
*** ROUND 1 ***
Test: Test Xrender doing non-scaled Over blends Time: 0.190 sec.
Test: Test Xrender (offscreen) doing non-scaled Over blends Time: 0.303 sec.
Test: Test Imlib2 doing non-scaled Over blends Time: 0.697 sec.
*** ROUND 2 ***
Test: Test Xrender doing 1/2 scaled Over blends Time: 10.347 sec.
Test: Test Xrender (offscreen) doing 1/2 scaled Over blends Time: 10.231 sec.
Test: Test Imlib2 doing 1/2 scaled Over blends Time: 0.315 sec.
*** ROUND 3 ***
Test: Test Xrender doing 2* smooth scaled Over blends Time: 207.028 sec.
Test: Test Xrender (offscreen) doing 2* smooth scaled Over blends Time: 205.275 sec.
Test: Test Imlib2 doing 2* smooth scaled Over blends Time: 5.695 sec.
*** ROUND 4 ***
Test: Test Xrender doing 2* nearest scaled Over blends Time: 164.460 sec.
Test: Test Xrender (offscreen) doing 2* nearest scaled Over blends Time: 166.281 sec.
Test: Test Imlib2 doing 2* nearest scaled Over blends Time: 4.119 sec.
*** ROUND 6 ***
Test: Test Xrender doing general nearest scaled Over blends Time: 313.187 sec.
Test: Test Xrender (offscreen) doing general nearest scaled Over blends Time: 310.261 sec.
Test: Test Imlib2 doing general nearest scaled Over blends Time: 11.444 sec.
*** ROUND 7 ***
Test: Test Xrender doing general smooth scaled Over blends Time: 477.511 sec.
Test: Test Xrender (offscreen) doing general smooth scaled Over blends Time: 474.695 sec.
Test: Test Imlib2 doing general smooth scaled Over blends Time: 17.290 sec.
(reformatted to get past the lameness filter)
This ends up being even more true if you do any sort of complex compositing (eg: alpha blending, hardware accelerated mpeg / video, openGL windows, etc, etc). Enlightenment uses alpha channels, it would be fater to composite in hardware than software. These sorts of operations are not accelerated at all on the 2d path, and have to be done in software.
Go check out Quartz Extreme at http://www.apple.com/macosx/jaguar/quartzextreme.h tml (excuse the space in html).
Having used Xfree86 and Quartz extreme on the same graphics hardware, I can tell you there's no comparison. Quartz is much faster and much more capable.
Apple's OSX does all rendering through Quartz, (as PDFs) which is accelerated by OpenGL, and called QuartzExtreme.
:-)
That's not accurate. Quartz is really made of two parts: Quartz 2D and the Quartz Compositor.
The Quartz Compositor is reponsible for compositing all the layers (desktop, windows, layers inside windows) on-screen. It offers Porter-Duff compositing, which was developped at Pixar more than 15 years ago. See this post from Mike Paquette for details. Mr Paquette is one of the main developpers of Quartz. Quartz Extreme is "simply" an OpenGL implementation of Porter-Duff compositing and modern graphic cards offer the primitives needed to do that very efficiently.
The Quartz 2D layer is what offers drawing primitives following the Postscript drawing model. The same drawing model is used with PDF (no surprise), Java2D and SVG (and Microsoft's GDI+ ?). This part is not HW accelerated. I am sure Apple is working on it, but it wouldn't surprise me if new HW will be required to make this possible. There is a strong incentive for card manufacturers to offer acceleration, since Longhorn is supposed to use GDI+ extensively. I doubt that such acceleration will fit in the traditionnal OpenGL/Direct3D rendering pipeline.
The Apple JVM team implemented HW accelerated Java2D drawing in their 1.3.1 JVM. Their 1.4 JVM doesn't offer it (1.4.1 was a massive rewrite for them, 1.3.1 was more of a quick port to OS-X using some of their "old" carbon code). There were quite a few problems when HW acceleration was used. I hope they can and will wait for a system-wide Quartz-2D HW acceleration, it seems ludicrous to have the JVM team spend resources on an effort that will be wasted once Quartz2D is accelerated.
See Apple Marketing page, another post from Mike Paquette, and the presentation from Apple at SIGgraph about Quartz Extreme and OpenGL.
If that post doesn't end-up rated +5 informative, I don't know what will !