Boost UltraSPARC T1 Floating Point w/ a Graphics Card?
alxtoth asks: "All over the web, Sun's UltraSPARC T1 is described as 'not fit for floating point calculations'. Somebody has benchmarked it for HPC applications, and got results that weren't that bad. What if one of the threads could do the floating point in the GPU, as suggested here? Even if the factory setup does not expect an video card, could you insert a low profile PCI-E video card, boot Ubuntu and expect decent performance?"
Man, aside from the ocassional desktop, who actually uses video cards on a Sun machine??? ;-)
Sun SPARC kit doesn't use a BIOS. Unfortunately, nearly all modern graphics cards that haven't been specifically designed to work on non-x86* kit rely upon the BIOS to initialise the card. This massively limits the hardware availability. PCI, sadly, is only a hardware standard.
There's been some work by David S Miller on getting BIOS emulation into the Linux kernel so that regular cards can be fooled into working, but it's not there yet and will probably fall foul of Debian's firmware loading policy (does that apply to Ubuntu too?).
Especially since current GPUs don't implement double-precision floating point math. Heh, in that vein you could add a dual Opteron single-board computer into one of the expansion slots...
We produce an Open Firmware solution which includes an x86 emulator to bootstrap x86 hardware, specifically graphics cards and the like.
E 7-45B8-9C79-420134DD9B8E.html
PowerPC boards, PC graphics chips with x86 BIOS, no driver edits required on the OS side.. it is there like it would be on a PC.
http://metadistribution.org/blog/Blog/78A3C88E-1C
http://www.genesippc.com/
But Sun realized that the more things change, the more they stay the same; the reason why vendors got away with making floating point an expensive option was that there are lots of workloads where floating point performance is unimportant. So they applied the RISC principle and chose to not waste a lot of silicon on the T1 implementing instructions that are not needed in their target workload, but instead figure out how to get lots of concurrent threads.
Trying to improve floating point perf on a T1 by adding another card is like trying to figure out how to put wheels on a fish. It might be a cool hack and it might solve some particular problem but it doesn't generalize.
If you want floating point perf and tons of threads, wait for the rock chip from Sun (and hope that Sun stays afloat long enough to ship it). It's like a T1 only moreso, with floating point for each thread.
Am I part of the core demographic for Swedish Fish?
The T2 is supposed to have an FPU for each core, so would be a simpler solution tan trying to use a grpahics card. The T2 is also supposed to have double the number of threads per core and even more memory bandwidth.
At that point, you're bound by the bandwidth between the graphics card and the CPU. Why not just purchase hardware that works for what you want to use it for in the first place?
Now even if you custom code an application to do all floating point work in a specific thread, you would need to completely modify the kernel thread management sub-systems. The threads themselves would need meta flag data to signify what "kind" of thread they are so that the "floating point thread(s)" are queued for running on the GPU and not on the T1 (unless there are idle T1 cores and the GPU is already busy).
Now even if you have the above changed, the only thing this will work on is custom made applications, in other words, you will need to completely re-write anything and everything to take advantage of this setup. This really isn't viable when you may possibly be dealing with non-open-source products like Matlab or Oracle. Even with open source products, it will take MAJOR rework to implement a change like this.
The T1 is designed as it is, a multi-core processor that would make a very good NFS Data Server, ftp server, or web host server with highly efficient power usage. It is NOT a database, application, or HPC server core. Too many of the latter operations require too much floating point operations to be run efficiently on the T1. In a pinch you can use it for them, but it will not shine in that application.
We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
Most real life CAD software (as in, what is used to build chips inside your little computer box or your cellphone) used to be (~8 years ago) on Solaris, occasional HP/AIX, Linux. Now it is Linux, Solaris, the rest are somewhat supported, but not exactly healthy... You can get some FPGA/PCB/Solid 3D CAD on Windows, but it is nowhere near the true industrial-strength quality. Think about it this way, if you pay $100,000 for a seat, it does not really matter how much the hardware is and Sun's was winning due to general stability/availability. IBM (the big Cadence shop) pushed Cadence to release the Linux version of their software simultaneously with the Solaris version about 5 years ago, since then Linux was gaining popularity...
;-)
There are no good techical reasons not to recompile something like this for OS-X, but if you can imagine porting a package which comes as a bookshelf of CDs from UN*X to Win API, I'd like some of the stuff you are smoking!
Paul
nVidia & IBM/Sony/Cell/Playstation can perform only 32-bit single-precision floating point calculations in hardware. [IBM/Sony can, at least in theory, perform 64-bit double-precision floating point calculations, but the implementation involves some weird software emulation thingamabob which invokes a massive performance penalty.]
ATi is even worse - last I checked, they could perform only 24-bit "three-quarters"-precision floating point calculations in hardware.
And just in case you aren't aware, 32-bit single-precision floats are essentially worthless for anyone doing even the simplest mathematical calculations; for instance, with 32-bit single-precision floats, integer granularity is lost at 2 ^ 24 = 16M, i.e.
Now while 64-bit double-precision floats [or "doubles"] are probably accurate enough for most financial calculations, where, generally speaking, accuracy is only needed to the nearest 1/100th [i.e. to the nearest cent], 64-bit doubles are still more or less worthless to the mathematician, physicist, and engineer.For instance, consider the work of Professor Kahan at UC-Berkeley:
In particular, read a few of these papers from the late nineties: At the time, Kahan was arguing in favor of using the full power of the Intel/AMD 80-bit extended precision doubles [i.e. embedding 64-bit doubles in an 80-bit space, performing calculations with the greater accuracy afforded therein, and then rounding the result back down to 64-bits and returning that as your answer], but, truth be told, the Sine Qua Non of hardware-based calculations is true 128-bit "quad-precision" floating point calculations as performed in hardware.Sun has a "quad-precision" floating point number for Solaris/SPARC, but, sadly, it's a software hack, and, like IBM/Sony/Cell/Playstation, far too slow to be used in practice.
I believe that IBM makes a chip for the Z-Series mainframe, which can perform 128-bits in hardware, but I imagine that it's prohibitively expensive [if you could even convince IBM to sell it to you in the first place].
The best configuration here would probably look like a fancy-schmantzy Digitial Signal Processor [DSP] chipset, from someone like Texas Instruments, capable of 128-bit hardware calculations, mounted onto a card that would plug into something very fast, like a 16x PCIe bus, which in turn would be connected to a HyperTransport bus [but boy, wouldn't it be really cool if the DSP lay directly on the HyperTransport bus itself?].
By the way, if anyone knows of a company that's making such a card, with stable drivers [or, God forbid, a motherboard with a socket for a 128-bit DSP on the HyperTransport bus], then please tell me about it, 'cause I'd be very interested in purchasing such a thing.
I have yet to see a low profile version, however, I have seen v210s and v240s with this card in them. It could only be a matter of time.
Who is general failure, and why is he reading my hard drive?
I worked on a couple of similar projects using TI C51 and AT&T DSP32 processors. I recall that the 286 could not keep up with the data rate using the Borland C compiler. I had to delve into x86 asm to optimize some loops in order to get it to keep up. The C51 board was a telecom voice processor including PCM modems and such. The DSP32 was a multi-channel(as in T1) DTMF decoder. It ended up running at 98% utilization(50MHz) with a *lot* of hand optimized code...
Fun, bleeding edge, stuff back then.
Good judgement comes from experience, and experience comes from bad judgement.
- W. Wriston, former Citibank CEO
And if I mounted a SRB (Shuttle solid rocket) on top of a Prius it would be the fastest car on earth. Why even validate Sun's hype any further by claiming that there's a simple solution to a fundamental deficiency?
In theory, if you run the mobo outside its normal case, you could throw a supported-on-sparc sun framebuffer in it and have things work .... not that I've got one handy nor would be willing to try and splice it into an atx chassis or whatnot ....