Boost UltraSPARC T1 Floating Point w/ a Graphics Card?

Video card for Sparc? by Corbets · 2006-04-22 08:57 · Score: 0, Flamebait

Man, aside from the ocassional desktop, who actually uses video cards on a Sun machine??? ;-)

Re:Video card for Sparc? by Eideewt · 2006-04-22 09:04 · Score: 1

Well, there might be a reason to if you could use one to boost performance with one.
Re:Video card for Sparc? by Anonymous Coward · 2006-04-22 09:07 · Score: 1, Interesting

Ever heard of CAD?
Re:Video card for Sparc? by nukem996 · 2006-04-22 09:48 · Score: 1

Why would you buy a Sun macine for CAD? Most CAD software is for Linux, Mac, or Win. Itd be cheaper to just build a box or buy a Dell.
Re:Video card for Sparc? by MichaelSmith · 2006-04-22 10:13 · Score: 1

Most CAD software is for Linux, Mac, or Win.
Just curious. Do you know of any good CAD software for Linux other than qcad?

--
http://michaelsmith.id.au
Re:Video card for Sparc? by pizpot · 2006-04-22 13:24 · Score: 1

Just curious. Do you know of any good CAD software for Linux other than qcad?
Well, IBM makes what is it called Catia for linux, and EDS is making Unigraphics NX4 for linux (why it was late). Both were unix products before going to windows too.
Re:Video card for Sparc? by Boone^ · 2006-04-22 14:18 · Score: 1

All chip design/verification/physical EDA software that Synopsys and Cadence create has Linux binaries. Maybe not "CAD", but it's engineering software nonetheless.
Re:Video card for Sparc? by csirac · 2006-04-22 14:45 · Score: 2, Informative

Most high-end CAD products that matter run on Solaris. It hasn't been until the last few years that they mostly have a Linux option, which is nice.
Re:Video card for Sparc? by nukem996 · 2006-04-22 14:56 · Score: 1

Most of the classes ive ever taken use Autocad, which is win only. Would you know of a good Linux replacement of Autocad(perferribly compatible to)
Re:Video card for Sparc? by csirac · 2006-04-23 21:21 · Score: 1

You could move to Solaris and run AutoCAD for Solaris; otherwise This article about running the Windows version under Linux looks useful.

The CAD applications I'm familiar with are all related to electronic engineering... Cadence, Mentor, OrCAD, EAGLE, etc. Some of these have Solaris versions without a Linux option, some have both. I'm sure there are good generic CAD programs out there for Linux, but I haven't used any.

This link looks useful.

No, you cannot by keesh · 2006-04-22 09:09 · Score: 4, Insightful

Sun SPARC kit doesn't use a BIOS. Unfortunately, nearly all modern graphics cards that haven't been specifically designed to work on non-x86* kit rely upon the BIOS to initialise the card. This massively limits the hardware availability. PCI, sadly, is only a hardware standard.

There's been some work by David S Miller on getting BIOS emulation into the Linux kernel so that regular cards can be fooled into working, but it's not there yet and will probably fall foul of Debian's firmware loading policy (does that apply to Ubuntu too?).

Re:No, you cannot by Jeff+DeMaagd · 2006-04-22 09:37 · Score: 2, Informative

That problem had been solved for Alpha computers around 1992. I was able to choose from any standard PCI video card, though driver support in the OS was a different issue. There may be some patent issues though, so the approach might need to be different.
Re:No, you cannot by Anonymous Coward · 2006-04-22 10:47 · Score: 1, Interesting

[...]will probably fall foul of Debian's firmware loading policy

No, it won't. The firmware won't be shipped with debian, it would be run directly from the rom that is on the very card that is to be initialized. Debian has shipped XFree86 for a long time, and it supports a similar method to initialize secondary graphics cards that require their bios to set them up to function properly (probably only works on x86 CPUs).
Re:No, you cannot by Anonymous Coward · 2006-04-22 13:05 · Score: 0

I don't know about Debian, but Ubuntu came with the firmware to boot my Intel IPW2200 wireless minipci card.
Re:No, you cannot by antime · 2006-04-22 20:43 · Score: 1

Lack of a BIOS can be worked around (eg. the Pegasos boards have some sort of emulation built into its firmware that allows you to use normal PC graphics cards despite being PPC and OpenFirmware-based), but without drivers you ain't doing jack shit. And that's a very big problem if you're not using an x86 CPU. The open-source r300 driver is making progress but is not near production-quality and AFAIK nothing similar exists for nVidia chips yet, so unless you can convince Ati and nVidia to port their drivers to Sun hardware this idea is dead in the water.

Probably, but it's not an optimal solution by the_humeister · 2006-04-22 09:12 · Score: 4, Informative

Especially since current GPUs don't implement double-precision floating point math. Heh, in that vein you could add a dual Opteron single-board computer into one of the expansion slots...

Re:Probably, but it's not an optimal solution by Jeff+DeMaagd · 2006-04-22 10:24 · Score: 1

Heh, in that vein you could add a dual Opteron single-board computer into one of the expansion slots...

I'm not certain of the cost of the T1 systems, but I would think that if FPU is important, you'd rather just go for a dual-dual-core server. The T1 systems are compatible with more memory though, 32GB for the T1000 vs 16GB for what I've seen in the AMD dual processor workstations.
Re:Probably, but it's not an optimal solution by Wesley+Felter · 2006-04-22 12:53 · Score: 1

64GB two-socket Opteron systems (e.g. IWill DK88) are rare but available for people who need them.
Re:Probably, but it's not an optimal solution by Glonoinha · 2006-04-23 02:14 · Score: 1

64GB ought to be enough for anybody.
~Glonoinha, April 23, 2006

--
Glonoinha the MebiByte Slayer
Re:Probably, but it's not an optimal solution by syukton · 2006-04-23 14:26 · Score: 1

...anybody not running Vista.

--
Reinvent the wheel only at either a lower cost, greater effectiveness, or your own personal enrichment and satisfaction.
Re:Probably, but it's not an optimal solution by gormanly · 2006-04-24 01:32 · Score: 1

Actually, the Sun Fire T1000 only supports 16GB - but the T2000 does support 32GB. If you're looking at these, the alternative Opteron solution would be another rack-mounted system rather than a workstation, and the obvious candidate is the Sun Fire V40z with 32GB and 4 dual-core Opterons. Which, btw, are very nice systems.

Yes you can.. maybe not on SPARC though.. by NekoXP · 2006-04-22 09:17 · Score: 5, Informative

We produce an Open Firmware solution which includes an x86 emulator to bootstrap x86 hardware, specifically graphics cards and the like.

PowerPC boards, PC graphics chips with x86 BIOS, no driver edits required on the OS side.. it is there like it would be on a PC.

http://metadistribution.org/blog/Blog/78A3C88E-1CE 7-45B8-9C79-420134DD9B8E.html
http://www.genesippc.com/

Thanks for making me feel old... by pedantic+bore · 2006-04-22 09:29 · Score: 5, Insightful

I remember when it was common practice to buy extra hardware to add to your system to implement fast floating point ops. First it was a box (FPS), then a few cards (Sky), then a card (Mercury), then a daughterboard (everyone), then a chip (Weitek)... and then it was on the CPU and everyone expected it to be there.

But Sun realized that the more things change, the more they stay the same; the reason why vendors got away with making floating point an expensive option was that there are lots of workloads where floating point performance is unimportant. So they applied the RISC principle and chose to not waste a lot of silicon on the T1 implementing instructions that are not needed in their target workload, but instead figure out how to get lots of concurrent threads.

Trying to improve floating point perf on a T1 by adding another card is like trying to figure out how to put wheels on a fish. It might be a cool hack and it might solve some particular problem but it doesn't generalize.

If you want floating point perf and tons of threads, wait for the rock chip from Sun (and hope that Sun stays afloat long enough to ship it). It's like a T1 only moreso, with floating point for each thread.

--
Am I part of the core demographic for Swedish Fish?

Re:Thanks for making me feel old... by fm6 · 2006-04-22 10:10 · Score: 1

It might be a cool hack and it might solve some particular problem but ...
There's no "but" here. Cool hacks don't happen because they're useful, they happen because they're cool.
Re:Thanks for making me feel old... by Doc+Ruby · 2006-04-22 11:41 · Score: 1

Meanwhile, GPU developers have created a component that processes floating point math very quickly, sold for much less $:FLOPS than Sparcs (or any other CPU). Combining a T1 and GPGPU offers "best of breed" economies of scale appropriate to each component, like installing 3rd party memory and HD rather than the expensive Sun brands.

That's why GPGPU is an interesting strategy. GPU APIs offer parallelism, too. When those APIs can be harnessed with bus signalling that's high-enough level symbolically to exploit the processing speed without bottlenecking on the bus data bandwidth, it's quite a compelling architecture. It definitely needs a lot of work to make truly general purpose (or enough to get critical mass from a lot of niches). But that's why encouraging people like the poster to try it is worth doing.

--
--
make install -not war
Re:Thanks for making me feel old... by Anonymous Coward · 2006-04-22 12:05 · Score: 1, Informative

>Combining a T1 and GPGPU offers "best of breed" economies of scale appropriate to each component, like installing 3rd party memory and HD rather than the expensive Sun brands.

Combining a T1 and a GPU offers you jack, since GPUs use single-precision arithmetic.
Re:Thanks for making me feel old... by pedantic+bore · 2006-04-22 14:25 · Score: 1

Well, no, if you want flops/$ then the signal processing chips used in cell phones and MP3 players are the clear winners. There are some real screamers here. But they're a bit complicated to program and don't function well as general purpose processors, which is why they're primarily used in systems where they can be programmed once and then shipped by the million.
As I wrote before, I'm sure there's some workload where it makes sense to mate a T1 and a GPU (besides the obvious one, i.e., rendering graphics). But the relative latency and bandwidth gulf between the CPU and GPU make it impractical in many cases. Want to multiply two numbers? Do it on the chip; it might take more CPU time but it doesn't take any setup, transfer, and teardown time. Want to multiply a million numbers? In that case the setup and init time is amortized and the memory transfers can be pipelined, so it could make sense. Where's the break-even point? I don't know, but I'll wager it's closer to a million than to one.

--
Am I part of the core demographic for Swedish Fish?
Re:Thanks for making me feel old... by Doc+Ruby · 2006-04-22 18:05 · Score: 3, Interesting

Those DSPs you mention aren't CPUs, and they're not available on PCI cards - plus the programmability you mention.

The way to think about the use of GPGPU in a host with its own (GP) CPU is client/server computing. I put together such a system in 1990, a 12MHz 80286, with 4 12.5MFLOPS DSPs (AT&T DSP32c) and an FPGA "scheduler" on the ISA card. The 286 ran a loop sending data and commands to a memory mapped page on the card's SRAM, and copying the page when a status register was set. I had realtime 24bit VGA renderings of megapolygons at 30FPS, all processed on the DSPs. The systems have all scaled up, but the price improvement per FLOPS of the GPUs over the CPU is even better now than then.

As you say, the key is keeping the compute servers full, which amortizes the signalling overhead best, and keeping the signaling across the bus high-level enough that the bandwidth doesn't bottleneck. There are lots of demanding apps now which could use that architecture. Audio compression is my favorite - I'm waiting to stuff a $1000 P4 with 6 $400 dual GPUs, and beat the performance of any <$10K server, scalable down to $1500. That's the kind of host that could really transform telephony.

--
--
make install -not war
Re:Thanks for making me feel old... by bhima · 2006-04-22 20:14 · Score: 1

That that this is directly related... but it is interesting and related in the sense that my first effort in DSP work was moer or less bottlenecked at the ISA bus... and lately have been tinkering with a design that certainly would be by a PCI-X or PCI-e bus.

http://www.drccomputer.com/pages/products.html

--
Nothing in the world is more dangerous than sincere ignorance and conscientious stupidity.
Re:Thanks for making me feel old... by pedantic+bore · 2006-04-22 23:18 · Score: 1

the price improvement per FLOPS of the GPUs over the CPU is even better now
Yes, but the price improvement per bandwidth and especially latency of the interconnect between the the two is much worse. Going off-chip for anything has a huge cost; in order for it to make sense, you have to be able to amortize that cost.
And those DSP chips are CPUs in the conventional sense, although they don't have all the niceties that modern CPUs have (which, ironically, also often used to be implemented as co-processors: complicated interrupt/exception handling, virtual address translation). What I'm thinking of are the TMS320* ISA, which is much more like a CPU than is a GPU. Maybe you're thinking of something else.

--
Am I part of the core demographic for Swedish Fish?
Re:Thanks for making me feel old... by htd2 · 2006-04-25 03:23 · Score: 1

Sun's origional Motorola 68K based workstations had optional FPU's as did the first "desktop" SPARC workstation the 4/110. Sun workstations or servers equipped with a VME bus also had access to an optional Weitek FPU unit.

Even more exotic was the TAAC-1 which was a wide instruction word processor which could be used for FFT's, imaging etc.

One correction the TII (Niagara II) will be the first heavily multi-threaded SPARC CPU with one FPU per core, it is due out next year with rock being due out in 2008.
Re:Thanks for making me feel old... by networkBoy · 2006-04-25 14:01 · Score: 1

"Those DSPs you mention aren't CPUs, and they're not available on PCI cards"

Since when (on both counts)?
DSP == Digital Signal _Processor_ which is the Central Processor Unit on several platforms I know of.

http://www.signalogic.com/index.pl?page=m44
http://www.bittware.com/products/type/dsp-pci.cfm
http://www.innovative-dsp.com/products/delfin.htm
http://www.innovative-dsp.com/products/toro.htm
http://www.globalspec.com/FeaturedProducts/Detail/ InnovativeIntegration/CONEJO_64_bit_PCI_DSP_Card/1 1265/0?fromSpotlight=1
and my fav:
http://www.signatec.com/products/dsp_PMP1000_paral lel_digital_signal_processing_PCI_board.asp
For the record I'm waiting for the signatec to be available as a PCIe x16 card. As it is I have to sneak time on it for transcoding...
-nB

--
whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
Re:Thanks for making me feel old... by LWATCDR · 2006-04-26 09:56 · Score: 1

There was another system that had an optional FPU. I think it was called the IBM PC. You could get an FPU called the 8087. It was expensive and your software had to be compiled to support which very few programs where.
Was the Weitek an FPU or a vector processor?

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Thanks for making me feel old... by htd2 · 2006-04-27 01:58 · Score: 1

The weitek unit was an FPU. Like the IBM 8087 you had to compile applications specifically to use the weitek unit.
Re:Thanks for making me feel old... by LWATCDR · 2006-04-27 03:30 · Score: 1

Do you know if modern compilers for the x86 compile for an FPU and then emulate it if there isn't one? Or do they just expect an FPU these days? Does it depend on your target CPU?
I almost never need to use floats in my code so I haven't really looked in a long time.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.

Wait for the T2 by IvyKing · 2006-04-22 09:32 · Score: 3, Interesting

The T2 is supposed to have an FPU for each core, so would be a simpler solution tan trying to use a grpahics card. The T2 is also supposed to have double the number of threads per core and even more memory bandwidth.

Re:Wait for the T2 by drix · 2006-04-22 10:00 · Score: 0

Are you sure of that? I thought the whole point of the one FPU per chip was to dramatically cut down on power consumption, which is one of Niagara's main selling points.

--

I think there is a world market for maybe five personal web logs.
Re:Wait for the T2 by bbrack · 2006-04-24 09:21 · Score: 1

there is a FPU per core and the power for niagara2 is still supposed to be remarkably low
Re:Wait for the T2 by htd2 · 2006-04-27 02:03 · Score: 1

Correct, T2 is expected to be lower power or equivalent to the T1, part of this is because T2 will be built in a 65nm process as opposed to the 90nm process used to fabricate T1's.

The changes in T2 are 2 pipelines per core, up from 1. 8 threads per core, up from 4. FPU per core up from 1 per module. Faster memory subsystem, additional hw support for encryption and nework offload. On chip cache is expected to remain the same.

Feh by NitsujTPU · 2006-04-22 09:45 · Score: 2, Insightful

At that point, you're bound by the bandwidth between the graphics card and the CPU. Why not just purchase hardware that works for what you want to use it for in the first place?

Re:Feh by Mr+Z · 2006-04-22 10:01 · Score: 1

Why not just purchase hardware that works for what you want to use it for in the first place?

What if you want a better solution than the ones that are normally available?

--
Program Intellivision!
Re:Feh by NitsujTPU · 2006-04-22 10:12 · Score: 1

This is a workaround, and usually not a very good one. I've seen people do very specialized things by moving the floating point stuff off to video cards, but for general computation, I think it's a rather poor solution.

IE, this is not a better solution thna the ones that are normally available.

Will never work properly.... by Fallen+Kell · 2006-04-22 10:38 · Score: 3, Informative

All kinds of problems will arise with a setup like this. Performance will possbily boost for certain things, but they need to be coded properly themselves, but code is not written for a unique setup like this. Multi-threaded code will be under the assumption that all CPU's will have approximitely the same abilities (in other words, they do not split floating point ops into one thread and i/o and int operations into other threads). Any thread for the application will potentially have floating point operations mixed with other operations.

Now even if you custom code an application to do all floating point work in a specific thread, you would need to completely modify the kernel thread management sub-systems. The threads themselves would need meta flag data to signify what "kind" of thread they are so that the "floating point thread(s)" are queued for running on the GPU and not on the T1 (unless there are idle T1 cores and the GPU is already busy).

Now even if you have the above changed, the only thing this will work on is custom made applications, in other words, you will need to completely re-write anything and everything to take advantage of this setup. This really isn't viable when you may possibly be dealing with non-open-source products like Matlab or Oracle. Even with open source products, it will take MAJOR rework to implement a change like this.

The T1 is designed as it is, a multi-core processor that would make a very good NFS Data Server, ftp server, or web host server with highly efficient power usage. It is NOT a database, application, or HPC server core. Too many of the latter operations require too much floating point operations to be run efficiently on the T1. In a pinch you can use it for them, but it will not shine in that application.

--
We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"

Re:Will never work properly.... by Anonymous Coward · 2006-04-22 20:30 · Score: 0

Right, because we know how all database and application servers spend most of their time computing an FPU operation after FPU operation...
Re:Will never work properly.... by Anonymous Coward · 2006-04-24 19:09 · Score: 0

It is NOT a database, application, or HPC server core. Too many of the latter operations require too much floating point operations to be run efficiently on the T1
Oracle is a DB application and uses no floating point to implement its numeric data types. FP is too imprecise and varies too much between different types of systems. This might be also true for lots of other business apps.
Re:Will never work properly.... by LWATCDR · 2006-04-26 10:01 · Score: 1

Why would a database server need floating point?
I have never written on but I have written btrees and hash algorithms and they never used floating point.
For a database server I would guess you would tend to be IO bound.
You do have a point in that the T1 is a good platform for a web server or file server but not ideal for many other tasks. I wonder how is it's SSL performance is?

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Will never work properly.... by htd2 · 2006-04-27 02:11 · Score: 1

Not quite sure how that got modded informative.

DBMS don't require FPU performance since they don't issue floating point instructions. The app server market is also dominated by integer workloads, think Java and J2EE app servers as an example.

The T1 looks like an exceptionally effective Java/J2EE platform from the slew of great benchmark results Sun has published for the paltform. It is also no slouch as a DBMS platform as is SAP results show. It does lack single threaded performance so its going to be better as an OLTP platform for DBMS's than for high end reporting.

It also makes a good mailserver/messaging platform as its Notes performance demonstrates.

Huh? CAD on Macs/Windows??? by PaulBu · 2006-04-22 12:43 · Score: 4, Insightful

Most real life CAD software (as in, what is used to build chips inside your little computer box or your cellphone) used to be (~8 years ago) on Solaris, occasional HP/AIX, Linux. Now it is Linux, Solaris, the rest are somewhat supported, but not exactly healthy... You can get some FPGA/PCB/Solid 3D CAD on Windows, but it is nowhere near the true industrial-strength quality. Think about it this way, if you pay $100,000 for a seat, it does not really matter how much the hardware is and Sun's was winning due to general stability/availability. IBM (the big Cadence shop) pushed Cadence to release the Linux version of their software simultaneously with the Solaris version about 5 years ago, since then Linux was gaining popularity...

There are no good techical reasons not to recompile something like this for OS-X, but if you can imagine porting a package which comes as a bookshelf of CDs from UN*X to Win API, I'd like some of the stuff you are smoking! ;-)

Paul

Re:Huh? CAD on Macs/Windows??? by nukem996 · 2006-04-22 14:16 · Score: 2, Informative

Ive done some simple CAD stuff in school and all they use is AutoCAD and PTC. I guess I dont know to much about this stuff :\
Re:Huh? CAD on Macs/Windows??? by Kadin2048 · 2006-04-22 19:46 · Score: 1

The kind of CAD they're talking about in the *NIX workstation products is like an order of magnitude or more in complexity up from what most people do with AutoCAD. In short, some of those programs (the old "workstation" standbys) make AutoCAD look like something you'd use at Home Depot to lay out your new kitchen, while they themselves could be used to design an oil rig on the North Sea. They're not even close.

The gap may have narrowed from what it once was, but there are still things (particularly in some niche fields where Cost Is Not An Object, like petrochemicals) where *NIX workstations smoke the hell out of almost anything Windows based. And this is why you still see Sun and IBM selling what to the average person seem to be outrageously priced PCs, except that they're RISC and run Linux, or sometimes AIX or Solaris. (A while back I found what I thought was the most expensive PC I'd ever seen, it was a dual-proc Opteron from IBM -- not even a RISC box! -- that was close to nine grand. I think it was the "Intellistation A Pro" you can Google it.)

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Re:Huh? CAD on Macs/Windows??? by nukem996 · 2006-04-23 07:32 · Score: 1

heh ok Im just wondering since Im forced to use shitty win machines at school with AutoCad(Intel graphics 2.4ghz etc) while my machine at home runs a 7800 GT and an AMD X2 4400 on Linux.
Re:Huh? CAD on Macs/Windows??? by drinkypoo · 2006-04-24 08:37 · Score: 1

(A while back I found what I thought was the most expensive PC I'd ever seen, it was a dual-proc Opteron from IBM -- not even a RISC box! -- that was close to nine grand. I think it was the "Intellistation A Pro" you can Google it.)

ITYM, "not even a RISC-instruction set box!" since every intel chip since the Pentium and every AMD chip since the Am586 is internally RISC.

Aside from that nit, you're totally right. I remember in the 90s seeing a video for IBM CAEDS, a CAD program that ran only on RS/6k. With sufficiently sexy hardware you could do things like realtime systems and stress analysis in it, as well as a bunch of stuff that PC apps do now, like injection molding modeling.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Huh? CAD on Macs/Windows??? by Anonymous Coward · 2006-04-24 08:44 · Score: 0

get yourself eds solidedge 11 thru your school. should run fine on your machine at home and the academic license is like $10 or something.
Re:Huh? CAD on Macs/Windows??? by drgonzo59 · 2006-04-25 23:34 · Score: 1

Actually the industrial grade CAD/CAE/CAM programs used to be written for SGI, mainly just because they had better hardware for visualization (and pretty good servers too, we had a couple of quad R12000 machines). I used to work on a popular CAD/CAM/CAE application, the full package was millions and millions of lines of C,C++,Fortran, and custom scripting language. The primary customer and development platform was SGI, everyone had a nice SGI workstation on their desk, but we had ports to HP/UX, Solaris and AIX. When SGI started to suck, we decided to jump ship and slowly migrate to NT (Linux was still a toy OS then). It took years to do that and I think they are still working out the issues resulting from that (I left the company).
As another side project I worked with a developer on porting a Pre/Post Processing Visualization program from Windows to Unix using MainWin. The major stuff worked but there were 1000 little annoyances and bugs that never got solved. I don't think we ever sold that specific port.
Bottom line, just like you said -- porting sucks, especially if the original code was not written with portabily in mind...

GPUs == Worthless Floating Point Precision by mosel-saar-ruwer · 2006-04-22 14:18 · Score: 3, Insightful

nVidia & IBM/Sony/Cell/Playstation can perform only 32-bit single-precision floating point calculations in hardware. [IBM/Sony can, at least in theory, perform 64-bit double-precision floating point calculations, but the implementation involves some weird software emulation thingamabob which invokes a massive performance penalty.]

ATi is even worse - last I checked, they could perform only 24-bit "three-quarters"-precision floating point calculations in hardware.

And just in case you aren't aware, 32-bit single-precision floats are essentially worthless for anyone doing even the simplest mathematical calculations; for instance, with 32-bit single-precision floats, integer granularity is lost at 2 ^ 24 = 16M, i.e.

16777216 + 0 = 16777216
16777216 + 1 = 16777216
16777216 + 2 = 16777218
16777216 + 3 = 16777220
16777216 + 4 = 16777220
16777216 + 5 = 16777220
16777216 + 6 = 16777222
16777216 + 7 = 16777224
16777216 + 8 = 16777224
16777216 + 9 = 16777224
16777216 + 10 = 16777226
16777216 + 11 = 16777228
16777216 + 12 = 16777228
16777216 + 13 = 16777228
16777216 + 14 = 16777230
16777216 + 15 = 16777232
16777216 + 16 = 16777232
etc

Now while 64-bit double-precision floats [or "doubles"] are probably accurate enough for most financial calculations, where, generally speaking, accuracy is only needed to the nearest 1/100th [i.e. to the nearest cent], 64-bit doubles are still more or less worthless to the mathematician, physicist, and engineer.

For instance, consider the work of Professor Kahan at UC-Berkeley:

William Kahan

In particular, read a few of these papers from the late nineties:

PDF File: Roundoff Degrades an Idealized Cantilever
PDF File: How JAVA's Floating-Point Hurts Everyone Everywhere
PDF File: Matlab's Loss is Nobody's Gain

At the time, Kahan was arguing in favor of using the full power of the Intel/AMD 80-bit extended precision doubles [i.e. embedding 64-bit doubles in an 80-bit space, performing calculations with the greater accuracy afforded therein, and then rounding the result back down to 64-bits and returning that as your answer], but, truth be told, the Sine Qua Non of hardware-based calculations is true 128-bit "quad-precision" floating point calculations as performed in hardware.

Sun has a "quad-precision" floating point number for Solaris/SPARC, but, sadly, it's a software hack, and, like IBM/Sony/Cell/Playstation, far too slow to be used in practice.

I believe that IBM makes a chip for the Z-Series mainframe, which can perform 128-bits in hardware, but I imagine that it's prohibitively expensive [if you could even convince IBM to sell it to you in the first place].

The best configuration here would probably look like a fancy-schmantzy Digitial Signal Processor [DSP] chipset, from someone like Texas Instruments, capable of 128-bit hardware calculations, mounted onto a card that would plug into something very fast, like a 16x PCIe bus, which in turn would be connected to a HyperTransport bus [but boy, wouldn't it be really cool if the DSP lay directly on the HyperTransport bus itself?].

By the way, if anyone knows of a company that's making such a card, with stable drivers [or, God forbid, a motherboard with a socket for a 128-bit DSP on the HyperTransport bus], then please tell me about it, 'cause I'd be very interested in purchasing such a thing.

Re:GPUs == Worthless Floating Point Precision by Anonymous Coward · 2006-04-22 18:10 · Score: 0

http://www.ati.com/products/RadeonX1900/specs.html

"Full speed 128-bit floating point processing for all shader operations"
"64-bit floating point HDR rendering supported throughout the pipeline"

http://www.nvidia.com/object/7_series_techspecs.ht ml

"Full 128-bit studio-quality floating point precision through the entire rendering pipeline"
Re:GPUs == Worthless Floating Point Precision by pla · 2006-04-23 01:15 · Score: 1

truth be told, the Sine Qua Non of hardware-based calculations is true 128-bit "quad-precision" floating point calculations as performed in hardware.

For in-hardware calculation, yes. For a quick approximation or when the result has no serious consequences, yes. For anyone serious about getting the correct answer, no no no

We (by which I mean CS, math, and hard-science folks) have known since the earliest days of floating point that it has inherent, unavoidable flaws that no arbitrary fixed number of bits can solve. Virtually all CPUs before 1985 (the year the 386DX came out - Or for Apple folks, 1990 with the 68040) didn't even have an FPU included in the CPU for a good reason - The only people doing "serious" number crunching understood the limitations of floating point, and would't use it even if the CPU supported it for exactly the reasons you mention. Not until 1992, with the popularity of Wolf3d, did Joe Sixpack start using his machine for serious (if you can call it that) FP number-crunching, and in the domain of gaming, perfect accuracy of the result doesn't matter nearly so much as speed.

If you care about your answer, no matter how many bits the FPU supports, you do it in software. Period. You use GMP, and don't round until the final result... and while that might not always prove possible due to having finite memory, I highly doubt we'll ever see even a 1024-bit FPU, much less one using 1048576 bits.

Unfortunately, and as one of your links mentions, I seriously wonder if many of the current generation of programmers even knows about this issue, nevermind cares (Huh, I sound like a cranky old man now). FAR too often I encounter code that uses something like "if(fpvar==0.0)"... That works fine in integers, but after a long series of FP calculations, 3E-15 just doesn't equal zero no matter how much it "should".
Re:GPUs == Worthless Floating Point Precision by Anonymous Coward · 2006-04-23 02:27 · Score: 1, Informative

Unfortunately, and as one of your links mentions, I seriously wonder if many of the current generation of programmers even knows about this issue, nevermind cares (Huh, I sound like a cranky old man now).

Not cranky and old enough.

If you care about your answer, no matter how many bits the FPU supports, you do it in software. Period. You use GMP, and don't round until the final result... and while that might not always prove possible due to having finite memory, I highly doubt we'll ever see even a 1024-bit FPU, much less one using 1048576 bits.

No, you don't. You do an error analysis, quantify the imprecision, and move on. Your point about all floating point ultimately being limited in precision anyway is a good one that the OP seemed to overlook in his advocacy for 80/128/whatever bit floating point as a "gold standard," but the idea that you'd do a black hole simulation completely in software is laughable.

GMP doesn't solve the problem (incidentally, GMP isn't exactly a high-end scientific math library) because, guess what? You still can't express things like 1/(2 * pi), because pi is irrational. It can't be expressed exactly with any number of digits or amount of memory. So you're right back to doing error analysis, and what's more, your calculations are sucking up more cycles and memory to boot. No thanks.

There's a reason why supercomputers are rated in FLOPS, and not IOPS. All that expensive floating point hardware on those scads and scads of processors is there for a reason.

If you absolutely need an exact answer, you either use a computer algebra system, which can do symbolic manipulation, or you stick to problems that can be solved using integer or rational arithmetic.

If you need an answer that has more precision than the built-in floating point types, then arbitrary precision libraries become relevant. But they aren't a magical fix that can suddenly make the limitations of limited precision disappear.

Otherwise, the best approach to take is to make sure your algorithms and design are sensitive to the issues involved. (For example, avoiding addition/subtraction wherever possible, especially when the magnitudes are significantly different, which cause losses in precision, unlike multiplication/division).

Honestly, if you work it right, the 15 decimal places of precision that doubles offer is more than good enough for most scientific computations, as long as you make sure you keep track of the error tolerances. More is always better, of course, but only a lazy scientist would rely on quads suddenly getting the right answer where doubles weren't good enough before, because there'll always be problems where you want more precision, and the naive approach won't work.
Re:GPUs == Worthless Floating Point Precision by Anonymous Coward · 2006-04-23 04:29 · Score: 1, Insightful

IBM/Sony can, at least in theory, perform 64-bit double-precision floating point calculations, but the implementation involves some weird software emulation thingamabob which invokes a massive performance penalty.
Just, for the record. Cell uses no "software emulation" for their double calculations. It's 7 cycle latency to do two DP multiply-add, which is certainly not slow. The "slow" part is that the throughput is also 7 cycles, meaning that multiple DP MADDs don't pipeline. So, while this cuts the theoretical maximum GFLOPs down significantly (SP MADDs can issue one every cycle, in addition to a non-FP instr), the "in practice" performance is much closer...
and we're still talking (4 flops / 7 cycles) * (8 SPEs) * x Ghz => 18.2 DP Gflops @ 4.0 GHz (pretty freaking fast!)
Oh, and GPUs aren't viable as FPUs because the latency sucks so hard.
Re:GPUs == Worthless Floating Point Precision by IvyKing · 2006-04-23 07:40 · Score: 1

At the time, Kahan was arguing in favor of using the full power of the Intel/AMD 80-bit extended precision doubles [i.e. embedding 64-bit doubles in an 80-bit space, performing calculations with the greater accuracy afforded therein, and then rounding the result back down to 64-bits and returning that as your answer], but, truth be told, the Sine Qua Non of hardware-based calculations is true 128-bit "quad-precision" floating point calculations as performed in hardware.

The CDC 6600's single precision arithmetic used 60 bits and the architecture had hardware support for doing double precision - and Kahan was familiar with the CDC ISA (the main computer at UCB at that time he started there was still the 6400 in the basement of Evans Hall).
The whole point of the 80-bit extended precision in the 8087 and successors was to be able to get reliable results with "pencil and paper" algorithms. IOW, get reasonable numbers without doing Numerical Analysis. The reality is that a programmer who is aware of roundoff issues can get better results with 64 bits that a clueless programmer with 80 bits.
Re:GPUs == Worthless Floating Point Precision by ponos · 2006-04-23 22:44 · Score: 1

And just in case you aren't aware, 32-bit single-precision floats are essentially worthless for anyone doing even the simplest mathematical calculations; for instance, with 32-bit single-precision floats, integer granularity is lost at 2 ^ 24 = 16M, i.e.
The error in floating point calculations is supposed to be roughly 2^-N, where N is the number of bits. Although some ALGORITHMS can be unstable, because they use series of operations that greatly increase error, many useful algorithms can be accurately implemented with floating point operations. I think that single precision floats are OK for many purposes and double precision floats have to be abused in order to produce bad results. Higher precision floats are primarily useful for programmers that don't know how to maintain precision.
Anyway, even arbitrary precision algorithms (like those in libgmp) have to be based on hardware operations. In that sense, if you actually require absolute precision (like predicting the weather or working in NASA or something) you can still implement arbitrary precision with 32-bit floats or 32-bit ints or even byte operations by carefully avoiding overflow.
To sum this up: for those that don't need much precision (games/video/audio) 32-bits can be enough with careful programming. Those that DO need absolute precision wouldn't care about 64 or 128 bits but can use the most appropriate hardware operations, including 32-bit floats, to implement arbitrary precision algorithms on TOP of the hardware.
P.
Re:GPUs == Worthless Floating Point Precision by merlin_jim · 2006-04-25 06:37 · Score: 1

wouldn't it be really cool if the DSP lay directly on the HyperTransport bus

You may not be aware, but AMD just released the new HyperTransport spec version - and it includes along with the usual speed and signaling imporvements, externally connected devices.

--
I am disrespectful to dirt! Can you see that I am serious?!
Re:GPUs == Worthless Floating Point Precision by networkBoy · 2006-04-25 16:31 · Score: 1

Why not the virtex FPGA setup: http://www.theregister.co.uk/2006/04/21/drc_fpga_m odule/
I'm sure quad (or even possibly oct.) precision floats could be implemented in that bad boy.
As I said in an earlier thread, this has my intel fanboi status at risk...
-nB

--
whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump

Sun Currently OEMs ATI Radeon video cards by GuyverDH · 2006-04-23 02:46 · Score: 1

I have yet to see a low profile version, however, I have seen v210s and v240s with this card in them. It could only be a matter of time.

--
Who is general failure, and why is he reading my hard drive?

Damn, brings back fond memories by marcus · 2006-04-24 02:07 · Score: 1

I worked on a couple of similar projects using TI C51 and AT&T DSP32 processors. I recall that the 286 could not keep up with the data rate using the Borland C compiler. I had to delve into x86 asm to optimize some loops in order to get it to keep up. The C51 board was a telecom voice processor including PCM modems and such. The DSP32 was a multi-channel(as in T1) DTMF decoder. It ended up running at 98% utilization(50MHz) with a *lot* of hand optimized code...

Fun, bleeding edge, stuff back then.

--
Good judgement comes from experience, and experience comes from bad judgement.
- W. Wriston, former Citibank CEO

Re:Damn, brings back fond memories by Doc+Ruby · 2006-04-24 02:19 · Score: 1

Yeah, the good old days when AT&T made CPUs like the DSP32c with a C language ASM instruction set.

As we can see from the current discussion, those same issues and techniques (or at least architectural patterns) are still relevant. In proportion - about a thousand times faster, but equally across the whole uneven platform.

--
--
make install -not war

Ridiculous Supposition by Anonymous Coward · 2006-04-24 05:22 · Score: 0

And if I mounted a SRB (Shuttle solid rocket) on top of a Prius it would be the fastest car on earth. Why even validate Sun's hype any further by claiming that there's a simple solution to a fundamental deficiency?

Framebuffer in UST1 by gentimjs · 2006-04-25 03:59 · Score: 1

In theory, if you run the mobo outside its normal case, you could throw a supported-on-sparc sun framebuffer in it and have things work .... not that I've got one handy nor would be willing to try and splice it into an atx chassis or whatnot ....

Slashdot Mirror

Boost UltraSPARC T1 Floating Point w/ a Graphics Card?

71 comments