Oracle Demos New SPARC T4 Processor
MojoKid writes "Oracle is publicly demonstrating its new T4 processor today and is shipping beta test systems to selected partners. The new T4 chip is a major departure from previous designs. The T4 offers a maximum of eight cores per physical chip and keeps the T3's eight-threads-per-core limitation. The T4 compensates for its lower maximum theoretical throughput in several ways. First, the T4 is an out-of-order processor with an enhanced branch predictor. Its maximum speed is said to be at least 3GHz, nearly double that of the 1.67GHz T3. Oracle claims the chip's single-threaded performance has been significantly boosted, and expects T4 to deliver a 2x-7x speed increase in single-threaded workloads compared to T3."
Is it me, or did Oracle completely miss the point of SPARC? We used to use SPARCs where I work for huge, multi-thread or child-spawning applications. If you want a number cruncher, go somewhere else. Go buy a POWER CPU. SPARC's shining glory is the massively threaded model where you spawn tons of little instances of the same thing that serve a quick, non-intensive purpose and die. Once again, Oracle is taking something they bought and trying to ram the square object into the round hole they call their business model.
Interestingly enough, the captcha for this was "idiots"
Well, in fairness, they did add an evil bit(TM) to the flags register. Unfortunately, in Oracle's case, "jump on evil" is an unconditional branch.
I mean, to (re)introduce a new CPU in the market?
Either the T4 can run Oracle SQL in silicon or it won't fit in between the Intel/AMD mature technology on one side and the rising (and power saving) ARM on the other one.
Yes, you can build an "Oracle appliance" with whatever CPU you want, even your very own design. But then will the market share justify the efforts in CPU design?
No, I don't think they won't ever succeed.
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
This comment may have been meant as a bite at Oracle, but it is really a good point. The T4 may be a departure, but that doesn't mean it isn't warranted. The chip is still massively parallel, but it has obviously been refocused. The question is, what does the application need? Perhaps the engineers saw the biggest gains for DB applications in boosting single thread performance. MySQL probably will benefit from the same things that benefit Oracle DB. What are the customer demands for power consumption? Are the tradeoffs balanced? Perhaps lower-power chips require too many servers to store and cool. The T4 still looks like a mighty processor.
Still, if they venture too far into Intel's Xeon space, they will have a hard fight indeed.
-d
"Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"
It's all very nice that they've decided to try and up the single-thread performance. However, it's worth noting that the only thing worthwhile to run on a SPARC nowadays (thanks to Oracle's PMITA licensing structure) is Oracle DB. You buy an Oracle box to run Oracle. Any other workload is nonsensical, as you'll get better single-thread performance from x86, and you'll get way more cycles per dollar from... well, just about any other hardware/OS combination out there.
So as you consider purchasing this higher-clocked box, I've been told that the Oracle licensing for this machine will be 0.5 per core, while the T3 is 0.25 per core. Basically Oracle will cost approximately twice as much per core on this machine. I'm not a DBA... does that make any sense, when databases are traditionally I/O-bound?
Incidentally, my first paragraph caused me pain to type... I'm my organization's SPARC and Solaris expert, and I was a big pusher of the platform. Oracle's takeover and subsequent psychotic support costs and absolute blindness to any workload not DB-oriented was a fair kick in the pants to me. I'll fully admit that I'm not impartial.
Brandon Hume
hume -> BOFH.Halifax.NS.Ca, http://WWW.BOFH.Halifax.NS.Ca/
but given that 4-8 cores is the most that will typically get used
You clearly don't understand the target market of these chips. In fact, your statement is hilariously ignorant.
A heavy read database, like a news site, will have nearly everything cached in memory.
I've done ad-hoc aggregate functions against non-indexed tables with over 100mil rows, and they return in sub 10ms times. Cached table data can be fast.
The problem is that hyperthreading CPUs and x64-64 EPIC are predicated around floating point performance. The idea is that if you're FPU bound, then you want to minimize RAM latency by flipping between threads while you have FPU-load stalls.. You add speculative execution, predicate registers, pipeline execution stacks to minimize branch-misses, etc. But it's all about FPU with 200 clock execution times (e.g. divides and transcendental ops - as with FFT).
But I'm sorry, no matter how fast you make their FPUs, they're not going to beat FPGA or ASIC or raw-silicone GPU's. These bastards optimize memory paths and reduce critical path latencies.. The only advantage CPUs have over GPUs is that you can context switch unrelated tasks better than with GPUs.
A vast majority of apps in the world are NOT FPU based. They are pure integer. And moreover, these days, they are RAM constrained.. If your're writing a NoSQL DB procedure to perform zlib or merge-sorting or state-machine syntax parsing, FPU oriented architectures are of ZERO benefit. This is all RAM -> branch-prediction related. That is, read-data, make a decision, jump to new code (which triggers new RAM loads) run two or three instructions, then repeat. While SOME of the app state-tables and code-paths can get cached efficiently, the input stream is generally far larger than your L3 cache (on the order of gigabytes).
So, SOME of the memory pre-loading, branch-prediction, and on-stalled-thread-ctx-switch could be leveraged.. But MT apps suffer from barriers in critical regions.. Namely if you memory stall while holding a lock, you cripple the parallel performance.
Co-processes are very efficient (e.g. apache pre-fork, postgres co-threads with specific shared-mem-segments, erlang, ruby-unicorn, etc) in that they organize very small messages to pass between processes and keep all remaining cache-lines isolated to their single thread and thus semi-dedicated CPUs. This can very nicely leverage co-processors without necessarily saturating RAM -though if the apps themselves are RAM-bound you still have problems; BUT if you have NUMA, the CPU can segment memory spaces better with co-processes than with MT. That being said, the SUN light-weight-threads are (I believe) designed around shared memory-spaces having minimal context-switching time v.s. posix-threads or normal co-processes, so they can't really take advantage of co-processes as well as MT.. So SUN light-threads are forced to endure potentially bad programming by DB, file-IO, OS, signal-processing applications.. Namely if you can't create isolated memory regions (malloc/free-locks, IO/pipe-locks, concurrent-data-structure 'critical-region' locks, etc), you'll find yourself dirtying shared cache-lines so often, you actually find yourself running slower than if you were just single-threaded.
I know, for example, a simple merge-sort can run significantly slower (3x) when run in parallel v.s. single-threaded predominantly because of intel's MESI implementation. Well, not necessarily 3x slower human-time wise, but in consumed CPU time with little or no visible decrease in human-response-time.
As another example, mysql INNODB had a inverse performance curve for the longest time.. Meaning, the more physical CPUs you added, the SLOWER it's total throughput would be.. Predominantly due to excessive critical-region locks. Many of those locks have been replaced with less-accurate atomic spin-locks (as with sequence-counters). Namely you can now 'lose' a primary key's sequential value under the right circumstances - but at the benifit of removing a major classic stall-point. But INNODB is still full of complex algorithms that require critical-regions. Lock-free-code is really hard and is very limiting. But that isn't to say people haven't figured out how to architect good designs. 'redis' NoSQL and erlang based apps (like rabbit-MQ) are good examples.. Namely copy-on-write small data-structures.
But there are two types of apps that have lots of parallel threads. Those with MASSIVE memory requirements and those that
-Michael