Oracle Demos New SPARC T4 Processor

← Back to Stories (view on slashdot.org)

Oracle Demos New SPARC T4 Processor

Posted by timothy on Tuesday September 27, 2011 @12:50AM from the store-away-from-volatile-gases dept.

MojoKid writes "Oracle is publicly demonstrating its new T4 processor today and is shipping beta test systems to selected partners. The new T4 chip is a major departure from previous designs. The T4 offers a maximum of eight cores per physical chip and keeps the T3's eight-threads-per-core limitation. The T4 compensates for its lower maximum theoretical throughput in several ways. First, the T4 is an out-of-order processor with an enhanced branch predictor. Its maximum speed is said to be at least 3GHz, nearly double that of the 1.67GHz T3. Oracle claims the chip's single-threaded performance has been significantly boosted, and expects T4 to deliver a 2x-7x speed increase in single-threaded workloads compared to T3."

13 of 127 comments (clear)

Min score:

Reason:

Sort:

Not the point of SPARC by Anonymous Coward · 2011-09-27 01:12 · Score: 3, Informative

Is it me, or did Oracle completely miss the point of SPARC? We used to use SPARCs where I work for huge, multi-thread or child-spawning applications. If you want a number cruncher, go somewhere else. Go buy a POWER CPU. SPARC's shining glory is the massively threaded model where you spawn tons of little instances of the same thing that serve a quick, non-intensive purpose and die. Once again, Oracle is taking something they bought and trying to ram the square object into the round hole they call their business model.
Interestingly enough, the captcha for this was "idiots"
1. Re:Not the point of SPARC by eclectus · 2011-09-27 02:03 · Score: 4, Insightful
  
  Is it me, or did Oracle completely miss the point of SPARC? We used to use SPARCs where I work for huge, multi-thread or child-spawning applications. If you want a number cruncher, go somewhere else. Go buy a POWER CPU. SPARC's shining glory is the massively threaded model where you spawn tons of little instances of the same thing that serve a quick, non-intensive purpose and die. Once again, Oracle is taking something they bought and trying to ram the square object into the round hole they call their business model.
  Interestingly enough, the captcha for this was "idiots"
  Do you really think Oracle could turn the ENTIRE chip engineering boat around in a year and a half? This best-of-both-worlds fast single threaded and massively multithreaded design was probably in the works for YEARS before Sun was bought.
  
  --
  This signature is a waste of 42 characters
2. Re:Not the point of SPARC by tlhIngan · 2011-09-27 04:08 · Score: 2
  
  Is there a way for a CPU to make mutex handling easier and more efficient?
  No. Because mutexes as slow (though highly efficient already).
  Most architectures of today provide all the necessary tools to implement lockless algorithms which are more complex, but faster. They're less efficient (because they often do a "try and get lucky" style of processing), but if you have a ton of threads and pretty idle CPUs, it's less of an issue.
  And support for this comes in the form of memory barriers (read and write - they're different) and conditional exchange instructions (if value of memory is same as register X, then swap X and memory) up to native word size (or more). The latter is important as you need to be able to swap pointer-sized things atomically.
  An example of "try and get lucky" is in singleton object initialization. Say your code instantiates an object, and the object needs ot initialize.
  If the object has been instantiated before, you return the existing object. If not, you instantiate and initialize. So far, exactly as how it's done normally, except no lock was taken during the check (!).
  Now you initailize the object, and the magic happens near the end. You do a memory barrier to ensure all I/O and memory accesses are done. You then do an atomic compare-and-swap with the memory to hold your instance. If no one else has initialized it yet, it should still be NULL, so you compare with NULL and exchange with your instantiation.
  If you succeed, you get the old value of the memory (i.e., NULL). If you fail, you get your instance pointer back (because the exchange failed). If you succeeded, object is initialized. If you failed, someone else initialized it under you, and you destroy the object you initialized and use the one already there.
  You can extend this for processing as well - often by using a sequence counter to "flag" what has been done. Or linked lists (being able to verify no one stomped over your next pointer). Or many other structures.
  it's less efficient because you're doing more work than necessary (e.g., if you have 100 threads, all 100 may initialize the singleton simultaneously, but only one will succeed and you would throw away the results of the other 99).
3. Re:Not the point of SPARC by maraist · 2011-09-27 13:08 · Score: 2
  
  "Is there a way for a CPU to make mutex handling easier and more efficient?"
  mutex's are VERY efficient with cache-oriented MESI and MOESI instructions, the problem is what you do while another thread owns the mutex. You can either spin-loop or context-switch to another thread. Specifically if thread A locks then has several cache-misses, then thread B would have to spin for thousands of clock-ticks. When you have 80 CPUs that might not be too bad (though it burns power), but if you have 2 CPUs, then that's likely highly wasteful.
  
  "trigger on event or register/memory=certain value"
  I believe intel provides a block-until cache-line-updated instruction. I believe that's how Linux futuex and OS-level schedulers work if IIRC.
  
  "I bet there's lots of code which regularly checks "is it time to do X yet?" or "wait till X happens" (e.g. wait for connection or data)."
  
  Well, wait-for-connection is an OS thing. If you use epoll, IO-Completion, kpoll, or even ancient unix 'select' you transfer the overhead of IO to the OS which is very event-driven (and thus doesn't necessarily have a lot of blocking structures). Namely ethernet frame-driver running on CPU-3 can in theory directly transfer to thread-16 after completion which is blocking on a TCP packet when the OS determines it's received enough data to be awoken.
  
  As for 'is it time to do X yet' isn't as bad as you might think (well, I don't have that much imperical evidence, but I've worked in this space).
  Instead of 'polling', you can leverage a priority-queue, such that if there is literally nothing critical-to-run, you quickly test the head of the queue for it's execution time-stamp. Then do a time-of-day operation (all while in the OS, so no additional context switching). If a < b you flag the blocking thread for execution (possibly transfering directly to it). Here the slowdown comes when you add/remove an item from the temporal priority-queue (namely O(log(n))). So this is a function of how many temporal waits are scheduled/completed, but is independent of how long it's supposed to wait.. When there is literally no work to perform at all (all CPUs are about to go idle), then you look at your priority-queue and ask how long before the next scheduled event.. Then you can make a CPU go to sleep for that long (using interrupt controllers if need be).
  Generally you're going to wake up 32 times a second anyway, and the marginal overhead of re-testing a time against the priority-heap-head 32 times a second is nominal (I can run 800,000 java-based time-of-day calls per second.. plenty of room for those 32).
  
  To boot, the OS doesn't need to know about all the scheduled tasks. It only needs to know about one per thread at most (generally only about 0 .. 3 at any given time, with a few socket-timeout type apps bringing it into the hundreds). Apps can, for example implement their own timer logic that mimics this priority-queue model (java does, for example).. Thus one thread can be OS-bound on a timer that is the nearest temporal event in a pool of potentially thousands of scheduled events (e.g. a Java Timer or memory-based Quartz).
  
  What I see the greatest problem with are peer-CPU's modifying common cache lines..
  
  Namely if you have a job-queue data-structure where N threads are pulling/pushing out-of/on-to, then you have a single spot in memory that EVERY thread must modify in order to transfer work.. This is a massive bottleneck.. One that ironically you don't have in single CPU configurations. This is something that I think CPUs can work to address. Especially as co-process and message-passing systems become more prevelant (erlang type languages or message-queue NoSQL models).
  
  One reason Intel CPUs suffer from this is that when two CPUs concurrently modify a cache-line, everybody is forced to 'flush' their cache line and re-read from RAM.. This not only makes those accesses no faster than main RAM, it's competing with everything else you want to do with RAM - increasing latancy massiv
  
  --
  -Michael
Re:high end CPUs from the database company by NoNonAlphaCharsHere · 2011-09-27 01:13 · Score: 4, Funny

Well, in fairness, they did add an evil bit(TM) to the flags register. Unfortunately, in Oracle's case, "jump on evil" is an unconditional branch.
Does that make any sense? by aglider · 2011-09-27 01:23 · Score: 3, Insightful

I mean, to (re)introduce a new CPU in the market?
Either the T4 can run Oracle SQL in silicon or it won't fit in between the Intel/AMD mature technology on one side and the rising (and power saving) ARM on the other one.
Yes, you can build an "Oracle appliance" with whatever CPU you want, even your very own design. But then will the market share justify the efforts in CPU design?
No, I don't think they won't ever succeed.

--
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
1. Re:Does that make any sense? by EthanV2 · 2011-09-27 01:30 · Score: 2
  
  No, I don't think they won't ever succeed.
  Sooo... You think that they will succeed?
2. Re:Does that make any sense? by angel'o'sphere · 2011-09-27 02:01 · Score: 3, Informative
  
  # prstat
  Total: 341 processes, 9909 lwps, load averages: 20.86, 19.36, 20.41
  # prtdiag -v
  System Configuration: Sun Microsystems sun4u SPARC Enterprise M9000 Server
  System clock frequency: 960 MHz
  Memory size: 262144 Megabytes
  Only roughly 3 processes per core ... and yes that is 262 GIGA BYTES of Ram. How much again can an ARM address?
  Well, keep in mind: when we talk about SPARC or the Power PC architecture we also talk about memory bandwidth, failover safety, attached I/O devices etc.
  I don't see anyone trying to build "big iron" with ARMs right now.
  
  --
  Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Re:Single thread performance by MetalliQaZ · 2011-09-27 01:31 · Score: 4, Interesting

This comment may have been meant as a bite at Oracle, but it is really a good point. The T4 may be a departure, but that doesn't mean it isn't warranted. The chip is still massively parallel, but it has obviously been refocused. The question is, what does the application need? Perhaps the engineers saw the biggest gains for DB applications in boosting single thread performance. MySQL probably will benefit from the same things that benefit Oracle DB. What are the customer demands for power consumption? Are the tradeoffs balanced? Perhaps lower-power chips require too many servers to store and cool. The T4 still looks like a mighty processor.
Still, if they venture too far into Intel's Xeon space, they will have a hard fight indeed.
-d

--
"Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"
Pricing? No, *licensing* by Brandon+Hume · 2011-09-27 01:48 · Score: 2

It's all very nice that they've decided to try and up the single-thread performance. However, it's worth noting that the only thing worthwhile to run on a SPARC nowadays (thanks to Oracle's PMITA licensing structure) is Oracle DB. You buy an Oracle box to run Oracle. Any other workload is nonsensical, as you'll get better single-thread performance from x86, and you'll get way more cycles per dollar from... well, just about any other hardware/OS combination out there.
So as you consider purchasing this higher-clocked box, I've been told that the Oracle licensing for this machine will be 0.5 per core, while the T3 is 0.25 per core. Basically Oracle will cost approximately twice as much per core on this machine. I'm not a DBA... does that make any sense, when databases are traditionally I/O-bound?
Incidentally, my first paragraph caused me pain to type... I'm my organization's SPARC and Solaris expert, and I was a big pusher of the platform. Oracle's takeover and subsequent psychotic support costs and absolute blindness to any workload not DB-oriented was a fair kick in the pants to me. I'll fully admit that I'm not impartial.

--
Brandon Hume
hume -> BOFH.Halifax.NS.Ca, http://WWW.BOFH.Halifax.NS.Ca/
Re:Single thread performance by FreakyGreenLeaky · 2011-09-27 02:13 · Score: 2

but given that 4-8 cores is the most that will typically get used
You clearly don't understand the target market of these chips. In fact, your statement is hilariously ignorant.
Re:Single thread performance by Bengie · 2011-09-27 05:10 · Score: 2

A heavy read database, like a news site, will have nearly everything cached in memory.
I've done ad-hoc aggregate functions against non-indexed tables with over 100mil rows, and they return in sub 10ms times. Cached table data can be fast.
Re:Single thread performance by maraist · 2011-09-27 12:36 · Score: 2

The problem is that hyperthreading CPUs and x64-64 EPIC are predicated around floating point performance. The idea is that if you're FPU bound, then you want to minimize RAM latency by flipping between threads while you have FPU-load stalls.. You add speculative execution, predicate registers, pipeline execution stacks to minimize branch-misses, etc. But it's all about FPU with 200 clock execution times (e.g. divides and transcendental ops - as with FFT).

But I'm sorry, no matter how fast you make their FPUs, they're not going to beat FPGA or ASIC or raw-silicone GPU's. These bastards optimize memory paths and reduce critical path latencies.. The only advantage CPUs have over GPUs is that you can context switch unrelated tasks better than with GPUs.

A vast majority of apps in the world are NOT FPU based. They are pure integer. And moreover, these days, they are RAM constrained.. If your're writing a NoSQL DB procedure to perform zlib or merge-sorting or state-machine syntax parsing, FPU oriented architectures are of ZERO benefit. This is all RAM -> branch-prediction related. That is, read-data, make a decision, jump to new code (which triggers new RAM loads) run two or three instructions, then repeat. While SOME of the app state-tables and code-paths can get cached efficiently, the input stream is generally far larger than your L3 cache (on the order of gigabytes).

So, SOME of the memory pre-loading, branch-prediction, and on-stalled-thread-ctx-switch could be leveraged.. But MT apps suffer from barriers in critical regions.. Namely if you memory stall while holding a lock, you cripple the parallel performance.

Co-processes are very efficient (e.g. apache pre-fork, postgres co-threads with specific shared-mem-segments, erlang, ruby-unicorn, etc) in that they organize very small messages to pass between processes and keep all remaining cache-lines isolated to their single thread and thus semi-dedicated CPUs. This can very nicely leverage co-processors without necessarily saturating RAM -though if the apps themselves are RAM-bound you still have problems; BUT if you have NUMA, the CPU can segment memory spaces better with co-processes than with MT. That being said, the SUN light-weight-threads are (I believe) designed around shared memory-spaces having minimal context-switching time v.s. posix-threads or normal co-processes, so they can't really take advantage of co-processes as well as MT.. So SUN light-threads are forced to endure potentially bad programming by DB, file-IO, OS, signal-processing applications.. Namely if you can't create isolated memory regions (malloc/free-locks, IO/pipe-locks, concurrent-data-structure 'critical-region' locks, etc), you'll find yourself dirtying shared cache-lines so often, you actually find yourself running slower than if you were just single-threaded.

I know, for example, a simple merge-sort can run significantly slower (3x) when run in parallel v.s. single-threaded predominantly because of intel's MESI implementation. Well, not necessarily 3x slower human-time wise, but in consumed CPU time with little or no visible decrease in human-response-time.

As another example, mysql INNODB had a inverse performance curve for the longest time.. Meaning, the more physical CPUs you added, the SLOWER it's total throughput would be.. Predominantly due to excessive critical-region locks. Many of those locks have been replaced with less-accurate atomic spin-locks (as with sequence-counters). Namely you can now 'lose' a primary key's sequential value under the right circumstances - but at the benifit of removing a major classic stall-point. But INNODB is still full of complex algorithms that require critical-regions. Lock-free-code is really hard and is very limiting. But that isn't to say people haven't figured out how to architect good designs. 'redis' NoSQL and erlang based apps (like rabbit-MQ) are good examples.. Namely copy-on-write small data-structures.

But there are two types of apps that have lots of parallel threads. Those with MASSIVE memory requirements and those that

--
-Michael