SW Weenies: Ready for CMT?
tbray writes "The hardware guys are getting ready to toss this big hairy package over the wall: CMT (Chip Multi Threading) and TLP (Thread Level Parallelism). Think about a chip that isn't that fast but runs 32 threads in hardware. This year, more threads next year. How do you make your code run fast? Anyhow, I was just at a high-level Sun meeting about this stuff, and we don't know the answers, but I pulled together some of the questions."
Now my hardware will force me to support CMT on my computer? This is as bad as DRM.
I see a deep schism growing in the processor industry. There are two main camps, the parallel processors, and the screemin single processors.
The parallel are used for intense processing. Research, servers, clusters, databases; anything that can be divided into many little jobs and run in parallel.
The other camp is the average user who just wants fast respons time and to play Doom 3 at 100+ fps.
FreeBSD: The Power to Serve!
Whatever the clock rate, multiply it by eight and it's pretty obvious that this puppy is going to be able to pump through a whole lot of instructions in aggregate.
Ho hum.
On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second, provided it has 8 independent threads not blocking on I/O to execute.
It only has one floating-point execution unit attached to one of those 8 cores, so if you have a thread that needs to do some FP, it has to make its way over to that core and then has to be scheduled to be executed, and then it can only do one floating-point instruction.
Superb.
The thing is, all of the other CPU vendors with have super-scalar, out-of-order 2- and 4- core 64- bit processors running at over twice to three times the clock frequency.
You do the mathematics.
Stick Men
Some people have predicted this move for quite some time. I remember hearing about it back in the late 80's early 90's and I'm sure it goes way back before then. The analogy was to Steam Engines and why they lost out over Diesels. You can only make a Steam engine so big but you cannot connect them together to get more power. With diesels you can hook many of them together for more power. Chips are finally getting to the same point -- It is more cost efficient to chain them together than to create a monsterous one. I'm surprised it has take this long to get to this point.
So does this mean that Intel's gamble with the Itanium was a good one? Or does this mean that we are going to try to teach students a totally new development style for more threads and parallelism?
and we don't know the answers, but I pulled together some of the questions."
What is this now, Questions for Nerds. Stuff we dont know?
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
from TFA:
"Problem: Legacy Apps You'd be surprised how many cycles the world's Sun boxes spend running decades-old FORTRAN, COBOL, C, and C++ code in monster legacy apps that work just fine and aren't getting thrown away any time soon. There aren't enough people and time in the world to re-write these suckers, plus it took person-centuries in the first place to make them correct.
Obviously it's not just Sun, I bet every kind of computer you can think of carries its share of this kind of good old code. I guarantee that whoever wrote that code wasn't thinking about threads or concurrency or lock-free algorithms or any of that stuff. So if we're going to get some real CMT juice out of these things, it's going to have to be done automatically down in the infrastructure. I'd think the legacy-language compiler teams have lots of opportunities for innovation in an area where you might not have expected it."
...and isn't this the challenge being addressed by DragonFly BSD?
Software people use threads already, as long as the VM and OS are up to the task. I don't see why it should matter if some of the threads are implemented in hardware.
http://michaelsmith.id.au
I agree with you on using the best tool for the job, but what does this have to do with the actual article?
I almost thought this was going to be about Star Wars nerds being forced to watch something on Country Music Television.
On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second, I meant per clock cycle, of course, not per second.
The thing is, all of the other CPU vendors with have
I meant "will have" not "with have".
Stick Men
Well I am a Star Wars weenie, and I am definitely NOT ready for Country Music Television.
given the fact, that I havent programmed a single threaded program in years.
As a scientific programmer, all I know is that this will eventually be a huge benefit to all my MPI and OpenMP codes.
I really only know the "scientific" programming languages, but most all math specific routines are already written for parallel machines. I'm a bit curious, what else really needs multiple threads? Isn't the benefit of dual-core procs the ability to not have a slow-down when you run two or three apps at a time? Don't games like DOOM III and Half-Life II depend mostly on the GPU (which I'm guessing they can handle multiple core GPU's since the programming should be fairly similar to SLI?)? What is the benefit in games? Just faster level loading times?
I don't want to sound like I'm whining or anything here... I'm not saying that multiple cores suck. On the contrary they're fantastic for what I do, but I just was hoping you guys could help me understand how common apps and non-mathematical operations can use them.
CMT is manufactured pop-country music at its worst. Yuck!
the good ground has been paved over by suicidal maniacs
I suspect we're all going to have to look to languages that really do support very high levels of parallelism from the get go. We're going to need a high perfomance language and a scripting language. From my early days as a computer scientist, I'd say anything functional will serve us really well, especially languages like CAML and Scheme.
To the desktop user, this really means nothing special. But when we're talking about producing a 1024-node system or, even some highend 1U racks for SMB markets, the more parallelism on chip, the better.
video game, ecchi, bbs and classic computing fans unite to eat sushi
"The hardware guys are getting ready to toss this big hairy package over the wall:"
Vivid imagary...
Look, if you have 32 threads operating at 1/32 of GHz, or you have 1 thread operating at 2GHz, then it is a basic wash (not really, but close enough).
I would be far more interested in taking advantage of all the CPU cycles that run all over at Businesses. THink of how much wasted cycles there are running Screen Saver, or a Word document. By distributing the load amongst the systems, then a large number of things can be done.
I prefer the "u" in honour as it seems to be missing these days.
32 threads in hardware on one chip is the same as 32 slow CPUs.
Current programming languages are insufficiently descriptive to permit compilers to generate usefully multi-threaded code.
Accordingly, multi-threading is currently handled by the programmer; which by and large doesn't happen, because programmers are not used to it.
A lot of applications these days are weakly multi-threaded - Windows apps for example often have one thread for the GUI, another for their main processing work.
This is *weak* multi-threading; because the main work done occurs within a single thread. Strong multi-threading is when the main work is somehow partioned so that it is processed by several threads. This is difficult, because a lot of tasks are inherently essentially serial; stage A must complete before stage B which must complete before stage C.
The main technique I'm aware of for making good use of multi-threading support is that of worker-thread farms. A main thread receives requests for work and farms them out to worker threads. This approach is useful only for a certain subset of problem types, however, and within the processing of *each* worker thread, the work done itself remains essentially serial.
In other words, clock speeds have hit the wall, transistor counts are still rising, the only way to improve performance is to have more CPUs/threads, but programming models don't yet know how to actually *use* multiple CPU/threads.
El problemo!
--
Toby
If price was no object, someone could design a chip with more than two cores in it, and each core still ran as fast as any single core chip out there.
Just the existance of one such device would heal the rift immediately. Everyone would say... aha! It is only a matter of time before blazing speeds and hardware threading comes to the desktop.
You can have your parallel processors and still play DOOM III at insane fps. At worst, it will just take a bit for folks to start writing programs to take advantage of the additional processors/cores.
BTW, your "average" user hasn't even played DOOM I, let alone DOOM III. Surfing the web and using e-mail doesn't usually put a lot of strain on a PC.
All of these recent articles about multi-cores, multiple pipelines of execution seem to miss the real value of theis technology; the provisioning of multiple Virtual Machines real-time on the same system. While most software will never use the multi-thread, multi-CPU capabilities of even the quad core AMD products like VMWare are now allowing you to dynamically provision systems on demand to deal with load. Another great use is for server consolidation; instead of 10 1U racks to handle web farming, try a 16 way box that can provide a single point of reliability, management and execution for those services. This is about horizontal scaling in a vertical fashion.
Now of course, the room was full of Sun infrastructure weenies, so if there's something terribly obvious in records management or airline reservations or payroll processing that doesn't parallelize, we might not know about it.
Well, since I work in airline reservations systems, I'll add my $0.02 worth...
Most OLTP systems will benefit from CMT and multi-core processors. We had a test server from AMD about a month before the dual-core Opteron was announced, we did some initial testing and then put it in the production cluster and fired it up. No code changes, no recompile, no drama.
IMHO, the single-user applications, such as games and word processors, will be harder to parallelize.
Alan.
First off, performance + java != good idea. Not trying to camp fanbois here but if you really need "down to the metal" performance you're writing in C with assembler hotspots.
/.".
So the observations that there is too much locking in Java's standard api is informative but not on-topic. the fact that the standard solution is to use a completely new class [e.g. StringBuilder] is why I laughed at my college profs when they were trying to sell their Java courses by saying "and Java is well supported with over 9000 classes!".
In the C and C++ world things get extended but also fixed at the same time. We can still use the strncat function which has been around for a while EVEN IN threaded environments...
Also, he totally fails to point out that extra threads [e.g. register sets] only pay off when the pipeline is empty. So it's a catch-22. You either have a very efficient pipeline that you can cram full of a single thread's instructions or you have a shoddy one where you're only hope is to mix in other threads.
Think about it. If you only have one ALU and 32 threads that means each individual thread works at 1/32 the normal speed. Even if they're a lower/higher priority!
That then gets into two camps. Are you threading because the performance of the pipeline sucks [e.g. dependencies in the P4] or because you want to interleave instructions [e.g. twice the clock rate but half the performance]. If it's the latter than even if you turn off 31 of 32 threads you still end up with one weak ALU.
Consider the AMD64 for instance. It usually gets an IPC that is pretty high [usually in the 1.5-2.5 range] which means that it's retiring instructions from a single thread at pretty much the entire capacity of the chip. Adding extra threads doesn't help.
Consider then the P4. It usually gets an IPC of 0.5 to 1 [for ALU code, which is observable by the fact it's about as fast as a half-clockrate Pentium-M]. This means it's two ALUs are not always busy and an additional thread could bump the IPC up to 1-1.5 range.
I know [for instance] that with HT turned on my 3.2Ghz Prescott compiles LibTomCrypt in close to the same time as my 2.2Ghz AMD64 [the P4 takes 5 seconds longer, without HT it takes about 15 seconds longer].
So the only saving grace is an efficient ALU so that you can run single tasks at least somewhat efficiently. Then tacking on the extra threads doesn't help as an efficient ALU won't have many bubbles where other threads could live.
So you end up with essentially a hardware register file but still 1/2 the performance. Remember that the goal of multi-processing is closer to 'n' times faster with n processors.
The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...
Whoopy...
Multi-threading is NOT the future. Multi-cell is. Where you have dedicated special purpose [re: space optimized] side-cores that do things like "I can do MULACC/load/store REALLY REALLY QUICK!!!".
In other words, "yet another press release on
Tom
Someday, I'll have a real sig.
Grow up, it's you who constantly take cheap stabs at Linux.
We all are.
If one of your favorite applications happen to be multithreaded then that's gravy.
But you'll benefit anyway. If you bring up your process list you'll see that you have probably at least 10 processes. These will now be able to run independently.
Also, the windows kernel itself can benefit from hardware threads.
The Internet is full. Go Away!!!
The latter can scale on multi-processors, and mostly do. Much of our performance work centered on finding out how many processes to run, and whether to group them all on one processor board to get short memory access times. Plus fixing obvious things, like O(n^2) algorithms.
in my personal opinion, the consideration for older programs are as follows:
So: adding CMT makes it a good idea to parallelize older programs, O(n^2) algorithms in CMT or multi-CPU programs are every bit as bad as in uniprocessor programs, and introducing locking is bad, but locking on CMT needs to be measured against regular multiprocessors to see if it's going to be better (my speculation) or worse.
--dave
davecb@spamcop.net
Well, the Fortran programs have an easy solution---just recompile with a modern compiler designed for these CPU's. Any loop that can be automatically unrolled can be parallelized instead. Loop parallelization has been a standard Fortran optimization on parallel architectures for decades. Yes, this can be done with other languages as well, but historically it hasn't been (I expecte either due to a lack of demand, or because it's harder to accomodate language features [things like strict aliasing], or both).
-JS
Vanity of vanities, all is vanity...
Wow, I've been noticing some out of place posts on Slashdot for a couple days now but this one just proves Slashdot has a serious problem.
I'm sure you didn't mean to but your post ended up showing up as the first post in an Article about CMT. What's really wierd is that it showed up after a bunch of other posts...
But those decade old apps can easily be done by one core in its spare time. I'm not sure why this is an issue.
That's really a shame about the FP performance. My hobby project is ray tracing, and my code is just waiting to be run on parallel hardware. The prefered system would have multiple cores sharing cache, but seperate cache would be fine too. memory is not the bottleneck, so higher GHz and more cores/threads will be very welcome so long as they each have good performance. The code scales well with multiple CPUs as pixels can be rendered in parallel with zero effort - the code was designed for that. As it sits, I'm hoping my Shuttle (SN95G5v2) will support a AMD64x2 shortly. We're still not up for RT Quake, but interactive (read very jerky 1-2 fps) high-poly scenes are possible today.
Am I the only one who thought a bunch of SoftWare Weenies were going to be ready for Country Music Television?
(Man I'm having a bad case of the Mondays)
Every time someone exposes concurrency at some layer as a way of improving performance, rather than because you're implementing a process that's inherently concurrent, it's a huge clusterfuck. Doesn't matter whether it's asynchronous I/O, out-of-order execution, multithreaded code, or whatever. Even when you're dealing with a concurrent environment like a graphical user interface the most successful approaches involve breaking the problem down into chunks small enough you can ignore concurrency.
One of UNIX's most important features is the pipe-and-filter model, and one of the really great things about it is that it lets you build scripts that can automatically take advantage of coarse-grained concurrency. Even on a single-CPU system, a pipeline lets you stream computation and I/O where otherwise you'd be running in lockstep alternating I/O and code.
That's where the big breakthroughs are needed: mechanisms to let you hide concurrency in a lower layer. Pipelines are great for coarse-grained parallelism, for example, but the kind of fine grain you need for Niagara demands a better design, or the parallelism needs to be shoved down to a deeper level. Intel's IA64 is kind of a lower level approach to the same thing where the compiler and CPU are supposed to find parallelism that the programmer doesn't explicitly specify, but it suffers from the typical Intel kitchen-sink approach to instruction set design.
Isn't the big issue cache? On a multi-CPU system running one thread per CPU, each thread has its own cache. On HMT, the cache is shared. Threads running in different sections of code on different data will tend to reduce cache hits, offsetting the performance gain of the multiple threads. The limit on increasing the number of threads is that most of the threads will be waiting on cache misses.
Intron: the portion of DNA which expresses nothing useful.
Of course the compiler has severe limits as to what it can really guess (the "independent" part can be very hard in this aspect), but at least once you write it, you can run it on all your apps for free.
I'll misquote Fred Weigel and suggest that the next problem is branching: Samba code seems to generate 5 instructions between branches, so suspending the process and running something else intil the branch target is in I-cache seems like A Good Thing (;-)).
Methinks Samba would really enjoy a CMT processor.
--dave
davecb@spamcop.net
Threads are actually one of the simplest form of parallelism to deal with and we have had decades of experience with them. That's why Sun loves them: it fits in well with their big-iron philosophy and hardware and makes it easy for their customers to migrate to the next generation.
But the future of high-end computing, both in business and in science, will not look like that. Networks of cheap computing nodes scale better and more cost-effectively. Many manufacturers have already gone over to that for their high-end designs. That's where the real software challenges are, but they are being addressed.
Processors with lots of thread parallelism will probably be useful in some niche applications, but they will not become a staple of high-end computing.
Seriously! And why foist this garbage on the Star Wars (SW) weenies? Has John Williams gone country?
Easy. In present days there are some assembly instructions that can be executed simultaneously. With a chip like this however, all bets would be off. Instead of just a meager few instructions that could be executed simultaneously you would be able to execute any number of instructions simultaneously.
So if you have a function that say does 10 additions and 10 moves you would first figure out if any of them needed to be done before or after each other. Then see which ones don't matter. Then write the function to do as many at once as possible.
It really doesn't matter for anyone other than the compiler writers. Those guys will write the compiler to do this kind of assembly level optimization for you. The trick is writing a high level language, or modifying an existing one, so the compiler can tell which things must be executed in order and which can be executed side by side.
The GeekNights podcast is going strong. Listen!
This year, more threads next year.
Hmm, I can't seem to find one. For arguably one of the tech-savviest sites on all of the Internet, Slashdot contributors have surprisingly awful grammar.
We hear a lot about the lack of technical education and preparation for engineering and science careers these days, but sometimes it looks like English instruction is just as bad.
I'm not looking for perfection, and sometimes, like in comments, speed matters more than grammatical accuracy, but when you're submitting a story, it really can't hurt to read it over to make sure it fits elementary school standards.
P.S., there is no such word as "virii." There, now this is officially off-topic.
Slashdot: 24 hours behind every other site or your money back!
"So, given that CMT chips use less watts per unit of computing, why aren't..."
I think the "requires a 500W power supply part should answer this question".
What will this Cell Based system look like? - Our Speculation
- MotherBoard supports up to 4 Cell Chips
- Each Cell Chip will have its own Rambus main memory. The memory will be on plug in strips much like DDR etc
- The Cell Chips on the motherboard will cooperate by means of FlexIO which is a multilane/serial technology.
- There will be two slots meant for video cards. Similar to AGP but designed for Rambus, not AGP compatable, 10x faster than AGP.
- All other I/O will be done by means of FlexIO similar to what is now possible with USB - except the system will boot from flexIO
- There will be no legacy hardware support - NO PCI, AGP, usb, serial, parallel, ps2 , ethernet - nothing
- The power supply will need to be about 500 watts.
- power management will allow cell chips and parts of cell chips to be powered down when not in use.
- There will be 16 FlexIO ports coming out the back. 2 in and 2 out for each Cell Chip.
- Cluster can be created by stacking Cell Boxes and connecting them with the FlexIO cables.
http://cellsupercomputer.com/power_pc.php
Having to work for a living is the root of all evil.
Am I the only person who was wondering why slashdot was talking about Country Music Television for a moment there?
* crickets *
Time to hand in my nerd badge I guess, and slink off into the sunset.
Seriously, though - thanks for clarifying the meaning of CMT in the blurb. A big step forward from the usual Slashdot blurb.
- sarcasm is just one more service we offer -
Maybe the need for smaller transistors and wires on chips has been fueling the growing nanotech industry, so maybe we should continue working on smaller and faster chips, though they might not be practical.
It was designed using his forth CAD software, probably running on one of his earlier cpus.
(I realize that this solution could actually be computed at compile-time for any known value of N, and I realize that there is a formula to compute this answer in constant time). My point is that just because a loop can be unrolled automatically (this loop can) does not mean that it can be executed in parallel. Executing this code in parallel would result in a *massive* performance hit or a tremendous memory size explosion.
I currently have no clever signature witicism to add here.
If you want us to accommodate your inability to improve single threaded performance and rearchitect 20 years of software for parallel computing, then how about this:
.
DON'T CALL US WEENIES! Ya bunch of Verilog writin', pocket protector wearing misfits, who take six months to implement what we can do in five lines of code, and cannot maintain app. integrity even in a single core non-hyperthreaded CPU! (See here: http://www.comp.nus.edu.sg/~abhik/pdf/pact04.pdf)
Yours Sincerely,
A Software Engineer.
${YEAR+1} is going to be the year of Linux on the desktop!
It's funny Sun claimed 15x performance increase with Niagara about 16 months ago, but they never bothered to put that claim into any context. 15x the then 900 MHz SPARC III, I doubt it seriously. I doubt even 15x their low-end SPARC IIe in their now discontinued blades.
It appears that Sun engineers have hit a MHz wall sooner than the likes of Intel/AMD/IBM and are going extreme parallelism.
Based on what I've read the Niagara CPU will only be deployed in a single slot server...the only thing it might be useful for is front-end web servers and light-duty app servers. It doesn't sound like FP performance will be too exciting so I doubt it will find it's way into renderfarms.
I would like to see a showdown between the IBM/Toshiba Cell and Niagara.
It's my opinion that the Sun engineering team are in serious trouble.
If single-threaded performance improvements slow down, and the available computing power is spread out among multiple cores, anyone persisting in writing single-threaded code will fall behind in performance.
Remember the old days when people used fancy tricks to implement naturally concurrent solutions as single-threaded programs? The future is going to be just the opposite. Any day now we'll see a rush toward langages with special support for quick, clear, safe parallelism, just like we've seen scripting languages catch on for web programming.
and other such languages will become more popular as this new multithreaded world takes hold because they embed the multithreaded concepts into the language without explicit programmer interaction. C, C++, Java style threading and mutex constructs are error-prone and awkward to use.
It isn't. And it isn't just scientific data chugging which would benefit from increased availability of actual concurrent processing in typical desktop computers; there are currently many of these PCs that already to things that can be paralellized.
For instance image processing. For many kinds of image processing it isn't even hard to partition the problem so that you can make use of more than one processor. I use my PC for processing pictures taken with a digital SLR. A lot of people I know do video editing on a PC and even people who have small home studios for music production centered around their PCs or Macs.
Even if you are not running multithreaded applications that are heavily CPU-bound, multiple CPUs or CPU cores is useful. Currently my desktop computer runs 108 processes. between 3 and 6 of these processes were on cursory inspection marked as "runnable", yet I have only one CPU. I'd probably benefit from another CPU or three because right now I'm not really doing anything that requires a lot of CPU grunt.
There is no problem. It isn't as hard as people say to make use of more processors, more cores or more low-level support for multithreading. If anyone is trying to make you believe there's a big problem, you can safely ignore them.
Because sometimes the sheer amount of data those applications have to calculate has increased. Or because a calculation that once was done once a week during the weekend on several machines with separate data groups in parallel is now done as an instant report at the fingertip of a clueless manager, who just want to be the 'numbers to be up-to-date' (of course THIS calculation can be parallelized, but not in an algorithmic way, but by separating independent data).
The main problem with paralelism for the general application is the current model. The "Event Model" that is used nowadays as the basic processing model for applications specifies that the program will stay idle until the user press a key or moves the mouse (or push buttons).
.NET technology) or even a interpreter, the middle-layer needs to be thread aware so it can distribute the processes in different threads.
With this model it is kind of hard to use the multithreading processors. Of course after the user has triggered an action then the program could make use of the threading capability to improve its performance.
Next comes the problem of looking at "how many threads" should one allow in his program... if one allow to many threads and the processors have just 2 it will be bad, also the other way around.
I think the compilers must be done "thread aware", so they can get the program code and efectively use the processing power. Of course if the program is compiled natively (C,C++, Pascal, etc) we would have the same problem of threads numbers, but if there is a virtual machine (Java and
Of course, the first applications that can take advantage of multithreading are games, as their model is active but, then again the compiler MUST be aware of the multithreading capacities and it should be able to fit the different developer wantd threads in the processor.
For the general application I think multithreading must can be used by changing (or extending) the events model window paradigm, so, in one thread the program could wait for the events and other thread could be used to pro-act; this could be achieved by some kind of artificial intelligent development.
Just today I was daydreaming about how to replace the totally old and awkward menu bar standard interface, specifically for OpenOffice, which has 10 menus with 30 or more submenues... this is a thing that could be improved by some kind of proactive behaviour from the computer (imagine something like an agent that could predict the options you where looking for while using the program... [no i am not thinking about the !£%!"£@ "feature" of hiding the menu options from MS Office , windows et al] ).
Another way to use multithreading could be from the Operating System, so the programs [that do not require] multithreading wont have to deal with it BUT the operating system would use the multithreading capacities to schedule the processes execution... in this way we may get [AT LAST] a [REAL] multiprocess OS (and not the illusion we have now by quit process switching).
Ubuntu is an African word meaning 'I can't configure Debian'
An UltraSparc that runs 32 threads of CMT, but combined of merely a few hundred MIPS, is worse than an IBM Power or AMD Opteron that requires software context switches, but crunches out thousands of MIPS. Sun needs a clearer server/CPU strategy than throwing a whole new paradigm on the table PER UPGRADE CYCLE.
Kudos, sir troll!
IBM started SHIPPING Power5 with SMT capablility August 31 of last year - IBM has SMT running on 1.9 GHz processors today. Sun is getting farther and farther behind.
-The Mad Duke
"Sure the two cars let you do independent things but when you're working on one task [getting to work] you're not ahead."
But you're not, you never are working on only 1 task.
Look at the threads running on a PC and its hundreds, you have file cache threads, communications threads, all kinds of stuff running.
A whole convoy of cars all sitting in one lane waiting for the car in front.
You keep the speed limit the same, make the highway 8 lane and 8 times the cars can pass through.
Also you would save the thread state store/recall overhead, as the processor needs prepped for each thread switch. You have only 1/8th of those happening if the chip can run 8 threads at a time.
That was worse than goatse - my eyes and my ears have been assaulted!
Traditional languages that have had threads bolted on like C/C++ make threading more challenging than it needs to be. Java, as long as you understand the principles of concurrency, makes it a breeze. I would be interested to see weather a well coded JVM / JIT could outperform traditional languages on these new CPUs - especially if you could dedicate a couple of the hardware threads to JIT, and GC threads.
Scared of flying, pointy things snce 1979!
From the article:
The standard APIs that came with the first few versions of Java were thread safe; some might say fanatically, obsessively, thread-safe. Stories abound of I/O calls that plunge down through six layers of stack, with each layer posting a mutex on the way; and venerable standbys like StringBuffer and Vector are mutexed-to-the-max. That means if your app is running on next year's hot chip with a couple of dozen threads, if you've got a routine that's doing a lot of string-appending or vector-loading, only one thread is gonna be in there at a time.
Classes such as StringBuffer and Vector are locked (synchronized) on a per-object basis. As long as you aren't trying to access the same object from different threads you won't block. And if you are trying to access the same object from different threads you will be happy that they were thread-safe!
The performance problems of having these classes being obsessive about thread safety do not result from the locking forcing singlethreadedness. The performance problem stem from the cost of locking objects.
void AccumulateLoopCount(int N) { int accumulator = 0; #pragma omp parallel do reduction(+:accumulator) for (int i = 1; i N; ++i) { accumulator += i; } return accumulator; } Very easy to parallelise this, each thread has its own private accumulator, initialised to zero, and the result from each thread is summed at the end. I don't see where this massive performance or memory hit would come from.
Some tasks are serial, others can be parallelized. You don't need fancy languages to do it either. To effectively partition tasks into threads using something as archaic as C, you can either fork or you can load different processes. Either way works. The trick is to shift one's thinking from "tasks are serial..." to "how could I speed my code up if I had multiple cpus available?"
Encoding music to put on some sort of music player that doesn't have replaceable batteries and is headed for landfill 18 months after it's purchased comes to mind. Partitioning the music can be done either by tunes or within a tune if the encoding scheme can be chunked.
AI is scalable via threading which mean a well laid out game architecture could scale with more hardware threads. A user with a single processor would only get a few smart enemies, a user with a cpu array could see lots of smart behavior such as some of the enemy deciding to flee while other warriors come charging into battle.
Folks who play with Photoshop or Gimp can easily soak their cpus. A blur operation is the kind of task that can be partioned to good effect. It's not programming languages that's keeping this from happening as much as not many people have the requisite hardware.
Personally, when I'm working, I'll be printing, scanning and reading the scanned input simultaneously. For some unknown reason, the printer driver soaks up all the cpu cycles which slows down my reader and scanner software. Being able to allocate the printer driver its own hardware would make the rest of my workflow smoother.
Some of us will be able to use the horsepower and some of us won't. Not much has changed in the past 40 years.
I heard the same talk under NDA about a bit over 6 months ago. They're just hyping their warez, it's nothing special. They're talking about multi-core CPUs like what just came out from AMD, and "hyperthreading" like what Intel has had for a while. They're basically playing catchup, and poorly. If they were smart they'd have dumped future plans for the UltraSparcs a few years ago and started transitioning to Solaris on x86s and especially Opterons, and possibly built some fat custom hardware in a similar vein to the SunFire series servers around the Opteron architecture.
11*43+456^2
and take some advance architecture courses.
The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...
I'm sorry but that turns out not to be the case.
When you have a system that is running lots of different threads simultaneously the amount of time that it takes to do a context switch from one thread to another becomes an issue. In the real world, threads often do things like I/O which cause them to block or they wait on a lock. If you can do a fast context switch you get back the time that you would have wasted saving registers off to RAM and pulling back another set. Faster thread switching means that your multi-thread single core now runs its total load (all of the threads) faster than a single core single thread design. Also, things like microkernels become a lot more feasible (microkernels are notorious for being slow because context switches are slow).
When you have looked beyond your desktop machine maybe you'll have earned the right to sneer at your professors. I don't think you're there yet.
Those of you who are up on the current state of the art here, please help me out. I was under the impression that multiple threads and automatic storage management were still not on good terms with each other, and that this was a big unsolved problem.
To a Lisp hacker, XML is S-expressions in drag.
They're just hyping their warez , it's nothing special.
:-)
So what you are saying is that MS Office and Adobe Photoshop are available on Solaris for free, but I have to go to some dodgy Russian web site to get it?
I,for one, welcome our new eunich overlords!
What do threads and CMT buy you when browsing? How about when playing a game? How about using OpenOffice or Office? All of these can run faster if multithreaded, but are they ready?
The Tao of math: The numbers you can count are not the real numbers.
Here's why: Few people want to put in years of effort learning to make production-quality apps, and many aren't able to because they have a hard enough time already keeping up their output on their "real" job. As a result, most cool things with limited audiences are stillborn. Filesharing, web browsing, and word processing programs make it because they have enormous user bases. Make GUI programming simpler, and more cool things will be created for medium-sized audiences.
I expect GUI library designers to discover ways to make programming easier by using simpler, less efficient models and using extra threads to make up the loss in efficiency. Ingenious people will find a way. Simplifying the task of GUI programming will mean that more creative tinkerers will build real apps instead of quirky, crash-prone prototypes.
The blame for this outrage should be put on Berman & Enterprise.
There is no America. There is no democracy. There is only IBM and AT&T and DuPont, Dow, General Electric, and Exxon
Single-threaded webpages are -terrible-.
I mean, unless you like your data access code holding up the page rendering.
I guarantee that whoever wrote that code wasn't thinking about threads or concurrency or lock-free algorithms or any of that stuff.
There is legacy multi-threaded code out there. I worked on a couple of projects in the 90's in C and C++ that were multi-threaded. pthreads is a C library, not Java, after all.
That's not to say that most of the legacy code out there is multithreaded. And multithreaded coding requires some serious discipline. The problem boils down to a simple trade off. You have to lock access to anything that is shared between threads. And locking is expensive, so you want to limit when you are doing it.
As a scientific programmer, all I know is that this will eventually be a huge benefit to all my MPI and OpenMP codes.
Unfortunately, these kinds of processors are pointless for most scientific applications (there are some exceptions, but not many). Scientific apps are limited by arithmetic units and memory bandwidth, and these processors do nothing to improve them. The Cell processor at least has multiple FPUs, this one doesn't even have that.
For your MPI codes, you are much better off using a workstation cluster, because unlike with these kinds of processors, you get a separate memory subsystem and a separate FPU with each thread that way.
This sounds like the stuff that Tera was working on with their MTA back in the 90s (see this or for more techincal details here). Basically, a processor that could handle up to 128 threads at a time, with almost zero-latency switching among threads. These processors could be easily interconnected to scale up to whatever the customer (e.g., Sandia, Los Alamos, LLL) wanted. From perusing Cray's website, though, I don't see any current machines that appear to be using that architecture, so I assume it didn't play out somehow.
Just junk food for thought...
Yes, compilers are not parallelizable. But multithreading doesn't necessarily mean parallelization.
I could for example imagine the parser to run on one thread, and the single-function optimizer on another thread. Every time the aprser has finished parsing a function, it tells the optimizer thread, which then immediatly starts optimizing it, while the parser starts parsing the next function. The later inter-function optimization passes then work with the pre-optimized functions which were mostly optimized during parse (given that parsing includes getting the source from disk, which will most likely block, I really wonder if a multithreaded compiler would even be an advantage on a single-core system).
The Tao of math: The numbers you can count are not the real numbers.
Yikes you are so right.
.NET upgrades as well as migrate some of the ASP to PHP or just simply properly written ASP. The 6-18 months to complete the rewrites is "unacceptable" to them. They did not realize how much of a mess they had on their hands, and I refuse to blow smoke up their butts and lie about how long it will take to fix.
I just was "promoted" to the "programming guru/ lead IT" position here. the last guy left an utter mess in old VB code as well as ASP that is a broken mess that is working but only barely. management freaked when I suggested we rewrite everything correctly and take advantage of the
MANY companies are running that old FORTRAN and COBOL code because management refuses to allocate resources to fix the slightly broken and old code/software.
This "good old code" is not that good. and usually is overlooked to be fixed because "it's working"
Do not look at laser with remaining good eye.
Have you not programmed anything threaded in years? Or have you not programmed anything single-threaded in years?
This is what keeps me up at night.
The speed and memory tradeoffs come from the two parallel implementations of the code I posted.
If you do what you suggest, the performance loss is in the step that you sorta gloss over: "wait for all the threads to finish." Semaphores and mutexes are notoriously expensive to lock and unlock. The original code I posted completed on the order of dozens of clocks per loop iteration. Locking a single mutex or incrementing (or decrementing) a semaphore costs thousands of cycles.
The memory explosion that I refer to comes from the temporary variables that you've added per thread. Of course, there is the phenomenal overhead of the thread itself (which is on the order of several K), but ignoring that there are now extra temporaries that are necessary; one per thread to avoid having to lock a mutex at each step of the loop.
Of course, in your implementation (with 4 threads) this is only a 4x increase in the memory requirements of my implementation. However, in an N-thread implementation, this is an Nx increase in memory requirements.
This is not an insignificant number of cycles or memory increase.
I currently have no clever signature witicism to add here.
Ha. My compiler parallelized that loop automatically. (Can yours?)
You say the performance hit of running this code in parallel would be "*massive*". Why is that? On my computer, the performance just gets better and better when I add processors. You say that parallelizing this loop would cause a "tremendous memory size explosion." Why would that happen? Are you saying the memory requirements would be worse than O(logN)?
Multi-threading is NOT the future. Multi-cell is.
I agree that multithreading is not the future, for all the reasons you give. But I don't believe multicell is either: you still have memory and I/O bottlenecks.
In fact, the future of high performance computing is already here: large amounts of commodity hardware. Every box you add automatically adds not only another CPU, but also a separate memory system and I/O.
Having said that, multicell doesn't have a future as a general purpose parallel computing paradigm, but it does hold the promise of being able to replace GPUs and other special-purpose hardware that litters our machines right now.
still haven't found the tech article but this is similar to what it was talking about...
...it makes the analogy even weaker but oh well...
"There was no salvation in using more than one steam engine on a single train, except in situations where extra power was needed for only a short distance, e.g., climbing a mountain grade. In normal operations, two steam engines would waste energy fighting against each other."
http://yardlimit.railfan.net/guide/locopaper.html
Garbage collection techology has been dealing with this well for a long time. Read The Memory Management Reference - multiple threads are assumed, and the single-threaded special case merits barely a mention. The "mutator" threads can keep running while garbage collection is going on, too - memory barriers are used to protect against race conditions.
Xenu loves you!
...now my code can look like this:
Thread 1....waiting for user input
Thread 2....waiting for user input
Thread 3....waiting for user input
All running at the same time!!!
Coder's Stone: The programming language quick ref for iPad
i'm pretty sure you can evaluate conservative fixed point analysis in parallel if you have a fine-grained machine (like an smt one)
Who saw Chewbacca ina cowboy hat?
OSGGFG - Open Source Gamers Guide to Free Games
That is, how would CAML and Scheme play?
"At any given load maybe 1 of them is active. "
The file cache thread *could* be active, file caching is only done during idle time to let the main thread run better, a simple OS tweak could overlap that better.
Same with the network, it *could* be overlapped better its only not done now to avoid impacting the front thread.
I assume the same is true through all the drivers and OS subsystems.
So yes you may have 1 thread mainly running now, but it doesn't mean you can't gain from this.
But all this misses the point, Sun make servers, the exact sort of boxes that run hundreds of active threads running the same code serving to multiple users. The perfect thing to benefit from this.
"A 1000-cycle task switch is NOTHING compared to the 2.2 million cycles a process has "
Fair comment.
Fascinating. Pray tell, how do you read the user's mind regarding exactly what data they want from a search query so you can do your query concurrently with the rendering the search results? I'm very curious about this exciting new technology.
I'm a researcher working on high performance computing and have used various configurations of Simultaneous Multithreading (aka Hyperthreading aka CMT) (Intel Xeon, IBM POWER5). The result is always the same - at the end, memory latencies and OS overheads kill most of the gains of instruction level parallelism coming from SMT. Look at it this way - the typical latencies of operations on most modern processors are of the order of 1 nanosecond, whereas DRAM latencies are of the order of 200ns. As long as you can't do anything about this latency, there's no point in cutting down on processing times. There's a very nice paper in this year's ACM SIGMETRICS that gives real experimental data to illustrate this fact - http://www.cs.princeton.edu/~yruan/XeonSMT/smt.pdf
The paper shows that the speedups obtained using SMT in practice are meagre. The reason that the simulation results coming from the original UWashington research on the subject - http://www.cs.washington.edu/research/smt/ - looked far better was their use of unreasonably large caches in their simulations, and that they completely ignored the OS overhead of enabling SMT - which is non-negligeable - and is a thing that has been pointed out often on the Linux Kernel mailing list as well.
Yep, and his first song is about how Darth Vader's pickup truck broke down, his wife left him, and he's drinking alone again.
That might be a solution if you've got the source. One of the more terrifying things learned during the great Y2K scare was that there exist a large number of legacy system that have been patched by directly modifying the binaries. Such systems have no source code anymore, and are not decompilable. Also, let's not forget that much of this code was probably written "oddly" to get another 2% worth of performance out of the original architecture; to do the job properly you'd have to rewrite the fiddly bits in a more standard fashion, then verify that you haven't: a) introduced any new bugs, and b) fixed any bugs the program was depending on. The second may be a non-issue in most cases, but the first is still a non-trivial exercise.
Just junk food for thought...
Actually it can't (because your multiplication might overflow even if the result doesn't). But I already posted what I think is the optimal version (without loops, and without overflow issues).
Thinking about it, the branch might be more costly than an addition and an xor, so here's another version which also avoids the if:As my previous version, this doesn't suffer from possible overflow.
According to the cost of parallelization: It's of course true that it has memory cost. And of course it would be silly to make N threads (which each would do exactly 1 addition, namely adding their value of i to 0, which of course could be easily optimized away), and then add all those values together (which would be exactly the work of the single-threaded version anyway). I'd expect the number of threads to be vastly less than N.
According to the time cost: This just means that parallelization only makes sense for N much larger than the number of threads. if N is of the order of a milliard, then the time to synchronize the four threads should be negligible (after all, you don't synchronize in every loop cycle, but only the ends of the thread). All of course under the (sensible) assumption that N_threads << N.
The Tao of math: The numbers you can count are not the real numbers.
It's very simple, actually, I do it quite frequently. Let's say you need to populate a drop-down based on user input in another drop-down.
At the start of the page you collect user input and fire off the data access code for the original drop and the parameterized drop, each in a seperate thread.
This executes while you're performing other formatting actions, like include headers, menu formatting, and outputting strings to your response (like client scripts, etc).
All the while, the other threads are formatting the first & second dropdown with the returned data, while your main thread is doing more menial UI tasks like formatting the tables and such that hold your page.
This is simply a basic example, but anyone who uses data access code even for a single databound table or drop should always be running it in a seperate thread and letting the main thread handle the non-data related rendering. There is a TON of work on web pages that you can be doing that is not data related, nor is it required to be peformed before or after the data is available.
Big hairy package..
Uhlm.. Too much information.
BROOKLYN
Looks like an interesting platform to run Cilk on.
Forget Java it can't multi-tread it's way out of a paper bag.
Sun sends me my free Solaris 10.0 DVD I signed up for years ago. All I got was there spam and phone calls trying to get me to fork down 50k+ for development software.
Sorry, sorry, sorry...
I couldn't help it.
- None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
you must only be familiar with shared-state concurrency. because if you weren't, you wouldn't spread this FUD framed from the perspective that threads are the only way to manage concurrency (parallelism) in software.
the means to do "thoughtless" concurrency has been available for going on 40 years now. look up (http://www.c2.com/) carl hewitt's actors paradigm (predecessor to alan kay, both as student-teacher relationship and actors-OOP) as well as read the successor to SICP, Concepts, Models, and Techniques of Computer Programming (http://www.info.ucl.ac.be/people/PVR/book.html).
now, understand instead: MESSAGE-PASSING CONCURRENCY. this is the "thoughtless" solution to programming on the Cell chip, CMT, and all these other new buzzwords for concurrent processing of information with multiple cores. (though, i don't think Oz, or MOZart, are the pragmatic languages to do this with; simply because i don't believe in multiple-paradigm languages -- perhaps i'm just really bitter against C++.)
...and I don't see how on-chip threading helps. Instead of memory serving a single thread's stream of instructions and one set of registers being loaded/stored, you now have multiple threads demanding multiple streams of instructions and loading/storing from multiple register sets.
Do the CMT chips assume greatly expanded L1 and L2 caches? The more threads, the broader and more scattered the working set. Without parallelism in memory service, multiple threads in multiple cores will just mean even more hardware sitting idle while waiting for a cache line to be loaded. Doing nothing in parallel? Not a win.
From the article "At one point during the CMT summit, I stuck my hand up and asked: is there anything that in principle doesn't scale with multithreading? There wasn't a lot that leapt to the minds' eyes, except for compiler code."
Any application where latency is important (high performance network servers and proxies, for instance) will either not gain or often suffer from multi-threading for two reasons:Nashville doesn't produce music like that any more. If anything, the first song would be Anakin singing about how Padme thinks his astromech droid is sexy.
the good ground has been paved over by suicidal maniacs
Unless they are already planning on many more Gb of on-chip cache, data-starvation will become an even bigger issue than it is today.
It might be less of a problem for multiple-threads that are executing in the same program, but they are still likely to be operating on different data streams.
In the case of multiple cores running different programs it will get much worse, unless average program sizes shrink to a 1-2Mb of Resident/Working-Set size. Right now, looking at 2 Desktop systems:
This would seem to indicate a need for 4-16Mb of L2 cache needed to keep all these processes from forcing L2-cache misses at 100's to 1000's of context switches/second. These are desktop systems that are not doing much other than email and web browsing. I cannot see it being better with high-load server systems. How many of the new multi-core systems are going to have L2 cache > 8Mb/core? 4Mb/core (for fast cache/low-latency memory)? How many systems will fast enough main memory feed 8-32 processes.
I've read that CPU starvation is already a problem in the faster Intel family processors, will the "system" hardware infrastructure be there to enable multiple cores to be fed?
They may be lowering the GHz/cpu, but as the Sun article points out, with 8 cores, that's still 8-cores times "N"GHz to be kept fed with data.
It's going to be a strained design scenario if you need to constrain those 8 cores to a using an average of 1Gb/core of cache memory.
Does anyone know if the new "breed" of multi-core CPU's have a shared cache or if they are going to be limited to separate caches/core? Could cache memory contention become an issue?
BTW -- does anyone know if disk manufacturers are planning (or are switching to common use) of multiple heads/platter? I could see arrays of 2-4 heads to cut seek latency by 50-75% and disks with heads 90 or 180 degrees out of phase to reduce rotational latency -- perhaps allowing lower RPM disks to consume less power and run with lower noise/cooling requirements. Maybe this is already being done in higher end SCSI disks?
**-why doesn't "ecode" support spacing? How does one do tables? What are "too many "lame" characters (when I had better table w/more spacing)? Grumble -- took 3x as long to format as write! >;-((
You want to see Ghyslain jump the General Lee over the creek, same as the rest of us. Stop with the denial, man!
--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
I think you are correct that there are two camps, but I would define them differently. There are scientific/educational/industrial users that are already running multithreaded software. And there are home users that are about to get a lot of new multithreaded software.
There isn't much that can be done to improve the performance of a single-threaded CPU with current technology. Both Intel and AMD have recently announced or released dual-core chips that will start making their way into home systems. The big iron vendors have been working on this for years. In the very near future home users will notice big performance differences between newer multi-threaded apps and their older software.
"Has John Williams gone country?"
No, that's his brother, Hank.
Beauty is in the eye of the beerholder.
Trust me, any app that old isn't going to be able to handle anything approaching large amounts of data. Case in point: the recent implosion of the pilot scheduling system last December. A legacy system not decades old which nevertheless was unable to handle a moderate load.
At the edge case, big iron wasn't farther than maybe 36 bits or so in terms of memory addressing, so that's probably about as much data as you'll be able to crunch, which is well within current capabilities. Scientific computing has always been designed with parallelism in mind, which is why you still have decades-old Fortran running around in modern simulations, but business computing (transactions, basically) is either designed for multiple CPUs running lots of transactions already, or is some dinky one-off which does minimal processing on its input.
I won't deny that some fool is probably trying to keep a 30 year old legacy system going because it just works, but it's rather wildly optimistic to think that there are any single-threaded legacy apps which need to crunch multi-TB or even GB data sets. Anyone whose data requirements have grown anywhere close to the pace of Moore's law (and thus need more and more single threaded performance) certainly has the need for a rewrite.
The class of problems that must be serialized is smaller than we tend to think it is.
If you look up some of the old research that Thinking Machines did with the Connection Machine (a hyper-cube architecture; a common configuration was 64K processes in 16K nodes of 4 processors each), one of the surprising results was parsing large program files in logarithmic time.
I do not claim that is a practical result, but I hope that it is enough to make some of us drop our assumptions about whether a given problem is fundamentally serial, with no hope of improving the performance with parallel processing.
I had forgotten how much cooler teenagers look when they are smoking. Oh, wait
Any one else read the title as "Ready for Country Music Television?"
My first thought was: "Nooooooooooooooooo!!!!"
First, the BIOS will download and play CMT stuff. (It would need a microphone to verify that it was played load enough.)
After a while, the CMT will tell the DRM in your computer about your criminal behaviour!
I wonder if the DRM will print the lawsuit on your own printer or send off an email so you get it by snailmail?
Or maybe the DRM will just demand your credit card number -- or kill all the data on your hard disk?
I wish I could add a ":-)" here.
Karma: Excellent (My Karma? I wish...:-( )
An enourmous amount of work has gone into finding ways to optimize legacy FORTRAN code for highly parallel architectures. Universities have been hammering away at the FORTRAN "lobster" (the huge body of well-tested library code used in research and some engineering circles) for close to 15 years now. I remember my college roommate doing this for a summer job in the early 90s, and it wasn't a new effort then.
Socialism: a lie told by totalitarians and believed by fools.
Why would Star Wars Weenies care about Central Mean Time?
Nathan's blog
For those with an interest in HPC, this would seem to be an attempt to bring Burton Smith's (Tera Computing, now Cray, Inc.) MTA idea to the world of mainstream business computing. Still a far cry from 128 simultaneous threads but working down that road. I do know that they have ported over and parallelized a lot of code for this design.
This is not multi-threaded programming, it's serial programming run on a farm. The only difference is the farm is at the CPU level rather than the data center level. You can get the same performance benefits by just optically hooking together a bunch of boxes.
Thread-happy applications like Java and databases should do well.
I have to disagree. The mutex is not significant, because N is very large at least > 100,000. (Assume a processor large enough to hold the result). If N is expected to be small you are correct, if N is very large, for some value of very large, then locking time is not significant.
When doing big-O analysis in computer science we always assume n large enough to overtake constant losses. If n is known to never be more than 4, than an O(n!) algorithm may run faster than a O(1) algorithm. However the O(1) algorithm will not be any slower when n is 1000, while the O(n!) algorithm will not finish in our lifetime.
So my code would be in C++: (Sorry about the formatting, but I'm not about to figure out how to make it look nice in allowable html):
int loopCountHelper(int start, int end) {
int result = 0;
for(int i = start;i result += 1;
}
}
int AccumulateLoopCount(int N) {
int accumulator = 0;
threadList tl;
while(N >0) {
int start = N > THREADFACTOR ? N - THREADFACTOR : 0;
tl.newThread(*loopCountHelper(start, N));
N = N > THREADFACTOR ? N - THREADFACTOR : 0;
}
while(tl.moreThreads()) {
accumulator = tl.waitandNextGetResult();
}
}
It has been a long time since I've done function pointers in C++ so I'm sure I did that all wrong. However I think you get the idea. For that matter now that you see what I'm trying to do I suspect you could come up with a better design. (I know I could if I had a few hours and a good editor to work with)
As someone who has worked on a project to replace 25 years old legacy applications I have some insight in how such old applications are used on a daily basis. Believe me. There was this report which was running for 200 processor hours because of the sheer amount of data to be processed, which was to be run on 18 different processors to keep the time low enough to have it run on weekend and still have the chance to restart it if it fails, once a month. ;) It's called O(n^m), and this is not even the worst case.
Sometimes the processing time goes with the square or a higher potency of the amount of data
Addressing often isn't the issue, matrix operations are. And many optimization algorithms use matix operations.
With every new generation of hardware more data was thrown at the old report, so it still was run once a month, and it still took about 200 hours of processing time, without changing the base algorithm.
And I was also working on the new report, where the processing time condensed to 1:20h, and suddenly the managers were running the report on a daily basis to make sure they don't miss any changing results.
15x the then 900 MHz SPARC III, I doubt it seriously
That is EXACTLY what they are claiming.
Based on what I've read the Niagara CPU will only be deployed in a single slot server...the only thing it might be useful for is front-end web servers and light-duty app servers.
Yes, Sun will market Niagara for blades and small servers such as web servers with a single CPU (8 cores 32 thread in hardware).
The larger servers will have multiple Fujitsu chips - 2.4GHz 64bit Dual-core SPARC64 with a very large cache.
They will use Fujitsu chips until their (Sun's) "Rock" processor arrives sometime after 2008 and it will offer both twice the throughput of Niagara and blazing single-threaded performance.
I would like to see a showdown between the IBM/Toshiba Cell and Niagara.
Brilliant now I can test a Playstation 3 vs a Web server.
It's my opinion that the Sun engineering team are in serious trouble.
You have clearly been misinformed. See above posts.
Oversimplified. I believe your "It's very simple" as much as I believe there's an easy way to eliminate deadlock, resolve synchronization issues....
Languages and frameworks do not change the fundamentals of computer science....
I don't have to synclock DA objects running SELECTs... do you? No wonder you think this is hard. There isn't synch issues in the above if you partition your code properly. The only time locking should come into play is if you have a cache you're dumping the returned data into... which was clearly outside the scope of my simple example provided to prove a point.
However, in that pesky real-world, O(n) is just a guide. The constants are quite meaningful in a lot of cases. Locking a mutex costs *thousands of clocks*. Each time. Incrementing or decrementing a semaphore also costs thousands of clocks. (You should also consider the fact that sometimes for performance metrics, it's more meaningful to use theta(n).)
Bottom line, if you were trying to get maximum performance out of that simple loop that I posted, you'd care very much about those costs, I assure you.
I currently have no clever signature witicism to add here.
Yes, in the real world you do care. Which is why the solution I presented is best - it uses some factor to determin how many threads to start. Each thread does enough work to make the locking trivial compared to the total run time.
Note that in the simple thread you posted results do not depend on the results of the last run, and I was able to factor all the data accesses out.
You are correct that theta(n) is often useful, but I posted a solution where theta(n) isn't as important as O(n). Remember the assumption that n is very large, and this algorithm dominates run time. When either case is not true, then my solution isn't of much use.
Just to add, there are languages designed specifically to exploit parallel architectures, effective in both shared-memory and non-shared-memory environments. Bit of a plug, but see KRoC (the Kent Retargetable occam-pi Compiler). And google for CSP (Tony Hoare, Bill Roscoe) for the formal semantics that make such parallel systems "safe" (and understandable/composable).