Posted by
ryuzaki0
on from the fun-with-new-hardware dept.
GFD writes "EETimes has a nice overview of Sun's new MAJC architecture. Combines multiple processors on one chip with VLIW and on chip multi threading. " I've been seeing some information about thsi floating around, but EETimes has done a nice job summarizing the chip itself.
61 comments
Re:It's an SMP, no less, quite alot more.
by
jovlinger
·
· Score: 1
The SMP part is kinda boring. Yes. The fun is how you have multiple threads on a chip. Imagine that you have a deep pipeline, and there is a stall of some kind (you need a value from memory, f.ex). This would normally mean wasted chance to use the ALU. For a threaded chip, there is a another instruction stream (with it's own decoding pipeline and everything), so instead of stalling, the CPU switches decoding pipelines and keeps the ALUs busy. Eventually this thread will stall, or time out or something, and the other thread gets a chance to run. Cool!
The point is that you replicate everything (I think) but the ALUs (so each thread has its own register file and what not), so context switches are instantaneous. Tera has a (vaporware?) supercomputer that switches after every instruction.
Ok, so you think this seems like a lot of duplicated silicon. Conscider this: Much of the silicon on a chip is devoted to not stalling the pipeline (register renamiming, forwarding, speculation). If stalls become free, we can simplify circutry alot, and crank up the clock speed ALOT (cimpler circuitry => smaller die => higher speed). cool! Double cool! no stalls and faster clock. Think of all the technologies on a chip these days.
Ok, things are not entirely simple (hands up those who thought they were), but that's the gist of it. For example, we'll need hardware support for locks and communication, probably.
So why haven't we seen these chips before, if the news is old? I think some of the previous posters got it right -- its only recently we start getting applications that are almost trivially parallelisable. One poster mentioned communication overhead; by providing annotations that these 10 threads want to be on the same cpu, we can be assured that they'll have very quick interthread communication.
Johan
Here's an idea
by
Anonymous Coward
·
· Score: 0
Why doesn't Sun, AOL, iPlanet shove their MAJC piece of crap, and start explaining why they're screwing Linux and the open source community?
1) The instant messenger API fiasco;
2) Mirabilis ICQ has always ignored Linux, except for that disgusting Java app they never fixed or upgraded;
3) WHO IS SUPPORTING NETSCAPE??? The newsgroups are full of bug complaints, and there is no news on updates or bugfixes, and Netscape appears dead for all intents and purposes.
I'm seriously tired of AOL/Sun/iBunghole leaving Linux users ignored, their products unsupported, and seeing the constant pro-sun posts on slashdot for no good reason at all.
Before anyone says "look at mozilla", I say "look at mozilla's buglist". It's obviously far longer than Communicator's.
/. posters assuming that the number one priority of every company on the face of the planet should be to make sure that they are providing as much support as they possibly can for the linux market. IM sucks, ICQ sucks, Netscape sucks. Why shouldn't Sun ignore linux users? What makes you so special?
If linux is screwed without a high end browser, why dont you write one?
Re:Here's an idea
by
Anonymous Coward
·
· Score: 0
Yes! The classic reply all wannabe linux developers have to complaints : "Go Write It Yourself".
You are the reason Linux will fail -- an ignorant chimp who cannot analyze a difficulty, and provide a rational response other than "DIY".
The truth is, Netscape once fervently supported Linux, but their stock is down to nothing, and AOL owns them.
My customer pays me for support. I give it to them, wholeheartedly.
I buy Linux, I ask for support, I get one reply -- from an asshole like you. You clueless bastard, you have no idea what support is, so you hide behind a free product yelping at people to "write a browser themself"
The fact is, Sun,Netscape,AOL have backed off Linux support. Deal with it!
Re:Here's an idea
by
Anonymous Coward
·
· Score: 0
no offense, but if microsoft was "smart" enough to figure out the on-the-wire protocol for icq, why don't you just go out and build an icq client that's more suitable than the java app. that mirabilis originally supplied.
After the SPARC debacle, where Sun first encouraged others to use the SPARC then pulled the rug out from those who did, I'll wait until somebody else does such a processor before getting excited.
-- Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
Re:How?
by
Anonymous Coward
·
· Score: 0
There are a number of papers from the University of Washington by Susan Eggers and Hank Levy which detail the advantages and some of the implementation requirements of Simultaneous Multi Threading.
The most useful (and boring) way to look at this chip is to think of it as an ordinary 2-processor symmetric multiprocessor. Run a normal SMP OS on it, no problem. Same benefits, same disadvantages.
No need to complicate this debate with statements like `this will be *useless* unless we solve the dynamic autothreading problem of single apps!'. SMP is useful for running unrelated threads/apps concurrently today. Tomorrow's single chip form will be just as useful.
The only complication I see this chip introduce is that, because the two instruction streams share the same resource units (fpu units, barrel shifters, alus), heavy use of resources by one of the threads will slow the other thread down. Traditional SMPs don''t have this problem.. *all* resources are duplicated.
Thats fine by me. There is some pretty neat scheduling advantages to that restriction. For example, if the chip is designed right, one could schedule thread #1 to get 75% of the instruction microcycles, while #2 gets 25%, and whatever microcycles each can't use the other gets. Alternatively, one could schedule thread #1 to get 100% of the microcyles while #2 gets whatever of those microcyles #1 wasn't able to use.
Ahhhhhh heaven. True realtime scheduling at the instruction level, no overhead. Who could ask for more? No wonder why this chip is directed at the embedded market. They could really use this feature.
A bit of history: I believe this idea was floated at Xerox PARC, at the time those folks were inventing mice and desktops and shared printers.
Re:Beowulf!
by
Anonymous Coward
·
· Score: 0
Hahah...hahah....hahah...hahah *holding sides* You don't fool me - this is a genuine beowolf cry Don't be ashamed...go on...you can say it: Beowulf...would...rock...on...a...cluster...of...t hese.....!! Well done you are learning fast.
Re:hmmm, fascist
by
Anonymous Coward
·
· Score: 0
No. k as in 2^10. 2k = 2 * 2^10 = 2048. Not 2038, the year a 32-bit time_t will run out.
Re:BeOS and a multi-threading chip
by
Anonymous Coward
·
· Score: 0
PC owners won't abandon their investment just for some pretty new architecture which is basically incompatible (as far as we know) with their existing hardware. And the whole thing will stagnate because no-one can start a market big enough to get the software backing (the Hardware-Software Paradox).
The software backing will come from Java apps written for other platforms. This explains Sun's enthusiasm for hyping Java - if they can convince people to write Java apps for existing platforms, they will have a large codebase for MAJC when they launch it.
Michael Rogers
Re:Holy grail?!
by
Anonymous Coward
·
· Score: 0
"The holy grail in the industry is breaking a single-thread application into multiple threads."
I'm really surprised by this statement. It seems like a lot of the CPU-bound things that people run these days is easily multi-threadable. e.g. Games, raytracing, image processing, even some aspects of compilers. Obviously people can just as easily name apps that aren't parallelizable, but there's already plenty of code out there just dying to run on more than one processor.
What kinds of commonly-run CPU-bound apps aren't threadable, which are giving these guys so much grief?
Re:hmmm, fascist
by
Anonymous Coward
·
· Score: 0
I think people who smoke marijuana should be imprisoned for life. I keep seeing marijuana smokers getting short prison terms. This is an outrage.
Re:National speed limit abolished
by
AtariDatacenter
·
· Score: 2
Haven't you kept up with the news? The national speed limit has been abolished. Besides, not everyone drives on public roads.:)
But I wasn't reading/. at the time! 8) Anyway, it's not THAT big of a time-sink, and I get a lot of valuable news from the site. J05H
-- gigantino.tv - Heavy but weighs nothing.
How?
by
Anonymous Coward
·
· Score: 0
How can multithreading be implemented in hardware? Anybody really know about this stuff or where I can find out? Does the instruction set have instructions for creating threads and such?
BeOS and a multi-threading chip
by
PaulWay
·
· Score: 1
This not a new opinion, but I thought I'd add some more contemplation.
A multi-threading chip would be useless in a non-multi-threading environment (e.g. DOS). This sort of chip is responding to the prevalence of multi-threading multitasking operating systems and applications. Since the BeOS is the furthest anyone's gone to making an OS multithread and multitask, to my mind this chip would run the BeOS like teflon-coated lightning.
Think about it - the best way to get maximum performance out of this architecture is to have lots of 'small' threads, to have as many threads available for immediate execution should one stall. If there's any OS out there that is more comprehensively thread oriented (which leads to more application threading) it must be proprietary.
But enough of that pipe-dream - let's get back to reality. Be won't dedicate engineering efforts to a new chip whose market is unproven without backup. PC owners won't abandon their investment just for some pretty new architecture which is basically incompatible (as far as we know) with their existing hardware. And the whole thing will stagnate because no-one can start a market big enough to get the software backing (the Hardware-Software Paradox).
It's on the wish-list somewhere, but I'm not selling the Celeron just yet.
-- --Reason is a tool. Try to remember where you left it.--
Re:BeOS and a multi-threading chip
by
edwdig
·
· Score: 1
GEOS has been doing premptive multithreading & multitasking on PC's since '91 or so. Most apps (like 99%) run with the UI in one thread and the processing in another. Makes the system really responsive and seem faster than it is (not that it isn't already a million times faster than Windows). More threads are extremely easy to create if needed.
Indeed, it was Ebay management's fault. But it is hard to fault them. They have a huge farm of E10000s, but no hot backup in case the sh*t hits the fan. Not only that, in order to keep up with the exponential increase in business they would have to shell out over a million dollars a piece for more hardware to keep up with demand. Is business going to keep growing at 300%/yr? Should they buy 16 more starfires and the storage to go with them? How much would this cost? 50 million? How much profit have they made? There's a lot of tough questions for management to answer. She (the CEO and executives) has made at least one mistake so far. Hopefully they (and others) will learn.
The morale: With an internet company, you have to spend the money and "bet the farm".
_damnit_
--
_damnit_
It's my job to freeze you. -- Logan's Run
Re:Multi-threaded chip? hmm...
by
Anonymous Coward
·
· Score: 0
Well, considering that BeOS people don't have enough time to fully support hardware on *existing* platforms, I wouldn't get too excited about the near-impossibility of them supporting a high end chip not yet available.
Re:National speed limit abolished
by
PeterT
·
· Score: 1
Duh.... Which police state do you live in? Speed limits do not apply to private property. Period.
I can drive as fast as I wish on _My_ property.
PeterT
Re:More links + some analysis
by
Anonymous Coward
·
· Score: 0
The truth is, Sun isn't the only company planning to include more than one CPU unit on one die.
One big question: are they planning on having the option to run the two CPUs in lock-step mode--where the chip can check the CPUs for errors (i.e. a big XOR on the outputs)? Their competitors would likely want to know. And the answer is probably "yes", considering how things are moving to focus more on reliability than speed, since companies like E-Bay don't want downtime.
Re:Beowulf!
by
Anonymous Coward
·
· Score: 0
hehe. the best laugh i've had in about a week;)
Re:Interesting... but will it run Linux?
by
Anonymous Coward
·
· Score: 0
The answer is unknown at the moment. There is no gcc and no binutils for MAJC. When we get tools, we may start thinking about the kernel and the userland. --P
Re:More links + some analysis
by
ChrisRijk
·
· Score: 2
Yes, I know Sun isn't the only one with this approach - I said that in my post.
I have no idea if Sun plan to do redundancy checking with multiple pipelines with the UltraSparc-V. (some IBM, and other, chips do this...) They might do it as an option, but I would currently guess they're doing it mostly for performance.
EBay's reliability problems are mostly related to poor management decisions (it seems) rather than EBay's (or Sun's or Oracle's) techs. Doing the above kind of checking wouldn't have helped EBay either. It doesn't matter what OS you use, if you have a screwed up setup, you'll get problems. And you'll be surprised/horrified at just how long it can take screwed setups to be fixed if the site's already gone live. (I know from experience. and no, it wasn't my screwed up setup.)
So why haven't we seen these chips before, if the news is old?
For the simple reason that, though a multithreaded machine is in the aggregate faster, though it makes more efficient use of chip resources, though it promotes fast context switching at the microinstruction level, any single thread will run *slower* on such a chip than on a chip optimized to run a single thread. This kills benchmark results for all the typical highly publisized benchmarks.
Until now, no one wanted to run a chip which had a lower benchmark rating. Nowdays, there is a greater appreciation for multiprocessing, and, due to the high performance of todays chips, the single-thread benchmark race is finally loosening its grip on the mind of purchasing agents and of the computing public.
Joe
pipelining processes: BINGO!!!
by
Mr+Z
·
· Score: 3
Three words come to mind: HIT, NAIL, and HEAD. :-)
To give an example from a paper I'm a coauthor on (being presented at ICSPAT'99), consider a JPEG decoder. Here's a quick overview of the bulk of a JPEG decoder:
for each 8x8 block
decode the Huffman code for the block
Perform inverse-quantization on the block
Perform the IDCT on the block
Write the block to the correct plane in the image
On a deeply pipelined / highly parallel processor, this is horribly inefficient, because each task is very small when applied to only one block, whereas switching between tasks is quite expensive. But, that's exactly what alot of JPEG decoders do (including the Independent JPEG Group's decoder). The decoder is alot easier to write that way, but is not nearly as efficient as it could be.
Instead, you want to batch things up as much as possible:
For all chunks of the encoded JPEG, do
Read a chunk of encoded JPEG
Decode the Huffman code for as many 8x8 blocks as possible
Inverse-quantize all of these blocks.
Perform IDCT on all of these blocks.
Write all of these blocks out to the image.
Now, you can make massive gains in efficiency due to better instruction cache locality, better parallelism across loop iterations due to the fact you're actually looping quite a bit now, and so on. (The wins are rather dramatic on a DSP which relies on programmed DMAs to move data on and off chip.)
What's nice about a system with parallel processing units (whether multiprocessor or multithreaded) is that each stage in the pipeline can become another parallel-executing thread. Indeed, that was one common way to program the TMS320C80 family DSPs, which had 2 or 4 DSPs on one chip, alongside a fairly strong RISC CPU... all on one die! The DSPs would be organized as a pipeline, communicating through a "crossbar" to shared on-chip SRAM. The RISC CPU would coordinate tasks and issue commands to the DSPs. It was really quite cool.
The idea of switching threads in hardware to get around cache misses has already been done by Tera, they have a machine at the San Diego Supercomputer Center. Dunno if it's a single chip design, I didn't have a screw driver handy when I visited SDSC. Tera claims some pretty impressive performance numbers.
FYI - I don't work for Tera or SDSC.
hmmm, fascist
by
Anonymous Coward
·
· Score: 0
this sounds like an awfully fascist chip design.
Re:hmmm, fascist
by
Anonymous Coward
·
· Score: 0
this sounds like an awfully fascist chip design. That's utterly ridiculous. How could a chip possibly be fascist? This chip is socialist.
They haven't really released enough details (on their website) just yet, but it does look interesting. One of the more obviously different attitudes the specification takes is highly customisable implimentations - you design a variation targeted at a particular application, whatever that might be - graphics accelerator, MP3 player/decoder, MPEG2/DVD decoder, or a more general purpose chip. Since it is mostly being targeted at embedded applications this is not surprising though.
Some other interesting aspects include:
'Support' for JIT/access-time compilers - not only does this help Java, but it is to make backwards compatability with older versions quite simple. This seems a bit like what Transmeta are doing, which was co-founded by an ex Sun guy btw.
Hardware support for ultra-fast thread switching - so fast that if one thread stalls waiting for DRAM access (which can take up to 100 clock cycles), you can switch to another thread rather than go idle. On many current OSs threads will be switched if the current one has to do some slow I/O say (ie read from disc) - so this is quite an improvement.
A more general approach to improving parallelism - you can have more than one CPU core in a single physical chip, which might or might not share their 1st level caches. (read this Microprocessor Report article for some background on this.) IBM are apparantly going to do a version of the PowerPC G4 which has 2 CPUs on one core, and I kinda suspect Sun might be planning something similar for their UltraSparc-V.
I'm not sure how Sun plan to make money of the design. It seems pretty likely they might do something like their "community source" model - you can get the design for free, but if you want to use it commercially you pay a license. ARM is doing well just licensing their CPU designs. I'd image Sun using to 'assist' their servers as add-on boards for doing heavy multi-media/3D graphics stuff - can you say "render farm"? Also, since Sun like selling their servers, they'd be happy for people to make lots of little, cheap devices that connect to nice big Sun servers.
Like the original poster said, IEEE Micro will probably have some interesting stuff, but it seems Sun aren't releasing all the details yet - looks like we'll have to wait until the Microprocessor Forum in October. I liked the article (written by the Sun engineers) about the UltraSparc-III - not only was it interesting (and I like Sun's approach) , it helped me figure out the inherant problem with the IA-64 architecture...
"The holy grail in the industry is breaking a single-thread application into multiple threads."
I'm really surprised by this statement. It seems like a lot of the CPU-bound things that people run these days is easily multi-threadable. e.g. Games, raytracing, image processing, even some aspects of compilers. Obviously people can just as easily name apps that aren't parallelizable, but there's already plenty of code out there just dying to run on more than one processor.
What kinds of commonly-run CPU-bound apps aren't threadable, which are giving these guys so much grief?
-- As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
Re:Holy grail?!
by
Anonymous Coward
·
· Score: 0
See http://www.tera.com for the details about their multithreaded supercomputer and automagically threading compilers.
Re:Holy grail?!
by
Anonymous Coward
·
· Score: 0
I think the problem is not that the algorithms don't lend themselves to parallization. I think the problem is that the programmers haven't written the software to break processing into small stages, and the compiler technology to translate monolithing end-to-end processing of each unit of data into a series of discrete steps isn't there yet. Sure, if you write the app from scratch with CMP in mind, most problems should parallelize easily. This is an area where Unix has an advantage because Unix programmers have traditionally favored piping many small filters together rather than writing one monolithic (e.g. microsoftian) application.
The article is refering to *hardware* multithreading, not software multithreading. The holy grail in the industry is implementing multithreading (not just multiprocessing) in hardware. What they're trying to do is remove the implementation of multithreading from the operating system level and transfer it to the hardware level, which would allow extremely fast thread switching, which would significantly increase speed and efficiency.
-- DES Khaddafi KGB genetic jihad Uzi Rule Psix Qaddafi cryptographic Peking Mossad Legion of Doom Albanian Serbian Saddam
Re:Holy grail?!
by
Anonymous Coward
·
· Score: 0
I think they mean doing it DYNAMICALLY. Ie, turning a program that doesn't use threads into a program that DOES at runtime.
I believe the problem's not so much finding the apps to parallelize, but rather the cost of parallelizing those apps. ie, the inter thread communication. I've got a program I wrote that would really benefit from parallelization if it wasn't for the communications costs. It's an N-body 3d gravity simulator (ie simulating a solar system). I believe any spead gained from spreading the load would be consumed by the communications (each planet has to know where every other planet is (N^2 problem)). BTW, the program's source (for DJGPP/Allegro) is on my web page. I've got Linux/svgalib code for it, but I haven't posted it yet (too lazy). If anyone's interested, email me.
--
Bill - aka taniwha -- Leave others their otherness. -- Aratak
BeOS is already pervasively multithreaded, unlike almost any other OS out there. Its nature makes debugging your apps a pain in the ass, but allows a 95% increase in processing power if you add a second CPU. Or so I've been told.
This chip would seem to take the pressure off the OS, and henceforth the programmers. *whew*
BeOS is already pervasively multithreaded, unlike almost any other OS out there. Its nature makes debugging your apps a pain in the ass, but allows a 95% increase in processing power if you add a second CPU. Or so I've been told.
Multithreaded code isn't so hard to debug as long as you design your program very carefully in advance with multithreading in mind. It's when you take a program or API that was designed to be single-threaded and try to hack in the multithreading after the fact that things can get awful.
--
I don't care if it's 90,000 hectares. That lake was not my doing.
Thread level data speculation
by
roca
·
· Score: 1
Some of the Sun material talks about Java applications with tens or hundreds of threads each. However, in the applications I write and the applications I've seen, most of those threads exist to provide nonblocking behaviour for various purposes, and it's hardly ever the case that 2 (or more) threads are runnable at once. The problem is that for many tasks it's just really hard to parallelize them into multiple threads, and to do it right. (One problem is that the thread model of concurrency just sucks, but that's another rant.)
So here's my shameless plug for CMU research: what we need is hardware support to make it easier to write threaded programs. One approach is thread-level data speculation. In this system, one thread executes normally while other threads execute speculatively, basically assuming that the parallel execution will be safe and correct. The processor is responsible for detecting conflicts between threads that mean the optimistic parallel execution is not correct. When there is a conflict, the speculating thread that caused the conflict is killed and its speculative state is thrown away. It's not as hard to do this as you might think; it seems possible to do it by adding some tags to the data caches on each processor.
See here for more: http://www.cs.cmu.edu/~tcm/STAMPede.html
The Stanford Hydra project does something like this too, BTW.
parallezation
by
Anonymous Coward
·
· Score: 0
The problem is that for many tasks it's just really hard to parallelize them into multiple threads, and to do it right.
I disagree. The problem is that software isn't written to process data in multistage pipelines. It's sort of like the assembly lines in using manufacturing. We currently have the same processor do all of the processing for a single packet of data before starting on the next one. This is analagous to having one person build a car from start to finish. It's much more efficient to have each person do one small specialized task and pass it on to the next. This is because transitioning from one procedure to another (setup) is very expensive. If each processor has it's own code cache, it should be able to execute it's own little part of the job over and over very efficiently. Of course, this probably requires a lot more total cache that the monolithic job approach... How much of a processor's time is spent on call/return and creating/destroying stack frames? How much more efficient would it be if it just sat in a tight loop running entirely in cache, but frequently had the expense of refreshing the entire data cache for a new "packet" of information?
Interesting... but will it run Linux?
by
Bill+Henning
·
· Score: 1
I've been seeing some information about thsi floating around...
Interesting times we live in: I read this sentence, and immediately my mental english parser interpreted the typo "thsi" as an acronym and went to work on translating it.
("THreaded Semiconductor Integrated circuit" was my interpretation before I realized it was an error.)
--
_______ 2B1ASK1
GPL'ed Freedom CPU here. Porters welcome.
by
cynicthe
·
· Score: 1
f-cpu.tux.org.
-- The ship sank. Get over it. (This sig was cut out from another's shirt and painstakingly hand-posted)
If there's any OS out there that is more comprehensively thread oriented (which leads to more application threading) it must be proprietary.
Out there currently, perhaps that's true. But looking back in computing history there's the T.H.E. multiprocessing system (by Djikstra and Riddle), plus an arbitrary number of clones of it, typically living in embedded systems.
I used one done by Mark Weiser, on a Nova, about 1975, and cloned my own onto an 8080 a few years later. Mine was a preemptive multitasking kernel (excluding drivers) a little over 500 bytes long. Add a console driver, a debugger, a network stack (not IP), real-time-clock processing, scheduled event interpreter, instrumentation drivers, a relay logic ladder-diagram interpreter, drivers to receive and send relay/contact signals from/to optoisolators, and a network daemon that downloaded schedules, read meters, examined relay states and stuck virtual screwdrivers in to force them, and it still come in under 2K bytes. This left the other 2K of ROM available for a description of a hysterically-large emulated-relay network.
That sucker flew, too. With the one tweak I added it became exactly an implementation of "actors", perhaps a bit before they were formalized. If you're not familiar with them: Imagine a machine where every program is in C++, but where every instance of every class is a separate thread of execution, every complicated class has been split into a set of simpler classes with one thread-related member function each, every call to a thread-related member functin is an intertask message - at about the cost of a subroutine call (with free queueing of multiple messages), and every thread-related member function (with all the non-thread-related subroutines it calls) can in principle run simultaneously (because they explicitly mutex when they must share a resource, and the free queueing makes such occasions are extremely rare). Now pour all these tiny tasks into the machine, with a half-K kernel to orchestrate them.
On a single processor machine the fact that the individual objects could run in parallel was an unused side-effect of a programming style that simplified writing programs to take maximum advantage of the tiny kernel. But with a more modern hardware platform, with a slightly more complicated kernel and perhaps a little hardware assist, the same style automatically produces a great pile of tiny, simple objects that can all be run in parallel on as many CPUs as you've got.
-- Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
Just write it with an actor-based OOP style. Then it's automatically split into tiny, simple, parallizable chunks.
Instead of having to explicitly declare what's parallizable, you explicitly declare what's interdependent. Typically that's a much smaller set - especially after the message-send/receive dependencies (which are automatically handled for you) are excluded.
-- Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
The SMP part is kinda boring. Yes. The fun is how you have multiple threads on a chip. Imagine that you have a deep pipeline, and there is a stall of some kind (you need a value from memory, f.ex).
This would normally mean wasted chance to use the ALU. For a threaded chip, there is a another instruction stream (with it's own decoding pipeline and everything), so instead of stalling, the CPU switches decoding pipelines and keeps the ALUs busy. Eventually this thread will stall, or time out or something, and the other thread gets a chance to run. Cool!
The point is that you replicate everything (I think) but the ALUs (so each thread has its own register file and what not), so context switches are instantaneous. Tera has a (vaporware?) supercomputer that switches after every instruction.
Ok, so you think this seems like a lot of duplicated silicon. Conscider this: Much of the silicon on a chip is devoted to not stalling the pipeline (register renamiming, forwarding, speculation). If stalls become free, we can simplify circutry alot, and crank up the clock speed ALOT (cimpler circuitry => smaller die => higher speed). cool! Double cool! no stalls and faster clock. Think of all the technologies on a chip these days.
Ok, things are not entirely simple (hands up those who thought they were), but that's the gist of it. For example, we'll need hardware support for locks and communication, probably.
So why haven't we seen these chips before, if the news is old? I think some of the previous posters got it right -- its only recently we start getting applications that are almost trivially parallelisable. One poster mentioned communication overhead; by providing annotations that these 10 threads want to be on the same cpu, we can be assured that they'll have very quick interthread communication.
Johan
Why doesn't Sun, AOL, iPlanet shove their MAJC piece of crap, and start explaining why they're screwing Linux and the open source community?
1) The instant messenger API fiasco;
2) Mirabilis ICQ has always ignored Linux, except for that disgusting Java app they never fixed or upgraded;
3) WHO IS SUPPORTING NETSCAPE??? The newsgroups are full of bug complaints, and there is no news on updates or bugfixes, and Netscape appears dead for all intents and purposes.
I'm seriously tired of AOL/Sun/iBunghole leaving Linux users ignored, their products unsupported, and seeing the constant pro-sun posts on slashdot for no good reason at all.
Before anyone says "look at mozilla", I say "look at mozilla's buglist". It's obviously far longer than Communicator's.
Without a high-end browser, Linux is screwed.
After the SPARC debacle, where Sun first encouraged others to use the SPARC then pulled the rug out from those who did, I'll wait until somebody else does such a processor before getting excited.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
The project home page is
http://www.cs.washington.edu/research/smt/ ellbee
The most useful (and boring) way to look at this chip is to think of it as an ordinary 2-processor symmetric multiprocessor. Run a normal SMP OS on it, no problem. Same benefits, same disadvantages.
.. *all* resources are duplicated.
No need to complicate this debate with statements like `this will be *useless* unless we solve the dynamic autothreading problem of single apps!'. SMP is useful for running unrelated threads/apps concurrently today. Tomorrow's single chip form will be just as useful.
The only complication I see this chip introduce is that, because the two instruction streams share the same resource units (fpu units, barrel shifters, alus), heavy use of resources by one of the threads will slow the other thread down. Traditional SMPs don''t have this problem
Thats fine by me. There is some pretty neat scheduling advantages to that restriction. For example, if the chip is designed right, one could schedule thread #1 to get 75% of the instruction microcycles, while #2 gets 25%, and whatever microcycles each can't use the other gets. Alternatively, one could schedule thread #1 to get 100% of the microcyles while #2 gets whatever of those microcyles #1 wasn't able to use.
Ahhhhhh heaven. True realtime scheduling at the instruction level, no overhead. Who could ask for more? No wonder why this chip is directed at the embedded market. They could really use this feature.
A bit of history: I believe this idea was floated at Xerox PARC, at the time those folks were inventing mice and desktops and shared printers.
Hahah...hahah....hahah...hahah *holding sides* You don't fool me - this is a genuine beowolf cry Don't be ashamed...go on...you can say it: Beowulf...would...rock...on...a...cluster...of...t hese.....!! Well done you are learning fast.
No. k as in 2^10. 2k = 2 * 2^10 = 2048. Not 2038, the year a 32-bit time_t will run out.
The software backing will come from Java apps written for other platforms. This explains Sun's enthusiasm for hyping Java - if they can convince people to write Java apps for existing platforms, they will have a large codebase for MAJC when they launch it.
Michael Rogers"The holy grail in the industry is breaking a single-thread application into multiple
threads."
I'm really surprised by this statement. It seems like a lot of the CPU-bound things that people run
these days is easily multi-threadable. e.g. Games, raytracing, image processing, even some
aspects of compilers. Obviously people can just as easily name apps that aren't parallelizable, but
there's already plenty of code out there just dying to run on more than one processor.
What kinds of commonly-run CPU-bound apps aren't threadable, which are giving these guys so
much grief?
I think people who smoke marijuana should be imprisoned for life. I keep seeing marijuana smokers getting short prison terms. This is an outrage.
Haven't you kept up with the news? The national speed limit has been abolished. Besides, not everyone drives on public roads. :)
Speed limits apply everywhere, not just on public roads. Racetracks are specifically excepted from the speed limit laws. (In most US states).
hey josh, stop reading slashdot
How can multithreading be implemented in hardware? Anybody really know about this stuff or where I can find out? Does the instruction set have instructions for creating threads and such?
This not a new opinion, but I thought I'd add some more contemplation.
A multi-threading chip would be useless in a non-multi-threading environment (e.g. DOS). This sort of chip is responding to the prevalence of multi-threading multitasking operating systems and applications. Since the BeOS is the furthest anyone's gone to making an OS multithread and multitask, to my mind this chip would run the BeOS like teflon-coated lightning.
Think about it - the best way to get maximum performance out of this architecture is to have lots of 'small' threads, to have as many threads available for immediate execution should one stall. If there's any OS out there that is more comprehensively thread oriented (which leads to more application threading) it must be proprietary.
But enough of that pipe-dream - let's get back to reality. Be won't dedicate engineering efforts to a new chip whose market is unproven without backup. PC owners won't abandon their investment just for some pretty new architecture which is basically incompatible (as far as we know) with their existing hardware. And the whole thing will stagnate because no-one can start a market big enough to get the software backing (the Hardware-Software Paradox).
It's on the wish-list somewhere, but I'm not selling the Celeron just yet.
--Reason is a tool. Try to remember where you left it.--
Indeed, it was Ebay management's fault. But it is hard to fault them. They have a huge farm of E10000s, but no hot backup in case the sh*t hits the fan. Not only that, in order to keep up with the exponential increase in business they would have to shell out over a million dollars a piece for more hardware to keep up with demand. Is business going to keep growing at 300%/yr? Should they buy 16 more starfires and the storage to go with them? How much would this cost? 50 million? How much profit have they made? There's a lot of tough questions for management to answer. She (the CEO and executives) has made at least one mistake so far. Hopefully they (and others) will learn.
The morale: With an internet company, you have to spend the money and "bet the farm".
_damnit_
_damnit_
It's my job to freeze you. -- Logan's Run
Well, considering that BeOS people don't have enough time to fully support hardware on *existing* platforms, I wouldn't get too excited about the near-impossibility of them supporting a high end chip not yet available.
*&^%$#@!
:wq
*&^%$#@! anyway, you knew what he meant, didn't you?
*&^%$#@!
1 SIMP0 TH1GHZ!!!!!!
P33PL3 USU4LLY KN0W WHUT 1 MEAN T00!!!!!!!!%!%!%!!!!!
BUT S0MET1M35 THAY SAY, 'B1FF SUX!!!` WHYYYYY???????
:WQ
------ ------ ------
ALL HA1L B1FF, TH3 M05T 31337 D00D!!!!!1
------ ------ ------
ALL HA1L B1FF, TH3 M05T 31337 D00D!!!!!1
:WQ
------ ------ ------
ALL HA1L B1FF, TH3 M05T 31337 D00D!!!!!1
------ ------ -
Duh.... Which police state do you live in? Speed
limits do not apply to private property. Period.
I can drive as fast as I wish on _My_ property.
PeterT
The truth is, Sun isn't the only company planning to include more than one CPU unit on one die.
One big question: are they planning on having the option to run the two CPUs in lock-step mode--where the chip can check the CPUs for errors (i.e. a big XOR on the outputs)? Their competitors would likely want to know. And the answer is probably "yes", considering how things are moving to focus more on reliability than speed, since companies like E-Bay don't want downtime.
hehe. ;)
the best laugh i've had in about a week
The answer is unknown at the moment. There is no gcc and no binutils for MAJC. When we get tools, we may start thinking about the kernel and the userland. --P
I have no idea if Sun plan to do redundancy checking with multiple pipelines with the UltraSparc-V. (some IBM, and other, chips do this...) They might do it as an option, but I would currently guess they're doing it mostly for performance.
EBay's reliability problems are mostly related to poor management decisions (it seems) rather than EBay's (or Sun's or Oracle's) techs. Doing the above kind of checking wouldn't have helped EBay either. It doesn't matter what OS you use, if you have a screwed up setup, you'll get problems. And you'll be surprised/horrified at just how long it can take screwed setups to be fixed if the site's already gone live. (I know from experience. and no, it wasn't my screwed up setup.)
So why haven't we seen these chips before, if the news is old?
For the simple reason that, though a multithreaded machine is in the aggregate faster, though it makes more efficient use of chip resources, though it promotes fast context switching at the microinstruction level, any single thread will run *slower* on such a chip than on a chip optimized to run a single thread. This kills benchmark results for all the typical highly publisized benchmarks.
Until now, no one wanted to run a chip which had a lower benchmark rating. Nowdays, there is a greater appreciation for multiprocessing, and, due to the high performance of todays chips, the single-thread benchmark race is finally loosening its grip on the mind of purchasing agents and of the computing public.
Joe
Three words come to mind: HIT, NAIL, and HEAD. :-)
To give an example from a paper I'm a coauthor on (being presented at ICSPAT'99), consider a JPEG decoder. Here's a quick overview of the bulk of a JPEG decoder:
On a deeply pipelined / highly parallel processor, this is horribly inefficient, because each task is very small when applied to only one block, whereas switching between tasks is quite expensive. But, that's exactly what alot of JPEG decoders do (including the Independent JPEG Group's decoder). The decoder is alot easier to write that way, but is not nearly as efficient as it could be.
Instead, you want to batch things up as much as possible:
Now, you can make massive gains in efficiency due to better instruction cache locality, better parallelism across loop iterations due to the fact you're actually looping quite a bit now, and so on. (The wins are rather dramatic on a DSP which relies on programmed DMAs to move data on and off chip.)
What's nice about a system with parallel processing units (whether multiprocessor or multithreaded) is that each stage in the pipeline can become another parallel-executing thread. Indeed, that was one common way to program the TMS320C80 family DSPs, which had 2 or 4 DSPs on one chip, alongside a fairly strong RISC CPU ... all on one die! The DSPs would be organized as a pipeline, communicating through a "crossbar" to shared on-chip SRAM. The RISC CPU would coordinate tasks and issue commands to the DSPs. It was really quite cool.
--Joe--
Program Intellivision!
I want an Alpha 667. That's the fascist linux chip I know of.
Win98 sux without these 1337 toolz !!
I don't understand why can't they make lawnmowers that also can be used as a BBQ grill. I mean, you use both in the backyard.
The idea of switching threads in hardware to get around cache misses has already been done by Tera, they have a machine at the San Diego Supercomputer Center. Dunno if it's a single chip design, I didn't have a screw driver handy when I visited SDSC. Tera claims some pretty impressive performance numbers.
FYI - I don't work for Tera or SDSC.
this sounds like an awfully fascist chip design.
Hemos, you're doing a good job in cranking out the stories... But trust me, there's time for a spellcheck!
Wah!
I wonder how BeOS would run on this thing...
I can imagine Beowulf would really rock with this... hee hee hee hee hee... oh the fun we could have!
Insert mind here.
aha, just wanted to mention it first since some other doofus will.
MAJC home page . See the docs home page - introduction, and a "community" page .
They haven't really released enough details (on their website) just yet, but it does look interesting. One of the more obviously different attitudes the specification takes is highly customisable implimentations - you design a variation targeted at a particular application, whatever that might be - graphics accelerator, MP3 player/decoder, MPEG2/DVD decoder, or a more general purpose chip. Since it is mostly being targeted at embedded applications this is not surprising though.
Some other interesting aspects include:
'Support' for JIT/access-time compilers - not only does this help Java, but it is to make backwards compatability with older versions quite simple. This seems a bit like what Transmeta are doing, which was co-founded by an ex Sun guy btw.
Hardware support for ultra-fast thread switching - so fast that if one thread stalls waiting for DRAM access (which can take up to 100 clock cycles), you can switch to another thread rather than go idle. On many current OSs threads will be switched if the current one has to do some slow I/O say (ie read from disc) - so this is quite an improvement.
A more general approach to improving parallelism - you can have more than one CPU core in a single physical chip, which might or might not share their 1st level caches. (read this Microprocessor Report article for some background on this.) IBM are apparantly going to do a version of the PowerPC G4 which has 2 CPUs on one core, and I kinda suspect Sun might be planning something similar for their UltraSparc-V.
I'm not sure how Sun plan to make money of the design. It seems pretty likely they might do something like their "community source" model - you can get the design for free, but if you want to use it commercially you pay a license. ARM is doing well just licensing their CPU designs. I'd image Sun using to 'assist' their servers as add-on boards for doing heavy multi-media/3D graphics stuff - can you say "render farm"? Also, since Sun like selling their servers, they'd be happy for people to make lots of little, cheap devices that connect to nice big Sun servers.
Like the original poster said, IEEE Micro will probably have some interesting stuff, but it seems Sun aren't releasing all the details yet - looks like we'll have to wait until the Microprocessor Forum in October. I liked the article (written by the Sun engineers) about the UltraSparc-III - not only was it interesting (and I like Sun's approach) , it helped me figure out the inherant problem with the IA-64 architecture...
I'm really surprised by this statement. It seems like a lot of the CPU-bound things that people run these days is easily multi-threadable. e.g. Games, raytracing, image processing, even some aspects of compilers. Obviously people can just as easily name apps that aren't parallelizable, but there's already plenty of code out there just dying to run on more than one processor.
What kinds of commonly-run CPU-bound apps aren't threadable, which are giving these guys so much grief?
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
This chip would seem to take the pressure off the OS, and henceforth the programmers. *whew*
Wah!
Some of the Sun material talks about Java applications with tens or hundreds of threads each. However, in the applications I write and the applications I've seen, most of those threads exist to provide nonblocking behaviour for various purposes, and it's hardly ever the case that 2 (or more) threads are runnable at once. The problem is that for many tasks it's just really hard to parallelize them into multiple threads, and to do it right. (One problem is that the thread model of concurrency just sucks, but that's another rant.)
So here's my shameless plug for CMU research: what we need is hardware support to make it easier to write threaded programs. One approach is thread-level data speculation. In this system, one thread executes normally while other threads execute speculatively, basically assuming that the parallel execution will be safe and correct. The processor is responsible for detecting conflicts between threads that mean the optimistic parallel execution is not correct. When there is a conflict, the speculating thread that caused the conflict is killed and its speculative state is thrown away. It's not as hard to do this as you might think; it seems possible to do it by adding some tags to the data caches on each processor.
See here for more:
http://www.cs.cmu.edu/~tcm/STAMPede.html
The Stanford Hydra project does something like this too, BTW.
I disagree. The problem is that software isn't written to process data in multistage pipelines. It's sort of like the assembly lines in using manufacturing. We currently have the same processor do all of the processing for a single packet of data before starting on the next one. This is analagous to having one person build a car from start to finish. It's much more efficient to have each person do one small specialized task and pass it on to the next. This is because transitioning from one procedure to another (setup) is very expensive. If each processor has it's own code cache, it should be able to execute it's own little part of the job over and over very efficiently. Of course, this probably requires a lot more total cache that the monolithic job approach... How much of a processor's time is spent on call/return and creating/destroying stack frames? How much more efficient would it be if it just sat in a tight loop running entirely in cache, but frequently had the expense of refreshing the entire data cache for a new "packet" of information?
Will Sun port Linux to it?
Will it be cheap?
Enquiring minds want to know... (and benchmark )
--------- Webmaster, http://www.cpureview.com and
MU News has been tracking this story a bit, and has some links if you wanna learn more about MAJC.
I've been seeing some information about thsi floating around...
Interesting times we live in: I read this sentence, and immediately my mental english parser interpreted the typo "thsi" as an acronym and went to work on translating it.
("THreaded Semiconductor Integrated circuit" was my interpretation before I realized it was an error.)
_______
2B1ASK1
f-cpu.tux.org.
The ship sank. Get over it. (This sig was cut out from another's shirt and painstakingly hand-posted)
Out there currently, perhaps that's true. But looking back in computing history there's the T.H.E. multiprocessing system (by Djikstra and Riddle), plus an arbitrary number of clones of it, typically living in embedded systems.
I used one done by Mark Weiser, on a Nova, about 1975, and cloned my own onto an 8080 a few years later. Mine was a preemptive multitasking kernel (excluding drivers) a little over 500 bytes long. Add a console driver, a debugger, a network stack (not IP), real-time-clock processing, scheduled event interpreter, instrumentation drivers, a relay logic ladder-diagram interpreter, drivers to receive and send relay/contact signals from/to optoisolators, and a network daemon that downloaded schedules, read meters, examined relay states and stuck virtual screwdrivers in to force them, and it still come in under 2K bytes. This left the other 2K of ROM available for a description of a hysterically-large emulated-relay network.
That sucker flew, too. With the one tweak I added it became exactly an implementation of "actors", perhaps a bit before they were formalized. If you're not familiar with them: Imagine a machine where every program is in C++, but where every instance of every class is a separate thread of execution, every complicated class has been split into a set of simpler classes with one thread-related member function each, every call to a thread-related member functin is an intertask message - at about the cost of a subroutine call (with free queueing of multiple messages), and every thread-related member function (with all the non-thread-related subroutines it calls) can in principle run simultaneously (because they explicitly mutex when they must share a resource, and the free queueing makes such occasions are extremely rare). Now pour all these tiny tasks into the machine, with a half-K kernel to orchestrate them.
On a single processor machine the fact that the individual objects could run in parallel was an unused side-effect of a programming style that simplified writing programs to take maximum advantage of the tiny kernel. But with a more modern hardware platform, with a slightly more complicated kernel and perhaps a little hardware assist, the same style automatically produces a great pile of tiny, simple objects that can all be run in parallel on as many CPUs as you've got.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
Because one uses gasoline, and the other one uses propane. Duh! ;-)
Of all the comments I've ever posted, this is definately one of them
Instead of having to explicitly declare what's parallizable, you explicitly declare what's interdependent. Typically that's a much smaller set - especially after the message-send/receive dependencies (which are automatically handled for you) are excluded.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way