Slashdot Mirror


Sun's New MAJC Architecture

GFD writes "EETimes has a nice overview of Sun's new MAJC architecture. Combines multiple processors on one chip with VLIW and on chip multi threading. " I've been seeing some information about thsi floating around, but EETimes has done a nice job summarizing the chip itself.

3 of 61 comments (clear)

  1. pipelining processes: BINGO!!! by Mr+Z · · Score: 3

    Three words come to mind: HIT, NAIL, and HEAD. :-)

    To give an example from a paper I'm a coauthor on (being presented at ICSPAT'99), consider a JPEG decoder. Here's a quick overview of the bulk of a JPEG decoder:

    • for each 8x8 block
      • decode the Huffman code for the block
      • Perform inverse-quantization on the block
      • Perform the IDCT on the block
      • Write the block to the correct plane in the image

    On a deeply pipelined / highly parallel processor, this is horribly inefficient, because each task is very small when applied to only one block, whereas switching between tasks is quite expensive. But, that's exactly what alot of JPEG decoders do (including the Independent JPEG Group's decoder). The decoder is alot easier to write that way, but is not nearly as efficient as it could be.

    Instead, you want to batch things up as much as possible:

    • For all chunks of the encoded JPEG, do
      • Read a chunk of encoded JPEG
      • Decode the Huffman code for as many 8x8 blocks as possible
      • Inverse-quantize all of these blocks.
      • Perform IDCT on all of these blocks.
      • Write all of these blocks out to the image.

    Now, you can make massive gains in efficiency due to better instruction cache locality, better parallelism across loop iterations due to the fact you're actually looping quite a bit now, and so on. (The wins are rather dramatic on a DSP which relies on programmed DMAs to move data on and off chip.)

    What's nice about a system with parallel processing units (whether multiprocessor or multithreaded) is that each stage in the pipeline can become another parallel-executing thread. Indeed, that was one common way to program the TMS320C80 family DSPs, which had 2 or 4 DSPs on one chip, alongside a fairly strong RISC CPU ... all on one die! The DSPs would be organized as a pipeline, communicating through a "crossbar" to shared on-chip SRAM. The RISC CPU would coordinate tasks and issue commands to the DSPs. It was really quite cool.

    --Joe

    --
  2. More links + some analysis by ChrisRijk · · Score: 3
    (I sent the following to a different forum over a day ago - before the EETimes article appeared. This is a word for word copy...)

    MAJC home page . See the docs home page - introduction, and a "community" page .

    They haven't really released enough details (on their website) just yet, but it does look interesting. One of the more obviously different attitudes the specification takes is highly customisable implimentations - you design a variation targeted at a particular application, whatever that might be - graphics accelerator, MP3 player/decoder, MPEG2/DVD decoder, or a more general purpose chip. Since it is mostly being targeted at embedded applications this is not surprising though.


    Some other interesting aspects include:

    'Support' for JIT/access-time compilers - not only does this help Java, but it is to make backwards compatability with older versions quite simple. This seems a bit like what Transmeta are doing, which was co-founded by an ex Sun guy btw.

    Hardware support for ultra-fast thread switching - so fast that if one thread stalls waiting for DRAM access (which can take up to 100 clock cycles), you can switch to another thread rather than go idle. On many current OSs threads will be switched if the current one has to do some slow I/O say (ie read from disc) - so this is quite an improvement.

    A more general approach to improving parallelism - you can have more than one CPU core in a single physical chip, which might or might not share their 1st level caches. (read this Microprocessor Report article for some background on this.) IBM are apparantly going to do a version of the PowerPC G4 which has 2 CPUs on one core, and I kinda suspect Sun might be planning something similar for their UltraSparc-V.

    I'm not sure how Sun plan to make money of the design. It seems pretty likely they might do something like their "community source" model - you can get the design for free, but if you want to use it commercially you pay a license. ARM is doing well just licensing their CPU designs. I'd image Sun using to 'assist' their servers as add-on boards for doing heavy multi-media/3D graphics stuff - can you say "render farm"? Also, since Sun like selling their servers, they'd be happy for people to make lots of little, cheap devices that connect to nice big Sun servers.

    Like the original poster said, IEEE Micro will probably have some interesting stuff, but it seems Sun aren't releasing all the details yet - looks like we'll have to wait until the Microprocessor Forum in October. I liked the article (written by the Sun engineers) about the UltraSparc-III - not only was it interesting (and I like Sun's approach) , it helped me figure out the inherant problem with the IA-64 architecture...

  3. Multi-threaded OS by Ungrounded+Lightning · · Score: 3
    If there's any OS out there that is more comprehensively thread oriented (which leads to more application threading) it must be proprietary.

    Out there currently, perhaps that's true. But looking back in computing history there's the T.H.E. multiprocessing system (by Djikstra and Riddle), plus an arbitrary number of clones of it, typically living in embedded systems.

    I used one done by Mark Weiser, on a Nova, about 1975, and cloned my own onto an 8080 a few years later. Mine was a preemptive multitasking kernel (excluding drivers) a little over 500 bytes long. Add a console driver, a debugger, a network stack (not IP), real-time-clock processing, scheduled event interpreter, instrumentation drivers, a relay logic ladder-diagram interpreter, drivers to receive and send relay/contact signals from/to optoisolators, and a network daemon that downloaded schedules, read meters, examined relay states and stuck virtual screwdrivers in to force them, and it still come in under 2K bytes. This left the other 2K of ROM available for a description of a hysterically-large emulated-relay network.

    That sucker flew, too. With the one tweak I added it became exactly an implementation of "actors", perhaps a bit before they were formalized. If you're not familiar with them: Imagine a machine where every program is in C++, but where every instance of every class is a separate thread of execution, every complicated class has been split into a set of simpler classes with one thread-related member function each, every call to a thread-related member functin is an intertask message - at about the cost of a subroutine call (with free queueing of multiple messages), and every thread-related member function (with all the non-thread-related subroutines it calls) can in principle run simultaneously (because they explicitly mutex when they must share a resource, and the free queueing makes such occasions are extremely rare). Now pour all these tiny tasks into the machine, with a half-K kernel to orchestrate them.

    On a single processor machine the fact that the individual objects could run in parallel was an unused side-effect of a programming style that simplified writing programs to take maximum advantage of the tiny kernel. But with a more modern hardware platform, with a slightly more complicated kernel and perhaps a little hardware assist, the same style automatically produces a great pile of tiny, simple objects that can all be run in parallel on as many CPUs as you've got.

    --
    Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way