MIT Startup Unveils New 64-Core CPU
single-threaded writes "Tilera, a startup out of MIT, has announced that it is shipping a 64-core CPU. Called the TILE64, the CPU is fabbed on a 90nm process and is clocked at anywhere from 600MHz to 900MHz. 'What will make or break Tilera is not how many peak theoretical operations per second it's capable of (Tilera claims 192 billion 32-bit ops/sec), nor how energy-efficient its mesh network is, but how easy it is for programmers to extract performance from the device. That's the critical piece of TILE64's launch story that's missing right now, and it's what I'll keep an eye out for as I watch this product make its way in the market. Though there are any number of questions about this product that remain to be answered, one thing is for certain: TILE64 has indeed brought us into the era of 64 general-purpose, mesh-networked processor cores on a single chip, and that's a major milestone.'"
Also FTA: "I'm due to talk to the head of Tilera's software team, which is actually larger than the company's hardware team."
I'll be very curious what their development toolchain ends up looking like, but it seems clear they understand the issue.
well, yes it does run Linux - full SMP 2.6 according to the blurb on their site.
ccalam - acoustic versions of new songs.
The watts isn't missing:
TFA says its between 175 and 300 milliwatts per core - do the math. 12 to 19 watts. They're targetting the embedded market (and with those low power consumption figures, I think a super laptop would be a no-brainer).
Kevin Smith on Prince
here is a bit
I prefer the "u" in honour as it seems to be missing these days.
FWIW:
""If you have an application written for any multi-core or single processor architecture that's written to work with Linux, you can take it, compile it and have it running on our chip in minutes," he said. "Now, if you want to ratchet up the performance, we provide libraries and interface mechanisms that customers can use to tune code."" from here
For those of you wondering about what their software will be like, here's some info on their Multicore Development Environment (MDE). http://www.tilera.com/products/software.php It's not the most info in the world, but it's a start.
The T1 was already doing 32, and the new T2 is supporting 256 in a single chip. Just wondering why "TILE64 has indeed brought us into the era of 64 general-purpose, mesh-networked processor cores on a single chip, and that's a major milestone", when the mile marker is already at 256?
Because this has 64 cores as opposed to 8 cores on either the T1 or T2?
Because the total number of threads supported by an 8 core T2 is 64 and not 256 as you wrote above?
This has been done. There was an article a while back about IBM being able to drill holes through their wafer to produce an interconnect to a second wafer on the bottom.
Intel did this a swell and redesigned the Pentium 4 on it.
The old method of bonding two wafers also works. Smart censors, for instance, bonds a photodetector material (a semiconductor like InGaAs or InSb) onto the top of a cmos chip. The bonding was very expensive, of course, but it is definitely possible to grow a semiconductor on top of existing metal/polysilicon.
Considering these things are MIPS cores, having C code compile to it wouldn't be hard at all I would say. It's utilizing the mesh network that's the problem.
Until I see some results of dynamically-compiled C code that runs really fast on this thing, I don't see it offering better solutions than, say, an FPGA. The exception would be if this was much lower-powered.
It's not theoretically impossible to do. Instead of treating it like a CPU, treat it like a network with micro-ops treated like packets. Run each sequence of micro-ops through something similar to a global routing algorithm and optimization should be fairly easy. This all, of course, assumes that you have something very parallelizable to begin with, like H.264 encoding.
Those are *not* very impressive figures for the embedded market. I imagined the whole 64-core chip would run below 100mW. If we're talking 12 to 19 watts for the chip, it is a beast in embedded terms. For reference, an SoC with 4 ARM cores, all of the peripherals that that thing has plus dedicated DSP/FPU units would still be under 4W.
FPGA's (particularly ones from Xilinx) that offer similar logic horsepower (assuming you had a digital designer to write your VHDL for your) for less than 500mW.
The latest Virtex 5 for DSP applications can provide the same processing capability these guys claim (2x H.264 streams) along with all the bells and whistles and on top of that, you have 2 PPC hardcore processors to act as arbitrators for slower functions.
Those things suck up up to 1W though and that's a lot of power for an embedded system.
Parallel processors on a single die (chip) is very different from Thinking Machines & beowulf clusters.
Up till now there were only 2 types of Parallel processing.
1.) loosely coupled. Thinking Machines & beowulf clusters for example are using this, these are interconnected with Ethernet or some other Network medium and send messages back and forth.
2.) Tightly coupled, this is SMP, NUMA, SNOOPY, basically shared memory system where each processor shares the same global memory space.
Each requires very different programming strategies and are limited to certain types of problems.
There is also a third form that is lesser know. This systolic arrays. An example of this is TimeLogic, and many DOD type projects.
This is usually done with a bunch of FPGA's and the math computations are done as a series of hardware pipelines without any CPU.
With the parallel core processor it's possible to make it like an SMP (share memory) type system, but you really get hammer with the memory bottleneck so after about 4 CPU's you don't really gain much.
What I had proposed with doing systolic array type of processing but with Simple but fast CPU's on one chip.
They would be connected with CPU registers that would pass data directly from one CPU to the next.
It's design would allow super tight coupling between each processor, so a programming problem wouldn't need to process a buffer at a time but could tackle problems that can't normally be broken up into parallel operations. For example a bignum math operation like multiplying 2 number that are 1024 bits long. Or large FFT, fast DVT, or matrix operations where each cpu could process part of a single operation that must be done serially, and can not be done using traditional parallel processing.
Specifically my interest was in video compression and image processing in real time. This is where DCT, motion vector searches Huffman coding and other operations that don't parallelize well would really get a boost using this type of processor.
I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso
Are you trying to say Billy G didn't comment on the retarded XT memory architecture with the quote that '640k should be enough for anybody'? I suppose you also don't believe the head of IBM saw a world market for around 5 computers.
Don't they teach history in schools anymore?