Parallella: an Open Multi-Core CPU Architecture
First time accepted submitter thrae writes "Adapteva has just released the architecture and software reference manuals for their many-core Epiphany processors. Adapteva's goal is to bring massively parallel programming to the masses with a sub-$100 16-core system and a sub-$200 64-core system. The architecture has advantages over GPUs in terms of future scaling and ease of use. Adapteva is planning to make the products open source. Ars Technica has a nice overview of the project."
It is like saying a bong is going to be used for tobacco. It may be true for some but we all know how it will _really_ be used.
I checked their front page and they have a kickstarter going to fund further development.
Might want to check it out and chip in if you're interested.
http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone
To make parallel computing ubiquitous, developers need access to a platform that is affordable, open, and easy to use.
They promise the latter three, but "access" seems a bit lacking. Also they specifically left out performance but talk it up in separate marketing materials (5 watts for 45 GFLOPs etc)
Some other alternatives optimizing for local maxima in the solution set:
Just simulate in software, if you don't care about speed but want to learn to program parallel. Erlang? They seem to have a fixation on C, why not use the right tool?
Go to opencores.org and stick a zillion cores on a off the shelf FPGA dev board. Or a fat stack of picoblaze or microblaze if you're willing to deal with the annoying licensing hassles (my advice, stick with opencores to avoid legal hassles, the weird licensing for the *blaze family is like the creepy dude in a van offering kids "free" candy)
They seem spread a bit thin based on clicking around the website. They're doing everything but invent hard AI and the warp drive on their website, which is a lot for just 4 people. Their kickstarter seems pretty firmly grounded in comparison.
One of those "infinite spare time" play toys would be to stick a bunch of 6809 cores (or pdp-8s or -11s or Z80s or whatever) on one of my FPGA boards and figure out the glue logic. Anyone with a big enough board could download by VHDL/Verilog and go for it on their own hardware.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
and the architecture is also very limiting.
16TFLOPS for $3000 or 0.09TFLOPS for $200. I'll stick to current hardware thanks. 178x more processing power for 15x more money. I would also prefer a "super computer" can address more than 4GB of RAM with more than 64bits of memory bandwidth. The architecture also limits the core cache to 64k.
The Parallax Propeller is a great multi-core chip to get started with. The chip is $7.95 and has 8 cores running at 80Mhz. You can pickup the Quickstart board at Radio Shack for $40, including an overpriced RS USB cable (they normally retail for $25).
The Parallax Propeller is a much more economical way of getting started with multi-core programming. Parallax offers the PropTool, which provides SPIN and PASM language support. For C development you can get SimpleIDE which is a great IDE to get started with C programming on the Propeller, which uses a port of GCC.
Comparing it to Pi is a little disingenuous. Reading the copy suggests there is an ARM core, plus some number of co-processors (perhaps like the Cell and its SPEs). That would make it a non-general-purpose processor. To compare apple-to-apples, we'd have to know how it compare to modern GPUs.
As for apples to apples, It's vaporware specs read similar to the old printout of the vaporware specs for the Propeller2 from microchip inc on my desk.
They are nearly identical situations, in fact, both small teams (the original prop was done by one guy, admitedly 6-7 years ago). Gigaflops/sec performance goals, etc.
As for which is the better vaporware product I think you're slightly better off with the parallella story than the prop2 story, but slow working silicon purchased online by CC and delivered by UPS next week always beats fast vapor, so whoever gets there first, wins....
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone/posts/323691
They have released their SDK and architecture documentation, worth a read. ...
Looks like an interesting platform, but the current performance indeed make me feel lacklusting
If you've got $100 to spare, a Radeon 7750 provides over 800GFLOPS. If you've got more money a 7970 will give you 4.3TFLOPS for $550.
a GTX650 will give you 800GFLOPS for $100 and a GTX680 will give you 3TFLOPS for $500.
"The GA144-1.20 chip, with 144 self-contained computers and software-defined I/O, is available in a 1cm x 1cm, 88-pin QFN package." $20 / each, minimum order 10 (as far as I know): http://www.greenarraychips.com/home/products/index.html 200 USD buys you 1440 cores...
Perl Programmer for hire
Total on-chip, inter-core bandwidth is 64 GBytes/sec, with 8 GBytes/sec of off-chip bandwidth.
A big problem here is that classical GPU:s only have two kinds of I/O ports: Video output, and PCI Express. Neither is very good for an embedded application, unless you have a big power budget and also have a board with an x86 processor. (Unfortunately you need x86 since you need binary drivers for your GPU to get good GPGPU performance...)
Yes, you may be right for your situation. I do not happen to have any system available with PCIe slots in it but would love to toy with a bunch of Parallella boards for a CPU-bound thing or two. So for me this is a more interesting option and I've backed their Kickstarter for that reason.
-- Spelling and grammar errors tend to be a sign of erroneous thinking.
Adapteva is creating false expectations here. Their chip won't deliver performance on par with GPUs (or CPUs, for that matter) and still be cheap. Why? Because it's not a thing that a startup can to in todays world of computing. For such a chip you need to use the latest CMOS processes and a huge team to design/optimize the ASIC (especially if it's meant to be a low power chip) -- both of which are extremely costly. If it was that easy, then we'd see more competition and not Intel, AMD, Nvidia and IBM as the only global players in the HPC arena.
If you're a small startup, then you'll be bound to 100nm processes (at best), and have to use automated layouts (not the hand-optimized ones e.g. Intel uses). Both reduce performance, increase power intake.
I work at the Chair for Computer Architecture at FAU. We have some of very brightest minds working at custom chips for industry solutions. This 2D CPU matrix that Adapteva proposes is something that my colleagues have played with years ago. It's a good approach and I personally believe that this will be the shape of CPUs to come. It started with the ring bus on the IBM Cell, now Intel's Nehalem has got an partitioned L3 cache connected with a... ring bus and Intel's Xeon Phi (MIC) even got a 2D on-chip grid network. But even my colleagues concede that a) on FPGAs you'll always be trailing GPUs concerning floating point performance (it's something FPGAs are particularly bad at) and b) even when designing an ASIC you'll always be beat by GPUs in terms of performance, assuming similar prices and power consumption. Those are simply beasts, optimized down to the bone. It's the result of a multi-billion mass market. That's also the reason why there is no next IBM Cell chip for a PlayStation 4: Cell was too expensive to develop to keep up with the competition. Its market is too small compared to the ubiquitous GPUs.
For teaching parallel computing I'd always suggest a GPU. The tools are there, the performance is great and you'll be able to use the knowledge gained in real-world projects.
Computer simulation made easy -- LibGeoDecomp
the devboard has a Dual-core ARM A9, so more like a pandaboard. even if you ignore the co-processor they are offering a lot for $99.
its interesting to compare the epiphany processor to a GPU. both give you lots of cores, GPUs get up ino the hundreds, epiphany is meant to scale to 4000. But a GPU is highly opitmised for graphics, and applying identical operations to millions of data values. in a GPU groups of core (typically 32) operate as a wavefront, if the code branches on an if stament, then the cores that get the else branch have to wait until the ones that follow the if finish.
epiphany has independant cores. you can send them each a different program. so for a much wider set of algorithms you can get efficient speedups. in a way it is more like the xeon phi, but without making each core a full x86 compatible processor.
>As for apples to apples, It's vaporware specs read similar to the old printout of the vaporware specs for the Propeller2 from microchip inc on my desk.
they have working 16 core silicon. they shared the cost of a 65nm wafer with other companies small run asics. this lowers the entry cost to making silicon, but gives a crazy high per unit cost. if they raise enough money to do a full wafer at 28nm, then it becomes cost effective. there are intersting details and numbers on page 3 of http://www.adapteva.com/wp-content/uploads/2011/06/adapteva_mpr.pdf
As soon as you have branches in your GPU code, the performance drops like a brick. GPUs also only work well with sequential data. What it comes down to, is GPUs only do well with matrix math.
Yes, that's true. But unfortunately i cannot plug your Radeon or GTX into my mobile robot or quadrocopter in order to give them machine vision or neural networks/machine learning "brains" (at least not with some serious improvements in battery technology!).
So, what are the alternatives to bring the current vision algorithms to mobile devices/robots? The Parallella is the only option I am aware of.
For these types of mobile applications, you should rather compare the Parallella with Raspberry Pi or Arduino. And guess who wins this performance comparison! ;)
20 - 40 GFLOPS per watt isn't that much better than the 17GFLOPS/watt of the high end Radeon GPU's.
Mobile devices are already available with GPU's up to 32GFLOPS. The new iPad is 32 and Intel's new Atom SoC is 34GFLOPS. I'm sure the Tegra 4 will be up around those figures when it comes out too. (those GFlops figures don't include the dual-core CPU's they sit next to and the PowerVR GPU is clocked slower than the Parallella)
Mobile devices are already available with GPU's up to 32GFLOPS.
Maybe, but are they also accesible for the programmer? I was greatly disappointed when I learnt that the praised and "powerful" GPU of the Raspberry Pi is locked down by NDAs and NOT available for the programmer for OpenCL or GPGPU. I think the same is true for the PowerVR.
I have looked around quite a while and have not found a readily available board with a GPU that could be programmed (OpenCL) and is powerful enough for real-time image/vision processing. Not sure about the Tegra 4 .... ?
A cellphone, that currently provides a dual-core arm cpu with several gflops of GPU goodness and as a bonus, its got its own battery with GPS and wireless communications. PowerVR's 600 series GPU's are capable of 100+GFLOPS. They'll be in your iDevice next year. The PowerVR 534 in the new iPad is only 32GFLOPS though.
http://www.youtube.com/watch?v=sDrz-w1jzEU OpenCL is supported by the PowerVR GPU's but it depends on the SoC vendor
a) the 64-core epiphany doesn't yet exist, so that 2 watts is theoretical.
b) its theoretical performance isn't much better than current mobile GPU's found in cellphones and tablets. I don't know how much power they consume but I can play angry birds and watch movies on my phone for hours with its 5w/hr battery (of which the LCD backlight consumes the most power)
The Epiphany core has a mere 35 instructions – yup, that is RISC alright – and the current Epiphany-IV has a dual-issue core with 64 registers and delivers 50 gigaflops per watt. It has one arithmetic logic unit (ALU) and one floating point unit and a 32KB static RAM on the other side of those registers.
Each core also has a router that has four ports that can be extended out to a 64x64 array of cores for a total of 4,096 cores. The currently shipping Epiphany-III chip is implemented in 65 nanometer processors and sports 16 cores, and the Epiphany-IV is implemented in 28 nanometer processes and offers 64 cores.
The secret sauce in the Epiphany design is the memory architecture, which allows any core to access the SRAM of any other core on the die. This SRAM is mapped as a single address space across the cores, greatly simplifying memory management. Each core has a direct memory access (DMA) unit that can prefetch data from external flash memory.
The initial design didn't even have main memory or external peripherals, if you can believe it, and used an LVDS I/O port with 8GB/sec of bandwidth to move data on and off the chip from processors. The 32-bit address space is broken into 4,096 1MB chunks, one potentially for each core that could in theory be crammed onto a single die if process shrinking continues.
Besides the open-source, how is this project any different from what Tilera already has? http://www.tilera.com/