Parallella: an Open Multi-Core CPU Architecture

Comparisons... by Anonymous Coward · 2012-10-07 07:44 · Score: 1

Comparing it to Pi is a little disingenuous. Reading the copy suggests there is an ARM core, plus some number of co-processors (perhaps like the Cell and its SPEs). That would make it a non-general-purpose processor. To compare apple-to-apples, we'd have to know how it compare to modern GPUs.

Re:Comparisons... by vlm · 2012-10-07 08:16 · Score: 2

Comparing it to Pi is a little disingenuous. Reading the copy suggests there is an ARM core, plus some number of co-processors (perhaps like the Cell and its SPEs). That would make it a non-general-purpose processor. To compare apple-to-apples, we'd have to know how it compare to modern GPUs.
As for apples to apples, It's vaporware specs read similar to the old printout of the vaporware specs for the Propeller2 from microchip inc on my desk.
They are nearly identical situations, in fact, both small teams (the original prop was done by one guy, admitedly 6-7 years ago). Gigaflops/sec performance goals, etc.
As for which is the better vaporware product I think you're slightly better off with the parallella story than the prop2 story, but slow working silicon purchased online by CC and delivered by UPS next week always beats fast vapor, so whoever gets there first, wins....

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Re:Comparisons... by vlm · 2012-10-07 08:27 · Score: 1

Propeller2 from microchip inc
Well that was embarrassing. That would be Parallax Inc, before the prop came out mid last decade (its old) they were famous for the basic stamp which I guess you'd call the Ardweeeeeeeno of 1990.
Microchip makes the PICs. I was thinking of something along the lines of instead of buying 8 cores of 80 mhz 2005 era propeller for $8, buy 8 PIC24's in 20 pin DIP packages (easy for noobs to prototype) for a bit over a buck a piece and tie the I/O lines together and learn the agonies of multi-processor interfacing...

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Re:Comparisons... by Anonymous Coward · 2012-10-07 08:43 · Score: 1

Hardly vaporware, you can already buy these commerically if you don't mind spending as much as a fully loaded Airbook.
Re:Comparisons... by ssam · 2012-10-07 10:22 · Score: 4, Interesting

the devboard has a Dual-core ARM A9, so more like a pandaboard. even if you ignore the co-processor they are offering a lot for $99.
its interesting to compare the epiphany processor to a GPU. both give you lots of cores, GPUs get up ino the hundreds, epiphany is meant to scale to 4000. But a GPU is highly opitmised for graphics, and applying identical operations to millions of data values. in a GPU groups of core (typically 32) operate as a wavefront, if the code branches on an if stament, then the cores that get the else branch have to wait until the ones that follow the if finish.
epiphany has independant cores. you can send them each a different program. so for a much wider set of algorithms you can get efficient speedups. in a way it is more like the xeon phi, but without making each core a full x86 compatible processor.
Re:Comparisons... by ssam · 2012-10-07 10:29 · Score: 2

>As for apples to apples, It's vaporware specs read similar to the old printout of the vaporware specs for the Propeller2 from microchip inc on my desk.
they have working 16 core silicon. they shared the cost of a 65nm wafer with other companies small run asics. this lowers the entry cost to making silicon, but gives a crazy high per unit cost. if they raise enough money to do a full wafer at 28nm, then it becomes cost effective. there are intersting details and numbers on page 3 of http://www.adapteva.com/wp-content/uploads/2011/06/adapteva_mpr.pdf

Hmmm... by AliasMarlowe · 2012-10-07 07:46 · Score: 1

I've got a compute-bound embarrassingly parallel problem at work (real-time image processing in a very compact unit). This bears looking at. What is its I/O potential?

--
Those who can make you believe absurdities can make you commit atrocities. - Voltaire

Re:Hmmm... by viperidaenz · 2012-10-07 08:32 · Score: 4, Informative

If you've got $100 to spare, a Radeon 7750 provides over 800GFLOPS. If you've got more money a 7970 will give you 4.3TFLOPS for $550.
a GTX650 will give you 800GFLOPS for $100 and a GTX680 will give you 3TFLOPS for $500.
Re:Hmmm... by Anonymous Coward · 2012-10-07 09:03 · Score: 2, Informative

Total on-chip, inter-core bandwidth is 64 GBytes/sec, with 8 GBytes/sec of off-chip bandwidth.
Re:Hmmm... by Anonymous Coward · 2012-10-07 09:05 · Score: 2, Insightful

A big problem here is that classical GPU:s only have two kinds of I/O ports: Video output, and PCI Express. Neither is very good for an embedded application, unless you have a big power budget and also have a board with an x86 processor. (Unfortunately you need x86 since you need binary drivers for your GPU to get good GPGPU performance...)
Re:Hmmm... by DaMattster · 2012-10-07 09:27 · Score: 0

The problem, AssMarlow, is that most people here at Slashdot have a serious cas of fellatio mouth.
Posted anonymously, too! Chickenshit!
Re:Hmmm... by Gorgonzola · 2012-10-07 09:47 · Score: 2

Yes, you may be right for your situation. I do not happen to have any system available with PCIe slots in it but would love to toy with a bunch of Parallella boards for a CPU-bound thing or two. So for me this is a more interesting option and I've backed their Kickstarter for that reason.

--
-- Spelling and grammar errors tend to be a sign of erroneous thinking.
Re:Hmmm... by qox · 2012-10-07 10:05 · Score: 1

The Parallella Solution is much much more energy efficient than a GPU because a GPU have to power (for HPC unused) parts like Polygon Rendering circuits and stuff like that. Think about it ~40GFLOPS for 1...2 Watts...hell, even the chips that they produce now in a 65nm process could be built easily into a Smartphone
Re:Hmmm... by ssam · 2012-10-07 10:40 · Score: 1

raw GFLOPS is not everything. GPUs need very SIMD work to reach the theorical limits. for a lot of graphcis work thats fine. for some scientific work its good too. for other promblems, where the code has lots of branches in it, you end up with cores waiting for other cores to do different branches (look up wavefront for more info).
the $99 on the kickstart gets you a dual core arm A9, with a 16core epiphany processor (a dual core arm A9 devboard costs in the region of $100 normally). no one is saying that the 16core version is going to beat an i7 or a GPU, it just puts a lot of CPU power into a tiny area of silicon, that uses a tiny amount of power. its a step on the road to 1024 and 4096 core versions, that will sit on PCI-E cards, and challange the performance of a GPU for a fraction the power consumption (most of the cost of HPC).
Re:Hmmm... by viperidaenz · 2012-10-07 10:58 · Score: 1

Compared to current CPU's, The i7-2600K does 108.8GFLOPS per core, of which it has 4.

In the low-end cpu market the i3-2100 has two cores, each does 49.6GFLOPS - outperforming the Parallella with 64 cores that can only muster 90. The 16 core chip does 26GFLOPS. trumped by an old i3-330 with two cores providing 17GFLOPS each.
I'll give it one thing though, it beats the pants off an Intel Atom. The D2700 only does 17GFLOPS.
Re:Hmmm... by Anonymous Coward · 2012-10-07 11:01 · Score: 3, Informative

As soon as you have branches in your GPU code, the performance drops like a brick. GPUs also only work well with sequential data. What it comes down to, is GPUs only do well with matrix math.
Re:Hmmm... by naeger · 2012-10-07 11:01 · Score: 3, Interesting

Yes, that's true. But unfortunately i cannot plug your Radeon or GTX into my mobile robot or quadrocopter in order to give them machine vision or neural networks/machine learning "brains" (at least not with some serious improvements in battery technology!).
So, what are the alternatives to bring the current vision algorithms to mobile devices/robots? The Parallella is the only option I am aware of.
For these types of mobile applications, you should rather compare the Parallella with Raspberry Pi or Arduino. And guess who wins this performance comparison! ;)
Re:Hmmm... by naeger · 2012-10-07 11:05 · Score: 1

It always comes down to the application: the i7 and i3 you mention consume how many watts? ... Impossible if the "real-time image processing" mentioned above should be done on a mobile device (mobile robot/drone). The 64-core epiphany only consumes 2 watts (in words: TWO!)!
Re:Hmmm... by viperidaenz · 2012-10-07 11:06 · Score: 2

20 - 40 GFLOPS per watt isn't that much better than the 17GFLOPS/watt of the high end Radeon GPU's.
Mobile devices are already available with GPU's up to 32GFLOPS. The new iPad is 32 and Intel's new Atom SoC is 34GFLOPS. I'm sure the Tegra 4 will be up around those figures when it comes out too. (those GFlops figures don't include the dual-core CPU's they sit next to and the PowerVR GPU is clocked slower than the Parallella)
Re:Hmmm... by naeger · 2012-10-07 11:35 · Score: 2

Mobile devices are already available with GPU's up to 32GFLOPS.

Maybe, but are they also accesible for the programmer? I was greatly disappointed when I learnt that the praised and "powerful" GPU of the Raspberry Pi is locked down by NDAs and NOT available for the programmer for OpenCL or GPGPU. I think the same is true for the PowerVR.
I have looked around quite a while and have not found a readily available board with a GPU that could be programmed (OpenCL) and is powerful enough for real-time image/vision processing. Not sure about the Tegra 4 .... ?
Re:Hmmm... by viperidaenz · 2012-10-07 11:37 · Score: 1

That step has a theoretical limit of 4095 cores, 4GB address space, 64bit memory bus and 64k of cache per core. A 4095 core version of this chip would have a theoretical 5.7TFLOPS, somewhere between the Radeon 7970 (3.8T) and rumored 7990 (6.9T, which is effectively two 7970's but as you can see, performance never scales linearly when you add more cores.) So in summary, you can already get a 2048 core GPGPU that sits in a PCI-e slot with its own 3GB of 5.5GHz, 384bit RAM. Next year you get 4096 cores with 6GB of RAM. For your $100 you already get 819GFLOPS with 1GB of 4.5GHz 128bit RAM in your PCI-e slot.

I googled Wavefront, and found this www.ccs3.lanl.gov/pal/publications/papers/terraflop00:parallel.pdf

Our analysis showed that contrary to conventional wisdom, interprocessor communication performance was not the bottleneck for such a problem, although communication does became important for smaller problem sizes. For the largest problem, single-node efficiency was the dominant factor.
So, for doing wavefront calculations you want big beefy cores more than you want many simple cores with a high speed inter-cpu network.
For doing any form of scientific research you'll also probably want more than 64k of instruction/data cache.
Re:Hmmm... by viperidaenz · 2012-10-07 11:43 · Score: 2

A cellphone, that currently provides a dual-core arm cpu with several gflops of GPU goodness and as a bonus, its got its own battery with GPS and wireless communications. PowerVR's 600 series GPU's are capable of 100+GFLOPS. They'll be in your iDevice next year. The PowerVR 534 in the new iPad is only 32GFLOPS though.
Re:Hmmm... by viperidaenz · 2012-10-07 11:46 · Score: 3, Informative

http://www.youtube.com/watch?v=sDrz-w1jzEU OpenCL is supported by the PowerVR GPU's but it depends on the SoC vendor
Re:Hmmm... by viperidaenz · 2012-10-07 11:52 · Score: 2

a) the 64-core epiphany doesn't yet exist, so that 2 watts is theoretical.
b) its theoretical performance isn't much better than current mobile GPU's found in cellphones and tablets. I don't know how much power they consume but I can play angry birds and watch movies on my phone for hours with its 5w/hr battery (of which the LCD backlight consumes the most power)
Re:Hmmm... by Anonymous Coward · 2012-10-07 13:18 · Score: 0

Says the "man" who uses an alias.
Re:Hmmm... by Anonymous Coward · 2012-10-07 13:43 · Score: 0

Are you saying they faked this video?
http://www.youtube.com/watch?v=4sMWbaV1sRQ&feature=plcp
Re:Hmmm... by ajlitt · 2012-10-07 13:54 · Score: 1

If you have the power budget and space for one of these GPUs, then throwing a mini-ITX board in the mix is the sane solution. An AMD E-series platform is going to draw a small portion of what a modern top of the line GPU will require.
My problem with this is that it's not an underserved market: there's XMOS and Propeller have both been around for a while and neither sells in considerable volume.
Re:Hmmm... by LocoMosquito · 2012-10-07 13:59 · Score: 1

Why we talking single precision FP? 7750 = (800GFLOPS/16)*efficiency, 7970 ~ 1TFLOPS*efficiency. 2500k@5GHz ~ 125GFLOPS I would like to see kickstarter for a multicore IBM from Blue Gene/Q - 18 core/200GFLOPS/55W monster. ARM is just slow in FP operations.
Re:Hmmm... by ssam · 2012-10-08 02:33 · Score: 1

you need to fill the SIMD units to get theoretical performance on a i7 (or similar). with epiphany it may be easier to get close to the theoretical FLOPS
http://www.adapteva.com/white-papers/ten-myths-debunked-by-the-epiphany-iv-64-core-accelerator-chip/
Also you save lots of power, because you dont have vast amounts of cache (though this may effect performance for some cases), and the architecture is much simpler, with only the instructions needed (not decades of x86 legacy). last time i was in an HPC machine room, there was a pile of not very old xeon servers in the corner, still fast machines, but not worth the energy cost to run them compared to the new servers.
Re:Hmmm... by Anonymous Coward · 2012-10-08 07:03 · Score: 0

Thank you, now I KNOW I won't buy a Raspberry PI.
Re:Hmmm... by viperidaenz · 2012-10-08 07:38 · Score: 1

Because the Epiphany processor cannot to double precision. It has no hardware support for > 32bit floating point.
Re:Hmmm... by viperidaenz · 2012-10-08 07:38 · Score: 1

or you save power because you have vast amounts of cache so the cores aren't idling waiting for system memory, which someone thought would be a good idea to limit to a single 64 bit sdram channel.

By the way, this Epiphany processor doesn't do double precision floating point. It only has hardware to do single precision and is not IEEE754 compliant.
Re:Hmmm... by Anonymous Coward · 2012-10-08 13:31 · Score: 0

So they flush denormal numbers to zero, big deal!

Headline Should Be: An Open Bitcoin Architecture by Anonymous Coward · 2012-10-07 07:49 · Score: 2, Interesting

It is like saying a bong is going to be used for tobacco. It may be true for some but we all know how it will _really_ be used.

Kickstarter by Trecares · 2012-10-07 08:00 · Score: 3, Informative

I checked their front page and they have a kickstarter going to fund further development.

Might want to check it out and chip in if you're interested.

http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone

Re:Kickstarter by hannson · 2012-10-07 09:01 · Score: 1

I did and... sold!
Re:Kickstarter by naeger · 2012-10-07 11:28 · Score: 4, Informative

I really like the parallella project. Due to its low power consumption (2 watts for the 64-core version), it is the only option to bring significant processing power to mobile devices (e.g. mobile robots/quadrocopter/drones) and would be ideally suited to implement machine vision and neural network/machine learning algorithms for those mobile devices.
That said, their kickstarter initiative has some serious flaws:
1. They are only offering the 16-core version for a goal of $750k. The much more interesting 64-core version is available only if a whopping $3m goal is met. Way out of reach for such a specialized interest project. And everyone who reads information about the parallella reads about the "sexy" 64-core version everywhere but can only fund the "just nice" 16-core version. From the comments it is clear: everyone wants the 64-core version.
2. There is only one interesting pledge: $99 for the 16-core version. No addons. No extras etc.
3. The information from adapteva is lacking. Only today they made the documentation available. But still there are no demos and dozens of questions in the comments which are unaswered.
Compare this to a greatly successful campaign like for example the Digispark (a low cost "mini-arduino"): a lower easily reachable goal, lots and lots of extras and addons developed together and in response to the backers and a constant information and communication with the backers. I wanted to spend $20 on this project but finally spent $70 because of all the addons and how responsive the team was to the backers. Digispark achieved more than 6000% of its initial goal!
That said, what would I suggest for the Parallella kickstarter:
1. Go for the 64-core version. Bring the goal from $3m down to say $1.5m by dropping the 16-core version (should save almos $1m) and some bank loan (if you can present >1000 backers who pay >$1.5m that should be no problem.
2. Offer more than just a 64-core parallella for $199. Offer special version for a higher price. Offer a dual-64-core version (with two epiphanies on it). Offer a "compute cluster": a little laser cut box with a network, a power supply and slots for up to 8 parallellas. Offer those cluster equipped with 1-8 parallellas. Offer a "machine vision" parallella with a camera sensor attached to it .. and so on ....
3. Be more open and communicating with the community. Answer all questions in the comments. Put up some polls what backers want. Provide demos/tutorials etc.
Please don't take this personally. But i would really like to see this project succeed. .... and I want machine vision and a neural network brain for my quadrocopter (yep, world domination ... that's the plan!) ;)
Re:Kickstarter by ssam · 2012-10-07 19:51 · Score: 2

your wish on the clusters has been answered. http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone/posts/323994
i agree that the 64core versions is far more exciting than the 16core version. i guess maybe they think there is a lot higher risk there (they have already made and tested a small number of 16core chips, http://www.adapteva.com/wp-content/uploads/2011/06/adapteva_mpr.pdf )
Re:Kickstarter by naeger · 2012-10-08 11:02 · Score: 1

Yes yes yes ... looks great :) ...
The 64 cores version is still out of reach though :(

yeah whatever... by fredan · 2012-10-07 08:04 · Score: 1

the real question is:

how many double sha256 hashes can they do?

FPGA by vlm · 2012-10-07 08:07 · Score: 3, Interesting

To make parallel computing ubiquitous, developers need access to a platform that is affordable, open, and easy to use.

They promise the latter three, but "access" seems a bit lacking. Also they specifically left out performance but talk it up in separate marketing materials (5 watts for 45 GFLOPs etc)

Some other alternatives optimizing for local maxima in the solution set:

Just simulate in software, if you don't care about speed but want to learn to program parallel. Erlang? They seem to have a fixation on C, why not use the right tool?

Go to opencores.org and stick a zillion cores on a off the shelf FPGA dev board. Or a fat stack of picoblaze or microblaze if you're willing to deal with the annoying licensing hassles (my advice, stick with opencores to avoid legal hassles, the weird licensing for the *blaze family is like the creepy dude in a van offering kids "free" candy)

They seem spread a bit thin based on clicking around the website. They're doing everything but invent hard AI and the warp drive on their website, which is a lot for just 4 people. Their kickstarter seems pretty firmly grounded in comparison.

One of those "infinite spare time" play toys would be to stick a bunch of 6809 cores (or pdp-8s or -11s or Z80s or whatever) on one of my FPGA boards and figure out the glue logic. Anyone with a big enough board could download by VHDL/Verilog and go for it on their own hardware.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger

Re:FPGA by Kjella · 2012-10-07 08:24 · Score: 1

Just simulate in software, if you don't care about speed but want to learn to program parallel.
I was thinking virtualization, how hard would it be to virtualize more cores than you physically have in the same VM? Just make say 16 virtual cores point to the same physical core as 16 different processes and you'll have a "64 core" machine on a quad core. Of course you'll get less total performance but you'll very quickly see if your application actually scales before you get a real massively parallel box.

--
Live today, because you never know what tomorrow brings
Re:FPGA by Anonymous Coward · 2012-10-07 08:47 · Score: 0

You seem to be talking like sticking a bunch of cores onto an FPGA would be something that a couple of thousand people would be able or willing to do in less than a month, because it's so much easier and less time consuming than funding $99 and getting a board in 20 days.
Re:FPGA by Anonymous Coward · 2012-10-07 09:03 · Score: 1

Also, the difficulty is not sticking a bunch of cores on an FPGA, the difficulty is to create an architecture where those cores can actually communicate in an efficient manner, both with each other and with off-chip memories.
Re:FPGA by Anonymous Coward · 2012-10-07 09:58 · Score: 0

Do you have a recommendation on a cheap and easily deployable FPGA board for this?
Can't actually see how the possibility to design the processors in FPGAs makes the actual production of the devices by Adapteva invalid? Of course 64 cores is just a toy, but as they say their current chip is pretty small and should be easily increased in size with proper funding, such that they could get up to 1024 next year. Together with double precision floating point arithmetic, that would make the chip quite attractive.
Re:FPGA by BitZtream · 2012-10-07 12:27 · Score: 0

Qemu will happily simulate far more cores than are actually in the machine its running on.

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Re:FPGA by Anonymous Coward · 2012-10-07 21:51 · Score: 0

I went for the Atlys board http://www.digilentinc.com/Products/Detail.cfm?NavPath=2,400,836&Prod=ATLYS , it is $350 for a Spartan 6 LX 45 and you get a big discount as a student. Sadly it is just sitting on my desk at the moment.

Still very expensive for the performance by viperidaenz · 2012-10-07 08:10 · Score: 4, Informative

and the architecture is also very limiting.

16TFLOPS for $3000 or 0.09TFLOPS for $200. I'll stick to current hardware thanks. 178x more processing power for 15x more money. I would also prefer a "super computer" can address more than 4GB of RAM with more than 64bits of memory bandwidth. The architecture also limits the core cache to 64k.

Re:Still very expensive for the performance by IAmR007 · 2012-10-07 10:48 · Score: 4, Informative

I agree. 32 bit a PGAS memory model is silly. Giving each core its own 32 bit address space and using MPI for communication would be much more useful. Then, it could at least be a good learning tool for HPC programming techniques. Right now, it looks pretty useless.

Even GPGPU is limited for what it can do for HPC. There's a lot more to HPC than raw mathematical power. Memory is often the bottleneck, not the FPUs. The reason we even deal with multiple processors is that the performance increase of single cores has nearly stalled, forcing the use of multiple processors. Communication between multiple cores/processors is a very complicated thing, as well, and getting good performance is a lot more complicated than hooking up a bunch of processors in a grid. For example, the supercomputer I work with has 90,112 2.3GHz cores and 90TB ram; 16 cores per chip in 704 blades, interconnected with a 3d torus network topology. It's the memory/cache size and speed and network topology that makes it a supercomputer. You could get the 800TFLOP/s in a much smaller package using GPUs, but the performance would be drastically less. Even with the 64 cores parallella could have, distributing the workload on a 64 core grid isn't easy. GPGPUs use work groups of smaller numbers of cores to make this sharing a bit more easy to manage. They should have at least made the interconnects a 2d torus rather than a grid, thereby reducing the maximum path length in half. In order to do stuff like quantum mechanics, a 5d torus is optimal. Memory access is the key. This is a bit like comparing apples to oranges, but that's exactly my point: the thing is not a supercomputer.
Re:Still very expensive for the performance by Anonymous Coward · 2012-10-08 01:03 · Score: 1

Perhaps you shouldn't make the question "either / or". Having recently built a Top500 top 100 rank supercomputer, I can tell you that we would not have been able to do that for the budget we had without employing GPUs. Getting the same performance with CPUs only would have been impossible within the budget. GPUs are a tremendous bang-for-the-buck and do very well on the flops-per-watt metric also. (This is assuming that your applications can be modified to effectively use the GPUs. Not all problems are amenable but many are. Besides, newer GPUs are becoming more general purpose. Also, GPU programming is still new enough that there is still a bit of resistance to the change. Kind of like the transition from vector processors to MPI we went through a number of years ago... )

Parallax Propeller by Y2K+is+bogus · 2012-10-07 08:14 · Score: 5, Informative

The Parallax Propeller is a great multi-core chip to get started with. The chip is $7.95 and has 8 cores running at 80Mhz. You can pickup the Quickstart board at Radio Shack for $40, including an overpriced RS USB cable (they normally retail for $25).

The Parallax Propeller is a much more economical way of getting started with multi-core programming. Parallax offers the PropTool, which provides SPIN and PASM language support. For C development you can get SimpleIDE which is a great IDE to get started with C programming on the Propeller, which uses a port of GCC.

SDK and Architecture Documentation Released by Anonymous Coward · 2012-10-07 08:25 · Score: 2, Informative

http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone/posts/323691

They have released their SDK and architecture documentation, worth a read.
Looks like an interesting platform, but the current performance indeed make me feel lacklusting ...

Re:SDK and Architecture Documentation Released by Anonymous Coward · 2012-10-07 08:38 · Score: 0

TL;DR:
Working GCC target, OpenCL compiler and debugger
32-bit single-precision floating point only (no double-precision)
Core local store global-addressable on 32-bit flat address space; can R/W outside address space with proper address mapping support
32KB local store on each core, 32 GB/s @ 1 GHz
External interface 2 GB/s * 4 (four directions) = max 8 GB/s
Fetch instruction from global address space (inter-core/chip is slow as hell, of course)
Fast inter-core write, slow read requests (1 for each 8 cycles)

Difficult times... by Anonymous Coward · 2012-10-07 08:30 · Score: 1

The multi- and many-core market is about to get crowded. After Tilera (www.tilera.com) there are now Kalray (www.kalray.eu) and the p2012 platform of ST microelectronics that produced silicon. And a lot of people working on research stuff, including open-source ones like soclib (www.soclib.fr). And it's not yet clear who's going to use all these architectures, even though logically this should be the way to go.

Sending messages by Anonymous Coward · 2012-10-07 08:36 · Score: 1

If you were to send messages from one Parallella to another Parallella, would they be called Parallellagrams?

How to characterize performance by Kevster · 2012-10-07 08:49 · Score: 1

How many of these would it take to, say, ray-trace Call of Duty: MW3 in real-time, 60 FPS? Would it cost less than using a modern graphics card to do the usual non-ray-traced rendering? That would be pretty cool.

--
I always equivocate. Well, almost always.

Re:How to characterize performance by Anonymous Coward · 2012-10-07 09:52 · Score: 0

take a look at the issues of programming the cell processor in the PS3 for game and you'll realize how much of a pain it is.
Re:How to characterize performance by citizenr · 2012-10-07 10:25 · Score: 1

a LOT considering its 90 (was 100 few days ago) Gflops single precision. This is 1/10 of Radeon 7750

--
Who logs in to gdm? Not I, said the duck.
Re:How to characterize performance by Anonymous Coward · 2012-10-07 10:33 · Score: 0

How many of these would it take to, say, ray-trace Call of Duty: MW3 in real-time, 60 FPS? Would it cost less than using a modern graphics card to do the usual non-ray-traced rendering?
Parallel processing in terms of raw computation power isn't your bottleneck there. Efficient intersection testing across large datasets (millions of primitives), is. Solve that problem, and you stand to garner a lot of interest - not just from gaming companies but also virtual optics proofing, hollywood, medical imaging industry, and a bunch of fields where ray tracing has very little to do with imaging solutions.

Green Arrays 144 computers on a chip @ 20 USD by John+Bokma · 2012-10-07 08:49 · Score: 2

"The GA144-1.20 chip, with 144 self-contained computers and software-defined I/O, is available in a 1cm x 1cm, 88-pin QFN package." $20 / each, minimum order 10 (as far as I know): http://www.greenarraychips.com/home/products/index.html 200 USD buys you 1440 cores...

--

Perl Programmer for hire

Re:Green Arrays 144 computers on a chip @ 20 USD by citizenr · 2012-10-07 10:24 · Score: 1

If you like Forth i suppose.

--
Who logs in to gdm? Not I, said the duck.
Re:Green Arrays 144 computers on a chip @ 20 USD by Anonymous Coward · 2012-10-07 12:39 · Score: 0

One can always make a LLVM back end for Forth. Not unlike most other chips out there that few people would program them at machine code level.

Thank goodness! by eyegone · 2012-10-07 09:17 · Score: 1

The masses are just dying for massively parallel systems.

--
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety."

Cheap or High Performance, PickOne by gentryx · 2012-10-07 10:07 · Score: 2, Insightful

Adapteva is creating false expectations here. Their chip won't deliver performance on par with GPUs (or CPUs, for that matter) and still be cheap. Why? Because it's not a thing that a startup can to in todays world of computing. For such a chip you need to use the latest CMOS processes and a huge team to design/optimize the ASIC (especially if it's meant to be a low power chip) -- both of which are extremely costly. If it was that easy, then we'd see more competition and not Intel, AMD, Nvidia and IBM as the only global players in the HPC arena.

If you're a small startup, then you'll be bound to 100nm processes (at best), and have to use automated layouts (not the hand-optimized ones e.g. Intel uses). Both reduce performance, increase power intake.

I work at the Chair for Computer Architecture at FAU. We have some of very brightest minds working at custom chips for industry solutions. This 2D CPU matrix that Adapteva proposes is something that my colleagues have played with years ago. It's a good approach and I personally believe that this will be the shape of CPUs to come. It started with the ring bus on the IBM Cell, now Intel's Nehalem has got an partitioned L3 cache connected with a... ring bus and Intel's Xeon Phi (MIC) even got a 2D on-chip grid network. But even my colleagues concede that a) on FPGAs you'll always be trailing GPUs concerning floating point performance (it's something FPGAs are particularly bad at) and b) even when designing an ASIC you'll always be beat by GPUs in terms of performance, assuming similar prices and power consumption. Those are simply beasts, optimized down to the bone. It's the result of a multi-billion mass market. That's also the reason why there is no next IBM Cell chip for a PlayStation 4: Cell was too expensive to develop to keep up with the competition. Its market is too small compared to the ubiquitous GPUs.

For teaching parallel computing I'd always suggest a GPU. The tools are there, the performance is great and you'll be able to use the knowledge gained in real-world projects.

--
Computer simulation made easy -- LibGeoDecomp

Re:Cheap or High Performance, PickOne by ssam · 2012-10-07 10:49 · Score: 1

>If you're a small startup, then you'll be bound to 100nm processes (at best), and have to use automated layouts (not the hand-optimized ones e.g. Intel uses). Both reduce performance, increase power intake.
they are planing to use "GlobalFoundries’ 28nm SLP technology" www.adapteva.com/wp-content/uploads/2011/06/adapteva_mpr.pdf
Re:Cheap or High Performance, PickOne by naeger · 2012-10-07 10:55 · Score: 3, Interesting

100nm process? ... Well, if you had read the information provided you would know that the 16-core version from the kickstarter is done in a 65nm process and the 64-core version is done in the 28nm process in cooperation with Globalfoundries.
And for the GPUs: yes, i know that a modern GPU (or even a core i7) is more powerful. But, I unfortunately I cannot plug a modern GPU into my mobile robot/drone/quadrocopter in order to do things like real-time vision processing/neural networks/machine learning/AI. The epiphany consumes something between 2-5 Watts (in words: TWO watts for 64-cores). I am currently not aware of anything coming close to the performance of the parallella for the mobile vision processing applications mentioned above.
PS: I know that the raspberry pi has quite a powerful GPU. But its GPU is locked down by NDAs and NOT accessible for OpenCL oder GPGPU.
Re:Cheap or High Performance, PickOne by gentryx · 2012-10-07 10:55 · Score: 1

Yes, I know. But "planning to" is not the same as actually doing so. That process is expensive: mask creation alone costs a fortune. I'll only work if they order millions of chips up front. Not exactly the thing you can do if you're funding is a Kickstarter project.

--
Computer simulation made easy -- LibGeoDecomp
Re:Cheap or High Performance, PickOne by Anonymous Coward · 2012-10-07 10:57 · Score: 0

Adapteva isn't bound to 100nm processes. They claim to already have a manufacturer lined up to use 65nm and 28nm processes. They're not manufacturing the chips themselves.
Re:Cheap or High Performance, PickOne by gentryx · 2012-10-07 11:13 · Score: 2

OK, maybe 100mn was a bit too much, yet I don't see the 28nm coming. Just to give you a comparison: Samsung is manufacturing the brand new Galaxy S3 SOC in 40nm. Why don't they use 28nm? Don't they want it? Hell yes, but it's not that easy. Think about that.
The power argument and the architecture's openness are sensible, I don't argue against that. Yet, the performance per Watt seems grossly inflated. If you look at today's most power efficient HPC chip, the CPU of IBM's Blue Gene/Q, then you'll see that they achieve less than 4 GLOPS/Watt. Adapteva claims more than 9. So they're twice as good as IBM? Really?

--
Computer simulation made easy -- LibGeoDecomp
Re:Cheap or High Performance, PickOne by gentryx · 2012-10-07 11:15 · Score: 2

They need to make the chip designs. These are specific to the manufacturing process, and that's the tricky part. Chip simulation and validation are expensive in terms of compute time and labor. That's all I'm saying.

--
Computer simulation made easy -- LibGeoDecomp
Re:Cheap or High Performance, PickOne by StripedCow · 2012-10-07 12:35 · Score: 1

And what would you suggest if the application is highly memory-bandwidth bound and uses only simple integer arithmetic?
In that case, GPUs still won't cut it, because their memory bus easily gets congested.

--
If Pandora's box is destined to be opened, *I* want to be the one to open it.
Re:Cheap or High Performance, PickOne by gentryx · 2012-10-07 12:47 · Score: 1

Is there an architecture with a memory bandwidth superior to GPUs? Nope. Commercial FPGA boards get in the range of O(20 GB/s), but defenitely not O(200 GB/s), which is where GPUs stand.
However, (Nviidia) GPUs are not designed for heavy integer arithmetic. FPGAs do this quite well and even though their DRAM controllers are generally worse, they can often avoid much memory traffic at all by keeping intermetiate data in their comparatively large on-chip SRAM. BTW: Xilinx have just recently announced their first prodicts for 28 nm production...

--
Computer simulation made easy -- LibGeoDecomp
Re:Cheap or High Performance, PickOne by docmordin · 2012-10-07 14:39 · Score: 2

And for the GPUs: yes, I know that a modern GPU (or even a core i7) is more powerful. But, I unfortunately cannot plug a modern GPU into my mobile robot/drone/quadrocopter in order to do things like real-time vision processing/neural networks/machine learning/AI. The epiphany consumes something between 2-5 Watts (in words: TWO watts for 64-cores). I am currently not aware of anything coming close to the performance of the parallella for the mobile vision processing applications mentioned above.
If you have around $3-6M (USD) to spare, I could have a 25mm x 25mm chip fabricated, using 28nm CMOS technology at either TSMC or GlobalFoundaries, with a 2-core ARM Cortex-A9 and a custom 384-core MIMT architecture, the latter of which would hit above 500 GFLOPS in single-precision peak performance.
Re:Cheap or High Performance, PickOne by ssam · 2012-10-07 20:06 · Score: 1

>Adapteva claims more than 9. So they're twice as good as IBM? Really?
Epiphany has a customised core that only has instructions useful for floating point,and fetching data. They have also changed from hierarchy of caches to a more network like method of moving data around. Its a scale of general purposeness, there things that a full CPU can do well that will choke an ephiphany. there are things an epiphany can do well that will choke a GPU.
Re:Cheap or High Performance, PickOne by Simon+Brooke · 2012-10-07 21:51 · Score: 1

If you have around $3-6M (USD) to spare, I could have a 25mm x 25mm chip fabricated, using 28nm CMOS technology at either TSMC or GlobalFoundaries, with a 2-core ARM Cortex-A9 and a custom 384-core MIMT architecture, the latter of which would hit above 500 GFLOPS in single-precision peak performance.
MIMT? Do you mean MIMD, or is this some new acronym I don't know about (and probably should)?

--
I'm old enough to remember when discussions on Slashdot were well informed.
Re:Cheap or High Performance, PickOne by Anonymous Coward · 2012-10-07 22:02 · Score: 0

If you have around $3-6M (USD) to spare, I could have a 25mm x 25mm chip fabricated, using 28nm CMOS technology at either TSMC or GlobalFoundaries, with a 2-core ARM Cortex-A9 and a custom 384-core MIMT architecture, the latter of which would hit above 500 GFLOPS in single-precision peak performance.
Is this price assuming first-time-right? That would be very impressive if you could pull this off. Even experienced teams at intel or ibm need several costly respins for such a complex design.
Re:Cheap or High Performance, PickOne by Anonymous Coward · 2012-10-07 23:09 · Score: 0

In the ...but does it run linux? (no, really) thread is my misplaced calculation that the biggest FPGAs reach 20 Petabytes/s internal memory bandwidth. If true and useable it is huge.
Re:Cheap or High Performance, PickOne by Anonymous Coward · 2012-10-08 05:16 · Score: 0

you're so full of sh**.
Samsung's SoC contains RF components that are not supported on a 28nm process yet.
Re:Cheap or High Performance, PickOne by docmordin · 2012-10-08 23:15 · Score: 1

MIMT (multiple instructions, multiple threads) is a term that I coined in one of my recent journal papers, which I just sent out for review, for a ray tracing architecture. While there were, arguably, better terms that I could have employed, e.g., coherent multi-threading, I preferred MIMT, since it immediately lets readers know that the work is different from the current SIMT (single instruction, multiple threads) paradigm in commodity graphics hardware.
Re:Cheap or High Performance, PickOne by docmordin · 2012-10-09 11:01 · Score: 1

You're correct that it would be quite expensive, considering that just a 28nm mask alone runs around $2.8M to $3M (USD) these days. However, with around $6M (USD) in hand, I'd be able to get some investors on board to match or even triple that amount, which would give me a better amount of wiggle room.

Re:Headline Should Be: An Open Bitcoin Architectur by Anonymous Coward · 2012-10-07 10:44 · Score: 1

1. mine bitcoins on you parallella
2. convert bitcoins to USD
3. travel back in time with USD
-1. use USD to fund the kickstater for parallella ....
. profit

Palindrome by libtek · 2012-10-07 13:37 · Score: 1

"allella" sounds so much kewler

--
Unequivocally the realest of the realz...

Epiphany's Architecture by metatheism · 2012-10-07 14:19 · Score: 3, Informative

Have a look at The Register's article for some details.

The Epiphany core has a mere 35 instructions – yup, that is RISC alright – and the current Epiphany-IV has a dual-issue core with 64 registers and delivers 50 gigaflops per watt. It has one arithmetic logic unit (ALU) and one floating point unit and a 32KB static RAM on the other side of those registers.

Each core also has a router that has four ports that can be extended out to a 64x64 array of cores for a total of 4,096 cores. The currently shipping Epiphany-III chip is implemented in 65 nanometer processors and sports 16 cores, and the Epiphany-IV is implemented in 28 nanometer processes and offers 64 cores.

The secret sauce in the Epiphany design is the memory architecture, which allows any core to access the SRAM of any other core on the die. This SRAM is mapped as a single address space across the cores, greatly simplifying memory management. Each core has a direct memory access (DMA) unit that can prefetch data from external flash memory.

The initial design didn't even have main memory or external peripherals, if you can believe it, and used an LVDS I/O port with 8GB/sec of bandwidth to move data on and off the chip from processors. The 32-bit address space is broken into 4,096 1MB chunks, one potentially for each core that could in theory be crammed onto a single die if process shrinking continues.

...but does it run linux? (no, really) by dirtyhippie · 2012-10-07 14:48 · Score: 1

seriously, though, what does it run? the article doesn't say except to use the nebulous term "open source". or are they planning on schlepping off the initial software development to the open source community too? (good luck with that)

Re:...but does it run linux? (no, really) by Dade916 · 2012-10-07 19:30 · Score: 1

It runs a standard Ubuntu: http://www.youtube.com/watch?v=YP30_PjSwug
Re:...but does it run linux? (no, really) by Simon+Brooke · 2012-10-07 21:22 · Score: 1

seriously, though, what does it run? the article doesn't say except to use the nebulous term "open source". or are they planning on schlepping off the initial software development to the open source community too? (good luck with that)
Seriously, though, UN*X-like operating systems for massively parallel architectures have been written in the past - one I'm familiar with is Helios. The Linux kernel is not optimised for massively parallel architectures, so that I doubt it would be easy to port Linux in such a way that it made efficient use of the parallelism of the Epiphany architecture. But the kernel is a relatively small part of a distribution. However, writing a new kernel is not an unfeasible task: Linus, after all, wrote the original Linux kernel (up to 0.99) largely by himself in less than two years, and it worked - I know, I used it. So something like an OpenHelios project, given that the documentation exists, would not be an over-ambitious open source project.

--
I'm old enough to remember when discussions on Slashdot were well informed.
Re:...but does it run linux? (no, really) by ssam · 2012-10-07 22:02 · Score: 1

all the compiler stuff is merged into GCC as of 4.7 http://gcc.gnu.org/gcc-4.7/changes.html
Re:...but does it run linux? (no, really) by Anonymous Coward · 2012-10-07 22:29 · Score: 0

I just browsed the Virtex 7 datasheets and if I am correct the internal memory bandwidth is huge.
Biggest FPGA has 1880 memory blocks (total size 68 Megabit) which can be split in two smaller ones -> 3760
one write port and one read port with 36 bit each for every block
max frequency is 601 Mhz
gives a total of 20 Petabyte/s!
But I have a hard time imaging a usefull application design using 3760 parallel data streams which can be implemented at these frequencies. Does anybody know a existing design aproaching these numbers?
Re:...but does it run linux? (no, really) by Anonymous Coward · 2012-10-07 22:44 · Score: 0

This replay was supposed to be at the Re:Cheap or High Performance, PickOne thread

I like their approach by FithisUX · 2012-10-07 20:17 · Score: 1

where is my RapsberyParallela?

Re:I like their approach by Anonymous Coward · 2012-10-08 04:47 · Score: 0

mmm, rapberry paella.

Some questions for those who may know... by Simon+Brooke · 2012-10-07 21:04 · Score: 1

OK, I'm giving up my power to moderate on this story to ask a few questions. Let's hope the answers are worth it...

I'm assuming this thing is MIMD. Separate processors with separate memory seems to me to imply that. Am I right?
How does this design relate to the old Inmos Transputer, which, from what I recall, was conceptually fairly similar? Is it a development of the same ideas, or is it something completely different? How does it compare to the Meiko Computing Surface, which was a system built on transputers?
How does it compare conceptually to the Connection Machine CM1/CM2 architecture?
How feasible would it be to implement a JVM on top of Epiphany, such that new JVM threads were mapped where possible to currently-idle cores?

My understanding, so far, for what it's worth, is that the key features of the Epiphany architecture are:

Each processor is able to address four other processors directly, through its own router, implying a grid or surface architecture rather than a cube or hypercube architecture.
There is a single flat memory map, of which each processor 'owns' a local page; but each processor can read from (? and write to ?) pages belonging to other processors. There is presumably some overhead in accessing other node's memory(?)

--
I'm old enough to remember when discussions on Slashdot were well informed.

Re:Some questions for those who may know... by JAMilrod · 2012-10-08 05:07 · Score: 1
I can't answer all of those questions (quite a few!), but I can address some of them:

I'm assuming this thing is MIMD. Separate processors with separate memory seems to me to imply that. Am I right?
- Yes, you are correct. Epiphany is MIMD.

How does this design relate to the old Inmos Transputer, which, from what I recall, was conceptually fairly similar? Is it a development of the same ideas, or is it something completely different?
- Transputers are now fairly antiquated - they were 8-bit engines! However, the basic concepts are pretty similar. It's been a long time since I thought about Transputer implementation details, but the biggest, most obvious difference in my mind is the standard, open programming environment... Inmos was very pigheaded about only supporting OCCAM; the CTO was quoted saying "we'll support C when someone can show me something that can be coded in C that can't be coded in OCCAM"... clearly he was missing the point, so they missed the market.

My understanding, so far, for what it's worth, is that the key features of the Epiphany architecture are:
1. Each processor is able to address four other processors directly, through its own router, implying a grid or surface architecture rather than a cube or hypercube architecture.
- Not exactly correct. Any core can directly address any other core, but with added latency beyond the nearest neighbors. Theorectically, cubes etc... could be implemented, but Epiphany is most naturally suited to a 2d grid.

There is a single flat memory map, of which each processor 'owns' a local page; but each processor can read from (? and write to ?) pages belonging to other processors. There is presumably some overhead in accessing other node's memory(?)
- I'm not a SW guy, but I don't think the concept of 'owning' is correct. Each processor node physically has local memory, but it can be 'owned' by another processor. Assuming a non-blocked optimized network and the availability of the memory resource, accessing any memory on the same chip has no overhead (i.e. the bandwidth can be the same as addressing the processor's local memory) other than the added latency of going through the routers (1.5 cycles per node).

$99 for a dual core arm by Anonymous Coward · 2012-10-08 00:21 · Score: 0

$99 for a dual core arm dev board, with gigabit network, is not bad, even if you completely ignore the epiphany co-processor. its about as powerful as a pandabaord, and much cheaper. if you can take advantage of the epiphany you are well ahead.

Tilera by nellaj · 2012-10-08 01:46 · Score: 2

Besides the open-source, how is this project any different from what Tilera already has? http://www.tilera.com/

Re:Tilera by Anonymous Coward · 2012-10-08 01:53 · Score: 0

where do you buy them? how much do they cost?
Re:Tilera by JAMilrod · 2012-10-08 04:44 · Score: 2

Tilera is not really floating-point (although they have planned support later), has caching that abstracts stuff (i.e. you can't control it), requires proprietary tools, and has integrated peripherals (great if they have exactly what you want, but otherwise wastes real estate and power).

Re:Headline Should Be: An Open Bitcoin Architectur by ahfoo · 2012-10-08 03:51 · Score: 1

When they taped out first silicon last year there was talk of its potential as a game emulator for the PS2 on cell phones.

This doesn't count as Open and certainly not Free by Theovon · 2012-10-08 05:39 · Score: 1

The thing about CPUs that makes Adeptiva's statement not particularly impressive is that in almost all cases, the ISA of a CPU _must_ be published, otherwise you can't get developers to write code for it. But an ISA is just a language, not an implementation. A CPU that is not "open" by their definition is completely worthless.

GPUs are much worse because they've always been peripherals, hidden behind a driver, which is responsible for generating rendering commands from OpenGL and JIT-compiling virtual instruction sets like PTX.

Re:This doesn't count as Open and certainly not Fr by ssam · 2012-10-08 10:16 · Score: 1

The ISA is in appendix A: http://www.adapteva.com/support/docs/e3-reference-manual/

Embedded Supercomputer by Anonymous Coward · 2012-10-08 16:13 · Score: 0

After looking into this I think it is aimed at robotics/ embedded DSP. This wont compete in the GPU space, but then GPUs don't really do much in the embedded space, other than drive a VDP.

To me this looks more like the approach that XMOS took, but with less capability in the I/O ring, and more generic, ANSI C friendly, support. If this had a better path to running as a stand-alone chip with a QFP pinout it would be far more compelling.

It is a shame that they seem to have screwed up their marketing on Kickstarter. If they had generated more buzz early on, they might have made their extended goal.

So far this is the best example of a 2D general purpose computing fabric I have seen that is something other than total vaporware. (they do appear to have x16 chips and demo boards, even if the prices are a little high)

Re:This doesn't count as Open and certainly not Fr by Theovon · 2012-10-09 01:04 · Score: 1

I'm not sure if you're trying to correct me or not. I assumed that the ISA would be published, along with lots of architectural details. I'm just saying that they HAVE to be published or else the CPU is worthless, so by saying that they're published, Adeptiva isn't doing anything special.

Slashdot Mirror

Parallella: an Open Multi-Core CPU Architecture

103 comments