Scaling To a Million Cores and Beyond
mattaw writes "In my blog post I describe a system designed to test a route to the potential future of computing. What do we do when we have computers with 1 million cores? What about a billion? How about 100 billion? None of our current programming models or computer architecture models apply to machines of this complexity (and with their corresponding component failure rate and other scaling issues). The current model of coherent memory/identical time/everything can route to everywhere; it just can't scale to machines of this size. So the scientists at the University of Manchester (including Steve Furber, one of the ARM founders) and the University of Southampton turned to the brain for a new model. Our brains just don't work like any computers we currently make. Our brains have a lot more than 1 million processing elements (more like the 100 billion), all of which don't have any precise idea of time (vague ordering of events maybe) nor a shared memory; and not everything routes to everything else. But anyone who argues the brain isn't a pretty spiffy processing system ends up looking pretty silly. In effect, modern computing bears as much relation to biological computing as the ordered world of sudoku does to the statistical chaos of quantum mechanics.
With his Connection Machine.
Don't remember how many cores that one had...
This 1-million core machine better be running open source software and not proprietary software. You know who runs proprietary software? Microsoft and Apple. Yeah, they make a lot of money, but they'd make much more money if they gave away their software and sold support licenses.
thats about 30 forks(), and there you go.
839*929
Simply put, there are some computational problems that work well with parallelization. And there are some that no matter how you try to approach it, you come back to a serial-based model. You could have a billion core machine running at 1Ghz get stomped by a single core machine running at 1.7Ghz for certain computational processes. We have yet to find a way computationally or mathematically to make intrinsically serialized problems into parallel ones. If we did, it would probably open up a whole new field of mathematics.
#fuckbeta #iamslashdot #dicemustdie
Isn't that a neural network, you know the things that have been research for over 40 years and do things great but mostly do one thing well and are very poor at doing things in general
I've left out links to some projects, by request, but everything can be found on their homepage anyway. Anyways, it is this combination that is important, NOT one component alone.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
don't have any precise idea of time (vague ordering of events maybe) nor a shared memory; and not everything routes to everything else
Sounds like a large scale distributed system. Maybe somebody should ask google about this.
http://michaelsmith.id.au
What you are describing is the problems around distributed systems. What would I do with a billion cores? Run tens of millions of instances of VMWare (x8 or 16 each) and write distributed code that runs on millions of machines. No shared memory, communication channels which are slow compared with computation? Basically, that's the line between distributed systems and non-distributed systems. Not that most distributed systems problems are solved, but this is the model that we would be investigating assuming no major shift in the computational model (turning vs quantum, etc).
"Those that start by burning books, will end by burning men."
"Our brains have a lot more than 1 million processing elements and not everything routes to everything else"
Thats why I can watch pr0n and code at the same time!
The problem posed by the author is somewhat of a straw man argument: "The trouble is once you go to more than a few thousand cores the shared memory - shared time concept falls to bits."
Multiple processors in a single multicore aren't required even today to be in lockstep in time (it is actually very difficult to do this). Yes, locally within each core and privates caches they do maintain a synchronous clock, but cores can run in their own clock domains. So I don't buy the argument about scaling with "shared time".
Secondly, the author states that the "future" of computing should automatically be massively parallel. Clearly they are forgetting about Amdahl's Law (http://en.wikipedia.org/wiki/Amdahl's_law). If your application is 99.9% parallelizable, the MOST speedup I can expect to achieve is 1000X, forget about millions. High sequential performance (ala out-of-order execution, etc.) will not be going away anytime in the near future simply because they are best equipped to deal with serial regions of an application.
Finally, I was under the impression that they were talking about fitting "millions" of cores onto a single die, until I read the to the end of the post that they are connecting multiple boards via multi-gigabit links. Each chip on a board has about 20 or so cores with privates caches or local store. They talk to other cores on other boards through off-chip links...... SO isn't this just a plain old message passing computer?! What's the novelty here? Am I missing something?
Computers are consistant and predictable. The human brain is not.
We have billions of human brains cheaply available, so let's use those when we want a human brain. And let's use computers when we want computers.
tiny Forth-based computers with up to 144 cores on a chip, and that's in a low tech 180 nanometer process. Each core has a rather fast ALU but just a few hundred(?) bytes of memory. Seems closer to neurons than the thing that guy is making where each core is a 32-bit ARM processor.
link.
Some folks with severely damaged brains seem to make better human computers than people with healthy brains. Rain Man leaps to mind as well as other savants. It seems that when some parts of the brain are impaired the energy of thought is diverted to narrower functions. Perhaps we need to think of delivery more energy to less cores to make machines that do tasks that normal humans are not so good at doing.
Is it coincidence that earlier this month there was a press release from IMEC regarding the issues of massively scaling up computational power ("exascaling")?
Press blurb can be found here.
Killer application would be "space weather prediction".
always seemed logical the next big step would be a return to analog :P
The problems with "coherent memory/identical time/everything can route to everywhere" isnt only seen when you get up to a million cores. I've done plenty of work with MPI and pthreads, and depending on how it's organized, a significant portion of these methods start showing inefficiencies when you get into just a few hundred cores.
Since there are already plenty of clusters containing thousands upon thousands of individual processors (which dont use coherent memory..etc), the step to scale up to a million would likely follow the same logical development. There should already be one or two decent CS papers on the topic, since it's basically a problem that's been around since beowulf clusters were popularized (or even before then)
and write an OpenGL app which puts multiple videos on the panes of multiple rotating truncated icosahedrons a la the old famous BeOS rotating OpenGL cube app from years ago.
...it seemed to me that Amdahl's law was still alive and kicking.
This is very similar to the Inmos Transputer, a mid-1980s system. It's the same idea: many processors, no shared memory, message passing over fast serial links. The Transputer suffered from a slow development cycle; by the time it shipped, each new part was behind mainstream CPUs.
This new thing has more potential, though. There's enough memory per CPU to get something done. Each Cell processor, with only 256K per CPU, didn't have enough memory to do much on its own. 20 CPUs sharing 1GB gives 50MB per CPU, which has more promise. Each machine is big enough that this can be viewed as a cluster, something that's reasonably well understood. Cell CPUs are too tiny for that; they tend to be used as DSPs processing streaming data.
As usual, the problem will be to figure out how to program it. The original article talks about "neurons" too much. That hasn't historically been a really useful concept in computing.
To make any computer mimic the design and function of the human brain would invite evolution and sentience. We want tools, not sentient machines.
Well, when you start thinking about 10^6 or more cores, is pretty obvious that they cannot all be connected to each other and cannot all share memory. At that point your in the realm of neural networking, and are looking at many many serial (and parallel) tasks running in parallel.
When you look at the brain, it has evolved so that different areas have different purposes and techniques for processing data. There are some very highly specialized systems in there for very specific problems.
I saw Steve Furber talk at Retro Reunited in Huddersfield last year (where he spoke about the past - Acorn, the development of the ARM, the present, and the future in terms of what they were doing with SpiNNaker. Very interesting talk. (I also saw Sophie Wilson, another one of the original ARM developers at Bletchley park a couple of weekends ago, another fascinating talk. She now works for Broadcom designing processors for telecommunications).
Oolite: Elite-like game. For Mac, Linux and Windows
The brain isn't like computers at all. The brain is compartmentalized. There are dozens of separate pieces each with its specialty. Its wired to other pieces in specific ways. There is no "Total Information Awareness"(tm) bullshit going on (what 1 million cores would give you). The problem with TIA is that there is too much crap to wade through. Too big a haystack to find the needle you need. What they found when analyzing Berger-Liaw speech recognition systems against other systems is that the Berger-Liaw system kept temporal (time-based) subtleties, in contrast to other speech recognition systems that were simply digital (with the clock/oscillator sampling in a Nyquist format, destroying or failing to capture temporal information). The Berger-Liaw system can best the best human listeners (which is why the US navy got it instead of it becoming an available commercial product). It could act as a 'sonic input device' using only a tiny neural network (20 to 30 nodes) for superhuman input, instead of the digital ones, giving crappy results with 2048 or 4096 nodes. The brain is wired with a lot of 'specialty components' which use a spare number of components to get the job done. Some of the excess appears to be redundant (although I am not a neural-scientist and could be wrong).
But anyone who argues the brain isn't a pretty spiffy processing system ends up looking pretty silly.
Wouldn't they have just proved the point then. In their case at least......which would mean that they weren't silly after all ..... so their comment would be silly .... which would prove their point.... [stack overflow]
> What do we do when we have computers with 1 million cores? What about a billion? How about 100 billion? ...run really awesome screensavers!
My opinion is that you should not require software to be parallelized from the start. You parallelize it during runtime or at compile time.
This makes sense because parallelization does not add anything in functionality (the outcome should not change). My point is: program functionality and configure/compile parallelization afterwards (possibly by power-users). There could be a unique selling point for open source: parallel performance because you can recompile.
nosig today
The Internet is at least in the 1 billion cores range. The way to use many of them for a parallel computation has been demonstrated by Seti@home, Folding@home and even by botnets. They might not be the most efficient implementations when you have full control of the cores but they show the way to go when the availability of the cores and the communication between them is unreliable, when they have different times and different clocks and when they might be preempted to do different tasks.
With 100+ cores we should start considering leaving Von-neuman behind.
Separated memory/processing/instructions/registers would stop making sense. We would have to follow another model. I don't know which.
For possible future patent trolls and to the all reading professional... remember reading the following as "prior art". (kind of like marking bad HDD sectors but instead marking bad processors broken)
TITLE: System, method and device for reliable computing applications
CLAIM 1: Computer device, system and failure eliminating method for computer devices k n o w n for the computer systems ability to evaluate each processor's ability to operate without significant errors and if such errors would be detected the system could be capable for marking such processor units broken and thereby eliminate the future use of such broken processor units withing the computer system.
All the human activity thus far, has failed to even begin to match the human brain made by "random chance", if you believe in that malarkey!
The brain does not do arithmetic, it only does pattern matching. That's what most people don't get and that's the obstacle to understanding and realizing AI.
If you ask how can humans can then do math in their brain, the answer is simple: they can't, but a pattern matching system can be trained to do math by learning all the relevant patterns.
If you further ask how humans can do logical inference in their brain, the answer is again simple: they can't, and that's the reason people believe in illogical things. Their answers are the result of pattern matching, just like Google returning the wrong results.
The current model of coherent memory/identical time/everything can route to everywhere just can't scale to machines of this size
Why not ? Obviously, you can't have a million processors accessing the same variable in memory, but with a layered system of caches, you could keep most processors working in their own local copy. As soon as a processor writes to memory that's also used by another process, extra hardware will keep the memory coherent. This architecture is basically a superset of a message passing architecture (memory coherency signals are equivalent to messages), but much simpler for the CPU. Because the CPU isn't aware of the messages, this allows the coherency hardware to be improved without changing the program in any way.
Erlang is probably a good start.
A neuron is a fairly simple processing element, after all. Complexity comes from the sheer number of connections with other neurons that a single neuron can have.
In Soviet Russia, a Beowulf cluster of these imagines YOU!!!!!
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Our brains have billions of neurons, not billions of cores. These are completely different beasts when it comes to architecture.
The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
The Human brain is rather spiffy, but it's far from perfect. It has fantastic performance, but it can frequently screw up massively. I don't think mankind would be too pleased if their most powerful computers got depressed (like Marvin) - we'd probably expect to be able to use them, and to trust their output. If they work like the human brain, we can't always do both.
When are they going to get to a point where all of the RAM is on the same die with the processor core(s) that need access to that RAM? By shortening the path to the RAM from going off-chip to staying on-chip, the opportunity for increased speed and lower power consumption arises. And this can also be constructed more compactly, allowing more such complete processors within the same space. Then with more processors, at some point we no longer even need virtual memory for at least the bulk of the processors (the ones doing the heavy computational parts). By removing the virtual memory mapping hardware, things can get smaller and use even less power, giving even more computational capability.
Hopefully, if we keep this up, we'll end up with 2^256 processor cores inside a singularity and have the ability, given enough time, to simulate a complete universe.
now we need to go OSS in diesel cars
Why wouldnt E scale up to a billion cores?
As far as I know the E programming language is designed for this problem.
I heard a funny rumor that google had like millions of pcs set up as servers all clustered together to share the load of doing a search on their own custom built file system. Is that not good enough for you, or the fact it does not come in a box with a nice little logo
with all sorts of imminent malware and trojan dangers and sell for more then it should make it a viable option???
Related stories
Intel Says to Prepare For "Thousands of Cores"
There, hope that helps.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
65536 processes ought to be enough for anybody.
Actually, the single neuron is not simple. It's extremely complex, as complex as a CPU or more. Our neural models usually over simplify each neuron and assign it some basic function but that is far from reality. Every cell in our body is more complex than we have the ability today to model with even the most powerful computers. Even more strongly, we can not even model single large chain protein (protein folding models are one of the hardest things tackled in computational biology), let alone entire cell, and let's not even speak of something as complex as organs or things like the brain.
As the island of our knowledge grows, so does the shore of our ignorance.
No no - you had the golden chance and missed it!
You *license* the baby!
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
anyone who argues the brain isn't a pretty spiffy processing system ends up looking pretty silly
Yet the brain fails miserably at incredibly easy calculations. Most applications require failure-tolerant, repeatable, and highly reliable results, none of which the brain is really good at. The brain might not be the best model for massive parallel processing. Sure we can learn from it but perhaps we can also do better. Even in AI research it is questionable whether a connectionist modeling of the brain is the really the right approach. As anyone with halfway demanding job knows, the brain is a fairly imperfect result of non-purposeful and undirected evolutionary processes and it is easy to experience its limits in your daily work. (If not, you should perhaps get an intellectually more challenging job.) Even for the purpose of building AI we might be able to come up with something better or something else entirely than mimicking the brain.
We just wait for the hardware to catch up. http://en.wikipedia.org/wiki/Occam_(programming_language)
A programming trick that used to work was to unroll loops, to prevent the pipeline penalties that occur when you branch. It worked well for a while. My idea (the bitgrid) is based on the idea of unrolling the whole friggin program. Instead of making a list with less branches... why not distribute each and every instruction of a program out into a physical processing instance?
To make it feasible in hardware, use the simplest computing grid feasible, a grid cells (each cell having 4 inputs (one bit from each neighbor), 4 outputs (one bit TO each neighbor) and a 16 entry look up table), each of which is a pitiful unit of computation by itself.... in a grid size to fit the application at hand, they can execute all of the instructions necessary to compute a result simultaneously.
Communication isn't shared, because every input and output only has 1 place to go or come from. It only has to go to the next cell... so there are no long communication lines to worry about. Each cell can function as a router and logic element at the same time.
Programming with Fortran or C or anything else you've heard of is RIGHT OUT. This is the big problem to solve.
It is this concept which I believe will get me some funding from the Exoscale research programs at DARPA.
It was surreal signing up to be a defense contractor last week.
I don't know about seti, but I saw a presentation a few years back by a guy who was in with the folding at home folks. Each computer has a task that is independent of the current results of all others. Just a large bundle of serial jobs. This is something I am familiar with. A lot of the large clusters have queuing policies that favor jobs with large proc counts. Sometimes I need to run several small jobs. I bundle them up to look like one big n-proc job, and presto.
46 & 2
PORN
There is a relevant computing model for billions of processors: Single Instruction Multiple Data. I have always thought that this would be just what is needed for ray-trace image generation. You could assign one processor per pixel.
We are already building hardware implementations of neural networks (CBF with the reference). These things are (or soon will be), for all practical purposes, logically the same as a human brain. If we get to the point where we can build billion core cpu's, then we'll have no trouble building a hardware implementation of a neural network to rival the human brain. Forget Von-neuman.
When we have machines that think and feel like a human, we as a society will seriously have to rethink our perception of humanity and our place in the universe. I think the world is heading in a strange new direction that no one will see coming.
there are approximately 50-100 trillion synapses estimated in the average human adult brain. Let me know when you get there if you can code sentience for the platform without knowing either the protocol, of the network, let alone the syntax and semantics of qualia, thx...