California Researchers Build The World's First 1,000-Processor Chip (ucdavis.edu)

← Back to Stories (view on slashdot.org)

California Researchers Build The World's First 1,000-Processor Chip (ucdavis.edu)

Posted by EditorDavid on Sunday June 19, 2016 @03:35PM from the laying-down-all-your-chips dept.

An anonymous reader quotes a report from the University of California, Davis about the world's first microchip with 1,000 independent programmable processors: The 1,000 processors can execute 115 billion instructions per second while dissipating only 0.7 Watts, low enough to be powered by a single AA battery...more than 100 times more efficiently than a modern laptop processor... The energy-efficient "KiloCore" chip has a maximum computation rate of 1.78 trillion instructions per second and contains 621 million transistors.
Programs get split across many processors (each running independently as needed with an average maximum clock frequency of 1.78 gigahertz), "and they transfer data directly to each other rather than using a pooled memory area that can become a bottleneck for data." Imagine how many mind-boggling things will become possible if this much processing power ultimately finds its way into new consumer technologies.

18 of 205 comments (clear)

Min score:

Reason:

Sort:

Link to paper by NotInHere · 2016-06-19 15:46 · Score: 5, Informative

The press release does not include it, nor does the slashdot summary. The link to the paper: http://vcl.ece.ucdavis.edu/pub...
In other news by ebonum · 2016-06-19 15:47 · Score: 5, Funny

A young intern who likes to "work late" in Davis California has recently come into the possession of a rather large stash of bitcoins.
remaining core count by Anonymous Coward · 2016-06-19 15:52 · Score: 5, Funny

the world's first microchip with 1,000 independent programmable processors ... Imagine how many mind-boggling things will become possible if this much processing power ultimately finds its way into new consumer technologies.
Yeah, but you have to keep in mind how many cores will be left for the user!
1000 cores minus:
* 200 cores for anti-virus software
* 25 cores for the ransomware battling it out with the anti-virus
* 55 cores for Microsoft's Win10 update nagware
* 350 cores for the NSA monitoring
* 122 cores for the FBI monitoring
* 75 cores to handle syncing all your data to the cloud
* 94 cores to run the 3D GUI based desktop
* 62 cores for constant advertising
* 14 cores for Google to keep tabs on what you're doing
* 1 core dedicated to emacs
So, only 2 cores left for the user. No better than an Athlon from 2005, I'm afraid.
Re:Can this chip run GNU/systemd/Linux? by ancientt · 2016-06-19 15:52 · Score: 5, Interesting

That's probably all it can run. Typically specially designed systems need the ability to configure the OS radically differently than has been done previously which requires source code. Microsoft provides source code, as does IBM, in some special situations, but mostly it tends to be Linux that is used first. Consider the reasoning behind the OS chosen for the fastest computers in the world.
Systemd? Probably because serious computer engineers don't have any trouble dealing with the irritation that systemd causes. (The rest of us may, but if you have enough smarts to handle building a specialized chip, then systemd isn't really a challenge.)

--
B) Eliminate all the stupid users. This is frowned upon by society.
Re: Mind bogglingly complecated co-processing by Anonymous Coward · 2016-06-19 15:55 · Score: 5, Interesting

I take it you've never done high performance computing, have you? More cores is often a good thing. If I'm doing a simulation across 1,024 cores and each node has 16 cores, that means I need a minimum of 64 nodes. There's a lot of communication that takes place over protocols like Infiniband in order to make MPI work. It also rules out the possibility of shared memory systems like OpenMP when jobs reach that scale and have to be spread across multiple nodes. If more cores are located within a single node, it reduces the amount of communication with other nodes and the resulting latency. It also makes shared memory a viable option for larger parallel jobs. If I can fit 64 or 256 cores on a node, there's a lot less need for relatively slow protocols like Infiniband to pass messages. I don't think the ordinary user has a need for 1,000 cores or would have such a need for a long time. But it really could help with high performance computing.
Re:Can this chip run GNU/systemd/Linux? by NotInHere · 2016-06-19 15:57 · Score: 4, Informative

No.
systemd requires glibc. And glibc is 2 MB large. According to the paper, the processor has whopping 768 KB of RAM (and no capabilities to add external RAM).
Means systemd won't gonna run. Dunno about the kernel, probably its easier to write a minimal one from scratch than to port it over to that special architecture.
Re:I guess this is great by Ironlenny · 2016-06-19 16:11 · Score: 4, Insightful

Quantum computing is not magic. It has problems it's insanely good at (in theory) solving, and it has problems where it's as fast or slower (because of the necessary error correction) as your traditional deterministic computer. Not only are we a long way off from personal quantum computing (we still don't even have a general purpose quantum processor), we still need to research deterministic architectures.

--
There is a system for subverting the system and you should use that system!
Re:Can this chip run GNU/systemd/Linux? by Pseudonym · 2016-06-19 16:12 · Score: 5, Informative

This is basically a modern transputer. As with connection machines, GPUs, and all such machines, it will very likely need a traditional host CPU to manage it, and that may well run Linux.

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Obligatory by Motherfucking+Shit · 2016-06-19 16:18 · Score: 4, Funny

Imagine a Beowulf cluster of these!

--
"BSD: Free as in speech. Linux: Free as in beer. Windows 10: Free as in herpes." --Man On Pink Corner in #52607549.
Re:Can this chip run GNU/systemd/Linux? by AchilleTalon · 2016-06-19 17:20 · Score: 4, Informative

Your GPU processors need to execute the SAME instruction at each clock cycle, this one has each processor capable to execute any instruction at each clock cycle. So, this is truly like a 1000 cores CPU. While the GPU is limited to dispatch the same instruction to all processors.

--
Achille Talon
Hop!
Re: Mind bogglingly complecated co-processing by docmordin · 2016-06-19 17:36 · Score: 4, Interesting

Doing any sort of large-scale computational fluid dynamics or finite element simulations may require a great many cores. For example, you might want to conduct a very detailed simulation of the air flow around a vehicle, airplane, structure, etc. to have a basic understanding of its aerodynamics before spending time and money testing an actual prototype in a wind tunnel. You might also want to look at how very complicated, soft-body structures deform due to a variety of external stimuli. Such information would be crucial for certain materials science applications. Chemical reaction and acoustic simulations may also require a great deal of computing power, especially if you want to have a high spatio-temporal resolution.
Essentially, there are plenty of physical and theoretical science applications that can benefit from massive processing capabilities. There is a lot of fundamental science that is also performed in simulation before any actual tests occur.
Re:What games does this come with by invictusvoyd · 2016-06-19 18:21 · Score: 4, Funny

pong
1000 exactly by evanh · 2016-06-19 19:08 · Score: 5, Informative

It's a 32 x 31 grid = 992, plus 8 extra stuck on one edge to make up the numbers.
Re:Can this chip run GNU/systemd/Linux? by WindBourne · 2016-06-19 19:49 · Score: 4, Informative

Totally easy to add external ram. In fact, it supports 12 independent memory modules. The 768 KB is in place of cache memory. Basically, it is a working table in which any of the CPUs can access any part of it.

--
I prefer the "u" in honour as it seems to be missing these days.
It does almost nothing very very fast by Required+Snark · 2016-06-19 20:22 · Score: 4, Informative

If you read the two page technical paper you will see that there is much less here then the hype suggests.
Each CPU supplies an amount of computation less then a single instruction on a regular CPU. Think of it as a grid of instructions not a grid of computers. A processor has a Harvard architecture with 128 instructions of 40 bit size and a separate data memory with two banks of 128 16 bit data values (256 16 bit data words total). It says nothing about register files or stacks or subroutine calls. It's likely that the two data banks are in effect the register set. The paper implies that a CPU can compute a single floating point operation in software.
Compiling means mapping code fragments to a set of connected CPUs and routing resources, and then feeding the data into the compute array. After some circuitous path through the grid the answer emerges somewhere. There are also 12 independent memory banks each with a 64KB of SRAM that are available to all CPUs.
History has not been kind to this kind of grid architecture with lots of CPUs and very little memory. Almost none of them ever made it out of the lab. It's symptomatic of hardware engineers who are clueless about software and design unprogrammable computers. They confuse aggregate theoretical throughput with useful compute resources.
Debugging code on this would be a nightmare. It's completely asynchronous, there is no hardware to segregate different sets of CPUs doing different computing tasks and so few resources per CPU that software debugging aids would crowd out the working code. The people listed on the paper should be punished by being force to make it do useful work for at least a year. They would be scarred for life.

--
Why is Snark Required?
Re: Mind bogglingly complecated co-processing by goose-incarnated · 2016-06-19 21:12 · Score: 4, Informative

It also makes shared memory a viable option for larger parallel jobs.

Good luck with that. I mean it. IME as you go *more* parallel, shared memory becomes a *less* viable option, regardless of how many cores are running on the same machine. The cycles lost to memory locking to make shared memory work increases exponentially with the number of autonomous processes/threads.
The math isn't disputed - see the birthday problem for a start on calculating the clashes in playing musical chairs. In short, when you have X individuals with Y pigeonholes, then you are effectively bounded by Y, not by X. When you have X threads trying to access one variable, the chance that any thread will get this variable without waiting is effectively 1 for one thread, 1/2 for two threads, 1/3 for three threads, etc.
By the time you get to a mere 64 threads each trying to access a variable, each thread basically has a 1.5% chance of getting it, and a 98.5% chance of being placed into a queue for that variable. Queue times get longer logarithmically. For one thread, time spent in the queue is ((0 * ATIME) + ATIME) where ATIME is the access time of the variable. For two threads, it's ((1-1/2) * ATIME) + ATIME, for three threads it's ((1-1/3) * ATIME) + ATIME, for four threads it's ((1-1/4) * ATIME) + ATIME. For ATIME=100us, the times above are, respectively, 100us, 150us, 166.67us, 175us. That last number is only for four threads with one variable, and assuming that queuing takes no clock cycles. The times increase exponentially with an increase in the number of variables that must be locked.
For 64 threads your expected time in the queue is ((1-1/64) * ATIME) = 98.5us. You can forget about using shared memory if you want to use 1000 cores.
But wait, "Use a sane design pattern and that won't happen, like with consumer/producer, etc" I hear you say? Sorry, no design pattern will save you, because if even a single thread writes to a variable, then all threads have to implement read-locks to make sure they don't get an access during a write (race condition).
If you have 1000 cores, implement local message-passing. Don't try shared memory unless each thread will use a local copy (in which case, it isn't "shared", now is it?). Or, go ahead and do it and maybe you'll find a shared memory design that doesn't fail to first year statistics, and if you do beat the numbers then I'll be the first to nominate you for a Fields medal/Turing award :-)

--
I'm a minority race. Save your vitriol for white people.
Re: Can this chip run GNU/systemd/Linux? by Bengie · 2016-06-20 00:22 · Score: 4, Informative

AchilleTalon is correct, each processing group in the GPU can only execute the same instruction on all cores in that group. Every time you have a branch in your code, the GPU takes one branch, executing the instructions for that branch and stalling all cores that took a different branch, then takes the other branch, and stalls the other other cores. GPUs hate branches. Yes, they can do them, but at a huge performance penalty. You may want to write better code.

To get into a bit more details, I'll use AMD as an example, but Nvidia pretty much does the same thing with slightly different terms for the same concepts. The AMD RX 480 has 2304 streaming processors(cores), that are grouped into 36 CUs(execution groups). Each streaming processor can handle up to something like 4 wavefront(threads, like hyper-threading to hide memory access latency) at a time. All streaming processors in a CU for a given wavefront must be executing the same instruction at the same time, except in the case of a branch. When a branch happens, one fork of the branch will process, stalling the other streaming processors taking the other fork. Once that fork is finished, the first group of streaming processors will stall while the other processing finish their fork.
Systemd on CentOS7 by DrYak · 2016-06-20 01:53 · Score: 4, Informative

Systemd? Probably because serious computer engineers don't have any trouble dealing with the irritation that systemd causes.
Confirming: our latest nodes on our cluster are running CentOS7 which is systemd powered.
(And hopefully the final practical product out this buzzword-compliant pressrelease would still be somewhat useful.
We could have some special workloads to apply it to).

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]