California Researchers Build The World's First 1,000-Processor Chip (ucdavis.edu)
An anonymous reader quotes a report from the University of California, Davis about the world's first microchip with 1,000 independent programmable processors: The 1,000 processors can execute 115 billion instructions per second while dissipating only 0.7 Watts, low enough to be powered by a single AA battery...more than 100 times more efficiently than a modern laptop processor... The energy-efficient "KiloCore" chip has a maximum computation rate of 1.78 trillion instructions per second and contains 621 million transistors.
Programs get split across many processors (each running independently as needed with an average maximum clock frequency of 1.78 gigahertz), "and they transfer data directly to each other rather than using a pooled memory area that can become a bottleneck for data." Imagine how many mind-boggling things will become possible if this much processing power ultimately finds its way into new consumer technologies.
Programs get split across many processors (each running independently as needed with an average maximum clock frequency of 1.78 gigahertz), "and they transfer data directly to each other rather than using a pooled memory area that can become a bottleneck for data." Imagine how many mind-boggling things will become possible if this much processing power ultimately finds its way into new consumer technologies.
The press release does not include it, nor does the slashdot summary. The link to the paper: http://vcl.ece.ucdavis.edu/pub...
Maybe things are getting better. Too many programs are single threaded. Too many drivers are single threaded. Yes you can sandbox them.
That leaves out the nasty deadly embrace. Or less nasty, waiting on a key resource to complete.
More core just gets you bound up in your shorts faster.
more cores is not a magic bullet.
A young intern who likes to "work late" in Davis California has recently come into the possession of a rather large stash of bitcoins.
Yay this is so awesome that researchers have pretty much put 1,000 6502 processors on a single chip. Way to go, maybe in a year we can move on to the equivalent of 1,000 z80 processors on that chip. Yay research!
probably... efficiently? Doubt it.
But I am not sure what system or software can take advantage of it. Personally I want to see progress being made on quantum computing for consumer lever stuff.
the world's first microchip with 1,000 independent programmable processors ... Imagine how many mind-boggling things will become possible if this much processing power ultimately finds its way into new consumer technologies.
Yeah, but you have to keep in mind how many cores will be left for the user!
1000 cores minus:
* 200 cores for anti-virus software
* 25 cores for the ransomware battling it out with the anti-virus
* 55 cores for Microsoft's Win10 update nagware
* 350 cores for the NSA monitoring
* 122 cores for the FBI monitoring
* 75 cores to handle syncing all your data to the cloud
* 94 cores to run the 3D GUI based desktop
* 62 cores for constant advertising
* 14 cores for Google to keep tabs on what you're doing
* 1 core dedicated to emacs
So, only 2 cores left for the user. No better than an Athlon from 2005, I'm afraid.
That's probably all it can run. Typically specially designed systems need the ability to configure the OS radically differently than has been done previously which requires source code. Microsoft provides source code, as does IBM, in some special situations, but mostly it tends to be Linux that is used first. Consider the reasoning behind the OS chosen for the fastest computers in the world.
Systemd? Probably because serious computer engineers don't have any trouble dealing with the irritation that systemd causes. (The rest of us may, but if you have enough smarts to handle building a specialized chip, then systemd isn't really a challenge.)
B) Eliminate all the stupid users. This is frowned upon by society.
No.
systemd requires glibc. And glibc is 2 MB large. According to the paper, the processor has whopping 768 KB of RAM (and no capabilities to add external RAM).
Means systemd won't gonna run. Dunno about the kernel, probably its easier to write a minimal one from scratch than to port it over to that special architecture.
This is basically a modern transputer. As with connection machines, GPUs, and all such machines, it will very likely need a traditional host CPU to manage it, and that may well run Linux.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Imagine a Beowulf cluster of these!
"BSD: Free as in speech. Linux: Free as in beer. Windows 10: Free as in herpes." --Man On Pink Corner in #52607549.
6 years ago at least there was a 1000 core processor made. I don't see how this is different.
The older article:
http://www.pcworld.com/article/215113/1000_Core_Processor_Eats_Quad_Core_CPUs_For_Lunch.html
It could be an interesting extra chip in a general use computer, where programs could syphon routines to, for example kinds of video/image rendering, parallel-able mathematical operations, image recognition, a 1000 node neural network, etc.
Deducing from its primary applications which is decoding/encoding and encryption it seems it is more similar to digital signal processor rather than to a regular cpu.
Perfect for the internet of things. Now rather than just an egg timer I can have a battery power super computer in my salt shaker that does a finite element simulation of the egg in boiling water, going beep and the perfect moment. The toaster will be able to insult me in the kings english or the emporer's mandarin.
And orange Pi is planning to make a board with one of these that only runs on one of the 1000 cores, and no stable OS.
This thing is I suspect suited for programs that parallelize and have little interprocess communication and run in small memory. Why do I know this? because if each processor had a large memory and an infiniband backplane it would melt. Thus you could update your facebook status on all 1000 dummy accounts for example. Or compute pi or chug bitcoins.
Some drink at the fountain of knowledge. Others just gargle.
Your GPU processors need to execute the SAME instruction at each clock cycle, this one has each processor capable to execute any instruction at each clock cycle. So, this is truly like a 1000 cores CPU. While the GPU is limited to dispatch the same instruction to all processors.
Achille Talon
Hop!
The 1990's called, they want their joke back!
*** Suerte a todos y Feliz dia!
The connection machine's processors were distributed among multiple cards, a single card contained 16 processors, the 1 bit processors were implemented using a ROM chip, I think.
Aren't the shader units of the modern GPUs like the Geforces basically specialized CPUs?
In this case we're already at 2560 CPUs on a single chip.
The way to improve computational technology is parallelism. What are the usage domains?
-anything video related
--games
--image recognition
-anything AI (I think?)
--autonomous cars
--facial recognition
-a lot of physics applications
Thoughts?
PS: I don't reply to ACs.
pong
It only runs at 1.7ghz. My Pentium IV running XP runs at 4 GHz! Just ask any Joe Six pack who bought them over an AMD?
http://saveie6.com/
...contains 621 million transistors... Imagine how many mind-boggling things will become possible if this much processing power ultimately finds its way into new consumer technologies.
Let see... 1,000 very small compute cores... sounds a awful lot like your typical GP-GPU these days. Only reason the power consumption is so small is because it has < 1 billion transistors. Compare that to the 17 billion transistor nVidia pascal monster. Even the non-Iris graphics Skylake desktop CPU has ~1.7 billion, and over half of those are spent on the GPU.
Chances are even paltry Intel HD Graphics running an OpenCL program will have more FLOPS than this thing. Don't be fooled by the flashy headline, the laws of physics still apply.
It's a 32 x 31 grid = 992, plus 8 extra stuck on one edge to make up the numbers.
Think like quantum mechanic, finite elements analyzes or weather prediction etc... Everything which are based on matrix or subset of elements which are calculated in parallel. Although I am guessing here memory would be a bottleneck.
C. Sagan : A demon haunted world:
http://www.amazon.com/gp/product/0345409469/
visit randi.org
So... What do you do with it? Brute forcing encryption keys?
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Sounds exactly like a GPU to me. :-P
Computer simulation made easy -- LibGeoDecomp
Will slow it down to a crawl before blue screening. Then we'll be ready for Windows 24 Home Premium Edition. No worries.
On y va, qui mal y pense!
why doubt it? After reading this, it sounds like a great set-up. With a 1000 CPUs of MIMD, it sounds like the right core for controlling access to massively parallel systems. And a single AA to run it? Sounds like a pretty decent chip to me.
I prefer the "u" in honour as it seems to be missing these days.
Totally easy to add external ram. In fact, it supports 12 independent memory modules. The 768 KB is in place of cache memory. Basically, it is a working table in which any of the CPUs can access any part of it.
I prefer the "u" in honour as it seems to be missing these days.
look up SIMD vs MIMD. In a nut shell, your GPU has a large number of 'CPU's that do the same thing. These are 1000 CPU, each capable of doing the same thing, OR doing their own thing.
I prefer the "u" in honour as it seems to be missing these days.
Each CPU supplies an amount of computation less then a single instruction on a regular CPU. Think of it as a grid of instructions not a grid of computers. A processor has a Harvard architecture with 128 instructions of 40 bit size and a separate data memory with two banks of 128 16 bit data values (256 16 bit data words total). It says nothing about register files or stacks or subroutine calls. It's likely that the two data banks are in effect the register set. The paper implies that a CPU can compute a single floating point operation in software.
Compiling means mapping code fragments to a set of connected CPUs and routing resources, and then feeding the data into the compute array. After some circuitous path through the grid the answer emerges somewhere. There are also 12 independent memory banks each with a 64KB of SRAM that are available to all CPUs.
History has not been kind to this kind of grid architecture with lots of CPUs and very little memory. Almost none of them ever made it out of the lab. It's symptomatic of hardware engineers who are clueless about software and design unprogrammable computers. They confuse aggregate theoretical throughput with useful compute resources.
Debugging code on this would be a nightmare. It's completely asynchronous, there is no hardware to segregate different sets of CPUs doing different computing tasks and so few resources per CPU that software debugging aids would crowd out the working code. The people listed on the paper should be punished by being force to make it do useful work for at least a year. They would be scarred for life.
Why is Snark Required?
Most programmers don't know how to code for parallel processors. At best you may get multi-threaded apps but those are often made to handle large load of request not soling a single problem much quicker.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Like this one form 2004?
https://tech.slashdot.org/story/11/01/03/1722240/researchers-claim-1000-core-chip-created
A single AA is marketing lies. Sure, the battery will handle it, but it's not what it is made for and the energy you actually get out will be less than the marked one.
The number listed on the battery is typically how much you will get out of it from a 20h discharge time.
You should not be surprised if an AA battery only lasts for half the rated time if you try to suck 0.7W out of it.
The runtime won't be much above 1 hour.
OTOH a computer with 1000 processors is hardly made for portable applications so the single AA example is just silly anyway. For the application one would use this for there are better power sources available.
Even ignoring all other limitations of this particular processor there's still Amdahl's law, limiting the speedup by the serial parts of a task.
As one example how that works look at compiling to hardware. In theory this should bring enormous benefits as not only can one parallelize on a instruction level but on a sub-instruction one, speculating and pipelining e.g. additions. Many types of communication can be eliminated entirely by replicating hardware.
But even with those benefits there are a _lot_ of software that is better to run on a standard processor. Why? Because using custom optimized hardware to run it ends up replicating a number of normal processors including caches, branch prediction etc. and then a processor optimized by a dedicated team of experienced people ends up being attractive.
Now saying custom hardware can't bring huge benefits, not even saying that this research processor can't do it _however_ in general there are a lot of tasks that can't really be accelerated much.
So has it been tested on bitcoin mining yet? or seti project? Im curious what sort of real world throughput it has...
Same thing I thought. And connection machine died because the architecture was not actually that great.
The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism
I don't think this is right. I've written OpenCL kernels that have variable length loops and branches either of which could be run, and executed then in parallel. So either my understanding is wrong, or GPU cores can indeed run different instructions at the same time.
Most programmers don't know how to code for parallel processors.
Most of them don't need to, because at a high level, most things you want to do with a computer are inherently serial.
At a low level, tons of math can be parallelized, although there's a trade-off of parallel processing and overhead. Low level parallelism happens inside libraries, written by people who do know how to write for parallel processors, and transparent to higher-abstraction programmers.
This chip is not going to be in your phone or in the next iPad. "Most" programmers are unlikely to have any contact with this architecture. It will be really good at parallel tasks requiring relatively little data (max data rate seems to be less than 400 Mbps/processor). Those are big problems, but very specialized. The people who work on them know how to parallel.
I still wonder how long it will be until the 'traditional host CPU' is scaled down to a small SOC, so that the traditional heavyweight CPU is freed up for tasks that actually require it: most of what runs on the i5 in the machine I am writing this on doesn't need anything remotely as powerful as said i5. Likewise, putting a small SOC-like chip in the graphics card and running most of the GUI there is another thing. As such, once processors hit the single core brick wall (and they're kind of doing that now), performance improvements will come from offloading what can run on a small power-efficient core to such a small power-efficient core. Given what the chip in e.g. a pi zero costs, it ought to make sense: connect your machine to power, and a tiny microcontroller handles the ILO and basic system management functions, and on power-on, a larger microcontroller/SOC does what the BIOS/UEFI does on current machines. Similarly in the screen we have the same arrangement, with a microcontroller starting up the GPU and display (independently of the rest of the machine). A modern PC is already like a small network (the GPU being networked to the main CPU via the pcie bus, multiple intel sockets networked via QPI etc.). Making this more explicit is the sensible thing to do.
John_Chalisque
It's only 0.7W when clocked at 115MHz, but still impressive.
768 K should be more than enough for everybody.
That's what most PC/XT clones with a Hercules card had 30 years ago.
New applications?
AchilleTalon is correct, each processing group in the GPU can only execute the same instruction on all cores in that group. Every time you have a branch in your code, the GPU takes one branch, executing the instructions for that branch and stalling all cores that took a different branch, then takes the other branch, and stalls the other other cores. GPUs hate branches. Yes, they can do them, but at a huge performance penalty. You may want to write better code.
To get into a bit more details, I'll use AMD as an example, but Nvidia pretty much does the same thing with slightly different terms for the same concepts. The AMD RX 480 has 2304 streaming processors(cores), that are grouped into 36 CUs(execution groups). Each streaming processor can handle up to something like 4 wavefront(threads, like hyper-threading to hide memory access latency) at a time. All streaming processors in a CU for a given wavefront must be executing the same instruction at the same time, except in the case of a branch. When a branch happens, one fork of the branch will process, stalling the other streaming processors taking the other fork. Once that fork is finished, the first group of streaming processors will stall while the other processing finish their fork.
Something that will run Flash without bogging down.
Do not look at laser with remaining good eye.
What kind of computer scientists are they?
They should have made it 1024. And labelled them 0-1023.
You're parsing this a little more than necessary. The point was not that people would use a AA battery. The point was that this chip was an energy sipper as opposed to an energy guzzler.
If you're scared of your govt then you need to further restrict its powers
Vote 3rd Party in 2016 and beyond
Systemd? Probably because serious computer engineers don't have any trouble dealing with the irritation that systemd causes.
Confirming: our latest nodes on our cluster are running CentOS7 which is systemd powered.
(And hopefully the final practical product out this buzzword-compliant pressrelease would still be somewhat useful.
We could have some special workloads to apply it to).
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Its depends.
In the case of Xeon-Phi (i.e.: ex-Larrabee GPUs repurposed as parallel processing units), in addition to the very wide SIMD AVX512 units, there are also scalar cores able to run pentium-compatible binaries.
So the Linux core managing all the hardware actually run *on* the GPU itself (and you can SSH into your Xeon-Phi if you want).
On the other hand, the Tilera works exactly as you describe.
A weird many-core structure running the processing kernels,
and a nearby classical risc core managing the whole.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Had to say it. Haven't see that response in a while.
Peace is easy to achieve, just surrender. Liberty is much harder get/keep.
I'm nitpicking to hell with this but...
Yes, all the *SIMD units attached to 1 execution core* will necessarily process the exact same instruction at the same time on the same cycle... ...but there more than 1 execution core on most higher range GPUs, and nearly all modern GPUs are able to keep several hyperthreads running concurrently to hide latencies.
(which from a design point of view makes entirely sens: graphical processing is about repeating some processing on thousands or million pixels. Better group them in batches instead of processing every last damn pixels individually)
So a modern GPU can execute several different instruction at the same time.
Even if usually it's the same exact OpenCL code uploaded to all units, the various SIMD units could be executing different points of code.
But yeah, you're right, within a SIMD, all the threads run the same instruction.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Nit-picking to hell...
You've forgotten a special use case:
Yes, if AC's code does something stupid like "every even thread branch lest, every odd thread branch right", the execution group will need to run the code twice, with altening masks to run each branch, exactly as you describe.
But if it's entirely different part of the thread block that diverge (e.g.: first half vs. second half), the "executions groups" will each diverge independently. The first 18 taking one branch and the second taking the other branch. With no time lost due to alterning execution masks.
(Which is the preferable way to handle branching code in parallel environment. If you can't do away with the branches altogether, at least try to organise it so nearby threads on the same SIMD branch/loop together.
e.g.: bin-sort your loops by similar lengths together)
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Most programmers don't seem to be able to deal with buffer overflows, race conditions or 64 bit. This is for the other ones who can deal with more than one thread, the ones that have caught up with the 1990s and are not stuck in the MSDOS mindset.
Even very simple stuff with sound and images is inherently parallel. More complex modelling of physical objects is inherently parallel.
You don't get it? Imagine resizing the every frame of a movie at 25fps over two hours. That's the same operation done many times and very trivial to do in parallel. It's just a matter of splitting the task to whatever resources you have. With sound (and thus things like seismic data as well) if you want to apply the same filter to thousands or millions of samples it's very trivial to do in parallel.
Home movies and digital photography fit into the mix so not very specialized at all.
Only in a beowulf cluster.
No doubt Linux runs on a conventional processor that manages the embedded processors. Probably just running on the metal on the embedded processors, like a GPU.
When all you have is a hammer, every problem starts to look like a thumb.
Most programmers seem to be coding Javascript these days.
When all you have is a hammer, every problem starts to look like a thumb.
bingo. I like the sounds of this chip. If done inexpensively (though they worked with IBM so it might not be), this could be a major chip. I could imagine a number of these for a server. Wow.
I prefer the "u" in honour as it seems to be missing these days.
SIMD are used for parallel same processing of different data. Imagine processing a graphics and you want to lighten it by 10%. Then this simply divides the work across the ram with individuals CPUs, but all adding 10% to the value.
Likewise, if doing weather processing, or geo-graphical, or simulations of lightrays. They all involve the same calulations but applied to different data.
Hence SAME INSTRUCTION; Multiple (or different) DATA.
Roughly, those CPUs all operate in lock-step.
MIMD, is like having 1000 different CPUs in a box, or for that matter, 1000 different boxes. The 768K data region is what the chips can see. So, you can have 5 chips working on 100K, while the rest is split amongst 995 other chips.
I prefer the "u" in honour as it seems to be missing these days.
I've written OpenCL kernels that have variable length loops and branches either of which could be run, and executed then in parallel.
The way this typically works is to use conditional execution, just like in ARM or Itanium, with the predicate bit being a set of bits. This is all explained in early research papers on GPUs, such as this one from the now-amusingly-named "Lucasfilm Pixar Project" circa 1984.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
I could imagine one of these for a desktop.
I would parse it as the memory modules have that much memory and that it has IO to an external bus at the speed stated.
Not the original poster, but I used to do a lot of comma ands that have been largely replaced with period ands and semicolons.
well, fsck me!.
Well, fsck is also going to be handled by systemd! Systemd is cancer!!!
No, wait, you're running the whole on top of BTRFS which doesn't have a real-fsck because it doesn't make any sens on copy-on-write systems! BTRFS is the cheap knock-off of ZFS!!!!
Argh! All these meme start to get confusing, I don't know which I currently need to blame!
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Sorry, can't "+1 Funny" you, cause I've already posted in this thread...
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Since when did systemd become part of the GNU/Linux moniker?
but you can have 999 players (on a 999 sided polygon field) and 1 ball