Tilera To Release 100-Core Processor
angry tapir writes "Tilera has announced new general-purpose CPUs, including a 100-core chip. The two-year-old startup's Tile-GX series of chips are targeted at servers and appliances that execute Web-related functions such as indexing, Web search and video search. The Gx100 100-core chip will draw close to 55 watts of power at maximum performance."
I can't wait to see the output of :
cat /proc/cpuinfo
I guess we will need to use:
cat /proc/cpuinfo | less
When we reach 1 million cores, we will need to rearrange the output of cat /proc/cpuinfo to eliminate redundant information ;-))
By the way I just typed "make menuconfig" and it wiil let you enter a number up to 512 in the "Maximum number of CPUs" field, so the Linux kernel seems ready for up to 512 CPUs (or cores, they are handled the same way by Linux it seems) as far I can tell by this simple test. Entering a number greater than 512 gives the "You have made an invalid entry" message ;-(
Note: You need to turn on "Support for big SMP systems with more than 8 CPUs" flag as well.
Everything I write is lies, read between the lines.
...cluster of natalie pormemes.
THL phish sticks
... and just imagine a Beowulf cluster of them.
Yeah, baby !! That's a LOT OF POWER to turn my knobs !!
Yes, I suppose technically any FPGA could be considered a "core" in its own right, but it's a far cry from the CPU cores that you typically associate with the term.
Putting a stock on a semi-automatic rifle makes it an "assault weapon", but c'mon. It's still a pea shooter.
It appears from the article that it's a new, separate architecture to which the kernel hasn't been ported yet, so these are add-on processors that can help reduce the load on the actual CPU, at least for now. So, em, two things: 1. How exactly does that work without kernel level support? They claimed having ported separate apps (MySQL, memcached, Apache), so this might suggest a generic kernel interface and userspace scheduling. 2. How does this fix the apps they ported being mostly IO bound in a lot of cases and 99% of the cores will still just be eating out of their noses?
Massive amounts or cores are cool and all that, but if the instruction set isn't any standard type (ie x86, Sparc, ARM, PowerPC or MIPS) chances are that it won't see light outside highly customized applications. Sure, Linux will probably run it. Linux run on anything, but it won't be put in a regular computer other than as an accelerator of some sort, like GPUs which are massively multicore too. Intel's Larrabee though..
- Henrik
- when the Shadows descend -
Wouldn't it have been better to make it a power of 2? Some work is more easily divided when you can just keep halving it. 64 or 128 would have been more logical I would have thought. I'm not an SMP programmer thought, so perhaps it doesn't make any difference.
in the article it is mentioned that Tilera is able to avoid the use of crossbars:
For faster data exchange, Tilera has organized parallelized cores in a square with multiple points to receive and transfer data. Each core has a switch for faster data exchange. Chips from Intel and AMD rely on crossbars, but as the number of cores expands, the design could potentially cause a gridlock that could lead to bandwidth issues, he said.
Does anybody here know how this actually works?
Sounds like something that might be useful in a video game console ...
.. I'm I the only one who gets mildly suspicious when reading 100-core instead of 128-core?
Although I don't expect Apple to release an Apple Server edition with a Tilera multicore processor, I would be interested to see a version of FreeBSD running with Grand Central Dispatch on a Tilera multicore chip. It would give a good idea of how effective GCD would be in allocating cores for execution. Any machine with 100 cores must have a considerable amount of RAM, perhap 8GB+, even with large caches.
Apple has been very active in developing LLVM compilers, and has recently added CLANG front end, instead of GCC. I don't think apple has open sourced all their work yet, but check llvm.org for the current details. The real trick is breaking any algorithm into blocks. Using OpenCL to organize your code for execution. I mean how different is a 100 core multi-CPU chip from a multicore GPU accellerator!
http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.27-rc2-git1.log
CC.
TaijiQuan (Huang, 5 loosenings)
...the first person to ask if this can run "Crysis?"
Wish I had mod points today. I wonder how many people will get just how funny this fantastically sarcastic and totally on target comment was. Bravo.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
Are these x86/x86-64 CPUs? It wasn't particularly clear to me.
In TFA sez it's ported to apache. Might be useful.
The cost of that cleanup, of course, will be borne by taxpayers, not industry.
OK, so big disclaimer: I work for Sun (not Oracle, yet!)
The Sun Niagara T1 chip came out over 3 years ago, and it did 32 threads on 8 cores.
And drew something around 50W (200W for a fully-loaded server). And under $4k.
The T2 systems came out last year, do 64 threads/CPU for a similar power budget. And even less $/thread.
The T3 systems likely will be out next year (I don't know specifically when, I'm not In The Know), and the threads/chip should double again, with little power increase.
Of course, per-thread performance isn't equal to anything like a modern "standard" CPU. Though, it's now "good enough" for most stuff - the T2 systems have a per-thread performance equal to about the old Pentium3 chips. I would be flabbergasted if this GX chip had a per-core performance anywhere near that.
I'm not sure how Intel's Larabee is going to show (it's still nowhere near release), but the T-seres chips from Sun are cheap, open, and available now. And they run Solaris AND Linux. So unless this new GX chip is radically more efficient/higher-performance/less costly, I don't see this company making any impact.
-Erik
...a Beowulf cluster of stale memes.
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
Since a) developing a processor is insanely expensive and b) they need it to run lots of software ASAP, it would be very clever if they spent a marginal part of the overall development costs in making sure every key Linux and *BSD kernel developer gets some hardware they can use to port the stuff over. Make it a nice desktop workstation with cool graphics and it will happen even faster.
They are going up against Intel... The traditional approach (delivering a faster processor with a better power consumption at a lower price) simply will not work here.
I think Movidis taught us a lesson a couple years back. Users will not move away from x86 for anything less than a spectacular improvement. Even the Niagara SPARC servers are a hard sell these days...
http://www.dieblinkenlights.com
Can someone explain to me how a chip can be targetted at much higher-level tasks like these?
I realize there are surely technical means to achieve this goal, I just can't imagine myself what these means could be.
diegoT
It is done only out of convince really. So you have your regular 1 core processor of course (2^0), next step up is a second core (2^1). Now from there, an easy step is to simply duplicate your dual core setup. You just make a second copy and put it on the same chip giving you 4 cores (2^2). This is as far as most chips go, more than 4 cores is not real common. However you might notice we have a real small sample set, we've only covered 3 powers of two, two of them by necessity. This trend thus isn't one because computers require it, just because it works out that way.
So, if you sniff around, you discover that indeed AMD makes 3 core processors. They are called the Phenom X3. Basically what happens is they designed a quad core chip. however they are having yield problems. Often enough, one of the cores fails testing, but the others work. So what they do is disable that core, and sell a 3 core product. End result works great, the OS sees 3 CPUs and uses them.
OSes don't care about specifics in terms of core numbers. Power of two core numbers are just the way it has worked out in many chips so far because we aren't dealing with large numbers. It is going to quickly go away though. Intel is going to introduce a 6-core chip next year. We are heading towards a market that will have processors with a number of cores that is convenient. What "convenient" is will depend on a lot of factors, but the divisibility of the numbers won't be one of them.
We may well start to see more odd numbered CPUs. If you design something with 100 individual units, it is much easier to disable parts if they don't work. Might see 96, 97, 98, 99, and 100 core varieties or something like that. All the same chip, just with units disabled if they fail.
GPUs have been doing this for years. They are highly parallel and often when a new high end part comes out there'll be a slightly lower end part that is a bit lower clock and with one or two of the pipelines disabled. This allows for parts that won't pass all the tests, but still mostly work, to be sold rather than thrown out.
100 cores... that means that my cpu will never go beyond '1% busy'
Unfortunately these days the meaning of supercomputer gets a bit diluted by many people calling clusters "supercomputers". They aren't really. As you noted what makes a supercomputer "super" isn't the number of processors, it is the rest, in particular the interconnects. Were this not the case, you could simply use cheaper clusters.
So why does it matter? Well, certain kinds of problems can't be solved by a cluster, just as certain ones can. To help understand how that might work, take something more people are familiar with like the difference between a cluster and just a bunch of computers on the Internet.
Some problems are extremely bandwidth non-intensive. They don't need no inter-node communication, and very little communication with the head node. A good example would be the Mersenne Prime Search, or Distributed.net. The problem is extremely small, the structure of the program is larger than the data itself. All the head node has to do is hand out ranges for clients to work on, and the clients only need to report the results, affirmative or negative. As such, it is something suited to work over the Internet. The nodes can be low bandwidth, they can drop out of communication for periods of time and it all works fine. Running on a cluster would gain you no speed over the same group of computers on modems.
However the same is not true for video rendering. You have a series of movie files you wish to composite in to a final production, with effects and so on. This sort of work is suited to a cluster. While the nodes can work independent, the work of one node doesn't depend on the others, they do require a lot of communication with the head node. The problem is very large, the video data can be terabytes. The result is also not small. So you can do it on many computers, but the bandwidth needs to be pretty high, with low latency. Gigabit Ethernet is likely what you are looking at. Trying to do it over the Internet, even broadband, would waste more time in data transfer than you'd gain in processing. You need a cluster.
Ok well supercomputers are the next level of that. What happens when you have a problem where you DO have a lot of inter-node communication? The result of the calculations on one node are influenced by the results on all others. This happens in things like physics simulations. In this case, a cluster can't handle it. You can slam your bandwidth but worse, you have too much latency. You spend all your time waiting on data, and thus computation speed isn't any faster.
For that, you need a supercomputer. You need something where nodes can directly access the memory of other nodes. It isn't quite as fast as local memory access, but nearly. Basically you want them to play like they are all the same physical system.
That's what separates a true supercomputer for a big cluster. You can have lots of CPUs and that's wonderful, there are a lot of problems you can solve on that. However that isn't a supercomputer unless the communication between nodes is there.
Clojure is a lisp on the JVM designed for multi-threading. From:
http://clojure.org/
"""
Clojure is a dynamic programming language that targets the Java Virtual Machine (and the CLR ). It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming. Clojure is a compiled language - it compiles directly to JVM bytecode, yet remains completely dynamic. Every feature supported by Clojure is supported at runtime. Clojure provides easy access to the Java frameworks, with optional type hints and type inference, to ensure that calls to Java can avoid reflection. Clojure is a dialect of Lisp, and shares with Lisp the code-as-data philosophy and a powerful macro system. Clojure is predominantly a functional programming language, and features a rich set of immutable, persistent data structures. When mutable state is needed, Clojure offers a software transactional memory system and reactive Agent system that ensure clean, correct, multithreaded designs.
"""
A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.
/proc/cpuinfo will become a small book. on the bright side, i guarantee 100 cores meets the draft requirements for 'windows 8 capable' status.
Good people go to bed earlier.
I've been personally let down time after time by systems that make these claims. I know it's a bit different, but Sun's T2/T2+ chips have been disappointing. Sure psrinfo shows 128 CPUs, but overall performance sucks for anything more than web serving. Sure, the kernel may be thread-aware, but the underlying parts of the OS aren't... Plus, the binutils and misc utilities that comprise day-to-day tasks don't take advantage of that many execution threads... You have to get special gzip that is parallelized.
I'll withhold judgement until I see some benchmarks in real world scenarios.
--alop
For some reason, I read this article and immediately thought about a 15-bladed hsaving razor... My point being that 100 cores, while it sounds impressive, you get a diminished return after a few cores. Even if software was written for multi-core use (and not enough of it is, IMO), you still can't possibly, effectively, use 100 cores...not before this processor is already extinct due to technological progress. Even my quad core Intel CPU, hardly uses all 4 cores...and most commonly hits CPU1 for processes.
100-core is binary for quad core.
If your network access is linear, then it's buggy. If the protocol specifies a linear stream, then it's buggy. I'm only half-joking -- by the time we get around to fixing these problems (how much do we have invested in TCP/IP?) they will bite hard and people will commit vile ugly hacks to get around them.
This looks like another one of those companies that announces they "will" have a part that does "something" nobody else does and that it "will" be available someday. When a two year old start up company makes an announcement like this, it usually means they are just looking for some fast capitalization to rip someone off. There recently was another start up that was going after Intel's business.
Then there was Transmeta Corporation.
Athiesm is a religion like not collecting stamps is a hobby.
Any news on how the busses will be shared? This is an issue that most CPU manufacturers will look away from. Remember FB-DDRram? I can actually imagine an arbitrator bigger than the CPU in this multi-core architecture. You need something to help it scale.
To explain my point a bit better: Imaging you have 100 computer all hooked up to a 10 / 100 hub (not switch ) and every computer has a bit torrent client opened. Same thing with the CPU and most modern buses. Your potential lag time to the bus is 99 other CPUs doing their shtick.
In TFA they mention blocks sharing switch points. Does that mean people will be encouraged to set affinities for data locality? Consider me to be an old fart, but I really would like some real world junk thrown at this or disclosure on the design.
It's been reported that these cores will be relatively underpowered, though both the total processing power and cost per watt will be quite impressive. This makes the chip appropriate for putting in a server but not so much a desktop machine, where CPU-intensive single-threads may bog things down.
So what about one of these in combination with a 2-, 3- or 4-core AMD/Intel chip? The serious threads can be run on the faster chip, while all the background stuff can be spread among the slower cores? Does Windows have the ability to prioritize like that? Does Linux?
It is like 100 Dancing Hamsters in your CPU.
Tsukasa: All I really want, is to be left alone...
A CPU for each sell processing its own value. Excel may almost run fast.
Old software typically runs in just a few threads. More cores won't help until new software is available.
I was doing some complex work on Excel 2007 and it was taking about a minute on a fast cpu. I checked the processor usage - it wasn't a disk intensive job but the usage graph was hovering only at the 40% level for the whole minute. Excel knows it has work to do, but something was still holding back the cpu. On a slower processor, the usage was into the 80 and 90% range though, and the time to finish was a lot longer.
Software inefficiencies just let my high speed processor idle. For older software, MHz still beats having a lot of cores, so Intel's turbo to let some cores run fast while others slow is just what we need.
Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.
*woosh* read his name.
Browse at -1 to keep an eye out for abuses.
Relly? I did a quick Googlig too, and found nothing. There's certainly nothing of this sort to be found on their homepage, nor ARM's. I did a lengthy googling and found an Intel executive stating that it's ARM, but I also found an ArsTechnica article http://arstechnica.com/hardware/news/2007/08/MIT-startup-raises-multicore-bar-with-new-64-core-CPU.ars stating that it's a MIPS derived VLIW architecture. After MIPS revealed itself as a candidate it was easy to find more information, and MIPS it is.
- Henrik
- when the Shadows descend -
The company website claims...
64-bit VLIW processors with 64-bit instruction bundle
3-deep pipeline with up to 3 instructions per cycle
I don't know how this could be considered ARM or MIPS-derived...
A better description might have been in this article...
Back in the day when they only had 64 processors. Note that Tilera (then) had both shared, and local memory on each core. Using shared memory slowed things down quite a bit. Using local memory makes the algorithm even more complicated. YMMV.