Inside Intel's $20M Multicore Research Program
An anonymous reader writes "You may have heard about Intel's and Microsoft's efforts to finally get multi-core programming into gear so that there actually will be a developer who can program all those fancy new multicore processors, which may have dozens of core on one chip within a few years. TG Daily has an interesting article about the project, written by one of the researchers. It looks like there is a lot of excitement around the opportunity to create a new generation of development tools. Let's hope that we will soon see software that can exploit those 16+core babies. 'The problem of multi-core programming is staring at us right now. I am not sure what Intel's and Microsoft's expectations are, but it is quite possible that they are in fact looking at fundamental results from the academic centers to leverage their large work force to polish and realize the ideas that come forth. It calls for a much closer collaboration between the centers and the companies than it appears at first sight.'"
./configure --num-cores=16
The thing is, most PCs have plenty of computing power as a single core system. The hard sell is getting people to upgrade those machines mainly used for email and browsing and video playback. I think as time moves on and quad core becomes the "low-end" you will see less demand for higher end hardware. Unless the next version of Windows requires a core dedicated to the OS or something in the future.
“Common sense is not so common.” — Voltaire
So, In other words, a language for Windows.
-EL
In the new octagon (8-way processors), a battle of the ages, Crapware vs AntiCrapware
Most of the new cores are being used to isolate crapware and anticrapware in a Battle Royal.
And it looks like Crapware is going to win in a submission tapout at the current rate.
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
The big deal will be when we (the user masses) get to use something that's not x86. Don't get me wrong, more cores are way cool, but there's always other ways to improve. Backwards compatibility :'(
Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
Software that will exploit 16+ cores already exists. The problem is, it is not consumer (home/office) software. There does not yet exist an application that people use that really needs multiple cores. Video encoding is getting there, but most people will never use it.
Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
I hope they do better at getting useful coding tools into the hands of home coders than GPU manufacturers have to utilise the parallel programmable nature of modern GPU's.
It should not be very hard... The algorithm begs for multi-threading — once you divide your array, you apply the same algorithm to the two parts, recursively. The parts can be sorted in parallel — this has a potential for huge performance gains implications in database servers (... ORDER BY ...), etc.
Anyone?
In Soviet Washington the swamp drains you.
Its only you peasants that persist in using old-hat Wintel stuff that are so last-year. Get with it people! You too could be runningNetBSD on your toaster (it will probably out perform Windows Vista in a 4-core Pentium anyway). Hell it might even eat Nandos peri-peri Vista for breakfast!
Sent from my ASR33 using ASCII
Fail.
SMT processors of this type are only useful for accelerating a certain type of problem set, and useless for most general computing.
We've had SIMD multicore PC's forever, and they're useless as desktops. I write this from a quad xeon machine, repurposed as my dev box, as CPU1 grinds away at about 75% all day long, the rest idle. It's been like that for more than a decade, it'll be like that until MIMD hits the street with a whole new paradigm of programming languages behind it - a handful of C compiler #pragma directives from intel isn't going to make this work.
It's not simply a matter of "coders don't know how to do it." It's a matter of these multi-core "general purpose" CPUs are only really useful for a fairly limited set of specific problems.
Eg; writing a game engine with a video thread, audio thread and an input thread still leaves 13 cores idle. You really cant thread those much farther (the ridiculously parallel problem of rendering is handled by the GPU).
Simply starting processes on different procs doesn't help all that much, since they all fight over memory and I/O time. The point of diminishing returns is reached fairly quickly.
But hey, if all you do is run Folding@home so you can compare your e-cock with the other kids on hardextremeoverclockermegahackers.com, well I have some good news!
As for me, I'm seeing AMD's multiple specific purpose core approach as being more viable, as far as actually making my next desktop computer perform faster.
Savain says it best at rebelscience.org: "Even after decades of research and hundreds of millions of dollars spent on making multithreaded programming easier, threaded applications are still a pain in the ass to write."
I don't need no instructions to know how to rock!!!!
The structure of VHDL is inherently parallel as all processes (blocks of hardware) run at the same time. Only the code within the processes is evaluated sequentially (in most cases).
Although VHDL is a hardware description language, couldn't similar concepts be used to make a parallel centric computer programming language?
Instead of trying to convince everyone on Earth to change all existing software, why doesn't Microsoft just make the next version of Windows have a process handler that can process single threads on multiple cores at once? Actually technically I think Intel could do that internally on their processors too sort of like RAID for cores. It seems really difficult and inefficient but if they finally got it right so it worked, all software would run faster on multi-core chips! Talk about a selling point! Cuz right now the 90% of my software that only processes its resource intensive code in a single thread actually runs slower on a 1.8GHz quad core than on a 2.6 GHz dual core!
Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
We all already have networks of servers all running in parallel. Multi core processing is simply squashing the network onto a little bit of silicon.
Deleted
Sun T2 uses 8-Cores & 64 Threads, like HyperThreading on steroids. This processor comes to its own running applications that can take advantage of thread-level parallelism (TLP) http://en.wikipedia.org/wiki/UltraSPARC_T2
In Soviet Russia ^H^H^H America, The bank finances YOU!
Didn't we already see this one? Intel (this time AMD also) develops radical new processor arch that will be insanely great once a quantum leap in developer tools is made to utilize it.
Itanic crashed, burned and sank against the rocks of the compiler tech not being able to keep up. I see it happening again.
Yes we will find ways to make a quad core system stay busy enough to sell em to corporate desktops and home users. Hell, you can assign one to the virus/crapware scanner. Waste another or two doing ever more Movie OS like 3D effects on the desktop. Most apps can (and will be ) be rewritten to keep a couple of threads busy, even if it is less efficient than the older single threaded version. But sixteen or more? We better get cracking on natural language voice recognition, direct brain interfaces or something to keep the chipmakers profitable because traditional desktop apps won't ever keep that many cores busy.
Democrat delenda est
from my point of view "multi-core" programming is just "SMP" programming - the programmer doesn't care where the cores are distributed on which chips.
:-) ) in svn trunk has a qt4 module (just starting working on the opengl support :-) ) and a documented (with doxygen) and stable API.
there are two interrelated parts to designing languages for SMP scalability in my opinion: designing the programming interface (the syntax - the concept of how the programmer works with the language) and making the right implementation in the compiler or virtual machine to achieve SMP scalability, i.e. maintaining maximum efficiency of usage of each core while providing maximum flexibility in manipulating data.
The first part is more art and the second more technical; both make up the overall language solution.
I've designed the qore programming language for SMP scalability, however using a more traditional approach to threading - as opposed to interesting techniques such as those taken by Erlang or Scala
Basically in qore there are a lot of optimizations aimed at reducing the amount of cache invalidations while still providing as much shared state as possible between threads. It's basically a dynamically-typed language where global variables and objects are shared between all threads and local variables are thread-specific. For such nice and easy-to-use (i.e. thread-safe) access to data it has very good performance, even on single-threaded code (particularly upcoming version in svn trunk which has a completely re-written type system and some massive new optimizations).
While the art of design of the programming language in this case is not as exotic or interesting as scala or erlang, for example, the more traditional approach and the fact that the entire language was designed to be thread-safe and scalable on SMP machines can serve to make it easier for some programmers to write multi-threaded code.
Qore supports deadlock detection and will throw exceptions on outright threading errors as well to make multi-threaded programming a little easier to work with in comparison to some other languages.
Qore supports a pretty unique feature set, in that it's designed to support embedding (and arbitrarily restricting) code, interfacing, and SMP scalability in a dynamically-typed language.
It also supports native XML and JSON de/serialization, powerful and very easy-to-use database drivers (including a DatasourcePool class that offers transparent datasource pooling on a per-thread basis), perl5 regular expressions, and a lot more.
the new version (coming "real soon now"
Anyway, if anyone's interested in checking it out, the homepage is:
http://qoretechnologies.com/qore
and the project is also hosted on sourceforge:
http://sf.net/projects/qore
-David
...which may have dozens of core on one chip... Does hardware come in flavors of "core," like music? So, what would be the hardware equivalents of hardcore, grindcore, thrashcore, gothcore, nerdcore, etc.?Although VHDL is a hardware description language, couldn't similar concepts be used to make a parallel centric computer programming language?
Excellent suggestion. This is precisely what the COSA software model is about. A pulsed neural network is my preferred metaphor for an ideal model of parallel computing. Intel and the others are on the verge of losing billions of dollars because they are already deeply committed to the hard to program multithreading model, a complete failure even after decades of research. To find out why multithreading is not part of the future of parallel programming, read Nightmare on Core Street.
Forget software not being written for multi-cores, the entire infrastructure around the computer needs to "go wide" for massive parallelism, not just the software. This includes disk, memory, front-side bus, etc./p>
I'm doing highly concurrent projects (grid computing) for my company and we're finding that some things parallelize just fine, but others simply move the pain and bottleneck to a piece of infrastructure that hasn't quite caught up yet.
For example, my laptop has a dual-core 2.2Ghz processor, which you'd think is great for development. It's no better than a single CPU machine because my disk IO light is on all the time. IntelliJ pounds the disk. Maven and Ant pound the disk. Outlook pounds the disk. Even surfing the web puts pages into disk cache, so browsing while building a project is slow. Until I get a SCSI drive, you're still limited on disk IO, so those extra cores don't help that much.
All the cores are great on the server, though. I've recently completed a massive integration project where I grid-enabled my company's enterprise apps. All those cores running grid nodes is giving us very high throughput. Our next bottleneck is the database (all those extra grid nodes pounding away at another bottleneck resource...)
Terracotta Server as a Message Bus. It's been a very interesting project.
The solution is right in front of our faces. If you use virtualization then you can easily make use of a 16 core system. I can have IIS, Exchange, a Linux Apache Server, and a Terminal Server all on the same physical machine.
Yeah, I agree with this view. Multi-threaded programming isn't really hard. I would think by now most coders are quite comfortable with the concept and the implementation. Multi-user application are no-brainers -- most are well-threaded. From my experience the trouble is that most single-user applications are ill-suited to it. Unless you have many processes occuring at once or a particularly well-suited type of static math problem (like video encoding) it tends to be useless.
Data Parallel Haskell.
Cheers.
Scala is a JVM based language that has good features for working well with multiple cores (Actors, immutable collections, functional language, etc), so why not sponsor it?
Mats
About 10 years ago, Intel bought Kuck and Associates, Inc (where I used to work), which for years sold "The KAP" auto-parallelizing C and Fortran optimizing compilers. Many of the KAP programmers still work in Intel's compiler group today. Intel has had access to these tools for years. The KAP worked fine on 20 cpu Sequent computers back in 1993, so why are these tools "new technology" 15 years later?
...for Linux, Mac and Windows supporting multicore and also cluster architectures.
Obviously it would be better if these worked better and were easier to use, but many people are unaware of the tools that are available right now.
You can't blow your nose for less than 20 million at either Intel or MS. I know lots of stories of 50-100 million dollar projects that led to absolutely nothing. And these are companies that are taking in billions every year.
This must be very low priority for them.
Seriously, Folks, who can do anything for a mere $20M today, let alone change the entire programming paradigm of the last 65 years?
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Sure. In fact, VHDL is based (closely) on Ada, which allows pretty similar things. The relevant differences are less between the languages than how they're used. Ada that was written in the same style as most VHDL would have a high level of parallelism as well.
That's rarely done though, because designing hardware in VHDL (or Verilog, etc.) is expensive, largely because designing things this way is difficult. Virtually nobody's willing to put up with the long lead times and extremely high initial expense of such a design for software, unless it's doing something that really gains a tremendous amount from doing so.
Of course, that hasn't stopped people from trying. The Transputer was one early attempt. The Ambric is on the market now, and apparently sells pretty decently into a few specialized markets (e.g. video encoding/decoding) but I've never heard of anybody even advocating that it was a practical tool for a typical desktop computer.
The universe is a figment of its own imagination.
Just like AI, nothing worth a real prize. Look at what we expect, no compromise, machine learning is NOT real AI, similarly, what we have now is not real parallel. Can we build the parallel machine? Program it easily? Why are we not satisfied with MPI? Usability...
What if there was a Core of a say Quad or Hex or Octo etc. processor that only parted out processes to the other Cores. It's sole job would be to part out and reconcile various operations. This would make the other Cores much more efficient by getting them the right information completely separate from software.
How does Intel persuade people to buy new CPUs if there is no benefit delivered to the buyer?
How does Microsoft sell you new licenses if you don't buy a new computer?
Virtualization at the OS image level only allows you to run multiple different applications. Running more applications at once isn't the primary goal of the average user. They want the application which has the focus of their attention to be slick and fast.
Multicore CPUs do not allow you to run a single application faster. Intel's PC market and Microsoft's empire were built in a feedback loop based on the promise that you can buy a new machine every two years and your applications will run significantly faster. This held true until a few years ago when semiconductor technology hit the heat density wall on ramping up clock frequency. Now, and for the forseeable future, if you buy a new machine your single threaded application will run NO faster than it did on the old hardware.
That in a nutshell is the multicore problem. Most existing software is not written to exploit parallel processors. Most software developers cannot write a correct parallel code. The promise of "buy a new one, it is faster and better!" becomes a lie if the the software cannot exploit the extra cores.
No one has the solution to this in their pocket. Threads aren't the answer because they are a ridiculously hard to use correctly outside of very coarse grain contexts. Automatically parallelizing compilers have never delivered the goods in the general case. New languages face extremely slow adoption. The answer probably lies in languages, but the adoption problem is an extremely tough nut to crack. The recent successes here are Java (basically C++ with garbage collection) and Javascript+AJAX, which I don't think any heralds as a radical leap forward in language design.
I am involved in this research personally, so I'm not just pulling these assertions out of the air.
People need to stop thinking that 'I don't have a program that uses 16 cores (16 real threads), so I don't need a 16 core system).'
Chances are you have at least 16 programs running and each of those is run in a thread. User Applications aren't the only things that need CPU time. It's only the touching the surface.
People are not creating multicore systems with the idea that a single program will use all the cores. Some programs will be more multithreaded than others, but that's not the point.
With multiple cores, you give the user the feeling of a more responsive system (due in some part I'd imagine from the CPU scheduler having far more real threads to work with than a single core system). Resource allocation of cpu time becomes more generous/less taxing for the OS.
The end result is that your MP3 player will run happily along in the background doing its thing, while your file download manager is downloading many files off the net, and the user is sitting down writing his word document, which has real-time spell checking that doesn't pause while it scans the large document. Oh, and Gmail is running on your web browser. Your IM client of choice is somewhere around there too.
All these programs have multiple threads (I won't even bother mentioning the plethora of operating system/system utilities and services and their threads).
Imagine splitting the CPU cycles of 1 core for all these tasks, and sharing them fairly, against splitting the cycles of 2..4..16 cores.
Outlook I can understand. It needs to flush the emails to disk before replying back to the server.
However, there's no reason why the web browser needs to ensure that the data hits the disk cache right away, so it should be just fine sitting in RAM until the disk frees up. Similarly, intellij, maven, and ant should be slow the first time but faster later on since they should be reading from the page cache.
There's no reason for your disk I/O light to be on unless you don't have enough RAM or the disk algorithm in windows blows chunks.
I do linux kernel development, and once I do an initial pass through the source tree the whole thing generally stays in RAM and I rarely have to hit the disk. I have 3GB of RAM, but this isn't excessive nowadays.
I've got a dual core machine sitting on the desk before me and the cpu rarely goes above 20% load. The strange thing though is it is still slow when loading programs and this is due to the hardisk (SATA II) being the bottle_neck on my system. I could fix this to some degree with a RAID setup but the real question is why isnt this being looked at more closely ?
Perhaps you could please explain COSA. I fail to see any difference between it, and any other existing component framework, or GUI based language. More importantly, how do you ever intend to implement COSA, and keep threads out of it? Support from the CPU manufacturers is the biggest long shot ever, their job is only to provide a CPU, not accommodate any language. And in reality, any existing compiled language could just as easily make use of such a CPU, at such a low level, its really all up to the compiler, no matter what language you use. Sure, some languages might give the compiler more insight, but its still not really going to be super faster. Any higher level implementation and its no real diffrent then existing component/parallel programing models, which must use OS provided features to achieve parallelism, such as[especially] threads.
As I already pointed out, the OS is what will determine how to parallelize things. Its true threads are quite horrible, but that cant be helped. If you really want to achieve easy parallelism, then what feature(s) would you demand from the OS to achieve it? Because no matter what new language you come up with, the fact remains that the OS runs the computer, and the compiler handles the CPU, and there are no other locations where its possible to take advantage of parallelism.
Business driven server systems.
:)
We love multiple threads and in some cases multiple processes and/or multiple machines. DBMS's, transaction processing systems, web servers, middle tier data servers....
The code is already in place and the more cores you throw at us the faster we can run. It's not mathematically complex tasks but it is workhorse stuff that parallelises beautifully, though not in a particularly fine-grained way.
There's more to computing than consumer desktops and science/engineering departments. Think about the infrastructure of the world - the financial markets, the credit card networks... all sorts of other things that don't instantly spring to mind
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
As I already pointed out, the OS is what will determine how to parallelize things.
Not at all. The processor should be designed for the software model, not the other way around. All of our current processors are designed for the algorithmic software model, which that was created 150 years ago by Charles Babbage. It's time to change. Like it or not, the computer industry will be dragged kicking and screaming into the 21st century.
Indeed, but then just how do you encode a COSA style datastream for execution by the processor? No matter how hard I try, I simply cant figure out a better way (or rather: any other practical way) then algorithmic code for the CPU level of computation (which itself is pretty much the same as the abacus). Sure, you can have procedural based CPU's, or function based, or recursion based, but I cant figure out how a object based method would look, let alone how to program it.
I had actually proposed this here more than a year ago. I think this is the way to go, perhaps alternatively with Verilog or System C. Too bad they don't have a search capability of your old postings or I would have linked to it.
My rights don't need management.
why can't we have something like netlogo, except based on a real language? netlogo is pretty awesome as it is, but why can't we have a dialect of Python that makes parallelism that easy?
imagine a Beowulf cluster of ...!
Sortof. There are parallel functional languages that describe portions of a parallel program, and some magic system will figure out how to map those to hardware. Those never worked very well for most types of applications though. Thus noone uses them.
We can build him. We have the technology.
Indeed, before "Worse is better" took off computer science hat better ideas. Have a look at some ideas about multi core / multi threading form 1983:
;-) [1]
http://en.wikibooks.org/wiki/Ada_Programming/Tasking
But you can use them - just use "gcc -x ada"
Martin
[1] You need a fully installed GNU Compiler Collection for it to work.
Well, VHDL is based on Ada, so why not use Ada then? Have a look:
http://en.wikibooks.org/wiki/Ada_Programming/Tasking
Martin
As I and other already pointed out VHDL was modelled after the Ada programming language - and as such Ada already has the multitasking features you are looking for:
http://en.wikibooks.org/wiki/Ada_Programming/Tasking
Martin
It can be done via the Actor model.
It's called LabView ! Where spaghetti code actually looks like spaghetti...
Non-Linux Penguins ?
Family cars had that kind of power in the fifties??? I'm just asking, I wasn't born. It does sound extremely strange a since highest end Porsche 356 only has 130HP. We're talking sportscar produced from 1948 to 1964: the timespan you describe.
Not being all that informed about American cars, I headed over to wikipedia "Ford" (American right?), took a "full-sized" category car from the fifties: the Ford Fairlane (Hey, I know that car from GTA!) and it's only mentioned power in the article says rated 225HP. Not bad, far from the 500HP you claim. I also clicked on the link to the "Chevrolet Bel Air" (which according to the article "overshadowed" the Fairlane) goes up to 195HP. (In the sixties there is a 409HP model which is a collectors vehicle now)
So, either you're looking at the fifties in roze coloured glasses, or I'm looking at the wrong category of "family cars".
As a final note: the engineers behind the Bugatti Veyron had to overcome quite a lot of obstacles to reach those 1001HP the car is rated at. How theBugatti Veyron works.
There are dozens of modeling systems that will map algorithms across an arbitrary number of homogeneous processors. Most of these development tools don't target x86 instruction set based machines, so I guess thats why Intel doesn't seem to have a clue they exist. Or perhaps this is a bunch of hype to lay the ground work for a future marketing campaign.
http://www.ni.com/multicore/ They have been doing it for a while.
Chris [CapitalC]
Surely the lessons learnt from PVM and other similar networked parallel machines can be used in multicore programming?
Does anyone know of a benchmarking of the top OSs showing performance as core numbers increase for various activities? To me the question for the buyer is going to be: What can this computer/OS/App offer to me as the number of processors (and thus chip/machine prices) increase? I suspect that what we will find is that performance increases diminish as the number of processors increase due to fundamental multi-core architecture problems involving moving data intra as well as inter chip even before we get to issues involving how to best allocate the workload. Perhaps the new optical bus technology is needed now that we are potentially asking for 16x (or more with overhead) the current single processor communication load. The optical technology can perhaps eliminate these limitations now so we can get on to the tougher problem of how to intelligently distribute processing resources.
Be as you would have the world become.
...the obligatory "current computers are fast enough" comment that we get on every article about new technology.
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
Does anyone know of a benchmarking of the top OSs showing performance as core numbers increase for various activities?
:)
It's easy in my case, since I write my own engineering code. Restrict the number of cores your program is allowed to access. Compare.
Now if you are using commercial software, it's pretty case dependant, but in some areas of work (video rendering is one) software is already utilizing 8 cores, no problem.
I suspect that what we will find is that performance increases diminish as the number of processors increase due to fundamental multi-core architecture problems involving moving data intra as well as inter chip even before we get to issues involving how to best allocate the workload.
I haven't seen this. My data set is pretty small: a quad core Opteron and a dual core Athlon x2. Scaling from 1-4 processors on the Opteron returns rather linear performance increases. Scaling on the x2 is just 2 data points. I will admit my work is very low I/O.
I'd love to build me a dual quad-core Xeon box at home and play with those numbers, but I just don't have the financial reserves at the moment... wife, kids, college keep sucking it away