Intel Updates Compilers For Multicore CPUs
Threaded writes with news from Ars that Intel has announced major updates to its C++ and Fortran tools. The new compilers are Intel's first that are capable of doing thread-level optimization and auto-vectorization simultaneously in a single pass. "On the data parallelism side, the Intel C++ Compiler and Fortran Professional Editions both sport improved auto-vectorization features that can target Intel's new SSE4 extensions. For thread-level parallelism, the compilers support the use of Intel's Thread Building Blocks for automatic thread-level optimization that takes place simultaneously with auto-vectorization... Intel is encouraging the widespread use of its Intel Threading Tools as an interface to its multicore processors. As the company raises the core count with each generation of new products, it will get harder and harder for programmers to manage the complexity associated with all of that available parallelism. So the Thread Building Blocks are Intel's attempt to insert a stable layer of abstraction between the programmer and the processor so that code scales less painfully with the number of cores."
...briefly translate this article into cretin for me, so that I can understand a bit more of why it's so cool?
u-bend
Translation: "They have made improvements that can better translate regularly programmed code into machine code that can run faster on their CPU's".
Hopefully, your games will be out faster, and run more realistic (in terms of AI and graphics) because programmers will spend less time making sure their code makes full use of the features of the CPU(s).
We see Intel mainly as a CPU/chipset maker, but don't pay much attention to their software side. I believe they are one of the largest software development companies in the world. Between drivers, compilers, and all the other goodies to support all their hardware, they spend a lot of time doing software development.
And as much as they develop compilers to optimize code for Intel CPUs, the code most of the time will also see a speed increase on AMD CPUs as well. Who else do you want developing a compiler but the people who made the hardware it's running on.
Its not what it is, its something else.
Will they add these features to GCC or make docs available so others can?
From Wikipedia:Moore's Law is the empirical observation made in 1965 that the number of transistors on an integrated circuit for minimum component cost doubles every 24 months
Alrighty, then. It's been a while since my CS classes. How does that apply to software? Does he mean that instead of increasing transistors on a single chip, the transistors are virtually increasing by using multiple cores?
I prefer Flambe as apposed flamebait.
If compilers keep abstracting away the interface between the programmer and the cpu, programmers will be less likely to write better code or learn new techniques that take advantage of all the power a few extra cores can provide right? That's just my take on it. Then again, I also think learning parallel programming techniques is fun, and a little more academic than most career programmers might like.
mr pibb + red vines = crazy delicious
I was looking at the Thread Building Blocks paper, and it reads like it was somebody's hastily-scribbled draft:
"The Intel Threading Tools automatically finds correctness and performance issues" (The tools finds?)
"Along with sufficient task scheduler and generic parallel patterns" (Who has insufficient task scheduler?)
"automatic debugger of threaded programs which detects many of thread-correctness issues such as data-races, dead-locks, threads stalls" (Sarcasm fails me...)
And that's just in the first few paragraphs, I haven't even gotten to the real meat of the article!
I'm used to informative, well-written and reasonably complete technical documentation from Intel — WTF is this?
Just junk food for thought...
Intel has added kitten whiskers and pixie dust to its compilers so your ponies can now play on multiple paddocks.
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
As the company raises the core count with each generation of new products, it will get harder and harder for programmers to manage the complexity associated with all of that available parallelism.
As a programmer, I already have abstractions such as Active Objects. While this may make it easier for compiler writers or kernel hackers, what benefits does it bring to us ordinary mortals?
FreeSpeech.org
Cue "Fortran is Dead" comments in
30
20
10
"As God is my witness, I thought turkeys could fly." A. Carlson
... the version before this one was in ebonics.
the intel product has somewhat more detail. it can be found here.
...vividly encapsulates that post-Watergate/pre-punk/coked-up moment when you could trust no one, least of all yourself.
Vectorized version is something like this:
Obviously, the above isn't valid code, but the idea is there, I hope?
I thought it was BSD that was dead?
It is, you know.
There was an ad today on the slashdot mainpage that read: "Want to jump of your version control tool?"
copy and pasted URL from the javascript non-link in the html source
OMG! PONIES!!!
You know, it's not high school here. You don't have to pretend to be stupid. It will actually make people think worse of you, not better.
/. really got to the stage that people think it's somehow clever to be stupid? News for nerds, and all that....
And if you simply are ignorant, you could always read about the things in the summary. You might learn something that way.
But has
If I am writing a quantum pyhsic calculation package or compiling it (let us say.... Molpro 96) I want it to use correctly the many core I assign it to run , using high paralllelized fortran compiler. I don not want to know how and why it does it, I jsut want it to do it. I ain't a computer sicentist, I am not writing a thesis on computer science, and I don't care a iota about this. I leave that to computer scientist. Neither does my chief care about academic computer science. Same for intel multicore. The fun fact, is that msot of us want to use the power of the many core, and don't care a bit about the how, why etc...
That's interesting, but what if step 11 of the loop is dependant on step 10? How does one vectorise that?
I can imagine vectorisation of loops working alright for basic loops like the one you described, which'll help in a number of cases, but it's not going to scale exceptionally well. It's good, but it's nothing amazing if I'm reading this right.
>> As the company raises the core count with each generation of new products, it will get harder and harder for programmers to manage the complexity associated with all of that available parallelism.
I'm very surprised and dissapointed by the pervasiveness of the incorrect myth thats being promoted even amongst supposedly technically knowledgeable groups that:
a) Writing multithreaded code is terribly difficult
b) You need to implement code to have the same number of threads as your target hardware has cores
Both of these is completely not true at least for the PC marchitecture.
The way to develop multithreaded code is to exploit the natural parallelism of the problem itself. If the problem decomposes down most neatly into one, three or 6789 threads, then design and write the implementation that way. Consequently the complexity of the problem does not increase as the number of cores available increases.
In the PC architecture case, attempting to design your code based on the number of cores in your target hardware just leads to a twisted and therefore bad and also non-portable design.
I'm surprised how few developers seem to understand that in fact its OK, normal and often desireable to have more than one application thread running on the same core. In fact you really can't even ensure or even assume that your multi-threaded app will get one core per thread even if the hardware has enough cores, or work best if it does, as core/thread allocation is dynamically scheduled by the OS depending on loading. Not to mention there's all sorts of other apps, drivers and operating system tasks running concurrently too, so depending on each core's load, one app-thread per core may actually not be the most optimal approach anyway.
I recently downloaded Intel's compiler to see whether my C++ code would run faster on it--I ended up giving up on it (for now) after spending a day trying to get it to work. I'm sure their compiler has many whizzy features in it, but for me, they don't really matter unless they're in GCC. I hope Intel will realize that it's in their interest to migrate these advances there.
"Not an actor, but he plays one on TV."
I know OS X is compiled using GCC but I wonder if Apple would see performance gains by using it? If they did, would it somehow introduce problems? Basically, I'm wondering if there would be a downside to using the Intel optimized compilers as opposed to all-purpose GCC compiler.
As an aside, Linux is obviously compiled using GCC but I wonder if Microsoft compiles Windows using the Intel compilers?
No, they won't add them to GCC. Intel's compiler competes with GCC and it is the best there ever was. In every test I've ever seen on Intel chips, it comes out ahead and I'm sure they've no interest in changing that. However yes, the docs are out there. Intel's processors are extremely well documented and you can get everything you need. The problem isn't that the GCC people are having to guess how the processors work, the problem is that their coders aren't as good as Intel's at optimising their compiler. This isn't helped by the fact that GCC targets many architectures where the ICC is only for one.
However don't expect Intel to help GCC out. Their answer will just be "buy the ICC".
PONIESOMGBBQ!!!?!!
The Kruger Dunning explains most post on
You're thinking of IBM.
Dewey, what part of this looks like authorities should be involved?
...hangin' right here.
Note the caption at the bottom of the photo that says how FORTRAN will make the machine easy to use!
PGI and Sun both make auto-parallelizing and optimizing Fortran/C/C++ compilers specifically for K8 (and i386).
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Intel is not trying to kill GCC or anything. They try very hard to make ICC compatible with gcc and g++ (ABI and command line interfaces)... so that you can just set CC=icc in your makefile and be on your way.
It was a big source of pride for them that they got the linux kernel to build in icc without patching. *eye roll*
But they don't expect every linux user to buy ICC or anything. They position it for use for performance reasons.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
You teach freshman CS, or CE, don't you?
I prefer Flambe as apposed flamebait.
You're right - vectorization - by itself can't handle step 11 dependent on step 10... and assuming there isn't a magical way to rewrite the loop to remove the dependence (which it the first thing the ocmpiler will try todiscover and do for you - but usually it can't) - then you need to look at pipelining - software pipelining on a single core, or parallelism on multi-core... but you'll have to have the right interconnect processor to processor to match the work to get multiprocessor pipelining to do what you want. Software pipelining can be very effective on loops with dependencies loop to loop.
I've tried to use the Intel C++ compiler on dozens of projects, and the gains - if any - are never worth the extra cost and work it requires to set up your environment and porting over the inconsistencies with gcc. The exception to this rule is scientific code.
'Course if Intel truly can auto parallelize non-threaded code, that would be enough to be worthy of reconsideration. But I have a feeling that their claim - one of the holy grails of compilers - is going to fall far short of what you really get out of it.
They do make a mighty fine Fortran compiler, though.
I got MKL 9.1 and Fortran 10 today.
./ does not have <pre></pre>?
Very basic test (3D DFFT and compex(4) matrix math) on dual X5160 on WinXP32:
SW Time(ms) Speed(%)
9.0 MKL + 9.1 F 3100 100.00%
9.1 MKL + 9.1 F 2900 93.55%
9.1 MKL + 10.0F 2720 87.74%
In sum: 12% improvement, 6% per piece.
Not bad for my purposes!
----
Why
Once automatic parallelization, the holy grail of parallel computing, comes in, all these tools, models that provide users abstractions for parallel programming, will be swept away. Automatic parallelization is hard - lot of research has been done for nearly a decade and a half, but nothing production-level yet. However, something is going to come out in the next couple of years.
Still, parallel programming models/libraries will be around so that now a parallelizing compiler would make use of it instead of the user directly using them. Also, when performance is critical and depending on applications, a user or a black-belt parallel programmer may directly want to do the parallelization himself - he knows best where parallelism is. So, I think automatic parallelization will once and for all settle the problem of bringing the benefits of emerging highly parallel architectures to the users. How tough can it be? It is not more complex than the complexity that exists in a current-day sequential compiler.
It seems that the two companies are in the arm race as far as multicore developer tools are concerned. Sun released Sun Studio 12 yesterday and the buzz around release seems to be very similar. It also seem that Sun Studio 12 is a nicer integrated package overall with not just the compilers but a bunch of tools and a pretty convincing IDE.
Intel clearly think that it is important to offer a compiler(s) specifically for their chips, and I see that as a good thing for users of Intel based systems who want to get the most out of there hardware, so my question is why AMD do not make compiler software for their chips. They do after all have their own set of special extensions so would they not benefit from creating a compiler armed with there own "inside knowledge"?
Seems odd to me to make the chips but not the software to allow people to fully utilize them and if GCC et al are good enough why do Intel and IBM offer there own compilers for the processors they make?
Top result on google for "amd compiler" is...
o dbytype/3.developme/11.compiler/index.html
http://www.amd.com/epd/desiging/fusionpartners/pr
list of 3rd party software from the era of Win98/NT, just a little out of date.
How do you know that SSE3 will be supported in a chip down the line? It could be that to reduce die waste you'll drop it. So how do you handle that? Disable SSE4 for it?
Or do you ask the CPU what it supports and include what it says it does?
If you do it that way, then why can't that be so for AMD processors? Ask them what they support and if they say "SSE3" use SSE3 requests.
It's not as if we're asking you to support AMD's version of an SSE-like library that isn't SSE.
Interesting because Sun Studio 12 is free and the Intel compiler suite are on a time limited evaluation. You only pay for Sun Studio if you want commercial support for the product.
Sun Studio Supports x86-64 and SPARC with Linux and Solaris but does not support Windows so you would need to look for something else if you were in the market for a Suite that did provide Windows support
FOR PONY!
http://www.lfgcomic.com/page/42
think before you write, it'll save me moderator points.
With intels new enhancements, they are now re-labeled as PWNies!
Vectors and multithreading are two different things that a kind of related. To answer your question you can not but then that type of a loop wouldn't be a vector.
Not everything can be a vector and multithreading code doesn't eliminate sequential code, but sometimes you still use the vector instructions for floating point ops. Why? Intel has been putting a lot more effort into SSE than FPU so SSE floating point is often faster than the FPU.
What I want to know is when compiling under Windows you can specify Q6 to get the best performance out of the PPRO/P3 family and Q7 to get the best out of Netburst, So which would give you the best performance on the Core? I am leaning towards using Q6 for a while since P3 tend to run slower then P4s and I would rather my code be usably fast on both than super fast on the P4.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Wow, people still use Fortran? News to me.
While I'm all for compatibility, since the linux kernel doesn't even _pretend_ to work on other compilers, it's not exactly the sort of data point I would want to have ICC compared against and say: "look, it works". It's liable to over-fit the problem. Moreover the kernel is a self-contained entity with little dependancy on external libraries or APIs and any of the compatibility issues that arise in such cases would not be effectively tested.
I would rather see them applauding a XX%-success-rate on a test against a corpus of open source applications (with a mixture C/C++ and heavy library usage) and/or the GCC regression test suite.
Also, just because you can compile the linux kernel with ICC doesn't mean you should. There's no point. It's not like it's going to get any faster what with the handcoded assembly in the tricky parts and the bog standard techniques used elsewhere.
And what's to say that some new driver that comes along breaks ICC compatibility for whatever reason since they only do testing with a limited range of GCC releases. It's not like Intel is doing regression checks for third party modules or anything.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON