Auto-threading Compiler Could Restore Moore's Law Gains
New submitter Nemo the Magnificent writes "Develop in the Cloud has news about what might be a breakthrough out of Microsoft Research. A team there wrote a paper (PDF), now accepted for publication at OOPSLA, that describes how to teach a compiler to auto-thread a program that was written single-threaded in a conventional language like C#. This is the holy grail to take advantage of multiple cores — to get Moore's Law improvements back on track, after they essentially ran aground in the last decade. (Functional programming, the other great hope, just isn't happening.) About 2004 was when Intel et al. ran into a wall and started packing multiple cores into chips instead of cranking the clock speed. The Microsoft team modified a C# compiler to use the new technique, and claim a 'large project at Microsoft' have written 'several million lines of code' testing out the resulting 'safe parallelism.'"
The paper is a good read if you're into compilers and functional programming. The key to operation is adding permissions to reference types allowing you to declare normal references, read-only references to mutable objects, references to globally immutable objects, and references to isolated clusters of objects. With that information, the compiler is able to prove that chunks of code can safely be run in parallel. Unlike many other approaches, it doesn't require that your program be purely functional either.
Moore's law has nothing to do with performance, numbnuts. Apparently the real geeks left Slashdot ages ago.
It turns out that Moore was observing only a small portion of a greater trend. That is, many (or most) things involving technology have a short doubling period, usually between 1 and 2 years. Total information created, processing power per dollar, resolution of brain scanning technology, et al all follow a similar doubling period to what Moore observed with transistors on integrated circuits. It's not that everyone else is misusing Moore's Law, so much as Moore didn't make make the law broad enough.
Tic-Tac-Toe, Global Thermonuclear War, and relationships all have the same winning move.
No no, you misunderstand. They're teaching the compiler to be so efficient that it actually starts creating new transistors to make itself smarter.
Make no mistake, this is how the rise of the machines will start.
oodaloop's Law doesn't have quite the same ring to it.
If God forks the Universe every time you roll a die, he'd better have a damned good memory.
Or, gawd forbid.. we could teach programmers how to use threading? I am a casual developer, with no formal training beyond writing practical code and reading as much as I can from the keyboards of real developers. I've run into my fair share of "real", "professional" developers who've been tasked to work on my code and thrown their hands up in defeat - not, as I feared, because of the awfulness of my coding style, but "This uses threading, I don't know this!".. Of course, looking at their resumes, a long list of single threaded webapps where the actual threading is handled for them by the webserver itself.. I give up. Even some basic native GUI client development should teach developers simple threading and asynchronous callbacks? alas, yet another talent abandoned in the age of the webapp. Is it any wonder the security issues (my actual realm of supposed 'expertise') in their code often make the architectural issues look trivial in comparison?
An interesting development, and much needed I fear, but yet another layer of abstraction to allow lazy developers to not have to really bother about knowing what their code is actually doing (that's for the poor SoB who has to maintain it is for...)
OK, so now we shall have another way to produce parallel programs.
Running things safely in parallel is a very good thing, but it does not by itself guarantee significant improvements in speed. The hard bit is actually to optimize the parallel code.
Having the compiler identify things which could run in parallel and thread work to run as A B&C D would be a huge step forward as long as it doesn't introduce bugs.
Compilers already try to do this for example with auto-vectorization. The problem is that they are usually quite terrible at it. Even Intel's compiler which is probably the best at it usually misses out on most of the obvious places that should be vectorized. This is why pretty much all code dealing with multimedia content (audio/video/image codecs, games, etc.) still rely on tons of had written SIMD to be optimized to their fullest.
Mainstream language have mutable state all over the code. Functional programming's big change on state issues is to careful isolate state. The Microsoft approach means that state needs to be tracked carefully so that it could be isolated by the compiler even if it isn't isolated by the code. Which is likely just as much work as isolating state. And the nice thing about isolating state is once you do it you can make use of all sorts of incredibly powerful functional paradigms like like first class functions (closures, partial execution...) and lazyness (infinite data structures, no need to figure out proper order of evaluation..)
The solution to parallelism is functional programming. And no it is not too hard. Excel is a functional programming language that lots of people know that does a great job isolating state.
Or, gawd forbid.. we could teach programmers how to use threading?
That's easy: "Don't."
From everything I've read about threading, there's no general way a hand-coded multithreaded program can ever be proven to be correct. Threads introduce all sorts of extremely subtle timing-based logic and security bugs which even the smartest programmers in the world think they can handle but in practice don't. And most programmers are not the smartest programmers in the world (they only think they are).
The correct solution is to switch from C-based imperative languages to pure-functional implicitly parallelised languages, but that's not likely to happen before the heat death of the universe.
You are not a brain: http://books.google.com/books?id=2oV61CeDx-YC
all the magic compiler is going to do is increase the speed of the System Idle Thread.
Have you seen how much processing power that thing consumes? Usually 90% and up. On all cores. It really needs some serious optimization.
I hate it when all the potentially interesting discussion on a slashdot story is headed off by a pedantic nitpick.
Imagine you have a for-loop that calls the method 'hop' on every object 'bunny' in the list:
for every bunny in list {
bunny.hop()
}
This is a simple, sequential process - bunny.hop() is called in order, for every bunny in the list, one after the other.
Now suppose you have defined the data in your program in such a way that the compiler knows that method 'bunny.hop()' only ever accesses read-only data, or modifies local data that is unaccessible from anywhere else. The compiler now knows that the order of execution of the bunny hops doesn't really matter, as every call of bunny.hop() is independent from anything else. This frees the compiler to spawn threads or processes to call as many bunny.hop()'s as he likes at the same time, thereby processing through the list alot faster.
Another method, bunny.eat() actually performs write access on a shared object 'carrot' when called. If two bunnies eat the same carrot, the compiler can not perform automatic parallelization, as running two bunny.eat() methods could lead to invalid state (only one piece of carrot remaining, two bunnies eating 1 piece of carrot at the same time, results in -1 pieces of carrot). In this case, the compiler will take care to run two bunnies eating away at the same carrot sequentially. However if there are 2 bunnies eating the green carrot and another 2 bunnies eating the yellow carrot, these are again independent from each other and can again be paralleled.
The requirement to make this possible is to provide the compiler with information on what kind of data something is - is it an isolated hopping? Or a shared carrot? or a global Bunnygod that affects all bunnies?
An interesting development, and much needed I fear, but yet another layer of abstraction to allow lazy developers to not have to really bother about knowing what their code is actually doing (that's for the poor SoB who has to maintain it is for...)
Developing software is all about managing complexity. Abstractions are the primary tool used to do so. They are neither inherently good or bad. A large part of writing good software is finding the appropriate abstractions and eliminating inappropriate ones. If an abstraction allows a program to be written more easily with an acceptible level of performance and correctness, it is an appropriate abstraction.
To know what code is "actually doing" is relative. No programmer knows what his code is doing at the level of gates on the chip. It's rarely necessary or even helpful to know what the code is doing at the level of CPU instructions or microinstructions. This is especially true if the code runs on multiple CPUs.
Magnetic Hard drives, for instance is an example of the same curve with no transistors.
Honestly, you know what this says? This just says some programmers still need to go practice in the threading space.
Threading is not that hard at all for a VERY large class of problems. There are issues -- like, the memory bandwidth will eventually choke the ability of processors to get at nearby memory regions -- and you need some kind of sane plan to deal with cacheing in one core as compared to the next core when you share flags or data coming in during execution from elsewhere -- and you need to figure out when thread overhead starts to chew into your efficiencies -- but really, it's entirely doable, and isn't that difficult a problem space as compared to the actual problems *serious* programmers have to solve. It may just be that your "thing" isn't really all that threadable. Some things aren't. But it should also be very obvious when they aren't.
Then again, some problem spaces lend themselves very well to multiple core approaches -- I just finished a library that slices up images and then applies various separable algorithms to the slices in a variable number of threads (the user of the library can specify.) I wrote it in pure C, used posix threads, no sweat at all, in my 8-core machine, you get just about exactly the gains you'd think (x the number of cores) when the process is CPU cycle intensive, less as the ops are simpler and memory I/O rates rise.
Seriously. If threading seems that hard, you need to go do some studying. It's not the threading. It's you. You really should get on this bus.
I've fallen off your lawn, and I can't get up.
If two bunnies eat the same carrot...the compiler will take care to run two bunnies eating away at the same carrot sequentially. However if there are 2 bunnies eating the green carrot and another 2 bunnies eating the yellow carrot
Am I the only one who is totally horny after reading that?
This is nothing new. As in decades-old. Back when I was at DEC in the 1980s we had auto-threading compilers that worked very well on standard application code. Today, Intel has been doing auto-threading in its C++ and Fortran compilers for more than a decade and it is very effective for a large class of applications. Intel's compiler also has features that help you tweak the code for better auto-parallelism, notably the Guided Auto Parallel feature introduced in 2010. Other compiler vendors also have auto-parallelizing compilers.
I've been in the industry for decades, and every couple of years someone comes out with yet another programming language or technique that is supposed to revolutionize application parallelization. This is just another one, and many years behind the rest of the industry from what I can tell.
If it has a visual effect, then it's not read-only.
Even the best of C or C++ compilers are terrible at vectorization of code.
Yeah, and the best humans are terrible at allocating registers -- so bad, in fact, that the best C compilers ignore the register keyword. What do you think is more relevant to the general case: vectorizing multimedia operations, or allocating registers? Compilers are also better than humans at:
To put it another way, look at the Orbitz story, which is over a decade old now:
http://www.paulgraham.com/carl.html
On the one hand, you have hand-tuned assembly language. On the other, you have a program written in Lisp (a high level, compiled language) with some C++ mixed in (for managing memory). Orbitz was able to compete on speed, but more importantly, it was returning better results. It's not that the people who wrote that mainframe assembly language were idiots -- they were taking advantage of all sorts of special hardware features, they knew how to hack their machines better than anyone else -- it is just that the Orbitz algorithm was far too complex for efficient hand-rolled assembly code, at which point compilers are really the only choice. The mainframe guys were busy thinking about how to make use of special machine features in assembly language; the ITA team was busy solving the higher-level problem, and relying on their compiler to generate good assembly language code. This is a particularly telling line:
We disassemble most every Lisp function looking for inefficiencies and have had both CMUCL and Franz enhanced to compile our code better.
[emphasis mine]. They disassembled their code...and then improved their compiler when they saw problems. They did not hand-roll the code, they made the compiler do a better job of generating code. These are not lazy programmers, nor are they programmers who do not know how to use assembly language; they are programmers who understand that they have a tool that is far better at generating assembly language than they are, and that they have more important things to do with their time.
I deal with quite a bit of crypto code in my work. I have seen lots of hand-tuned assembly language, I dealt with code that took advantage of the AESNI instructions to perform very fast encryption. I am well aware that in small, highly specialized functions (like AES), humans are better able to utilize special instructions to improve performance. Those are niche cases, and the techniques used in those cases have very limited applicability (even SSE is fairly limited in its applicability, by comparison with the sort of code programmers write and maintain every day), and the techniques scale very poorly.
Palm trees and 8
One would assume that "world modification" (monad; in this case, visual I/O) does not fit the restriction "only ever accesses read-only data, or modifies local data that is unaccessible from anywhere else" and as soon as you are showing something to the user, that blocks the auto-threading.
Then again, if you are drawing map tiles, the order you draw them in doesn't matter - so this goes to show that in some situations, there's no way around the programmer actually having to stop, think and declare whether the program's contract allows it to do some user-visible operation in parallel.
It's not the fall that kills you. It's the sudden stop at the end. -Douglas Adams
Compilers already try to do this for example with auto-vectorization. The problem is that they are usually quite terrible at it.
I suspect one reason they are so bad at it is they have to be very conservative in how they optimize, due to the relaxed nature of the C language. For example, if the C optimizer cannot prove beyond a shadow of a doubt that a particular memory location isn't being aliased, it can't make any assumptions about the location's value not changing at any step of the process. In practice, that means no optimizations for you.
Given that, it would seem that the Microsoft approach (using not only the higher-level language C#, but a specially customized version of C#) gives their optimizer much greater latitude. Because the language forces the programmer to annotate his objects with readable/writable/immutable tags (think C's "const" tag, but with teeth), and because the language (presumably) doesn't allow the programmer to do sneaky low-level tricks like casting away const or aliasing pointers or pointer math, the optimizer can safely make assumptions about the code that a C or C++ optimizer could never get away with. That may allow it to be more effective than you might anticipate (or maybe not, we'll see).
I don't care if it's 90,000 hectares. That lake was not my doing.
Just like quicksort(3) is far faster than bubblesort so too is a highly threadable code faster than non-threadble code
First, just to be pedantic, I'll point out that quicksort is as bad as bubblesort in the worst case, to a constant factor (you should have picked heapsort or mergesort). That aside, it is worth noting (and I am somewhat bothered by this when it comes to TFA) that we still do not know if it is even possible to optimize any program by parallelizing it; see the NC-vs-P question:
https://en.wikipedia.org/wiki/P-complete
Multithreading is not a magic bullet, and in all likelihood it is not generally applicable.
Languages do not, contrary to belief, express intent, the provide a strict set of instructions that the computer MUST respect
Wrong on all counts. Imperative languages are a way to convey instructions to a computer; declarative languages do not convey instructions, and programming in a declarative language requires an entirely different mode of thinking (it is closer to asking a question that giving instructions). It is also not strictly necessary for the computer to do exactly what a program expresses; there has been some work on compiler optimizations that have a (tunable) chance of not maintaining soundness.
In the end a good algorithm with no compiler help will beat optimized "dumb" code in all cases larger than "toy" (say, a few dozen "n" in Big-O notation)
If you are convinced of this, try implementing something more complex than the algorithms you see in Knuth's books; say, this:
http://eurocrypt2010rump.cr.yp.to/9854ad3cab48983f7c2c5a2258e27717.pdf
Then ask yourself this: could the constant factors in your implementation be better? At the end of the day, big constant factors will hurt your performance so badly that you might as well have used an asymptotically worse algorithm; indeed, consider fast integer multiplication:
https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm
10000 digits are needed before that algorithm actually outperforms the asymptotically worse Toom-Cook family of algorithms. Here is an even more extreme example:
https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm
Sure, that's a better matrix multiplication algorithm in the asymptotic sense...but only for matrices that are so large that you cannot even store them on today's computers.
So really, while you are right that asymptotic improvements will always beat constant factor improvements (which is what compilers are mostly going to do for you), you are wrong to ignore the importance of constant factor improvements. In the real world, constant factors matter. In the real world, quicksort and mergesort will use asymptotically worse algorithms below a certain problem size because of constant factors. In the real world, large integer multiplication is done using Karatsuba or Toom-Cook methods, not FFT methods, because of constant factors. In the real world, if you are not using a compiler to optimize your code, your code is going to be slower than it needs to be, even if you spent hours hand-tuning it, unless you are only dealing with toy problems.
Palm trees and 8
How frequently do you spend time thinking about which registers are being used to store partial results in the programs you write? When was the last time you spent time thinking about which registers are being used by which variables, or when it would be a good time to store register contents in memory?
What about memory alignment and packing? Calling conventions? Near pointers and far pointers (OK maybe that one is a bit dated)? Most programmers do not think about these problems, because most programmers are trying to solve bigger problems and need to spend time thinking about the bigger picture. Getting mired down in low level details makes programmers less productive -- that is why we use programming languages.
So if we can have compilers that can automatically utilize multithreading, even if it is in a somewhat limited context, we should be happy. If you are happy to not be spending time thinking about padding data or allocating registers, you should be happy about automatic multithreading.
Palm trees and 8
"MS R&D is the largest computer tech R&D in the world. Combine IBM, Intel, and AMD, and you get an idea of their size."
Citation need. Not disputing the first part. Just the second part about the relative size of Miscorsoft Research.
These and other applications written in the source language are performance-competitive with established implementations on standard benchmarks
Translation: we didn't speed them up any, or at least not by enough that we care to share any number.
Amdahl's Law is difficult to overcome in auto-parallelising systems that rely on anything other than loop optimizations. Basically, in straight-line code, if you make 50% of the code run on multiple cores, you're only going to get at most a 2x improvement in speed. In practice, you won't even get that much (except in loops) due to added overhead. Bottom line: you can write papers about your wonderful auto-parallelizing technique, but when the rubber hits the road this is unlikely to lead to much performance improvement.
Have you read my blog lately?
Of those, IBM have around 439,999 project managers.
Nobel Prizes? None. Because there is no Nobel Prize for computer science. The closest equivalent is a Turing Award. I haven't checked the whole list, however the 2009 winner works at Microsoft Research (and has since 1997), the 1998 winner is now dead, but was working at Microsoft Research between 1995 and his death. You can check the list yourself for IBM.
I am TheRaven on Soylent News