Auto-threading Compiler Could Restore Moore's Law Gains
New submitter Nemo the Magnificent writes "Develop in the Cloud has news about what might be a breakthrough out of Microsoft Research. A team there wrote a paper (PDF), now accepted for publication at OOPSLA, that describes how to teach a compiler to auto-thread a program that was written single-threaded in a conventional language like C#. This is the holy grail to take advantage of multiple cores — to get Moore's Law improvements back on track, after they essentially ran aground in the last decade. (Functional programming, the other great hope, just isn't happening.) About 2004 was when Intel et al. ran into a wall and started packing multiple cores into chips instead of cranking the clock speed. The Microsoft team modified a C# compiler to use the new technique, and claim a 'large project at Microsoft' have written 'several million lines of code' testing out the resulting 'safe parallelism.'"
The paper is a good read if you're into compilers and functional programming. The key to operation is adding permissions to reference types allowing you to declare normal references, read-only references to mutable objects, references to globally immutable objects, and references to isolated clusters of objects. With that information, the compiler is able to prove that chunks of code can safely be run in parallel. Unlike many other approaches, it doesn't require that your program be purely functional either.
No no, you misunderstand. They're teaching the compiler to be so efficient that it actually starts creating new transistors to make itself smarter.
Make no mistake, this is how the rise of the machines will start.
Having the compiler identify things which could run in parallel and thread work to run as A B&C D would be a huge step forward as long as it doesn't introduce bugs.
Compilers already try to do this for example with auto-vectorization. The problem is that they are usually quite terrible at it. Even Intel's compiler which is probably the best at it usually misses out on most of the obvious places that should be vectorized. This is why pretty much all code dealing with multimedia content (audio/video/image codecs, games, etc.) still rely on tons of had written SIMD to be optimized to their fullest.
all the magic compiler is going to do is increase the speed of the System Idle Thread.
Have you seen how much processing power that thing consumes? Usually 90% and up. On all cores. It really needs some serious optimization.
Imagine you have a for-loop that calls the method 'hop' on every object 'bunny' in the list:
for every bunny in list {
bunny.hop()
}
This is a simple, sequential process - bunny.hop() is called in order, for every bunny in the list, one after the other.
Now suppose you have defined the data in your program in such a way that the compiler knows that method 'bunny.hop()' only ever accesses read-only data, or modifies local data that is unaccessible from anywhere else. The compiler now knows that the order of execution of the bunny hops doesn't really matter, as every call of bunny.hop() is independent from anything else. This frees the compiler to spawn threads or processes to call as many bunny.hop()'s as he likes at the same time, thereby processing through the list alot faster.
Another method, bunny.eat() actually performs write access on a shared object 'carrot' when called. If two bunnies eat the same carrot, the compiler can not perform automatic parallelization, as running two bunny.eat() methods could lead to invalid state (only one piece of carrot remaining, two bunnies eating 1 piece of carrot at the same time, results in -1 pieces of carrot). In this case, the compiler will take care to run two bunnies eating away at the same carrot sequentially. However if there are 2 bunnies eating the green carrot and another 2 bunnies eating the yellow carrot, these are again independent from each other and can again be paralleled.
The requirement to make this possible is to provide the compiler with information on what kind of data something is - is it an isolated hopping? Or a shared carrot? or a global Bunnygod that affects all bunnies?
An interesting development, and much needed I fear, but yet another layer of abstraction to allow lazy developers to not have to really bother about knowing what their code is actually doing (that's for the poor SoB who has to maintain it is for...)
Developing software is all about managing complexity. Abstractions are the primary tool used to do so. They are neither inherently good or bad. A large part of writing good software is finding the appropriate abstractions and eliminating inappropriate ones. If an abstraction allows a program to be written more easily with an acceptible level of performance and correctness, it is an appropriate abstraction.
To know what code is "actually doing" is relative. No programmer knows what his code is doing at the level of gates on the chip. It's rarely necessary or even helpful to know what the code is doing at the level of CPU instructions or microinstructions. This is especially true if the code runs on multiple CPUs.
Apparently the real geeks left Slashdot ages ago.
Casted to void?
Questions raise, answers kill. Raise questions to stay alive.
If two bunnies eat the same carrot...the compiler will take care to run two bunnies eating away at the same carrot sequentially. However if there are 2 bunnies eating the green carrot and another 2 bunnies eating the yellow carrot
Am I the only one who is totally horny after reading that?
Maybe you need to read?
Here's the headline (emphasis mine): Auto-threading Compiler Could Restore Moore's Law Gains
In the past, adding more transistors to a core meant the core worked faster (simplification), so Moore's law indirectly lead to better performance. Now a core is close to as fast/efficient as its going to get, so we throw the extra transistors at new cores (still Moore's law). The problem is, there's only so many single-threaded processes a person can run before there will literally be one core per process. In order to see benefits from the extra transistors again, "gains" in the summary's terminology, then we need a way to take advantage of it (this new compiler technique, functional programming, programmers who can actually grok threads). In the end, if we're not seeing some kind of gain from Moore's law, then the chip manufacturers will stop adding new transistors because no one will pay money for a chip that's just as good as their current chip, and Moore's law will fail.
Maybe it's subtle sarcasm regarding learning threaded programming? People can't identify the right thread in a conversation, but should be expected to write threaded code? :-)
Even the best of C or C++ compilers are terrible at vectorization of code.
Yeah, and the best humans are terrible at allocating registers -- so bad, in fact, that the best C compilers ignore the register keyword. What do you think is more relevant to the general case: vectorizing multimedia operations, or allocating registers? Compilers are also better than humans at:
To put it another way, look at the Orbitz story, which is over a decade old now:
http://www.paulgraham.com/carl.html
On the one hand, you have hand-tuned assembly language. On the other, you have a program written in Lisp (a high level, compiled language) with some C++ mixed in (for managing memory). Orbitz was able to compete on speed, but more importantly, it was returning better results. It's not that the people who wrote that mainframe assembly language were idiots -- they were taking advantage of all sorts of special hardware features, they knew how to hack their machines better than anyone else -- it is just that the Orbitz algorithm was far too complex for efficient hand-rolled assembly code, at which point compilers are really the only choice. The mainframe guys were busy thinking about how to make use of special machine features in assembly language; the ITA team was busy solving the higher-level problem, and relying on their compiler to generate good assembly language code. This is a particularly telling line:
We disassemble most every Lisp function looking for inefficiencies and have had both CMUCL and Franz enhanced to compile our code better.
[emphasis mine]. They disassembled their code...and then improved their compiler when they saw problems. They did not hand-roll the code, they made the compiler do a better job of generating code. These are not lazy programmers, nor are they programmers who do not know how to use assembly language; they are programmers who understand that they have a tool that is far better at generating assembly language than they are, and that they have more important things to do with their time.
I deal with quite a bit of crypto code in my work. I have seen lots of hand-tuned assembly language, I dealt with code that took advantage of the AESNI instructions to perform very fast encryption. I am well aware that in small, highly specialized functions (like AES), humans are better able to utilize special instructions to improve performance. Those are niche cases, and the techniques used in those cases have very limited applicability (even SSE is fairly limited in its applicability, by comparison with the sort of code programmers write and maintain every day), and the techniques scale very poorly.
Palm trees and 8
Compilers already try to do this for example with auto-vectorization. The problem is that they are usually quite terrible at it.
I suspect one reason they are so bad at it is they have to be very conservative in how they optimize, due to the relaxed nature of the C language. For example, if the C optimizer cannot prove beyond a shadow of a doubt that a particular memory location isn't being aliased, it can't make any assumptions about the location's value not changing at any step of the process. In practice, that means no optimizations for you.
Given that, it would seem that the Microsoft approach (using not only the higher-level language C#, but a specially customized version of C#) gives their optimizer much greater latitude. Because the language forces the programmer to annotate his objects with readable/writable/immutable tags (think C's "const" tag, but with teeth), and because the language (presumably) doesn't allow the programmer to do sneaky low-level tricks like casting away const or aliasing pointers or pointer math, the optimizer can safely make assumptions about the code that a C or C++ optimizer could never get away with. That may allow it to be more effective than you might anticipate (or maybe not, we'll see).
I don't care if it's 90,000 hectares. That lake was not my doing.
Just like quicksort(3) is far faster than bubblesort so too is a highly threadable code faster than non-threadble code
First, just to be pedantic, I'll point out that quicksort is as bad as bubblesort in the worst case, to a constant factor (you should have picked heapsort or mergesort). That aside, it is worth noting (and I am somewhat bothered by this when it comes to TFA) that we still do not know if it is even possible to optimize any program by parallelizing it; see the NC-vs-P question:
https://en.wikipedia.org/wiki/P-complete
Multithreading is not a magic bullet, and in all likelihood it is not generally applicable.
Languages do not, contrary to belief, express intent, the provide a strict set of instructions that the computer MUST respect
Wrong on all counts. Imperative languages are a way to convey instructions to a computer; declarative languages do not convey instructions, and programming in a declarative language requires an entirely different mode of thinking (it is closer to asking a question that giving instructions). It is also not strictly necessary for the computer to do exactly what a program expresses; there has been some work on compiler optimizations that have a (tunable) chance of not maintaining soundness.
In the end a good algorithm with no compiler help will beat optimized "dumb" code in all cases larger than "toy" (say, a few dozen "n" in Big-O notation)
If you are convinced of this, try implementing something more complex than the algorithms you see in Knuth's books; say, this:
http://eurocrypt2010rump.cr.yp.to/9854ad3cab48983f7c2c5a2258e27717.pdf
Then ask yourself this: could the constant factors in your implementation be better? At the end of the day, big constant factors will hurt your performance so badly that you might as well have used an asymptotically worse algorithm; indeed, consider fast integer multiplication:
https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm
10000 digits are needed before that algorithm actually outperforms the asymptotically worse Toom-Cook family of algorithms. Here is an even more extreme example:
https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm
Sure, that's a better matrix multiplication algorithm in the asymptotic sense...but only for matrices that are so large that you cannot even store them on today's computers.
So really, while you are right that asymptotic improvements will always beat constant factor improvements (which is what compilers are mostly going to do for you), you are wrong to ignore the importance of constant factor improvements. In the real world, constant factors matter. In the real world, quicksort and mergesort will use asymptotically worse algorithms below a certain problem size because of constant factors. In the real world, large integer multiplication is done using Karatsuba or Toom-Cook methods, not FFT methods, because of constant factors. In the real world, if you are not using a compiler to optimize your code, your code is going to be slower than it needs to be, even if you spent hours hand-tuning it, unless you are only dealing with toy problems.
Palm trees and 8
"MS R&D is the largest computer tech R&D in the world. Combine IBM, Intel, and AMD, and you get an idea of their size."
Citation need. Not disputing the first part. Just the second part about the relative size of Miscorsoft Research.
These and other applications written in the source language are performance-competitive with established implementations on standard benchmarks
Translation: we didn't speed them up any, or at least not by enough that we care to share any number.
Amdahl's Law is difficult to overcome in auto-parallelising systems that rely on anything other than loop optimizations. Basically, in straight-line code, if you make 50% of the code run on multiple cores, you're only going to get at most a 2x improvement in speed. In practice, you won't even get that much (except in loops) due to added overhead. Bottom line: you can write papers about your wonderful auto-parallelizing technique, but when the rubber hits the road this is unlikely to lead to much performance improvement.
Have you read my blog lately?
Moore's law was coined by an engineer to describe a series of observations. That is, it's a mathematical function that seems to fit some data, without any explanatory power. Just like various other "laws" such as the laws of thermodynamics, and, your favourite, Newton's laws, including his law of universal gravitation.
Moore's law states that the number of components on an integrated circuit doubles approximately every two years.
Of those, IBM have around 439,999 project managers.
No, it's very relevant.
How much wiring happens on doped silicon? None. The vast majority of the chip is covered in transistors, with 6-10 levels of wires on top of them. There are some designs where the I/O count demands so many pins that's what dictates the size of the chip -- so cache is filled in underneath. Heck, if your power budget allows it, you're already blowing the silicon area anyway, might as well increase your cache size! Consider your recent Core derived designs. Take away half the cache. Do you think the die area would go down? Not hardly.
You did the math right, but the cache line tag logic and coupled CAM are negligible. Sure, they may add a few million or so, but not anywhere near 5% of 100M.
I realize it's vogue for people to revisit Moore's Law and rewrite it every few years, but he was not speaking specifically about memory arrays. In fact, the chips Moore had access to at the time had very little memory on them.
Wiring never forces silicon area to be transistor-free, unless you're thinking of 1980 era chips. Not even late '80s had wiring on doped silicon. Certainly the kinds of chips Moore was talking about has had no significant wiring on doped silicon in 20 years, the exceptions being only when layout designers are getting lazy. I've done layout design, I've done circuit design, I've audited dozens of chip layouts and seen several technology manuals dating back to the 90s.
That random logic, by the way, is the subject of the most innovation in the field of chip layout and arguably in all of chip design. When your chip's entire goal is to funnel data through different units and do different things to it, you're dominated by buses. Automated tools often do split these buses up, but different algorithms can pull them together and make them more efficient. Caches are the smallest because they can be small. There's an entire periphery to them, including senseamps devoted to reading the baby FETs that can't make full rail to rail swings on the bitlines.
May I guess you're a student? Perhaps one who is learning from a professor who hasn't been in the industry since about 1985?