HPs Dynamo Optimizes Code
sysboy writes "ArsTechnica have a very interesting piece about HP's Dynamo. " Interesting stuff about their run-time optimization stuff. They compare it to Transmeta's code morphing technology.
← Back to Stories (view on slashdot.org)
I remember seeing descriptions of linkage profilers that would reorder code segments inside an image after examining the stats from a profiling run of the application. I know HP was active in this around '93 but there were others doing it. I know of at least one windows C++ compiler (Watcom) that plays with this to re-ordering to plays the segments used to start the app near the beginning of the image.
Since it runs on an x86 CPU, I'm thinking it shouldn't be too difficult to hack the machine, extract the OS its running, and dual-boot linux or better yet, windows 2000 :)
The code morphing that Hotspot does does on-the-fly optimization of code, on-the-fly "unoptimization" of that code if necessary, and so on. That's code morphing, anyway you slice it. Doesn't matter if it's pcode based or not.
About the Amiga thing, the morphing I'm taking about there isn't the instructions individually themselves, it's the instruction sequences that were changed from MacOS calls into Amiga OS calls.
That product wasn't all smoke and mirrors. It did work for Mac stuff, it just didn't deliver on it's promise to emulate everything. ..and it took a really long (like Newtek coming out with the Video Toaster long) time for it to appear.
I guess you missed my point....I don't think the idea is bad...it's cool as anything, it's just not a new idea. Lots of people in the press are saying it's this "brand new" thing, when it really isn't.
False ol chum. Intel and HP are working on the itanium together, Basically the intel chip provides HW support to make such things easier, like dynamic recompilation. Intel is having some serious probs with static compilation on the Itanium. It relies heavily on a complex static compiler. No doubt these hardware features are also intended for a Dynamo-like utility to run on the itanium. There are SOME THINGS you just can not optimize out at compile-time. Dynamically typed OO languages are notorious for being difficult to optimize at compile time as it is hard to inference the most likely types to be used in a given chunk. Dynamic Type analysis greatly speeds this up. As would a utility like Dynamo.
From the article:
It is, in essence, an interpreter for HP's PA-8000 instruction set that itself runs on a PA-8000 processor.
Maybe Microsoft can figure out how to emulate Windows under Windows!
first reply to first post, ihope
How did this guy get a troll rating? He's probably just a stupid teenager who thinks he's really cool getting second post on a story. Is it clever? No. Is it funny? No. Is it trollworthy in the least? Not at all.
Hi
Profiling fails miserably for extremely dynamic systems, where user input can drastically affect the code trace through memeory. The situation you profiled against may not be the same the user is currently using. This type of dynamic recompilation will give the most benefit to dynamicly typed or typed languages with dynamic binding, such as Objective C, and even Smalltalk in certain cases ( A Smalltalk VM that spits out native code, as ST/X and VisualWorks do ).
This is what hotspot was supposed to do for java.
Funny, in practice, it isn't all that fabulous, and actually makes quite a few programs slower.
I have no doubt that dynamo will suffer from the same issues.
FX32! from Digital (Compaq) was doing a similar thing. Running Windows NT x86 programs under Windows NT on Alpha. BTW it's discontinued. http://www.digital.com/amt/fx32/index.html
I thought he was talking about Willamette, not Itanium? HP have nothing at all to do with Willamette as far as I know.
Please post some news about this. Thanks. Have a day.
I like hardware... it makes me happy
with IE
Moderate this down to (Score:-1,Troll)
Trollz rool.
This becomes an issue when organizations are stuck with old closed source legacy binaries that cannot be (or haven't yet been) replaced (i.e. large bits of MS-Windows). Not everyone has ready access to source code.
Dynamo does two things:
1) Generates streams of instructions in execution order, with the branch predictions already determined, based on past execution.
2) Additional code optimizations are enabled by knowing how the program actually executes. Things like pointer aliasing (which is the single biggest cause of poorly compiled code) can be resolved.
#2 is the same as profile guided optimization, which has been trashed on Slashdot several times because Intel put it into their compiler and got decent speedups. PGO allows the compiler to gather statistics on the application by actually running it, then recompiling based on that data.
#1 sounds exactly like the trace cache which will be in Willamette, Intel's new IA32 chip. Streams of uops are stored in program execution order, with the branches already predicted.
The difference is that Dynamo uses the hardware to do all of these things while the application is running, which adds considerable overhead. The Intel implementation is done automatically, without using CPU cycles, allowing ALL of the gains to be seen by the user.
Ars Technica has their history of anti-Intel bias, so I am sure that you will never see this comparison on their site.
Hasn't LISP (and scheme) been doing byte-code compilation for about 30 years?
Ars Technica always has interesting technical stuff. Heck, it's "News for Nerds" by definition. Kudos to slashdot for linking to it, especially since they post some of the better Slashdot links on Ars. :)
;)
They wrote an excellent article on the Crusoe, (I just wrote a short paper on it for class, myself, although I used the white paper for reference...) and were looking forward to this architecture. It's nice to hear more details.
Wow, these results are really impressive! I wish someone would implement an x86 interpreter on x86 that actually ran faster. I wonder if the limited number of registers would really get in the way of a control program, though. Transmeta, where are you?
---
pb Reply or e-mail; don't vaguely moderate.
pb Reply or e-mail; don't vaguely moderate.
Finding representative input may be impossible for a randomly choosen application from Freshmeat, but those are not the applications that need the optimization.
They are concerned with SPEC...which contains lots of time consuming calculations buried in nested loops. Consider a less esoteric example that has a similar program structure. 3D software (Povray, Quake, whatever) goes through long lists of polygons. The instructions in these loops will be hit thousands or millions of times. These oft-touched bits of code will be optimized and have several fragments that correspond to the various common branching configurations. Any trip through these tight loops is going to be "representative." If the two or more branchings occur frequently, each will be encapsulated in a fragment. If the loop is exhaustive enough, every possible branching may be optimized. Amdahl's law is at work here. The parts of the entire code that get run the most are the locations for potential bottlenecks, and hence they attract the most attention from the optimizer. This is in direct contrast to a priori compiler optimization approaches which have a very narrow focus and have no idea how often a particular bit of code will be executed.
Again, the proof of the efficacy of this approach is in the HP results. Those results will only get better when you dump the runtime overhead. Furthermore, it is quite understandable that these types of optimizations will be orthogonal to those made by compile-type optimizations. The squeaky wheels will get the oil. And, there is NO possibility of a performance penalty. If the "optimized" fragments are slower than original code, they can be discarded. And these optimization will be resilent to parametric changes, otherwise the HP software wouldn't be seeing any speedup. It has to have a cache of pre-ordered instruction blocks that is hit many times in order for the initial optimization/translation time to be amortized.
There is a lot of potential here I think, especially for computational physics codes that I work on.
This is really a new type of binary optimization. There is no reason that the "translation" software be fast or even run during the execution of the optimized program. The idea would be to compile the binary with the normal optimizations in place (O4, say). This is a priori, or before-runtime, optimization. Then, choose a representative set of inputs to exercise the binary and run it using the "translator." The optimizations in this a posteriori process can be very aggressive. The performance of the code during this procedure may even be seriously hampered by the optimizations, but at the end, all of the fragments in the cache are then assembled into the NEW a posteriori optimized binary. We then have a program that takes full advantage of this wholistic optimization approach but doesn't carry the translation runtime baggage. I think that this would make an exciting research project, especially with the added bonus of being able to insert different front-ends into the translation application.
One thought that struck me after reading the article was that this sort of thing has the potential to be really good for multi-cpu machines, since the large majority of programs could potentially get some significant speed-up from devoting extra cpus to the background profiling and cacheing work required for the current binary being 'executed' (which is actually being interpreted/translated). The article doesn't says that dynamo itself is threaded, but this would be a natural thing to do. Assuming that worked, then all of a sudden the multi-cpu machine is speeding up everything, and not just programs that or threaded or situations where two heavy-duty processes are working at the same time.
There are a lot of "geewheeze" out there about this supposingly new morphing thingy, but nobody is talking about the OVERHEAD !
If something has to be translated (even optimized) from a form to another form, there ought to be overhead for that something.
So, who can tell me about the OVERHEAD incurred?
Muchas Gracias, Señor Edward Snowden !
thad
I love Mondays. On a Monday, anything is possible.
the article says many companies use or will be using this kind of technology. So what about the patent transmeta got for their code morpher? Will there be big legal battles ahead?
Amiga's didn't need anything to "morph" code or translate it. They had the same processors as Mac's (Motorolla 68K's) and only needed ROM's and a few chips to translate calls to the Mac's hardware to the corresponding Amiga hardware.
http://www.hpl.hp.com/cambridge/projects/Dynamo/do cs.htm
Dynamo as a Virtual Machine Accelerator (PDF)
The Dynamo team has been at it since 1997 and the writeup is dated June 1999, does anyone know what HP is doing with it?
Wouldn't that be absolutely kick ass?
Can you immagine what Dynamo and Code Morphing technology can do for the Linux Kernel? Faster emulators, faster code, and _better_use_of_existing_hardware_!!!! MMX acceleration for everything (theoretically), because it's in the kernel!
The holy grail of linux is how well it works with legacy hardware, why not make it even better?
So who's going to take the reigns of this puppy? Put it on the bill for the 2.5 series! (or 3.0?)
For a while now people have been saying that if we only ran our programs through profilers and then recompiled with the hints they gave we could see a signifigant benifit (in things like branch prediction and such, and other things which cannot be seen at compile-time). I wonder how much more effective having a whole run-time enviorment like this is compared with if the same program used profiling feed back. The run-time enviornment has more up-to-the-second info, but there is the overhead of the enviornment itself, where as profiling would not have that.
Who am I? Subscribe and find out
As the book says, even its creator did not know it still existed until his computer started speaking to him...
(It's on the Cyberpunk reading list, but hard to find. P-1 is a fictional program created to penetrate a remote computer, get resources for an attack, but avoid detection. The creator mistypes, the program thinks he's not the creator, and it hides...it continues getting more resources and hiding its existence for years, including optimizing applications so it can use the extra CPU although the application seems to be using exactly as much CPU as it should.)
>Intel IA-32 (i386 through Pentium III) doesn't
>work that way. In the Pentium Pro and later,
>there are branch prediction bits in the cache,
>set as the code >executes.
x86 (and Alpha and SPARC and MIPS) have branch
predictors but the prediction history bits are
stored in on-chip prediction tables. NOT in the
instruction cache. Keeping them in the cache is not
unreasonable though and may be an interesting
research topic since you do not have to keep
a big on-chip table that suffers aliasing and
capacity problems.
Y.
I agree. I mean, based on their example of optimizing native HP PA-8000 binaries, one could, in theory, potentialy run M$ proggies faster on my linux box using something like Bochs or FreeMWare or Plex or whatever the hell they're calling it these days. Granted shitty code run 20% faster is still shitty code. ;P It would still be nice to see someone more talented than I pull it off.
I haven't read through all the posts so this may be repetitve buttttt, but I happen to know a little about the Dynamo system ant it is pretty cool. I was lucky enough to attend a lecture by the person who headed the research. Part of the reason they can obtain a speed up of 20% is that they apply really really agressive profiling techniques. Some of these techniques stem from advanced compiler optimization research. In short they find the most commonly executed paths at the machine language level and cache them. The trick of course, is that it is all done at runtime, thats what makes it really cool. They are able to overcome the overhead of processing and collecting the profiling data, and get the %20 performance over the normal execution time.
BOFH, My model for being a sysadmin :)
Go od for Bill!
Smalltalk.
"Video bona proboque; deteriora sequor." -- Ovid
An employer of mine told me of the time the IBM CSE, upon first seeing the mutated machine, said "you're running 3x as many users as this system is supposed to support. Corporate headquarters is going to put a price on your head"
Most of "new computer technologies" have their roots back in the `60s and early `70s. Changes have made some of these more effective or important now, resulting in their resurfacing as "bold new concepts".
Do not click here
Uh oh, not good.
A choice of masters is not freedom
Now I don't feel as bad when I write crappy, inefficient code. Let someone else clean up the mess ;]
"My mother never saw the irony in calling me a son-of-a-bitch." - Jack Nicholson
Case in point, Java Virtual Machines are seldom as fast as native code... And smalltalk and Lisp are both languages that are slow enough/memory intensive enough (especially LISP, the recursion master) that it's only relativly recently (last decade or so) that we have had computers fast/powerful enough to actually do much with them on a practical level.
Your point about the distinction between runtime morphing and emulation is well taken, however.
Or does that fall more into the catagory of emulation?
(Which, of course, has almost always been notoriously slow)
There's a fairly large difference, when you drill down to the details level, between runtime morphing of pcode-based and/or interpreted languages like Smalltalk and Java and native code, which is being discussed here. This stuff is obviously based off of lots of past research, but it IS a rather nice new use of ideas and technology. I'd love to see widespread usage of this type of tech on the x86 and other platforms.
Also, the Mac emulators for the Amiga, even the hardware based ones, did no code morphing. The Amiga and Mac were both based on the exact same processor line (680x0), and thus no translation was needed. My guess is you are thinking of that one Mac emulator that came out that promised it would be useful for emulating all sorts of other computers. That was mostly smoke and mirrors. The board was pretty much just a hardware add-on for the Amiga that would let you plug a Mac ROM in so you didn't have to use a binary image on disk.
Most Mac emulators on the Amiga at the time, even the software-only ones, were ever so slightly faster than a Mac running the same CPU. This was no fancy code trickery... just superior (at the time) video hardware in the Amiga, coupled with the fact that the Mac ROM would be cached within fast RAM if it was available.
Close enough. Sun didn't buy a company as far as I know. The technology comes mostly from the Self group at Sun. This is credited (but maybe not enough!) in the Hotspot whitepaper (try searching for 'Self').
When I was visiting the Self group in '96, they told me how frustrated they were that they knew how to speed up Java and couldn't get the Java guys to pay any attention... it was good to see that they got the message across eventually.[end shameful namedropping]
Dynamic optimisation is totally not new but there's always room for more research.
By the way Self is really worth checking out. How many environments let you change your inheritance at runtime? Or totally blitz the machine by visually manipulating the interpreter's settings?
RyanS
Geesh people, Runtime morphing is NOT NEW. The company that Sun bought to do the Hotspot VM did it with Smalltalk YEARS before Transmeta.
Close enough. Sun didn't buy a company as far as I know. The technology comes mostly from the Self group at Sun. This is credited (but maybe not enough!) in the Hotspot whitepaper (try searching for 'Self').
[Begin shameful namedrop]When I was visiting the Self group in '96, they told me how frustrated they were that they knew how to speed up Java and couldn't get the Java guys to pay any attention... it was good to see that they got the message across eventually.[end shameful namedrop]
Dynamic optimisation is totally not new but there's always room for more research.
By the way Self is really worth checking out. How many environments let you change your inheritance at runtime? Or blitz the machine by visually manipulating the interpreter's settings?
RyanS
Heck, there were even products for the Amiga that would take code for the Mac and change it around to work on the Amiga. (That was a hardware based product). Not sure how well they optimized the code, but it still fits into this category.
The other really intelligent thing Perl does it is uses native libraries. For example, if you use a regular expression in Perl, that's one instruction in the virtual machine- which calls a native C routine to actually do the regular expression. Most of your time in a Perl program is not spent in the virtual machine, it's spent in these optimized native routines.
This is one problem Java has for performance- since they wrote most of the libraries in Java, you are spending most of your time in the virtual machine.
LISP is but one of a great number of languages that used byte-code compilation many years before Perl. (Trivia: The interpreter for LISP was accidentally written in LISP before LISP existed!*)
But I don't know of any widely known languages before Perl which were predominantly thought of as interpreted languages and were compiled. The overhead of compiling was just thought to be too large for an interactive program...
Cheers,
Ben
* Piece of trivia I saw in the Dragon Book. The story is that LISP was used as an informal way to write down a program that you would then convert by hand into assembler. Well someone decided to demonstrate that LISP made a nice general purpose machine and wrote an "eval" function in LISP. Someone else saw that, realized that eval was *also* a LISP interpreter and converted it by hand into the first working LISP interpreter. I remember that this was in 1959, but I forget the names...
My usual seat in the cluetrain is at A HREF="http://pub4.ezboard.com/biwethey.ht
An optimizing compiler can not determine at compile-time which branches will be most often taken at run-time. This is what Dynamo seems to do.
If you think of control flow as a tree, Dynamo is doing run-time tree balancing, or twisting the code loops to shorten (or inline if you will) the paths most often enountered.
Then again, this is just my preliminary, first-read, interpretation. (no pun int)
Sidebar: Has anyone noticed that even when slashdot is doggy-slow, the ads pop up in not time flat, and then we get to stare at them while the rest of the page takes it's time loading?
-- What you do today will cost you a day of your life.
R U shittin' me? 20% is a kick ass SPEC improvement, esp since it's done totally in software. You can bet that there will be some sort of hardware support soon. Like the article starts out stating, the compiler team get hugeass bonusses based on piddly decimal point improvements (I might be overstating the case somewhat, tho). 20% is like, well, huge. The compiler guys didn't believe the numbers at first.
R U shittin' me mk II? compared to the black voodoo magic of some compiler transformations, this transformation is almost clear as crystal. And it runs in REAL TIME. These days, O(n^3) worst case is almost acceptable for static optimsation algs.
But you are right about your other points, sorta. While hint bits are a cheapass version of this idea, it appears that the real gains come when you start completely removing the branches altogether and propagate assumptions forward (redoing register allocation, that sort of thing).
Even more funner would be to cross kernel boundaries and inline kernel calls. Transmeta can do this, as their optimiser run at the meta level, whilst dynamo cannot, as it is merely a user level program.
I think that people have to think a bit. Considering that smalltalk is not as extensively used as C++ or java and that the Amiga is also not extensively used by the average person is reason enough to rework the idea for newer hardware/software combinations.
What I have yet to actually see is any of this affect a real desktop computer that I can buy. A transmeta processor would be a nice touch but tell me why waste something like that on afn embedded or handheld device which almost by definition has severe limitations?
Slashdot social engineering at it's finest
Intel IA-32 (i386 through Pentium III) doesn't work that way. In the Pentium Pro and later, there are branch prediction bits in the cache, set as the code executes. After a few passes through a loop, the most probable path is known and speculative execution gets smarter. Intel does its scheduling in hardware.
I'm not thrilled about adding all that hard-to-debug software to get only 20% better performance on SPECmark benchmarks.
When you run a Perl program, for instance, it actually compiles and then runs... unlike most interpreted languages, which translate and run instructions one at a time. Instead, a Perl script could begin to run translating one instruction at a time, go until it hit a trace condition, increment the counter, and so on.
Following the Perl example, the down side is that Perl does some compile-time error checking, which wouldn't be done if you used this algorithm.
I'm speculating off the top of my head, mind you... but then, if I'm not talking sense, I'm sure someone will let me know! ;-)
"The best we can hope for concerning the people at large is that they be properly armed." - Alexander Hamilton
IMHO, if they are getting a 20% increase in speed after fragmenting *native binaries*, then they have some serious work to do on thier compilers. If the code had been properly optimized by the compiler, then on the fly fragmenting should not speed up execution time. On that same note though, I thought HP native compilers were among the best at optimization. Am I off on this?
Stephen L. Palmer
---
Here is the result of your Slashdot Purity Test.
You answered "yes" to 86 of 200 questions, making you 57.0%
Perl compiles the instructions down to byte-code, and then runs an interpreter on the byte-code. Trust me, you really don't want it to interpret one line at a time. You really, really don't want that.
Also note that while Perl was the first well-known interpreted language to do this, it is an idea that is now considered the norm.
But to apply HP's ideas to Perl would be quite possible and probably beneficial. Today Perl has a lot of run-time lookups. For instance if I write a function foo, and I call it from a function bar, then (I believe) every time you run my function bar there is a run-time lookup while running bar to find foo because you might have changed it.
But foo almost never changes from call to call! Imagine instead the following. Each time you run bar, first check a flag for whether you have profiled it. If you have not then increment a counter for how many times it is accessed. If that counter is below some level, just execute. If it is above some target, then profile it and record information from which you can check whether it needs to be profiled. After profiling then execute the profiled version.
If your original check found that you profiled it, then quickly check that the profiled version has not been invalidated, and if all is fine then run the profiled version. If it has been invalidated then flip the profiled version, set the counter at 0, and run the safe version.
So...what does this complicated logic do? Why for a minor run-time penalty you manage to remove most of the repeated dynamic run-time lookups for which function to call, probably with significant speedups! (It could be an even bigger win if you did a similar trick on method-call lookups.)
The basic technique of taking out time to store a fast static lookup on data that is otherwise recalculated is called memoizing, and is good to know. (In Perl memoizing is almost always done by having a lexical hash (ie one defined with my) around the function with memoizing being done by sticking the answer in the hash. So the program becomes check the memoized hash and return that answer if you can. Otherwise calculate the answer, stick it in the hash, and return the answer.)
Cheers,
Ben
My usual seat in the cluetrain is at A HREF="http://pub4.ezboard.com/biwethey.ht
Huh - how do HP think they can market an entire server just for producing little sticky tags? I mean, who even buys those small electronic Dymo machines? I mean the old fashioned mechanical ones were perfectly good and everyone loves that "oops, I didn't click hard enough on the last letter, so I'll over-compensate on this letter" look.
...Oh, you said *Dynamo*! Sorry!
...I'll go away now....
A little planning goes a long way...