HPs Dynamo Optimizes Code

Runtime morphing is NOT NEW by Anonymous Coward · 2000-03-22 23:19 · Score: 2

Geesh people, Runtime morphing is NOT NEW. The company that Sun bought to do the Hotspot VM did it with Smalltalk YEARS before Transmeta. People talk like this is some sort of new thing! It's NOT.

Heck, there were even products for the Amiga that would take code for the Mac and change it around to work on the Amiga. (That was a hardware based product). Not sure how well they optimized the code, but it still fits into this category.

Re:Runtime morphing is NOT NEW by X · 2000-03-22 23:35 · Score: 2

....and before that people were doing it with APL...

--
sigs are a waste of space

Re:Letting you know... by bhurt · 2000-03-23 23:29 · Score: 2

The other really intelligent thing Perl does it is uses native libraries. For example, if you use a regular expression in Perl, that's one instruction in the virtual machine- which calls a native C routine to actually do the regular expression. Most of your time in a Perl program is not spent in the virtual machine, it's spent in these optimized native routines.

This is one problem Java has for performance- since they wrote most of the libraries in Java, you are spending most of your time in the virtual machine.

Re:Is that really runtime morphing? by X · 2000-03-27 07:11 · Score: 2

Seldom as fast is a matter of perspective. Even for computationally intesive operations I've done several Java benchmarks that show it to be very close to native code in performance (like within %10). Given the variances in performance between native compilers, there is no guaruntee that native code will outperform a good JVM.

Similar things have been demonstrated for Smalltalk. Lisp has certainly outperformed a lot of other languages in benchmarks, but it's mostly compiled to native code anyway.

--
sigs are a waste of space

Re:Is that really runtime morphing? by X · 2000-03-22 23:38 · Score: 2

You're talking about orthoganol concepts. Emulation is what you're doing, runtime morphing is how you're doing it.

The statement that emulation has almost always been notoriously slow is very misleading. AS/400's only do "emulation". JVM's only do "emulation". Smalltalk, LISP, and any other language that defines a VM for it's runtime are all doing "emulation".

--
sigs are a waste of space

Yabut by tilly · 2000-03-23 02:42 · Score: 2

LISP is but one of a great number of languages that used byte-code compilation many years before Perl. (Trivia: The interpreter for LISP was accidentally written in LISP before LISP existed!*)

But I don't know of any widely known languages before Perl which were predominantly thought of as interpreted languages and were compiled. The overhead of compiling was just thought to be too large for an interactive program...

Cheers,
Ben

* Piece of trivia I saw in the Dragon Book. The story is that LISP was used as an informal way to write down a program that you would then convert by hand into assembler. Well someone decided to demonstrate that LISP made a nice general purpose machine and wrote an "eval" function in LISP. Someone else saw that, realized that eval was *also* a LISP interpreter and converted it by hand into the first working LISP interpreter. I remember that this was in 1959, but I forget the names...

--
My usual seat in the cluetrain is at A HREF="http://pub4.ezboard.com/biwethey.ht

Re:Yabut by cburley · 2000-03-23 08:15 · Score: 2

But I don't know of any widely known languages before Perl which were predominantly thought of as interpreted languages and were compiled. The overhead of compiling was just thought to be too large for an interactive program...

I do: Beginner's All-purpose Symbolic Instruction Code, otherwise known as BASIC.
Since around 1980 at least, it seems most people thought of it as an interpreted language, or, more properly, of its implementations as interpreters.
For early PC's that may well have been true (with keywords and such converted to byte codes internally, to save space versus storing the whole program in ASCII).
But the original and many subsequent BASICs were environments that looked interpreted but actually compiled when the " RUN " command was issued.
(Which, for a large program, would mean a noticeable wait before anything useful happened within the program.)
That goes back to the mid-60s, and through my own personal close encounter with TOPS-10 BASIC around the mid-70s, when I first learned how compilers worked (and what PDP-10 assembly code really looked like) by typing in the whole thing from a listing! (Took about three months or so on my beat-up KSR-33 Teletype.)

--
Practice random senselessness and act kind of beautiful.

Re:Bad compilers? by SimonK · 2000-03-23 00:39 · Score: 2

As others have pointed out, a compiler *cannot* do the kinds of optimisations dynamo does.

Compilers only have the static representation of the program available to them. They have no idea what will be executed at runtime or how frequently. The only way they could ever do this, is to perform a data flow analysis - no faster, less effective and much more complex than actually running the program.

Not really; run-time vs compile-time by jabber · 2000-03-22 23:54 · Score: 2

An optimizing compiler can not determine at compile-time which branches will be most often taken at run-time. This is what Dynamo seems to do.

If you think of control flow as a tree, Dynamo is doing run-time tree balancing, or twisting the code loops to shorten (or inline if you will) the paths most often enountered.

Then again, this is just my preliminary, first-read, interpretation. (no pun int)

Sidebar: Has anyone noticed that even when slashdot is doggy-slow, the ads pop up in not time flat, and then we get to stare at them while the rest of the page takes it's time loading?

--

-- What you do today will cost you a day of your life.

Re:A Priori vs. A Posteriori Optimizations by Azza · 2000-03-23 07:06 · Score: 2

Then, choose a representative set of inputs to exercise the binary and run it using the "translator."

There's your problem right there. As a previous poster mentioned, for extremely dynamic systems, picking representative data is difficult, maybe even impossible.

That's not to say this isn't cool tech though. I'd love to see this implemented on x86, just to see what's possible.

Re:An artifact of HP architecture? by jovlinger · 2000-03-23 04:42 · Score: 2

I'm not thrilled about adding all that hard-to-debug software to get only 20% better performance on SPECmark benchmarks.

R U shittin' me? 20% is a kick ass SPEC improvement, esp since it's done totally in software. You can bet that there will be some sort of hardware support soon. Like the article starts out stating, the compiler team get hugeass bonusses based on piddly decimal point improvements (I might be overstating the case somewhat, tho). 20% is like, well, huge. The compiler guys didn't believe the numbers at first.

R U shittin' me mk II? compared to the black voodoo magic of some compiler transformations, this transformation is almost clear as crystal. And it runs in REAL TIME. These days, O(n^3) worst case is almost acceptable for static optimsation algs.

But you are right about your other points, sorta. While hint bits are a cheapass version of this idea, it appears that the real gains come when you start completely removing the branches altogether and propagate assumptions forward (redoing register allocation, that sort of thing).

Even more funner would be to cross kernel boundaries and inline kernel calls. Transmeta can do this, as their optimiser run at the meta level, whilst dynamo cannot, as it is merely a user level program.

Well an idea usually has to be changed a little by slashdot-terminal · 2000-03-22 23:27 · Score: 2

I think that people have to think a bit. Considering that smalltalk is not as extensively used as C++ or java and that the Amiga is also not extensively used by the average person is reason enough to rework the idea for newer hardware/software combinations.
What I have yet to actually see is any of this affect a real desktop computer that I can buy. A transmeta processor would be a nice touch but tell me why waste something like that on afn embedded or handheld device which almost by definition has severe limitations?

--
Slashdot social engineering at it's finest

An artifact of HP architecture? by Animats · 2000-03-23 01:26 · Score: 2

This may reflect a problem with the PA-RISC architecture HP uses. Some RISC machines have "hint" bits in instructions, which indicate which way a branch is likely to go. The IBM Power series (and PowerPC) works that way. Setting those bits at compile time requires guessing about run-time control flow. At run-time, it can be done by observation, which is better.

Intel IA-32 (i386 through Pentium III) doesn't work that way. In the Pentium Pro and later, there are branch prediction bits in the cache, set as the code executes. After a few passes through a loop, the most probable path is known and speculative execution gets smarter. Intel does its scheduling in hardware.

I'm not thrilled about adding all that hard-to-debug software to get only 20% better performance on SPECmark benchmarks.

Re:Bad compilers? by jeff_bond · 2000-03-22 23:58 · Score: 2

A compiler can't optimise a code 'path' that crosses from one object file to another (e.g. a call to a function in a different object file).

Even if you wrote optimisation code into your linker (which does see all the object files at once), it still couldn't optimise the way dynamo does, because any particular function can be called from many different places in the code.

Jeff

--
stty erase ^H

The algorithm's the thing by Skald · 2000-03-23 00:08 · Score: 2

It seems to me that if this is as cool as they say it is, it's a good solution to a more general problem. Why just apply this to translating binaries? It could work for interpreted languages as well.

When you run a Perl program, for instance, it actually compiles and then runs... unlike most interpreted languages, which translate and run instructions one at a time. Instead, a Perl script could begin to run translating one instruction at a time, go until it hit a trace condition, increment the counter, and so on.

Following the Perl example, the down side is that Perl does some compile-time error checking, which wouldn't be done if you used this algorithm.

I'm speculating off the top of my head, mind you... but then, if I'm not talking sense, I'm sure someone will let me know! ;-)

--

"The best we can hope for concerning the people at large is that they be properly armed." - Alexander Hamilton

Bad compilers? by slpalmer · 2000-03-22 23:27 · Score: 3

IMHO, if they are getting a 20% increase in speed after fragmenting *native binaries*, then they have some serious work to do on thier compilers. If the code had been properly optimized by the compiler, then on the fly fragmenting should not speed up execution time. On that same note though, I thought HP native compilers were among the best at optimization. Am I off on this?

Stephen L. Palmer
---
Here is the result of your Slashdot Purity Test.
You answered "yes" to 86 of 200 questions, making you 57.0%

Letting you know... by tilly · 2000-03-23 01:08 · Score: 3

Perl compiles the instructions down to byte-code, and then runs an interpreter on the byte-code. Trust me, you really don't want it to interpret one line at a time. You really, really don't want that.

Also note that while Perl was the first well-known interpreted language to do this, it is an idea that is now considered the norm.

But to apply HP's ideas to Perl would be quite possible and probably beneficial. Today Perl has a lot of run-time lookups. For instance if I write a function foo, and I call it from a function bar, then (I believe) every time you run my function bar there is a run-time lookup while running bar to find foo because you might have changed it.

But foo almost never changes from call to call! Imagine instead the following. Each time you run bar, first check a flag for whether you have profiled it. If you have not then increment a counter for how many times it is accessed. If that counter is below some level, just execute. If it is above some target, then profile it and record information from which you can check whether it needs to be profiled. After profiling then execute the profiled version.

If your original check found that you profiled it, then quickly check that the profiled version has not been invalidated, and if all is fine then run the profiled version. If it has been invalidated then flip the profiled version, set the counter at 0, and run the safe version.

So...what does this complicated logic do? Why for a minor run-time penalty you manage to remove most of the repeated dynamic run-time lookups for which function to call, probably with significant speedups! (It could be an even bigger win if you did a similar trick on method-call lookups.)

The basic technique of taking out time to store a fast static lookup on data that is otherwise recalculated is called memoizing, and is good to know. (In Perl memoizing is almost always done by having a lexical hash (ie one defined with my) around the function with memoizing being done by sticking the answer in the hash. So the program becomes check the memoized hash and return that answer if you can. Otherwise calculate the answer, stick it in the hash, and return the answer.)

Cheers,
Ben

--
My usual seat in the cluetrain is at A HREF="http://pub4.ezboard.com/biwethey.ht

HP's Dymo Server by Ratface · 2000-03-22 23:36 · Score: 3

Huh - how do HP think they can market an entire server just for producing little sticky tags? I mean, who even buys those small electronic Dymo machines? I mean the old fashioned mechanical ones were perfectly good and everyone loves that "oops, I didn't click hard enough on the last letter, so I'll over-compensate on this letter" look.

...Oh, you said *Dynamo*! Sorry!

...I'll go away now....

--

A little planning goes a long way...

Re:Bad compilers? by David+Greene · 2000-03-22 23:54 · Score: 4

Yes, you are. HP has some of the best compiler people around.

The thing about static compilers is that they have no idea what happens at run-time. Profiling has been used to mitigate this somewhat, but it's still a huge problem.

Accesses through memory are slow, so you want to get rid of them. One way to do this is through register allocation. Unfortunately, even if an infinite number of registers was available, you couldn't allocate everything to registers.

Why? Because we use pointers. There are multiple names for the same data running around in our little electronic brain. When you allocate something to a register, you bind it to one name. This is by definition incorrect for aliased data (data with multiple names).

Optimizations like register promotion try to get around this by allocating things in regions where the compiler can prove it only has one name. But this is exceedingly difficult when you have things like function calls which must be assumed to access and modify lots of data.

I won't even get into the problem of static instruction scheduling or other optimizations like partial redundancy elimination.

In short, aliasing through memory is nearly impossible to track accurately at static compile time. At run-time, the machine knows exactly which memory accesses reference which data, so things like run-time register allocation can do a better job. Crusoe does this to a limited extent.

Dynamo is essentially a software trace cache. Except that when forming the trace, it does transformations like Common Subexpression Elimination and other traditional compiler manipulations.

IBM has the Daisy project, which does code morphing from PPC to a VLIW ISA. I believe it also does some run-time optimizations. Projects like DyC and Tempo have been compiling at run-time for a while now.

I like to think of dynamic compilation in terms of the stock market. Which would you rather do: trade stocks with only limited information about their past behavior (and sometimes none at all), or trade stocks after having observed the absolutely most recent trends? I'll bet that if you pick the first strategy and I pick the second, I'll beat you every time.

That said, there are tricks ou can pull with static compilation. IA64 has the ALAT, which lets the machine track when store addresses match load addresses. This lets the static compiler speculatively move a load ahead of the store. If the store conflicts, the machine will execute some compiler-provided code to fix up the error. Essentially, the compiler is making an assumption that the load and store do not reference the same data and is communicating that assumption to the machine. The machine checks the assumption and invokes some fixup code if it proves to be incorrect.

--

Re:Bad compilers? by Cederic · 2000-03-22 23:39 · Score: 4

The review explains that the optimisations performed are those not available to a static compiler. Because the compilers are effectively optimising 'design time' code, there are further optimisations that can be performed on the 'run time' binary.

As an example, consider a fragment of code that performs an operation on a sequence of numbers. The sequence is unknown at design time - bounds checking, conditional statements and other code will all need to be compiled in to ensure different sequences can be handled. At runtime it may be that the same 4 number sequence is processed a large number of times. The runtime optimiser (Dynamo) detects this and can optimise out the bounds checking, the conditional statements and streamline the rest of the code.

It's very clever technology, but it's surprising it hasn't been done more extensively before now. What would be very interesting is to see if something like VMWare incorporated this technology - could it effectively negate the CPU cost of running VMWare itself?

~Cederic

20 of 78 comments (clear)