Slashdot Mirror


Transmeta Code Morphing != Just In Time

Andy Armstrong has written us a pretty interesting, but somewhat technical piece of Just in Time, computer generated assembly language, profiling, and more. Its a pretty interesting little bit that ought to give you a lot to talk about.

The following was written by Slashdot Reader Andy Armstrong

Transmeta Code Morphing != Code Morphing The recent Transmeta announcment crystalised something I've been thinking about for a while. It's my belief that it should be possible to make a compiler generate much better code in the general case than someone writing hand coded assembler. Furthermore it should be possible for a JIT (Just In Time) compiler to produce better code than a conventional one time compiler.

Why should a compiler be better than an experienced assembler programmer? Well,

  1. the compiler can know the target processor intimately (cycle times, impact of instruction ordering, etc.)
  2. the compiler gets to re-write the entire program each time it sees it.

The second point is critical: any programmer writing an assy. language program of any significant size will write the code to be maintainable. Of course, it makes sense to do things like defining standard entry and exit sequences to routines, keep a few registers spare (in those architectures that have more than a few) and other practices that lead to maintainable code, but the compiler doesn't have to maintain the code it writes. It gets to write the whole thing from scratch every time. This means that functions can be inlined (and repeated code sequences turned into functions). Loops can be unrolled then rolled back up at the next compile when the programmer has decided that space is more important than speed.

If you know you're not going to have to maintain a bit of code you can do some pretty scary things to get it to perform better. A compiler can potentially do that all the time.

Why should JIT be better than one time? This should be clear to anyone who's followed the Transmeta story. A key element of their code morphing technology is that they insert instrumentation into the code they generate, effectively profiling as it runs so that the compiler can decide which bits to optimize the next time it sees the code. It's well known that the coverage graph for a typical programme looks extremely spiky - the most frequently executed code may get hit thousands or millions of times more than it's neighbours. It follows that it really isn't worth optimizing the stuff that only accounts for a millionth of the code's execution time.

This brings me to my point: is there really any reason why a Java / JIT combination shouldn't result in code that executes as quickly as the equivalents in other languages?

You might suggest that garbage collection must slow things down, but I'm convinced that, done right, garbage collection can actually improve performance. malloc()/free() require the memory manager to think about the heap for every call, but new() can be implemented as a stack (m = memlimit; memlimit += size; return m) and the garbage collector gets to do all its memory management in one chunk - it can take an overview of the memory landscape rather than trying to keep things fairly optimal each time memory is released as free() must.

You could argue that the OO nature of Java means that it must dynamically allocate objects that would be static in a program written in C or assembler. That's true, but (assuming calls to new() are cheap, which I believe they can be) this really isn't a problem. Current processors don't take a huge performance hit when working with objects who's address is not known at compile time; in fact in many architectures it makes no difference at all.

So while it might seem profoundly counter-intuitive can anyone actually give me a good reason why Java + JIT should be slower than Good Programmertm + Assembler?

14 of 454 comments (clear)

  1. Re:Evolve code. by DQuinn · · Score: 4

    Sorry... this post is probably meant to be deeper inside this thread.

    Are we not all just dancing around the idea of a human compiler with a vast database of algorithms, intuition, look-ahead properties, and an abstracted idea of what it is exactly that the program is doing? IE. There is no compiler, to my knowledge, which writes highly effective parallel code. Why is that? Simply because no compiler can really understand the overall problem.

    How long do you think it will be before the all-powerful AI compiler is written, with all of these abilities, and more, on a box that runs at a few GHz?

    Or maybe i'm just being a little too science fictiony here :)

    Cheers,
    D

    --
    os.system("perl -e 'print \"My first Python Script.\"'")
  2. Re:Self-Modifying Code by ZoeSch · · Score: 4

    Nevertheless I'd like to point out that JIT is based a somewhat dangerous technique: a program that alters (its own) code. I believe this technique was used in the eighties to scare off hackers by making code incomprehensible and hard to disassemble until the program was actually running. Also (even on a good old 6502 processor) it's possible to make some speed improvents peeking and poking into the code you're actually executing.

    Not at all true... self modifying code is still around not only on JIT compilers but in interpreted languages and even device drivers (Or how do you think some device drivers adjust in real time to hardware changes?). And it's not a dangerous technique in itself, but (As usual) when executed improperly it can be a wild beast (As most people discovered when thy installed their first Cyrix and AMD 486 processors and discovered most x86 code wouldn't run properly because of cache integrity issues)

    When compilers for Microcomputers got faster and most processor architectures (known as Harvard architecture, I believe) explicitly require a division of RW and RO memory segments, self-modifying code was abandoned...

    First of all, Harvard machines require separate instruction and data pipelines and memory spaces, and 99% of the CPU's on the market (General, Embedded, etc.) are Von Neumann machines which use a single space to store instruction and data. The thing is, newer CPU's (Even if they're Von Neumann desings)have separate caches for data and instructions and most self modifying code violates the cache integrity (See above) because the code is modified in the cache but never stored back into memory (Unless you use a write-through cache design which is impractical from the performace point of view).

    BUT (And that's a big BUT :) if your JIT VM forces periodic cache writes to RAM (Not every time but enough times to ensure some sort of coherence if the cache and RAM are out of sync) you lose a bit of performance but gain a stabler setup.

    Disassemblers for JIT won't be as complex as a JIT assembler just because when you disassemble a piece of code you treat is under the black box principle (What goes in and what goes out) in order to derive the fundamental principles and algorithms, which can be implemented in 1E999 different ways, so if you disassembled the less optimal morph, bad luck... disassemble another test run and see if you get a better implementation of the algorithm. Or you could use the aforementioned principle and implement your own algorithm.

    JIT code/compilers, BTW are also being tested to produce self modifying chips (ASIC's and FPGA's) under VHDL/Verilog using also NNets or Gen. Algorithms to obtaing a first implementation and then using the JIT to optimize it...

    Really interesting stuff (The Transmeta Crusoe) and my bet is that soon other companies will follow altough not for the same reasons as Transmeta :)

    ZoeSch

    --
    I hate to agree with davecrazy but...
  3. Re:What about the JIT? by um...+Lucas · · Score: 4

    I read that the 700 MHz Crusoe chip actually only attains the performance of a 500MHz Pentium (II or III, I forgot - oh, can Crusoe emulate SIMD? Not like it matters, but I'm wondering).

    So there you have it - code morphing, or just in time compiliing or whatever gives you about 70% of the performance precompiled code.

    That's not to downplay Transmeta's theoretical accomplishment. Those extra 200MHz go to things like monitoring it's VM, throttling the clock, etc. So in the end you lose 30% of the performance but make it up with 35X less power consumed.

    It's a great trade off for laptops, handhelds, etc. But not for workstations, servers, etc...

    I still don't like the idea that they're keeping the instruction sets closed. It would seem like if someone out there wanted to port GCC to Crusoes native instructions, that would be good... But they just don't want to be percieved as being at all incompatible with Intel, i guees.

  4. Compilers dont write better code than humans by tjansen · · Score: 5
    I have already heard that assumption that a compiler can generate better code than a programmer a thousand of times, but it does not get more true by repeating it - it is false. At least until compilers are able to understand the program. In every program there are things the compiler simply doesnt know. For example, in x86 assembler it is possible to save some cycles (and memory accesses) by using 16-bit or even 8-bit registers instead of 32 bit registers. You can double the number of registers by doing it. You need to be sure that the values in the registers are below 65536 or 256 to use these tricks, and the programmer can know this, but the compiler cant. The compiler might profile the possible value range if it is a really advanced compiler (I never heard of any compiler actually doing this), but it cannot be sure so it must at least check the values before for the case that they arent.

    As long as the programmer has more knowledge than the compiler, he will always find tricks to save an instruction here or there and outperform the compilers this way. You can find a great example of such programming tricks in the PCGPEs article about texture mapping inner loops here.

    1. Re:Compilers dont write better code than humans by Mr.+Slippery · · Score: 4
      You need to be sure that the values in the registers are below 65536 or 256 to use these tricks, and the programmer can know this, but the compiler cant.
      Isn't this why we have short int, long int, and char data types - so we can tell the compiler these things?
      --
      Tom Swiss | the infamous tms | my blog
      You cannot wash away blood with blood
    2. Re:Compilers dont write better code than humans by AndrewHowe · · Score: 4

      In some ways I agree with you, OK, I love writing assembler stuff.
      But I'm trying to separate fantasy from the reality here.
      The fantasy is that you're some super studly asm c0der who can not only produce properly scheduled code, but also has the time to *completely re-write* it:
      a) For each new processor
      b) For each sub-type of that processor with different internal resources
      c) Every time the specification changes - Maybe you have limited the range of a variable or something.
      Any of these changes will require a huge amount of work. Even a small change, coupled with the pipelined nature of modern processors, means a large modification to re-establish optimal resource usage.
      The reality is that no-one has the time to do this, and even if they do, they should be doing something else instead. High level optimizations almost always beat low level tweaking.
      In addition, I simply don't believe it's impossible to
      a) Tell the compiler in more detail how to optimise your code
      b) Have the compiler suggest ways in which it could produce better code (for example, "Is it OK to assume there is no aliasing of this item?" or "Hey, if I make this variable 8 bits in size I can do this huge optimization")
      c) Have the compiler work this stuff out for itself
      d) As (c) but at run time, guided by profiling information.
      Some of those are hard to do right now, but I don't see anything that would make them impossible.
      Does anyone?

    3. Re:Compilers dont write better code than humans by GnrcMan · · Score: 4

      That may be true of a crappy compiler on a lame platform (x86)...but I'd like to see you try reordering and slot scheduling Alpha instructions by hand to enable optimal pipelining and branch prediction.
      As someone who worked firsthand on a compiler for the Alpha, I can tell you that for 99% of the cases, the compiler can perform better general optimizations than a human. That 1% left over is what you should be concentrating your efforts on.
      Historically, the point of a CISC processor was to reduce compiler complexity. So of course hand coding optimal X86 assembly isn't going to be difficult.

      --GnrcMan--

    4. Re:Compilers dont write better code than humans by locust · · Score: 5
      I have already heard that assumption that a compiler can generate better code than a programmer a thousand of times, but it does not get more true by repeating it - it is false.

      The metric that I given in my university computer architecture class was that a compiler can generate better code than 90% of assembly programmers. Which, when I examine higher level code written by a variety of programmers, some good, some bad, makes sense. Some people know all the details of a language and some don't. Some pay attention and some don't. I'm pretty sure it was backed up by a reference to an IEEE paper somewhere.

      If large numbers humans could reliably write large volumes of high quality, optimized code, at high speed, we'd still be using assembly everywhere. But they can't. I certainly don't have the time or the inclination to learn all the ins and outs of every instruction set architecture I operate on to squeze every last bit of juice out of the machine. And I certainly am unwilling to give up the expresiveness of a highlevel compiled language.

      But ultimately, the fraction of people that need to write things like texture mapping loops is very small, and the fact of the matter is that the loop is probably embeded in some higher level piece of code. So relax, compilers reliably produce better code than the majority of assembly programmers (maybe not on /.). Or more precisely, better machine code, than the majority of programmers, if they were asked to write assembly. --locust

  5. Profoundly counterintuitive? by TheDullBlade · · Score: 4

    Seems profoundly intuitive to me.

    First of all, a JIT is rushed. You can design your optimizing one-time compiler to look at the code for better ways to do it all day, if you like, and while the developers will moan, they will still use it if gets them better results.

    Secondly, Java has a very specific standard (that is typically fudged... but that's beside the point). It doesn't give any leeway for a program to act a little different in boundary conditions, like C does. In C, an int is whatever size int is most easily handled by the target system; if you need a certain exact size of int, you can code that in with some extra effort. In Java, an int is a Java int and Java doesn't care in the least what the most efficient native int format is.

    Thirdly... automatic garbage collection is less efficient than hand coded allocation and deallocation, and dynamic allocation is less efficient than static allocation. There are odd cases where this is false, but they generally hold true, and where they do not, the C version can always use the more efficient method anyway. (and extremely fast calls to "new" can be easily achieved... at roughly 50% memory wastage)

    Fourthly: Java locks you into a OOP model which is inherently inefficient (at least as done in Java). All function calls must be dynamically redirected etc.

    I could go on, but it feels like trying to describe that water is wet and stones sink in it to someone who thinks it's intuitive that water should sink into stones.

    However, I will not dispute that you can definitely get better results by recompiling for each chip that something will run on. Some JITs may use this to push out ahead of one-time compilers.

    BTW, an experienced assembly coder will always beat or at least equal the optimizing compiler, because if nothing else works he can always look at what the compiler produces and see if he can improve on that. Besides, optimizing compilers are good, but not that good, someone has to write them, and when was the last time that you wrote a program that can solve complex creative problems better than you can?

    --
    /.
    1. Re:Profoundly counterintuitive? by BurdMan · · Score: 5
      Firstly, water does sink into rocks. It then freezes and the pressure it exerts reshapes the world as we know it. But that is beside the point.

      I will address your comments with respect to the Java Programming Language and its HotSpot compiler technology. If you would like to enlighten yourself on the techniques behind HotSpot take a look at http://self.sunlabs.com. Transmeta, IMHO, was influenced by the same concepts when designing its code morphing techniques.

      1. "A JIT is rushed" - The first time bytecode is compiled to native code it must be done quickly to avoid delay. That code may be inefficient, true, but it can also be instrumented with profiler like information that can be used by later passes of the compiler. You see, what the author is describing is not a JIT system but a dynamic optimizing compiler which has the opportunity to study the program during execution and recompile parts or all of that program based on that information.
      2. "no leeway for boundary conditions" - Well here again you seem a bit confused and your example is naive. The native representation of a type is not fixed in Java, only the conceptual representation. To be a compliant Java implementation all results generated using 'int' (to use your example) must be consistent with the conceptual representation of 'int' when using the various operations. If the optimizing compiler finds a way to represent an 'int' in a more optimal way it is free to do so as long as the results of the operation don't change.
      3. "garbage collection is slower than hand coded memory management" - You really ought to read some of the latest garbage collection papers. This is no longer true even for collectors managing C or C++ allocation. When given an environment like Java which doesn't allow pointers and other antiquated memory techniques a good dynamic compiler with a modern garbage collector is both faster and more efficient than any hand coded attempt on all but the most simple of applications.
      4. "OOP and dynamic dispatch are inefficient" - Again, I urge you to read the papers at the Self site listed above. The way the Self, and now HotSpot, compilers work eliminates this bottle neck. I'll admit in early implementations of Smalltalk and Objective-C dynamic dispatch did add a small amount of overhead. Take a look at the documents on the HotSpot here and then tell me that dynamic dispatch is a problem.


      In summary, you are more than welcome to use assembly all you want. Code on my brother! But, please before you slam some other method try doing the smallest amount of research first. Maybe your snap intuition is wrong. You never know.

      As far as Transmeta goes, it has a lot of the HotSpot/Self style technology and I personally think that technology is the future. I can't wait to get my hands on a Crusoe powered product.

      -BurdMan
  6. about Java by jilles · · Score: 4

    Hi,

    What transmeta does is very similar to SUN's hotspot compiler for java. Like Transmeta's code morphing, hotspot compiles and optimzes bytecode on the fly using profiling data to optimize the parts that are executed most.

    Your question as to why Java is still slower than C++ is answered in a series of javaworld articles (www.javaworld.com): http://www.javaworld.com/javaworld/jw-11-1999/jw-1 1-performance.html and http://www.javaworld.com/javaworld/jw-12-1999/jw-1 2-performance.html and http://www.javaworld.com/javaworld/jw-02-2000/jw-0 2-performance.html

    Basically the problem is that stuff such as memory allocation, garbage collection, synchronization and runtime type checking have a performance price (you get a safer programming environment in return). The articles discusses these performance issues and give some usefull advice to avoid these bottlenecks.

    --

    Jilles
  7. Re:Java Byte Code by arivanov · · Score: 5

    No it will not.

    Think of Java licencing and control issues. Discussed widely on slashdot. Actually I shall retract this statement if one of the MAJOR league players will accept Crusoe as a primary CPU for at least one machine class.

    Otherwise it is obvious that it could be thy Java machine, but it is least likely to be.

    I think Cruose will actually make a reality something else which is much cooler than Java. It will make real thy developer's dream - the coat of many colors: The affordable machine of many architectures.

    On the basis of Crusoe even now you can build a machine that can happily emulate:

    Mac, Sun (lower end), IBM PPC (lower end), SGI (lower end), Alpha and curse it x86.

    1. All these are PCI based.

    2. The differences in chipsets can be ignored under the "one OS to rule them all, one os to find them, one os to bring them all and in (oh well cutting out darkness) bind them ". It can have drivers for the chipset and peripherals in question.

    3. You may actually do the reverse thing and develop drivers for the peripherals and the chipset for all platforms in question (not a hell of an effort, actually quite achievable). If peripherals are something like adaptec, tulip and a PCI VGA the drivers are basically there already. So you have only the chipset left. Actually you can intercept these and emulate chipset behaviour if you so desire.

    Overall - the developer's dream can become a reality for just a few bucks - about the price of a PC (excluding licencing for OSes and sowftare of course ;-)

    --
    Baker's Law: Misery no longer loves company. Nowadays it insists on it
    http://www.sigsegv.cx/
  8. Isn't this REALLY the smarter compiler arguement? by stevew · · Score: 5

    Years ago I worked at a company that was going to bring VLIW to the commercial world. One of the things they had to invent was MUCH smarter compiler technology. That was in 1985, so the basic ideas involved here are at least that old.

    Anyway, when the compiler DOES know the hardware, especially VLIW machines, they can indeed do a job as GOOD as a human. (Look - they won't usually do better jobs, when the guys who designed the machine write code, they're going to do what the compiler is trying to emulate..) But from what I remember, static profiling was considered "good enough" back then if the VLIW machine is built right.

    Most codes spend most of their time in inner most loops. If those loops can be rolled up correctly for VLIW you can be executing different interations of the loop within the same instruction slot. Being able to do this is usually the largest performance payoff. If you have a GOOD compiler in the first place, that KNOWS the hardware, then JIT and morphing aren't needed, or help.

    The place that I see the morphing being a step forward is that VLIW machines were considered architectural dead-ends until just now. If I use one of these smart compilers for a machine with say 7 functional units, the code that emitted won't work on a machine with 8 functional units, or at least not be optimized any longer. These machines didn't scale well until now! Code-morphing technology completely removes this limitation!

    --
    Have you compiled your kernel today??
  9. java JIT itself is not that slow. by monac · · Score: 4

    As shown at http://www.idiom.com/~zilla/Computer/javaCbenchmar k.html, even the linux jdk, which is relatively slow than other platform's, runs as fast as natively compiled C program when runs simple operations.

    In my understanding, java's poor performance is due to following factors:

    1. too generized api design resulted in very deep call stack. for example, printing current date and time using java.text.DateFormat uses much more calls than traditional C (a system call and a printf is enough for C). It gets more terrible in swing. use of interfaces causes the same problem i think.

    2. JNI is too slow. (can't understand why but it is known to be slow and java can't help but using many JNIs)

    3. as WORA is important in java, it is hard for java to use any platform specific resources such as Graphic card's accelerations. that is one of the reasons swing's design is getting ugly as it tries to boost its performance.

    4. too many small class files result in high I/O usage when JVM loads classes.

    5. no MACRO! for example, most of java API is synchronized, which is expensive, reguardless thread is used or not. the same for security check codes and platform specific implementation codes,
    etc. these problems can be solved by MACRO but...
    someone in sun doesn't like macro and i agree with him when think of java's purity.

    some of these problems are known to be solved in next jdk version(hotspot client version). slow synchronizations and class loading speed, etc.
    but mostly, they are java's design issue and never be solved, as java is mostly running in virtual machine and originally it is designed to be used for java chips, only when JVM is implemented in processor level, the speed problem will be solved.

    I'm looking forward to seeing java running in MAJC or Transmeta's chips.

    please correct me if i'm wrong in some and forgive me for my poor english.
    Cheers.

    --
    -- Y. J. Chun