Transmeta Code Morphing != Just In Time
The following was written by Slashdot Reader Andy Armstrong
Transmeta Code Morphing != Code Morphing The recent Transmeta announcment crystalised something I've been thinking about for a while. It's my belief that it should be possible to make a compiler generate much better code in the general case than someone writing hand coded assembler. Furthermore it should be possible for a JIT (Just In Time) compiler to produce better code than a conventional one time compiler.Why should a compiler be better than an experienced assembler programmer? Well,
- the compiler can know the target processor intimately (cycle times, impact of instruction ordering, etc.)
- the compiler gets to re-write the entire program each time it sees it.
The second point is critical: any programmer writing an assy. language program of any significant size will write the code to be maintainable. Of course, it makes sense to do things like defining standard entry and exit sequences to routines, keep a few registers spare (in those architectures that have more than a few) and other practices that lead to maintainable code, but the compiler doesn't have to maintain the code it writes. It gets to write the whole thing from scratch every time. This means that functions can be inlined (and repeated code sequences turned into functions). Loops can be unrolled then rolled back up at the next compile when the programmer has decided that space is more important than speed.
If you know you're not going to have to maintain a bit of code you can do some pretty scary things to get it to perform better. A compiler can potentially do that all the time.
Why should JIT be better than one time? This should be clear to anyone who's followed the Transmeta story. A key element of their code morphing technology is that they insert instrumentation into the code they generate, effectively profiling as it runs so that the compiler can decide which bits to optimize the next time it sees the code. It's well known that the coverage graph for a typical programme looks extremely spiky - the most frequently executed code may get hit thousands or millions of times more than it's neighbours. It follows that it really isn't worth optimizing the stuff that only accounts for a millionth of the code's execution time.
This brings me to my point: is there really any reason why a Java / JIT combination shouldn't result in code that executes as quickly as the equivalents in other languages?
You might suggest that garbage collection must slow things down, but I'm convinced that, done right, garbage collection can actually improve performance. malloc()/free() require the memory manager to think about the heap for every call, but new() can be implemented as a stack (m = memlimit; memlimit += size; return m) and the garbage collector gets to do all its memory management in one chunk - it can take an overview of the memory landscape rather than trying to keep things fairly optimal each time memory is released as free() must.
You could argue that the OO nature of Java means that it must dynamically allocate objects that would be static in a program written in C or assembler. That's true, but (assuming calls to new() are cheap, which I believe they can be) this really isn't a problem. Current processors don't take a huge performance hit when working with objects who's address is not known at compile time; in fact in many architectures it makes no difference at all.
So while it might seem profoundly counter-intuitive can anyone actually give me a good reason why Java + JIT should be slower than Good Programmertm + Assembler?
Not that a good compiler cannot outperform even a good assembly hacker on, for instance, memmory allocation. And not that abstraction in itself must lead to inneficiency, but it makes it inherently easy to allocate/free memory for the programmer, who will then of course allocate/free more memmory.
--The knowledge that you are an idiot, is what distinguishes you from one.
--The knowledge that you are an idiot, is what distinguishes you from one.
Other engineers are required to use the minimum resources to achieve the maximum requirements of the project design, shouldn't computer 'engineers' be able to acheive the same.
Apples and oranges. The cost of creation materials for say a building, demands that the least amount of them be used. Not so with bits and bytes. These days, they're almost free. I applaud the true craftsmen of the programming world for their principle and talents, but with all due respect, to me it's penny wise, pound foolish to waste time reinventing the wheel each time you have to write code. Did you dig your home's foundation with a trowel?
Sorry... this post is probably meant to be deeper inside this thread.
:)
Are we not all just dancing around the idea of a human compiler with a vast database of algorithms, intuition, look-ahead properties, and an abstracted idea of what it is exactly that the program is doing? IE. There is no compiler, to my knowledge, which writes highly effective parallel code. Why is that? Simply because no compiler can really understand the overall problem.
How long do you think it will be before the all-powerful AI compiler is written, with all of these abilities, and more, on a box that runs at a few GHz?
Or maybe i'm just being a little too science fictiony here
Cheers,
D
os.system("perl -e 'print \"My first Python Script.\"'")
Quite a few. Amazingly these are usually those that have portability problems. PGP 5 is a good example.
Baker's Law: Misery no longer loves company. Nowadays it insists on it
http://www.sigsegv.cx/
What if you compiled the original code with performance analysis included, and then fed that performance analysis back into a traditional compiler to execute again. In this way, you get all of the benefits of the above JIT system without the running overhead of the JIT. Sure, it takes a few thousand runs to get the program optimized, and you can only optimize for the average case, but most programs that need to be heavily optimized will have a roughly similar executaion pattern every time.
I may be wrong, but you can do this with the SUNWspro workshop compilers. This technology would be nice to have in gcc as well.
Mike
--
Mike Mangino Consultant, Analysts International
Mike Mangino
mmangino@acm.org
I think you may be missing (or misunderstanding) the point a bit. Sure, you can furbish an example, or a language, in which a human can optimize the program /statically/ better than a compiler/computer can optimize a program /statically/. But the author makes the point that the most frequently executed parts of the program are very sparse, and that the cost of profiling the program dynamically can be recouped by doing intelligent dynamic optimizations. As hardware gets faster and more complex I think this will be more true. There will simply be no way for humans to statically optimize for such a non-deterministic runtime environment. For example, how many programmers explicitly specify parallelism in their programs? How can they be sure that the CPU at runtime doesn't know better than them how to optimize the program? I say let the CPU decide. There is going to come a point in which it is just not feasible for humans to even attempt to optimize for such a complex runtime environment, and that time is coming fast. The faster chips get, the smaller the more amortized the overhead of finding better optimizations at runtime becomes.
Jazilla.org - the Java Mozilla
It's 10 PM. Do you know if you're un-American?
You're right, of course. No matter what some people have been trying to say for 40 years or more, you can almost always write better code than a compiler. This is especially true when you're dealing with something bigger than a a single function. The usual way of showing that a compiler is better than a human is by using one smallish C function as an example. That's a pointless example, because the benefit comes from analyzing that function in context and not on its own.
The point of diminishing returns comes into play quickly, however. For example, take the renderer for most any fast 3D game. If you went into the routine that passes triangles to the graphics card, a frequent hot spot, and added a pointless call to a supposedly expensive function, like sqrt, you're not going to notice an effect on frame rate. Doing a higher level optimization, like removing a single polygon from a model, is going to be more of a benefit than optimizing the polygon code, but it's still not going to be noticible. Sometimes you can get big benefits by using algorithms that look more complicated, ones you wouldn't want to approach in assembly, even though they use more code.
With complex programs, it's even conceivable for an interpreted language to out run a compiled one, because everything comes down to architecture and an understanding of the problem. This is hasn't been the case with Java, because Java is a fairly low level language (the more abstract a language is, the less win from compliation) and because Java has become entrenched in the "learn programming in 14 days" market of web designers turned programmers.
So, yes, an assembly programmer can outrun most any compiler. But does it matter? Almost never.
Would you also send your pointers across a network and expect them to work at the other end?
The idiots who wrote the code that caused me the pain would have...
This involves one of the internal debates I've had about C for years. C allows you to shoot yourself in the foot. That is all well and good as I (I hope) know what I'm doing. The trouble is that, with turnover in this industry what it is, a new foot might have arrived by the time the bullet hits. (Unfortunately, I've often been the new foot.)
The cake is a pie
1.Please give an example on the type of code you'd want to optimize, any reasonably simple (small fragment of )ecode that doesn't depend on advanced integer arithmetic should compile to something as fast or faster than you .cpp, GCC input, GCC assembler output and your "better" ASM output.
would possibly write yourself... Please include example.c or
I was talking about a simple brute force compare and if it dosn't work then move on type thing. I think that this type of thing will always work (just take a little while). Considering that most modern computers are extremely powerful then I have to say that such brute force is still possible. What I don't get and still don't get is how people get all the time to actually research all this stuff. Does anyone have a functional life anymore? I am took courses for an associate of science and they didn't get anywhere as complex as all the little theory arguments as I have seen here.
1.Invartiant detection, using compilers to find invariants in the code, and then deduce lemma's which may remove bounds-checks, type-lookups, casting or flush of registers to memory.
Could someone please explain what a lemma is I'm sorry I guess I didn't take Phd level math courses.
Slashdot social engineering at it's finest
"In the general case", perhaps a compiler will produce better code (and certainlty not, as yet, for specialised cases - an inner loop coded direct to PPC macro assembler (PASM on Amiga PPC) is the fastest code I've ever written, but humans still write compilers. Now, if they were to force-evolve code using a genetic algorithm, you might get code better than any a human code write, but it'll probably depend on some weird side effect to some obscure instruction, or the rtesosnant frequency of your ram bus, or something...
Cool article, all the same though. I'd love to see more of this sort of thing on Slashdot, like back in the old days.
The Curso technology will probably make the Java chip a reality and improve upon the idea all in one shot. Program the morphing code to run Java Byte Code and you have a Java chip which optmises your code. A very clean language that you can easily maintain and not sacrifice execution speed. Every Software development Managers dream!
If at first you don't succeed, skydiving is not for you.
On large systems, master C++ programmers, master programmers have had these problems in real world applications. It's not so much not knowing when to delete objects so much as making sure you delete enough pieces when you delete them. If you delete an object that is referenced by another then you're going to have a crash. If you delete an object but not the references it contains then you have to have another reference to each of those objects and you have to know what you're down to the last reference so you can delete the object.
I agree about the sloppiness of thought associated with java and GC but that laziness and the willingness to be sloppy is there regardless of whether or not the programmers are using GC. TakeGC away and they may be a little bit more careful but they've already shown that they are willing to be that lazy and sloppy.
This is my signature. There are many signatures like it but this one is mine..
3. as WORA is important in java, it is hard for java to use any platform specific resources such as Graphic card's accelerations. that is
one of the reasons swing's design is getting ugly as it tries to boost its performance.
Surely it's the JVM's job to abstract hardware acceleration in the same way as, say, DirectX or OpenGL? For instance, my windows JVM might know about directX and so implement the primitive drawing functions within the swing classfiles using the accelerated calls. I don't know if that's what happens, but it strikes me as the logical plan.
---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"
Nevertheless I'd like to point out that JIT is based a somewhat dangerous technique: a program that alters (its own) code. I believe this technique was used in the eighties to scare off hackers by making code incomprehensible and hard to disassemble until the program was actually running. Also (even on a good old 6502 processor) it's possible to make some speed improvents peeking and poking into the code you're actually executing.
:) if your JIT VM forces periodic cache writes to RAM (Not every time but enough times to ensure some sort of coherence if the cache and RAM are out of sync) you lose a bit of performance but gain a stabler setup.
:)
Not at all true... self modifying code is still around not only on JIT compilers but in interpreted languages and even device drivers (Or how do you think some device drivers adjust in real time to hardware changes?). And it's not a dangerous technique in itself, but (As usual) when executed improperly it can be a wild beast (As most people discovered when thy installed their first Cyrix and AMD 486 processors and discovered most x86 code wouldn't run properly because of cache integrity issues)
When compilers for Microcomputers got faster and most processor architectures (known as Harvard architecture, I believe) explicitly require a division of RW and RO memory segments, self-modifying code was abandoned...
First of all, Harvard machines require separate instruction and data pipelines and memory spaces, and 99% of the CPU's on the market (General, Embedded, etc.) are Von Neumann machines which use a single space to store instruction and data. The thing is, newer CPU's (Even if they're Von Neumann desings)have separate caches for data and instructions and most self modifying code violates the cache integrity (See above) because the code is modified in the cache but never stored back into memory (Unless you use a write-through cache design which is impractical from the performace point of view).
BUT (And that's a big BUT
Disassemblers for JIT won't be as complex as a JIT assembler just because when you disassemble a piece of code you treat is under the black box principle (What goes in and what goes out) in order to derive the fundamental principles and algorithms, which can be implemented in 1E999 different ways, so if you disassembled the less optimal morph, bad luck... disassemble another test run and see if you get a better implementation of the algorithm. Or you could use the aforementioned principle and implement your own algorithm.
JIT code/compilers, BTW are also being tested to produce self modifying chips (ASIC's and FPGA's) under VHDL/Verilog using also NNets or Gen. Algorithms to obtaing a first implementation and then using the JIT to optimize it...
Really interesting stuff (The Transmeta Crusoe) and my bet is that soon other companies will follow altough not for the same reasons as Transmeta
ZoeSch
I hate to agree with davecrazy but...
I do agree that machines should spit better code than humans do, but we are just not there yet. When will we get there? Maybe tomorrow, maybe next month, maybe next year, maybe next century. Nobody knows. Suppose we do get there tomorrow, I believe it will not be Java + JIT that made it, for the following reasons:
If that breakthrough were to come tomorrow, I believe it is most likely to come from the functional programming people. I believe that is the way to go for the future.
I don't see where your argument about placing memory for new() on the stack holds up. The point of dynamic object creation is to allow the program to create an aribitrary number of objects and then free() them when done with them.
Trivial example that really makes the stack argument fall apart:
Let's say I have 500 objects new()ed. I'm assuming that the first object created would be on the bottom of the stack, and the last object created would be on the top. I now delete the first 499 items. I now have 499 * object size bytes of memory on my stack that I can't use becuase I still have the 500th item on the stack.
I can't recover the memory because it's on the stack.
Am I missing your argument? Help me out here.
--
then it comes to be that the soothing light at the end of your tunnel is just a freight train coming your way
Now, I have yet to see anyone provide concrete examples of a garbage collector performing better for the general case, than standard memory management. However, let's assume that the garbage collector provides a small performace gain overall.
Unfortunately, as others have pointed out, compilers don't understand the objective of the program they are dealing with the way assembly coders do. The same is true for a garbage collector.
I can speak from experience as a game programmer here. There are certain parts of a game which are time and performance critical. How can I trust a garbage collector not to slow things down in that section? Well, I can't. There has to be some form of programmer interaction telling it not to cleanup in section X of the code.
This isn't very practical for games though. Typically we want full control over the memory. We know the upper limit of what we need, and we don't care much about other applications. So we can allocate a large chunk at start time and then work within that, to avoid the new/delete/malloc/free overhead.
If removing "antiquated" things like pointers is the best solution, then we'd all be writing games in Java. OK well that's a stretch but you see my point...
Best regards,
SEAL
"1...not a JIT system but a dynamically optimizing compiler" a slow start is a big problem (which is why they are emphasizing long-term server programs), but that's not the main issue. There's no reason that profile-based optimization can't be applied to C programs. That there's an experimental JIT (yes, it's still a JIT) that's ahead of the C compilers in common use is no reason to claim that the JIT strategy is better.
"2...as long as the results of the operation don't change" Which means that if the representation is different in a meaningful way they have to check whether the results of the operation change, reducing the efficiency, which was my point.
"3...on all but the most simple applications" But I rarely do anything but the most simple applications. I allocate an array, then I free an array. I do this for hashes, trees, stacks, queues, and pretty much any data structure I use. Only when I am being extremely lazy will I allocate and free objects recursively. Modern CPUs with slow main memory and fast cache like it when you keep all the data together, so I generally do so. I also try to access it linearly, which they also like. The point holds.
Don't try to compare the most primitive and unoptimized methods of custom memory management to the most sophisticated garbage collection algorithms. Remember, too, that a person doing custom memory management can write the same sophisticated garbage collector that Java would use.
And don't confuse "primitive" with "fundamental". Pointers are fundamental. Having to analyse reverse polish code function to prevent dangerous operations, rather than making it impossible to construct operations you don't want, is primitive.
"4...then tell me that dynamic dispatch is a problem." Okay, dynamic dispatch is a problem. Just because it can inline methods here and there doesn't mean you won't take an overall performance hit. Besides, that was only one example of the ways the lousy Java handholding slows things down.
The Sun marketing papers you linked me to are just more of the same old Java hype. They'll go on to "prove" that their new super JIT is better than anything you could code by hand by comparing optimized idiomatic Java with the same code translated poorly into unidiomatic C++.
Idiomatic C written by a competent optimizer will still blow the Java version out of the water.
JIT by any other name smells just as bad.
This is very intersting stuff, and I do care about these questions, but virtually none of this enters into my selection of a language for a task.
When you are writing a routine that reads single keystrokes and puts them in a buffer, you really don't care much how optomized the loop is unless it takes more time than elapses between the user's keystrokes.
In other words, I'm very glad people are working on smarter compilers and Java optomizations, but I select Java for one reason alone:
It is easier to write more code more quickly with fewer bugs in Java than in C or C++.
Most of us in MIS find ourselves in situations where we can throw more hardware at things much more cheaply than we can pay people to write and debug code.
Certainly there are situations where we are prcoessor-bound, or where data comes from a high volume network source and then we must consider performance, but since in many cases even a slow language is "fast enough" for the maximum throughput, the ease and reliability of development come to the fore.
So, if I want a fairly portable device driver, I write in C. If I want a stable application, I choose Java. For me it is this feature of the language, not its speed (or lack), not its "write once, run anywhere," but rather the ease with which stable code may be written that draws me to Java. After that, I am concerned more with the footprint of the VM and its impact on non-Java system components than I am with the optomization of Java byte codes.
I say this not to disparage this discussion, which I really do find enormously interesting, but I must say that for many of us, this is the computing equivalent of counting angels on pinheads.
I'm surprised nobody else has commented on the obvious here. Sure, maybe dynamically recompiling is awesome. But doing it twice at once isn't. Just imagine your Crusoe chip screaming in agony as the code-morphing tries to profile and recompile code that is being dynamically recompiled by the JIT compiler. And how do you fix that?
By abstracting your two smarty-pants recompilers into a "hardware" and an "interpreter" layer, you've removed any way for them to talk to each other or even really be aware of the other's existence. The problem's not theoretically impossible, but the practical advantages of keeping your abstractions clean make me worry about the hopes for Crusoe web-pads (which would presumably run a lot of Java).
Preferential Voting: easy as 1-2-3
Actually, computers tend to be even better at speed chess than at regular chess. The reason is that computers analyze games by expanding the game tree as far as possible. Because the chess game tree grows exponentially, this leads to diminishing returns as the amount of time increases; if you watch the "ply searched" indicator on a chess program it'll generally jump to somewhere between 3-5 moves almost immediately and then appear to stall. However, even this shallow game tree is enough to catch an unsuspecting human in a tactical trap.
The reason humans can still beat computers is that they can apply long-range strategic planning -- "what are the best ideas to pursue in this position?" -- and rules of thumb or experience about what sorts of things are likely to happen in a certain type of position. Working from this, they then decide on a small set of moves to evaluate, usually in a narrow tree (because humans can generally discard moves which are irrelevant out of hand; I believe some better computer algorithms try to do this as well but I'm not sure) This approach doesn't really start to pay off until the human has had some time to think, but once it does it can allow the human to catch up to the computer, which is still examining every possible outcome of shuffling the rooks back and forth.
Disclaimer: all of the above is empirical guessing and may not bear any resemblence to reality.
Daniel
Hurry up and jump on the individualist bandwagon!
This is a personal anecdote where I would say that the compiler generated better code than I would have.
I worked on an embedded network analyzer product using the intel gcc compiler for the i960. It contained about 16000 lines of C code (not trivial, but not large) that included a bare bones operating environment, network drivers, and the actual analysis code. Of these ~16000 lines, about 500 of them were in the "common critical path" and would execute on every incomming packet.
The intel gcc compiler for the i960 has this is really cool feature called "profile based optimization". You build an instrumented version of the executable, run a set of test data through it, get the profile data, and re-compile with the profile data. This optimization feature had the added advantage of closely integrating both the compiler and the linker.
The main benefits of this optimization included:
Setting the branch prediction bits for the most common cases.
Inlining functions with high hit counts : both within a module (traditional inlining), and across many modules (an advantage of linker integration)
Placing variables with high hit counts in faster memory (the i960 had 1k of on board, single-clock DRAM, and our card had both 2-clock SRAM and 4-clock DRAM)
Inlining all functions that get called only once in the executable.
I saw a 150% performance improvement over the non-profile based, but otherwise fully optimized builds.
The key 500 lines were interspaced through about six functions in four modules, often in functions where half the function was error/unusual path handling. Although I could have rewritten those 500 lines in asm, it would have been very hard to keep it integrated with the rest of the code. Also when I examined the instruction trace of those 500 lines, I found only two or three places where I would want to change things.
Intel's x86 compiler is supposed to have similar a set of profiling optimization features and another feature that's even more important on the x86 that lays out functions and data in the executable to reduce cache misses and page faults based on the profiling data.
In the "me too" catagory: Optimizing for Alpha processors is a nightmare best left to the compiler. Just try to single-instruction-step through some Alpha code and see how much you move around within the C source.
Peter (first /. post!)
I had a commodore 64 once. Never bothered to learn assembler code for it but even today I'm amazed by what people did on those machines. But it just doesn't scale up to a modern PC.
Perhaps those people from Intel can write code for trivial software but anything that goes beyond that is even beyond the minds of those people. Of course those guys at Intel can program their own CPU's manually (at least I hope so). Hand coding millions lines of assembly code is something else though, let alone optimizing it. The problem doesn't scale linearly you need a computer to optimize that.
Either way a static compiler nor a human being can compete when you start to involve real time profiling information. It gives the dynamic compiler an edge, it can optimize to situations only known at run time.
So, I hope you see my point. I admire people who can handcode a more optimal program than a compiler but they can't compete with dynamic copilers such as transmeta's code morphing or SUN's Hotspot because they simply don't have the kind of information those compilers have.
Jilles
There is certainly a range of people that write code and I've seen both extremes. Poorly written higher level language programs aren't going to be improved by a compiler. There might be suggestions by a good level syntax checker but that isn't the topic here. One of the long term comments I've always made is that you can write Fortran in any language. If you don't understand the language constructs sufficiently, you aren't going to be saved by the compiler. There is no substitute for programming competence.
This post clarifies what I was saying. To summarize: 1. By "modern processor" I mean a processor that allows a compiler to perform good optimizations. The X86 instruction set is not very good for this. I don't believe the 486 even performed any parallel execution. Specifically, I have experience with the Alpha. 2. You are talking about a specific algorithm. As I said, 99% of the time the compiler will beat you. I didn't say assembly was dead or not useful sometimes. Just that most of the time the compiler will generate faster code overall than a human. 3. Facinatingly enough there are cases where your assembly optimization will actually hinder the compiler from doing a better optimization. Specifically, some paralellizing compilers I think.
Check out this link for a plethora of information on compiler optimizations.
--GnrcMan--
Your programmer would have to be really good to beat a modern compiler on optimizing for very complex instruction. Also he probably would need a lot of time even to get very simple programs up and running that way.
And then there's still something a java VM (I'm assuming the hotspot type here) and that is dynamically (i.e. while the program is running) optimizing code using profiling information. This also works in situations where a program dynamically loads a class (possibly a class that did not exist at the point the other classes were developed). Your programmer cannot not take this situation into account when he is coding.
In other words, under normal circumstances the assembler superman would lose (badly).
Now, your next question will probably be why java is so slow then -> see my other reply's on this story.
Jilles
As a side note: I remember hearing FX!32 also keeping some information from the JIT compilation and storing it on disk for the next time the
application is run. Anyone knows whether this is still used and to what success?
Yes, FX!32 would actually profile the application and convert parts to native Alpha code, storing the native code in a database. Incidentally, the profiling happened during execution, while the optimization happened during free time, after the program was done being executed. The more you ran the program, and the more code paths you hit, the faster it would get.
It's a moot point now, though, since NT doesn't exist for the Alpha anymore.
--GnrcMan--
Massivly parallel creations are largely very close to what the programmer wrote, good compiler or no. Compilers for this case are especially weak, and only make vague and half correct assumptions.
.sig: Now legally binding!
I'll simplify:
1)Anything that can be expressed in Java can be expressed in hand-coded assembly; the reverse is not true.
2)Humans are smarter than computers.
3)Humans can always use computers as aids to their work, and so produce output equivalent to what the computer can produce as a baseline worst case.
On a machine with 64 bit ints, as on a machine with 32 bit ints, ints and longs are the same (assuming the compiler was competently designed; it could fit the standard without being). Using long should only ever confer a performance hit on 32-bit machines, for 32-bits and over, long is pretty much guaranteed to be the same as int (AFAIK, the standard doesn't force it, but it doesn't require any more range in the long, so it would be stupid to make it bigger).
If you want to define your own special int ranges, you can always use Ada. That's the kind of language you get when you throw in every nice little feature.
I read that the 700 MHz Crusoe chip actually only attains the performance of a 500MHz Pentium (II or III, I forgot - oh, can Crusoe emulate SIMD? Not like it matters, but I'm wondering).
So there you have it - code morphing, or just in time compiliing or whatever gives you about 70% of the performance precompiled code.
That's not to downplay Transmeta's theoretical accomplishment. Those extra 200MHz go to things like monitoring it's VM, throttling the clock, etc. So in the end you lose 30% of the performance but make it up with 35X less power consumed.
It's a great trade off for laptops, handhelds, etc. But not for workstations, servers, etc...
I still don't like the idea that they're keeping the instruction sets closed. It would seem like if someone out there wanted to port GCC to Crusoes native instructions, that would be good... But they just don't want to be percieved as being at all incompatible with Intel, i guees.
I can also write code in ML that can be proven correct at compiletime, since it's statically typed and lexically scoped (Java can still come up with an exception and die on you).
So my question to you is: How are your languages so magic if none of the software I use on a daily basis uses them, or seems to miss their properties?
Some guys here at CMU wrote a TCP/IP stack in ML that was provably correct. It took them a whole year. Was it worth it? You could write a whole operating system in that time!
...
The stack-based ISA for Java is slow. It doesn't map well onto actual processors. It's a big reason that even JITC'd Java isn't very fast.
There's no reason not to use a language like Java with slightly stronger typing than C++ (although there are good reasons to use C++; there are powerful things C++ can do that Java can't). But it should be compiled, even if "compilation" is something that happens the first time you run a program, and it should be compiled from a more intelligent format than a nonexistent ISA. Why not distribute source code after the frontend has run on it, with all the symbols resolved and the syntax broken up into lower-level constructs? That would be as hard to reverse-engineer as assembly, and significantly moreso than Java classes (which leave way too much info in there).
...
Being able to state with supreme confidence that this type of bug does not exist in my code because it has been rigorously proven is helpful. You really should be more concerned with running correct code than running "fast" code.
Except for the fact that we've gotten everything done so far by writing fast code, and testing it to see that it's reasonably bug free. Proving code with Java is still quite hard, and writing code at all in ML is very hard.
Empirically, companies will trade off getting the software written for having it proven; they've been doing things that way for quite a long time. Java certainly won't change that.
No. Hand coded assembler, coded by a clueful programmer, will always be better than compiler generated code.
You are wrong.
With modern processors, a decent compiler can do better general optimizations than you or any other assembly programmer I've ever heard of. While there is that (less than) 1% of the time when you can hand optimize a small piece of code, that's the exception rather than the rule. The rest of the time you are better off letting the compiler do it's job.
--GnrcMan--
I think I agree... :)
;)
:)
There is indeed an art - if not a Zen - and a science to knowing when to optimize stuff, and I think it boils down to 'use the hottest algorithm around' (especially if you start in C), *followed* by 'hack the ASM by hand'.
Something I finally got round to doing just this past weekend, that I'd been meaning to do for a while, is it compare the speed of execution of loops that simply count between 0 and a big number, but using different comparison methods, and different compiler optimization levels in gcc, with Perl for the hell of it as well.
The full source I used follows:
#include
#include
#include
#define INNER 5000000
#define OUTER 200
void up(void), down(void), xor(void);
int
main (void)
{
int c;
time_t t0, t1, t2, t3;
t0 = time (NULL);
for (c = 0; c Mosix cluster, which was "interesting", to say the least...
As for perl, well that behaved the way I expected. The whole thing was way slower than C, but it did at least do an xor when I asked for it - that was almost invariably faster than counting up with "".
Well, those be the results... compile the source yourselves with all the various gcc options and compare the assembler, folks
~Tim
--
Rushing on down to the circle of the turn
So there you have it - code morphing, or just in time compiliing or whatever gives you about 70% of the performance precompiled code.
Comparing interpreted to compiled code on a MHz-for-MHz basis makes no sense, because MHz is not an intrinsically interesting number. Most people don't care about what their computer's clock speed is. They care about how fast it runs applications (i.e. MIPS), and how much it costs. So don't compare Crusoe to an Intel chip on the basis of a MIPS/MHz ratio. Compare it on the basis of a MIPS/$$$ ratio. How well does a Crusoe perform compared to an Intel chip *of the same price*?
Aaaargh!
CmdrTaco, your preview mode is broken. That worked for me before.
In any case, the source is available here.
~Tim
--
Rushing on down to the circle of the turn
Complete brain fart. That should read "Comparing Crusoe to Intel..."
Its wrong to have the most basic data type wobble over the bit length.
.. casting the return value of malloc(), etc ..)
No. This simply wrong, and almost humorously so. What are you going to do? In the C standard, are you going to say "Implementations shall make int 32 bits wide?" What do you do on a machine (say, an embedded system) with 9-bit bytes? What about a Cray YMP? The goal of the C standard is to have the implementor pick representations for the basic types that make sense for the target machine. Performance is nice, but in the end, they have to make sense. Now, there are some limits in the form of minimal ranges that are placed on the "basic" integral types, but that's about it.
The solution would have been to have the basic data types with fixed size, and additional data types with "some size between 16 and 64 bit giving the optimal performance on the target machine".
You should be more than pleased with the next revision of the C language standard, C99. Among other things, it standardizes such types as int16_t and int32_t, which provide exactly the functionality that you seem to think is important to portability.
This would have saved many people a lot of grief...
What would have saved many people a lot of grief is if they would have learned C correctly in the first place. They would then know what it guarantees and what it does not, and how they can use the language facilities (i.e., typedef to satisfy their non-portable desires.) The amount of junk I see in code written even today is staggering! (assumption that short is 16 bits, long is 32 bits, etc
We're going down, in a spiral to the ground
Hmm... That bites.
If you ask me, it is a design flaw. Longs are only required to have the range of a 32-bit number, so it is wasteful to include more bits. A lot of portable code will have unnecessarily bloated memory use and any reliance on the extra 32 bits throws portability out the window. Given that, a non-portable extension for 64-bit ints (since full use of 64-bits is inherently non-portable; with a lot of sizeof checking some use is possible, but not convenient) to be used only for system specific optimizations would probably be a better choice.
Given that most code run on Alphas probably isn't written with a 64-bit long in mind, it almost certainly hurts more than it helps (especially with memory access being the main bottleneck of modern systems).
The next C standard should fix this problem by providing a 64-bit int (if you ask me, they should just allow you to stack longs, doubling the bits of the minimal range each time; writing arbitrary bit size emulation into the compiler is trivial, though the trivial implementation is inefficient, and it would be a permanent fix that could have surprising benefits for certain supercomputer applications), as well as the addition of fixed bit size types (which is, I suppose, good for networking code, though I fear people will abuse it; IMHO it's better the way it is where you should chop everything up manually into whatever format the protocol takes).
Somebody moderate this guy UP. He's hit the nail on the head with a huge sledgehammer, this is _exactly_ why code morphing is different, and far more spectacular, than the JIT concept.
:)
Although I still want access to the morph layer to play
You can't win a fight.
Doesn't Sun have a patent on hardware that can run java bytecode natively?
I wonder where transmeta-ish devices fall. I'd call it software, but would sun's lawyers?
Forward, retransmit, or republish anything I say here. Just don't misquote me.
As long as the programmer has more knowledge than the compiler, he will always find tricks to save an instruction here or there and outperform the compilers this way. You can find a great example of such programming tricks in the PCGPEs article about texture mapping inner loops here.
a JIT compiler will compile the exact same code each and every time you go to run it. While a convential compiler will compile it one time. Yes the JIT compiler might be able to tune the performance better for each run, but I think it takes a performance hit at the start by having to do the compilation. While the JIT compiler is compiling code, the exact same code compiled with the convential compiler is already executing.
The faster the hardward becomes though, the differences between the two might not be as noticible. Although, the more powerfull the hardware becomes, the more complex the code becomes.
Nothing exists exept atoms and empty space; everything else is opinion.
blah blah blah....
How many of you actually routinely (e.g. 6 times a year) examine the output of your compiler, and compare it to what you would have written?
I've done it often (a promise I made to myself as a foolish youth) and learned quite a bit. It's actually changed my higher-level coding technique on a few occassions (and I'm not just talking about optimizing for a better fit with a specific compiler)
Not only have I never met another person who does it, but I've almost never met a programmer who didn't look at me oddly for even mentioning the practice. And yet, here are so many people walking as if they did it routinely. I guess that's just the Silicon Valley edge.
I'm not being sarcastic or snide. I really want to know. Maybe I should give up my current profession (ditching a decade of grad and post-doc training) and move to the West Coast, where Programmers are Real Programmers.
Or maybe I should finish reading my NEJM, and we should all just keep in mind what we really know, and what we only think we know. Again, no disrespect, Lord knows I've had to re-think the extent to which I "knew" things before.
Please post your honest answer, and encourage your friends to do so as well. I'm afraid I'll have to take a lack of responses as a confirmation of my worst fears -- that everyone is only guessing.
If you can go to bed, knowing you did a valuable thing today, you're very lucky. If you can't... it's not bedtime
Well, a good optimization typically makes one type of instruction or sequence of instructions a fair bit faster (say 30-50%).
While this might give a 2% improvement across average code, it will make a lot of difference if this particulr optimization helps out your inner loop. There are times when you really need this.
The way I see it, if compilers can optimize effectively, then there becomes less need to hand-code an inner loop in assembly. Decent optimising compilers may not give you a great deal of extra speed but they sure as hell give the hardcore coder an easier life.
BTW, The rate of hardware improvement is a good argument *unless* you are trying to stay at the cutting edge. In these cases, optimizations are certainly not trivial.
Talking about dynamic recompilation and program profiling, and specifically about Hotspot, I've always wondered why I have to put up with the fact that between program runs, the compiler "forgets" everything it learned during previous runs? Why can't the compiler generate a log for each program, and review the log when the program is run again, so the compiler can pick up learning where it left off?
First, make it work, then make it right, then make it fast, then, make it bloated!
[slashdot bug reformatted the last post] As the microkernel unix people used to say before they finally died out, "We are within X% of the performance of the original". - Alan Cox The JIT guys can write all the papers in the world about how they should (yes, really should) be able to get native code speed. No, that extra level of indirection is really a performance benefit, not a hindrance! I'll believe it when I run Java bytecode and it's not dog slow.
The evaluation of an action as 'practical' . . . depends on what it is that one wishes to practice.
First off, there are profound performance differences that can be acheived simply by optimizing for the Pentium Pro and not the Pentium, see The Twofish Implementations. This is mainly just instruction ordering issues- possible to do in a JIT. There's more to optimization anymore than simple cycle counting.
Second, memory management is a different beast under object oriented programming than under procedural programming. The holy grail of object oriented programming (as it were) is black box reuse- allowing a class maintainer to modify how a class works without those using the class needing to change their code. Not having garbage collection in an OO language either disallows true black-box reuse (which is a maintainability issue), forces the object to implement it's _own_ garbage collection (such as the C++ STL template basic_string does- and this is generally the worst sort of garbage collection, such as basic_string's reference counting), or forces the program to leak memory like a seive. If those are the options, then garbage collection is not too high a price to pay.
Especially when you consider that GC isn't that high of a cost. _Every_ memory management scheme that isn't static in nature that I've seen is O(n). Either on the allocation or on the free. Sooner or later you have to search for a hole. GC doesn't remove this penalty, but instead it allows you to _defer_ it until a better time. Since most programs spend most of their time waiting for input, there are lots of places where garbage collection can be free because the program isn't doing anything else.
Last, but not least, I'd like to point you at the The GC FAQ, a good resource for all things GC related. Highly recommended reading- after it, it was "intuitively obvious" that the world was flat, that the sun went around the earth, and that time didn't slow down as you went faster.
When you run a Java program, the JIT has to do its thing before the program starts. That means the application takes longer to launch. For a lot of GUI apps, this is exactly where the greatest bottleneck is.
So the kind of optimization suggested probably wouldn't improve matters for the user, at least with the small, visual applications Java was supposed to be good at. However, it suggests the people using it for numerical work may not have been so mad after all.
If it works.
Of course, you could cache the compiled code, so that it launches better second time. Pretty soon, you'll end up re-inventing compile/run cycles.
Seems profoundly intuitive to me.
First of all, a JIT is rushed. You can design your optimizing one-time compiler to look at the code for better ways to do it all day, if you like, and while the developers will moan, they will still use it if gets them better results.
Secondly, Java has a very specific standard (that is typically fudged... but that's beside the point). It doesn't give any leeway for a program to act a little different in boundary conditions, like C does. In C, an int is whatever size int is most easily handled by the target system; if you need a certain exact size of int, you can code that in with some extra effort. In Java, an int is a Java int and Java doesn't care in the least what the most efficient native int format is.
Thirdly... automatic garbage collection is less efficient than hand coded allocation and deallocation, and dynamic allocation is less efficient than static allocation. There are odd cases where this is false, but they generally hold true, and where they do not, the C version can always use the more efficient method anyway. (and extremely fast calls to "new" can be easily achieved... at roughly 50% memory wastage)
Fourthly: Java locks you into a OOP model which is inherently inefficient (at least as done in Java). All function calls must be dynamically redirected etc.
I could go on, but it feels like trying to describe that water is wet and stones sink in it to someone who thinks it's intuitive that water should sink into stones.
However, I will not dispute that you can definitely get better results by recompiling for each chip that something will run on. Some JITs may use this to push out ahead of one-time compilers.
BTW, an experienced assembly coder will always beat or at least equal the optimizing compiler, because if nothing else works he can always look at what the compiler produces and see if he can improve on that. Besides, optimizing compilers are good, but not that good, someone has to write them, and when was the last time that you wrote a program that can solve complex creative problems better than you can?
The problem that one tries to solve with JIT based language (say Java) is not the same problem that one tries to solve with a systems programming language (say, C) is not the same problem that one tries to solve with an AI language (say LISP).
;-). Ease of programming and maintenance is much more valuable -- that observation led to the decision to use C to implement UNIX in the first place.
;-)
I would not try to write device drivers in anything but C -- assembly is too unportable. I would not try to write an AI program in C unless I had an essentially unlimited budget -- LISP is much better suited to that task. If I want to interact with a WWW user -- Java is the best choice thus far. As to effeciency of the resulting program, the speed of assembly code is irrelevant in 99.99% of code, since the program spends most of its time in that 0.01% (OK -- I'm exaggerating, but not that much
"So what" you ask? While we talk about JIT and code morphing vs. compilation vs. hand assembly, we need to compare similar things. Language characteristics are nearly unrelated to performance (*implementation* is a whole other story...) In any HLL there must be some translation from strings of bytes to executable things, whether byte codes (emacs lisp), VM codes (Java), or machine code (any of the above under an alternate translator).
JIT tries to solve the problem of bridging the gap between one (abstract) instruction set to another. Code morphing does the same thing. Compilation does the translation from HLL to machine code in one block. The time available to a compiler to explore different register allocation strategies and branch orderings is much larger, but less information about the typical case is available than for the JIT. However, branch prediction is the only place where repeated JIT/morphing has an advantage.
BUT: I repeat -- they solve different problems. Compilers create programs to execute on a chip. JIT/morphing makes a program written for one chip executable on another.
The use/non-use of garbage collection does not tell us a-priori that a program will be slower than the same program without GC -- that depends on the allocation behaviour. There is considerable literature (both theory and practice) on this topic, for languages such as C (gasp!), C++, and LISP. Probably Java too, by now.
Compilers may miss optimization opportunities occasionally but the assembly code they generate is usually pretty good, and it is free of human-errors.
Finally -- Alan Cox mentioned the microkernel performance mistake. I think that it applies here, but not in the way you might expect. It is really easy to miss the real cause of performance problems. In my experience, choice of algorithm and data structure (even in OO code) have a much greater impact on performance than any other single factor (or even several other factors together). You should use language features to help with your task -- but don't be sloppy with them. That always leads to bad, buggy code, and poor performance.
Hmmmm -- still on topic? You decide
In theory, a human can write better assembly than a compiler - in a vacuum. But code doesn't exist in a vacuum, and it becomes less of a vacuum every day. Consider these few issues:
* Cache misses: page faults that cause L1/L2 cache misses (going straight to memory, or worse, to disk) cost thousands, maybe millions of instruction cycles. Do you write all your assembly language to neatly break your data structures across paging lines? No? Didn't think so. btw, this level of optimization can be performed in C in many cases, by good C programmers.
* Pipelining: Do you hand-code your assembly language to give the best hints to the branch decision-making hardware so it can optimize pipelining? Really? You're better than me. Or maybe your code is so much better than compiler-generated crap that the branch lookahead hardware support is useless.
* APIs: I suppose you don't mind reverse-engineering every library call you might ever need, so you can avoid those wasteful, inefficient C headers.
* Et Cetera: The days where humans can outthink compilers in the general case died with the PowerPC and Sparc I. Those of you who have written 1,000,000+ line assembly language programs are free to tell me i'm full of shit. Those of you who have worked on million-plus line code in high-level languages are free to give me an AMEN!!
---
Hand me that airplane glue and I'll tell you another story.
Hmm, I think the author missed the REAL advantage of Code Morphing (or Binary Retranslation).
;-)
JIT Compilers (or even FX32!) are pieces of software running ABOVE the OS layer. They are executed for a specific application which runs on top of it (that is the application on top of the JIT environment).
Yes there are similarities between Code Morphing and JIT compilation, as each does some profiling to get better performance in frequently executed code areas. It is further true, that on more predictable and orthogonal platforms like the Alpha, all kinds of optimisation depend on knowing the underlying platform.
BUT they only see the code from the application running on top of them.
The processor itself however gets code from MANY SOURCES. The number of processes running on any machine today is quite large. In addition things as shared libraries, OS calls, etc add a pinch of code.
Hence the processor still has to perform all kinds of clever things, trying to resolve dependencies, avoiding stall and stalling at times. For modern multiple Issue machines this gets VERY complicated. It slows down cycle times, increases size (and hence speed) takes up space which could be used for better performing optimisations (larger cache, etc), etc.
This is one of the major headaches for CPU designers these days, as these parts increase unproportionally in complexity when trying to add more functional units, etc.
This is the reason why VLIW (very long instruction words) are used in the next generation processors (i.e Itanuim). In those the compiler does all the hard work, resolving all dependencies stalls, etc, at compile time, and then simply sending one long instruction to the processor, which includes one instruction for every functional unit. This means the processor can run at MUCH Higher speed, as it loses a big overhead complexity.
This is the reason why TRANSMETA uses such a VLIW, it enables them to build fast, small and CHEAP processors which can run at quite high speeds. (I am sure they could do more than 500Mhz, but then cost and power consumption get in the way. And the are much smaller and havent (yet) got the man resources as Intel, AMD)
Now where does code morphing come into play ?
I recently saw a few comments asking for access to the VLIW instruction directly, bypassing the code morphing. Well, that would DEFEAT the concept !
The Big Thing about code morphing is that it sees ALL instructions which get executed, in the right order. This includes the library calls, the OS routines, etc. Hence it can do MUCH BETTER optimization than ANY Compiler by definition can do. A compiler can only know about the program it compiles (yes you can do clever profiling, but you still only have the application source to optimize), it cannot optimize ACROSS APPLICATIONS.
The code morphing software has absolut knowledge about what code has executed and what code currently tries to execute. This enables it to do much better optimizations.
Furthermore it is SPECIFIC to the underlying processor. It knows everything about the processor and even more important, the processor is designed specifically for it !
(Hence the two versions are internally comletely different, each optimised for the kind of code linux/Win it is executing)
This is what makes the concept so interesting. It adds an additional abstraction layer on top of the actual hardware, allowing the underlying implementation to be always up-to-date and optimized for the application (removing all the complexity of having backwards compatible hardware, ask Intel
Overall, the approach is a MAJOR step forward, IMHO it can be compared to the CISC->RISC transition. It is after all the first implementation, competing with products well understood and optimized for years. It has potential to solve many of the major headaches of modern CPU designs (including COST). Comparing it to JIT is missing the BIG points.
I think it will go a LONG WAY !
Frank
Hi,
1 1-performance.html and http://www.javaworld.com/javaworld/jw-12-1999/jw-1 2-performance.html and http://www.javaworld.com/javaworld/jw-02-2000/jw-0 2-performance.html
What transmeta does is very similar to SUN's hotspot compiler for java. Like Transmeta's code morphing, hotspot compiles and optimzes bytecode on the fly using profiling data to optimize the parts that are executed most.
Your question as to why Java is still slower than C++ is answered in a series of javaworld articles (www.javaworld.com): http://www.javaworld.com/javaworld/jw-11-1999/jw-
Basically the problem is that stuff such as memory allocation, garbage collection, synchronization and runtime type checking have a performance price (you get a safer programming environment in return). The articles discusses these performance issues and give some usefull advice to avoid these bottlenecks.
Jilles
There's local overhead in the instrumentation code, but not a net overhead in the combined instrumentation code plus controlled loop code: the loss due to extra code prior to a loop is far less than the gain resulting from the more efficient loop code produced by introducing that extra instrumentation.
"The question of whether machines can think is no more interesting than [] whether submarines can swim" - Dijkstra
when the programmer has decided that space is more important than speed.
That would seem to be one of the issues where manual coding has the edge. I agree with your general point, but I think there's still work to do before auto-generation is perfect.
Imagine a case where an algorithm could easily optimise either way speed/space - maybe there's a hash table that's going to hold some programmer-controlled depth of the initial search and allowing it to expand would make usage quicker, but eat memory. A human coder would probably know the pattern of usage this routine would get. By knowing the practical number of data items to be encountered, they could optimise the hash size. A compiler can't do this, because the necessary information comes from the overall systemm domain, not just the source code. To allow compilers to compete effectively, we'll need much more subtle optimisation hinting than we currently have; especially that of the form "ignore this block, it's only used once" and "speed like crazy here, and you'll never have more than 5 items loaded simultaneously".
I'm not a compiler / assembler geek, so maybe someone is already doing this ?
I'll just say that when the compiler and the machine are already married together, the compiler DOES know about the size of the caches. I'm specifically talking about VLIW hardware as oppossed to a super-scalar machine that has multiple execution units. The main difference between these two classes of machines is that the super-scalar machine has a hardware layer that is doing the instruction scheduling where a VLIW box has the compiler doing that function. Thus I feel your observation is more of an apples and oranges situation.
Have you compiled your kernel today??
Where the hell did you get these figures ?
What compiler, what OS, what architecture.
They look *completely* out of line with every
other test I've seen. Check out the JGL against STL comparison above for some real numbers.
For a start, what the hell is the difference between integer division in C or C++ - no compiler that I've heard of written in the last 5 years produces different code in these two cases.
Secondly - who remembers to turn on optimization: anybody who cares about performance and isn't completely stupid.
http://rareformnewmedia.com/
>I can say that they do not offer Java in the
>CS department. This is a reflection in the fact
>that Java is not important for what programmers
>actually do in the world.
Man have you got to get out in the real world! You and your profs that is. Right now Java is my job. A year ago I was coding c and asm on an 8051 derivative for an embedded project. Today I'm building a distributed network management system for a company that builds fiber backbones. Guess what the most universal platform for executing code is? I'll give you a few clues: All their winXX boxes support it. All their *nix boxes support it. Any ideas? It's called a web browser and everyone that might be interested in designing, configuring, maintaining a network, every one of their laptops, desktops, and workstations has one.
Sooo...which platform do we support? What language do we choose? You only get one choice. I suggest that you transfer to a, uh, more modern school ASAP.
Good judgement comes from experience, and experience comes from bad judgement.
- W. Wriston, former Citibank CEO
This whole question boils down to whether or not a human mind is deterministic. We can prove that a
compiler is, or in fact that any program that runs on current computer hardware is.
If a human mind is deterministic then it follows that a human could do just as good as a compiler simply by using all of the same algorithms to generate the code, although the computer would certainly finish alot sooner.
If a human mind is NOT deterministic then it is clear that a person who had no programming skill whatsoever could beat the best compiler given enough time, a specification sheet of the language he was to generate the given algorithims in and a circit diagram of the cpu he/she was coding for.
This is VERY simple to see. A compiler can NEVER
under any conditions for any reason EVER EVER EVER produce better code than a human because the
human can ALWAYS produce the same code that the compiler did. The question should be is it reasonable to wait the 150 years of time it would take me to compile mozzila by hand or should I build a compiler to do it in a few hours. The next question would be how much time(if any) should I spend hand optimizing the compiler output.
A question I have is why cant we do the same thing that a JIT compiler does before runtime? It seems
clear that you could easily get very close to as much optimization from a multi pass profiling compiler as you ever could with a JIT without the runtime overhead.
Later
Spacewalrus
Hmmm.... back in the days when I used to dabble in a bit of demo coding I would frequently benchmark my 486 code against the (Borland) compilers.
I almost inevitably won. Hands down.
Compilers may have come on a bit since then, but I'd still fancy my chances. Of course, I would only dream of doing this kind of optimization for the most critical inner loops. Most of the time it isn't worth it.
But I would like to stress the point that a human with good knowledge of the system and the problem at hand should be able to beat a compiler in many cases. Compilers can't think laterally.
For example, consider a function that must fill an array of bytes with consecutive numbers between X and Y.
Compiler produces beautifully optimized loop that increments a counter, moves an array index and sets the bytes appropriately. If it's clever, it might use the same register as both an index and a loop variable. No problems. Pretty fast, but....
Joe Hacker takes one look at the problem, sets a 32-bit register to contain the bytes X, X+1, X+2, X+3 and does a quarter of the number of loops just adding 04040404 hex to the register each time.
Joe wins, because he *knows* that most of the time X and Z will be sufficiently far apart for his bit of hackery to be quicker. His knowledge of the nature of the problem will beat the compiler.
This is the kind of thing that humans will be better at for the time being. Once AI compilers start to make an appearance, things may change. But right now, I'll bet on human intuition any day of the week.
I completely agree that humans can always write better assembly than compilers can generate, but it's getting harder.
Not all that long ago, compilers weren't all that smart, and processors were a lot simpler. Today processors use tricks like out-of-order execution and branch prediction, making it extremely difficult for an assembly-language programmer to determine exactly what is going on. It still remains the case, though, that the programmer may know something about the code that the processor doesn't (that a register will never contain a certain value, or that some execution paths are much more likely or more critical than others) which will give the programmer the edge.
None of the just-in-time compilers that I am aware of recompile code as they go. If the Crusoe actually attempts to optimize sections of the code based on runtime profiling information, this could be the main reason it performs so well. There have been academic papers written on this idea before but this would be the first implementation I've heard of.
I was about to try and answer your question but it's bloody difficult to explain!
With garbage collection implemented in the best way the JIT or whatever doesn't try to hard to keep a record of what mem has been allocated.
But, whenever the system isn't paricularly busy or if it panics due to low memory the JIT can traverse all the objects that have been allocated (just starting with each of the Thread objects I think) just by checking the references each object has until it has a list of all the objects in the system. There are a few reasons this is more reliable (not just faster) than maintaining a reference count.
Then I suppose it (the JIT) could ask the OS to rearrange it's memory mapping thingies (Page Translation whatchamacallems) to put any of it's user space pages after the current position of the stack or something like that (hard to describe but i know what i'm talking about (even though i can't remember the name exactly!))
Basically it can be made to work but maybe not particular efficiently because it could result in a full page (4K) been used for just one small object but this can also be fixed
Years ago I worked at a company that was going to bring VLIW to the commercial world. One of the things they had to invent was MUCH smarter compiler technology. That was in 1985, so the basic ideas involved here are at least that old.
Anyway, when the compiler DOES know the hardware, especially VLIW machines, they can indeed do a job as GOOD as a human. (Look - they won't usually do better jobs, when the guys who designed the machine write code, they're going to do what the compiler is trying to emulate..) But from what I remember, static profiling was considered "good enough" back then if the VLIW machine is built right.
Most codes spend most of their time in inner most loops. If those loops can be rolled up correctly for VLIW you can be executing different interations of the loop within the same instruction slot. Being able to do this is usually the largest performance payoff. If you have a GOOD compiler in the first place, that KNOWS the hardware, then JIT and morphing aren't needed, or help.
The place that I see the morphing being a step forward is that VLIW machines were considered architectural dead-ends until just now. If I use one of these smart compilers for a machine with say 7 functional units, the code that emitted won't work on a machine with 8 functional units, or at least not be optimized any longer. These machines didn't scale well until now! Code-morphing technology completely removes this limitation!
Have you compiled your kernel today??
Actually, there is no absolute requirement that free() does all the cleanup work that the allocator, any more than a garbage collector must call a routine to do all of its cleanup work at the end of every block and at every assignment.
free() could just add the freed chunk to a list of chunks to be cleaned up. A second thread could run continuously tidying up all these chunks with the same system-wide view as a garbage collector.
I can see two major problems offhand:
The first is that java is, conventionally, a
language which runs in a virtual machine. The
infamous JVM. This involves a layer of
inefficiency from the word go. Granted, you could
avoid this problem by compiling direct for the
CPU in question, but this is a nontrivial
conversion from a stack-based language to a
register-based architecture. If you are on a
java-optimised CPU, compiling direct, this could
still be pretty good.
The second objection involves the compiler.
Compilers take CPU cycles. The smarter they are,
the more they take. If you compile beforehand,
this of course doesn't impact execution speed.
If, of course, you'd like to compile on the fly,
well, you're going to have to execute the compiler.
Java also isn't really all that suited to this,
to be frank. If you want to see a REALLY powerful
model for this sort of thing, try using a FORTH
compiler. On a CPU which is at all stack-friendly
with a good execution environment you can pull
all kinds of nifty tricks (on the fly modification
of running code, assembler-like low level activity
and so on) which can really speed things up
immensely.
So while it might seem profoundly counter-intuitive can anyone actually give me a good reason why Java + JIT should be slower than Good Programmertm + Assembler?
:).
:), it's always best avoided when using java, but sometimes you can't. This affects other languages too ofcourse.
:) hmmm.
:P. It's nice throwing around O notations about how much more efficient garbage collection can be etc, but you have to realise than you're assuming that memory freeing is the most costly process, it may not be. With JIT, GC etc, you have the additional overhead of the JIT and GC. I haven't seen any proof so far that these overheads aren't overheads at all.
;).
:)~.
:).
There are many things that slow java down, that I'm sure you're aware of, but I'll outline some of them here as they come to me
Limited stack, all java objects are created on the heap, and thus is considerably slower. Microsoft Research actually have a division working on speeding up java by putting some objects onto the thread stack.
Runtime checking, array bounds checking etc can not be turned off. Good for stability, bad for speed.
Late binding, can be rather slow
Coding practise. I don't know about many people here, but when I program in C, I usually know exactly what's going in memory, so I spend a bit more time optimizing this and that saving a few extra cycles, especially if the code is in a loop. I'm sure this is even more true of most ASM programmers. In java, I just don't care, I don't really know what's going on in memory, and I don't really care, I just do what makes the tidiest code, I also tend to use inheritance a bit more
Also, java's thread synchronization is easy, but extraordinarily slow compared to what you could do if you were to use Win32 for example, since you have a lot more control. A JIT is not going to help that. Come to think of that, that's where JIT can fail with high level languages, there are things that these languages restrict you from doing, which prevent such cool optimizations that a JIT couldn't possibly do - I'm not talking about instruction level optmization, I'm talking about things like being able to manipulate memory directly.
The idea that Java + JIT can be faster than traditional ASM is interesting, but never will happen I'm afraid
There are additional 'optimizations' that can be done at runtime, but does using a JIT really improve performance (taking into account the overhead of the JIT)? How many cases are there when a compiler can not optimize the code as much as a JIT, and the JIT can actually give any kind of improvement. I doubt there are many, and unless the improvements are like 10% improvements, I doubt it would matter much. What's the diff between a 433Mhz and a 476Mhz celeron? Nothing any average user would notice. The only place you would notice the speed difference is when you're running something like VMWare, 3DSMax or compiling Linux....and in those cases would a JIT improve or just hinder performance?
Ofcourse, as we get faster, and faster processors, we can move to higher and higher level languages like Java. They drastically reduce development times, and in many cases, will be the better choice. I doubt you'll see Quake or any OS written in Java anytime soon....hell, I doubt you'll see Office or CorelDraw written in java anytime soon
Crusoe is fast and super cool, but it's not as fast as a real 666Mhz x86. And if Transmeta adds official java support, I'd love to get my hands on one
I should also stress here, I have nothing again'st java/jit etc. There're all cool stuff, but I think the concept of Java beating good programmer + asm a bit, uh, unrealistic
If one Transmeta CPU, running the morphing software and actual productive code, is on par with a PIII 500 Mhz, what would the speed improvements be if you used 2 CPUs? One dedicated to morphing, and one dedicated to productive code? Design would be almost as easy as current design, and speed could double.
Best part of this approach is that it still appears that your system has a single CPU. Win9x will still work.
(appended to the end of comments you post, 120 chars)
Once a programmer has infered some property about his program, he needs a highi-level language to express his solution to the compiler. ML's typing system is the best thing I know that acheives that.
Once a program has figure a fast instruction ordering, he need a low-level language to express his solution in. We all know what kind of language does that.
But I don't know any languges that combines both.
As for todays' question : if JITs looks cool, and one time compiles looks boring, think of compiling n-times with a profile session in between and you will get much better result. (all the sugar, no r.time JIT overhead)
-
This post was compiled with `% gec -O`. email me if you need the sources
I recently sped a program up by 150% in a snap simply by killing a few new()s. Swing and LotusXSL have had similar experiences.
I think that part of the problem is that all of this is new, so there is more to do. HotSpot the trendy JIT from Sun in places IS ALREADY FASTER THAN C, but whenever it comes to Object creation, things slow down a lot.
So why in theory should new() be quick when in fact it is slow? IMHO the problem is not with the memory claiming, its with all the other stuff that the JVM has to do.
When I call new Foo() the JVM:
- Checks to see if the bytecode for Foo already exists, and if not it loads it, verifies it, and calls the class init method.
- Allocs new memory. Probably very quick
- Calls the hierachy of constructors of all Foos superclasses. Quite slow.
- (For advanced garbage collectors) Place the object on the 'recent items' list. Probably quick
So I guess the complexity of the system as a whole is the problem here.This is very very slow, but should only be done once.
DWR is Ajax for Java
I once took a course with Dr. Alfred Aho, one of the authors of The Dragon Book and a contributor to lex, yacc and awk (he's the 'a' in awk.) Dr. Aho would never call a compiler component an "optimizer;" instead, he called it a "transmogrifier," because the only way a compiler could produce truly optimal code would be if it could write an optimal program--and it can't.
While the compiler has the opportunity to rewrite the code every time it processes it, it's not very good at rewriting code, so its abilities are limited.
A well-designed compiler on a well-designed CPU can produce very good assembler code--I doubt I could outsmart a compiler in a RISC assembler optimization problem--but I believe that code tweaking is usually of value only in inner loops and for certain kinds of data storage.
All of which adds up to the reality that, pending the arrival of artifical intelligence, human programmers are still needed.
The basic data types are defined. You can use sizeof() to generate 100% portable code when you really have to know exactly how bit your ints are. More importantly, you can use your noggin and avoid it being a problem in most cases by writing code that accepts the range of possible value limits.
Embedded systems programmers will back me up on this one: it's a Good Thing that when you write C you can write one piece of code that's perfectly happy to use a 16-bit, 20-bit, or 32-bit int and it will be equally efficient on any of those systems.
I believe Andy has a good point in arguing that there's no fundamental reason for manually written assembly being better than automatically self-optimizing stuff. I also believe that manually written assembly will ultimately become obsolete.
Nevertheless I'd like to point out that JIT is based a somewhat dangerous technique: a program that alters (its own) code. I believe this technique was used in the eighties to scare off hackers by making code incomprehensible and hard to disassemble until the program was actually running. Also (even on a good old 6502 processor) it's possible to make some speed improvents peeking and poking into the code you're actually executing.
When compilers for Microcomputers got faster and most processor architectures (known as Harvard architecture, I believe) explicitly require a division of RW and RO memory segments, self-modifying code was abandoned. Hacking into your own code is generally conceived as bad-programming-practice.
I was wondering wheter JIT techniques also require very intelligent (and thus "heavy") disassemblers? One might also expect that developing a JIT compiler is a lot harder than doing a conventional one (without peep-hole optimization). Does anyone have experience in these subjects?
I hate when I skip words.
The point is that the average application has inflated enormeous over the years.
So has the average machine. Which is cause and which is effect? Do we have bigger applications because with today's faster processors and bigger memories, throwing more hardware at the problem is cheaper then trying to bum five bytes out of a 50 KB program?
Something to think about, anyway.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
Perhaps another malloc comes along wanting the same size (roughly) as the thing just freed. We just give it the recently freed chunk. We then don't have to build a new one.
:)
IIRC from what I have read about the Linux kernel, it does something very like this. Blocks of free memory are kept track of in a hash table, where the hash key is the size of the block. When you free a block, it gets merged with any adjacent blocks and put in the appropriate spot in the hash table.
Or it might have been an array of lists of blocks, where each element in the array represents a list of blocks twice as big as the previous element's list. It's been awhile; I forget.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
@#$% theory.
:-)
I can write fast C (or C++ if you stay away from the evil slow features) code.
Everything I do in Java is bloody slow.
I do believe fundamentally that some sort of per-machine compilation is in fact ideal -- the C compiler could do clever stuff if it new the exact cache architecture of your machine and compiled your programs from some intermediate (say parsed and desymbolized) representation when you ran them (or maybe just the 1st time you ran them).
(all the slackware people in the crowd say "bwaahahahaha! I did compile my programs for my own machine!"
But I think the dumbness of Java -- the funky GC, the dumb dumb dumb RTTI, will always make it slow. For that matter, Java uses a dumb fake-o instruction set that's not a particularly good representation of the information needed by a compiler. Why not give your JITC something closer to the source rather than a binary for a machine that doesn't even exist ?!?
...
Ditzel said something very significant when he introduced Crusoe: (i'm paraphrasing) "The reason we haven't talked openly about this before is we didn't want to boast before we could prove we've done it". Ditzel stood there and said "We've got a chip that does on-chip JIT architecture translation, and does it to run as fast as any other chip out there, and here's the proof".
The Java people have been claiming for a long time that they could write faster code than the C compiler with JITC.
Java has gotten better than when it started, it's true.
But it still can't hold a candle to gcc -O2.
My message to the Java world: I spend too much on my computers to waste time running slow code. Call me back when you can actually demonstrate your claims.
Doesn't "fast enough" outweigh "fastest possible" if you get it sooner?.
Yes it does but only if this is something you want to run not very often. If it is running constantly - think kernel, daemons, etc it does not outweight a thing.
Also perl or python still beat the sh... out of Java on anything and develkoping in either one of those is much faster. What I actually want to see finally is a perl.so or python.so plugin that loads perl (or python perse) as an application and does policing on all dangerous IO functions.
After all browser functionality is just an elementary module. Than the entire jabba the hut issue will be finally put where it belongs - to rest.
Baker's Law: Misery no longer loves company. Nowadays it insists on it
http://www.sigsegv.cx/
One of the areas that run-time profiling adds to the mix is the ability of the code to adjust "itself" based on processor cache sizes, etc. If I run a program on my K6-200 and then on a K7-700, the program should have loops optimised very differently. That is, my chip has very little cache in comparison to the K7. The K7 could have a large number of long instruction sequences unrolled that would be in L1 cache whereas it would be smarter, if branch prediction works out (again, run-time profiling) to keep those loops rolled up on my CPU.
- Michael T. Babcock (Yes, I blog)
Which reminds me of how big a smile I got when I read Transmeta's processor marketing where they said that efficiency was more important to them than performance and it would not be as fast as a native CPU, but close enough to cut it.
- Michael T. Babcock (Yes, I blog)
I don't want to argue about the general case, but as things relate to current-technology Java, JVMs and JITs, here are some things to consider:
a. One of the most important optimization techniques is inlining. In OO, you can't inline unless you're sure a call is monomorphic (static). Java is even worse than the general case OO -- a Java program simply cannot use inlining in the vast majority of cases. Here's why: To inline a call in OOP, you must know for 100% that the distatch site is monomorphic (i.e. that the call is never polymorphic). But in Java, even if the compiler finds out that a certain method is never replaced by inheritance (overriden) in any class in the program, the program could load new classes at runtime (using Java's Reflection mechanism), thus turning monomorphic call sites into potentially polymorphic ones.
(A few exceptions to this rule: first, there are final methods, or methods in final classes, that cannot be overridden. Second, there are Sealed JAR files, in which all package-level (default access level) methods cannot be overriden. IBM's JIT takes advantage of that, a fact which allows for great optimizations in the standard Java libraries.)
b. Sun's JVM (the one for Windows, at least) implements references by a dual layer: the value stored in the reference fields is an index into an array maintained by the JVM, which contains the actual pointer to the memory block. This allows the memory manager to move blocks around without having to update all references to that block, but it also means each dereference actually requires two dereference operations. So when you write "this.field1.field2", you think you're making two dereferences but actually the machine makes four.
c. The JVM is a stack machine. This means that naive implementations on register machines result in slow operations, that simply do not take proper advantage of the native architecture even when compiled to native code. Microsoft was the first to realize this, and indeed their JVMs were the fastest at the time (later on IBM's JVM took that trophy).
Finally, a word or two on the general case: anything a JIT can do can also be done by a regular compiler, if you allow it to use profiling data from a few "typical" runs of the program (but note that not all programs have "typical" runs). The JIT, however, consumes memory at runtime, unlike a regular compiler - and with programs that require more memory than is physically available (practically all programs today), extra memory requirements reduce speed since they increase paging; so the JIT has its price.
- Tal Cohen
As shown at http://www.idiom.com/~zilla/Computer/javaCbenchmar k.html, even the linux jdk, which is relatively slow than other platform's, runs as fast as natively compiled C program when runs simple operations.
In my understanding, java's poor performance is due to following factors:
1. too generized api design resulted in very deep call stack. for example, printing current date and time using java.text.DateFormat uses much more calls than traditional C (a system call and a printf is enough for C). It gets more terrible in swing. use of interfaces causes the same problem i think.
2. JNI is too slow. (can't understand why but it is known to be slow and java can't help but using many JNIs)
3. as WORA is important in java, it is hard for java to use any platform specific resources such as Graphic card's accelerations. that is one of the reasons swing's design is getting ugly as it tries to boost its performance.
4. too many small class files result in high I/O usage when JVM loads classes.
5. no MACRO! for example, most of java API is synchronized, which is expensive, reguardless thread is used or not. the same for security check codes and platform specific implementation codes,
etc. these problems can be solved by MACRO but...
someone in sun doesn't like macro and i agree with him when think of java's purity.
some of these problems are known to be solved in next jdk version(hotspot client version). slow synchronizations and class loading speed, etc.
but mostly, they are java's design issue and never be solved, as java is mostly running in virtual machine and originally it is designed to be used for java chips, only when JVM is implemented in processor level, the speed problem will be solved.
I'm looking forward to seeing java running in MAJC or Transmeta's chips.
please correct me if i'm wrong in some and forgive me for my poor english.
Cheers.
-- Y. J. Chun
Seriously, when you communicate over a inter-machine connection, you have to comply with the communication standard.
If the communciation protocol is, for example, byte based and you are writing portable code, you have to have a formal, portable method of chopping up that struct into bytes at the sending end and building it back up again (from bytes) at the receiving end. It's called serializing, and it's disgusting laziness to skip it; if you do, you deserve those problems, and also to get bleeding genital warts.
Just dumping the bits of a fuzzily-defined internal data structure outside of the comfy innards of your own program is the ultimate in programmer carelessness (well, dividing by a null pointer and writing to the dereferenced quotient, with appropriate casts, is the ultimate in carelessness, but you get the point). Would you also send your pointers across a network and expect them to work at the other end?
This has to be one of the most widespread, and possibly destructive, myths about C programming. The truth is, if you write code that depends on the size of your basic types, your code will be less portable. The next generation machines will always handle your types in ways you will fail to predict.
> And it also doesnt give todays hardware any advantage.
Have you ever worked with a PowerPC processor? Some of these cores do not have 16-bit integers, only 32-bits. Yet, lots of people reasoning like you use them anyway (because "16 bits is enough for this value"). This forces the compiler to insert 16-bit emulation code, thus increasing code size and decreasing execution speed. Had these people ignored the size of their integers and just used 'int', the code would have been both faster and more portable!
Take a look at a few of the most widely ported software packages around. How many of these redefine the types in order to "know the size of their types"? Not many, I can tell you.
Writing portable code has little to do with worrying about the size of your integers, and lots to do with not trying to optimize your code to whatever system you happen to write the code on (by, for example, worrying about what the size of your integers are...)
The difference between theory and practice is that in theory there is no difference whilst in practice there is.
First off JIT is not java specific. Transmetas chip does a form of JIT on everything - even x86 assembly code. Your basic argument that JIT can be efficient is correct (not more efficient than humans, but I'll come to that).
The performance hit from Java comes far more from it's crappy libraries (Swing anybody) than the JIT aspects. Bounds checking, garbage collection, and inherent overhead of OO techniques all cause a performance hit, but the real trouble comes from crappy libraries written by crappy programmers who put 10 levels of indirection between a request and it's implementation.
For instance the String class being written with 16 bit characters to make it unicode friendly when 98% of the java code written never uses this feature but still requires twice the time and space for every operation. Also Strings are not alterable so to alter a character you create a new StringBuffer object, copy everything, make an adjustment, then replace the old reference with the new so the old object eventually gets passed to garbage collection... all compared to
buf[i] = 'a'; (C or C++). It's crap like this that makes Java seem so slow, not the JIT hit.
Your argument about how JIT can outperform programmers because it has more information is flawed - any decent programmer will know where the spikes are and optimize accordingly. He will know what he can and can't do with the data, whilst the compiler has to take the strictest possible interpretation. Advanced optimization is too slow to do on the fly (although you can optimize the byte code to some extent with a decent javac, then optimize the translation with a JIT)
Garbage collection argument is silly too. In C/C++ or assembly, the programmer has a choice as to whether he uses the stack or the heap. It makes his life more complicated, but as long as he knows what he's doing it's inherently more efficient. If he doesn't know what he's doing then he's useless in any langauge. If you want to, you can use a garbage collector in assembly or C++, but it's not the only technique available to you.
Basically Java is slow for the same reason that Windows sys admin is slower than Linux sys admin.
You don't trust the user with the rm command or trust the programmer with pointers - show him an idiot box if he tries to delete a file or bounds check that array access for him. It's one way of doing things, but it sure isn't efficient.
http://rareformnewmedia.com/
And if you need really big numbers you can always code your own BigInt structs, though you might want to write a little preprocessor to do that (or use C++ and redefine all the ops for it).
Or maybe it only bothers to do this when resources get tight. Perhaps another malloc comes along wanting the same size (roughly) as the thing just freed. We just give it the recently freed chunk. We then don't have to build a new one.
Caveat: I have no idea if this makes sense, performance-wise, in the real world.
The cake is a pie
C is for efficient coding by trained experts, with the potential for portability if that is a desired goal. It provides no hand-holding that would interfere with run-time efficiency. The default behavior is and should be the most efficient behavior.
There are plenty of other languages for people who are willing to sacrifice efficiency for more hand-holding.
> it's particularly nice in large and sophisticated object models where you almost always lose track of who has a reference to what
Does this make anyone else nervous as heck. If you can't even keep track of when your objects should be deleted, how the hell do you keep track of anything. Increasing complexity requires more clarity of thought and design, not less. It's not the overhead of GC that bothers me, it's the sloppiness of thought it seems to encourage (try using JBuilder with less than 256Meg if you want to see what I mean).
In some languages (eg ML), you really are working at a different level of abstraction and then GC makes perfect sense, but it's not GC explicity that makes life easier.
http://rareformnewmedia.com/
The reason has nothing to do with Intel compatibility, it has to do with the nature of VLIW. As someone else mentioned above, every new VLIW CPU from Transmeta will be different (the two CPUs they announced have different cores) and require totally different optimizing, possibly even having a different instruction set.
What Transmeta has done is remove the primary problem of VLIW...the fact that you need a very specific compiler for your CPU to extract performance out of it. By including the "compiler" in the CPU, it's done for you at run time, so you won't have to recompile your code every time you get a new CPU.
OpenVerse Visual Chat: http://openverse.org
-- OpenVerse Visual Chat: http://openverse.com
Well, Crusoe was designed for low power use, then they tried to get speed out of it..
"Efficiency"=Power efficiency, not speed efficiency.
It is slower then other x86 chips per mhz, but paster per watt. Therefore, they have done a good job at what they wanted to do. No, it's not an alpha or a k7, but it wasn't designed to be.
Blessed are the pessimists, for they have made backups.