Transmeta Code Morphing != Just In Time
The following was written by Slashdot Reader Andy Armstrong
Transmeta Code Morphing != Code Morphing The recent Transmeta announcment crystalised something I've been thinking about for a while. It's my belief that it should be possible to make a compiler generate much better code in the general case than someone writing hand coded assembler. Furthermore it should be possible for a JIT (Just In Time) compiler to produce better code than a conventional one time compiler.Why should a compiler be better than an experienced assembler programmer? Well,
- the compiler can know the target processor intimately (cycle times, impact of instruction ordering, etc.)
- the compiler gets to re-write the entire program each time it sees it.
The second point is critical: any programmer writing an assy. language program of any significant size will write the code to be maintainable. Of course, it makes sense to do things like defining standard entry and exit sequences to routines, keep a few registers spare (in those architectures that have more than a few) and other practices that lead to maintainable code, but the compiler doesn't have to maintain the code it writes. It gets to write the whole thing from scratch every time. This means that functions can be inlined (and repeated code sequences turned into functions). Loops can be unrolled then rolled back up at the next compile when the programmer has decided that space is more important than speed.
If you know you're not going to have to maintain a bit of code you can do some pretty scary things to get it to perform better. A compiler can potentially do that all the time.
Why should JIT be better than one time? This should be clear to anyone who's followed the Transmeta story. A key element of their code morphing technology is that they insert instrumentation into the code they generate, effectively profiling as it runs so that the compiler can decide which bits to optimize the next time it sees the code. It's well known that the coverage graph for a typical programme looks extremely spiky - the most frequently executed code may get hit thousands or millions of times more than it's neighbours. It follows that it really isn't worth optimizing the stuff that only accounts for a millionth of the code's execution time.
This brings me to my point: is there really any reason why a Java / JIT combination shouldn't result in code that executes as quickly as the equivalents in other languages?
You might suggest that garbage collection must slow things down, but I'm convinced that, done right, garbage collection can actually improve performance. malloc()/free() require the memory manager to think about the heap for every call, but new() can be implemented as a stack (m = memlimit; memlimit += size; return m) and the garbage collector gets to do all its memory management in one chunk - it can take an overview of the memory landscape rather than trying to keep things fairly optimal each time memory is released as free() must.
You could argue that the OO nature of Java means that it must dynamically allocate objects that would be static in a program written in C or assembler. That's true, but (assuming calls to new() are cheap, which I believe they can be) this really isn't a problem. Current processors don't take a huge performance hit when working with objects who's address is not known at compile time; in fact in many architectures it makes no difference at all.
So while it might seem profoundly counter-intuitive can anyone actually give me a good reason why Java + JIT should be slower than Good Programmertm + Assembler?
So, yes, an assembly programmer can outrun most any compiler. But does it matter? Almost never.
Depends on the application.
Nobody here is talking about the fact that big piggy compilers use lots of resources when you're talking about very resource-tight applications. I've written a fair bit of embedded Assembly code for 4 and 8 bit single chip designs where there was 128 or 256 bytes or maybe even 1K of read/write memory in total for the whole system. Guess what: that means you don't get to have a huge stack.
The only use for C in such an environment is to steal math routines out of it by compiling to ASM and reverse engineering them.
And there wasn't even any C available for the NEC 4-bitter I spent so much time coding.
The reasons - (1)humans get bored and lazy. They can't do even fairly basic optimizations over large programs consistently. So sure sometimes a human may do better for a small stint, but we are talking whole real world programs like an OS, database engine or Mozilla.
(2) Humans are error prone. Even good clueful ones. That means that programming in asm introduce bugs, and at times means screwing up the optimization but still generating correct output. Less buggy is better and higher level will yield less bugs.
(3) Some optimizations just are not tractible for humans to do in practice. Sure I can sort an a list of 2500 names, sorting is not hard, but on past a certain scales it just is not practical to get humans to do it. Computers scale very well for many types of problems much better than people, and many of those types of problems are useful for generating faster/better code.
Slothmonster
Not that a good compiler cannot outperform even a good assembly hacker on, for instance, memmory allocation. And not that abstraction in itself must lead to inneficiency, but it makes it inherently easy to allocate/free memory for the programmer, who will then of course allocate/free more memmory.
--The knowledge that you are an idiot, is what distinguishes you from one.
--The knowledge that you are an idiot, is what distinguishes you from one.
For instance, a line in a program might be "a = b % 255". But the 255 might not be set in stone, it could differ by up to 10. Computer languages can't express this, but if they could, the computer could figure out that it could change the 255 to a 256 (here and possibly elsewhere in the program) and make it "a = b & 0xFF" which is much faster.
Granted, this is another trivial example, but there are many more situations like this where optimizations are possible. The more we tell the computer, the more it can figure out. The boundaries of what programming languages can express and what optimizers can do are expanding all the time, but until we tell it every nuance about the program, a human will still be able to use that knowledge to make faster programs.
Other engineers are required to use the minimum resources to achieve the maximum requirements of the project design, shouldn't computer 'engineers' be able to acheive the same.
Apples and oranges. The cost of creation materials for say a building, demands that the least amount of them be used. Not so with bits and bytes. These days, they're almost free. I applaud the true craftsmen of the programming world for their principle and talents, but with all due respect, to me it's penny wise, pound foolish to waste time reinventing the wheel each time you have to write code. Did you dig your home's foundation with a trowel?
Besides, optimizing compilers are good, but not that good, someone has to write them, and when was the last time that you wrote a program that can solve complex creative problems better than you can?
I haven't done that myself, but some other people had. The FFTW people have created a DFT library written in ANSI C that outperforms nearly all other DFT libraries, many of which are provided by the hardware vendors who hire the best programmers and use assembly code extensively. Most of the code in the library are generated by a program written in OCaml.
And judging from the number of patents and papers in IEEE journals, I believe that creating a fast DFT library is indeed a complex and creative problem.
Java really doesn't 'require' one to have a browser, even. It requires a JIT or like concept to run the code (VM's included...). More and more, Java use will likely split into split camps. It is trivial, these days, to actually create an executable native code (machine code) java program. Modern compilers are beginning to include options for this at an increasing rate. I KNOW from experience that Metrowerks codewarrior tools give one this ability. On the other side, that same code can still be compiled into Java BC, and run with VM and JIT and the like. So, this is changing rapidly, and in the future we should see even more changes to bring Java more into the 'main stream' of programming, particularily when all the assorted threading and timing and other difficulties are sorted out... Brian
No, it's not an alpha or a k7
I am wondering if it's even a K6. They certainly spent enough time at the dog and pony show preaching about "the benchmarks need to be adapted" instead of quoting performance numbers.
It's almost certainly as good as a 486sx, though. Faster, at least.
Although I agree in general with your comments, your 4th point is incorrect. Just as with C++, the Java compiler is often able to detect when a method call does not need to be virtualized, and does not need to go through indirection.
This is especially so for two important cases: inlining function calls, and finalized classes (where the programmer instructs the compiler that there will be no subclassing).
--binkley
--binkley
Have you ever written a large program? Even on a command line? On a CLI you can do assembler or C, because they're linear in nature. C/C++ was not created so that programs could use less CPU cycles; it was created so that programs could be written more quickly (programmer time), more robustly, and vastly larger than their assembly counterparts.
I'm certain you can write a 500-instruction assembly program that will run faster than a similar program I write in C++. But the scope of problems you can solve in those 500 lines are so unbelievably small that its not worth the time to test it.
>>>>>>>>>> Kvort
-Don't mind me, I'm personality-deficient and mentally-impaired.
Sorry... this post is probably meant to be deeper inside this thread.
:)
Are we not all just dancing around the idea of a human compiler with a vast database of algorithms, intuition, look-ahead properties, and an abstracted idea of what it is exactly that the program is doing? IE. There is no compiler, to my knowledge, which writes highly effective parallel code. Why is that? Simply because no compiler can really understand the overall problem.
How long do you think it will be before the all-powerful AI compiler is written, with all of these abilities, and more, on a box that runs at a few GHz?
Or maybe i'm just being a little too science fictiony here
Cheers,
D
os.system("perl -e 'print \"My first Python Script.\"'")
You might want to read "Engineering a Compiler" by Patty Anklam, Dave Cutler, Roger Heinen, and Doug MacLaren, from Digital (as in DEC) Press, ISBN 0-932376-19-3. (Dave Cutler was one of the architects of VMS and, eventually, Windows/NT.) This is the team that wrote the Dec VMS compilers (PL/1 and Vax-C). If I remember correctly, their back-end code generator was used for several (most? all?) other Dec VMS compilers. Dave's thesis was that a good compiler could outperform all but the best assembler programmers (and Dave was an awesome assembler programmer on the PDP-11 and Vax architectures). If I remember the book correctly (it's at home), their idea was to generate straight-forward assembler code and pipeline-optimize everything by detecting and rewriting assembler code sequences.
Quite a few. Amazingly these are usually those that have portability problems. PGP 5 is a good example.
Baker's Law: Misery no longer loves company. Nowadays it insists on it
http://www.sigsegv.cx/
What if you compiled the original code with performance analysis included, and then fed that performance analysis back into a traditional compiler to execute again. In this way, you get all of the benefits of the above JIT system without the running overhead of the JIT. Sure, it takes a few thousand runs to get the program optimized, and you can only optimize for the average case, but most programs that need to be heavily optimized will have a roughly similar executaion pattern every time.
I may be wrong, but you can do this with the SUNWspro workshop compilers. This technology would be nice to have in gcc as well.
Mike
--
Mike Mangino Consultant, Analysts International
Mike Mangino
mmangino@acm.org
I hope you don't mean that you think a "good"
assembler will produce fast but incorrect code..
For every problem, there is at least one solution that is simple, neat, and wrong.
Of course, that is why you could do it in a seperate (but identical) CPU without any problems. The code optimizations could be much better then what you could get out of a hardware only solution.
(appended to the end of comments you post, 120 chars)
As for RTTI, it allows you to do things that you cannot do without it. Things like type saftey, and runtime loading of classes. More importantly it make formal analysis of Java programs possible, unlike C/C++ programs. In the future those tools will be critical - because they will be the key to producing near bug free code. Being able to state with supreme confidence that this type of bug does not exist in my code because it has been rigorously proven is helpful. You really should be more concerned with running correct code than running "fast" code.
Now if you want to design a better byte code format for a Java like language feel free. Researchers all always looking to improve those sorts of things. Java uses a stack based machine - this is good in that makes .class files small. What other (dis)advantages are you aware of?
Message to the C/C++ world - you spend too much time writing software to waste time on a non type safe langauge.
Nothing exists exept atoms and empty space; everything else is opinion.
blah blah blah....
I think you may be missing (or misunderstanding) the point a bit. Sure, you can furbish an example, or a language, in which a human can optimize the program /statically/ better than a compiler/computer can optimize a program /statically/. But the author makes the point that the most frequently executed parts of the program are very sparse, and that the cost of profiling the program dynamically can be recouped by doing intelligent dynamic optimizations. As hardware gets faster and more complex I think this will be more true. There will simply be no way for humans to statically optimize for such a non-deterministic runtime environment. For example, how many programmers explicitly specify parallelism in their programs? How can they be sure that the CPU at runtime doesn't know better than them how to optimize the program? I say let the CPU decide. There is going to come a point in which it is just not feasible for humans to even attempt to optimize for such a complex runtime environment, and that time is coming fast. The faster chips get, the smaller the more amortized the overhead of finding better optimizations at runtime becomes.
Jazilla.org - the Java Mozilla
It's 10 PM. Do you know if you're un-American?
You're right, of course. No matter what some people have been trying to say for 40 years or more, you can almost always write better code than a compiler. This is especially true when you're dealing with something bigger than a a single function. The usual way of showing that a compiler is better than a human is by using one smallish C function as an example. That's a pointless example, because the benefit comes from analyzing that function in context and not on its own.
The point of diminishing returns comes into play quickly, however. For example, take the renderer for most any fast 3D game. If you went into the routine that passes triangles to the graphics card, a frequent hot spot, and added a pointless call to a supposedly expensive function, like sqrt, you're not going to notice an effect on frame rate. Doing a higher level optimization, like removing a single polygon from a model, is going to be more of a benefit than optimizing the polygon code, but it's still not going to be noticible. Sometimes you can get big benefits by using algorithms that look more complicated, ones you wouldn't want to approach in assembly, even though they use more code.
With complex programs, it's even conceivable for an interpreted language to out run a compiled one, because everything comes down to architecture and an understanding of the problem. This is hasn't been the case with Java, because Java is a fairly low level language (the more abstract a language is, the less win from compliation) and because Java has become entrenched in the "learn programming in 14 days" market of web designers turned programmers.
So, yes, an assembly programmer can outrun most any compiler. But does it matter? Almost never.
Just wanted add this to the discussion.
This post encoded with ROT26. If you can read it, you've violated the DMCA. Handcuffs please, sergeant.
This my be a little off the thread of thought here, but why aren't there any one-time "dynamic" compilers. What I mean by this is a program that monitors watches another program run and then gives a detailed report of where the preformance problems in the code, which would be used for optimization.
Example: Load up your dynamic compiler and have it complile a program, lets say Quake. After comiling it just like gcc Quake would then load. The user then play Quake for a while to demonstrate normal usage; meanwhile the "dynamic" compiler would be taking notes in the background as to where 90% of the cpu time is spent. After the user is done playing Quake the "dynamic" compiler would look at the notes it just made and optimize those parts of the code that are a preformance bottleneck.
This would be a much better approach than optimizing blindly with no indication of where the optimization needs to be.
Although pulling out your BFG might be sped up by 30ns unrolling a couple loops it wouldn't be worth the extra size of the program. However a little loop used in rendering the background unrolled might make a huge preformence boost, that would more than justify the extra program size. How would a "static" compiler know this? It wouldn't because it doesn't know how the code is used in a real world situtaion.
So:
1. Are there any similar optimizing compilers out there? And where can I get my hands on one?
2. If there isn't one there should be an intiative to make one.
bash-2.04$
bash-2.04$yes "Don't you hate dialup connections?"| write USERNAME
Optimization at the level that the compiler works on, is generally a last step for optimization. I agree that a compiler can do that type of opimization better than most people. However, to truly optimize your app/function the compiler will fail simply because it has no clue as to what the function/app is supposed to be doing.
You will get a bigger bang for your buck, optimization wise, by changing the implementation rather than optimzing CPU cycles.
A really good book on this subject is called "The ZEN of Code Optimization" by Michael Abrash. The book covers almost every aspect of optimization including things like knowing when to optimize and when not to. Its geared towards the Pentium processor and has most examples in C and Assembler.
Another problem that you could run into with the compiler optimizing for you is that it might change the intent of the function/app. If it does that then you will be hard pressed to debug your app unless you enjoy tracing through the assembler. Most if not all compilers don't have this problem anymore.
If you want it to go faster, change the way your doing it.
Visit the Arcade Restoration Workshop @ http://www.arcaderestoration.com
This my be a little off the thread of thought here, but why aren't there any one-time "dynamic" compilers. What I mean by this is a program that monitors watches another program run and then gives a detailed report of where the preformance problems in the code, which would be used for optimization.
Example: Load up your dynamic compiler and have it complile a program, lets say Quake. After comiling it just like gcc Quake would then load. The user then play Quake for a while to demonstrate normal usage; meanwhile the "dynamic" compiler would be taking notes in the background as to where 90% of the cpu time is spent. After the user is done playing Quake the "dynamic" compiler would look at the notes it just made and optimize those parts of the code that are a preformance bottleneck.
This would be a much better approach than optimizing blindly with no indication of where the optimization needs to be.
Although pulling out your BFG might be sped up by 30ns unrolling a couple loops it wouldn't be worth the extra size of the program. However a little loop used in rendering the background unrolled might make a huge preformence boost, that would more than justify the extra program size. How would a "static" compiler know this? It wouldn't because it doesn't know how the code is used in a real world situtaion.
So:
1. Are there any similar optimizing compilers out there? And where can I get my hands on one?
2. If there isn't one there should be an intiative to make one.
bash-2.04$
bash-2.04$yes "Don't you hate dialup connections?"| write USERNAME
Part of the argument was in fact compilers beeing better at writing fast code than assy. programmers. I HAD to say something :)
Anyway, usualy normal portable code does the trick if you are not running realy time critical stuff. And if you do, then no compiler will.
We are at a point in this industry where most advances won't matter to almost anyone that would spend a thought on it. There is no difference (to the vast mayority of users) between a task taking a nanosecond or a femtosecond. Or whatever.
rmstar
Would you also send your pointers across a network and expect them to work at the other end?
The idiots who wrote the code that caused me the pain would have...
This involves one of the internal debates I've had about C for years. C allows you to shoot yourself in the foot. That is all well and good as I (I hope) know what I'm doing. The trouble is that, with turnover in this industry what it is, a new foot might have arrived by the time the bullet hits. (Unfortunately, I've often been the new foot.)
The cake is a pie
1.Please give an example on the type of code you'd want to optimize, any reasonably simple (small fragment of )ecode that doesn't depend on advanced integer arithmetic should compile to something as fast or faster than you .cpp, GCC input, GCC assembler output and your "better" ASM output.
would possibly write yourself... Please include example.c or
I was talking about a simple brute force compare and if it dosn't work then move on type thing. I think that this type of thing will always work (just take a little while). Considering that most modern computers are extremely powerful then I have to say that such brute force is still possible. What I don't get and still don't get is how people get all the time to actually research all this stuff. Does anyone have a functional life anymore? I am took courses for an associate of science and they didn't get anywhere as complex as all the little theory arguments as I have seen here.
1.Invartiant detection, using compilers to find invariants in the code, and then deduce lemma's which may remove bounds-checks, type-lookups, casting or flush of registers to memory.
Could someone please explain what a lemma is I'm sorry I guess I didn't take Phd level math courses.
Slashdot social engineering at it's finest
Actually, some industrial strength JVM's (i.e. not M$ or browser based ones) run with a preset memory allocation. For instance, we run the sun JVM on Solaris with an allocation of, say, 512mb. That means it's running it's own memory allocation within that space, which it doesn't have to share with any other processes. So of course your argument about it having to malloc() twice is avoided. In fact (and my knowledge of memory allocation models is not good enough to be sure) it kind of suggests to me that because the JVM is allocatin within a "safe" environment it may be able to cut corners and actually do the malloc faster than native OS calls.
---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"
ho ironic would it be to see engineers getting replaced by machines...i could just see a protest outside microsoft by unemployed computer profesionals!!!
-- your knees hurt, don't they?
"In the general case", perhaps a compiler will produce better code (and certainlty not, as yet, for specialised cases - an inner loop coded direct to PPC macro assembler (PASM on Amiga PPC) is the fastest code I've ever written, but humans still write compilers. Now, if they were to force-evolve code using a genetic algorithm, you might get code better than any a human code write, but it'll probably depend on some weird side effect to some obscure instruction, or the rtesosnant frequency of your ram bus, or something...
Cool article, all the same though. I'd love to see more of this sort of thing on Slashdot, like back in the old days.
This isn't inherent to all JITs and code-morphing techniques, but Java does take a performance hit from having much more heap allocation than some other languages. An article in SIGLANG last Spring detailed a Java byte-code compiler that got around this problem. The compiler checked each dynamically allocated object to see if any references could be made to it outside its code-block--this is easy: just check if it is returned from the block (as the value of a function, say), or if its address is stored in a variable of higher scope (i.e., a variable that will outlast the code block). If not, it is compiled as a stack allocation, rather than a heap allocation. This algorithm eliminated %90 of unnecessary heap allocations. The JIT (translating byte-code to machine code) was altered to allow stack allocations like C, and whammo!--a %10 performance increase on Pentiums. There are a few optimizations like this that require a compiler to be able to see much or all of the code at once, rather than doing repeated optimizing passes over small sections of code. I wonder if these compiler algorithms are hard to incorporate in code-morphing techniques? On another note, Java has a lot of performance/memory usage problems that are due more to things like the sandbox-security system, lots of indirection, and the compromises necessary for system-indepence. These problems won't affect a code-morpher that translates from one machine language to another.
This doesn't seem all that new, actually.
If I remember correctly, to run x86 NT code on an ALPHA you use FX!32 which translates your code on the fly to the alpha, and saves it, so that the next time the code gets executed it is already translated, and it still looks like an x86 to the program. I believe the translation was also saved to disk, so the more you used your program, the less translation overhead you have.
Of course this new approach was taken further by re-optimizing code as needed, and providing hardware support for an optimistic translation.
What will the next iteration be?
The Curso technology will probably make the Java chip a reality and improve upon the idea all in one shot. Program the morphing code to run Java Byte Code and you have a Java chip which optmises your code. A very clean language that you can easily maintain and not sacrifice execution speed. Every Software development Managers dream!
If at first you don't succeed, skydiving is not for you.
Man can always beat the machine because man can use the machine but not viceversa.
Interactive compilation sounds like a great idea, but it can get incredibly complicated/boring real fast. The problem is the compiler must assuming many things to just set up the entry, and exit for loops, etc. You would have to guide the interactive compiler through all this which would get very boring and probably be more confusing. I wouldn't mind a compiler which profiled subroutines, etc. and said 'this is x clock cycles, here is the asm I came up with, do you wanna screw with it?'. *that* would be nice (IMHO).
Your assuming that the JIT compiler was a Good Programmer w/ Assembler. How much time was spent on conformance rather than raw speed for these JIT compilers?
-Unresolved symbol? Byte me!
On large systems, master C++ programmers, master programmers have had these problems in real world applications. It's not so much not knowing when to delete objects so much as making sure you delete enough pieces when you delete them. If you delete an object that is referenced by another then you're going to have a crash. If you delete an object but not the references it contains then you have to have another reference to each of those objects and you have to know what you're down to the last reference so you can delete the object.
I agree about the sloppiness of thought associated with java and GC but that laziness and the willingness to be sloppy is there regardless of whether or not the programmers are using GC. TakeGC away and they may be a little bit more careful but they've already shown that they are willing to be that lazy and sloppy.
This is my signature. There are many signatures like it but this one is mine..
3. as WORA is important in java, it is hard for java to use any platform specific resources such as Graphic card's accelerations. that is
one of the reasons swing's design is getting ugly as it tries to boost its performance.
Surely it's the JVM's job to abstract hardware acceleration in the same way as, say, DirectX or OpenGL? For instance, my windows JVM might know about directX and so implement the primitive drawing functions within the swing classfiles using the accelerated calls. I don't know if that's what happens, but it strikes me as the logical plan.
---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"
What about the amount of processing time the JIT takes? The JIT runs as a process to process the instructions. I beleive that would create significant delays to the code (stealing cycle time for JIT process and not code execution). Thoughs?
I read the Lagoona PDF essay. Is a Lagoona system currently available? I'll look at the site again. This is the first time that any Oberon-style language has shown up in this thread, although Oberon is VERY relevant to this dicussion. Oberon is type-safe, garbage collected, can run as an OS on bare Intel hardware, is very fast, and compiles 10 time faster than C++. I haven't implemented this yet, but it should be easy to write an Oberon program that constructs a code fragment, calls the compiler, and imports the compiled code into the running program. Look at http://www.oberon.ethz.ch/native/ . There's a version that runs on top of Linux, as well as a pure native one.
Nevertheless I'd like to point out that JIT is based a somewhat dangerous technique: a program that alters (its own) code. I believe this technique was used in the eighties to scare off hackers by making code incomprehensible and hard to disassemble until the program was actually running. Also (even on a good old 6502 processor) it's possible to make some speed improvents peeking and poking into the code you're actually executing.
:) if your JIT VM forces periodic cache writes to RAM (Not every time but enough times to ensure some sort of coherence if the cache and RAM are out of sync) you lose a bit of performance but gain a stabler setup.
:)
Not at all true... self modifying code is still around not only on JIT compilers but in interpreted languages and even device drivers (Or how do you think some device drivers adjust in real time to hardware changes?). And it's not a dangerous technique in itself, but (As usual) when executed improperly it can be a wild beast (As most people discovered when thy installed their first Cyrix and AMD 486 processors and discovered most x86 code wouldn't run properly because of cache integrity issues)
When compilers for Microcomputers got faster and most processor architectures (known as Harvard architecture, I believe) explicitly require a division of RW and RO memory segments, self-modifying code was abandoned...
First of all, Harvard machines require separate instruction and data pipelines and memory spaces, and 99% of the CPU's on the market (General, Embedded, etc.) are Von Neumann machines which use a single space to store instruction and data. The thing is, newer CPU's (Even if they're Von Neumann desings)have separate caches for data and instructions and most self modifying code violates the cache integrity (See above) because the code is modified in the cache but never stored back into memory (Unless you use a write-through cache design which is impractical from the performace point of view).
BUT (And that's a big BUT
Disassemblers for JIT won't be as complex as a JIT assembler just because when you disassemble a piece of code you treat is under the black box principle (What goes in and what goes out) in order to derive the fundamental principles and algorithms, which can be implemented in 1E999 different ways, so if you disassembled the less optimal morph, bad luck... disassemble another test run and see if you get a better implementation of the algorithm. Or you could use the aforementioned principle and implement your own algorithm.
JIT code/compilers, BTW are also being tested to produce self modifying chips (ASIC's and FPGA's) under VHDL/Verilog using also NNets or Gen. Algorithms to obtaing a first implementation and then using the JIT to optimize it...
Really interesting stuff (The Transmeta Crusoe) and my bet is that soon other companies will follow altough not for the same reasons as Transmeta
ZoeSch
I hate to agree with davecrazy but...
The obvious answer would be that the JIT/VM code takes up clock cycles to convert/schedule/optimise the code, so there's always a significant overhead (JIT removes unnecessary (sp?) conversion, but there's still some overhead).
--
malloc() / free() in conrast have rather more work to do. malloc() has to search through a number of data structures to find an appropriate chunk of memory, and free() has to return memory to the heap in such a way that the heap is still usable.
My point is that when you have a garbage collector that can take an overview of the whole heap, your allocation strategy can be much more simple. The most simple possible case would be to have a pointer (c.f. stack pointer) that simply moves forward through memory by the size of each block allocated. Then all the work falls to the garbage collector which has to sort out a heap that may potentially have a number of gaps in it (corresponding to objects that have gone out of scope in 'non stack' order).
In practice garbage collected memory manager that worked in this way would be a little naive, but I would still maintain that a GC system should be able to achieve better performance than a malloc() / free() system because malloc() / free() have to micro-manage the heap, keeping it in a usable state after each call whereas a garbage collector can do all its heap optimisation at once.
Actually I'm cheating a bit here, so I better come clean before someone else points it out. malloc() / free() can actually enjoy all the benefits that I attribute to a garbage collecting memory manager if you implement them like this:
void*malloc(size_tsize)
/*getmemoryfromagarbagecollect edstorageallocator*/
/*donothingatall*/
{
returngc_alloc_new(size);
}
voidfree(void*mem)
{
(void)mem;
}
In other words you can implement malloc() / free() on top of a garbage collector. This is what happens with the garbage collector libraries for C++; new() just calls the GC's allocator and delete() does nothing.
Andy Armstrong
As the microkernel unix people used to say before they finally died out, "We are within X% of the performance of the original". - Alan Cox The JIT guys can write all the papers in the world about how they should (yes, really should) be able to get native code speed. I'll believe it when I run Java bytecode and it's not slow.
The evaluation of an action as 'practical' . . . depends on what it is that one wishes to practice.
This site really *is* aimed at users isn't it! If that's technical, my name's Rumplestiltskin.
I do agree that machines should spit better code than humans do, but we are just not there yet. When will we get there? Maybe tomorrow, maybe next month, maybe next year, maybe next century. Nobody knows. Suppose we do get there tomorrow, I believe it will not be Java + JIT that made it, for the following reasons:
If that breakthrough were to come tomorrow, I believe it is most likely to come from the functional programming people. I believe that is the way to go for the future.
I don't see where your argument about placing memory for new() on the stack holds up. The point of dynamic object creation is to allow the program to create an aribitrary number of objects and then free() them when done with them.
Trivial example that really makes the stack argument fall apart:
Let's say I have 500 objects new()ed. I'm assuming that the first object created would be on the bottom of the stack, and the last object created would be on the top. I now delete the first 499 items. I now have 499 * object size bytes of memory on my stack that I can't use becuase I still have the 500th item on the stack.
I can't recover the memory because it's on the stack.
Am I missing your argument? Help me out here.
--
then it comes to be that the soothing light at the end of your tunnel is just a freight train coming your way
Now, I have yet to see anyone provide concrete examples of a garbage collector performing better for the general case, than standard memory management. However, let's assume that the garbage collector provides a small performace gain overall.
Unfortunately, as others have pointed out, compilers don't understand the objective of the program they are dealing with the way assembly coders do. The same is true for a garbage collector.
I can speak from experience as a game programmer here. There are certain parts of a game which are time and performance critical. How can I trust a garbage collector not to slow things down in that section? Well, I can't. There has to be some form of programmer interaction telling it not to cleanup in section X of the code.
This isn't very practical for games though. Typically we want full control over the memory. We know the upper limit of what we need, and we don't care much about other applications. So we can allocate a large chunk at start time and then work within that, to avoid the new/delete/malloc/free overhead.
If removing "antiquated" things like pointers is the best solution, then we'd all be writing games in Java. OK well that's a stretch but you see my point...
Best regards,
SEAL
It was years before gcc noticed the Pentium existed. All the new x86 technologies like MMX/3DNow/KNE(Pentium3) are barely supported in either gcc or MSVC++, which together, probabls account for a majority of all x86 code compiled and run. No, I'm not volunteering to do any work to fix this, but it does seem wrong. In the real world, Intel threw $$$ at Cygnus to get their Merced supported.
Frankly, I think Transmeta may be onto the right track here: they ship a good compiler in every chip, so they don't have to wait for the crucial (software) compiler technologies to catch up.
Is it just that compilers are inherently big ugly nasty pieces of code that hate being mucked with? Is it a lack of people wanting to work on something 'unsexy' (and producing 15,000 window managers instead)? As long as compilers really are *BEHIND* the curve in codegen vs human asm coders (who'll use all the instructions), compilers will continue to lag.
Nathan Mates
Just like to point out for the (n+1)th time that the Java language itself is the main reason that Java performance is "bad". In particular, function dispatch (we have to crawl up the class chain to find the real function), runtime tag checks (every time we downcast), tag creation (so we can tag check), null pointer checks (very often). This stuff can NOT be helped by compilers (it's a losing battle since most of the time you are dealing with noncomputable problems). A big clue to this should be the fact that the language designers special cased ints and chars to NOT be objects.
Certain "better" type systems (there are usually tradeoffs) don't require such nonsense, and are limited only by compiler technology.
When compiling a statically-typed language like (insert your favorite here), the compiler can often do automatic theorem proving to generate much better code. In the next few years, this will come a lot farther (compilation of traditional languages is near the state of the art).
By the way, I do agree that compilers generate better assembly code for languages like C, but only because of the sheer hugeness of the code. Hand-coded inner loops (by someone who knows the architecture well) will beat compiler code any day. If you've ever studied the code generated by a compiler, this should be clear to you. The main reason being that the compiler can't make an optimization that it can't be SURE doesn't change the behaviour of a program... stuff like the possibility of pointer aliasing and linktime magic with function calls means a lot of wasted optimizations. But a programmer is often much more easily able to convince herself that certain situations can't arise -- because she knows how the program works. Most advanced programming languages remove a lot of the ambiguities through type systems, allowing for much more powerful automatic optimizations.
- Tom 7
In general this is not true. First, an address calculation takes at least an addition operation. Even in a parallel-execution processor where such an address calculation can be performed simeultaneously with other instructions, the address calculation must be performed before the memory access. This prevents out-of-order execution of the instructions, and prevents pre-fetching from the cache at least until the address is calculated. There are some cases where it might be the same speed (the cpu can execute operations in parallel, the address generation does not block any other instruction from executing, and the requested object is already in the cache) but in general dynamic memory referencing is slower than static allocation.
...not for twits with ego problems.
"1...not a JIT system but a dynamically optimizing compiler" a slow start is a big problem (which is why they are emphasizing long-term server programs), but that's not the main issue. There's no reason that profile-based optimization can't be applied to C programs. That there's an experimental JIT (yes, it's still a JIT) that's ahead of the C compilers in common use is no reason to claim that the JIT strategy is better.
"2...as long as the results of the operation don't change" Which means that if the representation is different in a meaningful way they have to check whether the results of the operation change, reducing the efficiency, which was my point.
"3...on all but the most simple applications" But I rarely do anything but the most simple applications. I allocate an array, then I free an array. I do this for hashes, trees, stacks, queues, and pretty much any data structure I use. Only when I am being extremely lazy will I allocate and free objects recursively. Modern CPUs with slow main memory and fast cache like it when you keep all the data together, so I generally do so. I also try to access it linearly, which they also like. The point holds.
Don't try to compare the most primitive and unoptimized methods of custom memory management to the most sophisticated garbage collection algorithms. Remember, too, that a person doing custom memory management can write the same sophisticated garbage collector that Java would use.
And don't confuse "primitive" with "fundamental". Pointers are fundamental. Having to analyse reverse polish code function to prevent dangerous operations, rather than making it impossible to construct operations you don't want, is primitive.
"4...then tell me that dynamic dispatch is a problem." Okay, dynamic dispatch is a problem. Just because it can inline methods here and there doesn't mean you won't take an overall performance hit. Besides, that was only one example of the ways the lousy Java handholding slows things down.
The Sun marketing papers you linked me to are just more of the same old Java hype. They'll go on to "prove" that their new super JIT is better than anything you could code by hand by comparing optimized idiomatic Java with the same code translated poorly into unidiomatic C++.
Idiomatic C written by a competent optimizer will still blow the Java version out of the water.
JIT by any other name smells just as bad.
This is very intersting stuff, and I do care about these questions, but virtually none of this enters into my selection of a language for a task.
When you are writing a routine that reads single keystrokes and puts them in a buffer, you really don't care much how optomized the loop is unless it takes more time than elapses between the user's keystrokes.
In other words, I'm very glad people are working on smarter compilers and Java optomizations, but I select Java for one reason alone:
It is easier to write more code more quickly with fewer bugs in Java than in C or C++.
Most of us in MIS find ourselves in situations where we can throw more hardware at things much more cheaply than we can pay people to write and debug code.
Certainly there are situations where we are prcoessor-bound, or where data comes from a high volume network source and then we must consider performance, but since in many cases even a slow language is "fast enough" for the maximum throughput, the ease and reliability of development come to the fore.
So, if I want a fairly portable device driver, I write in C. If I want a stable application, I choose Java. For me it is this feature of the language, not its speed (or lack), not its "write once, run anywhere," but rather the ease with which stable code may be written that draws me to Java. After that, I am concerned more with the footprint of the VM and its impact on non-Java system components than I am with the optomization of Java byte codes.
I say this not to disparage this discussion, which I really do find enormously interesting, but I must say that for many of us, this is the computing equivalent of counting angels on pinheads.
I'm surprised nobody else has commented on the obvious here. Sure, maybe dynamically recompiling is awesome. But doing it twice at once isn't. Just imagine your Crusoe chip screaming in agony as the code-morphing tries to profile and recompile code that is being dynamically recompiled by the JIT compiler. And how do you fix that?
By abstracting your two smarty-pants recompilers into a "hardware" and an "interpreter" layer, you've removed any way for them to talk to each other or even really be aware of the other's existence. The problem's not theoretically impossible, but the practical advantages of keeping your abstractions clean make me worry about the hopes for Crusoe web-pads (which would presumably run a lot of Java).
Preferential Voting: easy as 1-2-3
The fantasy is that you're some super studly asm c0der who can not only produce properly scheduled code, but al...
Thank you, I couldn't have said it better myself.
--GnrcMan--
Why don't we just ban all programming languages and go back to flipping switches. That'll guarentee job security.
-Elendale (I know the driver is a bad example, so don't correct that!)
IANAT (I Am Not A Troll)
I guess it was inevitable that this debate would lead to this kind of argument, but somehow I think most people have missed the original point... If you tell a hardened ASM programmer that a compiler can produce better code than they can, the knee-jerk reaction will always be, "No way!" But we're not talking about static compilation here. The subject is runtime compilation, actually producing code during execution.
If you are writing any sort of genuinely generic code, then the likelihood is that you will not be able to envisage every possible use that the code will be put to. What this implies is that as a programmer, you cannot know the exact circumstances in which your code will be used. A run-time compiler knows the exact circumstances, including how many registers a calling function may have free, how frequently a particular piece of code is executed, argument and loop invariants, and so on.
This knowledge allows the compiler to improve the efficiency of argument passing, memory usage, cache usage and various other things dynamically according to the circumstances of execution, and it is knowledge that the original programmer simply cannot know. The only way a human programmer can perform these kinds of optimisations is through making assumptions as to the code's usage, which defeats the purposes of generic coding.
Think along the lines of writing a piece of code, then seeing the compiler product different chunks of machine code from the same source, depending on its usage. Maybe the compiler notices that a particularly frequent part of the program consistently calls your function with the same parameters. It can use this knowledge to compile your program with the constants in mind, thus saving register usage and parameter passing overhead, for this particular case. For a human programmer to perform this kind of optimisation, they would have to analyse every possible use that their function was put to. That simply isn't practical for non-trivial programming projects.
If we really want to debate the implications of run-time compilation, the question should not be whether a human programmer can optimise better than a static compiler. We should be asking whether the advantages a run-time compiler has over static compiler are enough to outweigh the overhead that is incurred by the compilation process.
Someone pointed out that a 700MHz Crusoe processor executes code at approximately the same speed as a 500MHz Pentium III. It sounds like quite a big overhead, but it would be a more meaningful statistic if we knew how fast a Crusoe processor would execute entirely native code...
Nik.
OK, but how about this scenario: I want to hold a number from 0 to 1,000,000. It's too big for the lowest common denominator (16 bits) so I whack it in a long. On a machine with 64 bit ints, I just went sub-optimal.
So the whole idea of 'int' as this optimal size is fairly useless in practice, IMHO.
What I really want to do is tell the compiler "I want to store 0..1000000" and let it sort it all out. Then I can decide whether I want to compile for size or speed [at runtime, in the spirit of this article], or if I definitely want that value in a long because some other code needs it to be that size.
Flame me if I'm wrong but I think that's the same compromise:
Qt is easier to hack with and much more intuitive to program. As a result u develop apps quickly.
in Gtk OTOH, under the toolkit there's an implementation of OOP in C, thus you have to follow a protocol to code Gtk apps. Its more complicated but you can control everyting.
In short term, obviously Qt becomes more rentable, but I think in long term Gtk apps will stand out as they will let programmers more control over the underlying code.
-----------------
Don't take me too seriously, I'm more a philosopher than a programmer
Wow, that's really useful. I mean, just the other day, I was thinking to myself "Man, what I really need is a blisteringly fast program to solve the 8 queens problem."
And the fun factor is incredibly high.
Different strokes.
Whatever methods that may be used to improve bytecode speed/efficiency can be (and will be) applied to native executables. Therefore any speed gains made by Java/bytecode programs will also be realized by native binaries. That leaves JIT compiled programs with extra overhead and therefore slower.
Actually, computers tend to be even better at speed chess than at regular chess. The reason is that computers analyze games by expanding the game tree as far as possible. Because the chess game tree grows exponentially, this leads to diminishing returns as the amount of time increases; if you watch the "ply searched" indicator on a chess program it'll generally jump to somewhere between 3-5 moves almost immediately and then appear to stall. However, even this shallow game tree is enough to catch an unsuspecting human in a tactical trap.
The reason humans can still beat computers is that they can apply long-range strategic planning -- "what are the best ideas to pursue in this position?" -- and rules of thumb or experience about what sorts of things are likely to happen in a certain type of position. Working from this, they then decide on a small set of moves to evaluate, usually in a narrow tree (because humans can generally discard moves which are irrelevant out of hand; I believe some better computer algorithms try to do this as well but I'm not sure) This approach doesn't really start to pay off until the human has had some time to think, but once it does it can allow the human to catch up to the computer, which is still examining every possible outcome of shuffling the rooks back and forth.
Disclaimer: all of the above is empirical guessing and may not bear any resemblence to reality.
Daniel
Hurry up and jump on the individualist bandwagon!
Two points:
1)Logically no perfect compiler can produce better code then a perfect human coder.
Note this statement has a couple of implicit statements - given infinite time, and the code may be of equal quality.
This is because the perfect human coder programmed the compiler, and programmed the application. Since the compiler acts in a deterministic fashion the programmer can create exactly the same code as the compiler.
2)The issue of dynamically recompiled code is indeed an interesting one. Whether or not the extra overhead need to monitor and recompile is worth it can not be determined ahead of time. Most likely this would be of most benefit to long running applications which use a variety of data sources (such as certain scientific applications).
Bradley
This is a personal anecdote where I would say that the compiler generated better code than I would have.
I worked on an embedded network analyzer product using the intel gcc compiler for the i960. It contained about 16000 lines of C code (not trivial, but not large) that included a bare bones operating environment, network drivers, and the actual analysis code. Of these ~16000 lines, about 500 of them were in the "common critical path" and would execute on every incomming packet.
The intel gcc compiler for the i960 has this is really cool feature called "profile based optimization". You build an instrumented version of the executable, run a set of test data through it, get the profile data, and re-compile with the profile data. This optimization feature had the added advantage of closely integrating both the compiler and the linker.
The main benefits of this optimization included:
Setting the branch prediction bits for the most common cases.
Inlining functions with high hit counts : both within a module (traditional inlining), and across many modules (an advantage of linker integration)
Placing variables with high hit counts in faster memory (the i960 had 1k of on board, single-clock DRAM, and our card had both 2-clock SRAM and 4-clock DRAM)
Inlining all functions that get called only once in the executable.
I saw a 150% performance improvement over the non-profile based, but otherwise fully optimized builds.
The key 500 lines were interspaced through about six functions in four modules, often in functions where half the function was error/unusual path handling. Although I could have rewritten those 500 lines in asm, it would have been very hard to keep it integrated with the rest of the code. Also when I examined the instruction trace of those 500 lines, I found only two or three places where I would want to change things.
Intel's x86 compiler is supposed to have similar a set of profiling optimization features and another feature that's even more important on the x86 that lays out functions and data in the executable to reduce cache misses and page faults based on the profiling data.
In the "me too" catagory: Optimizing for Alpha processors is a nightmare best left to the compiler. Just try to single-instruction-step through some Alpha code and see how much you move around within the C source.
Peter (first /. post!)
You make a totally valid point here, but I don't think that it is at odds with the original comment! Yes, one of the big advantages of using a compiler is that we don't have to rewrite swathes of code every time we change anything. (And this is true even if we stick to one architecture - just a minor change in algorithm can mean a lot of effort in asm.)
But that's beside the point: the fact remains that if we do spend the time to write the code by hand, we can always do a much better job than the compiler. Compilers are, frankly, still rather crummy, but they're good enough that we put up with it to reap the very worthwhile benefits they offer.
(I think we may all be in violent agreement...)
And I seriously doubt you've ever experienced the superior power a machine learning system has on some kinds of complicated optimization problems. While you are going into great detail about how to optimize the compilers even more you are running inside a big labyrinth and at the end have to rewrite and rewrite even your compiler again and again because you have to change some little things to make it perform better on this and this new cpu or architecture. This is the best example the author of the original article could have for writing unmaintainable code. On the other hand, if you step back and get the bigger picture and abstract the problem, you'll quickly see that over the time there will be a paradigm change towards computer-generated code. Whether it is just the translation from abstract languages into machine code that is done by some intelligent program, or whether it is the whole programming at all is a matter of time. The time will come whereas the exact definition of the problem is the solution. While today yet only relativly simple problems are solveable through the help of genetic programming systems, with the increasing computing power and even more important the increased networking of this power worldwide, we will be able to solve even the most complicated optimization problems in an evolutionary way, striving for perfectness. --scut
This isn't quite right. If either:
- The programmer doesn't know everything about the target machine (eg. there's more than one), or
- The programmer doesn't take the time to optimize the code (it happens all the time)
then a JIT compiler could be faster than a human coder.While I agree with the critics that Java is probably not the programming language of a new machine architecture, I do think Just In Time, as an essential idea, is the future.
Eventually, either the nanotech or the re-programmable chip scientists will have the necessary break through. Both chip "architectures", the one being based on random molecule distribution, the other on re-programming the chip as it works, will require software conditioning and optimization - over a *long* period, when measured in nanoseconds. In either case, the theory behind Crusoe and JIT is likely what we'll be using - and developers will be "teaching" new computers rather than loading assembly programs on them.
Me, I'm looking forward to it.
-The Fuzzy
This involves one of the internal debates I've had about C for years
the debate here is not about C, it's about hiring nimrods. the problem is that people write bad code, and it's almost unavoidable. no matter how you change the language, there will always be twinks to write bad, unmaintainable, unportable code in it.
...dave
Think different? I'd be happy if most people would just think...
One nice feature of development with typical Microsoft tools is the use of ASSERT. In debug builds (where _DEBUG is #defined), it takes a form similar to C's traditional assert(), otherwise, it is opimized away to nothing. This lets you ensure the healthy state of the code that you're developing and then get a nice fast chunk of code to release.
Now, what if the compiler were smart enough to examine these assertions and base some optimization on them? If you ASSERT( x>=0 && x, then the compiler should be quite comfortable in storing x in a 8bit register.
--mark
one step towards making better self-documenting code
This whole question boils down to whether or not a human mind is deterministic. We can prove that a compiler is, or in fact that any program that runs on current computer hardware is. If a human mind is deterministic then it follows that a human could do just as good as a compiler simply by using all of the same algorithms to generate the code although the computer would certainly finish alot sooner. If a human mind is NOT deterministic then it is clear that a person who had no programming skill whatsoever could beat the best compiler given enough time, a specification sheet of the language he was to generate the given algorithims in and a circit diagram of the cpu he/she was coding for. This is VERY simple to see. A compiler can NEVER under any conditions for any reason EVER EVER EVER produce better code than a human because the human can ALWAYS produce the same code that the compiler did. The question should be is it reasonable to wait the 150 years of time it would take me to compile mozzila by hand or should I build a compiler to do it in a few hours. The next question would be how much time(if any) should I spend hand optimizing the compiler output. A question I have is why cant we do the same thing that a JIT compiler does before runtime? It seems clear that you could easily get very close to as much optimization from a multi pass profiling compiler as you ever could with a JIT without the runtime overhead. Later Spacewalrus
I had a commodore 64 once. Never bothered to learn assembler code for it but even today I'm amazed by what people did on those machines. But it just doesn't scale up to a modern PC.
Perhaps those people from Intel can write code for trivial software but anything that goes beyond that is even beyond the minds of those people. Of course those guys at Intel can program their own CPU's manually (at least I hope so). Hand coding millions lines of assembly code is something else though, let alone optimizing it. The problem doesn't scale linearly you need a computer to optimize that.
Either way a static compiler nor a human being can compete when you start to involve real time profiling information. It gives the dynamic compiler an edge, it can optimize to situations only known at run time.
So, I hope you see my point. I admire people who can handcode a more optimal program than a compiler but they can't compete with dynamic copilers such as transmeta's code morphing or SUN's Hotspot because they simply don't have the kind of information those compilers have.
Jilles
There is certainly a range of people that write code and I've seen both extremes. Poorly written higher level language programs aren't going to be improved by a compiler. There might be suggestions by a good level syntax checker but that isn't the topic here. One of the long term comments I've always made is that you can write Fortran in any language. If you don't understand the language constructs sufficiently, you aren't going to be saved by the compiler. There is no substitute for programming competence.
This post clarifies what I was saying. To summarize: 1. By "modern processor" I mean a processor that allows a compiler to perform good optimizations. The X86 instruction set is not very good for this. I don't believe the 486 even performed any parallel execution. Specifically, I have experience with the Alpha. 2. You are talking about a specific algorithm. As I said, 99% of the time the compiler will beat you. I didn't say assembly was dead or not useful sometimes. Just that most of the time the compiler will generate faster code overall than a human. 3. Facinatingly enough there are cases where your assembly optimization will actually hinder the compiler from doing a better optimization. Specifically, some paralellizing compilers I think.
Check out this link for a plethora of information on compiler optimizations.
--GnrcMan--
Synthesis takes performance to a level that's impossible to match with code written statically, by using things like address information that's known only at execution time, as well as knowledge of how multiple reusable components are dynamically connected.
Check it out (gzipped PostScript file): Synthesis: An Efficient Implementation of Fundamental Operating System Services.
First, JIT compiling equates to a type of peep-hole analysis, JIT compilers simply don't have enoguh information or enough time to do full global analysis. They aren't even in the same league with modern static compilers, there isn't any contest at all.
Second, runtime profiling and optimization is a fairly new thing. Hotspot kind of does it and there are some research compilers that do it but it's still new and the pain/performance gain ratio hasn't been established as something that makes it "worthwhile" yet. It's also important to note that there are even more aggressive optimizations that can be done with static compiling. Some papers (sorry I don't have the links on hand here at work) have suggested as much as a 30% performance improvment with runtime optimization and particularly branchy code (that's where they shine, monitor branch selection and then recode loops to take the most optimal path most of the time to avoid stalling the pipelines on processors that can't take both paths, that sort of thing where a compiler needs to make a selection that can affect performance but doesn't have enough information) but it's still research. I don't know about you but a lot of developers don't particularly liek the idea of trying to debug code that dynamically get's rewritten as it executes. There is going to be more work done in this area as it is pretty hot but it is by no means the end-all yet. Again, you have to understand that runtime optimization isn't going to do global analysis and that's where a lot of the big advances in compiler technology have been made over the last 10-15 years. Higher order analysis and optimization usually yields far more performance than putting polishing tweaks on a few loops that can already be highly optimizated by a good compiler.
Thirdly, java's dynamic nature is a key aspect of it that will always prevent certain types of code from being as fast as C or C++, regardless of the compiler. I'm not going to criticize various weaknesses or compilers for java, but this is something that needs to be remembered and understood. Also, processor time recompiling, reoptimizing, etceteraing code is processor time not spent running code. JIT needs to yield a performance boost that covers the time JIT takes, this isn't easy to do.
I'm also a huge fan of garbage collection, it's particularly nice in large and sophisticated object models where you almost always lose track of who has a reference to what. GC can slow code down and it can also make certain types of code faster, both sides are right on that. The important thing is that it makes code better, most programs don't have time restrictions such that GC would affect them and most good GCs are so fast that the user wouldn't notice anyways. GC makes it easier for programmers to get away with being sloppy and while it may rob performance that's not the reason for it. I'm not sure that a discussion on GC is germane to our topic here. It's generally understood and accepted in research, academic circles, and it's becoming more so in actual development groups that GC is a "good thing" (tm) and it's worth the price of admission, software get's bigger and more complicated and we need tools to help us stop worrying about the minute details as much, it's an economic thing as much as anything because programmers tracking leaks aren't programmers adding value that you can market.
Lastly, and maybe this is just due to my non-expertise on JIT compilers but the trend that I see coming is parallelism. Chip realistate is reaching the point where it's more efficient to put 2 or 4 processors on one die, extra cache isn't paying off as much and adding funtional execution units makes the design much more complex. This parallelism is going to need special compiling technology to exploit it, possibly including runtime profiling and re-optimization. I think that is where it's going and that JIT as it is now is a dead-end. Not that I mind the effort being put in to it but by the time someone get's java running as fast as C++ or C it will be pointless. Maybe I'm fooling myself here, I'd love to be wrong on it, maybe somebody will make some breakthroughs and we will get JITs that do global analysis and optimize code to work with the garbage collector and run as fast as C++ but I'm not going to hold my breath. Maybe I'm altogether wrong and parallel multiprocessor chips can be epxloited with a compiler running outside of the code on a different processor, I don't know.
Your programmer would have to be really good to beat a modern compiler on optimizing for very complex instruction. Also he probably would need a lot of time even to get very simple programs up and running that way.
And then there's still something a java VM (I'm assuming the hotspot type here) and that is dynamically (i.e. while the program is running) optimizing code using profiling information. This also works in situations where a program dynamically loads a class (possibly a class that did not exist at the point the other classes were developed). Your programmer cannot not take this situation into account when he is coding.
In other words, under normal circumstances the assembler superman would lose (badly).
Now, your next question will probably be why java is so slow then -> see my other reply's on this story.
Jilles
It'd only become a good idea if you needed the performance so bad that you could tolerate the code maintenance nightmares.
In many/most places of business it takes a *lot* of performance gain to excuse unreadable code. As the man says, I already have enough performance in most situations: I'm user-bound more often than processor- or disk- or network-bound.
Cris E
St Paul, MN
>C/C++ was not created so that programs could use
>less CPU cycles; it was created so that
>programs could be written more quickly
>(programmer time), more robustly, and vastly
>larger than their assembly
>counterparts.
I Agree.
That's why you write most of the stuff in C/C++ and then if something is time critical you do the 500 instruction assembly thang. Or you buy a faster computer.
>But the scope of problems you can solve in those
>500 lines are so unbelievably small that its not
>worth the time to test it.
I disagree. That is what people have been told over and over again, but this is wrong. For example, you can solve the 8 queens problem in 500 assy lines, and there are a lot of problems in that complexity range. And the fun factor is incredibly high.
rmstar
As a side note: I remember hearing FX!32 also keeping some information from the JIT compilation and storing it on disk for the next time the
application is run. Anyone knows whether this is still used and to what success?
Yes, FX!32 would actually profile the application and convert parts to native Alpha code, storing the native code in a database. Incidentally, the profiling happened during execution, while the optimization happened during free time, after the program was done being executed. The more you ran the program, and the more code paths you hit, the faster it would get.
It's a moot point now, though, since NT doesn't exist for the Alpha anymore.
--GnrcMan--
I strikes me that the Crusoe could have have effectively two
complete cpu's inside. One to do code morphing / JIT, the
other to do VLIW processing. They could have entirely
different instruction sets.
The jit cpu could be working independantly of the VLIW cpu,
without 'stealing' it's cycles, and feeding it.
The VLIW cpu may not need external bus connexions.
Otherwise I'm puzzled about how you you would get the
performance claimed.
Jamie
Massivly parallel creations are largely very close to what the programmer wrote, good compiler or no. Compilers for this case are especially weak, and only make vague and half correct assumptions.
.sig: Now legally binding!
This is a timely post for me, as I recently attended the ACM workshop on Dynamic and Adaptive Compilation & Optimization. I thought I'd highlight a few interesting projects which bring up issues that haven't been discussed here. I'll pull these mostly from the panel, which described full systems rather than interesting techniques).
i sw/DynComp/www/ ) which acts like a dynamic compiler for C and Tempo ( http://www.irisa.fr/compose/tempo ) which requires explicit annotations but can probably be a bit more aggressive as a result.
First, note that code which is structured clearly and readably for programmers isn't usually structured perfectly for compiler optimization. Calling conventions add overhead, inlining adds code bloat---and in the presence of e.g. dynamically linked libraries, there's no opportunity to optimize.
One of the biggest overheads is dynamic branching---this is why dynamic method dispatch in Java/C++ is slow, but it also means that the simple act of returning from a function call is also slow. From this perspective, the Dynamo project at HP ( http://www.hpl.hp.com/cambridge/projects/Dynamo ) is particularly fascinating (I've heard tell that many of these folks now work at Transmeta). What dynamo does is it _interprets_ fully-optimized native binaries! When it spots hot code, it unrolls it, determining where all the dynamic branches are going. The unrolled hot code is then optimized and run without interpretation. The result? When dynamic branches usually go to one place, the unfolded code can be executed in a straight line. The speedups (again, with already- optimized code) are impressive.
Second is the Jalapeno system from IBM ( http://www.research.ibm.com/jalapeno ). This is a dynamic Java system written entirely in Java. It does optimizations akin to Sun's HotSpot. The interesting idea here, though, is the ability to optimize across the run-time boundary---because the system is Java code as well, Java internals can be inlined and specialized within user code.
Finally, few people seem to know about partial evaluation. The idea here is that a lot of code depends heavily on a few run-time parameters which get set once and then (almost) never change. So
what we can do is generate a partially-optimized program where we "fill in the blanks" at run time. Note that these blanks would normally be references to variables, run-time condition checks, and so forth---so the partially optimized program is already at least as efficient as our original code run through an optimizing compiler. At run time the blanks are filled in and some more (usually simple) optimizations are performed, usually making the resulting code substantially more compact and faster. Two extremes (both for C programs) are the DyC project ( http://www.cs.washington.edu/research/projects/un
-Jan-Willem Maessen
I'll simplify:
1)Anything that can be expressed in Java can be expressed in hand-coded assembly; the reverse is not true.
2)Humans are smarter than computers.
3)Humans can always use computers as aids to their work, and so produce output equivalent to what the computer can produce as a baseline worst case.
On a machine with 64 bit ints, as on a machine with 32 bit ints, ints and longs are the same (assuming the compiler was competently designed; it could fit the standard without being). Using long should only ever confer a performance hit on 32-bit machines, for 32-bits and over, long is pretty much guaranteed to be the same as int (AFAIK, the standard doesn't force it, but it doesn't require any more range in the long, so it would be stupid to make it bigger).
If you want to define your own special int ranges, you can always use Ada. That's the kind of language you get when you throw in every nice little feature.
That's just not true and you know it. If the standard libraries are Java's (only?) strength then how could it's object orientation not be a strength too since the standard libraries are object oriented. Also, portability is much greater in Java than in most other languages. Consider the fact that there's no need for something autoconf in Java as an example. Sure, the same's true for Perl, but that's not a language, it's an implement of torture.
I can also write code in ML that can be proven correct at compiletime, since it's statically typed and lexically scoped (Java can still come up with an exception and die on you).
So my question to you is: How are your languages so magic if none of the software I use on a daily basis uses them, or seems to miss their properties?
Some guys here at CMU wrote a TCP/IP stack in ML that was provably correct. It took them a whole year. Was it worth it? You could write a whole operating system in that time!
...
The stack-based ISA for Java is slow. It doesn't map well onto actual processors. It's a big reason that even JITC'd Java isn't very fast.
There's no reason not to use a language like Java with slightly stronger typing than C++ (although there are good reasons to use C++; there are powerful things C++ can do that Java can't). But it should be compiled, even if "compilation" is something that happens the first time you run a program, and it should be compiled from a more intelligent format than a nonexistent ISA. Why not distribute source code after the frontend has run on it, with all the symbols resolved and the syntax broken up into lower-level constructs? That would be as hard to reverse-engineer as assembly, and significantly moreso than Java classes (which leave way too much info in there).
...
Being able to state with supreme confidence that this type of bug does not exist in my code because it has been rigorously proven is helpful. You really should be more concerned with running correct code than running "fast" code.
Except for the fact that we've gotten everything done so far by writing fast code, and testing it to see that it's reasonably bug free. Proving code with Java is still quite hard, and writing code at all in ML is very hard.
Empirically, companies will trade off getting the software written for having it proven; they've been doing things that way for quite a long time. Java certainly won't change that.
Does this mean I just have to wait until I can afford something with the power of Deep Blue and then I'll have a compiler that can generate better code than a person? Given current exponential rates of improvement I guess that's not too bad.
Problem is, I'm not sure it's true. If you consider chess to be an optimisation problem, I have my suspicions that it's actually a good deal more complex than chess. Chess is relatively narrowly-defined. It feels like the space that code optimisation has to work in could be far broader?
It certainly feels like a different problem - one simple way of doing better than the compiler is to look at what the compiler produces and then remove all the stupid mistakes it's made. (Modern compilers tend to make 3 or 4 on every page of code. And I mean really stupid - completely failing to notice that values were already available in a register and reloading/recalculating them, that sort of thing.) This is not likely to be a successful approach with a chess game...
Ian Griffiths
that should read "should only ever confer a performance hit on <32-bit machines"
When I hit preview, it changed the < to < in my text, and then thought it was the start of a tag when I hit submit.
Lesson: Don't use preview!
In theory you are right. A compiler should be able to produce better code than most any human. In practise it doesn't work out that way cuz compilers are just more complicated than most people expect. In theory, theory = practise. In practise, not.
No. Hand coded assembler, coded by a clueful programmer, will always be better than compiler generated code.
You are wrong.
With modern processors, a decent compiler can do better general optimizations than you or any other assembly programmer I've ever heard of. While there is that (less than) 1% of the time when you can hand optimize a small piece of code, that's the exception rather than the rule. The rest of the time you are better off letting the compiler do it's job.
--GnrcMan--
hey, I'm in topeka, the capital of Kansas, and while yes I don't like Kansas, just because there are idiots who by some miracle got responsibility, it doesn't mean that there aren't smart people in Kansas. Ever heard of newtek or Quvis? http://www.newtek.com http://www.quvis.com
This Wiki Feeds You TV and Anime - vidwiki.org
> In addition, I simply don't believe it's impossible to
> a) Tell the compiler in more detail how to optimise your code
I've always wondered why there aren't programming languages out there that provide direct support for this. There are lots of sorts of optimization hints that we programmers ought to be able to provide to the compiler... knowing that not all compilers will be smart enough to take advantage of all of the hints. But I've never seen a programming language that supported compiler hints in the language level. Am I simply ignorant, or is there a reason why people don't build this into languages?
-- Michael Chermside
--
Brian Fundakowski Feldman
You can't compare Java to Assembler. They are different languages and have (or at least should have) different purposes. While it is possible to write highly efficient Assembler code, this comes at a high price in terms of programmer time & expertise.
As an experiment, try writing a fairly simple program (a basic implementation of the *nix sort command, for example) from scratch in both Java and Assembly. Personally, I could probably have the Java version finished in an hour or two, while the Assembly version would take me (at least) a whole work day. (I'll admit that my assembly is a bit rusty) Yes, the Assembly version would be smaller, faster, and more elegant, maybe even noticably so on large files; but the Java verson would be done and working in a fraction of the time.
Back in the good old days, programmer time was far cheaper than CPU time -- an extra couple of days worth of coding to shave off a few K or a few CPU cycles was a cost-effective use of your resources; now, however, the reverse is true. For most modern applications, the small increase in performance you could potentially get from writing in Assembly just isn't worth it in terms of time and aggrivation.
Different computer languages have their own strengths and weaknesses. I would not willingly use Assembly to write a GUI front-end for a database application, any more than I would use Java to do somthing like signal processing on an imbedded microcontroller.
Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
1. Why the hell would you have code like that? It sounds like trying to reuse a variable for two different purposes. Declare two different variables and the compiler will sort it out.
2. Why would you want to use the upper byte for a different value? Unless your working on a weenie chip with only four registers....oh.
3. Byte operations suck on all modern processors I know of...including Pentiums. Congradulations, you've just un-optimized your code.
4. Modern compilers can perform the optimizations you describe, If the optimization is beneficial...which is dubious in this case.
--GnrcMan--
I do not intend to insult the authors of FFTW, but their FFT benchmarks are severely flawed and biased in favor of their own algorithm. I know (based on personal experience) that IMSL and the SGI HPC libraries are significantly faster than FFTW across a broad range of parameters. Portable C libraries exist (eg. Bernstein's FFT) that are faster than FFTW by a significant margin.
Heh, well maybe not so violent :) :) And as I thought of my reply I remembered the old Emo Phillips gag, which I trawled the 'net for, and here it is...
But, the devil is in the details
(Score: -1 for Off Topic, +10 for Funny)
-----------------
I was walking across a bridge one day, and I saw a man standing on the
edge, about to jump off. So, I ran over and said, "Stop! Don't do it!"
"Why shouldn't I?" he said.
I said, "Well, there's so much to live for!"
He said, "Like what?"
I said, "Well...are you Religious or Atheist?"
He said, "Religious."
I said, "Me too! Are you Christian or Buddhist?"
He said, "Christian."
I said, "Me too! Are you Catholic or Protestant?"
He said, "Protestant."
I said, "Me too! Are you Episcopalian or Baptist?"
He said, "Baptist!"
I said, "Wow! Me too! Are you Baptist Church of God, or Baptist Church
of the Lord?"
He said, "Baptist Church of God!"
I said, "Me too! Are you Original Baptist Church of God, or are you
Reformed Baptist Church of God?"
He said, "Reformed Baptist Church of God!"
I said, "Me too! Are you Reformed Baptist Church of God, Reformation of
1879, or Reformed Baptist Church of God, Reformation of 1915?"
He said, "Reformed Baptist Church of God, Reformation of 1915!"
I said, "Die, Heretic Scum...", and pushed him off.
> if your JIT VM forces periodic cache writes to
> RAM (Not every time but enough times to ensure
> some sort of coherence if the cache...
Actually, a little-known fact (surprise to me anyway) about the Java specification is that it
only guarantees a cache sync when starting or finishing a synchronized block, or when reading or writing a variable marked as volatile. (See Doug Lea's book.) Bugs like this are nearly impossible to test for. I wonder how much Java code will break on multi-CPU machines?
Slothmonster
- He has no time to do it
- He is not able to keep track of his own optimizations -- at least not for systems of interesting size
- He will make errors. To say it short: A compiler that makes errors is buggy, a programmer that makes errors is human. I can fix the compiler but not the programmer.
- The compiler is build upon knowledge from many good programmers. This knowledge can increase for a long time. The programmer will reach his bounds before the compiler will. And he will eventually forgot things.
Don't misunderstand me: if a profiler tells you that there is a function consuming 10% of your CPU time and you are experienced in assembler: go for it. Compiler are not that good today. But they will be - more sooner than later. And: optimizing code is not really the creative problem. Designing code is -- if optimizing means redesigning for you, you should think on your designs earlier and maybe on a more abstract level. But maybe I'm just blinded by OOA/D/P and you're right. Time will tell. Peter-- CAUTION: Don't read this posting.
For example, in x86 assembler it is possible to save some cycles (and memory accesses) by using 16-bit or even 8-bit registers instead of 32 bit registers.
Plain wrong. (Not only) in this special case you have to consider the native register width of the cpu, the width of different busses, first level, second level cache strategies, main memory access strategies and architectures.
For example, every capable os on x86 is nowadays using code segments which default to operations being 32 bit wide. Within the same opcodes it is also possible to switch to 8 bit wide operations. But if you use 16 bit wide operations, you'll have to add a prefix byte every opcode telling the cpu to use the 16 bit variant. Depending on the whole scoup of caches and memory subsystem this can actually slow things down.
At this place we don't have even considered different cycle times for the various x86 derivatives: every x86 cpu, whether it's an Dementium(tm) or some other cpu, executes the same instruction using a different number of cycles.
Conclusion: while in some cases humans can actually write better code for one cpu and mainbord, we're better off letting the system itself collect appropriate statistical data and use that to optimize on the fly step-by-step.
As a side note: I remember hearing FX!32 also keeping some information from the JIT compilation and storing it on disk for the next time the application is run. Anyone knows whether this is still used and to what success?
Harry Deer
There's a wonderful counterexample to your arguments, which is FFTW (the fastest fourier transform in the west). This is a code that dynamically generates a highly optimized algorithm for doing FFTs. It is written in 100% C, and outperforms (not all the time, but often) hand coded highly optimized FFT routines.
It does this by actually running timing tests, and using the results to optimize the algorithm for your computer, for the size of array you want transformed. It is quite amazing. Here are the benchmarking results from a RS6000 of some sort, comparing (among others) FFTW with ESSL, the IBM hand optimized FFT library.
This is sort of a strange example, in between JIT compilers and normal compilers, and is a case where the compiler actually knows a heck of a lot more than the programmer does. Anyhow, figured I'd point out a cool code that is able to optimize itself.
JAVA version = program code + VM + JIT
Asm version = program code
The obvious wrong answer, because it assumes "program code" as a constant. A more accurate version:
JAVA = program code version A + JIT overhead
ASM = program code version B
Without any knowledge of how program code versions A & B compare, we can't confidently say that ASM is better.
0xFF == 255 ...
Its wrong to have the most basic data type wobble over the bit length.
.. casting the return value of malloc(), etc ..)
No. This simply wrong, and almost humorously so. What are you going to do? In the C standard, are you going to say "Implementations shall make int 32 bits wide?" What do you do on a machine (say, an embedded system) with 9-bit bytes? What about a Cray YMP? The goal of the C standard is to have the implementor pick representations for the basic types that make sense for the target machine. Performance is nice, but in the end, they have to make sense. Now, there are some limits in the form of minimal ranges that are placed on the "basic" integral types, but that's about it.
The solution would have been to have the basic data types with fixed size, and additional data types with "some size between 16 and 64 bit giving the optimal performance on the target machine".
You should be more than pleased with the next revision of the C language standard, C99. Among other things, it standardizes such types as int16_t and int32_t, which provide exactly the functionality that you seem to think is important to portability.
This would have saved many people a lot of grief...
What would have saved many people a lot of grief is if they would have learned C correctly in the first place. They would then know what it guarantees and what it does not, and how they can use the language facilities (i.e., typedef to satisfy their non-portable desires.) The amount of junk I see in code written even today is staggering! (assumption that short is 16 bits, long is 32 bits, etc
We're going down, in a spiral to the ground
>The metric that I given in my university computer >architecture class was that a compiler can
> generate better code than 90% of assembly >programmers.
Did you get any more, or just a soundbite from your prof?
Does that mean, "90% of assembly programmers" in your class, in the industry, or in some anonymous survey?
Do you wonder what the statistic would say if the "assembly programmers" were chosen from among those known to have a passion for optimization?
Consider this: the person who can come up with the compiler optimization that makes a big difference is probably in the 1% of the programmers in your statistic. And do you think that if your 90% of the programmers out there were the ones writing the compilers, they would be so efficient?
Write a Java program that allocates, say, a million 26 element arrays of char, or bytes or whatever primitive type makes you happy. Make sure to hold on to each reference.
Run it a few times with the default memory settings and pay attention to the time. Also, watch it (in whatever tool your OS provides, task manager for NT) and try to gauge about how much memory it needs. Call this trial 1.
Now, run it again, but set the starting heap size big enough so that it doesn't have to allocate much or any memory. Note the time. Call this trial 2.
Now, write the corresponding program in C or C++. Allocate a bunch of arrays on the heap. In all fairness, you should calloc them because Java initialized array elements to 0 for you.
Run it and note the times. The time differences between the c/c++ and trial 2 program should roughly equate to the raw performance of "internal" memory allocation. And the difference between trial 1 and the c/c++ version should roulghly account for the extra time it requires for the JVM to grow its own heap.
I did this awhile ago, and I wound up with numbers like these.
trial 1: ~7 sec.
trial 2: ~3 sec.
c/c++: ~2 sec.
So the internal guts of that JVM on that day under those conditions put it about 50% slower than C/C++. But when that JVM on that day under those conditions had to deal with the operating system to grow its heap, it really started to stink. Note that the C/C++ program had to grow its heap to complete.
I've finally found the off by one erro
Crusoe CAN'T run java natively. It has it's own VLIW instruction set. On top of that sits the code morphing engine/JITC. So Transmeta can easily point to that as being proof they haven't built a processor that runs Java "natively".
Intel is definetly the entity to worry about so far as Crusoe goes. Did transmeta pay lioensing fees for MMX, etc? I'm sure intel's scrambling right now trying to find how they can keep this chip off the market. They'll probably end up with some sort of patent licensing agreement in the end. After all, transmeta probably want SSE (is that it?) instructions eventually, and imagine if they could produce a module that just dropped into socket 370's? Intel would probably die for their software simply because that would be much more efficient than either developing their own code morphing stuff or building x86 translation hardware into merced.
Who the hell moderated my post to "Troll"? We just had a really good discussion on this stuff. What part of "Troll" do you not understand? Maybe it was a slip of the mouse...
Hmm... That bites.
If you ask me, it is a design flaw. Longs are only required to have the range of a 32-bit number, so it is wasteful to include more bits. A lot of portable code will have unnecessarily bloated memory use and any reliance on the extra 32 bits throws portability out the window. Given that, a non-portable extension for 64-bit ints (since full use of 64-bits is inherently non-portable; with a lot of sizeof checking some use is possible, but not convenient) to be used only for system specific optimizations would probably be a better choice.
Given that most code run on Alphas probably isn't written with a 64-bit long in mind, it almost certainly hurts more than it helps (especially with memory access being the main bottleneck of modern systems).
The next C standard should fix this problem by providing a 64-bit int (if you ask me, they should just allow you to stack longs, doubling the bits of the minimal range each time; writing arbitrary bit size emulation into the compiler is trivial, though the trivial implementation is inefficient, and it would be a permanent fix that could have surprising benefits for certain supercomputer applications), as well as the addition of fixed bit size types (which is, I suppose, good for networking code, though I fear people will abuse it; IMHO it's better the way it is where you should chop everything up manually into whatever format the protocol takes).
Yeah, assembly is really slow
Java is really fast
That's the reason why we use java in the kernel, not assembly
That's also the reason why assembly is the fatest code and so on...
I can't believe in such article
Even the best compiler in this world can't generate code as good as handcrafted assembly by a human (yeah, I know, this may depend on human skill)
Of course is too much work to do to write a whole program in assembly language... but ain't every day I have to completely rewrite from scratch my kernel, IDE drivers, Sound Blaster Stuff...
Tell these guys to make an DEMO like that from Hornet Underground in a high-level language
Somebody moderate this guy UP. He's hit the nail on the head with a huge sledgehammer, this is _exactly_ why code morphing is different, and far more spectacular, than the JIT concept.
:)
Although I still want access to the morph layer to play
You can't win a fight.
I have two items to contribute.
First is a reply to various people's assertation that whether or not JIT can speed things up, you pay an initial cost for it:
This is a technically true statement, however it is misleading. There is obviously a cost for compiling code at runtime, however the question that should be asked is does this cost outway the benefit of the optimization that it provides?
If you start with the assumption that the JIT can provide code, that _once_it_is_compiled_ runs faster than that which an average human coder writes, then it is simple to realize that in some situations, the JIT compilation will kill you, and in others it will
The first case would be if you were to write a program like the unix command 'date' in Java. The time it takes the compiler to compile this code is always (I'm guessing) going to be slower than it is to just yank the date out of the hardware and spew it to the screen. Plus after the Java program compiles, it still has to make calls to native code to yank the date out of the hardware and spew it to the screen, so it's doomed to be slower than just about anything but the worst handcoded assembler or even C.
However, the second case is that you are writing a major application; perhaps a web application. The application starts up, compiles all the classes, you lose maybe 2 or 3 seconds to this process if you're running a huge amount of code on a crappy processor. However, the application runs for 200 days, constantly handling mortgage applications, or doing text indexing, or selling t-shirts or whatever. During this time, the code that took 2-3 seconds to optimize is executed millions and millions of times. It's obvious that you would benefit greatly from a JIT compiler in this case (assuming you accept the base assumption that the JIT can created better code).
So, IMHO, there's not much more to say about whether the JIT causes more of a performance hit than it provides so I think everyone can safely go back to discussing whether or not it can really outcode a human. I'll save my two cents on that for another day.
In the meantime the second thing I'd like to add is a set of numbers I've come up with while working on a paper about whether or not Java is a viable platform for systems with high performance requirements. I plan to release the paper soon and when I do I'll try to post about it here, or you can contact me by e-mail if you're specifically interested. In the meantime here are some example numbers I've seen trying to compare Java, C++ and C performance.
Test 1 - Integer Divisions
Average time in C, w/ -O3: 00:00.76
Average time in 1.2 JVM with JIT: 00.02.30
Average time in C: 00:08.18
Average time in 1.1 JVM with no JIT: 00:50.97
Test 1 - Integer Divisions on a different (slower) machine w/ C++:
Average time in C: 00:01.2x
Average time in C++ (no optimizations): 00:15.3x
Average time in C++ (w/ optimizations): 00:15.3x
Average time in Kaffe 1.1 VM w/JIT: 00:18.7x
Average time in Sun Java 2 JVM w/ JIT: 00:18.7x
Test 2 - Integer Divisions in Multiple Threads
Average time in C w/ -O3: 00:07.52
Average time in 1.2 JVM with JIT: 00.18.74
Average time in C: 01:13.04
Average time in 1.1 JVM w/ no JIT: 08:30.00
I'll leave it to you to deduce what you like from these numbers, but the first thing that surprised me was that the optimized Java code consistantly stomps all over the unoptimized C code. After thinking about it for a while this makes sense, but how often do people remember to throw in the -O3? Sure, on commercial software that's going to be burned on CD and shipped to millions of people, they probably remember... but in a Makefile in an Open Source project? Or even in smaller commercial shops? How many average C programmers are even aware that it can make a 10x difference in speed? If you are using a JVM with JIT compiling, you will always be running optimized. Another interesting thing is that I do not have a machine I can run Sun's HotSpot compiler on to compare it, which I will be doing prior to finishing the paper. Also at the rate things are accelerating it seems obvious to me that Java performance will catch and probably surpass C in relatively short order. It's just about already overtaken C++, which people said was 10 years away just a year ago.
If you tend to use "malloc()/free()/new/delete" a lot, you'll pull in more heap, causing page faults, and freeing it later, causing it to possibly be marked "unused" by the system and waiting to be page faulted again later. If you use the stack a lot, you cause the exact same kind of faulting and page reclamation (if marking stack unused).
What's the big difference? Well, with dynamic memory allocation (a misnomer since "static allocation" is dynamic in much the same way) with malloc/free/etc., you run special code which allocates the memory, handles the alignment of it, etc. With on-heap allocation, you just force that work to be repeated (albeit in a smaller manner) in assembly as part of the function code. The compiler itself generates that, but it's still the same thing: managing your memory allocation with code.
The difference is just in it being explicit or implicit: is it done by me, or the compiler? In any case, I'll wager that there will be absolutely no measurable difference in speed between something that uses a _good_ memory allocator to allocate everything on the heap, and a _good_ compiler while allocating things on the stack. Feel free to prove me wrong, since I'm easily swayed by that
--
Brian Fundakowski Feldman
Yes, you are right, some chess programs do use a kind of pre-selection.. From what I remember from the times I have played chess profesionally (~10 years ago), the chess software works (or used to work those ~ten years ago) this way: it computes all possible positions generated by its move(s) up to some level. Then it evaluates all the positions with some "static function", i.e. counts it's material, centralized King in an ending etc., so each position gets it's ranking. Then the best one is chosen. Now, there are two schools: "pre-selection" and "brutte force". Pre-selection school generates the rankings only few moves ahead, and just trashes positions which look completely silly (i.e. have too low ranking), and only pursues further the promising candidates. The brute force school evaluates the full tree, sometimes getting a really great move. Now, entry-level chess programs use brute force, better programs use the pre-selection method, but the best ones brutte force. Or this is what I learned those 10 years ago, and this is oversimplified, you have opening libraries, endings specialities etc. I wonder what does this add to our discussion of brute force in optimization. :-)
Doesn't Sun have a patent on hardware that can run java bytecode natively?
I wonder where transmeta-ish devices fall. I'd call it software, but would sun's lawyers?
Forward, retransmit, or republish anything I say here. Just don't misquote me.
You're right, and this was discussed, mostly because of me :(, on comp.lang.c some weeks ago.
There is absolutely nothing you gain from an int16_t (signed int, 16-bits long) over a short (signed int, able to contain values from -32767 to +32767 (usually at least 16 bits)). If I can do it with an int16_t, I can do it with a short and do it much more efficiently.
The place where people start crying is when you're trying to send stuff over the network or via files. int16_t again buys you nothing over short. You have to chop it up into chars (or 'bytes' for you non-C freaks) and endianise it first. In the end you're stuck with an unsigned char * (or a void *) which you're going to write to the FILE *, so int16_t doesn't buy you anything. This problem gets particularly sticky when you're dealing with platforms with chars ('bytes') of 9, 15, 18, 32, or 5.3e59 bits in size; in fact, I don't think there's any requirement in the C standard that the values have to be stored as bits at all. Mandating that an 'int' has to be exactly 32 bits long when the size of a byte is 64 bits on a platform is just retarded.
Something that would be nice is a standard class for C++ which would do that for you. You do 'cout some_integer_16' and it automatically endianises and chops it up for you.
I have seen a compiler use the16 and 8 bit registers when it knew the ranges because of some previous code. It was really neat to see how my C code became much faster ASM.
But I still had to rip out the routine and re-write it by hand to use MMX.
Become a FSF associate member before the low #s are used
As long as the programmer has more knowledge than the compiler, he will always find tricks to save an instruction here or there and outperform the compilers this way. You can find a great example of such programming tricks in the PCGPEs article about texture mapping inner loops here.
and, so what? If the performance spec is "do this in X seconds", and it's done below that time frame, then isn't that good enough?
Now, is it possible to make these programs unlearn their optimizations, or are they optimiztically stubborn?
Memory allocation isn't your problem for game programming in Java... to paraphrase James Carville, "It's where the graphics are rendered, stupid" (no insult intended). You can do all of the memory allocation at the beginning of your application (I do it all the time with byte[] buffers for I/O), just like you would in C/C++. If the graphics rendering code is in Java, or uses software OpenGL renderers (like running Java3D in Windows NT), the performance sucks.
I think people have seen Applet-based games in Java and think "this sucks". Those Applets most likely use the AWT or Swing for their graphics rendering, and you see poor performance.
Just my 2-cents.
-Chris
a JIT compiler will compile the exact same code each and every time you go to run it. While a convential compiler will compile it one time. Yes the JIT compiler might be able to tune the performance better for each run, but I think it takes a performance hit at the start by having to do the compilation. While the JIT compiler is compiling code, the exact same code compiled with the convential compiler is already executing.
The faster the hardward becomes though, the differences between the two might not be as noticible. Although, the more powerfull the hardware becomes, the more complex the code becomes.
Nothing exists exept atoms and empty space; everything else is opinion.
blah blah blah....
How many of you actually routinely (e.g. 6 times a year) examine the output of your compiler, and compare it to what you would have written?
I've done it often (a promise I made to myself as a foolish youth) and learned quite a bit. It's actually changed my higher-level coding technique on a few occassions (and I'm not just talking about optimizing for a better fit with a specific compiler)
Not only have I never met another person who does it, but I've almost never met a programmer who didn't look at me oddly for even mentioning the practice. And yet, here are so many people walking as if they did it routinely. I guess that's just the Silicon Valley edge.
I'm not being sarcastic or snide. I really want to know. Maybe I should give up my current profession (ditching a decade of grad and post-doc training) and move to the West Coast, where Programmers are Real Programmers.
Or maybe I should finish reading my NEJM, and we should all just keep in mind what we really know, and what we only think we know. Again, no disrespect, Lord knows I've had to re-think the extent to which I "knew" things before.
Please post your honest answer, and encourage your friends to do so as well. I'm afraid I'll have to take a lack of responses as a confirmation of my worst fears -- that everyone is only guessing.
If you can go to bed, knowing you did a valuable thing today, you're very lucky. If you can't... it's not bedtime
I completely agree on your first subject (compiler generate faster code than handmade assembly). There's hardly any dispute (except maybe for very esoteric applications that must use very dark corners of a processor).
With the JIT vs. compiled language compilers I must disagree. The JIT-compiled language will never get faster than the chip will compile for. JIT compiler are generally in a hurry: it hasn't been compiled before and the user wants it executed NOW. Pre-compiled application have their compiler sitting a few minutes in silence to figure out the best instructions to use. Who will come up with the fastest code? Yep: the compiler.
The memory allocation issue that you were thinking about is only a small portion of performance and optimal code. Stack or heap allocation/de-allocation algorithms are well understood problems. These issues have been solved before. Compiled vs JIT-compiled will make little difference.
Marco Schramp
Well, a good optimization typically makes one type of instruction or sequence of instructions a fair bit faster (say 30-50%).
While this might give a 2% improvement across average code, it will make a lot of difference if this particulr optimization helps out your inner loop. There are times when you really need this.
The way I see it, if compilers can optimize effectively, then there becomes less need to hand-code an inner loop in assembly. Decent optimising compilers may not give you a great deal of extra speed but they sure as hell give the hardcore coder an easier life.
BTW, The rate of hardware improvement is a good argument *unless* you are trying to stay at the cutting edge. In these cases, optimizations are certainly not trivial.
Talking about dynamic recompilation and program profiling, and specifically about Hotspot, I've always wondered why I have to put up with the fact that between program runs, the compiler "forgets" everything it learned during previous runs? Why can't the compiler generate a log for each program, and review the log when the program is run again, so the compiler can pick up learning where it left off?
First, make it work, then make it right, then make it fast, then, make it bloated!
[slashdot bug reformatted the last post] As the microkernel unix people used to say before they finally died out, "We are within X% of the performance of the original". - Alan Cox The JIT guys can write all the papers in the world about how they should (yes, really should) be able to get native code speed. No, that extra level of indirection is really a performance benefit, not a hindrance! I'll believe it when I run Java bytecode and it's not dog slow.
The evaluation of an action as 'practical' . . . depends on what it is that one wishes to practice.
programming & debugging time should be thrown in as well. It's better to underhype and overdeliver than overhype and underdeliver...
There is a _good_ reason why Java is so slow, and this should be immediately obvious to any programmer in C or Assembly...
Java has _A LOT_ of type checking done during run time. If you overwrite an array in Java, it throws an Exception.. If you do it in C or C++, no one really gives two **its..
There are more intricies with Java that make it impossible for the code to be as fast as assembly because the language REQUIRES that these run-time checks be performed. If it doesn't do the checks, it isn't Java.
So sorry, no matter how advanced your compiler is, if it is Java it will always be _slower_ than a compiled language with not as nearly as many type checks.
First off, there are profound performance differences that can be acheived simply by optimizing for the Pentium Pro and not the Pentium, see The Twofish Implementations. This is mainly just instruction ordering issues- possible to do in a JIT. There's more to optimization anymore than simple cycle counting.
Second, memory management is a different beast under object oriented programming than under procedural programming. The holy grail of object oriented programming (as it were) is black box reuse- allowing a class maintainer to modify how a class works without those using the class needing to change their code. Not having garbage collection in an OO language either disallows true black-box reuse (which is a maintainability issue), forces the object to implement it's _own_ garbage collection (such as the C++ STL template basic_string does- and this is generally the worst sort of garbage collection, such as basic_string's reference counting), or forces the program to leak memory like a seive. If those are the options, then garbage collection is not too high a price to pay.
Especially when you consider that GC isn't that high of a cost. _Every_ memory management scheme that isn't static in nature that I've seen is O(n). Either on the allocation or on the free. Sooner or later you have to search for a hole. GC doesn't remove this penalty, but instead it allows you to _defer_ it until a better time. Since most programs spend most of their time waiting for input, there are lots of places where garbage collection can be free because the program isn't doing anything else.
Last, but not least, I'd like to point you at the The GC FAQ, a good resource for all things GC related. Highly recommended reading- after it, it was "intuitively obvious" that the world was flat, that the sun went around the earth, and that time didn't slow down as you went faster.
When you run a Java program, the JIT has to do its thing before the program starts. That means the application takes longer to launch. For a lot of GUI apps, this is exactly where the greatest bottleneck is.
So the kind of optimization suggested probably wouldn't improve matters for the user, at least with the small, visual applications Java was supposed to be good at. However, it suggests the people using it for numerical work may not have been so mad after all.
If it works.
Of course, you could cache the compiled code, so that it launches better second time. Pretty soon, you'll end up re-inventing compile/run cycles.
You just need a java runtime to run a java application (not applet), like you need a perl interpreter to run perl scripts. If you get the application as source, then you also need the compiler (not to be confused with JIT compiler).
A compiler is just a machine that transforms a parse tree into machine code by following certain rules, so yes, there will always be situations where an ASM coder can best a compiler on a small section of code.
However, the strength of a compiler is that it can automate tasks which would be very prone to error if done manually, such a loop unrolling, instruction schedulling to avoid branch and structural delays, etc.
Today's large software systems, > 10 million instructions, would have taken 100 times longer and cost 100 times as much if written in ASM by hand. For hard real-time systems, where every cycle counts, humans will still need to write ASM, but for almost every other application, using a compiler is much better.
rmstar - the following is not directed at you, but to the slashdot community in general and I just included it here to avoid having to post twice.
Often, especially on technical threads, many slashdotters post on topics that they seem to not know anything about (i.e. someone above who did not know that compilers performed optimizations,e tc.).
Everyone is entitled to have their own opinions, and not everyone has the same degree of expertise on a particular topic, but it is a waste of time for someone who has no idea what a compiler or assembly language is to post their "thoughts" on this topic. Ask a question, yes, but if you don't know what you are talking about, please don't write about it.
Seems profoundly intuitive to me.
First of all, a JIT is rushed. You can design your optimizing one-time compiler to look at the code for better ways to do it all day, if you like, and while the developers will moan, they will still use it if gets them better results.
Secondly, Java has a very specific standard (that is typically fudged... but that's beside the point). It doesn't give any leeway for a program to act a little different in boundary conditions, like C does. In C, an int is whatever size int is most easily handled by the target system; if you need a certain exact size of int, you can code that in with some extra effort. In Java, an int is a Java int and Java doesn't care in the least what the most efficient native int format is.
Thirdly... automatic garbage collection is less efficient than hand coded allocation and deallocation, and dynamic allocation is less efficient than static allocation. There are odd cases where this is false, but they generally hold true, and where they do not, the C version can always use the more efficient method anyway. (and extremely fast calls to "new" can be easily achieved... at roughly 50% memory wastage)
Fourthly: Java locks you into a OOP model which is inherently inefficient (at least as done in Java). All function calls must be dynamically redirected etc.
I could go on, but it feels like trying to describe that water is wet and stones sink in it to someone who thinks it's intuitive that water should sink into stones.
However, I will not dispute that you can definitely get better results by recompiling for each chip that something will run on. Some JITs may use this to push out ahead of one-time compilers.
BTW, an experienced assembly coder will always beat or at least equal the optimizing compiler, because if nothing else works he can always look at what the compiler produces and see if he can improve on that. Besides, optimizing compilers are good, but not that good, someone has to write them, and when was the last time that you wrote a program that can solve complex creative problems better than you can?
The problem that one tries to solve with JIT based language (say Java) is not the same problem that one tries to solve with a systems programming language (say, C) is not the same problem that one tries to solve with an AI language (say LISP).
;-). Ease of programming and maintenance is much more valuable -- that observation led to the decision to use C to implement UNIX in the first place.
;-)
I would not try to write device drivers in anything but C -- assembly is too unportable. I would not try to write an AI program in C unless I had an essentially unlimited budget -- LISP is much better suited to that task. If I want to interact with a WWW user -- Java is the best choice thus far. As to effeciency of the resulting program, the speed of assembly code is irrelevant in 99.99% of code, since the program spends most of its time in that 0.01% (OK -- I'm exaggerating, but not that much
"So what" you ask? While we talk about JIT and code morphing vs. compilation vs. hand assembly, we need to compare similar things. Language characteristics are nearly unrelated to performance (*implementation* is a whole other story...) In any HLL there must be some translation from strings of bytes to executable things, whether byte codes (emacs lisp), VM codes (Java), or machine code (any of the above under an alternate translator).
JIT tries to solve the problem of bridging the gap between one (abstract) instruction set to another. Code morphing does the same thing. Compilation does the translation from HLL to machine code in one block. The time available to a compiler to explore different register allocation strategies and branch orderings is much larger, but less information about the typical case is available than for the JIT. However, branch prediction is the only place where repeated JIT/morphing has an advantage.
BUT: I repeat -- they solve different problems. Compilers create programs to execute on a chip. JIT/morphing makes a program written for one chip executable on another.
The use/non-use of garbage collection does not tell us a-priori that a program will be slower than the same program without GC -- that depends on the allocation behaviour. There is considerable literature (both theory and practice) on this topic, for languages such as C (gasp!), C++, and LISP. Probably Java too, by now.
Compilers may miss optimization opportunities occasionally but the assembly code they generate is usually pretty good, and it is free of human-errors.
Finally -- Alan Cox mentioned the microkernel performance mistake. I think that it applies here, but not in the way you might expect. It is really easy to miss the real cause of performance problems. In my experience, choice of algorithm and data structure (even in OO code) have a much greater impact on performance than any other single factor (or even several other factors together). You should use language features to help with your task -- but don't be sloppy with them. That always leads to bad, buggy code, and poor performance.
Hmmmm -- still on topic? You decide
Actually most of what he proposed is well defined ways of compiling code as known today. At least trees and SSA are, I'm too lazy to look up to the post to see what else he wrote about. Neither of those however are particularly magical.
This space for sale
For processors that are not made by transmeta, what the C Language needs is a way for the programmer to tell the compiler extra information.
For instance if you have a 'case' statement or an 'if' statement. We should be able to tell the compiler which code path is going to be considered the most common path.
Ive seen a good discussion of this in the kernel list. There are places in the kernel where goto's are used to accomplish what could be done if we could tell the compiler which code path is most common. I "think" some of those optimizations in the kernel are assuming your going to compile with GCC on intel to get the performance gain. Please correct me if I am wrong.
In theory, a human can write better assembly than a compiler - in a vacuum. But code doesn't exist in a vacuum, and it becomes less of a vacuum every day. Consider these few issues:
* Cache misses: page faults that cause L1/L2 cache misses (going straight to memory, or worse, to disk) cost thousands, maybe millions of instruction cycles. Do you write all your assembly language to neatly break your data structures across paging lines? No? Didn't think so. btw, this level of optimization can be performed in C in many cases, by good C programmers.
* Pipelining: Do you hand-code your assembly language to give the best hints to the branch decision-making hardware so it can optimize pipelining? Really? You're better than me. Or maybe your code is so much better than compiler-generated crap that the branch lookahead hardware support is useless.
* APIs: I suppose you don't mind reverse-engineering every library call you might ever need, so you can avoid those wasteful, inefficient C headers.
* Et Cetera: The days where humans can outthink compilers in the general case died with the PowerPC and Sparc I. Those of you who have written 1,000,000+ line assembly language programs are free to tell me i'm full of shit. Those of you who have worked on million-plus line code in high-level languages are free to give me an AMEN!!
---
Hand me that airplane glue and I'll tell you another story.
The term water, in the sense I used it, indicates a singular continuum bounded by the absence of water molecules. To indicate individual molecules, this must be expressly stated or clear from context, as English is strongly biased toward the aforementioned meaning.
Water does not sink into a bucket of stones, it flows. Furthermore, it flows into the bucket and amongst the stones, but not into the stones. Water can sink into sand, but sand is not stones.
Yes, water can seep into stones, but seeping is not sinking.
There are probably 2 lines of attack which have to converge to the same point. There are some magical trade secret algorithms which go into designing the most efficient processors (such as the PPC), which are not "RISC" but OInC (optimized instruction set computers)--there are algorithms which are used to optimize the flow of bits through the machine. The chip manufacturers aren't going to spill the beans on these, so you really aren't going to know other than in a general way how to get those bit through the processor as fast as possible. The manufacturere, like IBM which has its feet in both the manufacturing and software camps, could presumably fine tune these algorithms to particular features of compiler technology--specific chip design algorithms coupled to specific compiler algorighms, so that you can either make a deterministic prediction that bad things will never happen or a probabilistic prediction that bad things that will slow you down greatly will be very rare events. I don't think anybody has gotten into this yet, but am not in a position to be informed if they were. (I'm a physicist, not a computer scientist.) Perhaps this is part of Transmeta's claim to fame.
There is also something apparently magic that can happen when you do a really good job of *designing* an OOP language. This usually takes some time. I'm vaguely interested in scientific computing and you hear stories. Scientific calculation people love Fortran 77 and hate Fortran 90 and 95: Fortran 90 and 95 were abominations created by a bunch of CS professors who vaporlock if you can't do recursive programming, etc., just like C (never mind that any decent implementation of Fortran has the ability to run C subroutines if you want to use C to manipulate some data in a way which would be clunky in Fortran 77). Fortran 90 and 95 were 15% to 30% slower than Fortran 77--in other words, speed comparable to C (this is not flamebait, this is the word I've heard from people who pay attention to the number of _hours_ it takes a routine to run through a calculation). There is now an OOP Fortran 97, and I've heard a few strange stories about it--it seems to be 5% or so _faster_ than Fortran 77. My point is that a totally buzzword compliant language--like Fortran 97--can be designed in such a way that it can kick the butt of a carefully tuned and optimized language like Fortran 77, or at the very least keep up with it. This is already on the streets. What presumably has to happen for Java is that somewhere, sometime, it will fall into the hands of a group of people who know how to make languages run quickly. Right now, we are at the stage of knowing what we want/need it to do, and doing our performance tuning (such as it is) around that. Hopefully, when the performance adaptations have to be made to Java it will not be get out the nuclear weapons and let's start over.
First off, I agree with the hypothesis about garbage collection being potentially faster than explicit allocation and deallocation (especially on a multiprocessor system!) Second, it might be possible to make memory allocation code aware of repeated allocations and deallocations of the same size, and have it set aside a few chunks of memory to be recycled continuously.
The problem with Java, IMHO, is not garbage collection. It's all the dereferencing that kills you. Consider the case of a line of Java code that refers to the member variable this.myInteger. The Java VM (or the Just In Time compiled code, or whatever) first must convert the object handle to a memory reference to get the object's base address. (Object handles are a necessary evil because they let the garbage collector reorganize memory blocks on the fly.) Next, the VM must add an offset to the base address to obtain the address of the variable. That's a whole lotta dereferencing to get at a simple int.
Running JIT-compiled code on a RISC machine, we've just performed a bare minimum of 3 loads to get to something that would have required only 1 load in C, or 2 loads in C++. The numbers are different on an Intel machine, but everyone knows those are slow by nature. ;)
Another area to examine is method calls. In C++, non-virtual method calls cost almost nothing, and virtual calls have a small overhead. In Java, however, all method calls are virtual--not only that, they're called by method name instead of vtable pointer. A JIT could try and optimize this out, but it still must make allowances for introspection and method calls coming from un-JITted code.
I'm sure I've missed a few reasons why Java is slow, but I think I've kicked it around enough for tonight. Until someone can tackle these problems, Java is still slower than an early-binding language like C++. But with judicious use of native code (which has of course been aggressively optimized) we might be able to narrow the performance gap to ~15%. That would be enough to convince most people that Java is not a toy.
byte[] buffer = new byte[1024]; // Creating the buffer once
for (long loop = 0L; loop { // some number
buffer[loop % 1024] = 42;
}
And when running the JDK 1.2.2 VM with -verbose:gc you get a few GC messages at the beginning of the program, and then the GC never comes back.
However, stack-based short-lived objects *are* important for high-performance math. James Gosling is working on that as a possible future enhancement for the VM.
Just my 2-cents
-Chris
Yes, I know the code is contrived. I'm just trying to illustrate the fact that if you think before you allocate, Java is no slower in that regard than a manual "free-and-release" system like the standard C heap.
One thing which hasn't been considered to optimize the high-level->machine level conversion is having the compiler forward the code via network to a Sega Genesis. This code can then be blast processed and converted to something near-native to the transmeta instruction set. Note that line rate is irrelevant as the Genesis implements a ~3x10^70:1 compression ratio for 2 gigs of data in a single cycle. This is only when Sonic the Hedgehog is plugged in. Space Harrier seems to give a bit poorer performance.
for (long loop = 0L; loop
Stupid entities... why can't /. support the tag???? :D
-Chris
at what time will there chips actaully be out ?
Hmm, I think the author missed the REAL advantage of Code Morphing (or Binary Retranslation).
;-)
JIT Compilers (or even FX32!) are pieces of software running ABOVE the OS layer. They are executed for a specific application which runs on top of it (that is the application on top of the JIT environment).
Yes there are similarities between Code Morphing and JIT compilation, as each does some profiling to get better performance in frequently executed code areas. It is further true, that on more predictable and orthogonal platforms like the Alpha, all kinds of optimisation depend on knowing the underlying platform.
BUT they only see the code from the application running on top of them.
The processor itself however gets code from MANY SOURCES. The number of processes running on any machine today is quite large. In addition things as shared libraries, OS calls, etc add a pinch of code.
Hence the processor still has to perform all kinds of clever things, trying to resolve dependencies, avoiding stall and stalling at times. For modern multiple Issue machines this gets VERY complicated. It slows down cycle times, increases size (and hence speed) takes up space which could be used for better performing optimisations (larger cache, etc), etc.
This is one of the major headaches for CPU designers these days, as these parts increase unproportionally in complexity when trying to add more functional units, etc.
This is the reason why VLIW (very long instruction words) are used in the next generation processors (i.e Itanuim). In those the compiler does all the hard work, resolving all dependencies stalls, etc, at compile time, and then simply sending one long instruction to the processor, which includes one instruction for every functional unit. This means the processor can run at MUCH Higher speed, as it loses a big overhead complexity.
This is the reason why TRANSMETA uses such a VLIW, it enables them to build fast, small and CHEAP processors which can run at quite high speeds. (I am sure they could do more than 500Mhz, but then cost and power consumption get in the way. And the are much smaller and havent (yet) got the man resources as Intel, AMD)
Now where does code morphing come into play ?
I recently saw a few comments asking for access to the VLIW instruction directly, bypassing the code morphing. Well, that would DEFEAT the concept !
The Big Thing about code morphing is that it sees ALL instructions which get executed, in the right order. This includes the library calls, the OS routines, etc. Hence it can do MUCH BETTER optimization than ANY Compiler by definition can do. A compiler can only know about the program it compiles (yes you can do clever profiling, but you still only have the application source to optimize), it cannot optimize ACROSS APPLICATIONS.
The code morphing software has absolut knowledge about what code has executed and what code currently tries to execute. This enables it to do much better optimizations.
Furthermore it is SPECIFIC to the underlying processor. It knows everything about the processor and even more important, the processor is designed specifically for it !
(Hence the two versions are internally comletely different, each optimised for the kind of code linux/Win it is executing)
This is what makes the concept so interesting. It adds an additional abstraction layer on top of the actual hardware, allowing the underlying implementation to be always up-to-date and optimized for the application (removing all the complexity of having backwards compatible hardware, ask Intel
Overall, the approach is a MAJOR step forward, IMHO it can be compared to the CISC->RISC transition. It is after all the first implementation, competing with products well understood and optimized for years. It has potential to solve many of the major headaches of modern CPU designs (including COST). Comparing it to JIT is missing the BIG points.
I think it will go a LONG WAY !
Frank
First of all, I have seen a few examples of Java running maybe only 10% slower than the equivalent C program. In these cases, the programs were doing raw calculations and not allocating anything.
While I would love to see Java run as fast as C++, that's not the most important thing to me. I can develop in Java much faster than C++. Doesn't "fast enough" outweigh "fastest possible" if you get it sooner?
Perl is slower than C, too, but you don't hear as many complaints about that. For most applications you use Perl for, writing the equivalent in C would be a huge pain.
What does it have to do with C++? I mean, doesn't the same argument/counterargument hold for other languages with dymanic memory allocation?
Suns Hotspot Java Compiler uses runtime profiling to optimize code. I agree completely with our points. I want to add that with Component based software the JIT will have another advantage. The advantage is that it can inline calls to the Component. A static compiler such as a C++ compiler cannot do that because it has no knowledge about the component. With an application build out of a lot of little components that can be replaced at runtime the JIT will have a huge advantage. Markus
Aren't you being a little optimistic here? You assume garbage collection can be cheap via a stack, but has it ever been cheap? The answer is yes, but only in the simple one program world of DOS. But for the multitasking, you're going to have lots of unusable chunks. I do agree that a JIT can optimize code, but the analysis requires more resources. You just stated that the JIT has to rewrite the code everytime it runs it. That means inserting and analyzing optimization routines all over again. It is true that a programmer can never optimize code as well as a compiler who knows the processor intimately, but any good assembler worth a salt can optimize code to at least 70% because he knows which loops/routines are executed the most. That's more than acceptable. I don't know if the remaining 30% is worth it. And based on your assumption of processor power, it's probably not.
Hi,
1 1-performance.html and http://www.javaworld.com/javaworld/jw-12-1999/jw-1 2-performance.html and http://www.javaworld.com/javaworld/jw-02-2000/jw-0 2-performance.html
What transmeta does is very similar to SUN's hotspot compiler for java. Like Transmeta's code morphing, hotspot compiles and optimzes bytecode on the fly using profiling data to optimize the parts that are executed most.
Your question as to why Java is still slower than C++ is answered in a series of javaworld articles (www.javaworld.com): http://www.javaworld.com/javaworld/jw-11-1999/jw-
Basically the problem is that stuff such as memory allocation, garbage collection, synchronization and runtime type checking have a performance price (you get a safer programming environment in return). The articles discusses these performance issues and give some usefull advice to avoid these bottlenecks.
Jilles
Well in that case think of Penrose tiling. People do kick ass at NP hard problems. That's why people are employed to pack things into boxes, not robots.
First of all, I don't think any CS people work on FORTRAN at all. FORTRAN 90+ are just an attempt to keep the language current with the newest language fads. Keep piling in features until you can't find the structure underneath the junk. However, I'm not too happy with FORTRAN 77 either. The claim that it is faster is due to a couple of reasons, but the most often cited one is that everything is passed by reference in a global location. You can of course do this in C with global variables and achieve similar function call performance. Aliasing are the second most cited reason, and const, ANSI aliasing rules, and a good compiler all help make that less of an issue.
There's local overhead in the instrumentation code, but not a net overhead in the combined instrumentation code plus controlled loop code: the loss due to extra code prior to a loop is far less than the gain resulting from the more efficient loop code produced by introducing that extra instrumentation.
"The question of whether machines can think is no more interesting than [] whether submarines can swim" - Dijkstra
The instruction set in Crusoe is supposedly 'morphable' and from this I get the impression that the programmer could just rewrite the instructions (of course, after obtaining the source) to better optimize Java or any other language. Hmmmm... or maybe I'm just typing a bunch of nonsense and trollin'.
The problem is indeed that the compiler doesn't know everything that the programmer knows. However, the reason for that is that modern programming languages aren't sufficiently expressive. There's no reason that I shouldn't be able to tell the compiler that these particular variables should also be less than 256. Not only would that be useful for optimization (for instance, I could say this number is between 7 and 17, and it knows it can store than in a byte instead of 3) but it would also simplify some programming tasks such as bounds-checking for arrays and input. Really, you should be able to specify in a program everything you know about inputs. This would also be useful for choosing algorithms. You could tell the compiler you needed to store some information to be looked up by a numerical key and inserts can be slow, but lookups need to be O(1). And it could choose an appropriate data structure for you. Same with sorts.
The performance jump just from finding the right data structure/algorithm from the hundreds available would be a huge performance leap.
when the programmer has decided that space is more important than speed.
That would seem to be one of the issues where manual coding has the edge. I agree with your general point, but I think there's still work to do before auto-generation is perfect.
Imagine a case where an algorithm could easily optimise either way speed/space - maybe there's a hash table that's going to hold some programmer-controlled depth of the initial search and allowing it to expand would make usage quicker, but eat memory. A human coder would probably know the pattern of usage this routine would get. By knowing the practical number of data items to be encountered, they could optimise the hash size. A compiler can't do this, because the necessary information comes from the overall systemm domain, not just the source code. To allow compilers to compete effectively, we'll need much more subtle optimisation hinting than we currently have; especially that of the form "ignore this block, it's only used once" and "speed like crazy here, and you'll never have more than 5 items loaded simultaneously".
I'm not a compiler / assembler geek, so maybe someone is already doing this ?
>It's my belief that it should be possible to make a compiler generate much better code in the general case than someone writing hand coded assembler.
No. Hand coded assembler, coded by a clueful programmer, will always be better than compiler generated code. Or at least as long as a compiler cannot considered an advanced Artificial Intelligence...
To some large extent this is a question of creativity. Compilers are not creative.
for argument (1): Well, we are talking about a =clueful= programmer.
for argument (2): This isnt significant.
For the profiling: A clueful assembler programmer can do this too.
Sounds great, but wouldn't that learning process lead to the possibillity of bugs, based on incorrect optimizations the compiler had learned? How would you get the compiler to generate proofs that show that its optimizations will be guaranteed to work? Further, what about human errors in coding? Wouldn't human errors in code make it possible that the compiler learns really bad optimization methods? Just some random thoughts....
Fish
Crouso is basically just taking the idea to it's logically next step (hardware support for this).
However by using global garbage collection the compiler knows that all 499 are going to be deleted so it deletes them and then shifts the remaining memory down. This means that the speed increase of using a stack to create memory isn't lost when deleting and so overall it should be faster.
Well... History has proven that handwritten assembly has always had an edge over HLLs. But now you are suggesting that Java, which IMHO runs like Windows on an underclocked XT, could be faster with A JIT-compiler? ROFL! No, off course not! A compiler may know clockcycles and all, but it does not understand the actual logic of a program, or the more complex tricks. It can hardly keep track of the flags (meaning that it doesnt know when to replace MOV AX,0 with XOR AX,AX and those kind of things) and when it comes to complex instruction re-ordering it becomes too much. If you don't believe me, just download the Quake source and see the difference with and without assembly (especially software rendering). Compilers have improved, but handwrittem assembly still has an edge.
But that is about to change. Indeed when the Merced and K8 are released, compilers might win. These CPUs were optimized for compiler optimization and I think when the x86 architecture is ditched, the days of handwritten assembly are over. Until then, stick with it.
I'll just say that when the compiler and the machine are already married together, the compiler DOES know about the size of the caches. I'm specifically talking about VLIW hardware as oppossed to a super-scalar machine that has multiple execution units. The main difference between these two classes of machines is that the super-scalar machine has a hardware layer that is doing the instruction scheduling where a VLIW box has the compiler doing that function. Thus I feel your observation is more of an apples and oranges situation.
Have you compiled your kernel today??
Where the hell did you get these figures ?
What compiler, what OS, what architecture.
They look *completely* out of line with every
other test I've seen. Check out the JGL against STL comparison above for some real numbers.
For a start, what the hell is the difference between integer division in C or C++ - no compiler that I've heard of written in the last 5 years produces different code in these two cases.
Secondly - who remembers to turn on optimization: anybody who cares about performance and isn't completely stupid.
http://rareformnewmedia.com/
Well, all you geeks who's lives revolve around code, I can sympathise that you are indeed vulnerable to advances in technology. If you were smart then you would develop embedded hardware and write code for it too. That way you would gain the best of both worlds. Any dullard can write a c/c++ VB app. True legends make the hardware, which makes things happen in the real world!
Can it be done with standard cpus ? I mean, buy a motherboard with 2 alpha (or 2 power pc) to use with linux. When used in native mode you have a very fast computer. If you have to run a x86 program one cpu translates and the other executes the code translated...
[slashdot bug reformatted first two posts]
- Alan CoxThe JIT guys can write all the papers in the world about how they should (yes, really should) be able to get native code speed. "No, that extra level of indirection is really a performance benefit, not a hindrance! I'll believe it when I run Java bytecode and it's not slow.
The evaluation of an action as 'practical' . . . depends on what it is that one wishes to practice.
>I can say that they do not offer Java in the
>CS department. This is a reflection in the fact
>that Java is not important for what programmers
>actually do in the world.
Man have you got to get out in the real world! You and your profs that is. Right now Java is my job. A year ago I was coding c and asm on an 8051 derivative for an embedded project. Today I'm building a distributed network management system for a company that builds fiber backbones. Guess what the most universal platform for executing code is? I'll give you a few clues: All their winXX boxes support it. All their *nix boxes support it. Any ideas? It's called a web browser and everyone that might be interested in designing, configuring, maintaining a network, every one of their laptops, desktops, and workstations has one.
Sooo...which platform do we support? What language do we choose? You only get one choice. I suggest that you transfer to a, uh, more modern school ASAP.
Good judgement comes from experience, and experience comes from bad judgement.
- W. Wriston, former Citibank CEO
You're correct in saying that JIT generated code can run faster, be more efficient, etc. The only real problem is that you still have the initial burden of compiling every time you want to run the code. This is not a trivial amount of the time spent waiting for JAVA (to use one example) to do it's thing. Compiling of the sort you're describing is hugely complex and in the medium to long term will outway any benefits you might see from efficiency improvements in the actual code. Cheers
read my post entitled
Hey everybody..Listen up
Later
Spacewalrus
If you think about it closely, you'll find that the easiest way to really "clearly define" a problem is often just to write it in a programming language. No, I don't think, computer programming will ever be fully automated. Sure, improvements in technology and new tools will make it much easier for a programmer to oversee the complexity of today's programs, but this doesn't mean programmers will no longer be needed. Tasks will shift from tedious low-level work to reuse and combination of existing building blocks, but this will be compensated by a growing project complexity.
Same morphing regardless of language or virtual machine...
From my understanding, the Crusoe morphing starts from the x86 instructions. So, regardless of what language the program was written in, or how it was invoked, the Crusoe chip is still morphing the x86 instructions into its native code. The Crusoe is not morphing Java into a virtual machine, its at a lower level than that.
Not at all true... self modifying code is still around not only on JIT compilers but in interpreted languages and even device drivers (Or how do you think some device drivers adjust in real time to hardware changes?)
I guess you're right, though I'm not sure whether we should consider things like function-vectors as self modifying code. These programming structures have relatively well-behaved formal counterparts (such as lamda calculus). Are more dangerous constructs used in practice too?
BTW. What happens when two threads executing on two different processors try to optimize the same piece of code? Without proper synchronisation mechanisms (such as write-through or cache snoops) this is very dangerous. Work-arounds, suche as semaphore will most probably cause a performace penalty.
Also, how to cope with asymetric multiprocessing? This obviously raises problems for binary incompatible processors using the same memory. But it will also lead to at least inefficiency when two processors ARE binary compatible but need different optimizations...
I don't have a real comparison for you, but check out this comparison, which might be a starting point for your comparison. Basically, if you use built-in types (int, array) in Java, it can go fast, but when you start using classes, Java slows down while C++ STL stays roughly the same speed.
This whole question boils down to whether or not a human mind is deterministic. We can prove that a
compiler is, or in fact that any program that runs on current computer hardware is.
If a human mind is deterministic then it follows that a human could do just as good as a compiler simply by using all of the same algorithms to generate the code, although the computer would certainly finish alot sooner.
If a human mind is NOT deterministic then it is clear that a person who had no programming skill whatsoever could beat the best compiler given enough time, a specification sheet of the language he was to generate the given algorithims in and a circit diagram of the cpu he/she was coding for.
This is VERY simple to see. A compiler can NEVER
under any conditions for any reason EVER EVER EVER produce better code than a human because the
human can ALWAYS produce the same code that the compiler did. The question should be is it reasonable to wait the 150 years of time it would take me to compile mozzila by hand or should I build a compiler to do it in a few hours. The next question would be how much time(if any) should I spend hand optimizing the compiler output.
A question I have is why cant we do the same thing that a JIT compiler does before runtime? It seems
clear that you could easily get very close to as much optimization from a multi pass profiling compiler as you ever could with a JIT without the runtime overhead.
Later
Spacewalrus
Second point. I might just be starting out in this field, but what would make the most sense to me is to include performance profiling with beta testing software. Just imagine an algorithm using data gathered from several thousand users in almost real time over the internet automatically recompiling and distributing the software back to the testers. A modification of the same process might allow big software companies to vastly improve their product between releases based on the data collected from users. Better still, imagine adding branching to the algorithm for different uses. Microsoft might no longer have an excuse to sell the same piece of software for insanly different prices. Serving a web page rather than playing Quake, here's your kernel. Using your database for this or for that, here's a different component. Just my two bits between classes.
In Self 2.0 (back in 1991) they got half the speed of C for highly numerical benchmarks. On conventional processors, that is as far as a language like Self or Java can go.
C, C++ and Fortran get their speed by not including safety features:
On a very different topic, I haven't seen any mention of memory usage. A dynamic compilation system only has to generate code for that parts of the program that are actually used, while a static compiler must generate code for everything. In the first case you lose some memory for the compiler itself and some more to have the "source" available at runtime. But the fact that you can use the rest for just part of the program means you can be very agressive in inlining and loop unrolling compared to the static compiler and this could result in faster code.
I completely agree that humans can always write better assembly than compilers can generate, but it's getting harder.
Not all that long ago, compilers weren't all that smart, and processors were a lot simpler. Today processors use tricks like out-of-order execution and branch prediction, making it extremely difficult for an assembly-language programmer to determine exactly what is going on. It still remains the case, though, that the programmer may know something about the code that the processor doesn't (that a register will never contain a certain value, or that some execution paths are much more likely or more critical than others) which will give the programmer the edge.
None of the just-in-time compilers that I am aware of recompile code as they go. If the Crusoe actually attempts to optimize sections of the code based on runtime profiling information, this could be the main reason it performs so well. There have been academic papers written on this idea before but this would be the first implementation I've heard of.
Well, I noted that someone else said that it wasn't really a problem, but, if someone were worried enough about overhead for processing time, couldn't the JIT/VM become standard and then companies just put an extra, not so whiz-bang processor "alongside" the "main" processor to "process" and "optimize" the code? or would that create even more lag time because of slow communication between the processors/memory usage? If it wouldn't be a problem, and this could be done, then the entire capabilities of the "main" processor of said system could devote its time and cycles to executing the created/optimized code produced by the "extra" processor..
or, maybe I'm just babbling..
Insert mind here.
I was about to try and answer your question but it's bloody difficult to explain!
With garbage collection implemented in the best way the JIT or whatever doesn't try to hard to keep a record of what mem has been allocated.
But, whenever the system isn't paricularly busy or if it panics due to low memory the JIT can traverse all the objects that have been allocated (just starting with each of the Thread objects I think) just by checking the references each object has until it has a list of all the objects in the system. There are a few reasons this is more reliable (not just faster) than maintaining a reference count.
Then I suppose it (the JIT) could ask the OS to rearrange it's memory mapping thingies (Page Translation whatchamacallems) to put any of it's user space pages after the current position of the stack or something like that (hard to describe but i know what i'm talking about (even though i can't remember the name exactly!))
Basically it can be made to work but maybe not particular efficiently because it could result in a full page (4K) been used for just one small object but this can also be fixed
Years ago I worked at a company that was going to bring VLIW to the commercial world. One of the things they had to invent was MUCH smarter compiler technology. That was in 1985, so the basic ideas involved here are at least that old.
Anyway, when the compiler DOES know the hardware, especially VLIW machines, they can indeed do a job as GOOD as a human. (Look - they won't usually do better jobs, when the guys who designed the machine write code, they're going to do what the compiler is trying to emulate..) But from what I remember, static profiling was considered "good enough" back then if the VLIW machine is built right.
Most codes spend most of their time in inner most loops. If those loops can be rolled up correctly for VLIW you can be executing different interations of the loop within the same instruction slot. Being able to do this is usually the largest performance payoff. If you have a GOOD compiler in the first place, that KNOWS the hardware, then JIT and morphing aren't needed, or help.
The place that I see the morphing being a step forward is that VLIW machines were considered architectural dead-ends until just now. If I use one of these smart compilers for a machine with say 7 functional units, the code that emitted won't work on a machine with 8 functional units, or at least not be optimized any longer. These machines didn't scale well until now! Code-morphing technology completely removes this limitation!
Have you compiled your kernel today??
If you read a bit farther into the release information, you'll find that the 700mhz part is designed for handhelds, laptops, etc. A bit farther down you will see a part talking about ease of addign to the architechture for models that are not portable-based, i.e. server architecture. They go further into performance gains from this change in architecture.
"We apologize for the inconvenience."
"can anyone actually give me a good reason why Java + JIT should be slower than Good Programmertm + Assembler? " The good reason is; Sun sells CPUs. The faster they make Java, the less new, faster CPUs they sell to their customers. Intel too.
For those that haven't seen it, distributed.net's crypto-cracking client has multiple "cores", from which it selects the fastest one for a given processor.
I ran a comparison between three machines (MII@233MHz, K6-3@400MHz and Celeron@300MHz). Remombering that they're all x86 architecture machines, there's some huge variations:
core MII K63 Celeron
#0 175 569 354
#1 426 648 70
#2 359 617 841
#3 520 547 694
#4 374 602 693
#5 362 704 712
Notice that the largest is a ten-fold difference!
The point of this? Imagine what your run-time optimising compiler could do when it knows the exact charateristics of the chip it's running on, as opposed to those of the chips it might run on. I know this is an extreme case, but there can be significant differences.
$ find
True, but languages that punish bad code, rather than just accept it will have better (and usually less -- another improvement) code written for them.
Consider what would happen if you recieved a 40V electric shock every time you wrote bad code, or were in careless in some way -- there'd be less coders writing code, but the end result would be better (provided that the electric shocks were dealt out fairly -- and this is the problem of standardising...)
John
John_Chalisque
Think 'little theorem'.
For example, when proving a major theorem in some research paper, you'll write some smaller things building up to your main result(s) in 'lemmas'.
What the previous poster is referring to is sets of 'when this holds, so does this' type implications (e.g. if a is invariant bc and ca, we know that ba and don't need to work this out explicity)
John
John_Chalisque
If you have the compiler rewrite the code each time you compile (or just-in-time compile) how would that affect performance on a Transmeta type chip? I don't know what I'm talking about, but I think that I read that the chip itself (along with its low level software layer) will learn each time it runs the program so that it can run it faster the next time. If the chip is faced with a different program each time the application is run, how can the chip perform to its best?
Of course, such an arragement would also consume twice as much power, negating the very design goal of Crusoe...
Automatically optimizing just the code that gets executed most often doesn't work.
Take a simple calculator program for example. It spends most of it's cycles in a wait loop waiting for input. What's the point of optimizing a wait loop??
And that hyperbolic trig subroutine that you use 3 times a day might never get optimized; but you would fume if you had to wait 5 minutes for an answer when you needed it.
Function vectors!=self modifying code but code that patches on the fly (i.e. TCP/IP stacks for non reentrant OSes). That would also exclude DLL's and other dynamically linked libraries.
:)
BTW. What happens when two threads executing on two different processors try to optimize the same piece of code? Without proper synchronisation mechanisms (such as write-through or cache snoops) this is very dangerous. Work-arounds, suche as semaphore will most probably cause a performace penalty.
You have to decide what's mode important, in high speed CPUs where the CPU goes 10x as fast as the memory bus write throughs are a bigger performance hits than semaphores, in massively parallel architectures it could be even worse because another processor might be locking access to the RAM or I/O. OTOH, semaphores can be a performance hit if htey're not persistent. That brings you to the following problem, do you run a single kernel image that controls all processes or do you run several kernels with interprocess communication between the kernel images and the threads?
For example the INMOS Transputer solved the problem through interprocess communication using tuple-spaces in OCCAM, some HA *nixes use semaphoring between multiple kernels, so you have a performance hit which is offset by the more robust setup. And.... BTW, IMHO SMP without parallel I/O and storage offsets the benefits of having several processors except for computational intensive operations that don't depend heavily on stored data sets.
Coping with Asymmetrical Multiprocessing? Wow, I haven't thought about this in a long while... there's only one answer there... sepparate instruction/data stores and a shared memory space for communications... either that or forget about the whole thing
All systems are inherently inefficient, what you do to make that inefficiency not noticed is what counts.
ZoeSch
I hate to agree with davecrazy but...
>In C, an int is whatever size int is most easily handled by the target system
This is not a feature. Its a failure of the standardization process.
You cannot write portable programs if your basic data types are not defined.
And it also doesnt give todays hardware any advantage. Alpha uses 32 bit int. Using 64 bit would be pointless since portable programm wont use the added 32 bits.
Actually, there is no absolute requirement that free() does all the cleanup work that the allocator, any more than a garbage collector must call a routine to do all of its cleanup work at the end of every block and at every assignment.
free() could just add the freed chunk to a list of chunks to be cleaned up. A second thread could run continuously tidying up all these chunks with the same system-wide view as a garbage collector.
I can see two major problems offhand:
The first is that java is, conventionally, a
language which runs in a virtual machine. The
infamous JVM. This involves a layer of
inefficiency from the word go. Granted, you could
avoid this problem by compiling direct for the
CPU in question, but this is a nontrivial
conversion from a stack-based language to a
register-based architecture. If you are on a
java-optimised CPU, compiling direct, this could
still be pretty good.
The second objection involves the compiler.
Compilers take CPU cycles. The smarter they are,
the more they take. If you compile beforehand,
this of course doesn't impact execution speed.
If, of course, you'd like to compile on the fly,
well, you're going to have to execute the compiler.
Java also isn't really all that suited to this,
to be frank. If you want to see a REALLY powerful
model for this sort of thing, try using a FORTH
compiler. On a CPU which is at all stack-friendly
with a good execution environment you can pull
all kinds of nifty tricks (on the fly modification
of running code, assembler-like low level activity
and so on) which can really speed things up
immensely.
You may inline if you surely know that object you handle is of given type.
Simple example:
Foo.Bar() -- will be translated as -- if Foo.Type == TypeWeHandle90POfTheTime -- inline Bar for TypeWeHandle90POfTheTime -- else -- call Foo.Bar(); using message passing. That's the way things should be done.
So while it might seem profoundly counter-intuitive can anyone actually give me a good reason why Java + JIT should be slower than Good Programmertm + Assembler?
:).
:), it's always best avoided when using java, but sometimes you can't. This affects other languages too ofcourse.
:) hmmm.
:P. It's nice throwing around O notations about how much more efficient garbage collection can be etc, but you have to realise than you're assuming that memory freeing is the most costly process, it may not be. With JIT, GC etc, you have the additional overhead of the JIT and GC. I haven't seen any proof so far that these overheads aren't overheads at all.
;).
:)~.
:).
There are many things that slow java down, that I'm sure you're aware of, but I'll outline some of them here as they come to me
Limited stack, all java objects are created on the heap, and thus is considerably slower. Microsoft Research actually have a division working on speeding up java by putting some objects onto the thread stack.
Runtime checking, array bounds checking etc can not be turned off. Good for stability, bad for speed.
Late binding, can be rather slow
Coding practise. I don't know about many people here, but when I program in C, I usually know exactly what's going in memory, so I spend a bit more time optimizing this and that saving a few extra cycles, especially if the code is in a loop. I'm sure this is even more true of most ASM programmers. In java, I just don't care, I don't really know what's going on in memory, and I don't really care, I just do what makes the tidiest code, I also tend to use inheritance a bit more
Also, java's thread synchronization is easy, but extraordinarily slow compared to what you could do if you were to use Win32 for example, since you have a lot more control. A JIT is not going to help that. Come to think of that, that's where JIT can fail with high level languages, there are things that these languages restrict you from doing, which prevent such cool optimizations that a JIT couldn't possibly do - I'm not talking about instruction level optmization, I'm talking about things like being able to manipulate memory directly.
The idea that Java + JIT can be faster than traditional ASM is interesting, but never will happen I'm afraid
There are additional 'optimizations' that can be done at runtime, but does using a JIT really improve performance (taking into account the overhead of the JIT)? How many cases are there when a compiler can not optimize the code as much as a JIT, and the JIT can actually give any kind of improvement. I doubt there are many, and unless the improvements are like 10% improvements, I doubt it would matter much. What's the diff between a 433Mhz and a 476Mhz celeron? Nothing any average user would notice. The only place you would notice the speed difference is when you're running something like VMWare, 3DSMax or compiling Linux....and in those cases would a JIT improve or just hinder performance?
Ofcourse, as we get faster, and faster processors, we can move to higher and higher level languages like Java. They drastically reduce development times, and in many cases, will be the better choice. I doubt you'll see Quake or any OS written in Java anytime soon....hell, I doubt you'll see Office or CorelDraw written in java anytime soon
Crusoe is fast and super cool, but it's not as fast as a real 666Mhz x86. And if Transmeta adds official java support, I'd love to get my hands on one
I should also stress here, I have nothing again'st java/jit etc. There're all cool stuff, but I think the concept of Java beating good programmer + asm a bit, uh, unrealistic
If one Transmeta CPU, running the morphing software and actual productive code, is on par with a PIII 500 Mhz, what would the speed improvements be if you used 2 CPUs? One dedicated to morphing, and one dedicated to productive code? Design would be almost as easy as current design, and speed could double.
Best part of this approach is that it still appears that your system has a single CPU. Win9x will still work.
(appended to the end of comments you post, 120 chars)
Speaking of evolvable code, what ever happened to software FPGAs? I'd like to get my hands on some of that... Anyways, I've an idea. Why not include information on the overall objective of the code into the FPGA?
No one ever says, 'I can't read that ASCII E-mail you sent me.'
Thanx.
Doesnt look too good unfortunately...
Actually, AFAIK, Id hasn't used assembly since Quake1. If you're interested Michael Abrash wrote a book called "Black Book of Graphics Programming" which details some of the tricks used in Quale.
I thought that id stuck to C from Quake2 on. (I'm fairly certain the quake3 code is in C, and not C++, The Quake Source was all C). As far as games go, really the only place to use assembly was the inner loop, or the per pixel rasterization loop. That is no longer necessary due to hardware acceleration (Quake 3 requires it), and I don't know if everyone noticed this, but I found Quake2 software mode to be FAR, FAR slower than Quake's software rasterizer.
I really don't think you'd gain enough to warrant assembly language in other parts of a 3d engine (bsp traversial, picking the potentially visible set), it's too complicated, the inner loop of the rasterization engine, OTOH, is very, very small, probably 30 lines of assembly.
Anyway, those are my thoughts, I hope Carmack or Abrash is reading this so they can correct me on any points I missed.
-------- "All I want in life's a little bit of love to take the pain away" --Spiritualized
Okay, admittedly I'm clueless as to the dynamics of JIT compilation, but from my relatively limited perspective, I do have a few comments about this concept of JIT compilers generating better code than a human.
It makes sense to me that a JIT compiler would generate...ubercode. Of course a processor which is inimately familiar with itself would be able to optimize its activity given a certain situation to fit its own needs. It as if the processor reaches enlightenment.
However, in the case of a Transmeta processor I wonder if the processor has to generate new code every time a non-native assembly program is run, how fast could it be. Is there some kind of cap on the speed of the processor due to its constant compilation?
It seems to me, a person clueless in the tao of the processor that a coder, writing for a specific processor, would be able to create code that executes far faster than a processor generating code from a coders code on the fly.
The degrees of seperation between the coder and the binary should be as small as possible.
--
If there is a God, you are an authorized representative. - Kurt Vonnegut Jr.
Once a programmer has infered some property about his program, he needs a highi-level language to express his solution to the compiler. ML's typing system is the best thing I know that acheives that.
Once a program has figure a fast instruction ordering, he need a low-level language to express his solution in. We all know what kind of language does that.
But I don't know any languges that combines both.
As for todays' question : if JITs looks cool, and one time compiles looks boring, think of compiling n-times with a profile session in between and you will get much better result. (all the sugar, no r.time JIT overhead)
-
This post was compiled with `% gec -O`. email me if you need the sources
I work w/ AI and computer language theory both as hobbies and see no reason why code can't be evolved. True, it'd take some effort to do it well but the main point is the more you ran the compiler the better it'd get as it'd learn more and more tricks. First I assume it'd jsut compile a base version.. the same version any old compiler would give you.. then it'd begin trying various optimizations and as it went it'd learn what works best in certain situations and even begin to make it's own optimization combos. After it gained enough experience it could do the job far faster than any human. I've played with this some, mostly in running interp'd languages of my own design and it seems to work fairly well. I've also wondered if you couldn't create something similar for file compression. Might be interesting to see what you get.
At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
- They can avoid costly system calls. Asking the OS for memory and telling the OS you're done with that memory has some overhead. If memory management can be done without making system calls, time is saved.
- Manual free()ing of memory involves touching every piece of memory one last time just to relinquish it. Some garbage collection schemes like stop-and-copy (and it's more complex sibling, generational collection) avoid this last touch by only manipulating memory that is still in use.
You'll still get occasional pauses due to GC, but the amortized performance of your program can be better than manually collected code.x
Interpreted languages generally don't use self-modifying code. And it's not exactly a common technique in device drivers either. How many instances do *you* see in /usr/src/linux/drivers/*? Adaptability doesn't imply self-modification.
First of all, Harvard machines require separate instruction and data pipelines and memory spaces, and 99% of the CPU's on the market (General, Embedded, etc.) are Von Neumann machines which use a single space to store instruction and data. The thing is, newer CPU's (Even if they're Von Neumann desings)have separate caches for data and instructions and most self modifying code violates the cache integrity (See above) because the code is modified in the cache but never stored back into memory (Unless you use a write-through cache design which is impractical from the performace point of view).
This is probably more nitpick than anything else, but it's legitimate to refer to machines with split I/D caches as "Harvard architecture". The original meaning of completely seperate memory for I and D is not terribly relevant anymore and the distiction at the cache hierarchy level is fairly useful. Not many people talk about the IBM Mark series in anything other than a historical context. :)
BUT (And that's a big BUT :) if your JIT VM forces periodic cache writes to RAM (Not every time but enough times to ensure some sort of coherence if the cache and RAM are out of sync) you lose a bit of performance but gain a stabler setup.
To be correct across architectures, your JIT has to insure that you'll NEVER see the old icache data after you store to your program memory. The simplist way to do this is to force an icache flush on every write to instruction space. There may be a way of dynamically tracking this and doing some sore of Just In Time flush, especially if the processor doesn't offer any sort of selective flush in the icache. This could probably be done by invalidating a page with a special flag which causes a flush on TLB miss.
The point is largely moot though. I have serious doubts you'll ever get a consistent performance gain out of updating code on the fly; better to finish execution with a given translation and then go back and tweak the code ordering/generation based on profiling. In that case, flushing the icache is more or less free, from a performance perspective.
But even run time profiling is dubious at best, methinks. There's very little benefit in this space that can't be gotten more easily and at less cost from a combination of static compiler optimizations and dynamic architecture adaptations (e.g. branch prediction).
I recently sped a program up by 150% in a snap simply by killing a few new()s. Swing and LotusXSL have had similar experiences.
I think that part of the problem is that all of this is new, so there is more to do. HotSpot the trendy JIT from Sun in places IS ALREADY FASTER THAN C, but whenever it comes to Object creation, things slow down a lot.
So why in theory should new() be quick when in fact it is slow? IMHO the problem is not with the memory claiming, its with all the other stuff that the JVM has to do.
When I call new Foo() the JVM:
- Checks to see if the bytecode for Foo already exists, and if not it loads it, verifies it, and calls the class init method.
- Allocs new memory. Probably very quick
- Calls the hierachy of constructors of all Foos superclasses. Quite slow.
- (For advanced garbage collectors) Place the object on the 'recent items' list. Probably quick
So I guess the complexity of the system as a whole is the problem here.This is very very slow, but should only be done once.
DWR is Ajax for Java
I once took a course with Dr. Alfred Aho, one of the authors of The Dragon Book and a contributor to lex, yacc and awk (he's the 'a' in awk.) Dr. Aho would never call a compiler component an "optimizer;" instead, he called it a "transmogrifier," because the only way a compiler could produce truly optimal code would be if it could write an optimal program--and it can't.
While the compiler has the opportunity to rewrite the code every time it processes it, it's not very good at rewriting code, so its abilities are limited.
A well-designed compiler on a well-designed CPU can produce very good assembler code--I doubt I could outsmart a compiler in a RISC assembler optimization problem--but I believe that code tweaking is usually of value only in inner loops and for certain kinds of data storage.
All of which adds up to the reality that, pending the arrival of artifical intelligence, human programmers are still needed.
A program that uses a JIT compiler can perhaps execute faster AFTER the JIT has ran, but the JIT takes away execution time. The MORE optimalisation you do in a JIT the more time it consumes.
:) This is sometimes the case, in other situations there can be a lot of CPU time lost to the JIT, because code is just executed once or twice.
The weird part is: the JIT's we know today don't store their compiled code. Why not? Ok, some optimalisations are part of the executionpath the program takes based on the actions and reactions of the user, but a lot of stuff is not.
Interpreted languages have the disadvantage that the result of the interpretation is lost. A JIT solves that (partly). The old discussion about whether a true native compiler can generate code that is faster than the code interpreted and compiled on the fly with a JIT, is a discussion that's not closed or won by any part. This is because there is no scientific evidence that proves any side is right (yet).
What we KNOW today is that a native compiler can generate code that is, generally speaking, fast on the target CPU. Faster than the interpreted version. We also know that the compilers of TODAY can't always predict what's the most fastest code to emit. the optimalisation schemes are very clever but not clever enough. We know that some optimalisations can only be done at run time when there is knowledge what to avoid and what CAN be optimized.
It's thus a mixture of both. However, as I said before, the addition of a JIT to a native compiled codepart, as can be done in a CPU, also adds CPU time. It's thus up to the JIT developer to keep the time added by the JIT lower than the gained speed
Besides the technical aspect: what's really important? if you want platform independance (hardware wise for example), what are you willing to pay for that issue? speed in execution? a high price in software?
If speed is the most important thing, emulation and JITs are TODAY not an option. Perhaps tomorrow or next year, when there is more information available about what is faster in certain situations: a CPU with a JIT, or a JIT with interpreted code or just a fast CPU with native code.
Never underestimate the relief of true separation of Religion and State.
between the quote and my text. It showed up that way in preview. Slashdot ate it when I submitted.
--
you can guess what went wrong...
One of the reasons it is very tough to make java fast is that the byte code is designed for a Virtual Machine that is a Stack Machine!
A Stack Machine of all things! There's a reason that no modern CPUs are stack machines.
Last year, I had to write a java compiler for my compilers class. I was simply disgusted by the stack machine nature of the VM.
Instead of doing actual, meaningful work, the stupid VM spends loads of time wailing away at the stack, pushing, pushing, popping....
Stupid, stupid design for anything other than braindead little chips. But wait, that's what Oak (Java's precursor) was designed for. Go figure.
Once again, marketing has given us a solution in search of a problem.
scottwimer
-- Beer. It's what's for breakfast.
The basic data types are defined. You can use sizeof() to generate 100% portable code when you really have to know exactly how bit your ints are. More importantly, you can use your noggin and avoid it being a problem in most cases by writing code that accepts the range of possible value limits.
Embedded systems programmers will back me up on this one: it's a Good Thing that when you write C you can write one piece of code that's perfectly happy to use a 16-bit, 20-bit, or 32-bit int and it will be equally efficient on any of those systems.
I went to the 1998 Atlanta Linux Showcase and Miguel (The most hyperperson on earth.) Was teaching a class about GNU GROPE. It was suppose to be a optimizer for code similar to the SGI rope. I'm not a programmer so I don't understand exactly what it does but Miguel said that using GROPE could lead to a 40 precent memory reduction under gcc. Has any one heard about this? If so what ever happend to it.
I decided to throw this in after all the other talks on optimization and compilation.
I believe Andy has a good point in arguing that there's no fundamental reason for manually written assembly being better than automatically self-optimizing stuff. I also believe that manually written assembly will ultimately become obsolete.
Nevertheless I'd like to point out that JIT is based a somewhat dangerous technique: a program that alters (its own) code. I believe this technique was used in the eighties to scare off hackers by making code incomprehensible and hard to disassemble until the program was actually running. Also (even on a good old 6502 processor) it's possible to make some speed improvents peeking and poking into the code you're actually executing.
When compilers for Microcomputers got faster and most processor architectures (known as Harvard architecture, I believe) explicitly require a division of RW and RO memory segments, self-modifying code was abandoned. Hacking into your own code is generally conceived as bad-programming-practice.
I was wondering wheter JIT techniques also require very intelligent (and thus "heavy") disassemblers? One might also expect that developing a JIT compiler is a lot harder than doing a conventional one (without peep-hole optimization). Does anyone have experience in these subjects?
Nothing wrong with microkernels, the fact that 99% of research was done on the fatally flawed Mach is what did them in.
(1) I am an assembly language programmer (not 100% or all the time, just that I can when it makes sense for me to do so.)
(2) I've never met a compiler I could not beat in terms of performance, except in trivial cases or with significant hand holding.
(3) A JIT sees optimized bytecodes which are not necessarily optmized for any particular target platform. Thus a JIT must necessarily reschedule code from scratch.
(4) Transmeta translates code which to some extent or another has already been optimized for, presumably, the Intel x86 architecture. Thus Crusoe has less high level work to do, but much more low level work to do.
(5) Crusoe has a very important hardware feature that allows speculative memory dispatching to work around potential aliasing situations. Thus it will be very important the Crusoe's morphing code take advantage of this feature. This solves a problem that has plagued other CPUs and compilers for years now. A JIT has not analogous thing that it can do in this situation.
(6) Calls to new() are not the biggest penalty to the garbage collection memory model. Its the ref count maintenance that ends up all over the place. Besides, there is no way that the could lead to a speed up over the malloc()/free() method which could also use a stack and sub-heap method for memory management.
(7) The Java standard has problems with it that no JIT can solve. The primary example being that the floating point standard is strictly bit compatible with UltraSparcs, but not compatible with either Intel's IA-32 or Compaq's Alpha floating point units. Hence UltraSparcs will always beat x86's and Alpha's at floating point because they have to emulate a Sparc's FPU rather than using their own, and JITs can't do anything about this. But a perusal of www.spec.org should convince you of what a crock this is in native code. Transmeta, should, in theory, not have this sort of problem.
(8) Java has bounds check on its arrays. A compliant JIT cannot remove this check. Crusoe doesn't have anything comparable that it needs to worry about (it merely has to emulate exceptions properly.)
Just my $0.02
--
Paul Hsieh
http://www.pobox.com/~qed/mailme.html
i think so
Can the crusoe chip morph efficiently when, for example, a C program invokes a java program, eliminating the "virtual machine" paradigm?
HFF
Overseas Europe starts its coverage with a Strong Buy recommendation on SFI - Southland Financial Inc. -=(SINC NASDAQ BB)=- it places a target of 25$ by year end 2000. *Website http://www.sincsfi.com/ (Under devlopment) *Outstanding shares Details* SFI has a total of 16,200,000 shares issued and outstanding, of which 3.5 million are publicly traded and 13,100,000 are restricted, *Company Overview* SFI aims to be the premier, independent communications 'network service provider' in South East Asia. SFI will provide cost effective, high quality, value added, last mile qolutions to its customers. SFI opportunities are driven by new technologies including : - Video On Demand (VOD) - Music On Demand (MOD) - MP3 technology, MPMan - Home Shopping and Home Banking - Videoconferencing, Videocasting, Videostreaping, Telemedicine - Interactive Network Games - Internet / Fast Internet Access Products SFI also closed a deal with TEEL (Top Express Engineering Limited) which is expected to be worth $300 million in the next five years in terms of both existing and new infrastructure. SFI recently has applied for the International Gateway License in Hong Kong, OFTA (http://www.ofta.gov.hk/index_eng1.html), we expect news at the end of January, early February. SFI holds the landing rights to Project Oxygen for Hong Kong. (www.projectoxygen.com), (Project Oxygen will be one of the biggest fiberoptic projects in the world connecting 76 ) (countries,Project Oxygen dwarfs any other Line slated for coming into Hong Kong. ) SFI recently signed a letter of intend with CCF STARBRIDGE to apply and enter in the Internet, Intranet and E-commerce area in Mainland China & Overseas (http://www.ccfstar.com/) SFI is presently finishing a form 10 QSB with the S.E.C., In conjunction with receiving an approval to a fully reporting status and the finalisation of project funding, the Company intends to make an application for a NASDAQ listing and is presently in final negotiations with qualified and experienced individuals to augment the board. Further details of which, will be made available shortly. SFI recently signed a mandate with Macquarie Bank (www.macquarie.com.au/oc/) Macquarie Bank Limited of Australia will advise SFI on its project to establisch $200 milllion international telecommunications gateway link into Hong Kong, UTI (United Telecom International ltd) has obtained landing rights in Hong Kong for the $14 Billion global marine cable company, Project Oygen, Macquarie Bank's mandate with SFI includes the requirement of Project Oxygen completing its network into Hong Kong on a timely and cost-effective basis. *Latest Chart http://www.bigcharts.com/quickchart/quickchart.asp ?symb=sinc&sid=0&o_symb=sinc *Latest News Releases http://biz.yahoo.com/bw/000118/southland__1.html http://biz.yahoo.com/bw/000112/southland__1.html http://biz.yahoo.com/bw/991202/southland_1.html http://biz.yahoo.com/bw/991104/southland_1.html *Board of directors & management David A. Turik, President and Chairman of the Board Mr Turik brings a wealth of telecommunications expertise to the Company having previously held senior executive management positions with Australian publicly listed companies, NetComm Australia, Telstra, (the former national carrier, Telecom Australia), AAP Telecommunications and Spectrum Network Systems, as well as Toronto Stock Exchange listed, TSB International. In the past 5 years, he has predominantly focussed on Business Development, Mergers and Acquisitions and project specific capital raisings. Mr Turik has held an interim role of acting President and Chairman for the Company and was responsible for identifling and negotiating the Company's telecommunications infrastructure based project in Hong Kong. Mr Turik has held many private board appointments across a wide range oftelecommunications related organisations. Martin Dougherty, Non Executive Director Mr Dougherty has had a long and distinguished career in the media and communications field having held senior executive positions with both the John Fairfax Group and News Limited. Mr Dougherty has also held Board Directorships with the John Fairfax Group, the Australian Associated Press, (AAP) and David Syme Pty Ltd, (publishers of the Melbourne Age and Business Review Weekly). For over 17 years, Mr Dougherty has also been Chairman of Dougherty Communications; public affairs consultants, specialising in strategic public relations, issues management, telecommunications, Government and media relations. This company became a member of the Ogilvy Public Relations Group. Mr Dougherty was a member of the International Management Group of Olgilvy Public Relations during the 1980's. He resumed private practise as a consultant in 1989. Willie Lo, Executive Director, Chairman of United Telecom Inc, (Hong Kong) Mr Willie Lo is a veteran of the Telecommunications and Information Technology industries. He has held senior technical management positions for over 18 years, with Hong Kong Telecom, Telstra, (Telecom Australia) and PRACOM Pty Ltd. Mr Lo has successfully managed a broad range of projects from cable network roll-outs to systems development and implementation. For 8 years, Mr Lo was a Project Manager for Hong Kong Telecom, managing the deployment of major cable networks. Mr Lo was the IT Manager, Media & Broadcasting, for Telstra, prior to becoming IT Manager for PRACOM and an Executive Director, for Pacific Communications Research P/L. Mr Lo holds a Masters Degree in Systems Engineering, (RMIT), Diploma in Business Studies, Hong Kong Polytechnic and Bachelor of Applied Sciences, University of Melbourne. Robert Talbot-Stern, Non Executive Director Mr Talbot-Stern, B.S. Econ. (Wharton), J.D. (Penn.), LL.M. (London), has had a distinguished corporate, academic and public career, having been Group Counsel for UNISYS and Assistant General Counsel for Chrysler. While at Chrysler Talbot-Stern held Board directorships with Mitsubishi Motors and Peugeot, while coordinating Chrysler Canada's rescue effort. Mr TalbotStern is on a current White House Task Force on Deregulation and Competition, has been a guest columnist for the Australian Financial Review and business commentator for CNBC-TV. He has worked as a Management and Legal consultant in affiliation with consulting and law firms in Washington (National Academy of Sciences, McKinsey & Co., Herzfeld & Rubin, and Boston Consulting Group). He has been a past adviser to the Business Roundtable and Federal Reserve. * Comment of Overseas Europe Staff Note that : - China population 1.850.000.000 (1998). - Hong Kong population 8.800.000 (1998). - ITU (International Telecommunications Union) forecast global telecoms industry to be 1.5 Trillion USD by 2000-2001. - The recently signed WTO will have a very postive impact for SFI on long & short term. SFI will announce a new PR company who has a very good reputation on Wall Street. Overseas Europe believes that the company is way undervalued and will update your further on upcoming events and news releases. For more info on SFI send a email at overseasii@yahoo.com.
Compilers, programers, coders, and physicists and all interested in private communication need to look at the 'Information to Data Compiler' dialogue of Fox Forums: Fred Langa; Techno File: Physics.
I hate when I skip words.
So if these languages don't support hinting on a language level, what does? And why don't they?
The point is that the average application has inflated enormeous over the years.
So has the average machine. Which is cause and which is effect? Do we have bigger applications because with today's faster processors and bigger memories, throwing more hardware at the problem is cheaper then trying to bum five bytes out of a 50 KB program?
Something to think about, anyway.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
Perhaps another malloc comes along wanting the same size (roughly) as the thing just freed. We just give it the recently freed chunk. We then don't have to build a new one.
:)
IIRC from what I have read about the Linux kernel, it does something very like this. Blocks of free memory are kept track of in a hash table, where the hash key is the size of the block. When you free a block, it gets merged with any adjacent blocks and put in the appropriate spot in the hash table.
Or it might have been an array of lists of blocks, where each element in the array represents a list of blocks twice as big as the previous element's list. It's been awhile; I forget.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
As all interesting questions this is a matter of how things are implemented. JIT isn't automatically faster than one-time compilation. Without optimization based on runtime data, platform information and profiling, JIT will be slower than compiling by default. Because JIT includes a significant overhead by converting, profiling and optimizing the program. Anyone who has compiled some big programs knows what kind of overhead we're talking about here. Especially with full optimization, there are worlds between time-consumption figures.
My proposal for JIT compilers is to _store_ and rerun the bytecode and profiling information for later. It should be possible to further optimize already compiled code, especially in much-visited code. So that over time, you can spread the optimization efforts to when you have CPU cycles to spare. As a result first-time execution will not be as painfully slow as it is now, and you will still appreciate fully optimized code after a while, yes, even better than compiled code. This really makes sense with runtime profiling.
For the best average effect in every scenario, I would recommend an automatic (perhaps weighted?) hybrid between interpretation, JIT and compilation. Also, leave the final choice up to the users of the system.
For all I know, this might already be implemented in some JIT compilers. Like I said in the beginning, it's all in the implementation. The use of hardware in Crusoe seems very promising for JIT, runtime profiling and platform-independence in general.
For games, I would recommend an ability (choice) to precompile Java programs and store the entire bytecode. Just because I meet a new type of monster in JavaQuake shouldn't mean my demise (unless I forget to strafe).
- Steeltoe
http://www.debunkingskeptics.com/
You're absolutely right. All of these ideas are already being used in Sun's Hotspot runtime for Java. There are some academic papers which claim that properly-tuned Generational garbage collection can be faster than allocating storage on a stack. It will just take a little time to see if commercial systems can replicate this. IMHO, it's silly to be so worried about speed. For many applications, the time it takes a programmer to write something is much more expensive to CPU time, and anecdotal evidence suggests Java programmers are 40%-100% more productive than C programmers. Game Boy Camera Gallery and Tutorial
Unfortunately, they are somewhat counterintuitive and go against certain popular beliefs. Don't you know that Java is slow (even with a JIT) and will always be slow? You will after a few hundred flames.
The only reason why Transmeta do the translation transparently is the instant support for the x86 OSes this gives.
For Java you dont need this. A full blown Java VM application gives better Java performance, because it basically does the same, but with less system-specific constraints.
well made points.
@#$% theory.
:-)
I can write fast C (or C++ if you stay away from the evil slow features) code.
Everything I do in Java is bloody slow.
I do believe fundamentally that some sort of per-machine compilation is in fact ideal -- the C compiler could do clever stuff if it new the exact cache architecture of your machine and compiled your programs from some intermediate (say parsed and desymbolized) representation when you ran them (or maybe just the 1st time you ran them).
(all the slackware people in the crowd say "bwaahahahaha! I did compile my programs for my own machine!"
But I think the dumbness of Java -- the funky GC, the dumb dumb dumb RTTI, will always make it slow. For that matter, Java uses a dumb fake-o instruction set that's not a particularly good representation of the information needed by a compiler. Why not give your JITC something closer to the source rather than a binary for a machine that doesn't even exist ?!?
...
Ditzel said something very significant when he introduced Crusoe: (i'm paraphrasing) "The reason we haven't talked openly about this before is we didn't want to boast before we could prove we've done it". Ditzel stood there and said "We've got a chip that does on-chip JIT architecture translation, and does it to run as fast as any other chip out there, and here's the proof".
The Java people have been claiming for a long time that they could write faster code than the C compiler with JITC.
Java has gotten better than when it started, it's true.
But it still can't hold a candle to gcc -O2.
My message to the Java world: I spend too much on my computers to waste time running slow code. Call me back when you can actually demonstrate your claims.
The basic reason why compilers are slower is a programmer can use information that is simply not in the program in their optimization.
In one (in)famous example, a quick, but, in general unsafe conversion was used in part of the Ariane 4 control software, justified by a mathematical formula present in a comment which proved from the geometry of the spacecraft that it was, in fact always safe. [ sadly, Ariane 5 had different geometry.....]
A more down to Earth example is a loop that you may know will never execute 0 times, which would allow you some optimisation.
Profiling JIT compilers can do somewhat better, in that they can, at least, note that condition X has never happened to date, and produce code that handles it as an expensive exception, while concentrating on keeping the main path fast. The Transmeta stuff seems to do a lot of that.
This was actually a topic of my master's thesis. More information on my FLEX compiler project. Both the RAW and the FLEX groups here at MIT have compilers that do bitwidth analysis.
[
The argument was that compilers can be better because they can rewrite the whole thing each time. Ergo, if you are also willing to rewrite a given piece of code whenever certain conditions change, then you can beat the compiler.
However, for situations where you can't change the code every time anything changes (eg. it's too big), my bet is on the compiler.
--
Patrick Doyle
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
No offence, but that's a really lame memory management scheme. Nobody would ever actually implement free() that way.
My guess was that Andy was referring to data flow analysis techniques that can determine that an object is never used once a certain function ends. Then, the object can be allocated on that function's stack, with very low overhead.
--
Patrick Doyle
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
To match this performance a traditional compiler would have to compile everything twice, once normally and once optimistically, thus doubling the executable size. Of course the CPU must have this rollback capability in order for things to work.
Moderate this down (-1, Just Plain Stupid)
--
Industrial space for lease in Flatlandia.
And to take this argument to it's logical (for me) conclusion:
The best setup would be to constantly re-compile the compiler, and for that matter, the kernel, too. (NOT!)
Compiling once for a given hardware set yes, but not re-compiling for every execution. I'm guessing that somewhere in the neighborhood of 50% to 90% of the compiled code WOULD NOT CHANGE from one recompile to the next, and I just don't think you'll be able to make up for that kind of wasted cycles by optimizing the rest of the code.
However, if you do, I'll be the first to write you a check!!!
You are wrong about the inlining. Sun's Hotspot VM can do "optimistic" inlining and rewrite the code if classes are loaded later.
It really does not matter whether this preprocessing / VM translation is done in software by a very fast RISC like processor or in microcode, as in a Pentium-like architecture. Technically I do not see a fundamental problem here.
There is a practical BONUS, though: since the interpretor is not in the actual silicon, you don't need to worry about buying a chip that possibly has lots of small bugs that need program patches and workarounds for every single program you bought!
Its wrong to have the most basic data type wobble over the bit length.
Most people dont think about this when writing their programs. This is a failure of C, because programming languages are for the people.
The solution would have been to have the basic data types with fixed size, and additional data types with "some size between 16 and 64 bit giving the optimal performance on the target machine".
This would have saved many people a lot of grief...
One of the areas that run-time profiling adds to the mix is the ability of the code to adjust "itself" based on processor cache sizes, etc. If I run a program on my K6-200 and then on a K7-700, the program should have loops optimised very differently. That is, my chip has very little cache in comparison to the K7. The K7 could have a large number of long instruction sequences unrolled that would be in L1 cache whereas it would be smarter, if branch prediction works out (again, run-time profiling) to keep those loops rolled up on my CPU.
- Michael T. Babcock (Yes, I blog)
All in all, quite interesting work, and quite worth the read.
I dont think so.
I have not seen one compiler that is able to optimize better than a human. I've worked with some compilers, have looked at the assembly code they generate and have written assembly code myself, both for Intel based systems and for MIPS Rxxxx.
The problem is that compilers have to generate code defensively. That means that they have to add code that enforces the laws of the higher language. Also, they have to follow clear guidelines with respect to register usage, parameter passing etc. that I as an assembly programer can change according to the situation. A compiler cannot behave that way because for this it has to see some bigger picture, and this is beyond it's abilities.
I think you completely underestimate the difficulty of writing fast code. JIT and java stuff will allways be inferior in performance to serious code (assembly or not) written for serious processors by serious programers. The best human-machine combination is still this one: the human does the thinking, and the machine limits itself to be stupid.
my view, anyway.
rmstar.
To start with, I beleive that Transmeta does not actually gain speed by code morphing, but it simply allows them to use a smaller chip. If I remember correctly it takes a 700mhz transmeta chip to compare with a 500mhz intel chip. This is why the chip is used mainly for laptops. Secondly, I don't think that a compiler will necessarily always be able to do better optimization than a human. I might be biased from programming the early VGA cards (and the fact I was using BC3.1), but there are many non-standard approaches to optimization that simply cannot happen with a compiler -- it requires knowledge of what the programing is trying to attempt. For example, a compiler will certainly not want to use the MMX floating point operations for everything, but for very time critical/protected areas you can get huge time savings (if you are using an 8086 of course). Thirdly, JIT will always have a longer start time. There is no way around this. Think of how long it takes to compile the Linux kernel. Imagine if it needed to compile _every_ time you rebooted. In theory, compile time cannot be made too much faster, especially if you start adding optimizations. The only reason why JIT is useful to _anything_, is that Java is so slow (as are almost all interpreted languages), that the time it takes to compile the byte code to machine specific code is worth it. And, you can write platform independent code (theoretically). However, I think computers are fast enough that hand coded optimizations are generally not needed for most applications. At the same time, try imagining some realtime embedded system (or again, the Linux kernel), where architecture may or may not be standard, and it cannot afford compiling anything anytime. Also, if you have ever tried writing an interrupt routine (at least in the old days), you were basically forced into writing in asm.
I don't want to argue about the general case, but as things relate to current-technology Java, JVMs and JITs, here are some things to consider:
a. One of the most important optimization techniques is inlining. In OO, you can't inline unless you're sure a call is monomorphic (static). Java is even worse than the general case OO -- a Java program simply cannot use inlining in the vast majority of cases. Here's why: To inline a call in OOP, you must know for 100% that the distatch site is monomorphic (i.e. that the call is never polymorphic). But in Java, even if the compiler finds out that a certain method is never replaced by inheritance (overriden) in any class in the program, the program could load new classes at runtime (using Java's Reflection mechanism), thus turning monomorphic call sites into potentially polymorphic ones.
(A few exceptions to this rule: first, there are final methods, or methods in final classes, that cannot be overridden. Second, there are Sealed JAR files, in which all package-level (default access level) methods cannot be overriden. IBM's JIT takes advantage of that, a fact which allows for great optimizations in the standard Java libraries.)
b. Sun's JVM (the one for Windows, at least) implements references by a dual layer: the value stored in the reference fields is an index into an array maintained by the JVM, which contains the actual pointer to the memory block. This allows the memory manager to move blocks around without having to update all references to that block, but it also means each dereference actually requires two dereference operations. So when you write "this.field1.field2", you think you're making two dereferences but actually the machine makes four.
c. The JVM is a stack machine. This means that naive implementations on register machines result in slow operations, that simply do not take proper advantage of the native architecture even when compiled to native code. Microsoft was the first to realize this, and indeed their JVMs were the fastest at the time (later on IBM's JVM took that trophy).
Finally, a word or two on the general case: anything a JIT can do can also be done by a regular compiler, if you allow it to use profiling data from a few "typical" runs of the program (but note that not all programs have "typical" runs). The JIT, however, consumes memory at runtime, unlike a regular compiler - and with programs that require more memory than is physically available (practically all programs today), extra memory requirements reduce speed since they increase paging; so the JIT has its price.
- Tal Cohen
As shown at http://www.idiom.com/~zilla/Computer/javaCbenchmar k.html, even the linux jdk, which is relatively slow than other platform's, runs as fast as natively compiled C program when runs simple operations.
In my understanding, java's poor performance is due to following factors:
1. too generized api design resulted in very deep call stack. for example, printing current date and time using java.text.DateFormat uses much more calls than traditional C (a system call and a printf is enough for C). It gets more terrible in swing. use of interfaces causes the same problem i think.
2. JNI is too slow. (can't understand why but it is known to be slow and java can't help but using many JNIs)
3. as WORA is important in java, it is hard for java to use any platform specific resources such as Graphic card's accelerations. that is one of the reasons swing's design is getting ugly as it tries to boost its performance.
4. too many small class files result in high I/O usage when JVM loads classes.
5. no MACRO! for example, most of java API is synchronized, which is expensive, reguardless thread is used or not. the same for security check codes and platform specific implementation codes,
etc. these problems can be solved by MACRO but...
someone in sun doesn't like macro and i agree with him when think of java's purity.
some of these problems are known to be solved in next jdk version(hotspot client version). slow synchronizations and class loading speed, etc.
but mostly, they are java's design issue and never be solved, as java is mostly running in virtual machine and originally it is designed to be used for java chips, only when JVM is implemented in processor level, the speed problem will be solved.
I'm looking forward to seeing java running in MAJC or Transmeta's chips.
please correct me if i'm wrong in some and forgive me for my poor english.
Cheers.
-- Y. J. Chun
Of course JIT takes some time to start up. But when a program is started next time, a smart OS (environment, etc. what you call this) can aways use a copy of a previous run.
This improves time efficiency at the cost of space efficieny. Wrt. effectiveness: the original byte code (half-product) is "processor-independent".
This leave much more space for speed improvent at the hardware level.
Seriously, when you communicate over a inter-machine connection, you have to comply with the communication standard.
If the communciation protocol is, for example, byte based and you are writing portable code, you have to have a formal, portable method of chopping up that struct into bytes at the sending end and building it back up again (from bytes) at the receiving end. It's called serializing, and it's disgusting laziness to skip it; if you do, you deserve those problems, and also to get bleeding genital warts.
Just dumping the bits of a fuzzily-defined internal data structure outside of the comfy innards of your own program is the ultimate in programmer carelessness (well, dividing by a null pointer and writing to the dereferenced quotient, with appropriate casts, is the ultimate in carelessness, but you get the point). Would you also send your pointers across a network and expect them to work at the other end?
In theory, a perfect compiler will outperform any human programmer writing hand-optimized assembly. In practise, a real compiler will outperform assembly code that is written by an incompetent programmer (who, for example, doesn't understand all the important details of the target architecture). Too bad many people writing assembly today seem to do a worse job than a decent C compiler.
Jukka Ikonen / jaikonen@nospam.cc.hut.fi
However, I'm pretty sure there are currently NO perfect compilers around and there won't be one for a long while, if ever. A GOOD assembly programmer who writes code for absolute best possible performance, not caring at all about maintainability, will outperform current compilers most of time. This happens at least on the most popular architecture of today, Intel x86. Speedups of 100% or more by tweaking gcc-generated assembly output is not uncommon, and VC++ is not much better.
x86 is probably the most difficult mainstream architecture to optimize for, especially for a compiler (there being only 8 general purpose registers is the most significant reason). If a (JIT) compiler could use run-time profiling information to put emphasis on optimizing the hot spots, the performance would obviously improve. For example, variables that aren't used in the most likely execution paths can be stored in stack, freeing registers for more important uses.
However, current JIT compilers typically don't outperform traditional static compilers. In order for them to be practical, they cannot perform very complex optimizations, since it would cause long time delays seemingly randomly during execution. Tight loops could be optimized well, but complex global optimizations are probably out of the question. In addition, Java's security model makes it practically impossible to get C-like performance in general. Especially clever (ab)use of pointers, casts and unions give a C programmer a clear advantage. Of course, a JIT compiler could be written for C, but with current compiler technology, it would be a lot slower than static compilers.
Copying garbage collectors are asymptotically faster than ordinary free()/malloc() combo. But in practise a good programmer uses custom memory allocation routines if performance is important. Actually, it is possible to use a garbage collector from C or write a custom garbage collector optimized for a certain task. If the programmer is sufficiently skilled, he/she will always beat a general-purpose garbage collector. Besides, an incremental/generational (read: fast) garbage collector requires read/write barriers that may cause a significant performance hit.
The lesson is: When talking about performance, theory does not count. Existing compiler technology, programmer skill levels and the time spent on optimizing code dictate real-world performance, not what a hypothetical perfect compiler could achieve.
This has to be one of the most widespread, and possibly destructive, myths about C programming. The truth is, if you write code that depends on the size of your basic types, your code will be less portable. The next generation machines will always handle your types in ways you will fail to predict.
> And it also doesnt give todays hardware any advantage.
Have you ever worked with a PowerPC processor? Some of these cores do not have 16-bit integers, only 32-bits. Yet, lots of people reasoning like you use them anyway (because "16 bits is enough for this value"). This forces the compiler to insert 16-bit emulation code, thus increasing code size and decreasing execution speed. Had these people ignored the size of their integers and just used 'int', the code would have been both faster and more portable!
Take a look at a few of the most widely ported software packages around. How many of these redefine the types in order to "know the size of their types"? Not many, I can tell you.
Writing portable code has little to do with worrying about the size of your integers, and lots to do with not trying to optimize your code to whatever system you happen to write the code on (by, for example, worrying about what the size of your integers are...)
The difference between theory and practice is that in theory there is no difference whilst in practice there is.
First off JIT is not java specific. Transmetas chip does a form of JIT on everything - even x86 assembly code. Your basic argument that JIT can be efficient is correct (not more efficient than humans, but I'll come to that).
The performance hit from Java comes far more from it's crappy libraries (Swing anybody) than the JIT aspects. Bounds checking, garbage collection, and inherent overhead of OO techniques all cause a performance hit, but the real trouble comes from crappy libraries written by crappy programmers who put 10 levels of indirection between a request and it's implementation.
For instance the String class being written with 16 bit characters to make it unicode friendly when 98% of the java code written never uses this feature but still requires twice the time and space for every operation. Also Strings are not alterable so to alter a character you create a new StringBuffer object, copy everything, make an adjustment, then replace the old reference with the new so the old object eventually gets passed to garbage collection... all compared to
buf[i] = 'a'; (C or C++). It's crap like this that makes Java seem so slow, not the JIT hit.
Your argument about how JIT can outperform programmers because it has more information is flawed - any decent programmer will know where the spikes are and optimize accordingly. He will know what he can and can't do with the data, whilst the compiler has to take the strictest possible interpretation. Advanced optimization is too slow to do on the fly (although you can optimize the byte code to some extent with a decent javac, then optimize the translation with a JIT)
Garbage collection argument is silly too. In C/C++ or assembly, the programmer has a choice as to whether he uses the stack or the heap. It makes his life more complicated, but as long as he knows what he's doing it's inherently more efficient. If he doesn't know what he's doing then he's useless in any langauge. If you want to, you can use a garbage collector in assembly or C++, but it's not the only technique available to you.
Basically Java is slow for the same reason that Windows sys admin is slower than Linux sys admin.
You don't trust the user with the rm command or trust the programmer with pointers - show him an idiot box if he tries to delete a file or bounds check that array access for him. It's one way of doing things, but it sure isn't efficient.
http://rareformnewmedia.com/
And if you need really big numbers you can always code your own BigInt structs, though you might want to write a little preprocessor to do that (or use C++ and redefine all the ops for it).
Naturally, on the first iteration, crusoe will behave almost exactly like a JIT compiler, but adaptive optimization is another story. I would present the case that crusoe is more like Sun's HotSpot technology. If you are interested in the details, here is a lecture presented at JavaOne about java optimization, which talks about the adaptive run-time optimization in HotSpot.
Or maybe it only bothers to do this when resources get tight. Perhaps another malloc comes along wanting the same size (roughly) as the thing just freed. We just give it the recently freed chunk. We then don't have to build a new one.
Caveat: I have no idea if this makes sense, performance-wise, in the real world.
The cake is a pie
Did I miss it? Where was it argued that Code Morphing is not the same as JIT compiling?
--
Patrick Doyle
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
Does anyone know about such a comparison?
Or, alternatively one between STL and Suns collection classes?
This would be very interesting and helpful, since this may be the only roadblock for using Java in a critical project right now.
If anyone has an URL, please put it in an answer to this post.
>As long as the programmer has more knowledge than the compiler, he will
/. to correct me on this... :)
>always find tricks to save an instruction here or there and outperform
>the compilers this way.
This is true, on a small scale. If you want to solve any kind of a decent-sized problem, however, you are not going to want to implement the entire program in assembly. Even if you're not writing a GUI program, a large program would be nearly impossible to implement. Looking at a decent sized C++ program, how much duplication and cruft do you see? Imagine how much more there would be in an assembly program.
As a real-world example, I believe that Quake3 uses MOSTLY C++ for implementation. Only the really intensive, often used portions of the code are implemented in assembler. I wonder if Carmack reads
Its worth spending twice as much time optimizing an algorithm that's going to account for 90% of your CPU time, but not otherwise. And only graphics apps really have these sorts of algorithms. In a GUI program, most of the CPU cycles are spent passing messages around the system, or in OS calls, which are also generally message-oriented. That would be code worth optimizing, maybe. In the application I'm working on, the most CPU intensive algorithm we have is a FFT, and I would bet that its probably the place where the least optimization could take place. The slowest part of our code (the part that uses the most CPU cycles for the least accomplished) is the part that writes the data out to the screen (After the FFT; Damned third-party controls). It takes approx 2 seconds to run a hundred thousand points through the FFT, and probably 20-30 seconds to display it. The extra time is taken up by GUI, message passing, and COM/OLE crap.
The point is that only the most CPU intensive, high-performance algorithms really need to be written in Assembly. And most programs don't have these needs.
>>>>>>>>>> Kvort
-Don't mind me, I'm personality-deficient and mentally-impaired.
<nit picking> A glass of water poured into a bucket of stones will sink, just as a glass of stones poured into a bucket of water will sink. Furthermore, while stones do not sink into molecules of water, water can seep into some stones. </nit picking>
I seriously doubt, that you have ever written any kind of code for a compiler!
- Invartiant detection, using compilers to find invariants in the code, and then deduce lemma's which may remove bounds-checks, type-lookups, casting or flush of registers to memory.
- Partial evaluation, make several instanses of the same code, whith certain parts of the code already evaluated, or arguments assumed to have fixed (often -- in this program -- used) values (Of course, this often introduces new invariants).
- Rescheduling instructions to run better pipelined, you would be amased at how good compilers are at this -- one technique i particularly like is using tree-automatons to select best-fits.
- Instruction selection -- using the right instructions for the processor.
- Inlining code, which makes more finding invariants easier, since some arguments may be known to have certain values (actually, inlining is a kind of partial evaluation)
- Moving the code to Single-Static-Assignment form (SSA-form) at some intermediate parse of the compiler, where every variable is only assigned once (this makes a lot of data-flow analyses work _very_ well)
None of these (except maybe instruction selection) are even remotely feasible to do by human hand, and a most compilers only implement them poorly. Unfortunatly we do not have linkers who can select from a variety of instances of functions that have been partially evaluated to meet some (varying for each instance) input specification, this would increase code size by something like 10-fold, but only if all the partially evaluated functions were used.C is for efficient coding by trained experts, with the potential for portability if that is a desired goal. It provides no hand-holding that would interfere with run-time efficiency. The default behavior is and should be the most efficient behavior.
There are plenty of other languages for people who are willing to sacrifice efficiency for more hand-holding.
After all, there's some limit to how efficiently the code can be run on the given architecture (I would assume) and so once the compiler gets close to that limit, I'm guessing it would reach a point where it goes "Ok, I'll get a bigger jump in efficiency by just using this 57th generation piece of code, than by trying to optimize it one more time around"...?
BTW, now that we're into tiny chip processes and superscalar and all that great crap, where's my 1024 RISC32 machines on a chip??? hmm? I want my connection machine!!!
Not that I don't appreciate optimized compilers and such, but compiler optimization hits the wall of diminishing returns pretty quickly. Especially when you consider that extreme optimization by a compiler in most cases will get you maybe 5% better performance in comparison to just a little optimization.
When you compare this to the rate at which hardware is improving and it is trivial.
> it's particularly nice in large and sophisticated object models where you almost always lose track of who has a reference to what
Does this make anyone else nervous as heck. If you can't even keep track of when your objects should be deleted, how the hell do you keep track of anything. Increasing complexity requires more clarity of thought and design, not less. It's not the overhead of GC that bothers me, it's the sloppiness of thought it seems to encourage (try using JBuilder with less than 256Meg if you want to see what I mean).
In some languages (eg ML), you really are working at a different level of abstraction and then GC makes perfect sense, but it's not GC explicity that makes life easier.
http://rareformnewmedia.com/