jmaessen · Slashdot Mirror

Re:Another reason ... on The Object Oriented Hype · 2001-01-09 02:09 · Score: 1

Chip designers have indeed done a piss-poor job of branch prediction in the presence of procedure calls. This hurts any code that isn't written using gigantic procedures. Yes, compilers respond with heavy inlining---but that's what leads to the ever higher icache miss rates.

It's time computer architects admitted that branching to an unknown offset is important. It's important because it's part of every procedure return, every shared library call, every piece of relocatable code, and every dynamic method dispatch that is ever done. There are limited solutions---the Intel architecture tracks CALL instructions and tries to map the to the corresponding RET. However, these solutions are so limited that common idioms---like the PC fetch required for relocatable code in .so and .dll shared libraries---break prediction completely for subsequent code.

When this prediction works reasonably well, everybody will benefit. Even better, the need for inlining will be much less and we can hope for improved icache hit rates and smaller binaries. There is promise; the Pentium IV's trace cache uses a technique for eliminating dynamic branching that's proven itself when done dynamically in software (a la the HP Dynamo project). Too bad they made the ICache far too small to be useful. The Itanium provides special registers for branch destinations; these can be loaded well in advance, giving some hope that the processor can keep up.

For now, we're stuck. It galls me no end to hear claims that our [Procedural, OO, Functional, ...] code is lousy when our processors can't even make relocatable code work in a reasonable way.

Ubiquitous storage reduces storage requirements? on A Different Idea For Distributed Storage · 2001-01-02 23:30 · Score: 1

The most intriguing aspect of distributed mass storage is the potential to actually *reduce* the total amount of storage needed to store information on the internet.

Fundamentally, most of the contents of my hard drive already come from somewhere else---programs, data files, cached web pages, etc. Much of that information is pretty rarely used, and ends up being discarded (the web cache is really useful for a few pages, and a waste of space for the rest) or languishing (little-used but still essential software).

A "good enough" distributed store, possibly combined with good versioning and clever uses of caching/disconnection would make it practical for me to offload most of this useless garbage. Yes, I am instead accepting encrypted chunks of data from all over. But I bet this data is comparable in size to the savings realized by commoning up all the crud that is found on most hard drives.

Re:Just one question... why? on Is SMT In Your Future? · 2000-12-28 05:22 · Score: 3

A lot of people are assuming that multiple processors can be put on the same die for "equal or less cost". This simply isn't true.

Sharing the cache is hard

Cache is the vast majority of chip area in a modern processor; as others have pionted out, it's obvious that multiple processors should share a cache. However, this is difficult. The problem is that every load/store unit from every processor must share the same cache bandwidth.

Thus, for a 2-way chip with only a shared cache, memory latency---to the cache, the best possible case---is cut in half.

We can work around this by using various tricks up to and including multiported caches---but most of these tricks increase latency (lowering maximum clock speed) or require much more circuitry in the caches (we were sharing the cache because it was so big, remember?).

It makes much more sense to share the circuitry that feeds into the cache.

Those are the superscalar execution units! Thus, SMT.

Utilization

Instead of keeping half the execution units busy, we attempt to keep them all busy. Extrapolating very roughly from Figure 2 we can expect to issue about half as many instructions as we have issue slots (actually less if we have a lot of execution units). The basic idea is we can cut the number of empty issue slots in half each time we add a new thread. Further, instructions from separate threads do not need to be checked for resource overlaps---this circuitry is the main source of complexity in a modern processor.

What's happening now has been predicted for a long time. The extra resources (a bigger register set, TLB, extra fetch units) required for multithreading are now cheaper than the extra resources you'd need (mostly pipeline overlap logic) to get a similar increase in single-threaded performance.

SMT easier than SMP?

Moving thread parallelism into the processor is actually easier for the compiler and programmer; the weak memory models implied by cache coherence models aren't an issue when threads share exactly the same memory subsystem.

To get an idea for how hard it is to really understand weak memory models, consider Java (which actually tries to explain the problem to programmers---in every other language you're on your own). Numerous examples of code in the JDK and elsewhere contain an idiom---double-checked locking---which is wrong on weakly-ordered architectures. What's this mean? Your "portable" Java code will break mysteriously when you move it to a fast SMP. Alternatively, you will need to run your code in a special "cripple mode" which is extremely slow.

From a programmer's perspective, SMT (as opposed to SMP) architectures will be a godsend.

A Turing machine? on Java On 8-bit Platforms · 2000-12-16 10:06 · Score: 2

The referenced paper is a little slim on actual technical content. It appears that what they're proposing is in effect an extensible bytecode (what they refer to as "adding new procedures"). This would not in itself be novel---it's one of the guiding principles behind Forth, which is one of the reasons Forth is still used for resource-constrained portable coding. There does seem to be provision for stuffing all those new bytecodes into a single namespace.

Not surprising, but it will be encouraging if they succeed in getting a fully functional system in a truly small footprint. More power to them. But it'd be nice to see more detail on what their tools are _really_ doing.

-Jan

Is it a good compiler target? on Analysis of Amiga Virtual Processor ASM · 2000-11-05 22:39 · Score: 1

Realistically, there should be no reason the average programmer should need or want to code to assembly language on the Amiga (any more than she would use assembly on the x86, or write raw bytecode for the JVM). Thus, the real test of the Amiga system will be its suitability as a compiler back end. How does it measure up there?

Other posters point out that many of the constructs touted in this article were common features in older assemblers. This hints that many of them are more useful to the assembly programmer than to the compiler writer. If the dynamic translations being used do an equally good job generating e.g. loops from looping constructs and from explicit branches, this should make little difference in practice. If, however, there is a penalty for using one or the other construct then we must expect either 1) poor high-level language support or 2) assembly programmers to ignore the high-level constructs in favor of more efficient low-level constructs.

The biggest potential of the VP code lies in establishing flexible calling conventions. Every procedure is allowed to establish its own conventions for passing arguments and returning results. This combined with the assumption of infinite registers allows very simple and reasonably efficient high-level language compilers to be built. Moreover, by passing all arguments in registers, it should be easy to make multiple languages interoperate smoothly. This is because there is no need to establish conventions about the shape of the stack, the location of the return value, and so forth when a call is performed. This is handled by the underlying assembler.

An additional advantage of the infinite-registers assumption may be the ability to "registerize" global variables and return values. Most compilers must keep global variables in main memory; if we were to keep them in registers, independently-compiled files (like libc!) would not understand which registers were in use. GCC provides special system-dependent ways of doing this, but they're dangerous and fragile. In principle we can perform register allocation for the entire program in one go when we dynamically compile it, and this concern goes away. I suspect it may still be impractical to do so in practice. Registerization of return values, however, is very easy. We simply declare multiple return registers as shown in the example code. The biggest concern here is that there is simply no way to declare this in C! Thus, very few C programmers are likely to be able to make use of this optimization---it is much like gcc's reg-struct-return optimization flag.

What C programmers do now to return multiple values is they pass pointers as function arguments. This raises the biggest unaddressed issue in this article: how _does_ the stack work, anyway? We presumably can't take the address of a register, and a C compiler must therefore be able to declare and use stack memory. It's not clear how that can be done, though there is presumably some way of doing it.

It's also unclear if we can get our hands on the real VM state. For example, many garbage collectors (eg the GC in any decent Java implementation, but you can name a statically-compiled garbage-collected language as well) need to walk the stack and index garbage collection tables to discover live variables. They then need to find the memory or registers which contain corresponding pointers. How can we do so in Amiga VP code?

This article raises more questions than it answers. I'm not encouraged by presentations which talk effusively about how easy it is to program in assembly language; no assembly, however easy to code, is going to become as widespread as C in the near future. If the Amiga VP is to become a popular target, there must be solid compiler support. It must also be attractive for compiler writers and library authors to re-target their own work to the Amiga VP---and that probably means answering questions like the ones above in a widely-accessible forum.

Re:Programming for distributed systems. on Distributed Operating Systems? · 2000-07-31 20:38 · Score: 1

Actually, very few programmers actually understand how to write correct multithreaded code, and many published guides give advice for multithreaded programming that is explicitly wrong! For an example of this, see The "Double-checked Locking is Broken Declaration"---which gives a long list of citations at the end, all of which advocate using a programming idiom in Java which simply doesn't work.

Unfortunately, these problems become more severe when your multithreaded programs are running on more than one machine. This is because memory ordering is the hardest part of multithreaded coding to understand---and distributed shared memory uses far more complex protocols with far more aggressive reordering properties. Programs which may have happened to work on one or even two processors are very likely to break under distributed shared memory.

Dynamic Compilation on Transmeta Code Morphing != Just In Time · 2000-01-27 05:13 · Score: 1

This is a timely post for me, as I recently attended the ACM workshop on Dynamic and Adaptive Compilation & Optimization. I thought I'd highlight a few interesting projects which bring up issues that haven't been discussed here. I'll pull these mostly from the panel, which described full systems rather than interesting techniques).

First, note that code which is structured clearly and readably for programmers isn't usually structured perfectly for compiler optimization. Calling conventions add overhead, inlining adds code bloat---and in the presence of e.g. dynamically linked libraries, there's no opportunity to optimize.

One of the biggest overheads is dynamic branching---this is why dynamic method dispatch in Java/C++ is slow, but it also means that the simple act of returning from a function call is also slow. From this perspective, the Dynamo project at HP ( http://www.hpl.hp.com/cambridge/projects/Dynamo ) is particularly fascinating (I've heard tell that many of these folks now work at Transmeta). What dynamo does is it _interprets_ fully-optimized native binaries! When it spots hot code, it unrolls it, determining where all the dynamic branches are going. The unrolled hot code is then optimized and run without interpretation. The result? When dynamic branches usually go to one place, the unfolded code can be executed in a straight line. The speedups (again, with already- optimized code) are impressive.

Second is the Jalapeno system from IBM ( http://www.research.ibm.com/jalapeno ). This is a dynamic Java system written entirely in Java. It does optimizations akin to Sun's HotSpot. The interesting idea here, though, is the ability to optimize across the run-time boundary---because the system is Java code as well, Java internals can be inlined and specialized within user code.

Finally, few people seem to know about partial evaluation. The idea here is that a lot of code depends heavily on a few run-time parameters which get set once and then (almost) never change. So
what we can do is generate a partially-optimized program where we "fill in the blanks" at run time. Note that these blanks would normally be references to variables, run-time condition checks, and so forth---so the partially optimized program is already at least as efficient as our original code run through an optimizing compiler. At run time the blanks are filled in and some more (usually simple) optimizations are performed, usually making the resulting code substantially more compact and faster. Two extremes (both for C programs) are the DyC project ( http://www.cs.washington.edu/research/projects/uni sw/DynComp/www/ ) which acts like a dynamic compiler for C and Tempo ( http://www.irisa.fr/compose/tempo ) which requires explicit annotations but can probably be a bit more aggressive as a result.

-Jan-Willem Maessen

Slashdot Mirror

User: jmaessen

Comments · 7