OK, these are just a few other bits of interest I picked out of the patent:
In a preferred embodiment of the invention, the morph host is a very long instruction word (VLIW) processor which is designed with a plurality of processing channels.
I'm not going to go into huge detail about VLIW machines (particularly since I don't know all that much about them:-). Suffice it to say that traditional VLIW CPUs fetch multiple instructions at once, and rely on the compiler to ensure that there are no dependencies between instructions in a fetch group (if the compiler can't find x number of independents, it will pad the holes with non-operations, or NOPs). Looking at Transmeta's patent, it appears that rather than a compiler doing this checking, their code-translation software will be doing it on the fly. RISC/CISC machines, on the other hand, typically do this checking in hardware. But Transmeta's reasoning seems to be that doing it in hardware adds complexity, hence lower clock rates, and also doesn't make multiple instruction sets very feasible.
Regarding the instruction translation and subsequent caching I mentioned in my previous post, a quote from the patent illuminates the matter a little more:
The code morphing software of the microprocessor...includes a translator portion which decodes the instructions of the target application, converts those target instructions to the primitive host instructions capable of execution by the morph host, optimizes the operations required by the target instructions, reorders and schedules the primitive instructions into VLIW instructions (a translation) for the morph host, and executes the host VLIW instructions.
When the particular target instruction sequence is next encountered in running the application, the host translation will then be found in the translation buffer and immediately executed without the necessity of translating, optimizing, reordering, or rescheduling. Using the advanced techniques described below, it has been estimated that the translation for a target instruction (once completely translated) will be found in the translation buffer all but once for each one million or so executions of the translation. Consequently, after a first translation, all of the steps required for translation such as decoding, fetching primitive instructions, optimizing the primitive instructions, rescheduling into a host translation, and storing in the translation buffer may be eliminated from the processing required. Since the processor for which the target instructions were written must decode, fetch, reorder, and reschedule each instruction each time the instruction is executed, this drastically reduces the work required for executing the target instructions and increases the speed of the microprocessor of the present invention.
Transmeta seems to have an excellent idea here. They're caching optimized translations of the incoming instructions, so rather than have to translate and optimize over and over each time you see that bit of code, you do it once and then just grab it from the cache. Due to the spatial and temporal locality of programs (ie the fact that your accesses to instructions are not random, but are localized in loops, etc), this cache ("translation buffer") will only fail to have a translation present once every million instructions. So you're doing *one* translation every million cycles, rather than a million translations like current processors would have to do. Interestingly enough, a scheme like this was brought up as a discussion item in my Superscalar Processor Design class a couple of weeks ago, though my professor used the example of an specialized Alpha decoding/translating x86 and caching the results. One might even write the translations back out to disk as an attachment to the original executable, so that the next time you run the program that's fewer translations you have to do, and eventually you'll have a fully translated version on your hard disk for optimal speed. I guess we'll just have to wait to see if Transmeta does something similar.
One embodiment of the enhanced hardware includes sixty-four working registers in the integer unit and thirty-two working registers in the floating point unit. The embodiment also includes an enhanced set of target registers which include all of the frequently changed registers of the target processor necessary to provide the state of that processor; these include condition control registers and other registers necessary for control of the simulated system.
It seems this new chip is going to have a lot of registers. As Cartman would say, sweeeeeet!
The patent also provides some sample C code, the corresponding x86 assembly, and some sample optimizations the Transmeta system may perform. It's a little more than half way down the page, if you want to look, just scroll until you see code:-)
Further clarification
by
Stradivarius
·
· Score: 5
Some notes for those who may want a more in-detail explanation:
The beginning of the patent ("claims") is essentially just a list of things that all modern, superscalar, out-of-order processors do, and saying "hey we do this too".
Basically, out-of-order machines execute instructions out of their program order (hence the name:). This means that if your code sequence is A,B,C; the CPU may actually execute it such that B is done executing before A. But B's results cannot be written to system memory or the architected registers ("machine state")until you know that instruction A didn't generate an exception. That's so that you can provide precise exception handling, ie that the OS can service A's exception and then resume exection with B. If you don't wait to do your memory store, then you'll end up executing B twice, which you didn't intend. So that's what all the talk in the beginning of the patent about memory stores, etc, is about.
If you get past all the uninteresting stuff like that in the beginning, you'll find the following:
"The present invention overcomes the problems of the prior art and provides a microprocessor which is faster than microprocessors of the prior art, is capable of running all of the software for all of the operating systems which may be run by a large number of families of prior art microprocessors, yet is less expensive than prior art microprocessors. "
The idea it seems is that rather than making complex hardware to execute the instructions and perform speed enhancements, they're doing speed optimizations in software. Which in turn allows very simple hardware(which in turn should translate to really high clock speeds). It seems that Transmeta's bet with this is that the penalty incurred by doing software rather than hardware optimizations is offset by the increase in clock speed and decrease in hardware cost.
Using such an approach should also make running multiple instruction sets a much easier task. Currently processors do their instruction decoding in hardware. But if Transmeta has managed to do this decoding (fast) in software, then they can just add a little more software to allow multiple instruction sets. They also seem to be caching the translations of non-native to native instructions in a memory structure of some sort, so that they minimize the redundant emulation computations.
Actually, to address gupg's comment, it also seems that they should not need *any* special compiler support, because they can run stuff that was compiled for any of the various instruction sets they choose to support. So they themselves should not need to do compiler work. I would guess that the reason they're hiring all sorts of compiler folks is that they need people to do the afore-mentioned software instruction translation, and the people best suited for that are compiler people since they work on the instruction level all the time. Most other programmers don't have to deal with anything other than high-level languages, and so would not be particularly well suited to doing what Transmeta is doing.
Anyway, hopefully this explained things a bit more to everyone. My reading and explanation of the patent was pretty quick since I have to go to class in a few minutes. I'll finish reading the patent afterwards and add anything else I think you might like to know.
1. Set of instructions comes into processor in one instruction set (like x86).
2. This device stores the data for this series of instructions temporarily
3. The device translates the (x86) instructions into its own internal instruction set and figures out an ordering that will not cause it to have exceptions.
4. The device retrieves the temporary data and "fills in the blanks" in the "inner" processor to get results, the so called "permanent storage" is probably the inner processor's instruction cache.
5. The data is cleared from the interim area once it's acted upon.
I think it means (from the abstract) that they are going to provide compatability to other processors by converting their instructions to their host processor. So, the story unfolds. Obviously, they have a super fast processor and will provide for running Intel etc instructions on their processors.
The patent itself is more concerned with making sure that the conversion process occurs without any exceptions taking place.. or actually holding the processor state and waiting for a sequence of instructions to make sure no exception etc happens and then excuting it on the host processor.
They obviously also need strong compiler support for such a processor which explains all the software and compiler people they have been recruiting.
Fun, fun, fun.. who says Computer architecture is dead !
Ok, its for emulation, but it Doesnt Just speed emulation. This allows for instruction ROLLBACK. Want a journeling filesystem? How about a journeling processor? The patent is for a co-processing unit that not Only translates an foreign instruction set into native instructions for a 'target processor', But, acts as a go-between for that target processor and memory. It stores the processor state, and buffers any memory writes, until it is certain that a group of instructions has been run without exception or error... If the translated instructions crash, no damage is done. Not only is this amazing overall, but it allows for Very speculative, and Very fast, instruction translation and branch prediction...
It appears to be a system in which a processor is fed a sequence of instructions in a translated foreign set, and the results are held in cache until it can be ascertained that the entire stream of instructions will run without error, at which time the cache is released. They may be using this purely as a CISCRISC mechanism, or they may be planning a platform where the actual program code is 'broken' into chunks, and the processors might encounter exception if the granularity of the sets is off. They may even be planning a platform that does multi-arch emulation on a transparent hardware/microcode level, ala AS/400. Heck, they might be doing all three! They also give an allusion to making a cheap processor run code designed for a more expensive one, so perhaps they're planning to give Intel a run for their money.
I'm sorry, but that is the closest I can get to an answer with the available information.
looks like a cpu which read foreign instruction sets and then translates them into its own set and execs them in a highly parallel manner to produces a faster execution than the original processor.
TRANSlatingMETAprocessor?
Re:No you're all wrong: it's for Emacs
by
Imperator
·
· Score: 5
Actually, it's a way to run any application for any processor and any OS, straight from Emacs. Unrelated planned features for Emacs include improved SMB support, an extremely light-weight httpd, and preliminary support for USB child-rearing devices.
--
Gates' Law: Every 18 months, the speed of software halves.
OK, these are just a few other bits of interest I picked out of the patent:
:-). Suffice it to say that traditional VLIW CPUs fetch multiple instructions at once, and rely on the compiler to ensure that there are no dependencies between instructions in a fetch group (if the compiler can't find x number of independents, it will pad the holes with non-operations, or NOPs). Looking at Transmeta's patent, it appears that rather than a compiler doing this checking, their code-translation software will be doing it on the fly. RISC/CISC machines, on the other hand, typically do this checking in hardware. But Transmeta's reasoning seems to be that doing it in hardware adds complexity, hence lower clock rates, and also doesn't make multiple instruction sets very feasible.
:-)
In a preferred embodiment of the invention, the morph host is a very long instruction word (VLIW) processor which is designed with a plurality of processing channels.
I'm not going to go into huge detail about VLIW machines (particularly since I don't know all that much about them
Regarding the instruction translation and subsequent caching I mentioned in my previous post, a quote from the patent illuminates the matter a little more:
The code morphing software of the microprocessor...includes a translator portion which decodes the instructions of the target application, converts those target instructions to the primitive host instructions capable of execution by the morph host, optimizes the operations required by the target instructions, reorders and schedules the primitive instructions into VLIW instructions (a translation) for the morph host, and executes the host VLIW instructions.
When the particular target instruction sequence is next encountered in running the application, the host translation will then be found in the translation buffer and immediately executed without the necessity of translating, optimizing, reordering, or rescheduling. Using the advanced techniques described below, it has been estimated that the translation for a target instruction (once completely translated) will be found in the translation buffer all but once for each one million or so executions of the translation. Consequently, after a first translation, all of the steps required for translation such as decoding, fetching primitive instructions, optimizing the primitive instructions, rescheduling into a host translation, and storing in the translation buffer may be eliminated from the processing required. Since the processor for which the target instructions were written must decode, fetch, reorder, and reschedule each instruction each time the instruction is executed, this drastically reduces the work required for executing the target instructions and increases the speed of the microprocessor of the present invention.
Transmeta seems to have an excellent idea here. They're caching optimized translations of the incoming instructions, so rather than have to translate and optimize over and over each time you see that bit of code, you do it once and then just grab it from the cache. Due to the spatial and temporal locality of programs (ie the fact that your accesses to instructions are not random, but are localized in loops, etc), this cache ("translation buffer") will only fail to have a translation present once every million instructions. So you're doing *one* translation every million cycles, rather than a million translations like current processors would have to do. Interestingly enough, a scheme like this was brought up as a discussion item in my Superscalar Processor Design class a couple of weeks ago, though my professor used the example of an specialized Alpha decoding/translating x86 and caching the results. One might even write the translations back out to disk as an attachment to the original executable, so that the next time you run the program that's fewer translations you have to do, and eventually you'll have a fully translated version on your hard disk for optimal speed. I guess we'll just have to wait to see if Transmeta does something similar.
One embodiment of the enhanced hardware includes sixty-four working registers in the integer unit and thirty-two working registers in the floating point unit. The embodiment also includes an enhanced set of target registers which include all of the frequently changed registers of the target processor necessary to provide the state of that processor; these include condition control registers and other registers necessary for control of the simulated system.
It seems this new chip is going to have a lot of registers. As Cartman would say, sweeeeeet!
The patent also provides some sample C code, the corresponding x86 assembly, and some sample optimizations the Transmeta system may perform. It's a little more than half way down the page, if you want to look, just scroll until you see code
Some notes for those who may want a more in-detail explanation:
:). This means that if your code sequence is A,B,C; the CPU may actually execute it such that B is done executing before A. But B's results cannot be written to system memory or the architected registers ("machine state")until you know that instruction A didn't generate an exception. That's so that you can provide precise exception handling, ie that the OS can service A's exception and then resume exection with B. If you don't wait to do your memory store, then you'll end up executing B twice, which you didn't intend. So that's what all the talk in the beginning of the patent about memory stores, etc, is about.
The beginning of the patent ("claims") is essentially just a list of things that all modern, superscalar, out-of-order processors do, and saying "hey we do this too".
Basically, out-of-order machines execute instructions out of their program order (hence the name
If you get past all the uninteresting stuff like that in the beginning, you'll find the following:
"The present invention overcomes the problems of the prior art and provides a microprocessor which is faster than microprocessors of the prior art, is capable of running all of the software for all of the operating systems which may be run by a large number of families of prior art microprocessors, yet is less expensive than prior art microprocessors. "
The idea it seems is that rather than making complex hardware to execute the instructions and perform speed enhancements, they're doing speed optimizations in software. Which in turn allows very simple hardware(which in turn should translate to really high clock speeds). It seems that Transmeta's bet with this is that the penalty incurred by doing software rather than hardware optimizations is offset by the increase in clock speed and decrease in hardware cost.
Using such an approach should also make running multiple instruction sets a much easier task. Currently processors do their instruction decoding in hardware. But if Transmeta has managed to do this decoding (fast) in software, then they can just add a little more software to allow multiple instruction sets. They also seem to be caching the translations of non-native to native instructions in a memory structure of some sort, so that they minimize the redundant emulation computations.
Actually, to address gupg's comment, it also seems that they should not need *any* special compiler support, because they can run stuff that was compiled for any of the various instruction sets they choose to support. So they themselves should not need to do compiler work. I would guess that the reason they're hiring all sorts of compiler folks is that they need people to do the afore-mentioned software instruction translation, and the people best suited for that are compiler people since they work on the instruction level all the time. Most other programmers don't have to deal with anything other than high-level languages, and so would not be particularly well suited to doing what Transmeta is doing.
Anyway, hopefully this explained things a bit more to everyone. My reading and explanation of the patent was pretty quick since I have to go to class in a few minutes. I'll finish reading the patent afterwards and add anything else I think you might like to know.
Cheers,
Stradivarius
It appears that the flow will be like this.
1. Set of instructions comes into processor in one instruction set (like x86).
2. This device stores the data for this series of instructions temporarily
3. The device translates the (x86) instructions into its own internal instruction set and figures out an ordering that will not cause it to have exceptions.
4. The device retrieves the temporary data and "fills in the blanks" in the "inner" processor to get results, the so called "permanent storage" is probably the inner processor's instruction cache.
5. The data is cleared from the interim area once it's acted upon.
The patent itself is more concerned with making sure that the conversion process occurs without any exceptions taking place .. or actually holding the processor state and waiting for a sequence of instructions to make sure no exception etc happens and then excuting it on the host processor.
They obviously also need strong compiler support for such a processor which explains all the software and compiler people they have been recruiting.
Fun, fun, fun .. who says Computer architecture is dead !
Sumit
Ok, its for emulation, but it Doesnt Just speed emulation. This allows for instruction ROLLBACK. Want a journeling filesystem? How about a journeling processor?
The patent is for a co-processing unit that not Only translates an foreign instruction set into native instructions for a 'target processor', But, acts as a go-between for that target processor and memory. It stores the processor state, and buffers any memory writes, until it is certain that a group of instructions has been run without exception or error... If the translated instructions crash, no damage is done. Not only is this amazing overall, but it allows for Very speculative, and Very fast, instruction translation and branch prediction...
man is machine
It appears to be a system in which a processor is fed a sequence of instructions in a translated foreign set, and the results are held in cache until it can be ascertained that the entire stream of instructions will run without error, at which time the cache is released. They may be using this purely as a CISCRISC mechanism, or they may be planning a platform where the actual program code is 'broken' into chunks, and the processors might encounter exception if the granularity of the sets is off. They may even be planning a platform that does multi-arch emulation on a transparent hardware/microcode level, ala AS/400. Heck, they might be doing all three! They also give an allusion to making a cheap processor run code designed for a more expensive one, so perhaps they're planning to give Intel a run for their money.
I'm sorry, but that is the closest I can get to an answer with the available information.
.sig: Now legally binding!
TRANSlatingMETAprocessor?
Actually, it's a way to run any application for any processor and any OS, straight from Emacs. Unrelated planned features for Emacs include improved SMB support, an extremely light-weight httpd, and preliminary support for USB child-rearing devices.
Gates' Law: Every 18 months, the speed of software halves.