NVIDIAs 64-bit Tegra K1: The Ghost of Transmeta Rides Again, Out of Order
MojoKid (1002251) writes Ever since Nvidia unveiled its 64-bit Project Denver CPU at CES last year, there's been discussion over what the core might be and what kind of performance it would offer. Visibly, the chip is huge, more than 2x the size of the Cortex-A15 that powers the 32-bit version of Tegra K1. Now we know a bit more about the core, and it's like nothing you'd expect. It is, however, somewhat similar to the designs we've seen in the past from the vanished CPU manufacturer Transmeta. When it designed Project Denver, Nvidia chose to step away from the out-of-order execution engine that typifies virtually all high-end ARM and x86 processors. In an OoOE design, the CPU itself is responsible for deciding which code should be executed at any given cycle. OoOE chips tend to be much faster than their in-order counterparts, but the additional silicon burns power and takes up die area. What Nvidia has developed is an in-order architecture that relies on a dynamic optimization program (running on one of the two CPUs) to calculate and optimize the most efficient way to execute code. This data is then stored inside a special 128MB buffer of main memory. The advantage of decoding and storing the most optimized execution method is that the chip doesn't have to decode the data again; it can simply grab that information from memory. Furthermore, this kind of approach may pay dividends on tablets, where users tend to use a small subset of applications. Once Denver sees you run Facebook or Candy Crush a few times, it's got the code optimized and waiting. There's no need to keep decoding it for execution over and over.
Let's see if I have this right:
With the OoOE cpu, the instructions from the code are handled by the cpu to decide what order to process them so you get a faster overall speed.
With the Project Denver cpu, it's an in-order processor, but it uses software at runtime to decide what order to process the code in and stores that info in a special buffer, but that software is itself ran by the cpu in the first place to make the OoOE decisions.
This seems to be kind of flaky to me.
Although I know only a little about CPU design, this sounds like one of the most revolutionary design changes in many years. The question in my mind is how well it will work. The CPU can use information at runtime that a static analyser running on a separate core might not have ahead of time, most obviously branch prediction information. OOO CPU's can speculatively execute multiple branches at once and then discard the version that didn't happen, they can re-order code depending on what it's actually doing including things like self-modifying code and code that's generated on the fly by JITCs. On the other hand, if the external optimiser CPU can do a good job, it stands to reason that the resulting CPU should be faster and use way less power. Very interesting research, even if it doesn't pan out.
Surely one of the points of OoOE is it can - in theory - take account of whether data is in cache or not when deciding when to do reads? I don't see how a hard coded instruction path can do this.
If I understand this story correctly, the message is that if you get a tablet with this processor, avoid manufacturers who install a lot of bloatware.
You are welcome on my lawn.
Buffer in the main memory, software that optimize most-used code. It looks like an OS job for me, something that could be implemented in the linux kernel and benefit all CPUs, provided that you have the appropriate driver.
According to the paper, it looks like biggest novelty is... DRM. The optimizer code will be encrypted and will run in its own memory block, hidden from the OS. It will also make use of some special profiling instructions which could as well be accessible to the OS. Maybe they will but they say nothing about it.
It's a "software" cache, it's stored in RAM.
Mada mada dane.
... that doesn't run Facebook?
Otherwise, no buy.
I apologize for the lack of a signature.
2 GiB = 2 * 2 ^ 30 Byte
128 MB = 128 * 10^6 Byte
2 GiB - 128 MB = 2019483648 Byte;
2019483648 Byte > 2GB
Who's the stupid fucker now?
I think NVidia tied their hands by retaining the ARM architecture. I suspect the result will be a "worst of both worlds" processor that doesn't use less power or provide better performance than competitors.
In order execution, exposed pipelines, and software scheduling are not new ideas. They sound great in theory, but never seem to work out in practice. These architectures are unbeatable for certain tasks (e.g. DSP), but success as general purpose processors has been elusive. History is littered with the corpses of dead architectures that attempted (and failed) to tame the beast.
Personally, I'm very excited about the Mill architecture. If anybody can tame the beast, it will be these guys.
The key sequence to access my Slashdot bookmark in Firefox is Alt-B-S. I don't believe this is a coincidence.
Nope. All standard OoO mechanisms is one of pushing - that is pushing of operations to execute from the scheduler to the execution units. The execution units are dumb and only consume data, operation information giving a set of results.
In most OoO designs the amount of operations actually capable of flowing through the execution units are less that the theoretical width, limited either by the scheduler or the retirement logic.
A VLIW can scale to greater actual execution throughput however it is hard to make them perform good on many types of code. Compilers is one example of a hard type of program for VLIWs.
I'm an expert on CPU architecture. (I have a PhD in this area.)
The idea of offloading instruction scheduling to the compiler is not new. This was particularly in mind when Intel designed Itanium, although it was a very important concept for in-order processors long before that. For most instruction sequences, latencies are predictable, so you can order instructions to improve throughput (reduce stalls). So it seems like a good idea to let the compiler do the work once and save on hardware. Except for one major monkey wrench:
Memory load instructions
Cache misses and therefore access latencies are effectively unpredictable. Sure, if you have a workload with a high cache hit rate, you can make assumptions about the L1D load latency and schedule instructions accordingly. That works okay. Until you have a workload with a lot of cache misses. Then in-order designs fall on their faces. Why? Because a load miss is often followed by many instruction that are not dependent on the load, but only an out-of-order processor can continue on ahead and actually execute some instructions while the load is being serviced. Moreover, OOO designs can queue up multiple load misses, overlapping their stall time, and they can get many more instructions already decoded and waiting in instruction queues, shortening their effective latency when they finally do start executing. Also, OOO processors can schedule dynamically around dynamic instruction sequences (i.e. flow control making the exact sequence of instructions unknown at compile time).
One Sun engineer talking about Rock described modern software workloads as races between long memory stalls. Depending on the memory footprint, a workload could spend more than half its time waiting on what is otherwise a low-probability event. The processors blast through hundreds of instructions where the code has a high cache hit rate, and then they encounter a last-level cache miss and and stall out completely for hundreds of cycles (generally not on the load itself but the first instruction dependent on the load, which always comes up pretty soon after). This pattern repeats over and over again, and the only way to deal with that is to hide as much of that stall as possible.
With an OOO design, an L1 miss/L2 hit can be effectively and dynamically hidden by the instruction window. L2 (or in any case the last level) misses are hundreds of cycles, but an OOO design can continue to fetch and execute instructions during that memory stall, hiding a lot of (although not all of) that stall. Although it's good for optimizing poorly-ordered sequences of predictable instructions, OOO is more than anything else a solution to the variable memory latency problem. In modern systems, memory latencies are variable and very high, making OOO a massive win on throughput.
Now, think about idle power and its impact on energy usage. When an in-order CPU stalls on memory, it's still burning power while waiting, while an OOO processor is still getting work done. As the idle proportion of total power increases, the usefulness of the extra die area for OOO increases, because, especially for interactive workloads, there is more frequent opportunity for the CPU to get its job done a lot sooner and then go into a low-power low-leakage state.
So, back to the topic at hand: What they propose is basically static scheduling (by the compiler), except JIT. Very little information useful to instruction scheduling is going to be available JUST BEFORE time that is not available much earlier. What you'll basically get is some weak statistical information about which loads are more likely to stall than others, so that you can resequence instructions dependent on loads that are expected to stall. As a result, you may get a small improvement in throughput. What you don't get is the ability to handle unexpected stalls, overlapped stalls, or the ability to run ahead and execute only SOME of the instructions that follow the load. Those things are really what gives OOO its adva
Scalar design just simply attach more cache... more hits and speculative loads (/MMU) solved it for SPARC/MIPS/Power
The HP research into Dynamo and later the transmeta design concepts showed promise but delivered no product beyond small samples (under 1 million shipped) and yet peoples houses...
I was most excited by dynamo and VLIW (itanium promised so much and delivered so little) LLVM provides some interesting concepts
I would really like Texas Instruments (TI) back in the game as I think a large I and D cache combined with specialised (DSP + crypto) offload engines would blow the socks off the current market...
it will be interesting as intel have a smaller geometry yet the market is with ARMHY but do manufacturers care ?
have fun and power consumption matters !
John Jones
It's hard drive manufacturers that insisted on decimal instead of binary prefixes, not RAM makers. In fact, it's fairly difficult to make RAM to decimal prefixes.
At any rate, RAM specs usually don't count OS overhead anyway, so this just makes the kernel heavier in a sense. Hell, this is probably being done with a proprietary kernel blob (as we know, Nvidia loves their magic blobs that give you amazing performance only on their hardware...)
"There's no need to keep decoding it for execution over and over."
So kinda in the bed with a combo ReadyBoost/XP-precache.
Why not just use ReadyBoost/precache and make it work the same freaking way?
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
Why not have all applications ship in LLVM intermediate format and then have on-device firmware translate them according to exact instruction set and performance characteristics of the CPU? By the time code is compiled to ARM instruction set, too much information is lost to do fundamental optimization, like vectorizing loops if applicable operations are supported.
Suppose for a moment that you are building a new processor for mobile devices.
The mobile device makers - Apple, Google, and Microsoft -- all have "App Stores". Side loading is possible to varying degrees, but in no case is it supported or a targeted business scenario.
These big 3 all provide their own SDKs. They specify the compilers, the libraries, etc.
Many of the posts in this thread talk about how critical it will be for the compilers to produce code well suited for this processor...
Arguably, due to the app development toolchain and software delivery monoculture attached to each of the mobile platforms, it is probably easier than ever to improve compilers and transparently update the apps being held in app-store catalogs to improve their performance for specific mobile processors.
It's not the wild west any more; with tighter constraints around the target environment, more specific optimizations become plausible.
My opinions are my own, and do not necessarily represent those of my employer.
It's not, obviously - they're both the same size at 2^31 bytes. Where you run into problems is if you want to create a 2 000 000 000 byte stick of ram. And then you have a bit of a problem: Addressing.
In hardware, exactly 2^N memory locations are addressable with N bits. You therefore need only ensure that your number of address lines corresponds to your number of memory locations in order to ensure consistency. If however 7% of possible addresses are invalid you need to insert logic somewhere to make sure every single memory access falls within the valid range, and that creates a performance overhead.
Worse, you have to *translate* those addresses. If I try to access byte 2 000 001 in your system the memory addressing infrastructure will have to do range analysis to determine which stick of ram it belongs in: that's multiple integer comparison and branching operations. That's slow. If your memory size is a power of two on the other hand you need only interpret the bottom N bits as the address within a memory bank, and the remaining high address bits as the address of the bank to access. No comparisons or branching needed - every address between 0 and RAM_MAX will automatically address the proper memory cell.
Furthermore since RAM has *always* been manufactured in powers of two, all of the surrounding architecture also has that assumption built in: RAM controllers, BIOS, operating systems - unless you want to create everything from scratch a non-power-off-two size stick of RAM will cause catastrophic errors wherever it's used. And keep in mind the decision of which size RAM sticks are available is made by the RAM manufacturers, NOT the device manufacturer. No RAM manufacturer is going to make off-size RAM because no existing systems will be able to use it. And no device manufacturer is going to commission off-size RAM because a limited run would be substantially more expensive than just using a slightly larger standard size.
--- Most topics have many sides worth arguing, allow me to take one opposite you.
Once Denver sees you run Facebook or Candy Crush a few times, it's got the code optimized and waiting.
I am so fortunate to live in such an advanced age of graphics processors, that let me run the equivalent of a web browser application and a 2D tetris game. What progress! We truly live in an age of enlightenment!
Does this architecture require us to load the "NVidia processor driver" which comes with 100 megabytes of code specializations for every game shipped?
That is, after all, why their graphics drivers perform so well - they patch all shaders on top-end games...
No one needs to do anything for software to run on these at all. nVidia would be developing a kernel module or something that would JIT existing software into their optimized in-order pipeline, then cache that result. The out-of-order architectures all do this too - in hardware (which uses more power maybe, but also executes more quickly and theoretically gets into sleep mode more often).
There's no need for anyone to generate special code for these CPUs, but it is interesting that a common perception is that there is a need to do so.
What I'm curious about is whether they could take the actual Transmeta route, and translate x86 bytecode (or anything else) in software to their own in-order architecture, or if there are enough low level APIs open for end users to take a stab at it (to create awesome emulators).
http://www.unfocus.com/
It's not the core instruction set that's the problem with booting alternate OSes .. as long as you stick to the base archtecture you'll be fine. It's the lack of standardization when it comes to boot firmware and device configuration that's the problem.
The server ARM initiative at least is standardizing on UEFI and ACPI .. hate them or love them, making ARM hardware more similar to Intel Architecture hardware will likely make it easier to support both.
Michel
Fedora Project Contribut
You'll have to pry my 2.147483648 Gb DRAMs from my cold, dead hands!