The Quest for More Processing Power
Hack Jandy writes "AnandTech has a very thorough, but not overly technical, article detailing CPU scaling over the last decade or so. The author goes into specific details on how CPUs have overcome limitations of die size, instruction size and power to design the next generation of chips. Part I, published today, talks specifically about the limitations of multiple cores and multiple threads on processors."
the quantum computer!! Until then we'll have to suck it up with these Si things.
What we need is a better architecture which would allow for a better implementation of algorithms. Will we ever have an MMIX-like processor with 256 general-purpose 64-bit registers that each can hold either fixed-point or floating-point numbers? That is what I am waiting for, not more "power," whatever that means.
Sincerely,
Pan Tarhei Hosé, PhD.
"Homo sum et cogito ergo odi profanum vulgus et libido."
That's what's been happening the last 10-15 years. Where are the indications that "time to market" and "sloppy programming" will suddenly vanish?
Run old software.
Its only new software thats sucking up all the extra processing power.
Remember back with really sluggish 33mhz 486s etc (and a lot lower) and thinking of the ultimate computer being a whole 50mhz.
Well now you got a computer thats over 10 times faster with practically infinate capacity.
Fire up that old operating system and run you original software, you will be in heaven!
liqbase
Might want to point out that the article is x86 centric. Not that it only applies to x86, indeed many/most of the issues are just generally related to processors (single vs multi-core, trace lengths, etc), but the article definitely focus' on these issues as applies to the x86.
From my point of view, chips lead to more bloat.
What kind of algorithm are you imagining would benefit from 256 fields of non-vectorized data?
Of course, those registers could be used in larger things for everything that's worthy of a local variable, but as soon as you run into a stack operation you'll either only want to push a subset of the registers to the stack, or face a harder blow of memory access times by making each function call a 2048 byte write to memory.
Explicit encoding of parallelism, hints to branch prediction, and similar stuff, seems far more appropriate.
Again, few single functions in an imperative language have 256 separate variables, without involving arrays of data. Unless the register file is addressable by index from another register (basically turning it into a very small addressed memory, which is whta you try to avoid with registers), you have little use for 256 of them. Take for example a trivial string iteration algorithm, most of those registers would be completely useless. The same holds true for common graph algorithms.
http://www.anandtech.com/printarticle.aspx?i=2343.
Same article without 90% of the ad-bloat.
Chances are that you aren't often pushing your CPU to capacity. What I'd like to see is a better way to identify bottlenecks in my system. There's no sense pumping more power into a system if it's all going to be throttled by something like a slow hard drive.
Socialism: A feeling of discontent and resentment caused by a desire for the possessions or qualities of another.
Jeez, I've been running Gentoo all this time when I should have been running Linux from Scratch, if only for the chance of sadomasochistic sex!
In Soviet America the banks rob you!
Ummm, my home machine has a 400MHz processor running Suse. I'm thinking of upgrading, as I have every 6 months for 5 years, but I just keep waiting for the "next" best thing rather than upgrading now.
There are mobile phones more powerful than my home PC, but it does the job.
The wonder of these future boxes is that we will STILL be able to write code that makes them run slow. Roll on Longhorn I say!
An Eye for an Eye will make the whole world blind - Gandhi
I expect that once multi-core desktop cpu's become more prevalent, the advantage of multi-threaded programming will become evident and start to take off.
Ok, classic x86 is cramped and the CPU does a lot of register renaming to get around it. I don't agree that more registers would actually do that much good.
It does. Take a look at x86-64. The 98% reason 64 bit x86 code is faster when you are using less than 4 gigs of RAM is the fact it has double the registers. With the same number of registers, 64 bit code normally slows things down measurably because the pointer size doubled. The instruction word length doesn't change.
256 registers goes a bit far unless half of them are predication bits.
Multi threading get's you a speed boost not necesarily on the individual application, but definetly on the OS level. That's why Sun get's away with individual CPU's that are each 1/4 the speed of cheapy x86 hardware.
Most OS's these days are not monolithic. Even MS is really a collection of smaller pieces, but not nearly to the degreee of Linux.
Linux just scales better than Windows on multiple CPUs. I have no doubt that MS will work indian programers day and night to catch up, but this is a game they are definetly playing catch up in.
Linux, in some versions is scalling past 64 CPUs now (oh the benefits of forked kernel development!), which should factor nicely when time comes that AMD ('cause may not be around then) is pushing ships with dozens if not hundreds of micro-cores.
Last I checked (and I may be out of date on this) Windows started bogging on 4 CPUs. And never mind it's assanine global message loop.
I fully realize Joe User cares more about percieved performance than real performance (long live xorg!), and explaining Linux's advanced scaling architecture will not win over the desktop, but it will have a signifigant impact on technical decision markets; from servers to embeded devices (HUGE market for these clustered chips).
I would rather be ashes than dust!
Wild. Did you get this from another site? I.e. are there any more? Actually, let me go look on Google.
The Itanium has a huge file with, IIRC, even more registers in total. They are not inter-changeable, though, but the (almost) only point in that would be to keep the total number of registers down, while being flexible for most types of code. As I think that it's generally actually easier to make them separate for different execution units, that's not very interesting. Also, note that the Itanium currently has a 2-cycle (again, IIRC) register access time! They tried to be visionary, adding a huge register set, in addition to some parallelism encoding and other things I mentioned in the parent, but they traded (what seems to be) far too much to get it.
A huge (defined as MMIX-like, not AMD64-like)register file might be great, but you need selective register pushing to stack to get away with it, unless you or the compiler are performing very aggressive inlining. What's easier, if you're doing assembler -- calling a function and put a local on the stack or writing a huge fricking implementation of your main algorithm, taking great care to use all different registers in each function inlining?
Just to note, I am not an Electrical Engineer (but will be in 3 years). From what little I've read, it seems like branch prediction allows the cpu to prefetch data it will need. Smart math people keep coming up with better and better general purpose algorithms. But these new algorithms need more and more logic behind them, adding to CPU complexity a lot. Now, my question is once we have an n-core cpu, would it be possible to optimize your main cpu set up for general purpose use, the second for video enconding, the third for games, and so on. Then when you run software, it will know (or you tell it), what CPU to run on. It seems that if the CPU designers knew what kind of code would be running, they could optimize branch prediction algorithms better for that task. It seems like misses are extremely expensive, and that something like this would help. It would be the next best thing to having an FPGA on your chip that automatically reconfigured itself for whatever algorithm you need.
The only reference made to AMD is regarding their ingenious SOI technology. With the exception of that, the focus is maintained on Intel, (whom he calls the "#1 in the CPU market"). I find that somewhat absurd, since Intel is largely failing (stretching an obsolete architecture to extreme limits by extending the pipeline) where AMD is innovating and has already largely surpassed them.
AMD's CPU does a hell of alot more per clock cycle than Intel's. The AMD 64 bit chip is a marvel.
You can already buy PCI boards that will let you do this. It is just that software support is seriously lacking (non-existant).
My guess is that this would work wonderfully for certain classes of problems, and would be quite useful for things like finite element analysis, MPEG encoding, and the like. The main problem is that a FPGA takes a fair bit of time to load its configuration file. Obviously, you would not want to multitask between two different applications trying to use this FPGA. Otherwise, you will spend more time context-switching than you would actually working.
You can get a simple FPGA for only a buck or two, now. Decent ones are $10. It would not cost too much to add them to a mobo. All you need is for somebody to come up with a decent programming framework (which is far from trivial).
"-1 Troll" is the apparently the same as "-1 I disagree with you."
Of course, a classic FPGA architecture wouldn't cut it. But there are some more advanced architectures that are being tested already, that allow extremely fast reprogramming. Imagine if some areas of your processor could be reprogrammed in the time it takes for, say, a context switch. And of course the underlying OS needs to be written so as to optimize the processor's use at any given time.
My leaky brain suggests that this might correspond to the propogation speed in silicon for a given path length and a given process (eg, 90nm may give us better results).
--dave
davecb@spamcop.net
Why would anybody need more than 640K?
I have to question one of the main assumptions in the article -- that most software won't benefit from multiple processors. In a sense it's true, but it's also misleading.
If you are desperate to run your word processor or spreadsheet faster, then he's got a point. But realistically, don't the current systems already run those kinds of programs just fine? Is this the kind of application where more speed is most needed?
I think Sony have got it right with their whole "media processor" approach, with high bandwidth and multiple vector units. It won't benefit most programs, but it will greatly benefit most of the programs that slog on today's systems.
The time has come to break away from the old approach of merely running the same linear X86 code faster and faster. I think this change is overdue.
Tom's Hardware also has "The Mother of All CPU Charts." Which is also a good read with many benchmarks.
It is crazy how far we have come.
-The only sig I have is a cig with a good single malt.
Is it just me, or does this article describe current leakage using bipolar transisters?? I didn't think those were commonly used, with CMOS pretty much supplanting it... Really it seems like the right argument, applied to the wrong mosfet.
Branch prediction sounds decent to most people, until it hits reality. Having a program "predicting" that it will need a certain path is *backwards*. If a certain calculation *should* go down a path, it should pre-tell the channels.
No! It's a *SIG*. Keep the Special Interest Groups away! (Con joke!)
Last I remember, x86 compatible microprocessors are more than 95% of the desktop market... it seems only that such an article would focus on x86 (unless the PPC people found a way to hit 5 GHz, which to my knowledge, has not happened yet.
Locked meanings? I'm not so sure. If we do a MUL EAX then the result goes into EDX:EAX. Since EAX gets clobered, it'll get renamed. Combine that with the fact that most compilers generate code that does not use instructions in which registers have special meaning anyway and I don't think this is actually a problem.
Now I only have a very limated understanding of the issues and electronics, given my lack of electronics experience. But couldn';t the leakage by utilised in a some form of intigrated peltier coolining to help pump the heat out of the chip and as such making it cooler help in a small way to reduce leakage. ANother, and call it wacky thought that struck me ws why not have another layer of large silicon that is powered by the leakage. It look to me that the leaked power from the 70mn process is nearly enough for the total power on the 90mn process and then the leakage from the 90 would do something just over the 180mn process. In a sence another form of heat pump :).
Anyhow I'm sure I've either given you electronics guru's somthing to thing about or at the very least, laugh about. Enjoy :)
www.everythingispossible.com
Never insult someone who serves you food. - Brought to you by the Democratic People's Republic of Jimmy.
july 20, 2004 was pretty sweet....
every day http://en.wikipedia.org/wiki/Special:Random
We have to look at how much this affects different people.
Who needs so much raw processing power? Your everyday Joe Computer User, only uses it for Word Processing and checking email, and surfing the interweb. Which is why when some of my friends (or their parents) go looking for a new computer, I ask them what they use their computer for, mostly. If they're not eXtreme gamers or something, then I don't see a point with them buying a processor screaming along at 4 Ghz or whatever.
In the light of this, I still think there's a market for single-core CPU's for the everyday user. There is probably one other thing that can change this though - video encoding/recoding. A lot of people are starting to use their PC for burning DVD's. As anyone who's ever authored a DVD knows, it can take some time. It takes about 3.5 hours on my Pentium 2.4Ghz to author a (4 Gig) DVD. That time is spent on the encoding. So multicore processors would probably help with that (or perhaps there could be a dedicated hardware solution - encoder cards?).
I know this article is just talking about continuing trends, and what could die out. So yes, unicore/single-core CPU's may not be a "profitable" trend, but there are still uses for them. Also, as the article showed, talking about hyperthreading, it would also help if apps were written taking into account hyperthreading/multicore processors in mind. That way, they can take full advantage of it. I see hardware taking a while to catch up and utilize the full potential of the hardware.
Vivin Suresh Paliath
http://vivin.net
I like
http://shit.slashdot.org/article.pl?sid=05/02/09/1 35251
If I recall correctly, MMIX uses its whole huge register file as a stack. All of your instructions specify register numbers as counted from the top-of-stack. Stack space is allocated and deallocated in frames, not a register at a time. A frame must be small enough to fit in registers. The stack spills to memory if it overflows, and refills from memory if it underflows. It does not have to spill/refill on a frame boundary. But activation records for compiled C routines could nest five or six deep and not spill. An inline routine can still allocate and release its own activation record.
Not all the registers are used in a stack-like way; some of them are global for your program and some of them are global to the OS. There are a couple of special registers that indicate where these regions start in the register file. The remainder of the register file is used as a stack.
Sunlit World Scheme. Weird and different.