Linux On Another New Architecture: PowerPC 64-bit
An unnamed correspondent writes: "This one rather silently whizzed by on the kernel mailing list. IBM reports that they have ported Linux to PowerPC hardware running in 64-bit mode. This no doubt applies only to the larger processors but it's pretty cool all the same." I don't see this processor yet listed on the NetBSD page, even on the mind-bending list of not-yet-integrated ports; is this a first? :)
_PowerPC Concepts, Architecture, and Design_ by Chakravarty and Cannon (McGraw Hill, 1994) mentioned the 64-bit architecture.
--
"It's tough to be bilingual when you get hit in the head."
Given the current trend of consolidation, I see room Intel, AMD, and a high-end player yet to be named - either Alpha or PPC. I'm discounting the Mac userbase in advance as I believe Mac users care the least about the technical details of their platform, and hence constitute an OS market more than a microprocessor market.
Neither Sun nor Compaq builds mainframes... not sure what you are referring to...
--
"It's tough to be bilingual when you get hit in the head."
> out six weeks before the internally developed 630.
The 620 made it int at least one Apple server, iirc. And when it trounced the wintel boxes in a benchmark, the predictable response back was that it wasn't fair to compare a 64 bit machine to 32 bit machines.
hawk
If the PPC group can't get their changes into the linux kernel, (as has been noted on /.), they why does it matter?
If it was said on slashdot, it MUST be true!
> The guys in the server room who clean the crap off the floor get hot
> and bothered by operating systems. The guys these people clean up
> after only care about getting the job done right.
For the most part. But if you told them to run NT on big iron, they would probably get hot under the collar and very bothered
hawk
:)
Ok, I'm sure I do need a clarification (I'm a compiler person, not a hardare person, and have never worked with the Power series). Why not just use multiple cores on a single chip? Last time I heard, that's what Power4 was going to do, right? Also, are you saying that the 8 execution units are split between several separate threads? If so, does anybody know if it's a fixed split (4 for first thread running, 4 for second) or dynamic (which would be surprising, but cool. . .)?
Thanks, all!
--JRZ
What set of benchmarks were you running when you collected those numbers? I assume you were using a benchmark suite such as SPEC. This is only one measure, useful to a certain audience.
For that particular project, I was using the go and cc1 integer benchmarks from the SPEC suite (not sure which year). No special reason; these were just the ones I had on-hand, and for the project it didn't really matter (as I was interested in relative and not absolute results).
I cite the figures from that project as a reference point to give some idea of the ballpark values that can be expected. A 50% increase in ILP for "average" code I might believe. A 400% increase I wouldn't.
Yes, certain scientific applications can be written to be easily parallelized, but this is only one niche. For most code, I am deeply skeptical of filling 8 issue units per clock. SMT offers the potential for across-the-board speedup (as long as you're running more than one CPU-bound thread on the machine at once).
Why not just use multiple cores on a single chip? Last time I heard, that's what Power4 was going to do, right? Also, are you saying that the 8 execution units are split between several separate threads? If so, does anybody know if it's a fixed split (4 for first thread running, 4 for second) or dynamic (which would be surprising, but cool. . .)?
The multiple cores idea has been around for a while, and certainly works; SMT is just more resource-efficient.
My impression is that the Power4 is going to have two cores, but I haven't been following it closely, so I could easily be wrong about that.
In a SMT system, functional units are indeed shared dynamically between the threads. As far as most of the chip's concerned, there's only one instruction stream, composed of interleaved instructions from the two threads (well, not interleaved in lockstep, but close enough). All you'd need to do would be to add an extra bit onto the register specifier tags (so that the two threads access non-overlapping sections of the register file) and give each thread its own page table identifier (selected by a few bits tacked on to the address). You could even get away with having a single TLB cache.
In summary, you can keep most of the design the same as for a single-thread machine, and make relatively minor changes in a few places to implement SMT. This takes far less silicon than dual cores, and lets you use the functional units more efficiently and use a wide issue unit efficiently (by boosting parallelism in the instruction stream).
Good way of looking at it, but lots of other things matter as well:
pipeline efficiency, memory bus bandwidth, smp cache coherency efficiency.
If you don't need >4GB of address-space then you're probably better off with a high-clock 32-bit chip and a good memory bus
**Vanuatu or bust**
-
Use Validators and Load Generators to Test Your
Web Applications
Maybe the folks who write the Slashcode would find it helpful.I've posted this here before, but don't want the IBM folks who might be reading to miss it:
-
Using Test Suites to Validate the Linux Kernel
Comments, criticism, additional links and resources to add, suggestions for future articles to write and of course articles you would like to write are appreciated.I could also use some help from someone with expertise in designing database schemas.
Thank you for your attention.
Mike
-- Could you use my software consulting serv
The system from the benchmark report has 24 RS64-IV 64-bit processors running at 600 Mhz with 96GB (yes, GB) of system DRAM. Each processor has 128kB L1 data cache, 128kB L1 instruction cache, and a 16MB L2 cache. The chips also support course-grain multithreading (simpler, but similar to SMT).
(600 Mhz sounds slow until you realize that it uses a simple, very efficient 5-stage pipeline. Intel and others achieve high clock rates through deep piplines and rely on branch prediction and other techniques to keep the pipe full. Branch mispredictions and cache misses can kill the actual performance of these chips on real server code.)
This system with 24 processors outperforms HP's 48 processor "SuperDome" and Sun's 64 processor EU10k (though the UE10k is an old system by now, it is the fastest server Sun is shipping.)
The above system is not using the Power3 chip from the posted story. You can bet IBM will port Linux to this beast next. We won't see a 24 processor systems with Linux right away, but an s80-like system would make a sweet 4-processor Linux server.
One last note: these systems are not vapor-ware. A 12-processor system with an earlier version of the same processor has been shipping since the summer of '98.
Here's some acronyms for you, which should clear up the mess:
POWER == Performance Optimised With Enhanced RISC
PowerPC == POWER for Personal Computers
The PowerPC was developed as a cut-down (32-bit instead of 64-bit and lacking a few rarely-used and complex instructions), largely binary-compatible version of the POWER.
PowerPC isn't really any particular processor, but a specification, which was first implemented as the PowerPC 601 back in 1994 (remember how it totally wiped out the Pentium-75?). Subsequently, embedded versions have been made, along with more powerful desktop versions of the PowerPC - the 603, 604, 750 (G3), 7400 (G4) and now the 7450 (G4+).
Meanwhile, the POWER has been developed as well, remaining a high-end 64-bit monster for the enterprise-level RS/6000 machines. The PowerPC 601 was based more on the POWER1 than anything else, the chip shown in the log is a POWER3, and the current hot topic is the POWER4 with all these nice new features (one or two of which have reportedly already made it into the 7450...).
The bottom line is that the POWER and the PowerPC are different but surprisingly similar beasts. They are nearly binary-compatible, which is why the kernel reports it as a PowerPC-class processor.
--- The key to knowledge is not to rely on people to teach you it ---
Linux was the first OS ever to boot on Itanium. (*bsd not there).
Where are the Itanium computers? This port isn't of much use to nearly everyone.
Linux was first on PPC64. (*bsd not there).
Where are the PPC64 computers?
Linux was first free OS on S/390. (*bsd not there.)How many people own an S/390?
Linux was first on UltraSPARC.
And where is a semi-usable UltraSPARC distribution?
Heck, all of these ports require much hand-rolling. And you also mentioned hardware which the vast majority of people here have never even touched or seen- have you?
Proof of concept ports, and ports that aren't deployed anywhere in the real world: these aren't of much use, regardless of if the port is of a Linux or a BSD.
-bugg
- Products such as Websphere can be released on one OS platform (Linux) and run on IBM's entire range of hardware.
- Linux has a lot more "geek momentum" than AIX. The guys in the server room would probably be much more excited to get a kick-ass RS/6000 if it meant they could stick Linux on it.
- It gives them something to talk about in their upcoming advertising campaign.
- IBM is a hardware company. To them, software is a way to sell hardware. If Linux is popular, then it's in IBM's interest to make sure it runs on their most expensive kit. They'd rather sell an RS/6000 than a Netfinity. (this also explains their porting it to S/390 first)
- I wouldn't be surprised if the long term plan was to fold the enterprise functionality of AIX into Linux, have the OS maintained by the open source community with much less IBM manpower than AIX takes, and then put AIX out to pasture.
Charles Miller--
The more I learn about the Internet, the more amazed I am that it works at all.
Datacenter can address 64 GB via Intel's Processor Address Extension hack (basically allows 34bit addresses via "dual address cycle" hardware (can pump in two 32 bit addresses in once clock cycle).
It performs pretty well for a kludge but does require your application to use the MS AWE (address windowing extensions) memory allocation api's which have some restrictions, such as only providing page fixed memory and only allowing you to dealloc in the same unit you alloc'd (so writing dynamic memory handling is not easy))
The increased address space is cool if your o/s has a good (fast, influenceable) vm manager - you can strip out buffer mgt code from your app (reduces complexity)
Also great for server apps that do lots of read io as you can buffer even at large concurrent user workloads, so can see Real/Oracle/Akamai type apps benefiting
**Vanuatu or bust**
Where are the Itanium computers? This port isn't of much use to nearly everyone.
Itanium represents the first commodity 64bit enterprise computing platform. A major advance if you ask me (regardless of performance), and linux will be there first, along with SCO, and win2k bringing up the rear.
Where are the PPC64 computers?
Ever hear of Power3 and Power4, and AIX? 'nuff said.
How many people own an S/390?
I think the count of people that use S/390 is far less inportant than the importance of those people. S/390 has no peer in its class as a mainfraim. Sun's starfire comes close.
And where is a semi-usable UltraSPARC distribution?
Debian has a semi-usable distribution for Ultra Sparc. I beleive they have Xfree working, among other things, along with the trivial ports that just require a linux kernel
Proof of concept ports... these aren't of much use...
Needless to say, I disagree.
eh? The 'bittiness' of the CPU rarely has anything do with floating point capabilities. The Intel x86 line all have the ability to use 80-bit floating point numbers (10 bytes). In fact, it was because of this the [in]famous FPU memory move was created for the Pentium processors -- it was faster to move memory into the FPU registers and then out back to memory than it was to use the usual movsd instructions to do the same, because via the FPU you moved 8 bytes (64 bits) at a time, whereas with movsd, you were only moving 4 bytes at a time. On the Pentium Pro and Pentium II, they finally fixed this by the use of write combining so that movsd'ing a block of memory was as fast or faster than doing it via the FPU. The numbers of bits generally refers to one of two features of the CPU -- either it's bus, or the size of the general purpose registers and address space. The Intel Pentium for example, had a 64-bit bus, but still only 32-bit registers and memory space. The Intel 80386SX had a 16-bit bus, and 32-bit registers.
I used up all my sick days, so I'm calling in dead.
this is kinda offtopic and pedantic but...
:)
IBM haven't ported Lunix (A UNIX implementation for the Commodore 64/128) to the 64-bit PPC platform. They ported Linux. Get It Right.
Sorry... someone had to do it though
life is a canvas/and the paint is hope and promise/the world is ours/no one can ever take it from us.
"Unlike a typical PC microprocessor, the chip features eight execution units fed by a 6.4 gigabyte-per-second memory subsystem, allowing the POWER3 to outperform competitors' processors running at two to three times the clock speed"
Eight execution units! I recall that the x86 line have half of that. And 6.4Gb/s memory is not to be laughed at either!
Memory bandwidth is a good thing. Low latency cache hits are great thing, if you can get them (no idea if PPC does this or not).
However, adding more execution units won't buy you much beyond a fairly small number. The reason: you just don't have that much extractable parallelism in the serial instruction stream.
I had the good fortune to be playing with this recently via simulation. If you give the processor a *huge* instruction window (256 instructions) and the ability to execute *any* number of instructions of *any* type in parallel (except for memory accesses - see below), you still get an average Instructions Per Clock of about 2.1-2.2. 95% of the time, you're getting four instructions or fewer issued (and most of the time, far fewer than that).
When SMT is put in silicon, wider issue will become practical (due to increased parallelism in the instruction stream), but as it is, you're better off spending the silicon on other improvements.
Re. memory accesses; the reason why it's extremely difficult to do memory accesses out-of-order with each other is that you have to check to see if any given two memory accesses refer to the same location (indicating a dependence). You often don't know what the target address is until late in the pipeline, and you'll still need to do a TLB translation to get the physical address, and compare two large bit vectors (the addresses).
Remember, to be useful for scheduling, you have to be able to do all of this very quickly and very early in the pipeline.
All of this makes out-of-order memory accesses very difficult to implement theoretically, and a nightmare to implement in real silicon. It's still sometimes done in a limited manner, but this doesn't affect the IPC very much.
Right. Let's take a look at this.
Itanium: Linux has an Itanium emulator written specifically for it, by Intel, I believe. That makes it kind of easier. Besides that, BSD does boot on the Itanium, even though they were severely impeded by lack of tools.
PPC64: It was ported by a corporation, fuckwit. A corporation with more resources than a non-profit organization could ever put towards porting to a platform, porting Linux to run on their own hardware, whereas NetBSD is an independent effort. They can't just run out and get a PPC64 box for themselves.
S/390: Same story.
UltraSPARC: Both run on UltraSparc, but I don't know dates of when they first booted, or the extent of Linux/Sparc support. This might actually be...a *relevant point*!
And then you call this stuff "mainstream, state of the art hardware". For all but the UltraSPARC, it's impossible for a normal person to even lay hands on one of those machines. Even in the case of corporations, how many do you know that are running Linux on IBM boxes instead of AIX? Why the hell would anyone want to, seeing as how AIX generally outperforms it anyhow?
In any event, how about high-end hardware that people can actually buy? NetBSD was the first to be running on the Alpha, for instance, a high performance platform that actually matters. First on SGI boxes. How about i386, the architecture everyone uses? In the early 90s, NetBSD was far more complete and usable than Linux, and to this day has very complete hardware support for the platform. One could also point out that Linux has been lagging behind on new technologies, like IPv6. Might want to take that into account when you're tallying up the final "Score".
Of course, this is not all of them, S/390 is even missing.
And uLinux runs on architectures like the DragonBall, and other things too. I don't know of a complete list anywhere.
-skip
There's another list here, with some other ports mentioned, that a quick google search turned up.
-skip
Actually even the width of the address bus isn't necessarily a limiting factor. The ancient Sinclair QL used a 68008, which could handle 32 bit addresses, and thus 4GB of memory, but only had an 8-bit combined address and data bus. It'd take 4 bus clocks to select an address, and another four to read/write a 32 bit value from/to the location.
Ouch!
If your general purpose registers are 32 bits (which is the definition of a 32-bit CPU) and addresses are 64 bits, where do you store the pointers? That's why in most recent chips, pointers are the same size as the integer registers.
IBM's current p680 box (up to 24 Power3 IV cpus) does implement a kind of multi-threading already it is too coarse to be called SMT, but it is multi-threading. As it is now, each processor 'presents' itself as two cpus to the OS. They say that it took less than 5% of the chip real-estate to support this multi-threading. If you look at their benchmark results on Spec and TPC, it seems to have paid off quite well.
When information is power, privacy is freedom.
Yes. You average 2.1-2.2, and 95% of the time you're only getting 4 or fewer. However, when you look at the other stuff the Power3 architecture includes, it's pretty obvious what the overall intent is:
Hardware Loop Unrolling.
IBM has got some customers that use some serious CPU. We're talking national labs and the like. For them, the ability to run 8 of those neat 'multiply-and-add' instructions per clock cycle is quite an important feature.
The chip *starts* at 375mz, and can do 16 floating point ops/clock (an amazing amount of code uses that mult-and-add over an array - and the IBM compilers are smart enough to detect and convert divide to multiply-by-inverse and add/subtract issues).
And of course, IBM is hoping that even though the big SP/2 iron is limited to national labs and Fortune-500 companies (see The Top500 List for details), that they'll be able to sell a lot of the smaller 43P deskside boxes (1-4 Power3 CPUS) and the 8-16 CPU rackmount servers, to all the smaller companies that need number-crunching.
...and a 64-bit integer can be manipulated on a 32-bit machine - and even fairly conveniently, if the compiler cooperates, and GCC does cooperate here (think long long int).
My question is, is there such a page updated with such info? I don't believe that Linux Torvalds maintains all different architecture branches..
Thanks!
r. ghaffari
(25/M/Baltimore, MD)
As in almost any area where theres money to be earned Big Blue is in there with some really cool hardware.
1 998/Oct/power3.html:
Taken from:
http://www.rs6000.ibm.com/resource/pressreleases/
"Unlike a typical PC microprocessor, the chip features eight execution units fed by a 6.4 gigabyte-per-second memory subsystem, allowing the POWER3 to outperform competitors' processors running at two to three times the clock speed"
Eight execution units! I recall that the x86 line have half of that. And 6.4Gb/s memory is not to be laughed at either!
Thomas S. Iversen
IBM can build big-ass proprietary servers and deploy them for customers while still using standard software products. Big deal for IBM since lunix is now a well respected server operating system. Easy to port software to and easy to market.
So, you can see this as yes IBM is scratching an itch, but at the same time making lunix more available in the high-end enterprise environment.
These are platforms I'd like to see some porting being done for:
Other people's brains (remote access, perhaps X-10 integration)
Garage door opener (so I can apply sound themes to it, and replace the "so 1991" screeching)
Alarm clock (see garage door opener)
Pets (obediance school in any C, Perl, or any language you want).
Jeremy McNaughton
------ Live simply so that others may simply live.
Without passing value judgement...
The PowerPC running in 64-bit mode will help them get Linux up-and-running on the eServer iSeries (that platform has been 64-bit for longer than just about any other major server). It allows them to funfill their goal of getting Linux running across all the eServers.
In a word, yes. If the "Not Invented Here" syndrome becomes rampant, then large corporations will have less incentive to build improvements upon the system, and therefore, will start again with either a fork or a potentially closed-source proprietary system. Either way, support for the original open-source system will wither away, reducing the potential for corporate uptake.
However, this is just a generalisation; I cannot comment specifically upon Apache itself.
--
The 64-bit PowerPC architecture antedated the RISC AS/400's, as far as I know - as I remember, I saw a PowerPC architecture manual describing 64-bit mode before the RISC AS/400's came out (it was some time in 1994, I think, when I saw it).
The PowerPC 620 was supposed to be the first 64-bit PowerPC; I don't know whether any machines shipped with it. IBM now have 64-bit PowerPC's in both the AS/400 and RS/6000 machines (I think some of the RS/6000's use the same chip as some of the AS/400's, with the tag bits and other AS/400 extensions disabled in the RS/6000's).
I completely understand your point and agree, but if I may offer an example or two of Slashdot reporting:
3 4&mode=thread
http://slashdot.org/article.pl?sid=01/02/21/14502
"it looks like NetBSD could give Linux a run for its money in the handheld arena."
http://slashdot.org/bsd/01/02/05/1859221.shtml
" 'Linux 2.4.0 is available for no money. So is FreeBSD. Linux uses advanced hardware, so does FreeBSD. FreeBSD is more stable and faster than Linux, in my opinion. "
Basically the precident is that it is acceptable to be inflammatory as long as your aren't Linux. A majority of the articles comparing BSD and Linux do so on a well-known point, stress under high loads. Notice that Slashdot does not post articles comparing native application support, user-base, or multi-processor support. Posting of such articles or comments will likely be considered inflammitory.
In posting this in now way am I trying to start a flamewar. However, I do feel Slashdot holds a double standard in how it treats BSD remarks, especially on the front page. Being immature and biased is useless, regardless of OS choice. Thoughts?
Far better for them to put the work into ensuring a stable port to their new chip. Now all they need to do is to wait for Intel to put out a sickly version of the Itanium (like they did with the first release of the P4).
--
Free Software: Like love, it grows best when given away.
This is probably IBM anticipating. After all, just because there's no demand now doesn't mean that there won't be demand when the system is available. Getting the system ready ahead of demand is smart; it means that when people running PPC want more horsepower, IBM will be able to provide them with a nice smooth path to 64 bit PPC. This looks like it's just a regular part of IBM's Linux strategy. They want to make it available everywhere, so companies can upgrade to more and more powerful systems without having to relearn everything.
There's no point in questioning authority if you aren't going to listen to the answers.
Something that I can see this having a use for, is for boxes that are going to soon (very probably) reach an 'end of life' in the IBM OS camp. True there is Project Montery, but some of these 64 bit machines could definately end up lying in the dust. Especially with a whole new OS and designers that may decide that the earlier systems are too 'hard' to support easily.
I'd much rather see IBM make sure that something decent still runs on older boxes, than having no shipping OS at all that will support such platforms.
And I'm not saying IBM will drop support for these platforms anytime soon, but it's much easier to get the thing started now, than later, when it could be too little, too late.
HP's PA-RISC's are a prime example. Linux runs on these, as various numbers of the machines fell into the hands of Linux developers that have experience in kernel code/porting. Unfortunately the machines never really made it, but at least the people out there have something that runs and is probably better supported than what you would otherwise get. It was the thing, just too late in the game. At least IBM aren't making the mistake of leaving people lingering with unusable hardware.
Also remember that IBM has a considerably large Second Hand group that reconditions trade-in systems and then sells them to more disadvanged groups or countries - and if they don't have an OS to ship on those machines, what do they do with them?
Of course, what contributions they get from some of the truely bright sparks in the community with the Linux port, may actually improve the way their own code monkeys write/implement their next OS kernel, which they can only view as a win-win situation.
Where do you get your figures (everybody does not know that, yadda, yadda...)
Don't fall into the arrogant assumption of thinking one architecture is enough for everything. You wouldn't want just VGA graphics now would you?
The 370 architecture is alive and quite well, thank you, and processing payroll, accounting and other mundane crap that you can't live without.
But you wouldn't want to have to write a game for a 3279 terminal now would you. No more than you'd want to bank with somebody who'd balance your accounts on a PS2.
The PPC architecture is alive and well and the G4 is very useful for some types of processing and totally useless for other things but what it does, it does damn fast.
The x86 is as much of a dead-end as the z80. It will be utterly swamped by the requirements of voice processing and image recognition that a wired economy needs. Forget passwords. Just say your name and smile for the cam. (And that's only the first app. The one at the gate, so to speak. )
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
Doing this kind of parallelism extraction in the compiler just plain makes more sense than doing it on the chip (with SMT). The compiler can see all of the source code at once and spend a huge amount of time studying the problem (which can be extremely complicated, if you want a really good, inter-procedural, flow-sensitive analysis) and then spit out code that bundles it up in explicit parallelism.
That's exactly why IA-64 is going to kick the crap out of other architectures 3 years down the line (once the compilers actually get good). And it's also why RedHat is screwing over their users by sticking with gcc over SGI's (GPL'ed) IA-64 compiler. As the first author said, conventional compilers only 2.X IPC at best. So two-thirds of the Itanium execution units are wasted when you use a compiler like gcc. SGI's, on the other hand, was redesigned from the ground up (starting with the gcc parser for compatability) to use all of the neat, theoretical tricks that you need to get ILP in this situation. TurboLinux has already gone with it and demonstrated good results (that's one reason why NCSA will be using Turbo for the second stage of their huge, new cluster). But gcc is Cygnus' baby, and they will fight to keep using it, no matter how badly it hurts performance in the end.
--JRZ
Doing this kind of parallelism extraction in the compiler just plain makes more sense than doing it on the chip (with SMT). The compiler can see all of the source code at once and spend a huge amount of time studying the problem which can be extremely complicated, if you want a really good, inter-procedural, flow-sensitive analysis) and then spit out code that bundles it up in explicit parallelism.
You seem to have an incomplete picture of what SMT is.
SMT - Symmetrical Multi-Threading - is simply the ability to have multiple threads running on a chip at the same time, with separate fetch units and register files but with the instruction window and the functional units still shared.
The threads don't even have to be from the same program, or in the same address space (though it'll reduce TLB and cache load if they are).
No extra effort is needed on the part of the programmer, and you get N times as much instruction level parallelism with N threads as you would for one thread. In one instruction stream, you'll always have dependencies that can't be avoided - true dependencies. Parallel threads don't have any shared dependencies for register operations, and are much less likely to have dependencies for memory operations (under most conditions).
A compiler, on the other hand, has to be made extremely complex to extract much more parallelism than is currently extracted, and still won't be able to capture a lot of it. I know this far too well, having seen the guts of compilers on a few occasions. You'll also get no benefit for legacy code or for code that was compiled with a mediocre compiler (as almost all code is, to Intel's continuing dismay).
SMT is especially nice because there's almost no extra hardware overhead for implementing SMT. It's a winning strategy from all angles.
+++
+++
NO CARRIER