Cool Linux Tricks With Atlas
dpilgrim writes: "Looks like some powerful players want to see Linux going toe to toe with Unix 'big iron.' Would you like to be able to run two Linuxes simultaneously on the same box? Or seemless swap processor and memory in and out of your machine? The Atlas project aims to bring you all that and more. There's a press release from TurboLinux reported here, and a more in-depth article running on SourceForge's
Linux on Large Systems Foundry."
Atlas seems to be relying upon this new Intel chipset and the Mckinley processor (one of intels new 64bit processors). This new chipset will support hotswapping it sounds like, and any motherboard maker that would use that chipset, would make sure that the slots could do that. So yes you do need a special mobo, but it wont be available for a while.
This isn't a linux issue. It's a hardware issue.
The significant thing about 'big iron' is that it's an enabling hardware technology.
Once you have it you can write firmware and software that creates the illusion that the hardware never fails
Until you have it, you can't.
The hardware described looks about right - if they handled machine checks properly. (And the fact that they even used the term implies they either did or are trying.) Basic idea: The machine catches ANY error, with enough state saved that you can:
CORRECT the error
IDENTIFY any failed components,
MOVE tasks to non-failed components or reconfigure the failing components to limp along,
NOTIFY the OS of any problem, so it can do things like start moving things off a dying component, and
pick up the computation where it left off WITHOUT the error.
When you can do this you can write a modified Linux, Windows, BeOS, or what-have-you that can do the things a mainframe can. (But you'll need to have a REALLY reliable OS for your starting point - you're now talking uptimes measured in decades. The software better not take the system down in the absense of hardware trouble, and there IS NO hardware troube. B-) )
Hot-swappable parts are more a side-effect than something key. You have to be able to hot-swap to replace a broken part with the system live. Once you have the ability to hot-swap in a replacement for a failed part and add it back into a running domain, it's trivial to generalize that to "fix" parts that were "bad" because they had never been installed.
Partitioning is also implied: You need a minimum of two domains ("virtual machine" subsets of the total device) - working (where the live system is) and diagnostic (where the maintainence guys check out the parts). Once you have that mechanism, making a LARGE number of working domains (with varying amounts of resource, including full or time-shared CPUs) is straightforward.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way