Linux May Need a Rewrite Beyond 48 Cores
An anonymous reader writes "There is interesting new research coming out of MIT which suggests current operating systems are struggling with the addition of more cores to the CPU. It appears that the problem, which affects the available memory in a chip when multiple cores are working on the same chunks of data, is getting worse and may be hitting a peak somewhere in the neighborhood of 48 cores, when entirely new operating systems will be needed, the report says. Luckily, we aren't anywhere near 48 cores and there is some time left to come up with a new Linux (Windows?)."
It appears that the problem, that affect the available memory in a chip when multiple cores are working on the same chunks of data, is getting worse and may be hitting a peak somewhere in the neighborhood of 48 cores, when entirely new operating systems will be needed, the report says.
Seriously? You picked that over my submission?
I submitted this earlier this morning I guess my submission was lacking. But if you're interested in the original MIT article and the actual paper (PDF):
eldavojohn writes "Multicore (think tens or hundreds of cores) will come at a price for current operating systems. A team at MIT found that as they approached 48 cores their operating system slowed down. After activating more and more cores in their simulation, a sort of memory leak occurred whereby data had to remain in memory as long as a core might need it in its calculations. But the good news is that in their paper (PDF), they showed that for at least several years Linux should be able to keep up with chip enhancements in the multicore realm. To handle multiple cores, Linux keeps a counter of which cores are working on the data. As a core starts to work on a piece of data, Linux increments the number. When the core is done, Linux decrements the number. As the core count approached 48, the amount of actual work decreased and Linux spent more time managing counters. But the team found that 'Slightly rewriting the Linux code so that each core kept a local count, which was only occasionally synchronized with those of the other cores, greatly improved the system's overall performance.' The researchers caution that as the number of cores skyrockets, operating systems will have to be completely redesigned to handle managing these cores and SMP. After reviewing the paper, one researcher is confident Linux will remain viable for five to eight years without need for a major redesign."
I don't know, guess I picked a bad title or something?
Luckily we aren't anywhere near 48 cores and there is some time left to come up with a new Linux (Windows?).
Again, seriously? What does "(Windows?)" even mean? As you pass a certain number of cores, modern operating systems will need to be redesigned to handle extreme SMP. It's going to differ from OS to OS but we won't know about Windows until somebody takes the time to test it.
My work here is dung.
This is exactly why people are doing research on Barrelfish (http://www.barrelfish.org/).
SGI has some awfully big single-system-image linux boxes.
I saw a comment on the kernel mailing list about someone running into problems with 16 terabytes of RAM.
They have an one-off error in their math, it's actually 9 times a 6 core CPU. So, at 42 cores a rewrite is needed.
Not anywhere near 48 cores? Stick 4 AMD Magny Cours Opterons (12 cores each) in a quad socket motherboard and you will have 48 cores. Not that uncommon.
Dunno... I am typing this on a system with 12 cores and 24 virtual cores. And the GPU has somewhere around 1600 cores... Other systems I've worked with have hundreds to thousands of cores so I think we are pretty close...
Seriously though, these issues have been known for a while but will have to trickle down to desktop OSs to deal with caching and shared memory.
Visit Jonesblog and say hello.
640 cores ought to be enough for anybody . . .
We are Dead Stars looking back Up at the Sky
Can somebody please explain what the fuck they are actually talking about? They've dumbed down the terminology to the point I have no idea what they are saying. Is this some kind of cache-related issue? Inefficient bouncing of processes between cores? What?
Just write a little AWK script to replace evey occurence of "48' in the source code by, say, 256 or 1024.
It looks like TFS was written by a Windows fanboy; why mention Linux specifically when it is a general problem? Why try to half-assedly imply that Windows is more advanced than Linux?
Yet Another Tech Blog
(but so much more, including game and movie reviews)
http://yanteb.peasantoid.org
UNIX and C were great in their days. But perhaps not in the meg-core era.
At my last job we had a bunch of Sun T5120s which housed 64 cores. So yeah, we are "anywhere near 48".
Reviewing just the first hour of video games.
...with the other 46 cores we are not using. Most people still do one thing at a time and only need a couple of cores at best. Three if you throw in Windows anti-malware software. :)
Cray seems to have addressed this problem, yes?
I'm still waiting for Windows to work well on ONE.
i was hoping to see Crysis 2 running on Linux
No kidding. SGI's Altix is a huge box full of multi-core IA-64 processors. 512 to 2048 cores is more normal, but they were reaching 10240 last I checked. This is SMP (NUMA of course), not a cluster. I won't say things work just lovely at that level, but it does run.
48 cores is nothing.
Windows 7 supposedly scales upto 256 cores.
See http://channel9.msdn.com/shows/Going+Deep/Mark-Russinovich-Inside-Windows-7/
Perhaps they should worry about getting Flash to work without stuttering before they worry 'bout 48 cores.
...unless the plan is to use 48 cores to make Flash work.
My new HP DL-385 G7 has a 12 core AMD processor. A four fold increase is not far off.
And guess what? With near linear scaling.
These have 512.
These have 256.
Appears to be a Linux problem.
http://channel9.msdn.com/shows/Going+Deep/Mark-Russinovich-Inside-Windows-7/
Windows 7 can scale to 256 processors.
http://xkcd.com/619/
What part?
Be specific, since I've been using it since '95 and it's only gotten better.
Granted, it's a little more bloated than back then, but hey you get that when you have sixteen billion subsystems.
-- This space for lease, low setup fee, inquire within!
Do they have support for smooth full-screen flash video yet?
My Ubuntu 10.04 system still can't play embedded youtube videos. At least Adobe provided a work-around by adding a "play on youtube" option in the right click context menu.
Why aren't we even close to 48 cores? There have been chip with 16 or more cores. Why are we still mulling around 6 cores?
I suspect it's a fab issue. In that so may ships on a die with such small lines means a lot more flawed chips and bad wafers.
Of course, Who will rewrite Linux? We could never recaptures it's unique origins.
The Kruger Dunning explains most post on
The part that requires the user to grow a neckbeard and masturbate to lolicon.
I seem to recall seeing operating systems running on more than 48 cores. In fact, doesn't Linux power some of the giant super computers with 64+ cores?
Why write a new Linux when Solaris already does such a fine job scaling to large numbers of cores/threads? OpenIndiana is just getting off the ground, but it's open source, free, and works now.
I'm not affiliated with Supermicro in any way, but they have four 1U serverboards designed for the 12 core opterons, so that's 48 cores in a 1U server. I'm guessing that Supermicro is not the only vendor of quad opteron boards supporting the latest chips. There are most likely quite a few of these in use by real people. Anyone want to speak up?
I know from personal experience that the socket F opterons performed very poorly in an 8 way configuration compared to the previous generation (socket 940 gen). I ran multiple tests on dual core chips (885s, I think), back in 2006 or 7 where I'd get nearly double the performance in going from a quad configuration to an 8 way configuration, but with the socket F breed of chips, there was no performance boost at all, it was like the clock speed was being cut in half and all the threads took twice as long to complete. I saw this behavior again and again, and the motherboard manufacturer that I was testing the chips with told me that it was an issue with the chips themselves. I think this is the reason why 8-way opteron systems are very rare now.
Nobody's every going to need more than 640 cores
--
Stay tuned for some shock and awe coming right up after this messages!
We've known about this problem for ... well, as long as we've had more than one core - actually as long as we've had SMP... You increase the number of cores/CPUs, you decrease available memory thruput per core, which was already the bottleneck anyway. Am I missing something here?
It's amazing how we live in a world involving an infinite, non-discrete numeric system yet the computers we construct are always bound by some finite, discrete limitation.
If a Windows machine had 48 cores, 47 of them would be running viruses, spyware, and anti-virus/anti-spyware software and one would be running the user's applications.
http://lkml.org/lkml/2010/7/22/252 is a fun post on the Linux-Kernel list about missing caching of ACPI tables leading to 20 minute boot times. I get that problem every day! (I wish :P)
It is a pretty safe bet that you don't have to worry about Linux and more than 48 cores, as it is the OS of choice for a lot of the top supercomputers and OS research in general. Of course, applications which can take advantage of such systems is another problem, but that is hardly a Linux problem.
One SGI Altix version comes with 2,048 cores running a single image.
The benchmarks in the paper are a bit suspicious because they avoid disk I/O. tmpfs is used instead, which may skew results significantly. Surprisingly, they do not describe the architecture of the test machine, but perhaps I've missed that. They suggest that a workload which does not spend much time in the kernel cannot have scaling issues caused by the kernel, which seems rather dubious to me.
http://www-03.ibm.com/systems/info/x/3755m3/
"With up to 48 cores..."
That's scary.
I'm trying to understand the point of this article..Do we really need a new paper to say that centralized memory bandwidth is at some point a limiting problem in an SMP environment? Isn't this why we have NUMA?
If you want to go after linux internals like the BKL more power to you but that horse left the stable a long long time ago as well.
You could talk about the software problem in dealing with decentralized memory access, synchronization, scalable algorithms...etc but this is all likely something needing to be addressed in application space rather than at the kernel where this paper seems to focus.
There are no shortage of huge single system image linux systems with thousands of processor cores and not a single one of them use SMP architecture. They are all NUMA based (decentralized memory access).
Are there other open-source OSes which are better suited to more parallelism? The Hurd, perhaps?
So, they found scalability problems in some microbenchmarks. Well, some of the scalability paths cited in the paper will be fixed when Nick Piggin's VFS scalability patchset gets merged. But it's not like you need to rewrite every operative system to scale beyond 48 cores, it's just the typical scalability stuff, and the kind of scalability issues found these days are mostly corner cases (Piggin's VFS being an exception).
What they're saying is basically two things:
First, there's a bottleneck in the on-chip caches. When a core's working on data it needs to have it in it's cache. And if two cores are working on the same block of memory (block size being determined by cache line size), they need to keep their copies of the cache synchronized. When you get a lot of cores working on the same block of memory, the overhead of keeping the caches in sync starts to exceed the performance gains from the additional cores. That's not new, we've known that in multi-threaded programming for decades: when you've got a lot of threads dependent on the same data items, the locking overhead's going to be the killer. And we've known the solution for just as long: code to avoid lock contention. The easiest is to make it so you don't have multiple threads (cores) working on the same (non-read-only) memory at the same time, that just requires some thinking on the part of the developers.
Second, you only gain from additional cores if there's workload to spread to them usefully. If you've got 8 threads of execution actually running at any given time, you won't gain from having more than 8 cores. And on modern computers often we don't have more than a few threads actually using CPU time at any given moment. The rest are waiting on something and don't need the CPU and, as long as we aren't thrashing execution contexts too badly, they can be ignore from a performance standpoint. To take advantage of truly large numbers of cores, we need to change the applications themselves to parallelize things more. But often applications aren't inherently multi-threaded. Games, yes. Computation, yes. But your average word processor or spreadsheet? It's 99% waiting on the human at the keyboard. You can do a few things in the background, file auto-save and such, but not enough to take advantage of a large number of cores. The things that really take advantage of lots of cores are things like Web servers where you can assign each request to it's own core. And no, browsers don't benefit the same way. On the client side there are so (relatively) few requests and network I/O's so slow relative to CPU speed that you can handle dozens of requests on a single core and still have cycles free assuming you use an efficient I/O model. But it all boils down to the developers actually thinking about parallel programming, and I've noticed a lot of courses of study these days don't go into the brain-bending skull-sweat details of juggling large numbers of threads in parallel.
Luckily we aren't anywhere near 48 cores and there is some time left to come up with a new Linux (Windows?).
Emphasis mine.
Oh please. Don't list an OS that has trouble running on 1 core as a possible solution.
Seven puppies were harmed during the making of this post.
The K42 project at IBM Research investigated the benefit of a complete OS rewrite with scalability to very large SMP systems in mind. This is an open source operating system supporting Linux-compatible API and ABI.
Their target systems, "next generation SMP systems", back in 2003 seems to have become the current generation of SMP/multi-core systems in the meantime.
explains it rather well, imho.
http://www.physorg.com/news205050157.html
"In a multicore system, multiple cores often perform calculations that involve the same chunk of data. As long as the data is still required by some core, it shouldn't be deleted from memory. So when a core begins to work on the data, it ratchets up a counter stored at a central location, and when it finishes its task, it ratchets the counter down. ...
As the number of cores increases, however, tasks that depend on the same data get split up into smaller and smaller chunks. The MIT researchers found that the separate cores were spending so much time ratcheting the counter up and down that they weren't getting nearly enough work done."
http://xkcd.com/619//. 'Nuff Said.
I'm using multiple servers right now (and have been for the past year) with 24 cores (4 x 6 cores running Debian Linux and Windows Server 2008). No performance problems at the moment but thanks for the heads-up.
Intel has 12-core Xeon's in the pipeline, and HP (and IBM, and etc.) have quad-socket servers...with Hyper-Threading, that's 96 cores presented to the OS.
Nothing to see here but us trolls...move along...
Tilera Corp. already has CPU architecture with 16-100 cores per chip.
TILE-Gx family
Support for these is already being included in the mainline kernel.
...there is some time left to come up with a new Linux (Windows?).
Windows, the new Linux.
You read it here first...
XKCD:Xeric Knowledge Comically Dispen
Using "cat /proc/cpuinfo" as a benchmark, I can see that my quad core is several times slower with an SMP kernel compared to a non-SMP kernel.
(But a non-proprietary NVIDIA driver will still not play your Flash movies smoothly. :P)
So, I wake up, check out a slashdot article, and lo and behold, I see a bunch of nerds having a nerd war about nerd knowledge. Did someone poison my DNS cache so that slashdot points to wikipedia.org and then reskinned wikipedia to look like slashdot?
Sure as fuck feels that way.
Have any of you participating in this conversation even read the conversation? You should be embarrassed and turn off your machines now if so. I don't know why the fuck we're concerned about how software can't keep up with more than 48 cores when it is clear that our own brains can't keep up with one another in a fluid conversation without devolving into an arrogant discussion about whose OPINION is right. For shame, ./, for shame.
Hey watch it buddy, I resemble that remark!
Check out my lame java blog at www.javachopshop.com
Lets drive the greenhorn OUT! No filthy high UID's with their spelling and gramar and solid well researched non-sensationlist writing. I want my editors to rape the language (bonus points if it is several languages at once) and sent my heart racing by raising my bile and fear of the unknown and known.
Headlines sell adverts. Truth, accuracy, honesty do not. Accept it, you are reading slashdot, it works.
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
Hate to tell you this but someone's been pulling your leg. Really, you can stop doing that now.
y2k all over again? :))
> may be hitting a peak somewhere in the neighborhood of 48 cores
There's the solution then - more cores. Since 48 is the peak, it must start getting quicker if you thow more at it. QED.
Hopefully the Haiku Project will be in a good place to pick up the slack by then.
http://www.haiku-os.org/
If BeOS had survived this wouldn't be an issue. Cores and threads everywhere! But noooooooo...
Just a bit of linguistic trivia: yaban in Japanese means "barbarian". Made me chuckle.
Cheers,
"What in the name of Fats Waller is that?"
"A four-foot prune."
I watch embedded and full-screen Flash videos all the time on a $400 Acer Aspire laptop. That's with a dual-core Celeron. Hulu, YouTube, Vimeo, on-site or embedded in somebody's blog, internal display or big external monitor, all of them work great under Ubuntu.
My daughter's single-core Atom netbook, on the other hand, does get choppy.
I have 34 systems which have 48 cores already in the server room. These are quad socket systems with 4 AMD 12-core CPU's. So I call BS to the guys who think we have plenty of time, because there are plenty of people deploying these things already.
We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
Solaris (and the defunct opensolaris) has the exact same issue when scaling up the cores per CPU. this badly written article was about cache constrained shared memory usage.
besides, Solaris doesn't scale as Linux does, despite the hype. Solaris doesn't scale *down* to the PDA level nor *up* to the monster NUMA architectures Linux does.
Oracle imagines they can make a unified IBM or Unisys type mainframe vertical stack with it now. But that won't work as commodity hardware in clusters can run Oracle's main applications faster and more cheaply than any ultrasparc box.
Solaris dying, OpenSolaris is dead.
There is interesting new research coming out of MIT[...]
Who'd thunk of that now?
However, posting your own post in your own post is a bit excessive, and there could have been better ways to do this than just repost your entire freakin story as the first comment.
Yo dawg, I heard you liked my post so I put a post inside my post so you could enjoy it while you're enjoying my post!
My work here is dung.
I let all my friends whitewash the fence for a fee, and most of them paid with apple cores, apart from dead cat in a string, a blue bottle glass to look through and a kite in good repair. I have more than 48 cores and now this! Well, going to give the whole charade up and become a Pirate in the Spanish Main.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Damn...this is going to seriously limit how many concurrent instances of goatse I can view.
When Linux is run on the 49th cpu in a system, that can be called the hardcore.
Take the cheese to sickbay, the doctor should see it as soon as possible - B'Elanna Torres, "Learning Curve"
That would be why they have been displaying a "thanks for helping make Slashdot great, wanna disable ads?" notice for me, and a lot of others, for years.
That's why intel is keen on Solaris. It already scales. I'm sure Oracle will manage to put a spanner in the works somehow, though. Then it will be more economical to rewrite Linux.
Stick Men
There's nothing BSD cannot do.
From Wikipedia:
In DragonFly, threads are locked to CPUs by design, and each processor has its own LWKT scheduler. Threads are never preemptively switched from one processor to another; they are only migrated by the passing of an "inter-processor interrupt" (IPI) message between the CPUs involved. Inter-processor thread scheduling is also accomplished by sending asynchronous IPI messages. One advantage to this clean compartmentalization of the threading subsystem is that the processors' on-board caches in SMP systems do not contain duplicated data, allowing for higher performance by giving each processor in the system the ability to use its own cache to store different things to work on.
and from dragonfly bsd:
DragonFly belongs to the same class of operating system as BSD and Linux and is based on the same UNIX ideals and APIs. DragonFly gives the BSD base an opportunity to grow in an entirely different direction from the one taken in the FreeBSD, NetBSD, and OpenBSD series.
The DragonFly project's ultimate goal is to provide native clustering support in the kernel. This involves the creation of a sophisticated cache management framework for filesystem namespaces, file spaces, and VM spaces, which allows heavily interactive programs to run across multiple machines with cache coherency fully guaranteed in all respects. This also involves being able to chop up resources, including the cpu by way of a controlled VM context, for safe assignment to unsecured third-party clusters over the internet (though the security of such clusters itself might be in doubt, the first and most important thing is for systems donating resources to not be made vulnerable through their donation).
These are quad socket systems with 4 AMD 12-core CPU's.
That's not the problem; 48 cores on one chip is the problem.
z/OS on a z196 processor supports up to 80 CPU's per lpar with up to 1TB memory per lpar. Hiperdispatch technology alleviates most of the MP effects of dispatching tasks in large CPU configurations . Parallel sysplex technology provides for the intelligent dispatch of units of work across up to 32 loosely coupled systems. Do the math. 80*32=2560 processors, 32 TB memory. Full fault tolerance for the hardware and the OS. Rolling IPL's of each lpar allows the rest of the sysplex to keep on doing your critical business work.
Well, Solaris has handled many more than 48 cores for years. Too bad Solaris' future looks grim at the best, following Oracle's acquisition of Sun.
Heck, most of us don't even have 640KB RAM.
you had me at #!
Maybe the Cray people (cray.com) can tell the people from kernel dev, slashdot.org or MIT (somebody is wrong!) what to do.
Cray Linux supports 2304 cores!!!!!
Comment removed based on user account deletion
I remember reading a statement from Intel affirming they had produced an 80 core processor which they didn't intend to put into market.
Aww.. Is someone upset that they can't figure out how to make WINE run their favorite eroge? Keep fapping on Windows, Linux doesn't want you anyway.
First!
Stupidity is its own reward.
I'd second this, We're already at dozens of cores in a regular server...even a year ago we had those, so all those talking about big iron boxes from Sun and others, HP sells them. However this is all pointless since the summary seems to have meant to say cores in a single socket. We're a ways from that in normal day-to-day servers. (but maybe not to far;)
Those who can, do.
Does anyone know how OS X will do with 48+ cores? I know Snow Leopard was supposed to improve the scaling to an extent with Grand Central Dispatch, but I don't think Apple went so far as to test the performance with 48 core machines.
Thank you for an enlightening response. Very interesting reading.. One question that popped up in my mind as I read it is does this mean that one or more cores need to be reserved to "manage" the other cores and determine the point at which adding more cores to a "problem" is slowing performance vs speeding up performance? Perhaps the measurement of speed and efficiency needs to become more AI and intuitive on an on-going basis. Obviously the researchers could determine that at some point adding more cores wasn't improving performance. Could this kind of observation be built into the operating system so the cores are managed better?
Have you fscked your local propeller head today?
If you pull the CPU chip thingie out and lick it with your tongue, you will feel many many more bumps on a CPU that has many cores.. Those bumps are like nipples. The more nipples you can fit in at one time the more milk you can get in one suck!
Have you fscked your local propeller head today?
Just CmdrTaco trolling for ad impressions
From TFA:
“slightly rewriting the Linux code so that each core kept a local count, which was only occasionally synchronized with those of the other cores, greatly improved the system’s overall performance.”
“The fact that that is the major scalability problem suggests that a lot of things already have been fixed. You could imagine much more important things to be problems, and they’re not. You’re down to simple reference counts.” Kaashoek said. “Our claim is not that our fixes are the ones that are going to make Linux more scalable,”
http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/t-series/sparc-t3-171613.html
I bet the per-core performance of your Dual-core Celeron exceeds that of the Atom. Any time I run a flash intensive site, it always seems single threaded. A single core PIV 3.6Ghz would also exceed your daughter's Atom on performance.
This is wierd because all supercomputers have crazy many cpu cores. And they mostly running Linux.
e.g.
System: HP Cluster Platform 3000 BL460c
Processor cores: 13728 (Xeon 53xx, 2,66 gigahertz, Infiniband)
Preferences: 102,8 teraflops rmax, 146 teraflops rpeak
Operating system: Linux
Manufacture: HP
Owner: Swedish state, FRA
"Who'se even got 48 cores?"
Yo! Here. Honkin' powerfull servers from Penguin (not so wild about them, but that's who we bought from), SuperMicro m/b h8qg6-f, AMD chips, Opteron 6172... and four of 'em. We're running the current CentOS, 5.5.
mark
thats presisly the problem of distributed caches and trying to put threads of the same job on the same phycial chip so they share caches
Well it may need more work but ... SGI ran Linux on a very large SMP/NUMA machine years ago. It may be true that while this +100 cpu system only ran Linux for a short time (with only modest work) it proved internally that CCNUMA and Linux was a viable pair. This was done about 2002. The machine had an internal network name of "stinger". Stinger did make the Top500 list running Irix...
I just compile that part out.
We're already at 48 cores with 64GB RAM, and for under $7200 Canadian.
I am willing to bet that Windows will NOT work with 48 cores. Windows is not an enterprise OS like Linux is. Windows doesn't even support PAE correctly in most versions.
47 it is.