Not All Cores Are Created Equal
joabj writes "Virginia Tech researchers have found that the performance of programs running on multicore processors can vary from server to server, and even from core to core. Factors such as which core handles interrupts, or which cache holds the needed data can change from run to run. Such resources tend to be allocated arbitrarily now. As a result, program execution times can vary up to 10 percent. The good news is that the VT researchers are working on a library that will recognize inefficient behavior and rearrange things in a more timely fashion." Here is the paper, Asymmetric Interactions in Symmetric Multicore Systems: Analysis, Enhancements and Evaluation (PDF).
Anyone who thinks computers are predictably deterministic hasn't used a computer. There are so many bugs in hardware and software that cause it to behave differently than expected, documented, designed. Add to that inevitable manufacturing defects, no matter how microscopic, and it's unimaginable to find otherwise.
It's like discovering "no two toasters toast the same. Researches found some toasters browned toast up to 10% faster than others."
...programs not designed for multi-core systems don't use them efficiently.
Works fine for me.
"Chinese Amazons, power armor, laser swords.... things just meant to be." - Shampoo, A Very Scary Bet
Oh Great Australia Internet Filter, why has thou abandoned me?
I don't know if Linux or Windows has an automatic mechanism to schedule task priority based on processor caches, but the study didn't even mention Windows. Seeing that the scheduling and managing the caches are OS problems this seems kind of important.
The other thing that seems odd is they were using a 2.6.18 Kernel and in 2.6.23 they added the Completely Fair Scheduler which could potentially change their results. It doesn't seem logical to base a cutting edge study on stuff that was released years ago.
Last I checked, Linux was smart enough to try to keep programs running on cores where cache contained the needed data.
Support my political activism on Patreon.
Linux can already deal with scheduling tasks to processors where the necessary resources are "close". It may not be obvious to the likes of PC Magazine, but its trivially obvious that even multithreaded programs running on a non-location aware kernel are going to take a hit. This is a kernel problem, not an application library problem.
I want to delete my account but Slashdot doesn't allow it.
Anyone who has been doing performance work should have known this. The tools to adjust things like core affinity and where interrupts are handled have been available in Linux and Windows for a long time. These effects were present in 1980s mainframes. DUH.
It's just an Insel Intide thing. DAAMIT processors are more predictable. Or not. If you don't use numactl (1) to force socket (and memory) affinity, you get exactly what you ask for (randomly selected sockets, and unpredictable performance)
Here's an exercise: Take 2 brand-new systems with identical configurations and start them at the same time doing some job that takes a few hours and utilizes most of the hardware to some significant degree. Say, compiling some huge piece of code like KDE or OpenOffice. System administrators who do exactly this will tell you that you'll almost never see the two machines complete the job at precisely the same time. Even though the CPU, memory, hard drive, motherboard, and everything else is the same, the system as a whole is so complex that minute differences in timing somewhere compound into larger ones. Sometimes you can even reboot them and repeat the experiment and the results will have reversed. It shouldn't come as a surprise that adding more complexity (in the form of processor cores) would enhance the effect.
that cutting edge research is done in Virginia.
I agree, and seeing this in the standard C/C++ libraries down the road would be nice. I would say Java would have framework-esque multicore support first, but then again Sun is in trouble and Java is just now getting video and 64-bit support. I don't use .NET enough to know, but it would be interesting to know if .NET has decent native multicore support and if Mono implements it correctly, although this all depends on MSIL versioning/limitations I'm sure.
In a nutshell, we need more portable multicore solutions in order to make better usage of them. Not just for the sake of being cross-platform, but for better documentation, example code, etc.
So and compiler do it for you, performance results are not consistent between runs.
... what a shock....
Wow
What's next. A study that shows if you don't select any optimization parameters a program won't run as effective as selecting the best ones??
I rarely read replies, it's my opinion and if you thought about your opinion a little more, I'm OK with that.
Last time I read anything about it (which was years ago) the Linux cache aware scheduling consisted of trying to get task scheduled on the same processor as they were scheduled on previously. This works well for a lot of things, but you lose a lot of benefit when multiple simultaneous tasks are working on the same data since those tasks would be spread across the processors to take advantage of concurrency.
This is just an engineering trade off.
I don't know if Linux or Windows has an automatic mechanism to schedule task priority based on processor caches, but the study didn't even mention Windows. Seeing that the scheduling and managing the caches are OS problems this seems kind of important.
I'm not sure why this article isn't tagged "duh".
It's pretty obvious from looking at the CPU graphs of my VMware ESX servers that their code does some optimization to keep processes on the same core, or at the very least on the same CPU.
This data is from a dual-socket quad-core AMD (8 total cores), which means a NUMA architecture, so running the code on the same CPU means you have faster memory access.
So, some commercial code that has been around for nearly 4 years takes advantage of the "discoveries" in an article published this month.
They mentioned this in an ESX class I took. I seem to remember it in the context of setting a processor affinity or creating multi-CPU VMs and how either the hypervisor was smarter than you (eg, don't affinity) or that multi-CPU VMs could actually slow other VMs because the hypervisor would try to keep multi-CPU VMs on the same socket, thus deny execution priority to other VMs (eg, don't assign SMP VMs because you can unless you have the CPU workload).
The problem is a complex one. Every possible scheduling decision has pluses and minuses. For example, keeping a process on the same core for each timeslice maximizes cache hits, but can lose if it means the process has to wait TOO long for it's next slice. Likewise, if a process must wait for something, should it yield to another process or busy wait. SHould interrupts be balanced over CPUs or should one CPU handle them?
A lot of work has gone in to those questions in the Linux scheduler. For all of that, the scheduler only knows so much about a given app and if it takes TOO long to 'think' about it, it negates the benefits of a better decision.
For special cases where you're quite sure you know more than the scheduler about your app, you can use the isolcpus kernel parameter to reserve CPUS to run only the apps you explicitly assign to them.
You can also decide which CPU any given IRQ can be handled by (but not which core within a CPU as far as I know) wilt /proc/irq/*/smp_affinity.
Unless your system is dedicated to a single application and you understand it quite well, the most likely result of screwing with all of that is overall loss of performance.
These go to eleven
Help stamp out iliturcy.
The libraries and the languages currently make threading harder then it needs to be.
How about a "parallel foreach(Thing in Things)" ?
I realize there are locking issues and race conditions, but really I think the languages could go a some ways to making things like this more hidden. Oh wait, does that mean I'm advocating for making programming languages more user friendly? I guess so. You know why people use Ruby, C# or Java? Cause those are way more user friendly than C++ or COBOL.
The usability of a programming language matters a lot. Nobody uses threading because the current crop of programing languages makes it complex, confusing, and full of ways to shoot yourself in the foot. Make threading user friendly, and we might see more people create multi-threaded apps.
We added 4 more cores to perform this "thinking" about which core the process should run on, we should be able to get back that 10% we lost, right?
TFA doesn't seem to specify, but I assume they're referring to Linux. Recent versions of Solaris (and also HP-UX) already have some of this functionality in what they call an "interrupt redistribution daemon".
But you have to think about it too much.
How about:
Things.ParallelEach(function(thing){
Console.Write("{0} is cool, but in parallel", thing);
# serious business goes here
});
There are lots of stupid loop structures that are used in desktop apps that are just begging to be run in parallel, but the current crop of languages dont make it braindead easy to do so. Make it so every loop structure has a trivial and non ugly (OpenMP pragmas) way of doing it.
Also, IMHO, not enough languages do stuff like the Javascript Array.Each(function(element){}). Am I blind, or is this construct missing from C#?
And for those who say "what what about all the weird race conditions and stuff". I'm not a computer science major, so I'm jumping off an edge asking this, but what if we actually use some of this new CPU power in our IDEs and our JIT compilers, couldn't our languages watch out for most of the nasty ways we can shoot ourselves in the food? Like if I do a Array.ThreadedEach(function(element){}) and I'm changing some shared data, couldn't the compiler or IDE let me know at compile time or while I'm writing the code? Obviously you'd need a strongly typed language like C# or Java to pull such stunts, you couldn't do it in perl :-)...
The goal is to make this threaded stuff usable. I think we can do it.
Why is this article labeled as hardware? Sure they talk about different procs being ... well, different. Duh! The article is about the software Tom and others developed to run processes more efficiently in a multi-core (an possibly heterogenous) environment. Big energy savings as well as performance boost. Green computing. HELLO! Did you read page two?
Can't see what the big news is, any single socket multi core system would look like and simple SMP and a multi socket would have some NUMA characteristics. So affinity scheduling, locality and behavior aware memory allocation and some interrupt fencing should create a deterministic behavior :)
Guess he should try OpenSolaris, been there, tried that and so forth :)
And OpenMP isn't "standard" as far as I'm concerned. Plus it makes you think about threading and it only works in low-level languages like C.
I'm talking about this highly useful code (which is written in a bastardized version of C#, Perl and Javascript for your reading pleasure):
List pimpScores = PimpList.ThreadedMap(function(aPimp){
# score how worthy this guy is at pimpin'
if(aPimp.Hoes > 10) {
return String.Format("Damn brother, {0} is a player", aPimp.PimpName);
} else if (aPimp.Hoes 0) {
return String.Format("{0} is a small time player", aPimp.PimpName);
} else {
return String.Format("{0} isn't a player at all!", aPimp.PimpName);
}
});
Look how easy it was to turn a transform like Map into something threaded (even though C# doesn't have Map... I forget what LINQ method does the same transform)
OpenMP doesn't offer anything as intuitive as that. It makes you think long and hard about threading in a dull, dry manner. Threading is everywhere in our code if the program language makes it obvious and easy.
In the future, we'll probably have hundred or thousand core CPU's and we can dedicate 10% of them to "thinking" about how to use the remaining 90%.
Often, an issue presents that isn't reproducible in the presence of a tech support person who knows what he's doing.
Sometimes it's a user error they don''t want to admit, and so they won't reproduce it in front of somebody who knows they should not have done that.
Sometimes it's just a glitch. Regardless, the best thing to do is smile and say "The bug must be afraid of me" and close the ticket.
Help stamp out iliturcy.
Keep in mind that a paper presented at a conference is submitted many months before it is published/presented. Not sure when the deadline for the conference was, but I suspect the completely fair scheduler was not available at the time. (Double or triple the sentiment for publication in a journal.)
AC
Running server to server, duo core to core - I interrupt y'all, and bless the Tech for the quad.
Yes, I realise Eric B. & Rakim references are entirely wasted on /.
. . .the tag "bang news" on a story involving researchers from Virginia Tech?
My sister opened a computer store in Hawaii. She sells C shells by the seashore.
How is this research? You will find much higher quality research on implementing support for SMP for network stacks (that is YEARS OLD, look at the work of Alan Cox and his students) and a plethora of papers on scheduling on SMP among many other things. ANL-supported research has really gone down the drain.
These guys are just stating the obvious with very ambiguous unscientific benchmarks and faulty metrics (no analysis of PMCs, etc...). It is surprising that these guys even hit slashdot, publicizing bad research only helps bad research continue.
One experiment went for a long time, and in the end when he analyzed the AI generated code, there were 5 paths/circuits inside that did nothing. If he disabled any or all of the 5 the overall design failed. Somehow, the AI found that creating these do nothing loops/circuits caused a favorable behavior in other parts of the FPGA that made for overall success.
The author took the unusual step of disconnecting the clock for the FPGA, taking advantaged of undefined behavior that depended on the unique electrical characteristics of the FPGA he used. Had he left the clock connected he'd likely have more portable results, however he may not have arrived at the same results since he'd be depending on discrete logic and not the unspecified, non-linear analog behavior.
Honestly, this stuff has been known in the HPC world for decades. What's interesting is that these troublesome bits are going to hit system-level and lower-level language programmers on everyday tasks. It's not clear to me how this stuff will affect higher-level programming, interpreted code, etc. It will almost certainly be a factor but I'm not sure there's much the programmer can do about it.
Some of the fun things we have to look forward to at the commodity level:
These (and others) all fall into the general category of "induced load imbalance." They are things the programmer doesn't directly think about; things that happen as a result of system services, CPU architecture and stuff generally out of the control of the application programmer. This is all in addition to the stuff the programmer does have control over such as data layout and the amount of work given to each thread.
Induced load imbalance is the primary reason that scaling to manycore is difficult. It requires a lot of OS work to reduce "OS jitter" to a level that is acceptable when running thousands of threads.
Here's an article on some of the scaling work HPC vendors have done with Linux.
...I had with an Asterisk VOIP server. Under certain conditions, calls transferred from one of two receptionist's phones were bouncing back and ending up at the wrong voicemail. Since only two phones had a problem I suspected it was something specific to these phones. After checking the configuration and even hardware on the phones, I checked the server. I narrowed the problem down to one macro (a macro in asterisk is basically a user-defined function) that allows a "fallback line" to ring if the first is busy, it seemed to be getting an argument for this line when there should have been none. Soon it became evident that the variable was changing "mid-macro", apparently out of nowhere (there are variables with special names that are used in macros to receive arguments, nowhere was this variable changed, the macro's less than 30 lines long). I eventually got so frustrated I put debugging lines in between every single line of the macro to make it print the variables to the output log. Then I narrowed it down to one line - one where a Dial() command is executed (this is the function that actually places the call, this function isn't supposed to even be able to change anything in the macro that called it, and there are no other problems like this). Now that had me totally stumped. I could demonstrate exactly what was happening but I couldn't figure out why. Stranger still, the results changed slightly with the debugging lines in place, as if it's a race condition of some sort.
The problem still exists to this day :(
"When information is power, privacy is freedom" - Jah-Wren Ryel