Slashdot Mirror


Not All Cores Are Created Equal

joabj writes "Virginia Tech researchers have found that the performance of programs running on multicore processors can vary from server to server, and even from core to core. Factors such as which core handles interrupts, or which cache holds the needed data can change from run to run. Such resources tend to be allocated arbitrarily now. As a result, program execution times can vary up to 10 percent. The good news is that the VT researchers are working on a library that will recognize inefficient behavior and rearrange things in a more timely fashion." Here is the paper, Asymmetric Interactions in Symmetric Multicore Systems: Analysis, Enhancements and Evaluation (PDF).

183 comments

  1. unsurprising. by Anonymous Coward · · Score: 5, Interesting

    Anyone who thinks computers are predictably deterministic hasn't used a computer. There are so many bugs in hardware and software that cause it to behave differently than expected, documented, designed. Add to that inevitable manufacturing defects, no matter how microscopic, and it's unimaginable to find otherwise.

    It's like discovering "no two toasters toast the same. Researches found some toasters browned toast up to 10% faster than others."

    1. Re:unsurprising. by Rod+Beauvex · · Score: 5, Funny

      It's those turny knobs. They lie.

    2. Re:unsurprising. by symbolset · · Score: 5, Funny

      You have to buy the one that goes to 11. You know how 10 makes the toast almost totally black? Well, what if you want your toast just a little bit more crispy? What if you want just that little bit more? That's what 11 is for. Those other toasters only go to 10, but this one goes to 11.

      --
      Help stamp out iliturcy.
    3. Re:unsurprising. by MightyYar · · Score: 4, Funny

      I had a Pentium that DEFINITELY went to 11.

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    4. Re:unsurprising. by ElectricTurtle · · Score: 2, Insightful

      Mod parent to 5, seriously, it's so true. There are more than a few times after working support for decade when I've had to say, 'that should be impossible' but a symptom nonetheless exists.

      --
      I support the Slashcott and will not be reading or commenting from 2/10/14 to 2/17/14. Beta is steaming pile of dog shit
    5. Re:unsurprising. by $RANDOMLUSER · · Score: 2, Interesting

      I remember HP-UX on PA-RISC from at least ten years ago making efforts to reassign a swapped out process to the processor that it had been running on before it was swapped out, on the notion that some code and data might still be in the cache. SMP makes for some interesting OS problems.

      --
      No folly is more costly than the folly of intolerant idealism. - Winston Churchill
    6. Re:unsurprising. by RuBLed · · Score: 5, Funny

      mine only went up to 10.998799799

    7. Re:unsurprising. by Fluffeh · · Score: 1

      Was it one of those PII Celeron 300A's that just ran and ran and ran even if you pushed them up from 300 mhz to 4509 mhz?

      Those things were HAWT!

      --
      Moved to http://soylentnews.org/. You are invited to join us too!
    8. Re:unsurprising. by Anonymous Coward · · Score: 1

      You have to buy the one that goes to 11. You know how 10 makes the toast almost totally black? Well, what if you want your toast just a little bit more crispy?

      It's like how much more black could this toast be? And the answer is none. None more black.

    9. Re:unsurprising. by ClosedSource · · Score: 1

      Actually, the PC was designed to be non-deterministic. No software bugs, hardware bugs or manufacturing defects needed.

      On the other hand, many early home computers were quite deterministic. In fact the Atari 2600 game machine was deterministic down to a single CPU cycle. Many 2600 games would not have worked if it were otherwise.

    10. Re:unsurprising. by TapeCutter · · Score: 1

      "It's like discovering "no two toasters toast the same. Researches found some toasters browned toast up to 10% faster than others."

      What we need is a toaster with an IQ of around 4000.

      --
      And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
    11. Re:unsurprising. by Kent+Recal · · Score: 1

      Hell yeah, that one was a bargain.
      I had mine clocked at at 400MHz and iirc saved about $200 over an equivalent "real" PII.

    12. Re:unsurprising. by symbolset · · Score: 0, Redundant

      This one goes to 11.

      --
      Help stamp out iliturcy.
    13. Re:unsurprising. by aaron+alderman · · Score: 1

      So you had a Pentium 3?

    14. Re:unsurprising. by Anonymous Coward · · Score: 2, Funny

      The review for "Not All Cores Are Created Equal" was merely a two word review which simply read "Shit Sandwich".

    15. Re:unsurprising. by aaron+alderman · · Score: 1

      I prefer to brown bread myself.
      As a physicist I don't see why computers aren't deterministic. After all, you just start with a spherically symmetric computer...

    16. Re:unsurprising. by aaron+alderman · · Score: 5, Interesting

      Impossible like "xor eax, eax" returning a non-zero value and crashing windows?

    17. Re:unsurprising. by aaron+alderman · · Score: 1

      Are you sure about that?

    18. Re:unsurprising. by $RANDOMLUSER · · Score: 5, Funny

      Moral of the story: There's a lot of overclocking out there, and it makes Windows look bad.

      Oh. So that's what's been doing it.

      --
      No folly is more costly than the folly of intolerant idealism. - Winston Churchill
    19. Re:unsurprising. by zappepcs · · Score: 5, Interesting

      Actually, (sorry no link) there was a researcher that was using FPGAs and AI code to create simple circuits, but the goals was to have the AI design it. What he found is that due to minor manufacturing defects, the code that was built by AI was dependent on the FPGA it was tested on and would not work on just any FPGA of that specification. After 600 iterations, you'd think it would be good. One experiment went for a long time, and in the end when he analyzed the AI generated code, there were 5 paths/circuits inside that did nothing. If he disabled any or all of the 5 the overall design failed. Somehow, the AI found that creating these do nothing loops/circuits caused a favorable behavior in other parts of the FPGA that made for overall success. Naturally that code would not work on any other FPGA of the specified type. It was an interesting read, sorry that I don't have a link.

    20. Re:unsurprising. by cratermoon · · Score: 0, Redundant

      It's like, how much more black could this be, and the answer is none. None more black.

    21. Re:unsurprising. by Anonymous Coward · · Score: 0

      What we need is a toaster with an IQ of around 4000.

      Who the smeg would want that?

    22. Re:unsurprising. by Anthony_Cargile · · Score: 1

      Very interesting story, wish I had some mod points right now :). I think I found a new blog to subscribe to, only this one has a purpose!

      Oh, and mod the comment above me up as well - that was just funny.

    23. Re:unsurprising. by Majik+Sheff · · Score: 2, Informative

      Processor affinity is still a nasty corner of OS design. It was one of the outstanding issues with the BeOS kernel that was not resolved before the company tanked.

      --
      Women are like electronics: you don't know how damaged they are until you try to turn them on.
    24. Re:unsurprising. by kimvette · · Score: 1

      I had an Abit motherboard (VP6) that went to 11. Unfortunately it ended with a little fireworks show. :( Stupid bad caps, lousy Abit QC.

      --
      The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
    25. Re:unsurprising. by Anonymous Coward · · Score: 0

      525 was the top I could get mine air cooled and still be stable. Ran for two years before I clocked it down to 450 and sold it. About 2 years later, I bought it back, reclocked it to 525, and used it for a closet server.

      In-de-structable

    26. Re:unsurprising. by Anonymous Coward · · Score: 2, Funny

      Wow, a joke from 1995. It's true, Slashdot is at the forefront of cutting-edge humor.

    27. Re:unsurprising. by paulgrant · · Score: 2, Insightful

      Damn it, get one!
      At least a name for christs sake!

    28. Re:unsurprising. by bm_luethke · · Score: 1

      You know, there have been a few cases of trying to work with some Open Source software that I find the following bit of logic in there:

      If (1){
      do stuff
      }
      more stuff

      (well, other than any syntax errors - being dyslexic if I write two lines without them then I'm doing good)

      And I never could figure out why the whole "if(1)". I always left it in the code because I figured someone somewhere had a reason and who am I to change it? I recall hearing Donald Becker rant about people taking "worthless" code out of his drivers and it being for some specific architecture. Though in this case I have always thought that someone was too lazy to change it initially (after all you had to find the other "}" and everyone else after them had the same idea I did.

      Now I know for sure - some AI someplace added in some code that no one else understands and must stay in under their own little world. But then I guess that is something along the lines of Becker's complaint that it didn't hurt other hardware yet was required for some specific vendor.

      I'm loath to change working code, even when it has something like the above.

      --
      ------- Sorry about the spelling, I suffer from two problems. Dyslexia makes it difficult to spell well, lazy makes it
    29. Re:unsurprising. by johnw · · Score: 3, Informative

      A simple Google search for "fpga genetic algorithm" shows up references quite quickly - e.g.

      http://biology.kenyon.edu/slonc/bio3/AI/GEN_ALGO/gen_algo.html

      The only part of the GP story I haven't seen before (and can't find a reference for) is the bit about the design not working on other FPGAs of the same specification. The closest story is that of Adrian Thompson at the University of Sussex who got a circuit with unconnected elements which nonetheless seem to be needed in order for the whole thing to achieve its goal. Nothing about the design only working on specific instances of the FPGA.

    30. Re:unsurprising. by Detritus · · Score: 1

      Sometimes you see stuff like that due to compiler bugs. The ugly code is a way of not triggering the bug. Simplify it at your peril.

      --
      Mea navis aericumbens anguillis abundat
    31. Re:unsurprising. by zappepcs · · Score: 1

      This article mentions how it won't work on only the FPGA it was developed on.

    32. Re:unsurprising. by sowth · · Score: 3, Insightful

      They probably put in the if(1) lines because they were testing various aspects of the program, or maybe some like to turn off various aspects of the program, but don't want to be arsed to write the proper code to select options. I commonly do that in POVray (3d raytracing) scripts when testing, so I don't have to wait for long renders--fog, radiosity, lots of light and such take orders of magnitude more time.

      As for the AI adding crap, it is probably more trying random code than truly thinking about how the code should work. This leads to the useful code intertwined with lots of crap code. Unfortunately, there are programmers who write like this too... (cue funny mod)

      As for the code not working on other FPGAs, maybe the researcher should not use real chips to check the iterations. A simulated one which conforms to the spec exactly and upon where quirks and such are expected, dies or sends a signal back to the AI program. Testing after the fact on real chips to verify the AI didn't exploit bugs in the simulator would be more proper procedure.

      Maybe I have too much of a background in theory, but I am not completely sure why the FPGAs would be so different. Is it race time conditions? Or is the FPGA being used in some analog way? Or does the circuit depend on the exact timing of some input, so the speed / capacitance of each component make a huge difference? Or was the poster talking about FPGAs with different specs?

      Crazy things happen when you enter the real world. I remember back when I was in electronics assembly. One would first assume all the solder would wick onto the metal, but the boards would always have tonnes of solder bridges, and we had to carefully examine every component and correct them. Friggin' microprocessors had countless tiny legs too!

    33. Re:unsurprising. by Mr2cents · · Score: 2, Interesting

      There is a very interesting channel on youtube called googletechtalks. There, you can find a lecture called "We have it easy, but do we have it right" about performance measures that really made me worry. Basically you can't just easilly compare performance by measuring the cpu time, because there are a lot of factors that determine performance. E.g.: by adding a environment variable before running a program, this can cause page allignments to change (even if the environment variable isn't used by the program), changing the performance dramatically in some cases. Same goes for changing the link order: performance can change by 20%.

      So much for determinism.

      http://www.youtube.com/watch?v=DKVRkfXrBpg

      --
      "It's too bad that stupidity isn't painful." - Anton LaVey
    34. Re:unsurprising. by Anonymous Coward · · Score: 0

      At 11 it's like, how much more black could this toast be? And the answer is: none more black.

    35. Re:unsurprising. by EdibleEchidna · · Score: 1

      I remember reading about that from when I doing a PhD in Genetic Algorithms. I don't remember the reference, it might have been in: Goldberg, David E (1989), Genetic Algorithms in Search, Optimization and Machine Learning, Kluwer Academic Publishers, Boston, MA.

    36. Re:unsurprising. by raynet · · Score: 4, Funny

      I am sure you mean to say; Wow, a joke from 1994.995994999.

      --
      - Raynet --> .
    37. Re:unsurprising. by lloydchristmas759 · · Score: 0

      I'd say that computers are deterministic at the chip/instruction level, but stochastic at the system level.
      It's like newtonian vs quantum mechanics... but upside down...

      --
      I'd give my right arm to be ambidextrous.
    38. Re:unsurprising. by Rakshasa+Taisab · · Score: 2, Funny

      That joke is so badly done it's not even funny.

      1994.995994999

      If you look carefully at this number, it's clearly one constructed by a human. The first '5' might be random, but the proceeding numbers do not have any specific reason to be weighted towards higher digits!!!

      Thus, a more realistic semi-random number would be:

      1994.995974983

      --
      - These characters were randomly selected.
    39. Re:unsurprising. by raynet · · Score: 2, Funny

      Actually, your number looks more like random number string by a human as human try to avoid using long chains of same numbers when writing random numbers. But you are right, my number was made by randomly punching multiple number keys on my keyboard and those happened to register. I did then edit it so that the first digit after to dot was 9.

      --
      - Raynet --> .
    40. Re:unsurprising. by oPless · · Score: 1

      Sounds like John Koza ( http://en.wikipedia.org/wiki/John_Koza ) or someone following his research

    41. Re:unsurprising. by ByteSlicer · · Score: 1

      Might be for testing. I sometimes use "if (false) { }" in Java to disable code parts in such a way that the code is still compiled (and so that Eclipse doesn't remove imports for this code).
      Otherwise, might be to limit the scope of local variables. A bare block "{ }" does the same, but might feel too awkward to some people. Might be useful to reclaim memory as soon as possible in long methods, although putting the code in a separate method would be better probably.

    42. Re:unsurprising. by Anonymous Coward · · Score: 0

      I think that was Adrian Thompson at the University of Sussex, I remember reading the same article.

      http://www.cogs.susx.ac.uk/users/adrianth/ade.html

    43. Re:unsurprising. by Anonymous Coward · · Score: 0

      yeah, if you put a thousand monkeys on a typewriter they will eventually produce works of Shakespear...

    44. Re:unsurprising. by Frozen+Void · · Score: 1

      Its for debugging.The code works on release versions(if (1)),but for debugging people need the ability to turn on/off certain parts of code(if (!1)).

    45. Re:unsurprising. by Anonymous Coward · · Score: 0

      That was bold.

    46. Re:unsurprising. by Anonymous Coward · · Score: 0

      there it is:

      http://www.informatics.sussex.ac.uk/users/adrianth/ices96/node5.html

    47. Re:unsurprising. by Sweetshark · · Score: 1

      You know, there have been a few cases of trying to work with some Open Source software that I find the following bit of logic in there:

      If (1){ do stuff } more stuff

      I was confused when I first saw a

      do { do stuff } while(false); more stuff

      until if found out this was an obfuscated goto, because there where break; or continue; statements in the "loop".

      Your ifs might be a workaround around compiler bugs (for example a compiler supporting variable scopes only in "real blocks" or something like that).

    48. Re:unsurprising. by TheRaven64 · · Score: 2, Interesting

      Yup, I found some interesting effects of this when doing my PhD. I tweaked my supervisor's code to add an abstraction layer in the middle before making changes, and found that this actually made things faster, even though it was doing more work (it was only meant to make things faster when I wrote something else on the other side of the abstraction layer). It was an entirely deterministic improvement though, even with different data sets, so most likely due to better instruction cache layout with the new code.

      --
      I am TheRaven on Soylent News
    49. Re:unsurprising. by ckaminski · · Score: 1

      FWIW, Windows NT, 3.5 I think, had a huge problem with process migration that killed performance.

    50. Re:unsurprising. by Anonymous Coward · · Score: 0

      "That joke is so badly done it's not even funny."

      That didn't stop you.

    51. Re:unsurprising. by Anonymous Coward · · Score: 0

      cant find the link, but iirc that was a hypothetical story... not a real experiment

    52. Re:unsurprising. by TheRaven64 · · Score: 2, Interesting

      Processor affinity is even harder on modern CPUs. You often have 2 or so contexts sharing execution units and L1 cache in a core, then a few cores sharing L2 cache in a chip. Deciding whether to move a process is tricky. There's a penalty for moving, because you increase the cache misses proportionally to the distance you move it (if you move it to a context that shares the same L1 cache, it's not as bad as if you move it to one that shares only the L2 cache, for example), but there's also a cost for not moving it if a lot of processes on a single context are suddenly doing a lot of work while those on another core are idle.

      Cache isn't the only problem though - with something like the AMD architecture, each core has its own memory, so if you allocate memory on one RAM chip then migrate the process to a different one then you end up with memory accesses being slower (and slowing down accesses on the other chip, since its memory controller is having to interleave remote requests with local ones).

      --
      I am TheRaven on Soylent News
    53. Re:unsurprising. by Anonymous Coward · · Score: 1, Insightful

      Yet the probability of your random number being generated is the EXACT SAME as the probability of his random number being generated.

    54. Re:unsurprising. by Anonymous Coward · · Score: 0

      That would be the evolvable hardware paper by Adrian Thompson.

    55. Re:unsurprising. by Anonymous Coward · · Score: 0

      There is an interesting general talk about stream computing which really shows the effects of sufficient locality of reference. From Stanford EE380:
      http://www.youtube.com/watch?v=8x7OqjUNbyo

    56. Re:unsurprising. by Anonymous Coward · · Score: 0

      Everything you described is deterministic. In fact you describe the situation explicity as being deterministic: "there are a lot of factors that determine performance". So where does the leap to "so much for determinism" come from?

    57. Re:unsurprising. by PitaBred · · Score: 1

      Well, yeah. But his LOOKS more random because we have an implicit assumption that randomness will make things different, rather than select the same thing every time. We're hard-wired as humans to recognize patterns ;)

    58. Re:unsurprising. by Beardo+the+Bearded · · Score: 1

      That's entirely incorrect.

      Computers are predictably deterministic -- the problem is that the number of variables used is neither known nor accounted for.

      Most code is crap, because most code isn't important. The stuff that is important is written to specific acceptable levels of error. The problem is when you get alphabet-soup diploma holders getting a little experience at a random startup then going off to write vital code. Then you get problems because you continue bad practices. The venerable K&R C bible has a code snippet that's held up as a good example but is responsible for millions of unpatchable bugs. Ask your average coder how to compensate for setting a value that takes a few milliseconds to settle - most of the time, they'll say "delay". Now add multithreading into the mix, and most programmers are out of their element.

      Hardware is mostly crap because making stuff that's perfect is hard work, nearing impossible. When a manufacturer makes a batch of electronics, they do their best to make them all to the highest quality level -- Military Grade. The batches that fail those quality test get thrown into the Industrial bin. The failures there get thrown into Automotive. The final rejects, the stuff that's still perfectly good for who it's for but has failed at least three quality tests, is put into the Consumer grade bin. That's true for everything from the ubiquitous 5% resistor to a PLC to a quad-core Xeon.

      So what you have is generally badly written software running on rejected hardware by untrained users who are unfamiliar with the system. Once you take all those factors into account, then you get a perfectly predictable system.

      --

      ---
      ECHELON is a government program to find words like bomb, jihad, plutonium, assassinate, and anarchy.
    59. Re:unsurprising. by ArsonSmith · · Score: 1

      Yes but the probability of a string of random numbers looking similar to the first is far less then that of a string of numbers looking similar to the second. Just like it may be the same chance to have 9999999 as to have 8675309, one is very consistent and can be seen to have a pattern while the other may not. Unless you're Jenny anyway.

      --
      Paying taxes to buy civilization is like paying a hooker to buy love.
    60. Re:unsurprising. by wassabison · · Score: 1

      This is completely incorrect. The probabilities are exactly the same. There is no way to judge whether the number exhibits a random distribution with such a small sample. What you can do is take a very large set of randomly generated numbers can calculate if the distribution is unlikely. So, you should ask him to give you 100,000 more numbers to test his randomness.

    61. Re:unsurprising. by w0mprat · · Score: 1

      The underlying assumption made at every point in hardware and software development is that computers are deterministic.

      --
      After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.
    62. Re:unsurprising. by sorak · · Score: 1

      Moral of the story: There's a lot of overclocking out there, and it makes Windows look bad.

      Oh. So that's what's been doing it.

      Yeah, Vista says my proc should actually be a vacuum tube.

    63. Re:unsurprising. by Ant+P. · · Score: 2, Insightful

      If overclocking is the cause of so many of these problems, why hasn't Intel or AMD got a mechanism to tell the OS that the hardware's being run out of spec? The blame for these crashes should be directed where it belongs - with the -funroll-loops ricers.

    64. Re:unsurprising. by Kazoo+the+Clown · · Score: 1

      As for the code not working on other FPGAs, maybe the researcher should not use real chips to check the iterations. A simulated one which conforms to the spec exactly and upon where quirks and such are expected, dies or sends a signal back to the AI program. Testing after the fact on real chips to verify the AI didn't exploit bugs in the simulator would be more proper procedure.

      The point was to have it take into account the physical characteristics of the device, where things such as capacitance between the physical segments could actually be utilized to their advantage. A simulator probably wouldn't provide this level of accuracy. Testing against an array of devices running at different temperatures, etc., could help make the design more robust however. These problems are specifically addressed by Adrian Thompson in his papers on the subject.

    65. Re:unsurprising. by Lost+Race · · Score: 2, Funny

      ... spherical frictionless inelastic computer at 0 Kelvin ...

    66. Re:unsurprising. by tkw954 · · Score: 1

      Yeah, I have some code that runs faster in VMWare than it does natively. I didn't want to look very hard at it in case it stopped working.

    67. Re:unsurprising. by Klintus+Fang · · Score: 1

      Bugs aren't even required to explain the non-determinism. Even if their were no bugs, systems as complex as a modern computer (even for a single core computer) would be non-deterministic. The only way you could have complete determinism on the timings is to have precise control of exactly which cycle count every operation occurred on. That would include precise control over the precise cycle count at which the BIOS began and then completed the bring up of each component in the system. And precise control over exactly how many cycles it took to boot the OS once the post was complete, etc. Unfortunately though, the components in the system are running at different frequencies and even the components running at the same frequency in the same package will not precisely agree on "when" each clock cycle begins and ends. One of them may have the rising edge of it's clock occurring at +/-a few pico-seconds relative to the other (assuming GHz frequencies...) and you'll never be able to control which is which unless you had precise control of the electron flow to each from the wall socket through the power supply across the board, and to the component. There's no way to control that skew between components unless all components are in the same die and on the same power plane. And even that might not be enough. Systems have to be designed to handle the skew and deal with it when data crosses clock boundaries.

      That is one place at the most basic level where non-determinism begins even before the BIOS has posted. Its not a bug though. It's just the part of the nature of any multiple component digital system.

      It isn't surprising at all that increasingly more non-deterministic performance drift can occur once you add on top of that the fact that you have dozens of IO components in the system which all have to be coordinated and managed by the OS. Not to mention that the OS has to boot off of one of those components as well (the hard drive).

      If you sit down and think about all the things that have to happen correctly, of how many disparate components are involved, and how many clock boundaries are crossed back and forth hundreds of times....all just to get the BIOS to post...it's pretty amazing that the PC under your desk (or in your lap) turns on at all.

      But I am digressing. My point is, that even if every component was bug free, the initial state of the system at the moment the BIOS completes the post would never be completely deterministic even for two consecutive boots of the same physical system.

      You'd probably have to literally be maxwell's demon to have that kind of control over the system. ;)

      --
      In a minute there is time For decisions and revisions which a minute will reverse. -T.S. Eliot
    68. Re:unsurprising. by Mr2cents · · Score: 1

      If you look at it that way you are correct, I misused the word deterministic a bit. But, as is described in the lecture I linked to, these factors are not relevant to the algorithm. You cannot use cpu time as a measurement of performance because factors like page boundaries, caches etc. influence it in unforseeable ways. In that sense undeterministic isn't isn't such a bad choice of words.

      --
      "It's too bad that stupidity isn't painful." - Anton LaVey
    69. Re:unsurprising. by sowth · · Score: 1

      I see. Then it is not surprising the circuit designs only worked for the FPGA it was designed on. You are right, I'm sure most simulators would not be sophisticated enough to take such details into account.

      Then again, if someone is doing this kind of work, they would probably need such a simulator to have any repeatable results for any sort of mass manufactured device...

    70. Re:unsurprising. by paulgrant · · Score: 1

      thats the part I found fascinating - miniscule variations on-chip leading to an unportable design :P Thanks for the info though, I love GA's + hardware :P

    71. Re:unsurprising. by ArsonSmith · · Score: 1

      No, you are completely wrong. In a grouping of 10 random numbers 0-9 there are only 10 that repeate every number:

      • 2222222222
      • 6666666666
      • 0000000000

      There are many more that have no repeating numbers such as:

      • 1983027456
      • 1234567890
      • 5647382910

      Then there are even more that have mixes of repeating numbers and non repeating numbers

      • 4902227405
      • 4956218444
      • 1112222333

      It comes down to there are fewer chances that you will randomly get something that looks as though it has a pattern compared to something that looks like it has no pattern.

      The fact that the odds of getting 1234567890 are the same as getting 7890243849 is obvious but off topic.

      --
      Paying taxes to buy civilization is like paying a hooker to buy love.
  2. who would've guessed... by Eto_Demerzel79 · · Score: 4, Insightful

    ...programs not designed for multi-core systems don't use them efficiently.

    1. Re:who would've guessed... by Anonymous Coward · · Score: 0

      In Visual Studio just drag another core from the toolbox into the application and voila!

    2. Re:who would've guessed... by timeOday · · Score: 4, Insightful

      No, the programs are not the problem. The programmer should not have to worry about manually assigning processes to cores or switching a process from one core to another - in fact, there's no way the programmer could do that, since it would require knowing what the system load is, what other programs are running, and physical details (such as cache behavior) of processors not even invented yet. This is all the job of the OS.

    3. Re:who would've guessed... by Anonymous Coward · · Score: 1, Insightful

      Summary (I didn't RTFA) says that the performance of a program can vary depending on which core it is executing on. No mention of multi-threading or using multiple cores at once. The article is not about using programs using cores efficiently. it is the about unpredictability and differences between seemingly identical cores and how the OS can detect and correct those problems.

    4. Re:who would've guessed... by PhrostyMcByte · · Score: 1

      The OS can only do so much. Most programs have downright horrible scaling on just 4 cores, let alone the 64 cores of 5 years from now. If you want to be scalable, you need to learn how to do it and design your app for it from the start.

    5. Re:who would've guessed... by Anonymous Coward · · Score: 0

      The problem isn't just the OS or the software being run -- it's the cache. What this article is about what everybody already knew: different workloads create different cache efficiency.... Not exactly the revelation of the century.

    6. Re:who would've guessed... by Splab · · Score: 1

      Actually it is the job of the programmer to make sure his program is cache friendly, that should work on all architectures.

      Also you should in a multi-core/-CPU environment make sure data needed is close to where you are, that means fetching it from whatever storage it is in (ram, hdd, other core) as early as possible and non-blocking if possible so you can complete other tasks while waiting.

      While the OS can help you with some tasks, there is no way for the OS to know what data you need next, so if you want high performance you have to program for it, and while you don't always have direct access to memory you can be pretty sure most hardware work in the same way, with the same drawbacks so usage of generalized optimizations for fetching/pushing data, for cache usage etc. should work over the boards.

    7. Re:who would've guessed... by Anonymous Coward · · Score: 0

      Actually it is the job of the programmer to make sure his program is cache friendly, that should work on all architectures.

      Sorry, but that is impossible. You can not design a "cache friendly" program without knowing the layout of the cache (Depth? Associativity? Separate D/I cache? Size?). Hence, you can not write code that is both portable and optimized.

      Being a compiler programmer, I can tell you it is already hard to optimize for a specific architecture, because cache sizes vary between different instances of the same architecture (although the cache layout mostly remains the same). Now consider the x86-64 platform and add to the mix: multiple vendors, multiple processor lines (budget, performance, low-power), different processors that identify the same (Core2 Quad and Duo, Phenom X3 and X4), even differing memory latency and bandwidth.

      In short, the only way (even for a compiler) to know with certainty the cache layout of the target processor is to inspect it at runtime. But even GCCs new switch -march=native does not go beyond ISA and feature-set identification, and uses generic information for all other parameters.

      In your view, what design properties should be emphasized to write "cache friendly" code? And what languages allow the programmer to express this "cache friendliness"?

      While the OS can help you with some tasks, there is no way for the OS to know what data you need next

      This can be alleviated via cache prefetch. But you cannot succesfully prefetch if you do not know the associativity (and size) of the cache you're working with.

      so if you want high performance you have to program for it

      Wow. Interesting.

      and while you don't always have direct access to memory you can be pretty sure most hardware work in the same way with the same drawbacks

      Are you implying that all NUMA architectures are equal? Or that NUMA architectures can be treated equal to centralized (FSB) memory architectures?

      so usage of generalized optimizations for fetching/pushing data, for cache usage etc. should work over the boards.

      Of course it should work over the boards, that's what generalized implies. But they are necessarily sub-optimal.

    8. Re:who would've guessed... by Kazoo+the+Clown · · Score: 1

      And I'm sure someone else will be willing to argue that it is the compiler's job to deal with all this.

      But no, the design of the OS, the compiler, the target program, and undoubtedly the behavior of the end-user will likely all have some role in optimizations of this nature. Welcome to the world of parallel processing.

    9. Re:who would've guessed... by afidel · · Score: 1

      I guess that's why Oracle costs so damn much, it's quite happy using as many cores/cpu's as you've got.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  3. make -j 3 by kevind23 · · Score: 0, Offtopic

    Works fine for me.

    1. Re:make -j 3 by aliquis · · Score: 1

      And this is useful info because?

      Isn't most of the point of using -j parameter that your machine can carry on compiling something else while whatever it did earlier get the resources it needed from disk or similar? Will it really help out with cache usage?

      Should more processes mean better or worse cache performance? Worse because cache is shared between them, better because if something is missing some other instruction can be done while the needed data is fetched from RAM?

    2. Re:make -j 3 by Anthony_Cargile · · Score: 1

      I believe this allows make to make use of several cores, not the actual application being compiled. More specifically, -j means "jobs" and therefore not necessarily "cores" per se, but you could always manually tweak the affinity yourself if you're compiling something absolutely huge.

    3. Re:make -j 3 by bob.appleyard · · Score: 2, Funny

      It's OK. This is a Gentoo user. Getting make to work on multicore well has a significant impact on the usability of his computer.

      --
      How dare you be so modest!! You conceited bastard!!
    4. Re:make -j 3 by kevind23 · · Score: 1

      The point of using this is so that you can compile multiple files at once. Obviously it can't impact how the application performs because that would require modification of the source code, and the compiler doesn't magically optimize it to work with multiple cores.

    5. Re:make -j 3 by kevind23 · · Score: 1

      Try Debian. I don't use packages for everything, you know.

    6. Re:make -j 3 by Anonymous Coward · · Score: 0

      Debian? No wonder you're so pissed off all the time

  4. multicore dev is fun... much like prison rape! by Shadowruni · · Score: 4, Interesting
    The current state of dev reminds me sort of the issues that Nintendo had with the N64.... a beautiful piece of hardware with (at the time) a God-like amount of raw power, but *REALLY* hard to code for. Hence the really interesting titles for it either came from Rare who developed on SGI machines (a R10000 drive that beast) or Nintendo, who built the thing.

    /yeah yeah, I know the PS1 and Sega Saturn had optical media and that the media's storage capacity which lead to better and more complex were truly what killed the N64.

    //bonus capt was arrestor

    --
    "Chinese Amazons, power armor, laser swords.... things just meant to be." - Shampoo, A Very Scary Bet
    1. Re:multicore dev is fun... much like prison rape! by aliquis · · Score: 1

      Could you point me at some direction for more information about the problems of developing for the N64? I knew developers didn't liked the Sega Saturn or whatever it was which had multiple cores but I don't remember reading anything about N64.

    2. Re:multicore dev is fun... much like prison rape! by Ironchew · · Score: 2, Informative

      http://en.wikipedia.org/wiki/N64#Programming_difficulties
      The amount of video memory for textures was way too small.

    3. Re:multicore dev is fun... much like prison rape! by carlzum · · Score: 4, Interesting

      I believe the biggest problem with multi-core development is a lack of maturity in the tools and libraries available. Taking advantage of multiple cores requires a lot of thread management code, which is great for highly optimized applications but deters run-of-the-mill business and user app developers. There was a recent opinion piece in Dr Dobbs discussing the benefits a concurrency platforms I found interesting. The article is clearly promoting the author's company (Clik Arts), but I agree with his argument that the complexities of multi-core development need to be handled in a framework and not applications.

    4. Re:multicore dev is fun... much like prison rape! by Fallingcow · · Score: 1, Offtopic

      The N64 was killed?

      Best "party game" system of that generation, easily.

      4 controller capability out of the box, 007 Goldeneye, Perfect Dark, Mario Kart, all the good wrestling games (hey, they were fun at the time...) etc.

      The PS1 was only good for racing games and RPGs, IMO. Oh, and Bushido Blade 1 and 2.

      Kind of like the Wii vs. 360/PS3. Any time we plug in a PS3 at a get-together, it's to ooh and ah over the graphics and maybe take turns playing the single player mode of a cool game (Need for Speed or something). When the Wii's plugged in it's so we can all play games together.

      Then again, no one I know likes console shooters, especially ones that don't do split-screen (and if they do, they better dumb it down like Goldeneye/Perfect Dark so it's fun rather than frustrating with the damn broken console controller--we all like PC shooters), so that may be why we don't get any multiplayer action out of those other consoles.

      / I see you are a fark.com user, too // Slashies right back at ya!

    5. Re:multicore dev is fun... much like prison rape! by Anonymous Coward · · Score: 0

      That's why I still scratch my head over the HDDVD/Bluray war... :(

  5. Re:First poSt by Anonymous Coward · · Score: 0

    Oh Great Australia Internet Filter, why has thou abandoned me?

  6. Linux and Windows by WarJolt · · Score: 3, Insightful

    I don't know if Linux or Windows has an automatic mechanism to schedule task priority based on processor caches, but the study didn't even mention Windows. Seeing that the scheduling and managing the caches are OS problems this seems kind of important.

    The other thing that seems odd is they were using a 2.6.18 Kernel and in 2.6.23 they added the Completely Fair Scheduler which could potentially change their results. It doesn't seem logical to base a cutting edge study on stuff that was released years ago.

  7. Linux schedules better than this by bluefoxlucid · · Score: 3, Informative

    Last I checked, Linux was smart enough to try to keep programs running on cores where cache contained the needed data.

    1. Re:Linux schedules better than this by HRbnjR · · Score: 2, Interesting
    2. Re:Linux schedules better than this by nullchar · · Score: 4, Interesting

      Possibly... but it appears an SMP kernel treats each core as a separate physical processor.

      Take an Intel Core2 Quad machine and start a process that takes 100% of one CPU. Then watch top/htop/gnome-system-monitor/etc where you can watch the process hop around all four cores. It makes sense that the process might hop between two cores -- the two that share L2 cache -- but all four cores doesn't make sense to me. Seems like the L2 cache is wasted when migrating between each core2 package.

    3. Re:Linux schedules better than this by Krishnoid · · Score: 3, Interesting

      Wasn't there an article recently about this describing that if only one core was working at peak capacity that the die would heat unevenly, causing problems?

    4. Re:Linux schedules better than this by Anonymous Coward · · Score: 0

      Why do people bother commenting on technical subjects of which they know nothing about?

    5. Re:Linux schedules better than this by Anonymous Coward · · Score: 0

      Isn't that why the new AMD cores all share the same cache, to avoid this 'unavailable' cached data? (I hope they get their act together. AMD can design but they can't execute. Intel can execute in numbing volumes but their designs leave a lot to be desired. And Motorola is probably still at 500MHz.

    6. Re:Linux schedules better than this by Anthony_Cargile · · Score: 1

      The article uses a kernel version that predates the completely fair scheduler, that would be why. If they aim to test something like this, they need to test the most recent version.

    7. Re:Linux schedules better than this by bluefoxlucid · · Score: 1

      Your PAUSE() function will spin indefinitely instead of continuing.

    8. Re:Linux schedules better than this by timeOday · · Score: 1

      Last I checked, Linux was smart enough to try to keep programs running on cores where cache contained the needed data.

      As if simply giving each process affinity for a given core solves the problem. But then you have interrupt handling, job loads with more than one process per core, multi-threaded programs - all sharing memory space yet with different memory access patterns - and different processors with e.g. different cache architectures. The task-switching OS is 50 years old and we still haven't settled on THE perfect scheduler - and now you suggest solving that problem with several more degrees of freedom due to multi-core is solved by a trivial heuristic.

    9. Re:Linux schedules better than this by Anthony_Cargile · · Score: 1

      Exactly. If Slashdot gave me more room, I would have put the rest of the joke on there:

      void PAUSE(){ printf("\nPress any key to continue. . ."); while(1) getch(); } // Enforce the 'any' key

      Whats even worse is that this line of code was used in a fake cmd.exe I made for a prank on my friend's computer. Tricky to install due to having to point the COMSPEC env. variable to a backed up version of the real cmd.exe and tinkering with the dllcache directory, but it was priceless to see his reaction to the fake ping error :D.

    10. Re:Linux schedules better than this by Anonymous Coward · · Score: 0

      This is why a true quad core architecture beats two dual cores glued together. Of course, it does help to release that true quad core on time and at promised speeds....

    11. Re:Linux schedules better than this by RAMMS+EIN · · Score: 1

      ``And Motorola is probably still at 500MHz.''

      Actually, they gave up on the desktop CPU market. They spun off their chip division into Freescale Semiconductor, which now makes embedded processors.

      --
      Please correct me if I got my facts wrong.
    12. Re:Linux schedules better than this by ILongForDarkness · · Score: 1
      My understanding is that was one of the features of the Xeon chips and presumably got transferred over to the core 2 world. The idea is that the work load gets moved around to distribute the heat better on the die. More even heat leads to more efficient cooling.

      You have a point when it comes to cache locality. It can be somewhat mitigated by smart timing of the core switching. For example long time on each core (as you probably would notice with your system monitor), or doing something like switching on each block read from memory. Presumably the thread is blocking on the memory read and will be using that data coming from RAM, so some of the currently cached data would probably be aged out to make room for it. If you swap cores, and possibly L2 caches at that time you can write to even older cache.

      Anyways, this is a really old problem that the article mentions, as other people have commented it has been around since at least the 80's. Any system with two or more processors sharing RAM have cache coherency issues and issues with devices (at least who has the image of the driver in the cache and possibly who has the physical connection to the device). Fun problems to solve, it should keep us tech geeks busy for several years to come.

    13. Re:Linux schedules better than this by slash.duncan · · Score: 1

      How do you have your SMP configured, and do you have NUMA and is it enabled?

      I'm running a now older 2xx series dual Opteron 290, so dual sockets, dual-cores each, physically configured with four gig memory hanging off each one. The AMD 8xxx chipset has the rest of the system (all the PCI-X channels and AGP, it's pre-PCI-E) hanging off socket-0. In the kernel, I have SMP set, SMT (multi-thread, this would be closer affinity than multi-core but of course the AMDs don't use it) unset, SCHED_MC (multi-core, lower affinity than SMT, higher than NUMA) set, and NUMA set (lower affinity than SCHED_MC, higher than generic SMP, the effect on this physical topography is to work with SCHED_MC to heavily prioritize socket affinity). (All these kernel options can be found under the Processor type and features menu, in menuconfig, etc.)

      With that physical and logical setup, the system definitely honors the closer topography of the paired cores as opposed to opposite sockets and the memory hanging off them. Under low load, the scheduler will keep everything on the paired socket-0 cores (with relatively little resistance to switching between cores on the same socket), logical since that lets the socket-1 pair idle (and on a newer chipset, probably sleep), and because socket-0 will have lower latency since it's direct-connected to the rest of the system.

      Under high load the scheduler distributes threads so all cores run as close to 100% as possible. Again, logical.

      The interesting behavior is the moderate load condition, or a single-thread 100% condition. Here, it seems to be interrupt sensitive, putting the single-thread hog on the socket-1 cores in most cases, and keeping X and high interrupt threads on the socket-0 cores if possible. I've noted the single-threaded CPU-hoggy behavior in at least two instances, one with a single-threaded X app hog, and one one with an unkillable kernel inotify thread gone haywire (I was running an early to mid 2.6.28-rc kernel at the time -- it apparently endless-looped when a file delete on an inotify watched file didn't get handled properly, I've not seen it since -rc7 or so in this cycle or on full releases).

      The X-app was the DOSBOX emulator, for the single old closed source game I still run, Master of Orion original DOS edition from the early 90s. The DOS emulation runs as full-on hog as the DOSBOX settings are configured to allow, but being DOS, it's naturally single-threaded. Most of that is in-memory emulation, very little I/O and comparatively few calls to X thru SDL, so few hardware level interrupts and the scheduler shifts it to the socket-1 CPU cores.

      The inotify kernel thread, once the unhandled delete happened and it went into its endless loop, obviously also had zero interrupt handling and was shifted off to the socket-1 CPU, leaving the socket-0 CPU for more interrupt driven threads. I run ksysguard with individual core CPU activity graphs and eventually noticed core 3 (socket-1, core-0) running 100% for "no reason", so investigated, and found the inotify thread sitting there @ 100%. But as it wasn't a convenient time to reboot, I let it sit there eating 100% of a socket-1 core's cycles for a day and a half or so before I eventually rebooted. It would switch cores on the socket-1 CPU every so often (possibly to avoid the point heating mentioned in the other reply), but it stayed on the socket-1 CPU and left socket-0 alone. I rebooted before I did any multi-thread compiling or other extreme-load multi-thread stuff, since I saw little point in stressing a kernel with one kernel thread already wigged out, but it was fine for my ordinary desktop stuff (including a round of MOO in DOSBOX, which took up the other core on the socket-1 CPU, still leaving both cores of the socket-0 CPU at only a few percent utilzation) and I'd have not even noticed the problem were it not for the per-core CPU activity graphing I have ksysguard configured to display.

      Of course with the NUMA as opposed to monolithic memory config, the per-socket core affinit

      --
      Duncan
      "Every nonfree program has a lord, a master,
      and if you use the program, he is your master."
      R Stallman
    14. Re:Linux schedules better than this by PitaBred · · Score: 2, Informative

      I thought that Intel specifically did that, that if one core were loaded it would overclock that core and downclock the others to get a speed boost...

      Yup, I thought I remembered correctly.

    15. Re:Linux schedules better than this by bluefoxlucid · · Score: 1

      You should have added "Guru Meditation" to the stop error ;)

    16. Re:Linux schedules better than this by nullchar · · Score: 1

      My chip is a Xeon X5450, which is still two dual-cores in a single package.

      Also, /sys/.../cpu*/cpufreq/scaling_governor is "ondemand". But the same behavior occurs if I turn off CPU scaling.

      I thought it might be a "thermal feature" to keep all the cores balanced in temperature. And to examine this better, I should really use a heavy CPU process that uses low I/O, to make full use of the cache. Then possibly boot to a non-SMP kernel and time the differences to ensure the L2 cache moving is a factor.

    17. Re:Linux schedules better than this by nullchar · · Score: 1

      Thanks for the reply. This is a single Xeon core2 quad, 2.6.24 (old, I know).

      I have the following enabled:
          SMP
          NUMA
          SCHED_MC
          FAIR_GROUP_SCHED
          (unsure which affinity is higher.. I can use 'taskset' on a process)

      And the following disabled:
          CONFIG_SCHED_SMT

      And like others mentioned, a true quad-core with shared L2/L3 cache would negate this issue -- until you add another physical chip (total of 8 cores). Will the process be migrated between physical chips? Should it? Could it move "closer" (driver/bus/interrupt handling) to the device the process is interacting with (ram, eth, disk, gpu)?

  8. NUMA NUMA by Gothmolly · · Score: 3, Informative

    Linux can already deal with scheduling tasks to processors where the necessary resources are "close". It may not be obvious to the likes of PC Magazine, but its trivially obvious that even multithreaded programs running on a non-location aware kernel are going to take a hit. This is a kernel problem, not an application library problem.

    --
    I want to delete my account but Slashdot doesn't allow it.
    1. Re:NUMA NUMA by Anonymous Coward · · Score: 0

      Yes, because "dick sucking faggots" don't really exist (well except on your imagination, obviously).

  9. This isn't news by nettablepc · · Score: 5, Informative

    Anyone who has been doing performance work should have known this. The tools to adjust things like core affinity and where interrupts are handled have been available in Linux and Windows for a long time. These effects were present in 1980s mainframes. DUH.

    1. Re:This isn't news by Clover_Kicker · · Score: 5, Insightful

      80s mainframe tech is NEW and EXCITING to a depressing number of tech people, look at how excited everyone got when someone remembered and re-implemented virtualization.

    2. Re:This isn't news by nullchar · · Score: 1

      I don't know if they've been in the default kernel for "a long time", but they are there now.

      read: http://www.alexandersandler.net/smp-affinity-and-proper-interrupt-handling-in-linux

    3. Re:This isn't news by Anonymous Coward · · Score: 0

      On the Windows side, it's been there since the NT series were around.

      I did it on NT 3.x in 1994.

    4. Re:This isn't news by Anonymous Coward · · Score: 0

      Well, I didn't have a mainframe in the 80s.

    5. Re:This isn't news by Anonymous Coward · · Score: 1, Informative

      80s mainframe tech is NEW and EXCITING to a depressing number of tech people, look at how excited everyone got when someone remembered and re-implemented virtualization.

      Ummm, that's re-implemented virtualization on x86 with very little performance overhead and at a very reasonable cost. That was new and exciting.

      And while I did use CICS and MVS back in the day, I don't think IBM had technology (maybe they did, but I never heard of it) like VMware's vMotion, where you can take a running virtual machine and move it from one host to another.

      Processor affinity isn't new. Quite a few applications have settings for that, even Microsoft Sql Server 2000.

    6. Re:This isn't news by ion.simon.c · · Score: 1

      Meh.
      That doesn't excuse the *rest* of the entire industry forgetting *everything* the mainframe folks learned. :/

    7. Re:This isn't news by Anonymous Coward · · Score: 0

      Having worked on a multi-cpu (8 or 16) SGI systems in the mid 90s, I can say that this article brings back some memories.

    8. Re:This isn't news by JAlexoi · · Score: 1

      Don't know when IBM's mainframes got it, but WAN clustering works like that. If a mainframe goes down, the VM is moved to another mainframe in the "cluster". Though, they probably can make them move.

  10. it's the affinity by non-e-moose · · Score: 2, Informative

    It's just an Insel Intide thing. DAAMIT processors are more predictable. Or not. If you don't use numactl (1) to force socket (and memory) affinity, you get exactly what you ask for (randomly selected sockets, and unpredictable performance)

  11. not a surprise by Eil · · Score: 5, Insightful

    Here's an exercise: Take 2 brand-new systems with identical configurations and start them at the same time doing some job that takes a few hours and utilizes most of the hardware to some significant degree. Say, compiling some huge piece of code like KDE or OpenOffice. System administrators who do exactly this will tell you that you'll almost never see the two machines complete the job at precisely the same time. Even though the CPU, memory, hard drive, motherboard, and everything else is the same, the system as a whole is so complex that minute differences in timing somewhere compound into larger ones. Sometimes you can even reboot them and repeat the experiment and the results will have reversed. It shouldn't come as a surprise that adding more complexity (in the form of processor cores) would enhance the effect.

    1. Re:not a surprise by im_thatoneguy · · Score: 4, Interesting

      We have this problem at work.

      We have a render farm of 16 machines. 12 of them are effectively identical but despite all of our coaxing one of them always runs about 30% slower. It's maddening. But "What can you do?". Hardware is the same. We Ghost the systems so the boot data is exactly the same... and yet... slowness. It's just a handicapped system.

    2. Re:not a surprise by visualight · · Score: 1

      Move processors around so you get a different boot proc, if you haven't tried that already.

      --
      Samsung took back my unlocked bootloader because Google wants me to rent movies. They're both evil.
    3. Re:not a surprise by Anonymous Coward · · Score: 0

      the machine that's slower, is it
       
      ..always the same machine?

      that machine is damaged
       
      ..always a different machine?

      you have a networking bottleneck

      cheers!

    4. Re:not a surprise by Ethanol-fueled · · Score: 1

      Ahh, the trusty ol' cycle 'n' swap. It's funny how complex problems often have simple fixes. Kinda like how the car won't start unless you kick the fender before you turn the crank.

      Some people put together servers all day that way: swapping a bunch of intermittent crap in and out until the box runs long enough to install the OS :)

    5. Re:not a surprise by Anonymous Coward · · Score: 1, Informative

      There are a number of possibilities. Make sure the CPU family/model/stepping is the same between the slow and normal effectively identical machine. Check that the DIMMs are exactly the same and installed in the same slots as the other machines. You might even try plain swapping memory with a known good machine. Another thing to check is the PCI bus. If you have a card in one slot in one machine and in a different slot in another machine, it might make a difference as to how the BIOS allocates interrupts for other devices (which may affect how Linux's lame interrupt mapping sets priorities). If this render farm machine talks on the network, it could be its own ethernet adapter is having problems or the switch port to which it is connected. Check for errors logged on both sides (ifconfig eth0) -- also make sure the ports are running full duplex.

    6. Re:not a surprise by Anonymous Coward · · Score: 0

      30%?! That's not handicapped, that's defective.

    7. Re:not a surprise by Kvasio · · Score: 1

      Back in 2000 or so I helped my friend to install prepared system images to new labs at my university. It was 18 or 20 machines with same specs (same model, same order, same patch, similar serial numbers).

      We've just copied images to hdds. Later, the first boot of each machine after coping image was with network disconnected (as we needed to change SIDs). (Not so much) to my suprise the boot times varied from ~ 1m30s to ~ 2m45s.

    8. Re:not a surprise by Anonymous Coward · · Score: 0

      Did you check if the slow systems are thottling the CPU to prevent overheating? It's quite possible that their CPU heatsinks are not mounted correctly, they are using a "silent" fan profile or they are just mounted in a warmer corner of your server room.

    9. Re:not a surprise by TheRaven64 · · Score: 1

      I encountered a similar issue on a cluster I used. One or two of the machines would suddenly become very slow. It turned out that the fans had partially failed. When the CPU got hot, it would be throttled back, without leaving anything in the error log. The technicians didn't expect this - with the old cluster CPUs that got too hot just failed and were replaced - but eventually tracked it down. You might want to check that the air flow around the slow machine is adequate. Recent Intel chips (and, I think, AMD ones) will slow themselves down a bit without telling you if they get a bit warm.

      --
      I am TheRaven on Soylent News
    10. Re:not a surprise by Anonymous Coward · · Score: 1, Funny

      I setup a build lab with around 10 machines. One of the machines ran 50% slower than the rest of the group. It was a huge puzzle, because each machine was a clone, identical hardware, etc. As it turned out, on of the guys in the lab setup the "slow" machine with a very CPU intensive screen saver. Whenever I went to tinker with it (to figure out why it was slower), the screen saver was not running.

      So... look for the screen saver. It is not obvious.

    11. Re:not a surprise by PitaBred · · Score: 1

      Check your power supply... that's almost always been the cause of any "weird" errors I've gotten. Jitter in power causes all kinds of fun, unpredictable stuff to happen.

    12. Re:not a surprise by Anonymous Coward · · Score: 0

      Check the BIOS versions. I've had a similar problem with machines that were claimed to be "absolutely identical".

    13. Re:not a surprise by toddestan · · Score: 1

      If the machines have ECC, also check that you aren't having memory errors. I've seen machines with ECC where you were constantly getting memory errors, but since the ECC was able to correct for them the computer was still stable, but took a considerable performance hit.

    14. Re:not a surprise by Eil · · Score: 1

      30% is quite a performance difference and is far beyond the almost insignificant margins I discussed in my post. That great a difference is almost certainly attributable to bad or misconfigured hardware, or perhaps a bug in the software which handles the load balancing.

  12. I find it hard to believe by Anonymous Coward · · Score: 0

    that cutting edge research is done in Virginia.

    1. Re:I find it hard to believe by Anonymous Coward · · Score: 0

      Yes Santa, there is a Virginia!

  13. Re:Linux and Windows by Anthony_Cargile · · Score: 1

    I agree, and seeing this in the standard C/C++ libraries down the road would be nice. I would say Java would have framework-esque multicore support first, but then again Sun is in trouble and Java is just now getting video and 64-bit support. I don't use .NET enough to know, but it would be interesting to know if .NET has decent native multicore support and if Mono implements it correctly, although this all depends on MSIL versioning/limitations I'm sure.

    In a nutshell, we need more portable multicore solutions in order to make better usage of them. Not just for the sake of being cross-platform, but for better documentation, example code, etc.

  14. In summary.... by johnlcallaway · · Score: 0

    So and compiler do it for you, performance results are not consistent between runs.

    Wow ... what a shock....

    What's next. A study that shows if you don't select any optimization parameters a program won't run as effective as selecting the best ones??

    --
    I rarely read replies, it's my opinion and if you thought about your opinion a little more, I'm OK with that.
  15. Re:Linux and Windows by nategoose · · Score: 1

    Last time I read anything about it (which was years ago) the Linux cache aware scheduling consisted of trying to get task scheduled on the same processor as they were scheduled on previously. This works well for a lot of things, but you lose a lot of benefit when multiple simultaneous tasks are working on the same data since those tasks would be spread across the processors to take advantage of concurrency.
    This is just an engineering trade off.

  16. Re:Linux and Windows by nabsltd · · Score: 1

    I don't know if Linux or Windows has an automatic mechanism to schedule task priority based on processor caches, but the study didn't even mention Windows. Seeing that the scheduling and managing the caches are OS problems this seems kind of important.

    I'm not sure why this article isn't tagged "duh".

    It's pretty obvious from looking at the CPU graphs of my VMware ESX servers that their code does some optimization to keep processes on the same core, or at the very least on the same CPU.

    This data is from a dual-socket quad-core AMD (8 total cores), which means a NUMA architecture, so running the code on the same CPU means you have faster memory access.

    So, some commercial code that has been around for nearly 4 years takes advantage of the "discoveries" in an article published this month.

  17. Re:Linux and Windows by swb · · Score: 3, Informative

    They mentioned this in an ESX class I took. I seem to remember it in the context of setting a processor affinity or creating multi-CPU VMs and how either the hypervisor was smarter than you (eg, don't affinity) or that multi-CPU VMs could actually slow other VMs because the hypervisor would try to keep multi-CPU VMs on the same socket, thus deny execution priority to other VMs (eg, don't assign SMP VMs because you can unless you have the CPU workload).

  18. Well known problem by sjames · · Score: 3, Insightful

    The problem is a complex one. Every possible scheduling decision has pluses and minuses. For example, keeping a process on the same core for each timeslice maximizes cache hits, but can lose if it means the process has to wait TOO long for it's next slice. Likewise, if a process must wait for something, should it yield to another process or busy wait. SHould interrupts be balanced over CPUs or should one CPU handle them?

    A lot of work has gone in to those questions in the Linux scheduler. For all of that, the scheduler only knows so much about a given app and if it takes TOO long to 'think' about it, it negates the benefits of a better decision.

    For special cases where you're quite sure you know more than the scheduler about your app, you can use the isolcpus kernel parameter to reserve CPUS to run only the apps you explicitly assign to them.

    You can also decide which CPU any given IRQ can be handled by (but not which core within a CPU as far as I know) wilt /proc/irq/*/smp_affinity.

    Unless your system is dedicated to a single application and you understand it quite well, the most likely result of screwing with all of that is overall loss of performance.

    1. Re:Well known problem by little1973 · · Score: 1

      "You can also decide which CPU any given IRQ can be handled by (but not which core within a CPU as far as I know)"

      With the usage of IOAPIC you can redirect the IRQ to any cores. We have a in-house-developed commercial OS for telephony applications and we use the IOAPIC with a simple round-robin fashion. I do not know why linux does not do this.

      --
      Government cannot make man richer, but it can make him poorer. - Ludwig von Mises
  19. Whoosh by symbolset · · Score: 0, Redundant
    --
    Help stamp out iliturcy.
  20. Yup by coryking · · Score: 1

    The libraries and the languages currently make threading harder then it needs to be.

    How about a "parallel foreach(Thing in Things)" ?

    I realize there are locking issues and race conditions, but really I think the languages could go a some ways to making things like this more hidden. Oh wait, does that mean I'm advocating for making programming languages more user friendly? I guess so. You know why people use Ruby, C# or Java? Cause those are way more user friendly than C++ or COBOL.

    The usability of a programming language matters a lot. Nobody uses threading because the current crop of programing languages makes it complex, confusing, and full of ways to shoot yourself in the foot. Make threading user friendly, and we might see more people create multi-threaded apps.

    1. Re:Yup by cetialphav · · Score: 3, Informative

      How about a "parallel foreach(Thing in Things)" ?

      That is easy. If your application can be parallelized that easily, then it is considered embarrassingly parallel. OpenMP exists today and does just this. All you have to do (in C) is add a "#pragma" above the for loop and you have a parallel program. OpenMP is commonly available on all major platforms.

      The real problem is that most desktop applications just don't lend themselves to this type of parallelism and so the threads have lots of data sharing. This data sharing causes the problem because the programmer must carefully use synchronization primitives to prevent race conditions. Since the programmer is using parallelism to boost performance, they only want to introduce synchronization when they absolutely have to. When in doubt, they leave it out. Since it is damn near impossible to test the code for race conditions, they have no indication when they have subtle errors. This is what makes concurrent programming so difficult. One researcher says that using threads makes programs "wildly nondeterministic".

      It is hard to blame the programmers for being aggressive in seeking performance gains because Amdahl's Law is a real killer. If you have 90% of the program parallelized, the theoretical maximum performance gain is 10X no matter how many cores you can throw at the problem.

    2. Re:Yup by gfody · · Score: 1

      what you're asking for is pretty much already that easy

      foreach(Thing in Things)
          new Thread(Thing.DoStuff);

      --

      bite my glorious golden ass.
    3. Re:Yup by Anonymous Coward · · Score: 0

      If separate threads were automated to the point that braces have automated gotos / jumps (to the point where we don't even worry about how many function calls are made because braces even look fun) then it would be a breakthrough.

      Imagine just coding something where you have like:

      -> sharing (myLootDataStructure var)
      -> splitToCores { work }
      -> accumulate (whateverYouNeedIntoResult)

      where you preferrably didn't need to define how stuff is shared, split or accumulated --just specify your stuff once, and let some magic do the "splitToCores" part dinamically. Same way as modern programming languages hide memory addresses by using variable names and scopes. I mean, if we can multicore program at all, then it's just a matter of some serious PhD work to define a new model and put a layer around stuff.

  21. What if... by raftpeople · · Score: 1

    We added 4 more cores to perform this "thinking" about which core the process should run on, we should be able to get back that 10% we lost, right?

  22. Interrupt redistribution by Anonymous Coward · · Score: 0

    TFA doesn't seem to specify, but I assume they're referring to Linux. Recent versions of Solaris (and also HP-UX) already have some of this functionality in what they call an "interrupt redistribution daemon".

  23. Close by coryking · · Score: 2, Interesting

    But you have to think about it too much.

    How about:


    Things.ParallelEach(function(thing){
      Console.Write("{0} is cool, but in parallel", thing);
      # serious business goes here
    });

    There are lots of stupid loop structures that are used in desktop apps that are just begging to be run in parallel, but the current crop of languages dont make it braindead easy to do so. Make it so every loop structure has a trivial and non ugly (OpenMP pragmas) way of doing it.

    Also, IMHO, not enough languages do stuff like the Javascript Array.Each(function(element){}). Am I blind, or is this construct missing from C#?

    1. Re:Close by Marillion · · Score: 1

      I agree. I've ranted about this before. 99% of languages implement multi-threading through function calls. Class method calls, in this case, are merely glorified function calls. Multi-threading should be handled at the same level as other flow control statements because that's what is most like.

      --
      This is a boring sig
    2. Re:Close by Anonymous Coward · · Score: 1, Insightful

      You have to think about it too much.
      Things.ParallelEach(function(thing){
          Console.Write("{0} is cool, but in parallel", thing);
          # serious business goes here
      });

      The problem isn't the parallel loop in itself, it is about the secondary effects the loop has. And you can not think about those effects too much. In your example, even ignoring the serious business, what does Console.Write() do? Does it write to a buffer -> is that buffer thread-safe? Does it do port I/O without locking -> forget about threading. Does it do port I/O with locking -> then all single-threaded applications incur unnecessary overhead. If it writes to a buffer, does it build first and flush the entire string atomic or are the items fed element-wise?

      In regular imperative languages (C, Basic and its descendants), there exist almost no "pure" loops without side effects. This means that the theoretical performance gain from going multi-thread is outweighed by the complexity of isolating orthogonal processes. Add to this the fact that memory-sharing between threads leads to considerable delays because all cores must synchronize their caches, and you have the reasons why most programs are not (yet) parallel.

      It isn't because the language constructs for parallel programming are ugly (Java's semantic approach to threading is quite nice IMHO), it is because imperative languages are sequential by definition. If you want easy parallelism, then don't use an imperative language.

  24. Only less ugly :-) by coryking · · Score: 1

    And for those who say "what what about all the weird race conditions and stuff". I'm not a computer science major, so I'm jumping off an edge asking this, but what if we actually use some of this new CPU power in our IDEs and our JIT compilers, couldn't our languages watch out for most of the nasty ways we can shoot ourselves in the food? Like if I do a Array.ThreadedEach(function(element){}) and I'm changing some shared data, couldn't the compiler or IDE let me know at compile time or while I'm writing the code? Obviously you'd need a strongly typed language like C# or Java to pull such stunts, you couldn't do it in perl :-)...

    The goal is to make this threaded stuff usable. I think we can do it.

    1. Re:Only less ugly :-) by Anonymous Coward · · Score: 0

      The problem is that the type systems of C# and Java are not good enough for the job, unless you go functional and require everything to be const or final. I'm not sure if a good enough type system has been invented yet - but perhaps a CS guru could enlighten us all?

    2. Re:Only less ugly :-) by xenocide2 · · Score: 1

      There's something called Turing completeness that blows "solve it with smarter compilers" idea out of the water in the general sense (even though it might work 95 percent of the time).

      Threaded stuff isn't super hard. Getting threaded stuff to run FAST is hard. There's a billion tradeoffs handled by what are traditionally different parts of the system. In your silly parallelize loops idea (aka MapReduce) the challenge is clear (How many items do you need before setting up parallelization is worth the extra computational price?) but the factors are not (multi-level cache, loop time, scheduler performance, the amount of data to be processed). Cache in particular is a good example: by design, caches are transparent. Your code will still run with a small or large cache, or even none at all. How can you compile a program for a general purpose computer without knowing the size of the cache?

      Unfortunately, the most popular performance oriented languages are still all threading hostile. C/C++ does not yet support threading in the standard. (Q:"but how does C++ do threading then?" A: Non-standard approaches!) There are languages that come with high scalability approaches, but they're so different than C/C++ that they're dismissed out of hand. Erlang is supposedly a ninja at scalability.

      --
      I Browse at +4 Flamebait

      Open Source Sysadmin

  25. This isn't hardware by multimediavt · · Score: 2, Informative

    Why is this article labeled as hardware? Sure they talk about different procs being ... well, different. Duh! The article is about the software Tom and others developed to run processes more efficiently in a multi-core (an possibly heterogenous) environment. Big energy savings as well as performance boost. Green computing. HELLO! Did you read page two?

    1. Re:This isn't hardware by Shikaku · · Score: 1

      Did you read page two?

      This is Slashdot

  26. Baisc SMP/NUMA by Anonymous Coward · · Score: 0

    Can't see what the big news is, any single socket multi core system would look like and simple SMP and a multi socket would have some NUMA characteristics. So affinity scheduling, locality and behavior aware memory allocation and some interrupt fencing should create a deterministic behavior :)

    Guess he should try OpenSolaris, been there, tried that and so forth :)

  27. But that is ugly by coryking · · Score: 1

    And OpenMP isn't "standard" as far as I'm concerned. Plus it makes you think about threading and it only works in low-level languages like C.

    I'm talking about this highly useful code (which is written in a bastardized version of C#, Perl and Javascript for your reading pleasure):


    List pimpScores = PimpList.ThreadedMap(function(aPimp){
          # score how worthy this guy is at pimpin'
          if(aPimp.Hoes > 10) {
              return String.Format("Damn brother, {0} is a player", aPimp.PimpName);
          } else if (aPimp.Hoes 0) {
              return String.Format("{0} is a small time player", aPimp.PimpName);
          } else {
                        return String.Format("{0} isn't a player at all!", aPimp.PimpName);
          }
    });

    Look how easy it was to turn a transform like Map into something threaded (even though C# doesn't have Map... I forget what LINQ method does the same transform)

    OpenMP doesn't offer anything as intuitive as that. It makes you think long and hard about threading in a dull, dry manner. Threading is everywhere in our code if the program language makes it obvious and easy.

  28. Well by Anonymous Coward · · Score: 0

    In the future, we'll probably have hundred or thousand core CPU's and we can dedicate 10% of them to "thinking" about how to use the remaining 90%.

  29. Reporter bias by symbolset · · Score: 2, Insightful

    Often, an issue presents that isn't reproducible in the presence of a tech support person who knows what he's doing.

    Sometimes it's a user error they don''t want to admit, and so they won't reproduce it in front of somebody who knows they should not have done that.

    Sometimes it's just a glitch. Regardless, the best thing to do is smile and say "The bug must be afraid of me" and close the ticket.

    --
    Help stamp out iliturcy.
  30. Re:Linux and Windows by Anonymous Coward · · Score: 0

    Keep in mind that a paper presented at a conference is submitted many months before it is published/presented. Not sure when the deadline for the conference was, but I suspect the completely fair scheduler was not available at the time. (Double or triple the sentiment for publication in a journal.)

    AC

  31. I believe Rakim covered this story in 1986... by Anonymous Coward · · Score: 0

    Running server to server, duo core to core - I interrupt y'all, and bless the Tech for the quad.

    Yes, I realise Eric B. & Rakim references are entirely wasted on /.

  32. Did anyone else notice. . . by MagusSlurpy · · Score: 1

    . . .the tag "bang news" on a story involving researchers from Virginia Tech?

    --
    My sister opened a computer store in Hawaii. She sells C shells by the seashore.
  33. "Research" by Anonymous Coward · · Score: 0

    How is this research? You will find much higher quality research on implementing support for SMP for network stacks (that is YEARS OLD, look at the work of Alan Cox and his students) and a plethora of papers on scheduling on SMP among many other things. ANL-supported research has really gone down the drain.

    These guys are just stating the obvious with very ambiguous unscientific benchmarks and faulty metrics (no analysis of PMCs, etc...). It is surprising that these guys even hit slashdot, publicizing bad research only helps bad research continue.

  34. FPGA programming by sshore · · Score: 1

    One experiment went for a long time, and in the end when he analyzed the AI generated code, there were 5 paths/circuits inside that did nothing. If he disabled any or all of the 5 the overall design failed. Somehow, the AI found that creating these do nothing loops/circuits caused a favorable behavior in other parts of the FPGA that made for overall success.

    The author took the unusual step of disconnecting the clock for the FPGA, taking advantaged of undefined behavior that depended on the unique electrical characteristics of the FPGA he used. Had he left the clock connected he'd likely have more portable results, however he may not have arrived at the same results since he'd be depending on discrete logic and not the unspecified, non-linear analog behavior.

    1. Re:FPGA programming by zappepcs · · Score: 1

      That's correct. My mind was fuzzy last night. Rereading it makes it very appropriate to this story though as it points out the minute variations in silicon/computers that is ignored by most software etc. as used today because of clocks etc. If the clock is not quite right, weird things can happen. Skynet was a clock failure?

    2. Re:FPGA programming by sshore · · Score: 1

      Rereading it makes it very appropriate to this story though as it points out the minute variations in silicon/computers that is ignored by most software etc. as used today because of clocks etc. If the clock is not quite right, weird things can happen.

      This story is more about how subtle differences in process-to-core mapping can result in real performance differences, rather than small differences in silicon. Kind of like the butterfly effect as it applies to computers.

      The FPGA thing was still an interesting article, though.

  35. Supercomputing Hits the Masses by David+Greene · · Score: 1

    Honestly, this stuff has been known in the HPC world for decades. What's interesting is that these troublesome bits are going to hit system-level and lower-level language programmers on everyday tasks. It's not clear to me how this stuff will affect higher-level programming, interpreted code, etc. It will almost certainly be a factor but I'm not sure there's much the programmer can do about it.

    Some of the fun things we have to look forward to at the commodity level:

    • Unsynchronized core interrupts
    • Cache bank conflicts among multiple threads
    • Various OS interactions / time to service system calls being different on different cores
    • Memory controller fairness issues

    These (and others) all fall into the general category of "induced load imbalance." They are things the programmer doesn't directly think about; things that happen as a result of system services, CPU architecture and stuff generally out of the control of the application programmer. This is all in addition to the stuff the programmer does have control over such as data layout and the amount of work given to each thread.

    Induced load imbalance is the primary reason that scaling to manycore is difficult. It requires a lot of OS work to reduce "OS jitter" to a level that is acceptable when running thousands of threads.

    Here's an article on some of the scaling work HPC vendors have done with Linux.

    --

  36. Reminds me of an issue... by GameboyRMH · · Score: 2, Interesting

    ...I had with an Asterisk VOIP server. Under certain conditions, calls transferred from one of two receptionist's phones were bouncing back and ending up at the wrong voicemail. Since only two phones had a problem I suspected it was something specific to these phones. After checking the configuration and even hardware on the phones, I checked the server. I narrowed the problem down to one macro (a macro in asterisk is basically a user-defined function) that allows a "fallback line" to ring if the first is busy, it seemed to be getting an argument for this line when there should have been none. Soon it became evident that the variable was changing "mid-macro", apparently out of nowhere (there are variables with special names that are used in macros to receive arguments, nowhere was this variable changed, the macro's less than 30 lines long). I eventually got so frustrated I put debugging lines in between every single line of the macro to make it print the variables to the output log. Then I narrowed it down to one line - one where a Dial() command is executed (this is the function that actually places the call, this function isn't supposed to even be able to change anything in the macro that called it, and there are no other problems like this). Now that had me totally stumped. I could demonstrate exactly what was happening but I couldn't figure out why. Stranger still, the results changed slightly with the debugging lines in place, as if it's a race condition of some sort.

    The problem still exists to this day :(

    --
    "When information is power, privacy is freedom" - Jah-Wren Ryel