How We'll Program 1000 Cores - and Get Linus Ranting, Again

Mutex lock by Anonymous Coward · 2015-01-01 19:05 · Score: 5, Funny

All other ended up in a mutex lock situaton so I had chance to do the first post

Re:Mutex lock by NoNonAlphaCharsHere · 2015-01-01 19:19 · Score: 4, Funny

Thanks a lot asshole, a lot of were busy-waiting while you were typing.
Re:Mutex lock by NoNonAlphaCharsHere · 2015-01-01 19:32 · Score: 5, Funny

I think I a word.

A lot of US were busy-waiting.
Re:Mutex lock by Z00L00K · 2015-01-01 21:01 · Score: 2

In any case - a multi-core machine can also handle multiple different tasks simultaneously, it's not always necessary to break down a single task into sub problems.
The future for computing will be to have a system that can adapt and avoid single resource contention as much as possible.

--
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
Re:Mutex lock by TheRaven64 · 2015-01-01 21:16 · Score: 5, Funny

That's what happens when you try to write without a lock.

--
I am TheRaven on Soylent News
Re:Mutex lock by drinkypoo · 2015-01-02 01:47 · Score: 2

The core is already dozens of times faster than memory and thousands of times faster than storage
When you add more cores, you also can add more memory bandwidth, if you couple them closely to memory controllers. This is how multiprocessor PCs work today. Hell, even some processors with more cores in them have more memory buses, it's not just adding chips that gives you more bandwidth.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Mutex lock by arth1 · 2015-01-02 04:44 · Score: 2

I use SSD, you insensitive clod!
Then after all blocks on the drive has been written to, you wait for a second while the drive moves data away and clears a sector so there's space to write to.
SSDs have far better average write speeds, but far worse worst case write speeds. Using them for anything timing critical without a battery backed up controller is asking for trouble.
"Use TRIM", I hear from the peanut gallery. Except that there are no RAID controllers (or software RAIDs) that actually support TRIM in practice. Nor does TRIM work for partitions where there is no file system support. Like raw database partitions or swap. Yep, put a single swap partition on the drive, and you will still be subject to the drive not knowing what blocks are free, and can't write to them unless asked to overwrite them.
For guaranteed rate I/O, spinning platter drives and pure battery-backed RAM disks is still the way to go. A RAID of short-stroked HDs have a worst case performance far better than modern SSDs, despite the average being much slower. For a desktop user, an occasional or rare "hickup" of a second might not be noticeable or even a concern if it is, so SSDs are fine, and even great. For real-time data processing, it can very well be a big concern.

Pullin' a Gates? by Tablizer · 2015-01-01 19:11 · Score: 4, Interesting

"4 cores should be enough for any workstation"

Perhaps it's an over-simplification, but if it turns out wrong, people will be quoting that for many decades like they do Gates' memory quote.

--
Table-ized A.I.

Re:Pullin' a Gates? by cb88 · 2015-01-01 19:25 · Score: 2

It already is wrong...

Linux Workstation: 16cores = way faster builds than 4 cores.
CAD workstation: I imagine alot of geometry processing is parallelized... the less waiting the better (either format conversion or generating demo videos etc.. eat up alot of CPU)
Video workstation: Thats just a blatantly obvious use for multiple cores...
Linux HTPC: I wanna transcode stuff fast... more cores
Linux Gaming: These days using at least 4 cores is getting more common...

Things that I often seen that are *broken* for instance 200Mb work documents that hang the entire system when you scroll (yes windows thats bad). Linux isn't much better though disk IO starvation is a long time pet peeve there... 4 cores is the wrong place to draw the line currently maybe 6-8 cores + improved disk IO would be a realistic ideal these days.

Granted alot of programs will *ought* to run just fine on my Sparcstation LX @ 50Mhz and 128Mb ram... but that isn't the future unless we have a nuclear apocalypse. Also, there is a good chance that alot of my cores will sit ide even so power management is better than it used to be and more cores can improve latency because now I have more available CPU time even though the individual cores are probably slower. Overall thats is a good tradeoff.
Re:Pullin' a Gates? by bruce_the_loon · 2015-01-01 19:42 · Score: 4, Interesting

If you went and read Linus' rant, then you'll find you are actually reinforcing his argument. He says that except for a handful of edge use-cases, there will be no demand for massively parallel in end user usage and that we shouldn't waste time that could be better spent optimizing the low-core processes.
The CAD, video and HTPC use-cases are already solved by the GPU architecture and don't need to be re-solved by inefficient CPU algorithms.
Your Linux workstation would be a good example, but is a very low user count requirement and can be done at the compiler level and not the core OS level anyway.
Your Linux gaming machine shouldn't be doing more than 3/4 cores of CPU and handing the heavy grunt work off to the GPU anyway. No need for a 64 core CPU for that one.
Redesigning what we're already doing successfully with a low number of controller/data shifting CPU cores managing a large bank of dedicated rendering/physics GPU cores and task-specific ASICs for things like 10GB networking and 6GB IO interfaces is pretty pointless, which is what Linus is talking about, not that we only need 4 cores and nothing else.

--
Trying to become famous by taking photos. Visit my homepage please.
Re:Pullin' a Gates? by jhol13 · 2015-01-01 19:50 · Score: 3, Insightful

Why not? Currently Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.
Same with Evince (which is crap anyway), it cannot do anything in parallel, should be able to use tens of cores.
Javascript? Although the language is the worst I have seen since APL, a smart compiler could at least in some cases parallelize it (maybe with speculative execution or like).
And so on.
It will turn out to be as wrong as "640k".
Re:Pullin' a Gates? by ls671 · 2015-01-01 20:07 · Score: 2

hmmm... Linus sounds right to me too. He specifically said, or almost, that people wanting to load 10 pages in sandboxed firefox process/thread in parallel could find a use for 16 cores ;-)

--
Everything I write is lies, read between the lines.
Re:Pullin' a Gates? by bloodhawk · 2015-01-01 20:15 · Score: 3, Insightful

Actually the quote is just an internet myth, at least no one has ever found a source for it or anyone that even reports to have heard him say it and gates denies having said it as well.
Re:Pullin' a Gates? by davmoo · 2015-01-01 20:15 · Score: 2

Except that Bill Gates never actually said the so-called "quote" that is attributed to him.

--
I want a new quote. One that won't spill. One that don't cost too much. Or come in a pill.
Re:Pullin' a Gates? by Rei · 2015-01-01 20:24 · Score: 3, Interesting

Linus's argument basically boils down to, "Parallel algorithms are sorcery, and the only place they matter are places applications that demand performance which are indeed increasingly using parallelism".
Of course you don't need, say, a 50-threaded version of vi or alsamixer or whatever. But for apps that need performance, increasingly they have to get them from threading. And there's nothing "magical" about parallelism. Perhaps in Linus's dislike for C++ he's missed how trivially easy it's gotten to launch threads in C++11, but it takes less work now than a for-loop, since std::thread is so simple and you can inline the command with a lambda. And you have a nice clean mutex library including scoped mutexes like std::lock_guard so you don't even have to remember to unlock them.
It's quite true that having multiple cores needing to read to and write from the same chunk of memory isn't a good thing. But I'd bet you that only in under 5% or so of high performance apps is that the *only* level you can thread at. Because if you have say five nested levels of looping, 4 of them can be memory constrained, but so long as least just one can be threaded without heavy reads/writes on shared cache, you can thread to your heart's content with minimal adverse impact. And "heavy" is the key word. So long as you're not doing essentially *constant* heavy reads/writes on shared cache, the overhead cost is minimal.

--
If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
Re:Pullin' a Gates? by Urkki · 2015-01-01 20:29 · Score: 5, Insightful

Why not? Currently Firefox has problems rendering (loading) two pages simultaneously, although it should be able to handle tens, using several cores.
Same with Evince (which is crap anyway), it cannot do anything in parallel, should be able to use tens of cores.
Javascript? Although the language is the worst I have seen since APL, a smart compiler could at least in some cases parallelize it (maybe with speculative execution or like).
And so on.
It will turn out to be as wrong as "640k".
Javascript is generally used in event driven manner, so it will perform quite well on a single core. Firefox having trouble loading multiple pages simultaneously should still be IO-bound, not CPU-bound, and if the engine has trouble, then it's an SW architecture problem where more cores will not really help.
Point of Linus was, taking a 6 core CPU, and replacing 2 cores with more cache and more transistors per core should make almost anything on Desktop run faster.
Re:Pullin' a Gates? by Urkki · 2015-01-01 20:41 · Score: 2

It already is wrong...
Linux Workstation: 16cores = way faster builds than 4 cores.
Did the 4 core CPU have 1/4th of the transistor count of the 16 core CPU? Then I'd expect it to be much slower of course. Point of Linus was, a 4 core CPU with same transistor count (used for more cache, better out-of-order execution logic, more virtual registers, and so on), as 16 core CPU will be faster on almost every task. So cores beyond 4 (the number Linus threw as the ballpark count) make sense only, if you really can not spend any more transistors in making those 4 cores faster, but still have die space to spare.
Re:Pullin' a Gates? by im_thatoneguy · 2015-01-01 20:54 · Score: 3, Interesting

It is a niche which will need specific algorithms tuned for the hardware (GPU or other) the pipeline must be kept busy to observe a performance gain. It doesn't scale to general purpose computing.
I feel like this is moving the goal posts. "You will never do massively parallel computing on a CPU because if it's massively parallel it's a GPU not a CPU."
Linus is 100% wrong. What's the "general purpose" computing that we all want? The NCC-1701D's main computer from star trek. If I say "Cortana/Siri/Google Now please rough me out a flyer for our yardsale on Saturday." you're going to be looking at massively parallel task for the neural networks to not only interpret the voice but then make sense of the words and finally produce a printable flyer suitable for hanging. Programming is still a really fancy version of "IF A THEN B". "for X in GROUP do Z". "X = Y". Yeah, if your application is incredibly serial then a serial processor is all that you'll need. When computing advances to the next phase of neural networks, AI and directed (not instructed) computing then it'll need to be more like our brain: massively parallel.
Now there are two obnoxious tautological arguments against this:
A) "That's not a "CPU" that's like a NeuroProcessorUnit, an NPU if you will"
B) "Yes we'll need a giant mainframe, but it'll be a server in the cloud!"
A is moving the goal posts. Just because the processor isn't an ARM or x86 instruction compatible chip doesn't mean it's not worthy of the label CPU. As mentioned above you can't say that there'll never be a CPU with massive parallelism because as soon as it has massive parallelism it's by definition no longer a CPU. B is just saying that nobody will have a need for computers because we'll have a giant mainframe. Which might be true but you just need a basic DSP not even a CPU if it's just a pure thin client transmitting a video, audio and input stream to the cloud for processing. In which case all of the CPUs in existence... need to be massively parallel AI processors.
Re:Pullin' a Gates? by TheRaven64 · 2015-01-01 21:19 · Score: 2

If you look at a typical web page, you have a load of images, a few iframes with ads, scripts (possibly with with multiple web workers). Each one of those really wants to be a separate security domain. You don't want a vulnerability in libpng (something that has happened many times before) to be able to do anything other than break the single image that it's decoding. This kind of fine-grained security is a lot easier if you have the ability to have a load of cheap threads.

--
I am TheRaven on Soylent News
Re:Pullin' a Gates? by SuricouRaven · 2015-01-01 22:09 · Score: 3, Informative

If massive-neural nets do reach common use (Which isn't that likely, they are somewhat overhyped) then I'd expect to see specific accelerators designed to run them. Probably something like FPGAs: Software writes the net, hardware executes it. A general-purpose processor (Probably x64 or ARM) does the coordinating, but augmented by specialised or semi-specialised hardware for certain tasks. Very much as we have today with hardware acceleration of 3D graphics or video decoding.
You can see the trend already. 3D acceleration was introduced for graphics, but then repurposed for other things, and followed up with revised graphics architectures designed for non-graphics applications. They are still useless for general-purpose computing, their architecture too limited, but used in conjunction with a general processor they can greatly outperform the processor alone on things like image processing, cryptographic tasks, physics simulation and such. It's now quite common to see even consumer applications, with games using physics simulation to provide much more detailed rigid-body simulation than was previously possible - ie, more bits of shrapnel and chunks of corpse bouncing around when you lob that grenade.
As for neural nets, you probably won't see much need to simulate huge ones. Small ones work surprisingly well, and their applications are really quite limited - they aren't some magic AI bullet that turns into a functional mind if you make them big enough. They excel at classification tasks, so they ar very handy in OCR, handwriting recognition, speech recognition and such. Google made one that can recognise cats, and if you can recognise cats then you can recognise other things, so straight away I'm seeing applications in web filter software.
Re:Pullin' a Gates? by TheRaven64 · 2015-01-01 23:45 · Score: 2

First, that's with a single thread and a single security context. If each one is an isolated sandbox it's not the case (trust me on this: it's my research area and we've done a lot of benchmarking). Second, even if it were true, it would be a lot less power efficient. If you can parallelise your workload, then two 1.5GHz cores will use less power than one 3GHz one. Four 750MHz cores will use less still.
Until a few years ago, most computers had a single core, so there wasn't much point trying to exploit parallelism and the fastest way of implementing many problems was to serialise them. That's no longer an automatic win.

--
I am TheRaven on Soylent News
Re:Pullin' a Gates? by HuguesT · 2015-01-02 00:46 · Score: 2

Thanks, interesting document, found here. The audio is really bad at the beginning and fluctuates throughout the talk. The interesting bit that you refer to is at 21 minutes from the start.
I'm trying to type in what he said directly from the audio:

The 16-bit design gave us a megabyte of memory. The 8086 has a 20-bit address. It is really a segmented 16-bit data path with segment registers that are really indexes. It is a 1-MB address space. And in this original design I took the upper 384K and tied it to a certain amount to provide for memory video, the ROM and I/O. And that left 640K for general purpose memory. And that leads to today's situation where people talk about the 640K barrier. The limit to how much memory you can put to these machines. I have to say that in 1981 while making those decisions I felt like I was providing enough freedom for 10 years. That is, a move from 64K to 640K felt like something that would last a great deal of time. Well, it didn't. It took only 6 years before people started to see that as a real problem.
Fortunately, there is a reasonable solution. Intel has moved forward with its chips families, the 286 chip introduced in 1984 moves us to a 24-bit address space (mumbles about segmented indirection, being not that good). That is sort of an intermediate milestone. in 1986 we moved up to the 386 where we get a full 32-bit offset to these segments that have been designed in this architecture. So what we have is a machine that can address 4GB of RAM. And I have to say with all honesty, I believe that it will take us more than 10 years to use up that address space.
So he never makes that exact quote, however one can understand why people picked it up. Essentially, BG thought in 1981 640K would be enough for everybody for a long while. Note that he was reasonably prudent regarding using up the 32-bit address space (that ship has sailed now).
Later, regarding memory, he says that computers should have about 1MB of RAM per MIPS. Specifically, he goes on to saying machines with 30-60MB of RAM should be desirable soon (in 1989).
In this talk he talks about many things, most are pretty insightful in fact: OS design, multitasking, parallelization, multi-processor designs, dynamic linking, object-oriented design. Funnily he talks at length about OS2 in a very positive way. This was before Windows 3 of course. He compares OS2 and Unix, saying that OS2 will take over the desktop and Unix the servers, and all other OSes will die out. He talks about the FSF, saying its task of creating a free Unix-like OS is doomed.
Some interesting comments on that talk here.
Re: Pullin' a Gates? by Anonymous Coward · 2015-01-02 00:47 · Score: 2, Insightful

There has been a push back against integrating ANNs into mobile platforms. I think low power real time classification is simply missing an application in the mass market that can't be solved by off loading to a server. We simply assume that we are continuously connected to a sufficiently large data pipe and the problem goes away. Whether the hardware changes on the server side or not is a question of power savings, but I doubt we will see gains in performance over software implemented on server farms.
That said if we put our future caps on, is there a point when the amount of data our electronics gather for processing that pushing into the cloud is cost and time prohibitive? If wearable electronics becomes a pervasive technology, we may need some on board continuously learning classifier cores to locally fuse sensor data rather than sending raw data Into the cloud. This is where we could see truly assistive computing without the creepier general intelligence hassabis and crew are working on at deep mind.
Imagine you have a conversation with your wife and she says the kids need to be picked up at 4 on Tuesday. If my phone put a reminder on my calendar for me based on my continuous audio stream, the mental offload would be huge as I could seamlessly continue with my day without managing my calendar, but I don't want to continuously stream my audio to Google nor do they want to continuously process the sound of me typing and sipping coffee... That's what we have the NSA for.
Re:Pullin' a Gates? by Half-pint+HAL · 2015-01-02 02:05 · Score: 2

Point of Linus was, taking a 6 core CPU, and replacing 2 cores with more cache and more transistors per core should make almost anything on Desktop run faster.
There's an element of truth to this, but on the other hand, cache space is already big enough that it hits the law of diminishing returns. Yes, the biggest performance hits in current computing are cache misses. But cache misses are already unexpected events, and cache misses are of biggest concern to the user when there are lots of them at once -- ie when iterating through a large bit of data. Text searches on large documents in a complex format (eg MS Word). Making a global change to a large file. These are the situations where performance matters, and these are the situations where you're going to get cache misses. Torvalds dismisses photo editing as a task for "professional photographers", but our amateur cameras are taking phenomenally detailed pictures, and even making fairly simple edits is a compute-intensive task. He may be right, but he may equally be wrong.

--
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
Re:Pullin' a Gates? by Anonymous Coward · 2015-01-02 03:18 · Score: 2, Interesting

Perhaps in Linus's dislike for C++ he's missed how trivially easy it's gotten to launch threads in C++11, but it takes less work now than a for-loop, since std::thread is so simple and you can inline the command with a lambda. And you have a nice clean mutex library including scoped mutexes like std::lock_guard so you don't even have to remember to unlock them.
He doesn't mention C++ or anything like that. What he is talking about is that since the overhead for task switching is pretty large so in cases where a tradeoff is made between the performance of a single core or adding more cores to a CPU you will typically get more performance gain by having fewer better cores since the task most users do most of the time is of a nature that doesn't lend itself to parallellization. In those cases where it is easily done it is already delegated to dedicated hardware like GPU.
For your typical for-loop that is so easy to launch threads for the problem is that the overhead for moving the task to another core with another cache is so high that you don't get a performance gain. There are still cases where it makes sense to launch threads but people who does it without thinking because "parallell is better" is the kind of programmers that jumps on every new programming fad.
Re:Pullin' a Gates? by Rei · 2015-01-02 04:41 · Score: 2

If your loop contains 10 instructions and loops 5 times
Duh. And that's obviously not what is being discussed here. Step up a level or 20 in the call stack.

If that "largely" means you need a lock,
"largely" meaning "does a bunch of stuff on its own and only briefly needs to lock common data structures to update based on the results of what it's been doing". That is by far the most common case in the real world. If you have a texture loading thread for a game it only needs to briefly lock the texture structure when it's gotten its latest texture loaded and processed. If you have a mesh tweaking function for a 3d editor it only needs to lock the list of meshes briefly to swap out its newly tweaked version for the old version. And on and on and on. The most common case doesn't involve locking to wait for calculations to be done from the other side, it just needs to lock briefly to make sure it doesn't read an incomplete state when the results of calculations are being written out.

--
If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.

Linus should try git by MichaelSmith · 2015-01-01 19:14 · Score: 3, Funny

...a tool which he may have heard off. It does connectionless, distributed data management, totally without locks.

--
http://michaelsmith.id.au

Re:Linus should try git by phantomfive · 2015-01-01 19:45 · Score: 3, Informative

In his post, Linus was talking about single, desktop computers, not distributed servers. He specifically said that he could imagine a 1000 core computer might be useful in the server room, but not for a typical user. So if you're going to criticize him, at least criticize what he said.

Also, git is not totally without locks. Try seeing if you can commit at the same time as someone else. It can't be done, the commits are atomic.

--
"First they came for the slanderers and i said nothing."
Re:Linus should try git by MichaelSmith · 2015-01-01 20:46 · Score: 2

My point is that git knows how to merge. It knows when a merge is required, when it is not, and when it can be done automatically. If you design your data structures properly, the same behaviour can be used in massively parallel systems.

--
http://michaelsmith.id.au

How parallel does a Word Processor need to be? by Nutria · 2015-01-01 19:42 · Score: 3, Interesting

Or a spreadsheet? (Sure, a small fraction of people will have monster multi-tab sheets, but they're idiots.)
Email programs?
Chat?
Web browsers get a big win from multi-processing, but not parallel algorithms.

Linus is right: most of what we do has limited need for massive parallelization, and the work that does benefit from parallelization has been parallelized.

--
"I don't know, therefore Aliens" Wafflebox1

Re:How parallel does a Word Processor need to be? by maccodemonkey · 2015-01-01 22:03 · Score: 2

Or a spreadsheet? (Sure, a small fraction of people will have monster multi-tab sheets, but they're idiots.)
Email programs?
Chat?
Web browsers get a big win from multi-processing, but not parallel algorithms.
Linus is right: most of what we do has limited need for massive parallelization, and the work that does benefit from parallelization has been parallelized.
This is kind of silly. Rendering, indexing and searching get pretty easy boosts from parallelization. That applies to all three cases you've listed above. Web browsers especially love tiled parallel rendering (very rarely these days does your web browser output get rendered into one giant buffer), and that can apply to spreadsheets to.
A better question is how much parallelization we need for the average user. While the software algorithms should nicely scale to any reasonable processor/thread count, on the hardware side you do have to ask how many cores we really need, especially in since a lot of users are happy right now. But targeting these sorts of operations as a single thread is also the entirely wrong approach. It's not power efficient for mobile users, and it drastically limits the gains your code will see on new hardware, while competing source bases pass you up.

Bad summary, shocking by Urkki · 2015-01-01 20:06 · Score: 5, Interesting

Linus doesn't so much say that parallelism is useless, he's saying that more cache and bigger, more efficient cores is much better. Therefore, increased number of cores at the cost of single core efficiency is just stupid for general purpose computing. Better just stick more cache to the die, instead of adding a core. Or that is how I read what he says.

I'd say, number of cores should scale with IO bandwidth. You need enough cores to make parallel compilation be CPU bound. Is 4 cores enough for that? Well, I don't know, but if the cores are efficient (highly parallel out-of-order execution) and have large caches, I'd wager IO lags far behind today. Is IO catching up? When will it catch up, if it is? No idea. Maybe someone here does?

Re:Core of the article by imgod2u · 2015-01-01 20:25 · Score: 3, Insightful

The idea isn't that the computer ends up with an incorrect result. The idea is that the computer is designed to be fast at doing things in parallel with the occasional hiccup that will flag an error and re-run in the traditional slow method. How much of a window you can have for "screwing up" will determine how much performance you gain.

This is essentially the idea behind transactional memory: optimize for the common case where threads that would use a lock don't actually access the same byte (or page, or cacheline) of memory. Elide the lock (pretend it isn't there), have the two threads run in parallel and if they do happen to collide, roll back and re-run in the slow way.

We see this concept play out in many parts of hardware and software algorithms actually. Hell, TCP/IP is built on having packets freely distribute and possibly collide/drop with the idea that you can resend it. It ends up speeding up the common case: that packets make it to their destination along 1 path.

Torvalds is half right by popo · 2015-01-01 20:25 · Score: 5, Insightful

The problem is that Linus is discussing two different things at once and so it sounds like he's making a more inflammatory point than he is.

The issue is not whether parallelism is uniformly better for all tasks. The question is, is parallelism better for some tasks. And as Torvalds points out, those tasks do exist (Graphics being an obvious one).

The nature of the workload required for most workstations is non-uniform processing of large quantities of discreet, irregular tasks. For this, parallelism (as Torvald's correctly notes) is likely not the most efficient approach. To pretend that in some magical future, our processing needs can be homogenized into tasks for which parallel computing is superior is to make a faith-based prediction on how our use of computers will evolve. I would say that the evidence is quite the opposite: That tasks will become more discrete and unique.

Some fields though: finance, science, statistics, weather, medicine, etc. are rife with computing tasks which ARE well suited to parallel computing. But how much of those tasks happens on workstations. Not much, most likely. So Linus' point is valid.

But I have to take issue of Linus tone in which he downplays "graphics" as being a rather unimportant subset of computing tasks. It's not "graphics". It's "GRAPHICS". That's not a small outlier of a task. Wait until we're all wearing ninth generation Oculus headsets... the trajectory of parallel processing requirements for graphics is already becoming clear -- and it's stratospheric. The issue is this: Our desktop processing requirements are actually slowing and as Linus points out, are probably ill-suited for increased parallelism. But our graphics requirements may be nearly infinite.

Unlike other fields of computing, we know where graphics is going 20 years from now: It's going to the "holodeck".

Keep working on parallel computing guys. Yes, we need it.

--
------ The best brain training is now totally free : )

Re:Torvalds is half right by Anonymous Coward · 2015-01-02 00:44 · Score: 4, Informative

AMD have a line of CPUs very much like this, the A Series. It has several conventional multi-purpose x86-64 cores for general-purpose use and a Graphics Processing Unit built-in for those embarrassingly-parallel floating-point operations. Best of all, they're very cheap and perform very well.
Re:Torvalds is half right by Anonymous Coward · 2015-01-02 01:09 · Score: 2, Insightful

No, that's not faith. That's an economic argument. I know that many tasks which are considered practically non-parallelizable today can in fact be parallelized. We don't do that today because the additional work doesn't pay off when massive multicore systems are not yet available or not yet capable of running general purpose code. Often it's just a matter of getting the right tools, but sometimes you need to look at problems again and solve them in a different way. With new algorithms and new tools, you will make use of many cores, because if you don't, you will be left in the dust by the people who do.
Re:Torvalds is half right by Half-pint+HAL · 2015-01-02 01:55 · Score: 2

In essence, it's already done that way. A System-on-a-chip (SoC) typically has a couple of general-purpose cores, along with sound and video processors. In a full-sized PC, the graphics processing is usually taken to another chip -- in fact another circuit board entirely. Because most of the work the graphics processor (=GPU) does is largely independent of the main processor (=CPU) (the CPU pushes in the data, says "do X with it", the GPU then churns away through the data) it doesn't need to be closely linked or share a lot of memory. In fact, it's more efficient for them not to share memory, as then they're not getting in each other's way.
Expanding that system for more types of semi-general-purpose cores would get rather complicated.

--
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
Re: Torvalds is half right by Half-pint+HAL · 2015-01-02 01:56 · Score: 4, Informative

Verification is the process of checking that software works correctly. The more complex the system, the more complex the process of verification. Rather unfair of the GP to throw that in as a single word after you explicitly said that you're not a computer scientist.

--
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
Re:Torvalds is half right by Lunix+Nutcase · 2015-01-02 02:54 · Score: 2

Why not design multi-purpose chips that have some cores optimized for some tasks, and other cores optimized for others
We do have those. Any CPU with an iGPU is such a chip. We've had such CPUs for years and years now. Have you missed out on the last decade of CPU design?

No locks by ShakaUVM · 2015-01-01 20:41 · Score: 2

Ungar's idea (http://highscalability.com/blog/2012/3/6/ask-for-forgiveness-programming-or-how-well-program-1000-cor.html) is a good one, but it's also not new. My Master's is in CS/high performance computing, and I wrote about it back around the turn of the millenium. It's often much better to have asymptotically or probabilistically correct code rather than perfectly correct code when perfectly correct code requires barriers or other synchronizing mechanisms, which are the bane of all things parallel.

In a lot of solvers that iterate over a massive array, only small changes are made at one time. So what if you execute out of turn and update your temperature field before a -.001C change comes in from a neighboring node? You're going to be close anyway? The next few iterations will smooth out those errors, and you'll be able to get far more work done in a far more scalable fashion than if you maintain rigor where it is not exactly needed.

Re:Programs people want to use... by Rei · 2015-01-01 20:48 · Score: 3, Insightful

Indeed. There's tons of CPU-intensive tasks that need to be done in a modern computer game, but they're typically done as:

while (true)
{
do_task_1();
do_task_2();
( ... )
do_task_N();
}

Rather than...

std::thread([&](){ while (true) do_task_1(); }).detach();
std::thread([&](){ while (true) do_task_2(); }).detach();
( ... )
std::thread([&](){ while (true) do_task_N(); }).detach();
}

... or similar. Because in C and older versions of C++ launching a thread takes significant typing and ugly code, up to and including - in the case of the same function threaded a variable number of times in a loop with more than a trivial argument - having to have a memory-managed threadsafe container to hold your arguments (and in C you don't have STL containers, you have to do all that work yourself too). It's not the end of the world to have to code threads in C or earlier C++, but it's enough work that programmers usually don't do it any more than they're pretty much forced to. "Okay, my game will literally run at half the speed if I don't thread this function" - fine, they'll thread it. But "this function call eats up 3% of my performance, this one 6%, this one 4%, this one 2,5%, this one 3,5%...."? Usually such functions just get stuck into one big main loop.

I really hope with how easy it's gotten in C++11 that more people will make better use of threads. In the first example code, not only do you relegate all of your tasks to the same core, thus hitting performance, but if any one task hangs, all of them hang. It's a terrible approach, but it's the most common. The only case where threads aren't good is where you're doing heavy concurrent read/writes to the same cached data, but in real world apps there's almost always a level where you can launch the thread where this isn't the case, if it's even an issue to begin with in your particular application. The presumption that concurrent access to cached memory will usually or always be a problem (which seems to be Linux's presumption) requires that A) your threads not doing the majority of their work on thread-local memory, AND B) that the shared data area being read from / written to concurrently is small enough to be cached, AND C) you can't just migrate your threads up in scope N levels to work around any such issue.

--
If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.

Re:Core of the article by Rei · 2015-01-01 21:06 · Score: 2

There are cases where getting exactly the right answer doesn't matter - real-time graphics is a good example. It's amazing the level of error you can have on an object if it's flying quickly past your field of view and lots of things are moving around. In "The Empire Strikes Back" they used a bloody potato and a shoe as asteroids and even Lucas didn't notice.

That said, it's not the general case in computing that one can tolerate random errors. Nor is the concept of tolerating errors anything new. Programmers have been using for example approximations for square roots for a long, long time to save compute cycles where precision takes a back seat to "just get the shape of the curve roughly right". There's even a number of lower-precision hardware math methods.

--
If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.

Shi's Law, Gustafsson's Law, Amdahls Law by amplesand · 2015-01-01 21:34 · Score: 3, Insightful

Shi's Law

http://developers.slashdot.org...

http://spartan.cis.temple.edu/...

http://slashdot.org/comments.p...

"Researchers in the parallel processing community have been using Amdahl's Law and Gustafson's Law to obtain estimated speedups as measures of parallel program potential. In 1967, Amdahl's Law was used as an argument against massively parallel processing. Since 1988 Gustafson's Law has been used to justify massively parallel processing (MPP). Interestingly, a careful analysis reveals that these two laws are in fact identical. The well publicized arguments were resulted from misunderstandings of the nature of both laws.

This paper establishes the mathematical equivalence between Amdahl's Law and Gustafson's Law. We also focus on an often neglected prerequisite to applying the Amdahl's Law: the serial and parallel programs must compute the same total number of steps for the same input. There is a class of commonly used algorithms for which this prerequisite is hard to satisfy. For these algorithms, the law can be abused. A simple rule is provided to identify these algorithms.

We conclude that the use of the "serial percentage" concept in parallel performance evaluation is misleading. It has caused nearly three decades of confusion in the parallel processing community. This confusion disappears when processing times are used in the formulations. Therefore, we suggest that time-based formulations would be the most appropriate for parallel performance evaluation."

.

Re:i'm so tired of political correctness by goarilla · 2015-01-01 21:50 · Score: 2

And some of us just grew up in the sort of nuclear family where offensive expletives are the norm.

Re:weird by serviscope_minor · 2015-01-01 22:05 · Score: 2

The central claim of Linus seem to be that there are many people out there who claim an efficiency increase by parallelism.

They do, and to an extent they are correct.

On CPUs that have high single thread performance, there is a lot of silicon devoted to that. There's the large, power hungry, expenive out of order unit, with it's large hidden register files and reorder buffers.

There's the huge expensive multipliers which need to complete in a single cycle at the top clock speed and so on.

If you dispense with that and replace it all with simple, in order, highly pipelined ALUs, you can fir an awful lot more raw artihmetic performance in a given area of silicon.

So it is much, much more efficient (at certain workloads). The trouble is getting good use out of a hudge wodge of simple cores. That's what GPUs do: the cores are simple and wide, but the problem of filing them is "solved" by limiting the workload to something very regular. The result is something vastly more efficient than a general purpose CPU... for those workloads.

The flops/W of a CPU are very much in excess of a CPU. Great, if you can use them.

Personally, I still want to have time to play with those AMD HSA chips, they put the cores of both types on the same side of the cache and MMU. Much more like a tightly coupled co-processor then.

--
SJW n. One who posts facts.

Poor slashdot... by Anonymous Coward · 2015-01-01 22:32 · Score: 3, Insightful

Few are actually people with a real engineering background anymore.

What Linus means is:
- Moore's law is ending (go read about mask costs and feature sizes)
- If you can't geometrically scale transistor counts, you will be transistor count bound (Duh)
- therefore you have to choose what to use the transistors for
- anyone with a little experience with how machines actually perform (as one would have to admit Linus does) will know that keeping execution units running is hard.
- since memory bandwidth has no where near scaled with CPU apatite for instructions and data, cache is already a bottleneck

Therefore, do instruction and register scheduling well, have the biggest on die cache you can, and enough CPUs to deal with common threaded workflows. And this, in his opinion, is about 4 CPUs in common cases. I think we may find that his opinion is informed by looking at real data of CPU usage on common workloads, seeing as how performance benchmarks might be something he is interested in. In other words, based in some (perhaps adhoc) statistics.

Re:i'm so tired of political correctness by Attila+Dimedici · 2015-01-01 23:00 · Score: 4, Insightful

No, "political correctness" is a thing. It is where someone gets in trouble for using the word "niggardly" because it sounds like another word.

--
The truth is that all men having power ought to be mistrusted. James Madison

Linus is right by gweihir · 2015-01-01 23:15 · Score: 3, Insightful

Nothing significant will change this year or in the next 10 years in parallel computing. The subject is very hard, and that may very well be a fundamental limit, not one requiring some kind of special "magic" idea. The other problem is that most programmers have severe trouble handling even classical, fully-locked, code in cases where the way to parallelize is rather clear. These "magic" new ways will turn out just as the hundreds of other "magic" ideas to finally get parallel computing to take off: As duds that either do not work at all, or that almost nobody can write code for.

Really, stop grasping for straws. There is nothing to be gained in that direction, except for a few special problems where the problem can be partitioned exceptionally well. CPUs have reached a limit in speed, and this is a limit that will be with us for a very long time, and possibly permanently. There is nothing wrong with that, technology has countless other hard limits, some of them centuries old. Life goes on.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Re:The single most significant sentence.. by Shados · 2015-01-02 01:48 · Score: 2

I remember an issue I had a few months ago... we were doing some image processing using HTML canvas element on a web app... Then we wanted a nightly job to use the same code, so we whip out a node.js script. Once it was done, to make sure it worked the same way, we compared the result...

They were different. Spent 2 days trying to debug it (they were using the same code for the most part, wtf?).

At the time, I didn't know about http://en.wikipedia.org/wiki/Canvas_fingerprintingcanvas fingerprinting Most of the time, different computers will generate equivalent, but different at the binary level, images from html canvas.

And there's always the good old floating point operations. ie: 0.2 * 3 = 0.6000000000000001

So its already everywhere, just not everywhere enough that we've been forced to deal with it (those things are usually just afterthought and end up in bugs). Soon, they won't be.

Re:Linus wrong? Shocking! by Half-pint+HAL · 2015-01-02 04:17 · Score: 2

Not true, because if the processes are IO bound (and most are), most of the processes will be waiting anyway. But Linus's argument hangs on a more fundamental problem: memory bandwidth. If all the cores are sitting waiting because the data isn't in the cache and the other cores are already trying to use the memory bus, then you'll end up with more unused cycles than if you ran timesliced threads on a single core. The correct answer to this one cannot be made by reasoning and logic from first principles, but only by looking at raw empirical data. I daresay Linus has more of that than most of us here.

--
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'

Ripe for Revolution by Roger+W+Moore · 2015-01-02 04:23 · Score: 2

Nothing significant will change this year or in the next 10 years in parallel computing.

You might be right but I'm far less certain of it. The problem we have is that further shrinking of silicon makes it easier to add more cores than to make a single core faster so there is a strong push towards parallelism on the hardware side. At the same time the languages we have are not at all designed to cope with parallel programming.

The result is that we are using our computing resources less and less efficiently. I'm a physicist on an LHC experiment at CERN and we are acutely aware of how inefficient our serial algorithms are at using modern hardware. What we need is a breakthrough in programming languages to be able to parallel program efficiently, just like object oriented programming allowed us to scale up the size of programs. Until this happens I agree than not much will change but if there is some clever CS researcher/student out there with a clever idea for a good parallel programming language the conditions are right for a revolution.

Re:i'm so tired of political correctness by Noah+Haders · 2015-01-02 06:55 · Score: 2

+1 this would make the best gravestone ever.

Lots of moving parts by m.dillon · 2015-01-02 07:05 · Score: 4, Informative

There are lots of moving parts here. Just adding cores doesn't work unless you can balance it out with sufficient cache and main memory bandwidth to go along with the cores. Otherwise the cores just aren't useful for anything but the simplest of algorithms.

The second big problem is locking. Locks which worked just fine under high concurrent loads on single-socket systems will fail completely on multi-socket systems just from the cache coherency bus bandwidth the collisions cause. For example, on an 8-thread (4 core) single-chip Intel chip having all 8 threads contending on a single spin lock does not add a whole lot of overhead to the serialization mechanic. A 10ns code sequence might serialize to 20ns. But try to do the same thing on a 48-core opteron system and suddenly serialization becomes 1000x less efficient. A 10ns code sequence can serialize to 10us or worse. That is how bad it can get.

Even shared locks using simple increment/decrement atomic ops can implode on a system with a lot of cores. Exclusive locks? Forget it.

The only real solution is to redesign algorithms, particularly the handling of shared resources in the kernel, to avoid lock contention as much as possible (even entirely). Which is what we did with our networking stack on DragonFly and numerous other software caches.

Some things we just can't segregate, such as the name cache. Shared locks only modestly improve performance but it's still a whole lot better than what you get with an exclusive lock.

The namecache is important because for something like a bulk build where we have 48 cores all running gcc at the same time winds up sharing an enormous number of resources. Not just the shell invocations (where the VM pages are shared massively and there are 300 /bin/sh processes running or sitting due to all the Makefile recursion), but also the namecache positive AND negative hits due to the #include path searches.

Other things, particularly with shared resources, can be solved by making the indexing structures per-cpu but all pointing to the same shared data resource. In DragonFly doing that for seemingly simple things like an interface's assigned IP/MASKs can improve performance by leaps and bounds. For route tables and ARP tables, going per-cpu is almost mandatory if one wants to be able to handle millions of packets per second.

Even something like the fork/exec/exit path requires an almost lockless implementation to perform well on concurrent execs (e.g. such as /bin/sh in a large parallel make). Before I rewrote those algorithms our 48-core opteron was limited to around 6000 execs per second. After rewriting it's more like 40,000+ execs per second.

So when one starts working with a lot of cores for general purpose computing, pretty much the ENTIRE operating system core has to be reworked verses what worked well with only 12 cores will fall on its face with more.

-Matt

Re:Programs people want to use... by Rei · 2015-01-02 07:06 · Score: 2

BZZT, fail.

1) You didn define launch_thread.
2) my_struct_array was said, and I quote, "a local-context data structure", so congrats, your data is going to go out of scope on you.
3) The concept of having to write that is absurd because "for (auto&i : container)" is a "do whatever you want, any number of steps, no matching function signature required, inline, on any container whatsoever" built into C++11, *and* it's something that anyone who knows C++11 will know rather being something you brewed yourself.

Again, to repeat, given your failures on #1 and #2:

" if you're too lazy to do it here, or change the requirements to present yourself with a simpler problem, then I'm going to take it that you're too lazy to do it in your code, too."

Hence, I'm going to take it that you're likewise too lazy to actually thread your code. And the fact that your code contains a fundamental oversight resulting in a memory leak which wouldn't have caused a compile error is just icing on the cake.

--
If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.

Linus Lock by fyngyrz · 2015-01-02 07:39 · Score: 2

The core is already dozens of times faster than memory

It isn't, though, except for integer operations and tossing things around. Floating point core elements have a ways to go yet to get to single cycle for everything, and so spreading math among cores still saves time. OS folk like Linus may tend to think in terms of byte-to-BusSize manipulation. A lot of us deal with more nuanced data and operations. I *guarantee* you that a multicore processor will chew up properly designed image manipulation tasks a good deal faster than a single core will, and more flexibly (and more system-friendly) than a GPU can too, although slower for ops that fit in the GPU's memory and for which it offers competence. Software defined radio also makes terrific use of multiple cores, for instance here, a 3 GHz system with 8 cores is mostly free to do other stuff, and a system with one core running at the same speed is about 90% utilized, which doesn't leave enough horsepower to do much else. Whereas with the 8-core, I can run the SDR and do whatever the heck I want. Then there's the "what do you mean by 'core'" question. Does the core have an FPU, or is it one of those profoundly crippled integer-only units? Does the core actually share memory (and therefore memory bandwidth) with other cores, or does it have its own pool of RAM? Is eco throttling choking it half to death? And so on.

Having 1000 cores all waiting for 3,000 microseconds while the hard drive rotates to the other side of the platter

What is this "hard drive" thing you describe? Doesn't everyone use boards with terabytes of RAM for near-term storage?

Seriously, though, we all know (well, the ones who have considered it) that's exactly where we're going. SSDs as they stand today are just the tip of the iceberg; you want to know what's coming, instantiate a ram disk on your machine and run some benchies with it. And when we get to real RAM based storage, or anything of similar speed (or perhaps better... memristors?), we won't have wanted CPU development to have been sitting on laurels planted in a garden made of dead-slow storage in the interim.

Having 1000 cores all waiting for 3,000 microseconds while the hard drive rotates to the other side of the platter does not improve performance over 4 cores waiting.

True enough, but of course, that's not what happens, so... Effectively -- of course they can and do switch roles when memory is shared -- one is monitoring your ethernet, several are kicking in and out of httpd threads and/or processes, and so on for hundreds of OS tasks, and if you're like me, more than a few users tasks as well. For every task within a process that isn't hidebound by disk (and there are already a lot of them) having an additional available core is a very worthy thing. And when cores are tied up waiting for high level math operations, memory is (more) free relative to the needs of the available cores, and things simply run soother, sooner. There's a lot of handwaving in there because of the complexity of caching and lookahead and so on, but the bottom line is in my 8 core machine, I can do a lot more than in my 2-core machine, both have the same amount of memory and run at the same speed. And I apologize for the mangling of terminology. I think the point remains clear:

Multiple cores are a great thing.

--
I've fallen off your lawn, and I can't get up.

Slashdot Mirror

How We'll Program 1000 Cores - and Get Linus Ranting, Again

55 of 449 comments (clear)