How We'll Program 1000 Cores - and Get Linus Ranting, Again
vikingpower writes For developers, 2015 got kick-started mentally by a Linus Torvald rant about parallel computing being a bunch of crock. Although Linus' rants are deservedly famous for the political incorrectness and (often) for their insight, it may be that Linus has overlooked Gustafson's Law. Back in 2012, the High Scalability blog already ran a post pointing towards new ways to think about parallel computing, especially the ideas of David Ungar, who thinks in the direction of lock-less computing of intermediary, possibly faulty results that are updated often. At the end of this year, we may be thinking differently about parallel server-side computing than we do today.
In his post, Linus was talking about single, desktop computers, not distributed servers. He specifically said that he could imagine a 1000 core computer might be useful in the server room, but not for a typical user. So if you're going to criticize him, at least criticize what he said.
Also, git is not totally without locks. Try seeing if you can commit at the same time as someone else. It can't be done, the commits are atomic.
"First they came for the slanderers and i said nothing."
If massive-neural nets do reach common use (Which isn't that likely, they are somewhat overhyped) then I'd expect to see specific accelerators designed to run them. Probably something like FPGAs: Software writes the net, hardware executes it. A general-purpose processor (Probably x64 or ARM) does the coordinating, but augmented by specialised or semi-specialised hardware for certain tasks. Very much as we have today with hardware acceleration of 3D graphics or video decoding.
You can see the trend already. 3D acceleration was introduced for graphics, but then repurposed for other things, and followed up with revised graphics architectures designed for non-graphics applications. They are still useless for general-purpose computing, their architecture too limited, but used in conjunction with a general processor they can greatly outperform the processor alone on things like image processing, cryptographic tasks, physics simulation and such. It's now quite common to see even consumer applications, with games using physics simulation to provide much more detailed rigid-body simulation than was previously possible - ie, more bits of shrapnel and chunks of corpse bouncing around when you lob that grenade.
As for neural nets, you probably won't see much need to simulate huge ones. Small ones work surprisingly well, and their applications are really quite limited - they aren't some magic AI bullet that turns into a functional mind if you make them big enough. They excel at classification tasks, so they ar very handy in OCR, handwriting recognition, speech recognition and such. Google made one that can recognise cats, and if you can recognise cats then you can recognise other things, so straight away I'm seeing applications in web filter software.
find a file called "bill-gates-1989.mp3". Its a bit over 80MB and about an hour and a half long. Somewhere in there is where he makes the memory statement while talking about how *he* designed the memory layout of the IBM PC.
AMD have a line of CPUs very much like this, the A Series. It has several conventional multi-purpose x86-64 cores for general-purpose use and a Graphics Processing Unit built-in for those embarrassingly-parallel floating-point operations. Best of all, they're very cheap and perform very well.
Verification is the process of checking that software works correctly. The more complex the system, the more complex the process of verification. Rather unfair of the GP to throw that in as a single word after you explicitly said that you're not a computer scientist.
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
There are lots of moving parts here. Just adding cores doesn't work unless you can balance it out with sufficient cache and main memory bandwidth to go along with the cores. Otherwise the cores just aren't useful for anything but the simplest of algorithms.
The second big problem is locking. Locks which worked just fine under high concurrent loads on single-socket systems will fail completely on multi-socket systems just from the cache coherency bus bandwidth the collisions cause. For example, on an 8-thread (4 core) single-chip Intel chip having all 8 threads contending on a single spin lock does not add a whole lot of overhead to the serialization mechanic. A 10ns code sequence might serialize to 20ns. But try to do the same thing on a 48-core opteron system and suddenly serialization becomes 1000x less efficient. A 10ns code sequence can serialize to 10us or worse. That is how bad it can get.
Even shared locks using simple increment/decrement atomic ops can implode on a system with a lot of cores. Exclusive locks? Forget it.
The only real solution is to redesign algorithms, particularly the handling of shared resources in the kernel, to avoid lock contention as much as possible (even entirely). Which is what we did with our networking stack on DragonFly and numerous other software caches.
Some things we just can't segregate, such as the name cache. Shared locks only modestly improve performance but it's still a whole lot better than what you get with an exclusive lock.
The namecache is important because for something like a bulk build where we have 48 cores all running gcc at the same time winds up sharing an enormous number of resources. Not just the shell invocations (where the VM pages are shared massively and there are 300 /bin/sh processes running or sitting due to all the Makefile recursion), but also the namecache positive AND negative hits due to the #include path searches.
Other things, particularly with shared resources, can be solved by making the indexing structures per-cpu but all pointing to the same shared data resource. In DragonFly doing that for seemingly simple things like an interface's assigned IP/MASKs can improve performance by leaps and bounds. For route tables and ARP tables, going per-cpu is almost mandatory if one wants to be able to handle millions of packets per second.
Even something like the fork/exec/exit path requires an almost lockless implementation to perform well on concurrent execs (e.g. such as /bin/sh in a large parallel make). Before I rewrote those algorithms our 48-core opteron was limited to around 6000 execs per second. After rewriting it's more like 40,000+ execs per second.
So when one starts working with a lot of cores for general purpose computing, pretty much the ENTIRE operating system core has to be reworked verses what worked well with only 12 cores will fall on its face with more.
-Matt