How We'll Program 1000 Cores - and Get Linus Ranting, Again
vikingpower writes For developers, 2015 got kick-started mentally by a Linus Torvald rant about parallel computing being a bunch of crock. Although Linus' rants are deservedly famous for the political incorrectness and (often) for their insight, it may be that Linus has overlooked Gustafson's Law. Back in 2012, the High Scalability blog already ran a post pointing towards new ways to think about parallel computing, especially the ideas of David Ungar, who thinks in the direction of lock-less computing of intermediary, possibly faulty results that are updated often. At the end of this year, we may be thinking differently about parallel server-side computing than we do today.
All other ended up in a mutex lock situaton so I had chance to do the first post
This is beyond the pale, and I am beside myself.
"4 cores should be enough for any workstation"
Perhaps it's an over-simplification, but if it turns out wrong, people will be quoting that for many decades like they do Gates' memory quote.
Table-ized A.I.
...a tool which he may have heard off. It does connectionless, distributed data management, totally without locks.
http://michaelsmith.id.au
.. more cores, don't or can't make use more cores. We've had lots of cores for a while now but it is a rare game that can use all of them.
Example of a brain that can't handle parallel thought processing.
Table-ized A.I.
You may recognize David as the co-creator of the Self programming language
No, actually, I don't think many people will recognize him as that. OK, back to reading the article.
"First they came for the slanderers and i said nothing."
'nuff said. I'll take all the cores you can give me.
Linus doesn't have a clue about much of computing:
* Floating point? Nope.
* Graphics? Nope.
* High performance? Nope.
* Parallel? Nope.
* Compiling the Linux kernel? Maybe.
This is another clear indication he currently lacks the clue token.
i'm so over this idea of political correctness. here's all it means: some assholes want to continue to be unmitigated assholes, just like they remember being "in the good old days." However, people are tired of putting up with their bullshit any more. so the assholes came up with this term "political correctness" to be like "i can say whatever I want, and if you don't like it then you are just bing politically correct." maybe you should jsut stop being assholes, k? the world has changed.
The article makes the point (which is not correct*) that to have high scalability we need lockless designs, because locking has too much overhead. If you can't imagine trying to get ACID properties in a multithreaded system without locks, well, neither can I. And neither can they: they've decided to give up on reliability. They've decided we need to give up on the idea that the computer always gives the correct answer, and instead gives the correct answer most of the time (correct meaning, of course, doing exactly what the programmer told it to do).
Here is what the guy says: " The obstacle we shall have to overcome, if we are to successfully program manycore systems, is our cherished assumption that we write programs that always get the exactly right answers." Not only that, we need to give up memory/cache locks within the processor (I don't know a whole lot about those), because when you scale to 1000 processes on a single processor, RAM becomes a bottleneck.
Now, if he's right, and the only way to get such high performance is by not worrying about whether the computer does what it is told, then he's not going to be able to convince many people.
*It is not correct in situations where each processor can work on a single chunk for a long time, that is, for problems where resource contention is a small fraction of processor time, like in video encoding. Then the overhead is still small, no matter how many processors you have.
"First they came for the slanderers and i said nothing."
Or a spreadsheet? (Sure, a small fraction of people will have monster multi-tab sheets, but they're idiots.)
Email programs?
Chat?
Web browsers get a big win from multi-processing, but not parallel algorithms.
Linus is right: most of what we do has limited need for massive parallelization, and the work that does benefit from parallelization has been parallelized.
"I don't know, therefore Aliens" Wafflebox1
Linus sounds like a programmer from 40 years ago....."Nobody will ever need more than 2 digits for a year, so the crazies suggesting years be represented by 4 digits are just that - crazy."
Linus doesn't so much say that parallelism is useless, he's saying that more cache and bigger, more efficient cores is much better. Therefore, increased number of cores at the cost of single core efficiency is just stupid for general purpose computing. Better just stick more cache to the die, instead of adding a core. Or that is how I read what he says.
I'd say, number of cores should scale with IO bandwidth. You need enough cores to make parallel compilation be CPU bound. Is 4 cores enough for that? Well, I don't know, but if the cores are efficient (highly parallel out-of-order execution) and have large caches, I'd wager IO lags far behind today. Is IO catching up? When will it catch up, if it is? No idea. Maybe someone here does?
As having taken a pioneering direction in lockless designs. In doing so they have made their system quite fast. It arguably has the fastest network stack of any operating system.
The problem is that Linus is discussing two different things at once and so it sounds like he's making a more inflammatory point than he is.
The issue is not whether parallelism is uniformly better for all tasks. The question is, is parallelism better for some tasks. And as Torvalds points out, those tasks do exist (Graphics being an obvious one).
The nature of the workload required for most workstations is non-uniform processing of large quantities of discreet, irregular tasks. For this, parallelism (as Torvald's correctly notes) is likely not the most efficient approach. To pretend that in some magical future, our processing needs can be homogenized into tasks for which parallel computing is superior is to make a faith-based prediction on how our use of computers will evolve. I would say that the evidence is quite the opposite: That tasks will become more discrete and unique.
Some fields though: finance, science, statistics, weather, medicine, etc. are rife with computing tasks which ARE well suited to parallel computing. But how much of those tasks happens on workstations. Not much, most likely. So Linus' point is valid.
But I have to take issue of Linus tone in which he downplays "graphics" as being a rather unimportant subset of computing tasks. It's not "graphics". It's "GRAPHICS". That's not a small outlier of a task. Wait until we're all wearing ninth generation Oculus headsets... the trajectory of parallel processing requirements for graphics is already becoming clear -- and it's stratospheric. The issue is this: Our desktop processing requirements are actually slowing and as Linus points out, are probably ill-suited for increased parallelism. But our graphics requirements may be nearly infinite.
Unlike other fields of computing, we know where graphics is going 20 years from now: It's going to the "holodeck".
Keep working on parallel computing guys. Yes, we need it.
------ The best brain training is now totally free : )
It is a well argued opinion.
And he doesn't say that parallel computing is a bunch of crock.
Aren't you supposed to actually RTFA before writing the summary?
Ungar's idea (http://highscalability.com/blog/2012/3/6/ask-for-forgiveness-programming-or-how-well-program-1000-cor.html) is a good one, but it's also not new. My Master's is in CS/high performance computing, and I wrote about it back around the turn of the millenium. It's often much better to have asymptotically or probabilistically correct code rather than perfectly correct code when perfectly correct code requires barriers or other synchronizing mechanisms, which are the bane of all things parallel.
In a lot of solvers that iterate over a massive array, only small changes are made at one time. So what if you execute out of turn and update your temperature field before a -.001C change comes in from a neighboring node? You're going to be close anyway? The next few iterations will smooth out those errors, and you'll be able to get far more work done in a far more scalable fashion than if you maintain rigor where it is not exactly needed.
I'll see your cores and raise you your boss strangling all your cores by forcing you to get all the data you were planning to process from NFS shares on 100 megabit LAN connections. Because your developers and IT department, with all the competence of a 14-year-old who just got his hands on a copy of Ruby on Rails, can't figure out how to utilize disk that every fucking machine in the company doesn't have read access to.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
How does Linux compile his kernel? Certainly I use a parallel make across as many cores as possible (well, up to the point where there's a core for every compilation unit).
All your ghosts are just false positives.
The central claim of Linus seem to be that there are many people out there who claim an efficiency increase by parallelism. While i agree that many people claim (IMHO correctly) a increase in the performance (reduction of execution time) within the constraints given by a specific technology level by doing symmetric multiprocessing, i have not heard many people to claim that efficiency (in terms of power, chip area, component count) is improved by symmetric, general parallelization; and nobody with a good understanding of infromation-related aspects of computation.
I am now speaking as a physicist, I find it disturbingly easy to show the opposite for many cases in the limit of ideal performing systems (that is, resource per implemented gate operation remaining constant with the number of gate operations).
Having said that, I speculate that there are reasons to introduce paralellism:
a) The performance you require can not be achieved without it. An example woulf be an FPU, or even just an 8-bit a full adder. You *can* implement it bit-wise, but you dont like to. The full adder also is an excellent example on how paralellism can increase power consumption (i.e. fast-carry-look-ahead) and resource usage
b) Your implementation simulates operations in a way in which requires a significant effort for fetching and decoding to simulated function. The extreme case of a extreme RISC processor with one bit operations and 1bit ALU only is more inefficient for many problems than the processors we use. This means that there probably is an ideal "processing power/RAM (cache)" combination, which is a function of your communication cost (i.e. bus drivers) and your algorithm.
c) From b) we can actually see that it can be extremely resonable to create non-symmetrich mutilprocessing units. For listening to a sensor signal to change, a 8-bit 1MHz Microcontroller with less than 100kGates may be an excellent choice (seen the ti430 line, from example), since it does not insist in keeping an overkill of ALU persistenly on.
d) Paralell programming is almost never used to increase efficiency (unless you really have a distributed input/output and inherent costs of collecting it), but only for these operations where the efficiency loss due to parallelism is negligible (or zero).
Shi's Law
http://developers.slashdot.org...
http://spartan.cis.temple.edu/...
http://slashdot.org/comments.p...
"Researchers in the parallel processing community have been using Amdahl's Law and Gustafson's Law to obtain estimated speedups as measures of parallel program potential. In 1967, Amdahl's Law was used as an argument against massively parallel processing. Since 1988 Gustafson's Law has been used to justify massively parallel processing (MPP). Interestingly, a careful analysis reveals that these two laws are in fact identical. The well publicized arguments were resulted from misunderstandings of the nature of both laws.
This paper establishes the mathematical equivalence between Amdahl's Law and Gustafson's Law. We also focus on an often neglected prerequisite to applying the Amdahl's Law: the serial and parallel programs must compute the same total number of steps for the same input. There is a class of commonly used algorithms for which this prerequisite is hard to satisfy. For these algorithms, the law can be abused. A simple rule is provided to identify these algorithms.
We conclude that the use of the "serial percentage" concept in parallel performance evaluation is misleading. It has caused nearly three decades of confusion in the parallel processing community. This confusion disappears when processing times are used in the formulations. Therefore, we suggest that time-based formulations would be the most appropriate for parallel performance evaluation."
.
Few are actually people with a real engineering background anymore.
What Linus means is:
- Moore's law is ending (go read about mask costs and feature sizes)
- If you can't geometrically scale transistor counts, you will be transistor count bound (Duh)
- therefore you have to choose what to use the transistors for
- anyone with a little experience with how machines actually perform (as one would have to admit Linus does) will know that keeping execution units running is hard.
- since memory bandwidth has no where near scaled with CPU apatite for instructions and data, cache is already a bottleneck
Therefore, do instruction and register scheduling well, have the biggest on die cache you can, and enough CPUs to deal with common threaded workflows. And this, in his opinion, is about 4 CPUs in common cases. I think we may find that his opinion is informed by looking at real data of CPU usage on common workloads, seeing as how performance benchmarks might be something he is interested in. In other words, based in some (perhaps adhoc) statistics.
Biggest machine I've run a single image on was 56 cores, by the time I'd finished lock contention was down in the noise.
Admitted synthetic benchmarks but my comment would be that there isn't enough RAM to run a single process across 1000 cores (yet), the thread stack size will kill you before lock contention on well writted code does.
Mostly writing code for MacOS X and iOS. All current devices have two or more cores. Writing multi-threaded code is made rather easy through GCD (Grand Central Dispatch), and anything receiving data from a server _must_ be multithreaded, because you never know how long it takes to get a response. So there is an awful lot of multi-threaded code around.
But the fact that work is distributed to several cores is just secondary for that kind of work. It is also easy to make most work-intensive code use multiple cores. There are calls like sorting an array or searching for an item with multi-threaded variants. With GCD, you can just say "do this task on a background thread", and if you have five things to do, it uses five threads and up to five cores. It's so easy that people do it a lot without measuring how efficient it is. As long as your software is fast enough, it's fine.
The typical result is an application that uses multiple cores to some degrees, but may have bottlenecks that require a single core. Now on an iPhone with 2 cores, that's fine. (If 30% of your time needs to run on a single core, but you have only two cores, it doesn't matter). On an iMac with 4 cores, it's quite OK. On a monster MacPro with 24 threads it might be a problem. On a hypothetical machine with 100s of cores it _is_ a problem.
So your typical MacOS X or iOS app written by reasonably competent people will work fine in the current environment, but would need major changes to take advantage of 100s of cores.
Nothing significant will change this year or in the next 10 years in parallel computing. The subject is very hard, and that may very well be a fundamental limit, not one requiring some kind of special "magic" idea. The other problem is that most programmers have severe trouble handling even classical, fully-locked, code in cases where the way to parallelize is rather clear. These "magic" new ways will turn out just as the hundreds of other "magic" ideas to finally get parallel computing to take off: As duds that either do not work at all, or that almost nobody can write code for.
Really, stop grasping for straws. There is nothing to be gained in that direction, except for a few special problems where the problem can be partitioned exceptionally well. CPUs have reached a limit in speed, and this is a limit that will be with us for a very long time, and possibly permanently. There is nothing wrong with that, technology has countless other hard limits, some of them centuries old. Life goes on.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Memorable only if stated by someone who both is "in power" and should know better. 640k was much better than 64k, but even in those days, big iron had (and needed) much more than 640k. Lots of people say stupid things, a corporate bigshot saying something stupid within his own field of expertise is a bit different.
some people just don't deserve to breathe. is it always the same twat always posting this shit or is there more than one single brain cell moron out there?
"The hands that help are better far than lips that pray." - Robert Ingersoll (1833-1899)
1.) Linus' wording is pretty moderate.
2.) He's right. Again.
We suffer more in our imagination than in reality. - Seneca
... are not politically incorrect. "Politically incorrect" is a phrase used by douchebags who want to be able to say anything they want about anyone at any time. It used to be used mostly by white male douchebags who wanted to be able to make racist remarks but now it's mostly been co-opted by MRAs.
Linus sometimes curses and doesn't pull his punches, but that's not "politically incorrect".
..is this:
The obstacle we shall have to overcome, if we are to successfully program manycore systems, is our cherished assumption that we write programs that always get the exactly right answers.
This is an interesting observation. Let's take graphs for example. We rarely need to solve every possible path and find THE shortest one, we usually only need to find one which is shorter than almost all the other ones.
Do we always care whether every pixel is the best possible color when compressing images ? No, it usually only has to be close enough so that we can't tell the difference.
These are classic examples of that statement that have already been implemented in both parallel and linear algorithm design. I'd like to see much more research into understanding why some problems don't require an exact answer, and some do. Maybe we need to change the way we think about what a solution is, rather than how to solve.
It's one thing to argue against massive parallelism in a single piece of software. Of course that's not the right answer to every problem, or even to most of them. But arguing against many cores at a hardware level, as he seems to be doing, is plain stupid. Of course more cores == better. As long as I have 100+ processes running on my desktop PC, the more cores I have to spread them around, the better.
No special languages or programmer training required.
...imagine a beowulf cluster of these!
If you were putting together a PC (any variety, any era), what would you expect to get the most bang for the buck? Obviously get the fastest current hardware, but then: double the CPU? double the RAM? double the comm (which at this point includes SATA controllers)? My experience all the way back to Z80s has normally been more RAM, the extension of which is more cache close to the CPU, which is one of the things Linus says.
It's hard to parallelize one application, which is why we all point to a handful of well-understood examples in graphics and that's about it. It's more straightforward - and more understandable - to parallelize multiple applications, like a "server" hearkening back to the old mainframe days. For a *general-purpose* computer doing mostly one or two things at a time with background communication and I/O, more RAM/cache == less thrashing == better *all-around* performance without adding complexity.
You may want to read some other comments here.
Multi-core CPUs are just a side-step because we can't scale single-core CPU performance to the same levels.
For example, if there was a choice between a single-core CPU that could do 1000 bogomips or a 4-core CPU that could do 4x250 bogomips, I know I'd rather have the single-core chip because for the vast majority of use cases the single-core chip would destroy the quad.
This is why modern multi-core CPUs have 'turbo' mode - Intel and AMD both realised that single-core performance is still much more important for individual programs so being able to run that code on one core and boost it at the detriment of the other cores gives a significant edge.
I still remember when multi-core CPUs first came out - They were limited by TDP so cheaper single-core CPUs would almost always beat them in benchmarks because while they were slightly behind in multi-threading performance, they were far superior on single-core performance.
One thing I am surprised is that no CPU manufacturer has come up with a dynamic pipeline system, where you could run a CPU as e.g. a quad core for normal usage, but when presented with highly predictable streaming data, switch to a P4-style long-pipeline by e.g. feeding one core into another and running the whole thing at a higher clockspeed
.
I think there is a little bit of noise on the early warning radar for future computing needs : sensor data processing in autonomous cars!
That might be a fluke for the rest of our professional careers, but it might also determine where a good chunk of the silicon diffused goes in ten years.
Since this area will be in constant and violent flux over the next twenty years, a lot of the processing power required for 3D-Analysis of video and radar sensors in each car will be provided by some sort of general purpose CPU/GPU. Video data could be processed parallel in segments of the images from multiple cameras, but also in parallel by something like a dozen different algorithms, to get extra safety and confidence out of majority voting.
Currently, about half of the value of an automobile sold in western markets is not made up from mechanical parts, but electric and electronic devices and the software for them.
With the advent of the autonomous car, I would think the computing requirements would explode again. Since this would coincide with the petering out of Moore's law, and come well after clock rates have effectively been capped, this will require some sort of parallel computing solution.
Linus is right, there is little need for massive parallelism outside of niche areas like HPC right now. But I am not so sure this cannot change.
"Anonymous Coward" - That's kinda of harsh just because I choose not to register ;-)...
Some individuals responding to Linus's "rant" aren't fulling reading what is said by quoting a couple of the various laws of parallel computing by saying he's 100 percent wrong. As Linus stated "The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless.", please note server side where these "laws" are relevant where there is very little relevance on the desktop. Many programmers have gotten lazy or aren't being taught very well, some either aren't programming for parallelization or are using in areas where it negatively impacts performance. I believe programmers need to take more care when utilizing parallelization.
"Give it up. The whole "parallel computing is the future" is a bunch of crock.", I do believe Linus is wrong with this statement, this statement lacks vision and would question his leadership role with in the linux community. I can think of a number things that would prompt me to make such a statement in his place where I would misstate my actual feeling on the matter, i.e. bad programming...
I certainly believe that parallel computing is the future of computing, perhaps not the immediate future. I see multiple people referencing this or similar articles: http://www.cmu.edu/silicon-valley/news-events/seminars/2011/ungar-talk.html
Aside from individuals saying they already thought of this or some such, who cares, I was thinking something similar as I read Linus's "rant". The point of the article of the article was point out that we need to go beyond our classic computing model to achieve massively parallel systems that continue to scale, with this being just one example of how to achieve it. I believe we need a whole host of new tools to perform parallel computing efficiently on a massive scale. Look at what the likes of FB and Google are doing with parallel computing and AI, they go so far as to custom build their own HW, yes, partly due to scales of economy but partly because COTS HW doesn't meet their functional requirements, not to mention the full software stack...
There are many common algorithms at the heart of important workloads that are not parallelizable. Consider sorting and shortest path algorithms that are important for managing data and route finding. The O(n-squared) versions can be parallelized (Bellman-Ford vs. Dijkstra's), but for any useful input size, the n-log-n version will be faster on a single core than the n-squared on a supercomputer (no hyperbole there). Even for workloads that do have a lot of parallelism, the inter-process communication often dominates. Except for benchmarks with no application to reality, there is always SOMETHING that serializes computation. Amdahl's law always bites you in the ass.
So much for parallel computing.
If you have many INDEPENDENT tasks, then sure, parallel computing is great. Web servers with many clients, graphics, etc. But that's for servers.
On end-user systems, the amount of thread-level parallelism is very limited. Unless you're compiling Gentoo, you're going to top out at a handful of cores. This is not limitation of the languages people use. It's a practical limitation of the parallelism inherent (or not) in the workloads people run, and it's a hard mathematical limitation of the optimal algorithms people use for common low-level tasks.
http://crd-legacy.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf
http://www.davidhbailey.com/dhbpapers/inv3220-bailey.pdf
http://www.cs.binghamton.edu/~pmadden/pubs/dispelling-ieeedt-2013.pdf
There are some people in parallel computing who need to go back to school and learn computational complexity.
So parallel is a crock except for.....
graphics
servers
in other words, if there's a need for multi-core then it happens.
is it always the same twat always posting this shit ?
You are an even bigger twat by replying to him. He is modded to -1, and few would even see his post if not for your reply making it more visible.
NEVER RESPOND TO TROLLS!!!
It sounds rather than Bill Gates' [supposed] "64KB is enough for anyone", but no denying that Linus said this one!
Saying that graphics is the only client side app that can utilize large scale parallelism is short sighted bunk, and even ignores what is going on today let alone the future. In 20 years time we'll have handheld devices that would look just as much like science fiction, if available today, as today's devices would have looked 20 years ago.
I have no doubt whatsoever that in the next few decades we'll see human level AI in handheld devices as well as server-based apps, and you better believe that the computing demands (both processing and memory) will be massive. Even today we're starting to see impressive advances in speech and image recognition and the underlying technology is increasingly becoming (massively parallel) connectionist deep learning architectures, not your grandfather's (or Linus's) traditional approaches. Current deep-learning architectures can be optimized to use significantly less resources for recognition-only deployment vs learning, but no doubt we'll see live learning in the future too as AI advances and technology develops.
Linus's relegation of parallelism to server side is equally if not more shortsighted than his lack of vision of client-side CPU-sucking applications! If you want systems that are always available, responsive and scalable then that calls for distributed (client side) implementation, not server based. Future devices are not only going to be smart but the smarts are going to be local. Bye-bye server based Siri.
http://xkcd.com/619/
Answering Linus' "Where the hell..." question:
"Where the hell do you envision that those magical parallel algorithms would be used?"
When you have millions of robots running around your body, repairing your telomere length and resetting the cells Hayflick limit, and repairing other aging related damage, so you can live another 200+ years of healthy, relatively physiologically young.
You know, unless you actually *want* to be old and decrepit, and die centuries before you actually have to...
Not true, because if the processes are IO bound (and most are), most of the processes will be waiting anyway. But Linus's argument hangs on a more fundamental problem: memory bandwidth. If all the cores are sitting waiting because the data isn't in the cache and the other cores are already trying to use the memory bus, then you'll end up with more unused cycles than if you ran timesliced threads on a single core. The correct answer to this one cannot be made by reasoning and logic from first principles, but only by looking at raw empirical data. I daresay Linus has more of that than most of us here.
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
Nothing significant will change this year or in the next 10 years in parallel computing.
You might be right but I'm far less certain of it. The problem we have is that further shrinking of silicon makes it easier to add more cores than to make a single core faster so there is a strong push towards parallelism on the hardware side. At the same time the languages we have are not at all designed to cope with parallel programming.
The result is that we are using our computing resources less and less efficiently. I'm a physicist on an LHC experiment at CERN and we are acutely aware of how inefficient our serial algorithms are at using modern hardware. What we need is a breakthrough in programming languages to be able to parallel program efficiently, just like object oriented programming allowed us to scale up the size of programs. Until this happens I agree than not much will change but if there is some clever CS researcher/student out there with a clever idea for a good parallel programming language the conditions are right for a revolution.
"Crock of shit" maybe? "Bunch of crock" doesn't seem like it'd even be a thing.
"Crock of shit" maybe? "Bunch of crock" doesn't seem like it'd even be a thing.
Replying to myself... apparently it is a thing: http://en.wiktionary.org/wiki/...
To be fair, the trend seems to be hitting a ceiling.
Desktop processors got to quad core and have pretty much sat there. The mobile space has been at quad-core a little less long and there are octo-core implementations moreso than desktop, but it still seems quad core is about where most devices settle. There are more efforts to make GPU style execution cores available for non-graphics use, but in practice a relatively small portion of the market has been able to have meaningful gains exploiting them. As vectorized instructions in cores become more capable, many of those problems actually start coming back to the traditional CPU cores as it works as well as the GPU but with an easier programming model. In short, the marketing results seem to indicate that end user devices might settle around quad core.
Servers have been going up, with 18 core per socket for 2-socket now available. This shows that the desktop parts have room to grow in that dimension, but it just isn't being bothered with.
XML is like violence. If it doesn't solve the problem, use more.
That sounds more like a distributed computing problem rather than applications running on a single 'system'. Even if it were centrally controlled, the computational load being time-shared might mean the best solution is still just a handful of cores. Such nanites would presumably be independent or unused enough that continuous CPU load would likely not even be in the picture. This is very much science fiction, but it still strikes me that the computational load would be negligible compared to the medical/engineering problems overcome. You take 30-40 years to start feeling the effects of aging, so it's not like cells require continual repair to achieve your hypothetical situation, just have to manage to repair everything within 25 years.
XML is like violence. If it doesn't solve the problem, use more.
Http:// www.duncan-white.co.uk
I was lucky enough to gather some parallel programming experience on the Connection Machine CM2, a 64k CPU (yes that is 65536 CPUs), 12 dimensional hypercube, a long time ago. The CM2 ultimately failed but we did get many great insights into parallel programming. At the time it was just not feasible for low cost, on your desktop, computing. It is NO problem to keep massive numbers of cores busy doing interesting computing. OK, the 12 dimensions are less clear on how to use them. At any rate, to claim that there is no need for 100 cores or more is really small minded because unlike the time when silly "the world does not need more than 5 computer" kinds of comments were made we already have evidence that there are powerful ways to employ massive parallel computing that can use thousands or even millions of cores.
Just because we are being caught in a sequential programming mindset does not mean that there is no room for parallel programming. If you are looking at a two dimensional array of data and think of a nested loop you ARE caught in a sequential programming mindset. Additionally, famous people, including Dijkstra, have poopooed some algorithms that are inefficient when execute sequentially to the point where researcher, or programmers, are not even looking any more for good parallel execution. Take bubble sort. Not sure it was Dijkstra but somebody suggested to forbid it. Yes, on a sequential computer bubble sort is indeed inefficient but guess what. If communication does matter and if you are using a massively parallel architecture (i.e., not 4 cores) bubble sort becomes quite efficient because you only need to talk to your data neighbors. Likewise there are AI algorithms that can be shown to be behave really well when conceptualized and executed in parallel. Collaborative Diffusion is an example: http://www.cs.colorado.edu/~ra...
You make me miss Shampoo.
I am very small, utmostly microscopic.
I imagine future processors internally will look like a spread sheets with millions or billions of registers 64 - 1gbit in size that can handle both application and graphics rendering(software) like gaming, running at 500 gigaherts.
The trouble is that extrapolating the present isn't a great way to predict the future!
If computers were never required to do anything much different than they do right now then of course the processing/memory requirements won't change either.
But... of course things are changing, and one change that has been a long time coming but is finally hitting consumer devices are the hard "fuzzy" problems like speech recognition, image/object recognition, natural language processing, artificial intelligence... and the computing needs of these types of application are way different than running traditional software. We may start with accelarators for state-of-the-art offline speech recognition, but in time (a few decades) I expect we'll have pretty sophisticated AI (think smart assistant) functionality widely available that may shake up hardware requirements more significantly.
There are lots of moving parts here. Just adding cores doesn't work unless you can balance it out with sufficient cache and main memory bandwidth to go along with the cores. Otherwise the cores just aren't useful for anything but the simplest of algorithms.
The second big problem is locking. Locks which worked just fine under high concurrent loads on single-socket systems will fail completely on multi-socket systems just from the cache coherency bus bandwidth the collisions cause. For example, on an 8-thread (4 core) single-chip Intel chip having all 8 threads contending on a single spin lock does not add a whole lot of overhead to the serialization mechanic. A 10ns code sequence might serialize to 20ns. But try to do the same thing on a 48-core opteron system and suddenly serialization becomes 1000x less efficient. A 10ns code sequence can serialize to 10us or worse. That is how bad it can get.
Even shared locks using simple increment/decrement atomic ops can implode on a system with a lot of cores. Exclusive locks? Forget it.
The only real solution is to redesign algorithms, particularly the handling of shared resources in the kernel, to avoid lock contention as much as possible (even entirely). Which is what we did with our networking stack on DragonFly and numerous other software caches.
Some things we just can't segregate, such as the name cache. Shared locks only modestly improve performance but it's still a whole lot better than what you get with an exclusive lock.
The namecache is important because for something like a bulk build where we have 48 cores all running gcc at the same time winds up sharing an enormous number of resources. Not just the shell invocations (where the VM pages are shared massively and there are 300 /bin/sh processes running or sitting due to all the Makefile recursion), but also the namecache positive AND negative hits due to the #include path searches.
Other things, particularly with shared resources, can be solved by making the indexing structures per-cpu but all pointing to the same shared data resource. In DragonFly doing that for seemingly simple things like an interface's assigned IP/MASKs can improve performance by leaps and bounds. For route tables and ARP tables, going per-cpu is almost mandatory if one wants to be able to handle millions of packets per second.
Even something like the fork/exec/exit path requires an almost lockless implementation to perform well on concurrent execs (e.g. such as /bin/sh in a large parallel make). Before I rewrote those algorithms our 48-core opteron was limited to around 6000 execs per second. After rewriting it's more like 40,000+ execs per second.
So when one starts working with a lot of cores for general purpose computing, pretty much the ENTIRE operating system core has to be reworked verses what worked well with only 12 cores will fall on its face with more.
-Matt
It isn't, though, except for integer operations and tossing things around. Floating point core elements have a ways to go yet to get to single cycle for everything, and so spreading math among cores still saves time. OS folk like Linus may tend to think in terms of byte-to-BusSize manipulation. A lot of us deal with more nuanced data and operations. I *guarantee* you that a multicore processor will chew up properly designed image manipulation tasks a good deal faster than a single core will, and more flexibly (and more system-friendly) than a GPU can too, although slower for ops that fit in the GPU's memory and for which it offers competence. Software defined radio also makes terrific use of multiple cores, for instance here, a 3 GHz system with 8 cores is mostly free to do other stuff, and a system with one core running at the same speed is about 90% utilized, which doesn't leave enough horsepower to do much else. Whereas with the 8-core, I can run the SDR and do whatever the heck I want. Then there's the "what do you mean by 'core'" question. Does the core have an FPU, or is it one of those profoundly crippled integer-only units? Does the core actually share memory (and therefore memory bandwidth) with other cores, or does it have its own pool of RAM? Is eco throttling choking it half to death? And so on.
What is this "hard drive" thing you describe? Doesn't everyone use boards with terabytes of RAM for near-term storage?
Seriously, though, we all know (well, the ones who have considered it) that's exactly where we're going. SSDs as they stand today are just the tip of the iceberg; you want to know what's coming, instantiate a ram disk on your machine and run some benchies with it. And when we get to real RAM based storage, or anything of similar speed (or perhaps better... memristors?), we won't have wanted CPU development to have been sitting on laurels planted in a garden made of dead-slow storage in the interim.
True enough, but of course, that's not what happens, so... Effectively -- of course they can and do switch roles when memory is shared -- one is monitoring your ethernet, several are kicking in and out of httpd threads and/or processes, and so on for hundreds of OS tasks, and if you're like me, more than a few users tasks as well. For every task within a process that isn't hidebound by disk (and there are already a lot of them) having an additional available core is a very worthy thing. And when cores are tied up waiting for high level math operations, memory is (more) free relative to the needs of the available cores, and things simply run soother, sooner. There's a lot of handwaving in there because of the complexity of caching and lookahead and so on, but the bottom line is in my 8 core machine, I can do a lot more than in my 2-core machine, both have the same amount of memory and run at the same speed. And I apologize for the mangling of terminology. I think the point remains clear:
Multiple cores are a great thing.
I've fallen off your lawn, and I can't get up.
Think of the way eyeballs work. Our neurons don't stream full resolution video back down the optic nerve. In fact, a bunch of processing occurs right behind the retina itself. The data is crunched into a radically smaller format by the type it its the brain. In much the same way, wearable needs to crunch/compress in a massively parallel manner down to something that can be reasonably transferred down the pipe.
http://en.wikipedia.org/wiki/M...
http://en.wikipedia.org/wiki/A...
I learned MapReduce for use with CouchDB and it is a powerful technique even when not on parallel hardware -- although a bit of a conceptual shift.
Here is a group using MapReduce with Hadoop for image processing:
http://hipi.cs.virginia.edu/
"HIPI is a library for Hadoop's MapReduce framework that provides an API for performing image processing tasks in a distributed computing environment. "
Linus wrote: "The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless." But would Linus really think image processing (like for robots or self-driving cars or using Baxter to sort your kid's Legos) is not an important issue? Sounds a bit like "640K is enough memory for anyone". Failure of the imagination is all too common based on unfamiliarity with some problem domain. Although, to be frank, I thought 32K of RAM on a Commodore PET was more than enough memory for anyone, because I could not imagine writing a program that large at the time. :-)
Also, agent-based simulations or zone-based simulations can often use as much parallel hardware as you can throw at it, even if there may be occasional short synchronization steps. For example you could have a Minecraft-like game with thousands of active entities like wolves, zombies, pigs, and so on -- as well as processes like erosion or plant growth going on in multiple zones simultaneously. Game design could really change with millions of available general purpose cores. My wife and I created an algorithm for growing botanically accurate plants, but current games like Minecraft can't use it to grow each unique plant because it would be too computationally intensive if you had millions of unique plants all growing at the same time.
https://github.com/pdfernhout/...
Congrats on your luck/skill in working with Thinking Machines hardware like the CM2. Around 1984, when an psychology undergrad at Princeton interested in AI, I had developed some software called "Mex" for multiple execution where I ran up to 1000 simulated processors on an IBM mainframe under VMUTS. I was using it to help process some data from a robot vision system I had put together (which itself had three 6502 processors). I was really excited about the idea of linking together lots of 6502 processors. I applied for a job then at Thinking Machines but didn't get an offer. A sociology grad student I knew from then (Clifford Nass) got a job offer there (and that is part of why I applied there) but he didn't take the offer, which is kind of ironic. He's brilliant and innovative as his career shows, but not really a programmer or hardware guy, and not all that interested in AI that I knew of:
http://adlininc.com/uxpioneers...
I'm shocked and saddened just now when checking what he is up to now to to see on Wikipedia that Cliff died recently of a heart attack:
http://en.wikipedia.org/wiki/C...
What a big loss for Cliff's family as well as the world. And not that long after the sad loss of Professor Jim Beniger, who was an inspiration and good role model to both Cliff and myself in various ways.
I can see though how Thinking Machines could also have benefited from Cliff's cleverness in thinking about human/machine interaction related to control of a (then) new type of machine. Maybe they'd still be in business if Cliff had gone to work with them? And maybe, being associated with MIT, they did not need yet one more programmer or hardware person, no matter how much they were interested in parallel processing or had done their own projects already on it
A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.
Verification is the process of checking that software works correctly. The more complex the system, the more complex the process of verification.
You said "verification" but you're thinking of "software quality assurance". Though "verfication" is sometimes used to describe a step in that process, when used standing alone (at least here in silicon valley), it refers to the analogous process in integrated circuit design.
Verification is a BIG DEAL in integrated circuit design. A good hardware project will have at least as many verification engineers as designers (and hardware designers will freely act as verification engineers - on OTHER designers' modules - during the later stages of a chip tapeout, without taking a carreer hit.) It is the limiting factor in when the chip design hits silicon and when it hits the market.
So IMHO the previous poster is talking about the up-front quality assurance processes and costs of hardware, rather than software, complexity.
(Releasing a rev to a software product due to a QA issue missed due to added complexity may be costly. But releasing a rev to silicon takes months and millions of dollars of sunk cost. They're not in the same league.)
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
There is limited application for making processes faster through parallelism. It only works well for processes that do not rely on the results of any of the other processes. Unfortunately, many real world applications depend on sequential tasks and I/O. Leaves running multiple applications in parallel, but that is different than parallel programming and a task already accomplished quite well by current OS.
"Although Linus' rants are deservedly famous for the political incorrectness and (often) for their insight..."
I would not say (often).
The issue is that when processor vendors went to dual and then quad core, people started extrapolating and saying 'oh in a decade, we'll be using hundreds of cores on a random desktop'. Instead it tapered out at about 4 for the most part with focus on reducing the power envelope while minimizing performance loss.
I would say the discussion presuming massive core counts is based on an extrapolation of older trends of increasing core count, and it's perfectly reasonable to step back and recognize the change in the trend. Sure, tomorrow we could suddenly be back on the path to 256 core desktop solutions for unforeseen reasons, but as it stands, there's no signs of that being the priority of the industry.
XML is like violence. If it doesn't solve the problem, use more.
It sounds like you're suggesting that memory bus speed will not continue to increase, and thus, we should stop adding bus contention by adding cores. The conclusion there hinges on a rather unsupported premise that is contradicted by the (historical) empirical data. All signs point to memory becoming much faster indeed.
If Linus' expertise were really relevant here, perhaps Transmeta wouldn't have failed.
Memory bus speed is increasing, and therefore the cost of cache misses is decreasing. One way or another, that still leaves us with cache misses as a bottleneck. The question is not a straightforward one of "memory bus speeds are increasing so who gives two hoots" -- there is a very subtle equation needed to determine what cache size is optimal with what bus speed, and for which task.
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
Well, there's obviously no need to add more cores/parallelism until there's a widespread need for it (unless you are Chinese, when octocore is a must!), but I think the need is coming pretty fast.
There are all sorts of cool and useful things you can do with high quality speech, image, etc recognition, natural language processing and AI, and these areas are currently making rapid advances in the lab and slowly starting to trickle out into consumer devices (e.g. speech and natural language support both in iOS and Android).
What is fairly new is that in the lab state of the art results in many of these fields are now coming from deep learning / recurrent neural net architectures rather than traditional approaches (e.g. MFCC + HMM for speech recognition) and these require massive parallelism and compute power. These technologies will continue to migrate to consumer devices as they mature and as the compute requirements become achievable...
Smart devices (eventually *really* smart) are coming, and the process has already started.
tÃf¼rkiyenà n can Damara ± arabesk radyo dinle www.arabeskinsesi.com ±