How We'll Program 1000 Cores - and Get Linus Ranting, Again
vikingpower writes For developers, 2015 got kick-started mentally by a Linus Torvald rant about parallel computing being a bunch of crock. Although Linus' rants are deservedly famous for the political incorrectness and (often) for their insight, it may be that Linus has overlooked Gustafson's Law. Back in 2012, the High Scalability blog already ran a post pointing towards new ways to think about parallel computing, especially the ideas of David Ungar, who thinks in the direction of lock-less computing of intermediary, possibly faulty results that are updated often. At the end of this year, we may be thinking differently about parallel server-side computing than we do today.
All other ended up in a mutex lock situaton so I had chance to do the first post
"4 cores should be enough for any workstation"
Perhaps it's an over-simplification, but if it turns out wrong, people will be quoting that for many decades like they do Gates' memory quote.
Table-ized A.I.
...a tool which he may have heard off. It does connectionless, distributed data management, totally without locks.
http://michaelsmith.id.au
Example of a brain that can't handle parallel thought processing.
Table-ized A.I.
Grokking and managing parallel programming seems to be the bottleneck. Using mass parallelism can be done, but so far it's been so difficult that it has yet to be worth it for the vast majority of apps (or at least the vast majority of the operations in a given app, for graphics and database calls can sometimes use lots of parallelism).
It's too early to know if it's just too hard a problem for the human mind in general, or the current generation of programmers is too locked into a way of thinking.
Regarding the suggestion to follow nature, nature can be unpredictable. Do we want that characteristic in our applications? How do you debug something if you can't faithfully recreate the state? I can see an organic mess being fun for some games, but not for accounting and tracking software.
We need more pilot projects to experiment with techniques.
Table-ized A.I.
The article makes the point (which is not correct*) that to have high scalability we need lockless designs, because locking has too much overhead. If you can't imagine trying to get ACID properties in a multithreaded system without locks, well, neither can I. And neither can they: they've decided to give up on reliability. They've decided we need to give up on the idea that the computer always gives the correct answer, and instead gives the correct answer most of the time (correct meaning, of course, doing exactly what the programmer told it to do).
Here is what the guy says: " The obstacle we shall have to overcome, if we are to successfully program manycore systems, is our cherished assumption that we write programs that always get the exactly right answers." Not only that, we need to give up memory/cache locks within the processor (I don't know a whole lot about those), because when you scale to 1000 processes on a single processor, RAM becomes a bottleneck.
Now, if he's right, and the only way to get such high performance is by not worrying about whether the computer does what it is told, then he's not going to be able to convince many people.
*It is not correct in situations where each processor can work on a single chunk for a long time, that is, for problems where resource contention is a small fraction of processor time, like in video encoding. Then the overhead is still small, no matter how many processors you have.
"First they came for the slanderers and i said nothing."
Or a spreadsheet? (Sure, a small fraction of people will have monster multi-tab sheets, but they're idiots.)
Email programs?
Chat?
Web browsers get a big win from multi-processing, but not parallel algorithms.
Linus is right: most of what we do has limited need for massive parallelization, and the work that does benefit from parallelization has been parallelized.
"I don't know, therefore Aliens" Wafflebox1
I think the actual problem is that some people are so worked up about political incorrectness that they take pleasure from it spewing insulting angry messages all day long. Lol, look at my freedomz of speechorz. But a clever guy can say things straight, without being a upsetting dickhead at the same time.
Linus doesn't so much say that parallelism is useless, he's saying that more cache and bigger, more efficient cores is much better. Therefore, increased number of cores at the cost of single core efficiency is just stupid for general purpose computing. Better just stick more cache to the die, instead of adding a core. Or that is how I read what he says.
I'd say, number of cores should scale with IO bandwidth. You need enough cores to make parallel compilation be CPU bound. Is 4 cores enough for that? Well, I don't know, but if the cores are efficient (highly parallel out-of-order execution) and have large caches, I'd wager IO lags far behind today. Is IO catching up? When will it catch up, if it is? No idea. Maybe someone here does?
The problem is that Linus is discussing two different things at once and so it sounds like he's making a more inflammatory point than he is.
The issue is not whether parallelism is uniformly better for all tasks. The question is, is parallelism better for some tasks. And as Torvalds points out, those tasks do exist (Graphics being an obvious one).
The nature of the workload required for most workstations is non-uniform processing of large quantities of discreet, irregular tasks. For this, parallelism (as Torvald's correctly notes) is likely not the most efficient approach. To pretend that in some magical future, our processing needs can be homogenized into tasks for which parallel computing is superior is to make a faith-based prediction on how our use of computers will evolve. I would say that the evidence is quite the opposite: That tasks will become more discrete and unique.
Some fields though: finance, science, statistics, weather, medicine, etc. are rife with computing tasks which ARE well suited to parallel computing. But how much of those tasks happens on workstations. Not much, most likely. So Linus' point is valid.
But I have to take issue of Linus tone in which he downplays "graphics" as being a rather unimportant subset of computing tasks. It's not "graphics". It's "GRAPHICS". That's not a small outlier of a task. Wait until we're all wearing ninth generation Oculus headsets... the trajectory of parallel processing requirements for graphics is already becoming clear -- and it's stratospheric. The issue is this: Our desktop processing requirements are actually slowing and as Linus points out, are probably ill-suited for increased parallelism. But our graphics requirements may be nearly infinite.
Unlike other fields of computing, we know where graphics is going 20 years from now: It's going to the "holodeck".
Keep working on parallel computing guys. Yes, we need it.
------ The best brain training is now totally free : )
Even the people who knew it would be an issue still used two digits. Resources were extremely constrained. It wasn't worth spending all of that for a problem that would happen decades later. I used to write complete programs that fit in 8K.
Linus sounds like a programmer from 40 years ago
Not necessarily a bad thing to sound like, IMO; 40 years ago you had to think and actually be insightful about what you were undertaking, because the tools and resources were so limited. And, as somebody else has already mentioned, Linus isn't against graphics and multi-core, he is against the stupid fad that blindly demands more cores at the expense of producing better cores (as well as the idiocy of wrapping everything in a graphical front-end, when that actually ends up getting in the way of doing the job).
I think what he says makes a lot of sense - when do you actually benefit from having many cores? Only when you have many, independent tasks; there are large classes of tasks that are serial in nature, which would not benefit from having several cores to run on. And most of the independent processes on the average PC are so lightweight that nothing is gained from having several cores compared to multiprocessing on a single core. Unless you are running a proper server in a data centre or performing large computations, you are likely to just waste your money, if you buy into the multi-core fad.
Ungar's idea (http://highscalability.com/blog/2012/3/6/ask-for-forgiveness-programming-or-how-well-program-1000-cor.html) is a good one, but it's also not new. My Master's is in CS/high performance computing, and I wrote about it back around the turn of the millenium. It's often much better to have asymptotically or probabilistically correct code rather than perfectly correct code when perfectly correct code requires barriers or other synchronizing mechanisms, which are the bane of all things parallel.
In a lot of solvers that iterate over a massive array, only small changes are made at one time. So what if you execute out of turn and update your temperature field before a -.001C change comes in from a neighboring node? You're going to be close anyway? The next few iterations will smooth out those errors, and you'll be able to get far more work done in a far more scalable fashion than if you maintain rigor where it is not exactly needed.
Indeed. There's tons of CPU-intensive tasks that need to be done in a modern computer game, but they're typically done as:
Rather than...
I really hope with how easy it's gotten in C++11 that more people will make better use of threads. In the first example code, not only do you relegate all of your tasks to the same core, thus hitting performance, but if any one task hangs, all of them hang. It's a terrible approach, but it's the most common. The only case where threads aren't good is where you're doing heavy concurrent read/writes to the same cached data, but in real world apps there's almost always a level where you can launch the thread where this isn't the case, if it's even an issue to begin with in your particular application. The presumption that concurrent access to cached memory will usually or always be a problem (which seems to be Linux's presumption) requires that A) your threads not doing the majority of their work on thread-local memory, AND B) that the shared data area being read from / written to concurrently is small enough to be cached, AND C) you can't just migrate your threads up in scope N levels to work around any such issue.
If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
I'll see your cores and raise you your boss strangling all your cores by forcing you to get all the data you were planning to process from NFS shares on 100 megabit LAN connections. Because your developers and IT department, with all the competence of a 14-year-old who just got his hands on a copy of Ruby on Rails, can't figure out how to utilize disk that every fucking machine in the company doesn't have read access to.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
Only if you have a single I/O device and channel.
NUMA architectures can also apply to disks and other I/O devices.
Of course - it comes with a new set of problems, but there's no golden solution.
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
How does Linux compile his kernel? Certainly I use a parallel make across as many cores as possible (well, up to the point where there's a core for every compilation unit).
All your ghosts are just false positives.
The central claim of Linus seem to be that there are many people out there who claim an efficiency increase by parallelism. While i agree that many people claim (IMHO correctly) a increase in the performance (reduction of execution time) within the constraints given by a specific technology level by doing symmetric multiprocessing, i have not heard many people to claim that efficiency (in terms of power, chip area, component count) is improved by symmetric, general parallelization; and nobody with a good understanding of infromation-related aspects of computation.
I am now speaking as a physicist, I find it disturbingly easy to show the opposite for many cases in the limit of ideal performing systems (that is, resource per implemented gate operation remaining constant with the number of gate operations).
Having said that, I speculate that there are reasons to introduce paralellism:
a) The performance you require can not be achieved without it. An example woulf be an FPU, or even just an 8-bit a full adder. You *can* implement it bit-wise, but you dont like to. The full adder also is an excellent example on how paralellism can increase power consumption (i.e. fast-carry-look-ahead) and resource usage
b) Your implementation simulates operations in a way in which requires a significant effort for fetching and decoding to simulated function. The extreme case of a extreme RISC processor with one bit operations and 1bit ALU only is more inefficient for many problems than the processors we use. This means that there probably is an ideal "processing power/RAM (cache)" combination, which is a function of your communication cost (i.e. bus drivers) and your algorithm.
c) From b) we can actually see that it can be extremely resonable to create non-symmetrich mutilprocessing units. For listening to a sensor signal to change, a 8-bit 1MHz Microcontroller with less than 100kGates may be an excellent choice (seen the ti430 line, from example), since it does not insist in keeping an overkill of ALU persistenly on.
d) Paralell programming is almost never used to increase efficiency (unless you really have a distributed input/output and inherent costs of collecting it), but only for these operations where the efficiency loss due to parallelism is negligible (or zero).
Shi's Law
http://developers.slashdot.org...
http://spartan.cis.temple.edu/...
http://slashdot.org/comments.p...
"Researchers in the parallel processing community have been using Amdahl's Law and Gustafson's Law to obtain estimated speedups as measures of parallel program potential. In 1967, Amdahl's Law was used as an argument against massively parallel processing. Since 1988 Gustafson's Law has been used to justify massively parallel processing (MPP). Interestingly, a careful analysis reveals that these two laws are in fact identical. The well publicized arguments were resulted from misunderstandings of the nature of both laws.
This paper establishes the mathematical equivalence between Amdahl's Law and Gustafson's Law. We also focus on an often neglected prerequisite to applying the Amdahl's Law: the serial and parallel programs must compute the same total number of steps for the same input. There is a class of commonly used algorithms for which this prerequisite is hard to satisfy. For these algorithms, the law can be abused. A simple rule is provided to identify these algorithms.
We conclude that the use of the "serial percentage" concept in parallel performance evaluation is misleading. It has caused nearly three decades of confusion in the parallel processing community. This confusion disappears when processing times are used in the formulations. Therefore, we suggest that time-based formulations would be the most appropriate for parallel performance evaluation."
.
And some of us just grew up in the sort of nuclear family where offensive expletives are the norm.
Few are actually people with a real engineering background anymore.
What Linus means is:
- Moore's law is ending (go read about mask costs and feature sizes)
- If you can't geometrically scale transistor counts, you will be transistor count bound (Duh)
- therefore you have to choose what to use the transistors for
- anyone with a little experience with how machines actually perform (as one would have to admit Linus does) will know that keeping execution units running is hard.
- since memory bandwidth has no where near scaled with CPU apatite for instructions and data, cache is already a bottleneck
Therefore, do instruction and register scheduling well, have the biggest on die cache you can, and enough CPUs to deal with common threaded workflows. And this, in his opinion, is about 4 CPUs in common cases. I think we may find that his opinion is informed by looking at real data of CPU usage on common workloads, seeing as how performance benchmarks might be something he is interested in. In other words, based in some (perhaps adhoc) statistics.
No, "political correctness" is a thing. It is where someone gets in trouble for using the word "niggardly" because it sounds like another word.
The truth is that all men having power ought to be mistrusted. James Madison
Mostly writing code for MacOS X and iOS. All current devices have two or more cores. Writing multi-threaded code is made rather easy through GCD (Grand Central Dispatch), and anything receiving data from a server _must_ be multithreaded, because you never know how long it takes to get a response. So there is an awful lot of multi-threaded code around.
But the fact that work is distributed to several cores is just secondary for that kind of work. It is also easy to make most work-intensive code use multiple cores. There are calls like sorting an array or searching for an item with multi-threaded variants. With GCD, you can just say "do this task on a background thread", and if you have five things to do, it uses five threads and up to five cores. It's so easy that people do it a lot without measuring how efficient it is. As long as your software is fast enough, it's fine.
The typical result is an application that uses multiple cores to some degrees, but may have bottlenecks that require a single core. Now on an iPhone with 2 cores, that's fine. (If 30% of your time needs to run on a single core, but you have only two cores, it doesn't matter). On an iMac with 4 cores, it's quite OK. On a monster MacPro with 24 threads it might be a problem. On a hypothetical machine with 100s of cores it _is_ a problem.
So your typical MacOS X or iOS app written by reasonably competent people will work fine in the current environment, but would need major changes to take advantage of 100s of cores.
Nothing significant will change this year or in the next 10 years in parallel computing. The subject is very hard, and that may very well be a fundamental limit, not one requiring some kind of special "magic" idea. The other problem is that most programmers have severe trouble handling even classical, fully-locked, code in cases where the way to parallelize is rather clear. These "magic" new ways will turn out just as the hundreds of other "magic" ideas to finally get parallel computing to take off: As duds that either do not work at all, or that almost nobody can write code for.
Really, stop grasping for straws. There is nothing to be gained in that direction, except for a few special problems where the problem can be partitioned exceptionally well. CPUs have reached a limit in speed, and this is a limit that will be with us for a very long time, and possibly permanently. There is nothing wrong with that, technology has countless other hard limits, some of them centuries old. Life goes on.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Only if you have zero clue about what he is talking about. Note: It is not possible to deduce validity from the way something sounds. That requires actual insight.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Fuck you. You can't tell me what I can think or say.
So, what you're saying is... his right to tell you things is trumped by your wish to not hear things? Freedom of speech does not mean what you think it means...
Something I wish I could have in a workstation again is a full-fledged crossbar switch like the Octane and Octane 2 had.
1.) Linus' wording is pretty moderate.
2.) He's right. Again.
We suffer more in our imagination than in reality. - Seneca
You're being pedantic. "You can't tell me" doesn't mean a literal 'you have to not talk', it means you cannot force your will on me to make *me* not think or say things. This was pretty much exactly what the poster he was responding to meant by "you should jsut stop". He's got freedom of speech correct.
And what most usage is on a computer is actually concurrency.
Massive parallelism is a special case, and even then you suffer from concurrency.
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
..is this:
The obstacle we shall have to overcome, if we are to successfully program manycore systems, is our cherished assumption that we write programs that always get the exactly right answers.
This is an interesting observation. Let's take graphs for example. We rarely need to solve every possible path and find THE shortest one, we usually only need to find one which is shorter than almost all the other ones.
Do we always care whether every pixel is the best possible color when compressing images ? No, it usually only has to be close enough so that we can't tell the difference.
These are classic examples of that statement that have already been implemented in both parallel and linear algorithm design. I'd like to see much more research into understanding why some problems don't require an exact answer, and some do. Maybe we need to change the way we think about what a solution is, rather than how to solve.
...imagine a beowulf cluster of these!
If you were putting together a PC (any variety, any era), what would you expect to get the most bang for the buck? Obviously get the fastest current hardware, but then: double the CPU? double the RAM? double the comm (which at this point includes SATA controllers)? My experience all the way back to Z80s has normally been more RAM, the extension of which is more cache close to the CPU, which is one of the things Linus says.
It's hard to parallelize one application, which is why we all point to a handful of well-understood examples in graphics and that's about it. It's more straightforward - and more understandable - to parallelize multiple applications, like a "server" hearkening back to the old mainframe days. For a *general-purpose* computer doing mostly one or two things at a time with background communication and I/O, more RAM/cache == less thrashing == better *all-around* performance without adding complexity.
And some of us just grew up in the sort of nuclear family where offensive expletives are the norm.
You mean, low-class? I grew up with that kind of family, but I don't have any illusions about whether obscenity is the crutch of the inarticulate motherfucker.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Multi-core CPUs are just a side-step because we can't scale single-core CPU performance to the same levels.
For example, if there was a choice between a single-core CPU that could do 1000 bogomips or a 4-core CPU that could do 4x250 bogomips, I know I'd rather have the single-core chip because for the vast majority of use cases the single-core chip would destroy the quad.
This is why modern multi-core CPUs have 'turbo' mode - Intel and AMD both realised that single-core performance is still much more important for individual programs so being able to run that code on one core and boost it at the detriment of the other cores gives a significant edge.
I still remember when multi-core CPUs first came out - They were limited by TDP so cheaper single-core CPUs would almost always beat them in benchmarks because while they were slightly behind in multi-threading performance, they were far superior on single-core performance.
One thing I am surprised is that no CPU manufacturer has come up with a dynamic pipeline system, where you could run a CPU as e.g. a quad core for normal usage, but when presented with highly predictable streaming data, switch to a P4-style long-pipeline by e.g. feeding one core into another and running the whole thing at a higher clockspeed
.
1. Until recently, most PCs had only a dual core CPU.
2. You're assuming those tasks can trivially be done in parallel. In reality, most can't. You can't render the graphics until the physics are calculated, for example. Yes, you can be calculating physics for the next frame while you're rendering the current one, but then you have to maintain two copies of all the relevant data (current and new), or use a more complex data format which can support multiple threads updating it at the same time. That's a lot more work than just wrapping a thread around the physics calculations.
Because in C and older versions of C++ launching a thread takes significant typing and ugly code,
Bullshit. It takes 1 function call- because if you had a need to do all that repeatedly, you would write the damn call once, turn it into a function, and let it be done. People didn't do it because the tasks weren't parallelizable- they had massive resource contentions on memory object. Contentions that would be non-trivial to solve, and would cause using threads to be a minimal gain or even a loss in efficiency.
Libraries like std::thread don't do anything that people weren't already doing- they just prevent people from going out and writing their own implementations. But any problems that would benefit from them were already being solved with roll your own solutions.
I still have more fans than freaks. WTF is wrong with you people?
There are many common algorithms at the heart of important workloads that are not parallelizable. Consider sorting and shortest path algorithms that are important for managing data and route finding. The O(n-squared) versions can be parallelized (Bellman-Ford vs. Dijkstra's), but for any useful input size, the n-log-n version will be faster on a single core than the n-squared on a supercomputer (no hyperbole there). Even for workloads that do have a lot of parallelism, the inter-process communication often dominates. Except for benchmarks with no application to reality, there is always SOMETHING that serializes computation. Amdahl's law always bites you in the ass.
So much for parallel computing.
If you have many INDEPENDENT tasks, then sure, parallel computing is great. Web servers with many clients, graphics, etc. But that's for servers.
On end-user systems, the amount of thread-level parallelism is very limited. Unless you're compiling Gentoo, you're going to top out at a handful of cores. This is not limitation of the languages people use. It's a practical limitation of the parallelism inherent (or not) in the workloads people run, and it's a hard mathematical limitation of the optimal algorithms people use for common low-level tasks.
http://crd-legacy.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf
http://www.davidhbailey.com/dhbpapers/inv3220-bailey.pdf
http://www.cs.binghamton.edu/~pmadden/pubs/dispelling-ieeedt-2013.pdf
There are some people in parallel computing who need to go back to school and learn computational complexity.
The second approach runs into trouble if your tasks aren't independent. Parallel processing works great until you have to start synchronizing state. If one process stalls and the other processes are dependent on it for some data, then the other processes are going to stall anyway. In the real world, most problems are hard to separate cleanly--data dependencies are very very common. So there is a hidden cost to parallelism--the cost of synchronization between the threads, and the cost grows very fast as you add more threads. This is basically Linus's point: outside of specialized domains it's just not possible to cleanly break up most problems into more than just a handful of threads, so having a 1,000 core beast of a processor doesn't help. You would just have 990+ cores waiting on some other core to finish its job, all of the time. Plus there's the fact that debugging multithreaded programs is inherently more difficult than single threaded ones and that all of this is moot if you are I/O bound anyway.
I read the internet for the articles.
On the other hand, if most people think your word means something different, it's not worth using that word.
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
think whatever you want, man. but when you feel like being an asshole, own it and say i'm an asshole. don't say I am just saying my mind and you're too politically correct if you take offense.
that's exactly what's changed. there's a group of people who were used to being on top for no reason of their own doing. others are like "god this guy's an asshole and I'm fed up with it because there's no reason he's on top". so he's not on top any more but he's really butthurt about it, which is what #gamergate is all about. so to all of these assholes, I'm saying wake up because the problem is you, not everybody else in the world.
It sounds rather than Bill Gates' [supposed] "64KB is enough for anyone", but no denying that Linus said this one!
Saying that graphics is the only client side app that can utilize large scale parallelism is short sighted bunk, and even ignores what is going on today let alone the future. In 20 years time we'll have handheld devices that would look just as much like science fiction, if available today, as today's devices would have looked 20 years ago.
I have no doubt whatsoever that in the next few decades we'll see human level AI in handheld devices as well as server-based apps, and you better believe that the computing demands (both processing and memory) will be massive. Even today we're starting to see impressive advances in speech and image recognition and the underlying technology is increasingly becoming (massively parallel) connectionist deep learning architectures, not your grandfather's (or Linus's) traditional approaches. Current deep-learning architectures can be optimized to use significantly less resources for recognition-only deployment vs learning, but no doubt we'll see live learning in the future too as AI advances and technology develops.
Linus's relegation of parallelism to server side is equally if not more shortsighted than his lack of vision of client-side CPU-sucking applications! If you want systems that are always available, responsive and scalable then that calls for distributed (client side) implementation, not server based. Future devices are not only going to be smart but the smarts are going to be local. Bye-bye server based Siri.
http://xkcd.com/619/
Answering Linus' "Where the hell..." question:
"Where the hell do you envision that those magical parallel algorithms would be used?"
When you have millions of robots running around your body, repairing your telomere length and resetting the cells Hayflick limit, and repairing other aging related damage, so you can live another 200+ years of healthy, relatively physiologically young.
You know, unless you actually *want* to be old and decrepit, and die centuries before you actually have to...
Limited data dependencies are common, it's true, but fundamental lockings between tasks are not that common in the real world. Most real world tasks aren't like matrix multiplication or whatnot. Let's say the task is a video game and your tasks are things like:
1. Get user input
2. Translate/rotate moving objects
3. Backcompute armature positions
4. Calculate mesh data from armatures
5. Load/unload new scene data
6. Load/unload textures
7. Scale objects by level of detail
8. Process AI
9. Play sound effects
10. Play music
11. Autosave
12. Read from the network.
13. Write to the network
14. Handle special effect animations
15. Render
And on and on and on, your average game has a whole laundry list of these sort of things, and each one is made of many subtasks. Some will be trivial, while others warrant threading even at the subtask level.
Now, when you look at these, of course they're all obviously interconnected in some ways, you obviously have to use mutexes. But the connections are limited. For example, If you're backcomputing how an armature must be configured, it's obviously going to use the same data structure as the thread that deforms mesh data with armatures. But the only real practical limitation is that the thread that changes armature positions has to lock the one armature it's computing briefly while writing the results of its calculations so that the other thread never reads half-written results - that's it. Likewise, rendering (which has tons and tons of subtasks, and is famously parallel) obviously depends on all sorts of texture and model data from different threads. But again, all it needs is that there not be anything half-written, it doesn't have to wait on any particular result. Objects moving will change their needed level of detail, user actions and collisions may cause sound effects, and on and on, but again, the only requirement is that you not have half-written states.
This is what the vast majority of CPU-intensive tasks in the real world are like. Yes, you have to use mutexes, and you have to be aware of iterator / pointer invalidation on insert / delete into data structures (where applicable), but apart from those sorts of things, they tend to thread very, very well.
If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
Here's pure C, C++ would lead to slightly neater syntax.
void do_operation_on_all(my_struct *array, int size, threadfunc func){
for(int i=0; i<size; i++){
launch_thread(func, array[i]);
}
}
Where launch thread is a function that calls the correct OS specific function to launch a thread (probably pthread in most cases).
It would then be called:
do_operation_on_all(array, size, func); which is actually even simpler than your solution.
I still have more fans than freaks. WTF is wrong with you people?
Perhaps not, but should someone get fired because people think a word they used is related to a racial epitaph, when it isn't?
The truth is that all men having power ought to be mistrusted. James Madison
Not true, because if the processes are IO bound (and most are), most of the processes will be waiting anyway. But Linus's argument hangs on a more fundamental problem: memory bandwidth. If all the cores are sitting waiting because the data isn't in the cache and the other cores are already trying to use the memory bus, then you'll end up with more unused cycles than if you ran timesliced threads on a single core. The correct answer to this one cannot be made by reasoning and logic from first principles, but only by looking at raw empirical data. I daresay Linus has more of that than most of us here.
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
ANd when I said C++ would lead to a nicer syntax- I mean C++ 01 without std::thread and autos. Mainly because you could make it a template function instead of special casing for the type of using void pointers.
I still have more fans than freaks. WTF is wrong with you people?
Perhaps not, but should someone get fired because people think a word they used is related to a racial epitaph, when it isn't?
No.
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
Nothing significant will change this year or in the next 10 years in parallel computing.
You might be right but I'm far less certain of it. The problem we have is that further shrinking of silicon makes it easier to add more cores than to make a single core faster so there is a strong push towards parallelism on the hardware side. At the same time the languages we have are not at all designed to cope with parallel programming.
The result is that we are using our computing resources less and less efficiently. I'm a physicist on an LHC experiment at CERN and we are acutely aware of how inefficient our serial algorithms are at using modern hardware. What we need is a breakthrough in programming languages to be able to parallel program efficiently, just like object oriented programming allowed us to scale up the size of programs. Until this happens I agree than not much will change but if there is some clever CS researcher/student out there with a clever idea for a good parallel programming language the conditions are right for a revolution.
Most people don't understand lock contention, or lockless code. That's why Dragonfly BSD is ignored, yet is so far ahead: every time someone sees a new problem with parallel computing, with semaphore contention, with threading models, DragonflyBSD is there with a fix from 10 years ago, DragonflyBSD wanted fast semaphores, lockless schedulers, threading models designed to handle running thousands of threads on hundreds of cores, and so on; this was seen, in the early 21st century, as a useless waste of time and a source of complexity; DFBSD is a fork of FreeBSD because the FreeBSD devs wouldn't let the DFBSD guy just do it in FBSD.
It's one of those things. I expect a long, arduous path to catch up to DragonflyBSD, to Minix, and so on, in the same way that we spent so much time catching up to XFS (ext4 spent years trying to reach feature and performance parity with XFS; it now even has on-the-fly inode allocation as an option). There's always some laughable side project somewhere claiming it will change the world, and there's always a point in the future where everyone else starts imitating that project. Whenever I see something big and long-running like this, I recognize it as some other thing; when people start doing multi-version Linux, I will immediately start talking about NixOS (which I think is implemented like crap, but has the right idea).
Support my political activism on Patreon.
"Crock of shit" maybe? "Bunch of crock" doesn't seem like it'd even be a thing.
"Crock of shit" maybe? "Bunch of crock" doesn't seem like it'd even be a thing.
Replying to myself... apparently it is a thing: http://en.wiktionary.org/wiki/...
To be fair, the trend seems to be hitting a ceiling.
Desktop processors got to quad core and have pretty much sat there. The mobile space has been at quad-core a little less long and there are octo-core implementations moreso than desktop, but it still seems quad core is about where most devices settle. There are more efforts to make GPU style execution cores available for non-graphics use, but in practice a relatively small portion of the market has been able to have meaningful gains exploiting them. As vectorized instructions in cores become more capable, many of those problems actually start coming back to the traditional CPU cores as it works as well as the GPU but with an easier programming model. In short, the marketing results seem to indicate that end user devices might settle around quad core.
Servers have been going up, with 18 core per socket for 2-socket now available. This shows that the desktop parts have room to grow in that dimension, but it just isn't being bothered with.
XML is like violence. If it doesn't solve the problem, use more.
That sounds more like a distributed computing problem rather than applications running on a single 'system'. Even if it were centrally controlled, the computational load being time-shared might mean the best solution is still just a handful of cores. Such nanites would presumably be independent or unused enough that continuous CPU load would likely not even be in the picture. This is very much science fiction, but it still strikes me that the computational load would be negligible compared to the medical/engineering problems overcome. You take 30-40 years to start feeling the effects of aging, so it's not like cells require continual repair to achieve your hypothetical situation, just have to manage to repair everything within 25 years.
XML is like violence. If it doesn't solve the problem, use more.
Http:// www.duncan-white.co.uk
I was lucky enough to gather some parallel programming experience on the Connection Machine CM2, a 64k CPU (yes that is 65536 CPUs), 12 dimensional hypercube, a long time ago. The CM2 ultimately failed but we did get many great insights into parallel programming. At the time it was just not feasible for low cost, on your desktop, computing. It is NO problem to keep massive numbers of cores busy doing interesting computing. OK, the 12 dimensions are less clear on how to use them. At any rate, to claim that there is no need for 100 cores or more is really small minded because unlike the time when silly "the world does not need more than 5 computer" kinds of comments were made we already have evidence that there are powerful ways to employ massive parallel computing that can use thousands or even millions of cores.
Just because we are being caught in a sequential programming mindset does not mean that there is no room for parallel programming. If you are looking at a two dimensional array of data and think of a nested loop you ARE caught in a sequential programming mindset. Additionally, famous people, including Dijkstra, have poopooed some algorithms that are inefficient when execute sequentially to the point where researcher, or programmers, are not even looking any more for good parallel execution. Take bubble sort. Not sure it was Dijkstra but somebody suggested to forbid it. Yes, on a sequential computer bubble sort is indeed inefficient but guess what. If communication does matter and if you are using a massively parallel architecture (i.e., not 4 cores) bubble sort becomes quite efficient because you only need to talk to your data neighbors. Likewise there are AI algorithms that can be shown to be behave really well when conceptualized and executed in parallel. Collaborative Diffusion is an example: http://www.cs.colorado.edu/~ra...
What's wrong with it? It only said you may recognise him - it didn't say that most or many would.
Shut up, Dave...
You make me miss Shampoo.
I am very small, utmostly microscopic.
+1 this would make the best gravestone ever.
The trouble is that extrapolating the present isn't a great way to predict the future!
If computers were never required to do anything much different than they do right now then of course the processing/memory requirements won't change either.
But... of course things are changing, and one change that has been a long time coming but is finally hitting consumer devices are the hard "fuzzy" problems like speech recognition, image/object recognition, natural language processing, artificial intelligence... and the computing needs of these types of application are way different than running traditional software. We may start with accelarators for state-of-the-art offline speech recognition, but in time (a few decades) I expect we'll have pretty sophisticated AI (think smart assistant) functionality widely available that may shake up hardware requirements more significantly.
There are lots of moving parts here. Just adding cores doesn't work unless you can balance it out with sufficient cache and main memory bandwidth to go along with the cores. Otherwise the cores just aren't useful for anything but the simplest of algorithms.
The second big problem is locking. Locks which worked just fine under high concurrent loads on single-socket systems will fail completely on multi-socket systems just from the cache coherency bus bandwidth the collisions cause. For example, on an 8-thread (4 core) single-chip Intel chip having all 8 threads contending on a single spin lock does not add a whole lot of overhead to the serialization mechanic. A 10ns code sequence might serialize to 20ns. But try to do the same thing on a 48-core opteron system and suddenly serialization becomes 1000x less efficient. A 10ns code sequence can serialize to 10us or worse. That is how bad it can get.
Even shared locks using simple increment/decrement atomic ops can implode on a system with a lot of cores. Exclusive locks? Forget it.
The only real solution is to redesign algorithms, particularly the handling of shared resources in the kernel, to avoid lock contention as much as possible (even entirely). Which is what we did with our networking stack on DragonFly and numerous other software caches.
Some things we just can't segregate, such as the name cache. Shared locks only modestly improve performance but it's still a whole lot better than what you get with an exclusive lock.
The namecache is important because for something like a bulk build where we have 48 cores all running gcc at the same time winds up sharing an enormous number of resources. Not just the shell invocations (where the VM pages are shared massively and there are 300 /bin/sh processes running or sitting due to all the Makefile recursion), but also the namecache positive AND negative hits due to the #include path searches.
Other things, particularly with shared resources, can be solved by making the indexing structures per-cpu but all pointing to the same shared data resource. In DragonFly doing that for seemingly simple things like an interface's assigned IP/MASKs can improve performance by leaps and bounds. For route tables and ARP tables, going per-cpu is almost mandatory if one wants to be able to handle millions of packets per second.
Even something like the fork/exec/exit path requires an almost lockless implementation to perform well on concurrent execs (e.g. such as /bin/sh in a large parallel make). Before I rewrote those algorithms our 48-core opteron was limited to around 6000 execs per second. After rewriting it's more like 40,000+ execs per second.
So when one starts working with a lot of cores for general purpose computing, pretty much the ENTIRE operating system core has to be reworked verses what worked well with only 12 cores will fall on its face with more.
-Matt
BZZT, fail.
1) You didn define launch_thread.
2) my_struct_array was said, and I quote, "a local-context data structure", so congrats, your data is going to go out of scope on you.
3) The concept of having to write that is absurd because "for (auto&i : container)" is a "do whatever you want, any number of steps, no matching function signature required, inline, on any container whatsoever" built into C++11, *and* it's something that anyone who knows C++11 will know rather being something you brewed yourself.
Again, to repeat, given your failures on #1 and #2:
" if you're too lazy to do it here, or change the requirements to present yourself with a simpler problem, then I'm going to take it that you're too lazy to do it in your code, too."
Hence, I'm going to take it that you're likewise too lazy to actually thread your code. And the fact that your code contains a fundamental oversight resulting in a memory leak which wouldn't have caused a compile error is just icing on the cake.
If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
Hmm, I was thinking of your launch_thread in terms of passing by reference, but I now imagine you meant copy (would have helped if you had actually, you know, defined the function). But then you're just adding an extra and unnecessary copy.
Let me help you out. Your function is going to have to keep a global data structure of all of the threads' arguments because they're too big to pass as the pthread's argument. Now, your array isn't going to be fixed-size because you don't know how many instances are going to be called (you could limit it and put a hard cap, but you still have to put checks for that). If it's pure C, then you don't have STL containers, so you have to implement all of your memory management overhead. Regardless, you at the very least have to do an additional copy of your passed my_struct into your global arguments structure (2x), versus the one that std::thread needs. Now, there is a way to work around having to keep a global data structure, but it sucks: it's to have your launch_thread function pass a pointer to the local copy of my_struct and then sit around and wait for the thread to start up, copy off of the pointer, and then zero out your copy to alert launch_thread that it's started and has copied the data structure (of course, this involves yet another copy, plus a ton of reads while sitting around and waiting and wasting time). All of this, of course, is on top of all of the overhead imposed by pthread itself, including defining a function (and not in the same place where the code is being used, which reduces clarity), and roughly three lines for the pthread calls themselves.
This is all assuming that you implement it pthread-only and not portable. Otherwise, you have to add in #ifdefs and do a whole different approach for whole different platforms.
Could you do all this? Of course you could. Would you do it? Clearly you didn't, and I know no amount of badgering would have gotten you to do it (I've tried this experiment before, you're not the first). Could you write it once and then reuse it?** Sure you could. Have you? No, of course you haven't, otherwise you would have just pasted it before. Why haven't you written such a thing before? Because it's too much hassle. Which is the very reason threading is underused.
** - kind of. You see, it's actually worse than that because unless you make an even more convoluted and unreadable and type-unsafe function, your thread launcher is going to be only set up for launching this particular case. But one can encounter all kinds of threading needs that would require significant changes. But I digress.
If you play a Ke$ha song backwards, you hear messages from Satan. Even worse, if you play it forwards you hear Ke$ha.
No, it is not. They tried getting parallel programming off the floor 40 years ago and have consistently been failing since then. Linus sums up the results of the last few decades of R&D perfectly.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
It isn't, though, except for integer operations and tossing things around. Floating point core elements have a ways to go yet to get to single cycle for everything, and so spreading math among cores still saves time. OS folk like Linus may tend to think in terms of byte-to-BusSize manipulation. A lot of us deal with more nuanced data and operations. I *guarantee* you that a multicore processor will chew up properly designed image manipulation tasks a good deal faster than a single core will, and more flexibly (and more system-friendly) than a GPU can too, although slower for ops that fit in the GPU's memory and for which it offers competence. Software defined radio also makes terrific use of multiple cores, for instance here, a 3 GHz system with 8 cores is mostly free to do other stuff, and a system with one core running at the same speed is about 90% utilized, which doesn't leave enough horsepower to do much else. Whereas with the 8-core, I can run the SDR and do whatever the heck I want. Then there's the "what do you mean by 'core'" question. Does the core have an FPU, or is it one of those profoundly crippled integer-only units? Does the core actually share memory (and therefore memory bandwidth) with other cores, or does it have its own pool of RAM? Is eco throttling choking it half to death? And so on.
What is this "hard drive" thing you describe? Doesn't everyone use boards with terabytes of RAM for near-term storage?
Seriously, though, we all know (well, the ones who have considered it) that's exactly where we're going. SSDs as they stand today are just the tip of the iceberg; you want to know what's coming, instantiate a ram disk on your machine and run some benchies with it. And when we get to real RAM based storage, or anything of similar speed (or perhaps better... memristors?), we won't have wanted CPU development to have been sitting on laurels planted in a garden made of dead-slow storage in the interim.
True enough, but of course, that's not what happens, so... Effectively -- of course they can and do switch roles when memory is shared -- one is monitoring your ethernet, several are kicking in and out of httpd threads and/or processes, and so on for hundreds of OS tasks, and if you're like me, more than a few users tasks as well. For every task within a process that isn't hidebound by disk (and there are already a lot of them) having an additional available core is a very worthy thing. And when cores are tied up waiting for high level math operations, memory is (more) free relative to the needs of the available cores, and things simply run soother, sooner. There's a lot of handwaving in there because of the complexity of caching and lookahead and so on, but the bottom line is in my 8 core machine, I can do a lot more than in my 2-core machine, both have the same amount of memory and run at the same speed. And I apologize for the mangling of terminology. I think the point remains clear:
Multiple cores are a great thing.
I've fallen off your lawn, and I can't get up.
It's too early to know if it's just too hard a problem for the human mind in general
Most user-space parallel problems aren't hard, it's just programmers who use algorithms and data-structures as black-boxes without understanding their implementation or characteristics, or alternatives, or generally being able to think for themselves. I don't know how many times I've glanced at problems that were throughput sensitive, and I immediately saw large potentials for parallelism, but required designs that would be utterly illogical for a serial design.
Solving code parallelism problems is nearly identical to making well factored code. You need to break down the problem into its atomic parts, then rearrange those parts. Once you understand all atomic parts of a system and all of the data dependencies, parallelism becomes trivial. The problem is most people don't "understand" the system that they're working on, they just mindlessly throw code at a wall and some sticks. Most parallel code really needs to be designed from the beginning. Designing code? What's what?
Part of growing up is learning to know when you don't know what you're talking about and Linus is calling you on it. Every time that I've looked into why Linus was "wrong", it was because he was wrong in theory, but correct in practice, because in practice, people are idiots and Linus recognizes this.
I assume Linus is looking at this from a practical standpoint, that jumping the gun to making massive overhauls of the kernel to optimize for our current limited understanding of concurrent software and hardware interactions for a problem that most programmers are too stupid to even take advantage of, would be a bit premature. We should wait for hardware and software to better stabilize before we get locked(pun) into a concurrency regime for the next few decades.
We've only just recently gained concurrent support for network and storage IO, and hardware has been changing a lot in the past few years as we keep scaling up SSDs and 40gb+ NICs. We can use work-arounds for the mean time, and once everyone says "yes, this is the best way", we can make large kernel changes.
Another example, AMD is already working on Mantle. Even if it doesn't fully take off, it's research into a related area, and we'll learn a lot from it. At some point in the future, a Mantle-like system may be incorporated into the Kernel, but lets not turn the kernel into a cesspool of ever changing interfaces while they figure this problem out.
Yes, specifically it's where there's a bit of a stooshie over something silly like niggardly, everyone finally calms down a bit, and then some asshole decides that the correct thing to do is run around shouting "NIGGER NIGGER NIGGER", because political correctness gone mad.
[FUCK BETA]
Bury him next to Dr "Got shot for being a paediatrician".
It's ridiculous that not only does the article not mention Erlang or Haskell, but no high modded comment does either.
Sad. Erlang's been around for more than 25 years with its successful lockless model.
you had me at #!
http://en.wikipedia.org/wiki/M...
http://en.wikipedia.org/wiki/A...
I learned MapReduce for use with CouchDB and it is a powerful technique even when not on parallel hardware -- although a bit of a conceptual shift.
Here is a group using MapReduce with Hadoop for image processing:
http://hipi.cs.virginia.edu/
"HIPI is a library for Hadoop's MapReduce framework that provides an API for performing image processing tasks in a distributed computing environment. "
Linus wrote: "The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless." But would Linus really think image processing (like for robots or self-driving cars or using Baxter to sort your kid's Legos) is not an important issue? Sounds a bit like "640K is enough memory for anyone". Failure of the imagination is all too common based on unfamiliarity with some problem domain. Although, to be frank, I thought 32K of RAM on a Commodore PET was more than enough memory for anyone, because I could not imagine writing a program that large at the time. :-)
Also, agent-based simulations or zone-based simulations can often use as much parallel hardware as you can throw at it, even if there may be occasional short synchronization steps. For example you could have a Minecraft-like game with thousands of active entities like wolves, zombies, pigs, and so on -- as well as processes like erosion or plant growth going on in multiple zones simultaneously. Game design could really change with millions of available general purpose cores. My wife and I created an algorithm for growing botanically accurate plants, but current games like Minecraft can't use it to grow each unique plant because it would be too computationally intensive if you had millions of unique plants all growing at the same time.
https://github.com/pdfernhout/...
Congrats on your luck/skill in working with Thinking Machines hardware like the CM2. Around 1984, when an psychology undergrad at Princeton interested in AI, I had developed some software called "Mex" for multiple execution where I ran up to 1000 simulated processors on an IBM mainframe under VMUTS. I was using it to help process some data from a robot vision system I had put together (which itself had three 6502 processors). I was really excited about the idea of linking together lots of 6502 processors. I applied for a job then at Thinking Machines but didn't get an offer. A sociology grad student I knew from then (Clifford Nass) got a job offer there (and that is part of why I applied there) but he didn't take the offer, which is kind of ironic. He's brilliant and innovative as his career shows, but not really a programmer or hardware guy, and not all that interested in AI that I knew of:
http://adlininc.com/uxpioneers...
I'm shocked and saddened just now when checking what he is up to now to to see on Wikipedia that Cliff died recently of a heart attack:
http://en.wikipedia.org/wiki/C...
What a big loss for Cliff's family as well as the world. And not that long after the sad loss of Professor Jim Beniger, who was an inspiration and good role model to both Cliff and myself in various ways.
I can see though how Thinking Machines could also have benefited from Cliff's cleverness in thinking about human/machine interaction related to control of a (then) new type of machine. Maybe they'd still be in business if Cliff had gone to work with them? And maybe, being associated with MIT, they did not need yet one more programmer or hardware person, no matter how much they were interested in parallel processing or had done their own projects already on it
A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.
Maybe that's why the banks F'd up mortgage pricing?
Table-ized A.I.
Verification is the process of checking that software works correctly. The more complex the system, the more complex the process of verification.
You said "verification" but you're thinking of "software quality assurance". Though "verfication" is sometimes used to describe a step in that process, when used standing alone (at least here in silicon valley), it refers to the analogous process in integrated circuit design.
Verification is a BIG DEAL in integrated circuit design. A good hardware project will have at least as many verification engineers as designers (and hardware designers will freely act as verification engineers - on OTHER designers' modules - during the later stages of a chip tapeout, without taking a carreer hit.) It is the limiting factor in when the chip design hits silicon and when it hits the market.
So IMHO the previous poster is talking about the up-front quality assurance processes and costs of hardware, rather than software, complexity.
(Releasing a rev to a software product due to a QA issue missed due to added complexity may be costly. But releasing a rev to silicon takes months and millions of dollars of sunk cost. They're not in the same league.)
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
Why haven't you written such a thing before? Because it's too much hassle. Which is the very reason threading is underused.
LOL. Actually there's a better reason such a thread launch facility doesn't commonly get written - which is that, in most circumstances, it really doesn't help performance that much, if at all - and the added complexity makes for a big net minus. There are a number of issues:
Firstly, spawning threads is expensive. Yes, on Linux it's "cheap", but that's "cheap" compared to other implementations - it's still a lot compared to doing a modest amount of work on the local CPU. (Why is it so expensive? Basically because there's a lot of housekeeping to do. In addition to the kernel creating new kernel structures for the new thread of execution (similar to creating a process), the process's thread library must allocate a stack for the new thread (involving modifying the process's page tables), iterate through all loaded shared libraries in order to allocate any thread-local storage they require, and so on, requiring multiple syscalls, a TLB flush, at least one context switch, and so on. To some extent the impact of this overhead can be reduced by maintaining a pool of ready-created threads, but this either takes away control of performance (if done automatically by your language/library) or substantially increases complexity (if you implement it yourself, since you then have to synchronise the threads carefully).
The second problem is that, unless you're very careful, extra threads don't buy you much performance, and can indeed hurt. Take the example you gave - doing some processing on each struct in an array, where each such struct contains an int and a double (16 bytes total, including alignment padding). With 64-byte cache lines (typical on x86), there are 4 such structs per cache line. If you distribute the processing over threads running on different cores, then instead of one core waiting for the cache line to come in to main memory, and then processing the 4 structs very rapidly (since they're now all in cache), you'll have 4 cores each waiting for the data to be available - i.e. up to a 4x slowdown for memory-bound tasks. And that's assuming the structure is only read from; if it's written to as well then the cache line will have to bounce between cores, and the multithreading slowdown will be many times worse. Now, if you ensure that structs in the same cache line get processed by the same core (ideally in sequence, and by the same kernel thread), then you do potentially get a big speedup - provided you don't hit any other gotchas - but the C++ code you're promoting doesn't seem to guarantee this in any way.
Third, and perhaps most importantly, data dependencies matter. In your example you're detaching all the threads; this is not realistic, because that means you cannot ever depend on their operations having finished. In the vast majority of cases you do need to know when an operation has finished: you're generally doing work for a reason - i.e. that you're going to use the result - and you can't begin to use that result until you know it has been produced. That, in of itself, adds complexity: you have to analyse your program's dataflow much more carefully in the presence of threads, because C/C++ will quite happily let you use a variable before another thread has finished assigning to it, without any sort of warning or exception. The analysis can certainly be done, and synchronisation put in place to eliminate the problems - but that is further overhead, both in the program's performance but also in the complexity of the program itself, and hence the time taken to write it (and especially to enhance it later, when the synchronisation model may not be so fresh in one's mind).
Used correctly and in the right circumstances, threads on an N-core system can give a N-times speedup (or greater, due to caching effects). Used badly, at best they'll reduce performance, and usually they'll increase complexity and lead to subtle bugs that are hard to debug.
The new thread features in modern C++ are very cool, but the fact they didn't exist before is not what's been preventing competent programmers from using threads all over the place :)
Need to type accents and special characters in Windows? Use FrKeys
There is limited application for making processes faster through parallelism. It only works well for processes that do not rely on the results of any of the other processes. Unfortunately, many real world applications depend on sequential tasks and I/O. Leaves running multiple applications in parallel, but that is different than parallel programming and a task already accomplished quite well by current OS.
The issue is that when processor vendors went to dual and then quad core, people started extrapolating and saying 'oh in a decade, we'll be using hundreds of cores on a random desktop'. Instead it tapered out at about 4 for the most part with focus on reducing the power envelope while minimizing performance loss.
I would say the discussion presuming massive core counts is based on an extrapolation of older trends of increasing core count, and it's perfectly reasonable to step back and recognize the change in the trend. Sure, tomorrow we could suddenly be back on the path to 256 core desktop solutions for unforeseen reasons, but as it stands, there's no signs of that being the priority of the industry.
XML is like violence. If it doesn't solve the problem, use more.
It sounds like you're suggesting that memory bus speed will not continue to increase, and thus, we should stop adding bus contention by adding cores. The conclusion there hinges on a rather unsupported premise that is contradicted by the (historical) empirical data. All signs point to memory becoming much faster indeed.
If Linus' expertise were really relevant here, perhaps Transmeta wouldn't have failed.
Memory bus speed is increasing, and therefore the cost of cache misses is decreasing. One way or another, that still leaves us with cache misses as a bottleneck. The question is not a straightforward one of "memory bus speeds are increasing so who gives two hoots" -- there is a very subtle equation needed to determine what cache size is optimal with what bus speed, and for which task.
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
Well, there's obviously no need to add more cores/parallelism until there's a widespread need for it (unless you are Chinese, when octocore is a must!), but I think the need is coming pretty fast.
There are all sorts of cool and useful things you can do with high quality speech, image, etc recognition, natural language processing and AI, and these areas are currently making rapid advances in the lab and slowly starting to trickle out into consumer devices (e.g. speech and natural language support both in iOS and Android).
What is fairly new is that in the lab state of the art results in many of these fields are now coming from deep learning / recurrent neural net architectures rather than traditional approaches (e.g. MFCC + HMM for speech recognition) and these require massive parallelism and compute power. These technologies will continue to migrate to consumer devices as they mature and as the compute requirements become achievable...
Smart devices (eventually *really* smart) are coming, and the process has already started.
tÃf¼rkiyenà n can Damara ± arabesk radyo dinle www.arabeskinsesi.com ±