An Overview of Parallelism
Mortimer.CA writes with a recently released report from Berkeley entitled "The Landscape of Parallel Computing Research: A View from Berkeley: "Generally they conclude that the 'evolutionary approach to parallel hardware and software may work from 2- or 8-processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism.' This assumes things stay 'evolutionary' and that programming stays more or less how it has done in previous years (though languages like Erlang can probably help to change this)." Read on for Mortimer.CA's summary from the paper of some "conventional wisdoms" and their replacements.
Old and new conventional wisdoms:
Old and new conventional wisdoms:
- Old CW: Power is free, but transistors are expensive.
- New CW is the "Power wall": Power is expensive, but transistors are "free." That is, we can put more transistors on a chip than we have the power to turn on.
- Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins.
- New CW: As chips drop below 65-nm feature sizes, they will have high soft and hard error rates.
- Old CW: Multiply is slow, but load and store is fast.
- New CW is the "Memory wall" [Wulf and McKee 1995]: Load and store is slow, but multiply is fast.
- Old CW: Don't bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer.
- New CW: It will be a very long wait for a faster sequential computer (see above).
Mortimer.CA writes with a recently released report from Berkley entitled "The Landscape of Parallel Computing Research: A View from Berkeley
Would that be a Parallelograph?
Push Button, Receive Bacon
pretty much the same thing Dave Patterson's been saying for a while now...in fact, the CW sounded so familiar, I went back to double check his lecture slides from more than a year ago:
/ Cs252s06-lec01-intro.pdf
http://vlsi.cs.berkeley.edu/cs252-s06/images/1/1b
and it's pretty much identical (check out slide 3 on the first page of the pdf)
I'm wating for a language which would parallelize stuff for you. This is most likely to be a functinal language, or an extension to an existing functional language. Maybe even Erlang.
I think the main reason people say "don't use threads" is because while single threaded apps are easy to debug, multi-threaded ones will crash and burn at seemingly random places if the programmer didn't plan ahead and use proper locking. This is probably good advice to a noob programmer but I otherwise can't stand people who are of the "absolutely, never, ever, use threads" mindset.
Some applications have no need to be multithreaded, but when they do it is a lot easier than people make it out to be. Taking advantage of lock-free algorithms and NUMA for maximum scalability *can* be hard, but the people who need these will have the proper experience to tackle it.
Language extensions for threading would be great, and I'm sure somebody is working on it. But until that magical threading language (maybe c++1x) comes along the current ones work just fine.
"but is likely to face diminishing returns as 16 and 32 processor systems are realized"
Then we are doing something wrong. The human brain provides compelling evidence that massive parallelization works. So: what are we missing?
This sig does not contain any SCO code.
Take a look at LabVIEW, a compiled graphical programming language from National Instruments. It natively supports SMP / multicore / multithreading. Essentially, dissociated pieces of code you write (computations, hardware I/O, etc.) are automatically scheduled in separate threads of execution in order to maximize efficiency. It's an interesting idea: here's a technical article from their website that does a better job of describing it (some marketing included as well): http://zone.ni.com/devzone/cda/tut/p/id/4233
In case any has missed it: http://video.google.com/videoplay?docid=-583031888 2717959520
I can't wait for the sequel!
In the immortal words of Socrates, "I drank what?"
I think parallelism can be achieved elegantly using languages that express what is to be done, rather than how it is to be done. Functional programming is a major step in the right direction. Not only do functional programs typically more clearly express what is to be done (as opposed to which steps are to be taken to get there), they also tend to cause fewer side effects (which restrict the correct evaluation orders). In particular, not using variables avoids many of the headaches traditionally involved in multithreading.
Please correct me if I got my facts wrong.
Right, massively parallel computing doesn't work? And, this is from a research institution? And, research institutions don't use massively parallel computers? Right.
Hmmm, seems to me that some cpu makers, just like car makers, would like to continue selling you legacy hardware. I am not buying. Commoditized massively parallel cpus are a step in the right direction. This is the way that RAM memory is sold today. The same works with CPUs. In fact, I would go a little further and combine the two, CPU with RAM, in a single commodity. $5 per cpu means I can buy hundreds of cpus for the same price as that single monolithic legacy cpu that you have little choice about today. ie. The desktop computer company let's you choose the amount of RAM, and size of hard disk, but they don't let you choose which CPUs. I think that this is all about to change.
Most of the time a computer isn't churning away on a single problem that needs to be parallelized. In that respect, the solution rests more with the operating system. Something like a server could relatively easily make use of almost any number of processors; one per client maybe. The trouble comes with bus contention. It is a problem similar to designing a subnet. If you have too many computers/processors, communications grind to a halt. In other words, adding processors slows things down rather than speeding them up.
The solution to adding more processors may be architectural. Adding more busses, ala harvard architecture for example, might be an effective approach. This is the approach used in DSPs which are a lot more powerful on a per-cycle basis than the more conventional architecture.
But, service based applications will soar. As a web developer, I'm thrilled by the prospect of being able to run my web application on 16 and 32 core machines. I could care less about whether office apps run any more quickly.
This isn't bringing parallelisation to your web app for free, instead you're getting 1 cpu just liek you used to have. The difference is that the server can handle many more users, with new limits of memory and disk i/o, of course.
I just heard that talk; he gave it at EE380 at Stanford a few weeks ago.
First, this is a supercomputer guy talking. He's talking about number-crunching. His "13 dwarfs" are mostly number-crunching inner loops. Second, what he's really pushing is getting everybody in academia to do research his way - on FPGA-based rackmount emulators.
Basic truth about supercomputers - the commercial market is zilch. You have to go down to #60 on the list of the top 500 supercomputer before you find the first real commercial customer. It's BMW, and the system is a cluster of 1024 Intel x86 1U servers, running Red Hat Linux. Nothing exotic; just a big server farm set up for computation.
More CPUs will help in server farms, but there we're I/O bound to the outside world, not talking much to neighboring CPUs. If you have hundreds of CPUs on a chip, how do you get data in and out? But we know the answer to that - put 100Gb/s Ethernet controllers on the chip. No major software changes needed.
This brings up one of the other major architectural truths: shared memory multiprocessors are useful, and clusters are useful. Everything in between is a huge pain. Supercomputer guys fuss endlessly over elaborate interconnection schemes, but none of them are worth the trouble. The author of this paper thinks that all the programming headaches of supercomputers will have to be brought down to desktop level, but that's probably not going to happen. What problem would it solve?
What we do get from the latest rounds of shrinkage are better mobile devices. The big wins commercially are in phones, not desktops or laptops. Desktops have been mostly empty space inside for years now. In fact, that's true of most non-mobile consumer electronics. We're getting lower cost and smaller size, rather than more power.
Consider cars. For the first half of the 20th century, the big thing was making engines more powerful. By the 1960s, engine power was a solved problem, (the 1967 turbine-powered Indy car finally settled that issue) and cars really haven't become significantly more powerful since then. (Brakes and suspensions, though, are far better.)
It will be very interesting to see what happens with the Cell. That's the first non-shared memory multiprocessor to be produced in volume. If it turns out to be a dead end, like the Itanium, it may kill off interest in that sort of thing for years.
There are some interesting potential applications for massive parallelism for vision and robotics applications. I expect to see interesting work in that direction. The more successful vision algorithms do much computation, most of which is discarded. That's a proper application for many-CPU machines, though not the Cell, unless it gets more memory per CPU. Tomorrow's robots may have a thousand CPUs. Tomorrow's laptops, probably not.
Those of us that use HPC clusters (i.e. Beowulf) have been thinking about these issues as well. For those interested, I wrote a series of articles on how one might program 10,000 cores (based on my frustrations as programmer and user of parallel computers). Things will change, there is no doubt.
The first in the series is called Cluster Programming: You Can't Always Get What You Want The next two are Cluster Programming: The Ignorance is Bliss Approach, and Cluster Programming: Explicit Implications of Cluster Computing.
Comments welcome.
HPC for Primates. Read Cluster Monkey
AKA F--, The simplest explicit programming model on the planet. Brainchild of Bob Numrich, unsung hero of Cray Research in the early 90's ( & probably much before... but that was when I was lucky enough to work with him) F-- was Numrich's second great contribution to parallel programming models... the first being the shmem model for the Cray T3D, Four assembly routines which made the raw capabilities of the T3D available to massively parallel applications when every other programming model (e.g. MPI) had about 50x the communication overhead. This was a big factor in Cray's takeover of the short-lived MPP market in the mid 90's. On the topic of the thread.... Explicit programming models scale to thousands of processors, implicit ones peter out at 4-8. The reason is Data Locality. Explicit models ensure that the processor is operating on data which is local and unshared. Implicit models end up fighting for operands with competing processors. This requires either heroic hardware ( e.g. 70% of the Cray C-90s processor logic was concerned with memory contention resolution) or a dramatic performance dropoff.
Actually, I've been working on a programming language/model that makes programs inherently parallel. Of course, it is quite different from anything currently in existence. Basically, it uses a queue (hence the name "Que") to store data (like the stack in FORTH), but due to the nature of the queue, programs become inherently parallel. Large programs could have hundreds of processes running at the same time, if so inclined.
If you are interested, check out my project (there's not much there right now), and/or contact me at FMota91 at GMail dot com.
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C1 bottles of beer on the wall. Take one down, pass it round... Oh, umm...
For those that are interested, the Berkeley View project website is at http://view.eecs.berkeley.edu/, which includes some video interviews with the principal professors involved in the project. There is also a blog at http://view.eecs.berkeley.edu/blog/
Any reproducable bug in the parallel binary will be reproducable given the same set of inputs on the sequential binary, which you can then debug as you have the corresponding sequential source code.
So why isn't this done? Automagically parallelizing compilers (as opposed to compilers that merely parallelize what you tell them to parallelize) are extremely hard to write. Until the advent of Beowulf clusters, low-cost SMP and low-cost multi-core CPUs, there simply haven't been enough machines out there capable of sufficiently complex parallelism to make it worth the cost. Simply make a complex-enough inter-process communication system, with a million ways to signal and a billion types of events. Any programmer who complains they can't use that mess can then be burned at the stake for their obvious lack of appreciation for all these fine tools.
Have you ever run GCC with maximum profiling over a program, tested the program, then re-run GCC using the profiling output as input to the optimizer? It's painful. Now, to parallelize, the compiler must automatically not just do one trivial run but get as much coverage as possible, and then not just tweak some optimizer flags but run some fairly hefty herustics to guess what a parallel form might look like. And it will need to do this not just the once, but many times over to find a form that is faster than the sequential version and does not result in any timing bugs that can be picked up by automatic tools.
The idea of spending a small fortune on building a compiler that can actually do all that reliably, effectively, portably and quickly, when the total number of purchasers will be in the double or treble digits at most - say what you like about the blatant stupidity rife in commercial software, but they know a bad bet when they see one. You will never see something with that degree of intelligence come out of PCG or Green Hills - if they didn't go bankrupt making it, they'd go bankrupt from the unsold stock, and they know it.
What about a free/open source version? GCC already has some of the key ingredients needed, after all. Aside from the fact that the GCC developers are not known for their speed or responsiveness - particularly to arcane problems - it would take many days to compile even SuperTuxKart and probably months when it came to X11, glibc or even the Linux kernel. This is far longer than the lifetime of most of the source packages - they've usually been patched on that sort of timeframe at least once. The resulting binaries might even be truly perfectly parallel, but they'd still be obsolete. You'd have to do some very heavy research into compiler theory to get GCC fast enough and powerful enough to tackle such problems within the lifetime of the product being compiled. Hey, I'm not saying GCC is bad - as a sequential, single-pass compiler, it's pretty damn good. At the Supercomputer shows, GCC is used as the benchmark to beat, in terms of code produced. The people at such shows aren't easily impressed and would not take boasts of producing binaries a few percent faster than GCC unless that meant a hell of a lot. But I'm not convinced it'll be the launchpad for a new generation of automatic parallelizing compilers. I think that's going to require someone writing such a compiler from scratch.
Automatic parallelization is unlikely to happen in my lifetime, even though the early research was taking place at about the time I first started primary school. It's a hard problem that isn't being made easier by having been largely avoided.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Implement the Observer (aka Listener) pattern (specifically the thing called "Subject" on the Wikipedia page). Your object should provide two methods, publish and subscribe. Clients can call subscribe to indicate their interest in being notified. When a client calls publish with a value, your object should pass on that value by calling the notify method on everyone who has previously subscribed for updates.
Sounds simple, right? But wait:
- What if one of your subscribers throws an exception? That should not prevent other subscribers from being notified.
- What if notifying a subscriber triggers another value to be published? All the subscribers must be kept up to date on the latest published value.
- What if notifying a subscriber triggers another subscription? Whether or not the newly added subscriber receives this in-progress notification is up to you, but it must be well defined and predictable.
- Oh, and by the way, don't deadlock.
Can you achieve all these things in a multithreaded programming model (e.g. Java)? Try it. Don't feel bad if you can't; it's fiendishly complicated to get right, and i doubt i could do it.Or, download this paper and start reading from section 3, "The Sequential StatusHolder."
Once you see how hard it is to do something this simple, now think about the complexity of what people regularly try to achieve in multithreaded systems, and that pretty much explains why computer programs freeze up so often.
As parallelism ramps up the processor cores will get simpler and we will return to the concept of RISC computing. But the programing paradigm will be virtual processors. Furthermore some processors will be better at certain operation than others.
What will eventually turn into an AI subsystem, will postcompile and dynamically rewrite programs to take best advantage of the system. By AI I don't mean anything sentient, just something that intelligently recognizes programming structures and patterns and restructures the code to match the hardware. Think IA64 on steroids.
The question or concern I have is that if one programs threads that way, communicating by creating, passing, and garbage collecting objects, is the garbage collector happy with this, or does this create more overhead than you save with the thread parallelism (on multi core of course)?
I don't say "never use threads", but I do say "don't use threads unless you have to". Bugs in threaded code are easier to introduce, are hard to see in code reviews, are very likely to remain undetected in manual testing, are likely to remain undetected in automated testing and are a pain to reproduce.
Part of the problem is the inherent difficulty of parallelism, but there is more to it than that. Threads are not isolated in any way: they can take unpredictable paths through the program and touch any data structure. It's a good idea to design your threaded program in such a way that each thread sticks to a small section of the code and data, but this is a structure that you have to impose yourself: the threading mechanism does not help here.
So, in my opinion threads are hard. Sometimes you have to do hard things though. But if you can avoid it, you can save yourself a lot of effort and risk by sticking to a single thread. I think the situation can be compared to memory allocation: noob programmers are sure to mess up explicit allocation/deallocation; experienced programmers get it mostly right, but spend a lot of effort on it and still make mistakes from time to time (especially in the error handling paths). Garbage collection avoids a lot of problems.
For the future, I think we should have some kind of mechanism for parallelism other than threads. It would have to isolate the parallel computations from each other and provide a safe and efficient way to pass control and data between them. I read a bit about Erlang when it was mentioned here a couple of days ago, but it is not yet clear to me if/how this could be fitted into an object oriented language. That would be essential to get it to the masses: functional programming hasn't become mainstream because it doesn't match the way most programmers think.
run my web application on 16 and 32 core machines. I could care less about whether office apps run any more quickly.
Never underestimate the huge quantity of virus/trojans/spywares/sony rootkits/spambots/etc. pollution that joe 6-pack will be running next to his power hungry Microsoft OS and his main office work.
16/32 cores machines could be useful for him too, so his main work (browsing for pr0n ?) won't slowdown for being constantly interrupted by the multitasking scheduler to let the huge pile of crap run.
---
1 core for Office.
4 cores for the "WoW effect".
11 remaining cores pumping spam.
1 "Intel Core Pento 16 cores" to rule them all.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Part of the problem, as previous posts have observed, is that most people didn't have much incentive to change, since parallel systems were expensive, and bloated, ineffeicient code would inevitably get faster thanks to the rapid improvement in single-thread performance that we enjoyed until recently. So outside of HPC and cluster apps, most parallelism consisted of decoupling obviously aynchronous tasks.
I don't think there ever will be one language to rule them all.... The right programming model is too dependent on the application, and unless you are designing a domain-specific system, you will never get people to agree. Depending on your needs, you want different language features and you make different tradeoffs on performance vs. programmability. For some applications, functional programming languages will be perfect, for others Co-Array Fortran will be, for others an OO derivative like Mentat will be, etc. And as new applications come to the fore, new languages will continue to spawn.
I think the key is to:
If one programming model does triumph, I would predict that it will be APIs that can be used equally well from C, Fortran, Java, etc., thus allowing different application domains to use their preferred APIs. And even that model is probably not compelling enough to bring them all and in the dark bind them....
Why does the notify allow() for throwing exceptions? The only exception I can see that it throws is IllegalMonitorState, and that seems easy enough to fix. I vaguely recall something about runtime exceptions or whatnot always being throwable, but really, there's nothing the Observer can do about such things except carry on with the rest of notifications. This is bad for debugging, so one solution would be to hold the exceptions until all notifications have been done and then deal with them all at once. Simply ignoring exceptions isn't an option, multi-threaded or not. If anything, this is an example of how instructional curricula have failed Java programmers.
If publishing triggers a subscriber to trigger another publish, that's either programming error or intended behavior in my book. A strict reading of the spec you've provided suggests we must notify all subscribers all publishes, not merely the newest. This eliminates a possible optimization should this trigger not occur every time.
My approach, without reading the paper, would be to make a copy the subscriber list on a publish. Synchronizing add and copy should be as simple as synchronizing on the subscriber's data structure. If this is prone to deadlock, I'm afraid you shouldn't be using Java.
I Browse at +4 Flamebait
Open Source Sysadmin
I used to work for SilverStorm (recently purchased by QLogic). They make InfiniBand switches and software for use in high performance computing and enterprise database clustering. The quality of the I/O subsystem of a cluster played a large part in determining the performance of a cluster. Latency (down the microsecond) and bandwidth (over 10 gigabits per second) both mattered.
Also, we found that sometimes, what made a deal go through was how well your proposed system could run some prexisting software. For example, vendors would publish how well they could run a standard crash test simulation.
Also, I would like to see more research put into making clustered operating systems like mosix good enough so that developers can stick to what they have learned on traditional SMP systems and have their code just work on large clusters. I don't think that multicore processors eliminate the need for better cluster software.
-- soldack
for (Subscriber subscriber: subscribers) {
subscriber.notify(value);
}
and it isn't synchronized, the notifications can arrive out of order.
I'm not saying you can't solve this problem. The point is, it's much trickier than almost all programmers are taught to realize. (This problem is easy to solve if you use event queues instead of multiple threads.)
I saw Patterson give this talk at PARC. While he did say something similar to the above quote, he was NOT suggesting that fewer, more powerful processors is the future (which is the impression I got from the summary). In fact, the entire talk was to the opposite point. Parallelism is the future; the processor is the new transistor; processors will have thousands of cores, etc. Very fascinating stuff. I hope more researchers participate and use his emulation system, because I think he's correct (as he has been in the past) that we will need to adopt parallelism more in the future.
There are two types of people in this world: those that categorize other people and those that don't.
I hope you enjoy you're wait, it's likely to be a long one. "Automatic Parallelization" in any but the most trivial circumstances will go the way of AI, that is, become a marketing/academic buzzword. Here's the approach that I think WILL work. 2 languages, one a purely sequential language designed from the ground up to be "parallizable". Notice, I don't say "parallel". And a second meta-language that a parallelization expert can use to parallelize the sequential code FOR A SPECIFIC HARDWARE PLATFORM. We aren't going to raise a generation of "parallel programmers" the way we did with "object oriented" programmers. Object oriented programming was designed to make programming easier. PP is something NOBODY WOULD EVER DO if processors were fast enough.
clusters are not parallel computers.
clusters are NOT parallel computers.
I suppose it depends on the teaching, but we run our students through synchronization routines in our Operating Systems course. They should know when they leave that multithreading requires protection. Of course, there's not much for the ones that leave school before this course (the course is a bear and the ones likely to avoid it are also likely to drop, but not before adding "Java" to their resume). Not all of them could propose a good fix, but they'd at least know that they should be looking for locks or something and encounter the "synchronized" keyword. Seems the biggest problem regarding java multithreading is efficiency and broken ways to avoid the synchronized keyword.
I'm not an expert on the java semantics of notify(), but after reading the javadoc and spec, if you intend to preserve the order of notification, it seems your point was to prove Java inadequate for the job: notify() has no parameters and merely signals waiting threads. It does not enforce any queuing on multiple notifies. If you were ready to run before the notify, you're still ready afterwards, but not necessarily running. Solving the problem becomes a lot less parallel, as publishes will have to copy, then block on pending publishes, AND for all subscribing threads to block again. If you signal multiple times, assuming the subscribers were waiting on themselves, using notify actions for publish potentially races with the waking thread's event handler due to the thread scheduler.
In fact, we're slowly building message queues as that's whats needed to satisfy the spec. We need FIFO processing of values and java's primitives fail the FIFO test. Fortunately, the paper decides to abandon java at this point for a pet language and introduces typical parallel language features like networking and the rest of distributed programming. Hopefully though students are scared enough of multithreading to look into how a given language's synchronization works.
I Browse at +4 Flamebait
Open Source Sysadmin
I need it to run an AI I designed as a child. Every word you say to it adds a ... what is that word? The opposite of their problem. to it.
What happens when a 1024-core server is too slow to handle 700 concurrent connections, and the only upgrade option is a 2048-core server? Then it matters whether each of those 700 requests is a parallelizable problem. Imagine a server that solves difficult computations like routing delivery traffic, designing tailored clothes from customer snapshots, monitoring security camera feeds at a casino, or analyzing a twenty-second voice recording to decide where to route a call. ("To help us best serve you, briefly state why you are calling.") Saying that one core per client will always be sufficient, even when cores stop getting faster, is tantamount to saying that nobody will ever figure out how to use all that power -- historically, a poor prediction.
For all the buzzz. OpenMP is a nice hack to add branch and join parallelism to loops in existing sequential code. An application needing higher granularity (task level) parallelism, streaming parallelism, or control over parallel processes, won't have much to do with it. Keep looking, the answers aren't there yet, nor in Fortress.
A big part of the problem is that established programming models, e.g., the use of UML to drive OO flavored designs, traps people into approaches unsuited for parallel execution. When is the last time you have tried to explain a massively parallel problem to somebody using UML? Antiobjects http://en.wikipedia.org/wiki/Antiobjects is a pattern mapping computation from so-called foreground objects to background objects. These Antiobjects can be mapped efficiently onto architectures with very large numbers of cores/cpus. They have been used in applications such as simulations as well as games and have been explored initially on Connection Machines (64,000 CPUS).
It's a myth that functional programs are more parallelizable than imperative ones.
First of all, most functional programs do heavy use of monads (especially the IO monad) and hence they need synchronization primitives around those monads just like imperative programs.
Secondly, those parts that do not use monads still need to be executed in certain order, so there needs to be a scheduler which synchronizes the multiple cores (waiting for results and dispatching computations), which again means using semaphores and mutexes.
Thirdly, functional languages do heavy use of the garbage collected allocator, because of using list cells and thunks a lot. Again, this means that the allocator needs to be synchronized.
There are two things that we are missing:
1) the brain does not share data. Data in the brain are copied around in the form of signals.
2) the brain is not a generalized turing machine. It can only do pattern matching on images/sounds/smells/touch. Don't be fooled of our ability to do arithmetic of formulate theorems: these are the result of pattern matching as well.
Pattern matching is highly parallelizable, of course: every piece of the pattern is fed to its own 'CPU', which calculates the degree of the match.
There is a way to automate shared state concurrency! every object should be its own thread. Computations that refer to the same object must be executed by the object's thread.
Here is how it works:
A computation does not return a result, but a tuple of {key, continuation}. The key is used to locate the thread to pass the continuation to. The computation is stored in the thread's queue and the thread is woken up.
The tuple {key, continuation} pair can be an 64-bit value (on 32-bit machines) that consists of a pointer to a memory location (the key) and a pointer to code (the continuation).
The insertion to the thread's queue can be done using lock-free data structures.
Threads can be user-level so there need not be a switch to kernel space.
This design can allow for linear scaling of performance: the more cores you put in, the more performance you get (for algorithms that are not linear, that is). Linear algorithms would execute a little slower than usual, but the trade off is acceptable: for many applications that allow for parallelization due to having lots of (relatively) independent objects, the performance boost be tremendous.
There are many domains of applications that would benefit from such an approach:
-web servers/application servers that must serve thousands of clients simultaneously.
-video games with thousands of objects.
-simulations that have many independent agents that can run in parallel.
-GUI apps that use the observer pattern and each observable has many observers than can be notified in parallel.
Note: The above ideas are taken from libasync-mp and lock-free data structure programming.
old CW: Black is black and white is white
new CW: White is black.
Seriously. Who uses such lame rhetorical techniques to convince people that all software should be rewritten?
The thing is, MOST algorithms cannot be PARALLELIZED or it is not worth it. This is a red herring to convince the Northern Venture Allicance Trolls to give more money to the new $ugly-name project which will promise parallelization. Yeah, milk the fat cow.
Just as a sidenote, the BlockingQueue you link to is a Java 5 feature, and indeed Java 5 has added a LOT of API classes that are a great help when dealing with threading. Not sure if you meant to refer to 1.5, since I'm sure 1.4 has also added some utility classes, but it's been touted as one of the major features of 5 aka 1.5.
Switch back to Slashdot's D1 system.
until that magical threading language (maybe c++1x) comes along
Studied Erlang much?
you had me at #!
That report on supercomputing doesn't indicate growth:
"Clusters have proven themselves as capable servers to handle a sizable portion of the HPC workload."
"More than half of the respondents expect their budgets for all HPC tools will decline (43%) or remain the same (17%) over the next two years."
"U.S. industrial users/buyers really want and need faster computers that fit their budgets and that don't require specialized programming skills."
Also, that study doesn't reflect "most companies". It reflects 30 companies that use supercomputers today, most of which are very large aerospace or automotive companies. They're not buying the tightly-coupled high end supercomputers, either; most of the new systems are clusters of conventional machines.
This is easy.
1) you should be catching any possible exceptions anyways, unless throwing an exception is part of your spec and dying is the correct behaviour.
2) the publish will be done by the notifying thread, so there is no deadlock. You have to avoid cycles on your own, but that is a design issue (if you spec a cycle, you will get a cycle: surprise). "must be kept up to date" is ill-defined. What ordering of updates to you want? Do you want every update in order? Then don't allow changes while a publish is in progress, or have a queue of outstanding publishes. Do you only want the "newest" value? Use a callback. Or you could make all of your notifications asynchronous (and use a condition to prevent non-atomic state changes). The notices will be in a new thread every time, so there will be no cycles and no deadlocks.
3) Non-issue. The subscribe enters the system through whatever mutex you are using to protect the subscriber list.
4) there is no reason to deadlock here.
Maybe we'll finally see functional languages come to the forefront. They make it a whole lot easier for the compiler to extract parallelism automatically. For example,see Parallel Haskell. I think this is the closest I've seen to a "magical threading language".
This sounds pretty much like event queueing, similar to the solution proposed by the paper i linked to.
Notice that this is no longer "shared-state" concurrency. If every object is running in its own thread, no threads share state. I think you've helped prove my point (shared-state concurrency is hard and ought to be avoided).
Using a queue of outstanding publishes is heading in the right direction: toward event-queue concurrency instead of shared-state concurrency (which is the point i was originally trying to make).
There are a lot of applications where taking advantage of a large number of threads is a fantastic way to increase performance. However, there is a lot of lower-level stuff that massively multicore CPUs should look into. Vector and matrix operations, FFTs, and single instruction multiple data [especially SIMD multiple compare, where many possibilities can be checked at once, eg for regexes, which is similar to the method d-wave uses for solving NP-complete problems with quantum computing] are the sort of instructions we are talking about- really using all those processors together. When the big CPU people focus on the sort of algorithms that crop up in scientific computing and database access, we should see the real power of top-down multicore design.
We at slashdot are scientists, specialists and kernel hackers. Your FUD will be found out.
Databases already allow a kind of parellel processing. A.C.I.D.-based techniques allow multiple users (processors) to send results to the same database in order to communicate results between each user/client. Each "client" may be single threaded, but together a client/server system is essentially a multi-threaded application, all without odd code or odd programming languages.
Table-ized A.I.
Could have been perusing too fast, but this guy fails to mention that one of the biggest obstacles to writing super fast multi-threaded code is the OS mutexes. Everything else he has said is far from surprising news but this *is* only slashdot after all...
We need to do away with our safe but expensive locks and critical sections, which work great in a system where you only have one core anyway so locking just lets someone else get something done, but which cause way too many cross-core stalls in a real-time system.
Something that would help a hell of a lot would be to have implementations of lock free FIFOs being created, supported and tweaked by OS developers.
A guy can dream...
It's OK Bender, there's no such thing as 2.
Looks like if you don't have ACM digital library access and don't feel like trudging to a library you can find a copy here: http://cva.stanford.edu/classes/cs99s/papers/hilli s-steele-data-parallel-algorithms.pdf
For a group of characters (substring) in the middle of the file, you can locally build a table that maps whatever the incoming parser state is (at the beginning of the substring) to what the corresponding outgoing state would be at then end, and then that table lets you process the whole substring in unit time. I like to think of it as the parsing equivalent of a carry-lookahead adder. Probably best to read the article if you're curious.
here's a thought, suppose you have a game with threads for gameplay, rendering, ai and physics we could seperate game time into internal frames and the threads out like this:
class cHuman extends cBiped
{
render
<
addDecal( decal_t decal , vec_t origin );
>
gameplay
<
traceAttack( vec_t origin , vec_t dir );
>
physics
<
impulseForce( vec_t origin , vec_t dir , float force );
>
}
this human may recieve a bullet to the chest, that's simple enough but it needs to invoke a force to knock the player back (this<physics>.impulseForce) as well as add a decal to the player skin to show the bullet impact on their bullet proof vest. It would probably also emit a sound and spawn a spark or debris model. So what seems at first to be a simple seperation of elements becomes more complex. This is where the internal frames come in - each thread would need an event queue, a list of commands to follow during each frame, and you would then add events to the tail of each thread's queues. At first glance this would probably look like it could introduce a new kind of infinite loop - whereby each thread gets more and more events tacked onto it's queue, but by seperating the execution into game-time-limited this can be overcome - so long as projectiles and explosions all have a finite velocity, and the AI characters have a realistic reflex time attatched then very little can happen infinitely quickly.
However, do we still have to limit ourselves by forcing all threads to have completed the current frame before starting the next? or can we allow a frame space, a range of say 50 frames over which the threads are allowed to run amok? - the renderer may be fixed to 50 or 60 Hz for example but the game engine may be able to run at 500 internal frames a second in order to get the physics working reasonably. Presumably the renderer can just take a snapshot of the gamestate and concentrate on rendering that.
This is where the compiler comes in, it should delegate the temporal snapshot ability of the renderer so that what the renderer is rendering does not change whilst it is rendering it, but the AI and physics threads can chat to each other about the current states of play. The programmer should be free to program these in parallel without having to worry about it too much.
If you don't risk failure you don't risk success.
I see two problems with this solution:
Interthread communications - the scheduler is going to become the bottleneck. Maybe a processor architecture with some serious scheduler support may be able to mitigate this?
Internally, most objects are going to need to be implemented as state machines. A formally correct state machine is no easy thing to create, particularly in most popular programming languages of today (C, C++, Java)
Still, these two problems are solvable with new CPU architectures and better compilers...
Indeed, but the overall result is similar, and you get a solution for multicore architectures for free.
The argument in favour of message queues instead of shared state isn't just about Java; Java is just an example. I am fairly sure that the Observer problem (and other simple coordination problems) are hard to solve reliably on any platform based on the conventional multithreading paradigm as taught in most computer science departments (threads, mutexes, critical sections, etc.).
It is true that java 5 added the java.util.concurrent package. However, if you are using 1.4 (or really 1.2+) you can use this package:
/ dl/util/concurrent/intro.html
http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs
The author Doug Lea is the same guy who wrote the 1.5 java.util.concurrent package. He also wrote the book "Concurrent Programming in Java 2nd ed." which is a must read IMO if you do anything multithreaded in Java.
So, we have hundreds or thousands of processors. The inclination of people is to try to take one app and try to make use of all that processing power. Maybe our whole focus is wrong. Maybe we should be thinking about actively running 100 or 1000 programs simultaneously where each program is a classic single-threaded program. You wouldn't think about doing this today because it would grind your machine to a halt but in the coming world of many processing elements this is a viable model. It may not be very power efficient but perhaps new apps will emerge that can do something useful for us in the background that don't require constant attention or immediate results.