Interview with Matthew Dillon of DragonFly BSD
JigSaw writes "Well-known FreeBSD/DragonFly/Linux/Amiga system hacker Matthew Dillon discusses a number of interesting points regarding where the BSDs are going, the status and goals of his latest project DragonFly BSD, the status of his innovative Backplane distributed database, his exciting plans to develop DragonFly into a transparently cluster-capable system implementing native SSI (Single System Image) which is something that no other operating system can do today, and more."
No this is to do with kernel threads. The userland threading is the same as in FreeBSD 4.x atm, AFAIK. The idea is to keep the model simple, unlike in FreeBSD 5.x where they are having trouble keeping it all sane with their fine-grained mutex model. Have a look at the dragonfly.kernel newsgroup, in nntp.dragonflybsd.org for more details on the SMP model, Matt talks about it regularly earlier on.
* Several monkeys are here, playing banjos and wearing small hats.
It's simply not true that "a transparently cluster-capable system implementing native SSI" is "something that no other operating system can do today." We were doing it at Locus in 1994 with SVR4 then with Tandem in 1996 with NonStop Clusters for Unixware. Now some of the same folks at HP have introduced OpenSSI, which is essentially the same code, less all the Unixware-related bits, ported to Linux and placed under the GPL. They are coming up hard on their 1.0 release, which is not bad for five people and such a large task.
OpenSSI is the real thing, it has processes that migrate from node to node, distributed file systems, the works. And it's running now on clusters literally all over the world. (Not many clusters, true, but maybe that will change if the Slashdot crowd finds out about it.)
I'm happy to say that there's a lot of my code in that system, as well.
I know a little about what Matt wants to do with his SSI in Dragonfly, but he should certainly take a look at OpenSSI; we had to solve a lot of the problems you run into when you build such a beast.
(And a beast it is. As complex as a kernel can be, when you have what is essentially a distributed kernel across several nodes, the complexity goes up by orders of magnitude. Makes tracking down those weird hangs pretty exciting, in a painful, time-consuming kind of way.)
If you read the article, Matt says (about SSI): "It is something that no non-commercial system today can do"...
* Several monkeys are here, playing banjos and wearing small hats.
Kernel threads almost universally stay on the cpu they were originally assigned to. High performance threaded subsystems, such as the network stack, are replicated. That is, the network stack creates multiple threads (one per cpu) and those threads do not migrate because, obviously, they do not need to.
Generally speaking, the purpose of making thread migration explicit instead of automatic is to partition a larger data set across available cpu caches rather then cause the same data to be shared amoungst all cpu caches. The processors operate a lot more efficiently and SMP scales a lot better. Most people do not realize the horrendous cost of moving threads between cpus because the cache mastership change is invisibly handled by hardware, but the cost is still there and still very real.
-Matt
"The three chief virtues of a programmer are: Laziness, Impatience and Hubris." -- Larry Wall
They do have ISOs, click the "download" link on their main page. The ISO is a liveCD, so you can boot your computer with it, like knoppix (no X or GUI stuff though). What they don't have is a friendly installer. But the /README file has detailed instructions on installing it to the hard disk, which should be easy if you have BSD experience, or if you're a brave newbie you can try it anyway.
The BSD base isn't packaged. BSD types like having a source tree for their entire base system and being able to do "make buildworld" and "make installworld" to upgrade it. The package management system is entirely for third party applications. This is not Debian or Gentoo who have no code maintained by themselves other than installation and package management stuff. The BSDs maintain the kernel, the libc, other key libraries, and all the base utilities like ls, cp, mount, etc. And there's also a lot of "contrib" software in the base system -- some of it necessary to build the system (gcc and binutils), some of it just there out of tradition or regarded as "too useful to be moved to ports" (bind, sendmail).
Granted, if you ran an all RedHat shop or an all Mandrake shop things would be easier than simply an all Linux shop, but the same would be true for an all OpenBSD shop vs an all FreeBSD or NetBSD shop. But if each department is free to buy what they want I'd rather find who-knows-which-BSD on the box than who-knows-which-Linux.
If all this should have a reason, we would be the last to know.
If you are unhappy with your executables be broken, simply keep a copy of the older libraries. (With Gentoo, simply delete the old package file in /var/db/pkg before updating.)
The LWKT scheduler on any given cpu is only allowed to operate on threads owned by that cpu. If you attempt to wakeup a thread owned by a different cpu, an asynchronous IPI message is sent to the target cpu's LWKT subsystem requesting that the specified thread be woken up. It's really that simple. Same goes for cross-cpu scheduling.
IPI messages themselves are lockless and require no mutexes to operate because the cpucpu messaging uses a software crossbar (array of FIFOs) approach.
In regards to cache issues, lets say you have a quad opteron system. Each cpu has a 1MB L2 cache. If you migrate threads willy nilly you basically wind up in a situation where each of the four cpu's L2 caches contain the same data. In effect, you wind up with a system that globally has only a tad more then 1MB of L2 cache. If you partition data (such as TCP protocol data) across distinct threads, and place those threads on different cpus, then you are in effect partioning your system's memory across all four cpu caches and you wind up with a system that globally has 3-4MB of L2 cache instead of 1-2MB.
There are two costs being saved here. (1) the cost of having to go to main memory when a piece of data is not in the L1/L2 cache, which can run into the hundreds of cpu cycles, and (2) the cost of cache mastership changes for all the data associated with the thread that was migrated (repeated each time the thread migrates).
-Matt
I don't think it's anyone's design in particular, but I tend to sit down and write things from scratch rather then copy other people's ideas. In the case of the thread replication used by the network stack, it is primarily Jeffrey Hsu's work and since he is big on reading papers I'm sure it's a combination of his own design and ideas gleaned from various published papers.
The serializing tokens used by DragonFly work differently. They only guarentee serialization while the thread holding the token is actually running. Other threads holding the same token will be allowed to run when the first thread blocks or switches away synchronously, and the original thread will not get the cpu back until the tokens it is holding are available again.
This means that threads can obtain tokens in any order they wish, and that threads can hold tokens across blocking situations or calls to other subsystems without having to tell those subsystems about it. It may seem like a small thing, but the result is a huge simplification of the programming model. The tokens act almost like mini-BGL's (Big Giant Locks) but have the added advantage of protecting against interrupt threads trying to hold the same token. We are planning to expand the token idea further into a shared/exclusive model. The shared/exclusive model would have characteristics very similar to RCU.
The actual internal implementation of our token code is also quite a bit more flexible, allowing us to rip the guts out and rework it as needed for performance without changing the abstraction.
-Matt