One major advantage of IB here is that it natively supports multipathing; there's no need to avoid loops in the graph either by topology or by using spanning trees. This allows one to build networks with decent bisection BW without needing big and expensive über-switches.
There are a few efforts to bring similar capability to ethernet as well, TRILL and 802.1aq, AFAIK neither of which is ratified at the time of writing this.
I'll assume you know more about this than me, but he did say that the nodes are going to be wired with 4x GigE. Might there be a penalty bridging from that to IB rather than 10GigE?
The way I read it, it means that the nodes have 4 1 GbE ports builtin on the MB. If you're going to use IB, you'll by separate PCIe IB cards for each node. The 1GbE ports can then be used to run management traffic etc. Or left unused, there's no law saying you have to use them all, and since 1GbE ports are practically free it's not like you're leaving any money on the table either.
Wrt RDMA over ethernet, iWARP this and that, yes I know it exists. My point was that RDMA has been supported on IB since day 1, the software stack is mature and widely used, which can't be said for ethernet RDMA. Since IB infrastructure so far is cheaper there's really no reason to go with 10GbE.
That's not to say that 10GbE is useless. Of course it's useful, e.g. if you run high-bandwidth services over TCP/IP accessible from outside the cluster. But that's not what you're doing on a cluster. A cluster interconnect is typically used for MPI and storage, both of which can run over RDMA, avoiding the by comparison heavy-weight TCP/IP protocol. And of course, at some point 10GbE will replace 1GbE as the cheap builtin stuff on MB's.
Shouldn't you have figured out answers too all these (simple) questions before ordering several million $$$worth of hardware? Sheesh..
As for you specific questions:
- IB vs. 10GbE: IB hands down. Much better latency and more mature RDMA software stacks (e.g. for MPI and Lustre). Cheaper and higher BW as well.
- GPU: NVidia Fermi 2090 cards. CUDA is far ahead of everything else at the moment.
Most codes can't deal with node failure. So far it seems the solution is to checkpoint frequently (say, once per hour). There's not much else that is sensible. E.g. running pairs of nodes in lockstep is more expensive than an IO subsystem capable of the checkpointing.
This one uses SPARC chips designed and fabbed (IIRC?) by Fujitsu. Sun/Oracle has nothing to do with it.
AFAICT the politics behind this machine is that a few years ago NEC pulled out from the project to design a next generation vector chip for use in a Japanese Earth Simulator follow-up. Hence the project resorted to the Fujitsu SPARC chips, which are not really designed for HPC but are still a domestic design.
I wouldn't expect this machine design to become popular outside Japan.
Persistency: once eth0, always eth0 - this is what most commentators here seem to think this is all about, but it's already taken care of by udev with most modern distributions.
To some extent. The persistency is taken care of by adding state to the system, that is, by storing the MAC's somewhere. That fails e.g. if you switch out a broken NIC, or if due to some hw failure you move the HD to another identical server.
Naming: The article says they're changing the naming. This is what makes no sense. It's not "required." ethx is just fine, as long as the names are enumerated consistently (meaning that on two "identical" boxes, the order is identical based on physical port).
IIRC the justification for this is that using ethX would race with the original kernel names. This thingy is based on udev, when the kernel boots devices are given ethX names and then udev rules rename them according to bios names, or PCI bus order etc.
Seems nowadays quite many of the pro(sumer) sound cards are external ones connected via USB.
Presumably the idea being to isolate the DAC from all the electrical noise inside the case?
What about latency on these things? One would imagine that one extra protocol hop would add latency, and then traffic would have to be shared with other traffic on the same bus? I mean, people doing audio production seem to be sensitive to latency, to the point that Linux users use the RT kernel. Is USB really up to it?
As others have mentioned, this is nothing but the latest attempt to kill off the used books market. The textbook industry is just a big racket.
Curiously, the obvious solution of using widely available free online textbooks is ignored (see e.g. http://theassayer.org/ for a directory). Oh yeah, can't do that because we "need to save the textbook industry".
Of course, free online textbooks aren't the answer to everything, say for some grad-level specialized course the selection of appropriate textbooks might be quite limited, if available at all. But for all those massive "XXX 101" courses, surely the free online resources are plentiful, and some even very good quality. Or maybe even better, as a free online textbook writer has no incentive to bulk up the book with useless fluff, which just wastes student time when reading.
What is usually done over here, at least in the math and physics departments, is that homework problems are separate handouts. That way it doesn't really matter that much which edition of a textbook the students use.
It was similar in Finland as well, you had to write a letter to the local parish explaining why you wanted out, and the priest had to grant you leave. It wasn't until, oh, maybe 5-10 years ago when the law was changed so that you only need to notify the magistrate (so that they won't withhold some of your income for church tax), and the eroakirkosta.fi site went up at about the same time to make it even easier.
I think there is some guarantee yes, but it's not eternal. AFAIK graves are often reused after some decades when the corpse has rotted away to the point that they can dig down a new one in roughly the same spot.
Also, I think that as the official state church, the Lutheran church has some kind of responsibility for taking care of bodies of people who don't belong to any particular faith nor have any kin paying for the disposal, or such. I suspect most parishes have some odd corner in the graveyard for these people, or then they are cremated, whichever is cheaper. FWIW cremation is increasingly common in the cities also for church members, for obvious reasons.
From the PDF article (http://pdos.csail.mit.edu/papers/linux:osdi10.pdf ):
We run experiments on a 48-core machine, with a Tyan Thunder S4985 board and an M4985 quad CPU daughter- board. The machine has a total of eight 2.4 GHz 6-core AMD Opteron 8431 chips.
We run experiments on a 48-core machine, with a Tyan Thunder S4985 board and an M4985 quad CPU daughter- board. The machine has a total of eight 2.4 GHz 6-core AMD Opteron 8431 chips.
Unfortunately, the summary as well as the short articles on the web were more or less completely missing the point. The actual paper ( http://pdos.csail.mit.edu/papers/linux:osdi10.pdf ) explains what was done.
Essentially they benchmarked a number of applications, figured out where the bottlenecks were, and fixed them. Some of the things they fixed where done by introducing "sloppy counters" in order to avoid updating a global counter. Others were to switch to more fine-grained locking, switching to per-cpu data structures, and so forth. In other words, pretty standard kernel scalability work. As an aside, a lot of the VFS scalability work seems to clash with the VFS scalability patches by Nick Piggin that are in the process of being integrated into the mainline kernel.
And yes, as the PDF article explains, the Linux cpu scheduler mostly works per-core, with only occasional communication with schedulers on other cores.
Btrfs is a product of Oracle. Oracle now owns ZFS outright and controls the fate of Btrfs in terms of developer resources. One guess as to whether Oracle will remain motivated to complete Btrfs.
If Oracle for whatever reason decides to stop investing in BTRFS, the likely outcome AFAICS is not that BTRFS dies, but rather that Chris Mason and his team jump shop to Red Hat, Novell, Google, IBM or some other Linux contributor with an interest in seeing BTRFS succeed. That's one of the advantages of a collaborative project like Linux which isn't subject to the whims of any single corporation in complete control.
To the extent that there might be a threat against BTRFS, depends on how the ZFS-WAFL lawsuit plays out. I wouldn't be particularly surprised if Oracle settles with Netapp, covering only official Solaris releases, leaving other ZFS versions (Illumos, Nexenta, FreeBSD, etc.) out in the cold, and perhaps BTRFS as well, depending on to which extent the WAFL patents apply to BTRFS.
1. User-space scheduling. It would be nice if a process could have better control on the priority of each of its threads. For example, on a web service where multiple users are active, it is often necessary to give each user his/her share of the cpu. Right now this is rather difficult to do in a fair way, since multiple threads may belong to the same user.
If normal priorities aren't sufficient, you can setup cgroups.
3. "Nice" for bandwidth.
For IO, ionice? Or, again, cgroups allows fair sharing IO and network BW, IIRC.
4. "Select" or "poll" with access to inter-thread synchronization structures. Select and poll are system calls which act mainly on file-descriptors. However, sometimes you'd like to wait also on a mutex or semaphore. Some support for this would be great.
Isn't this what pthreads condition variables are for? Or can you explain what you want in more detail?
From reading the mailing list thread, my impression was that it was a storm in a teacup, and the real problem was just a simple bug rather than a fundamental misdesign. Or if you want to be slightly less charitable, a case of "concern trolling".
I wouldn't be so sure. Like previous incarnations of the Xeon MP series, this one will be much more expensive per FLOP than the 2 sockets-per-node machines that make up most x86 entries in the top500 list.
Anyway, for these big machines parallel scalability is mostly determined by the internode network, merely stuffing more cores per node does nothing. Or actually, if you don't increase network performance as you make the nodes fatter, parallel scalability will worsen as you have more cores sharing the network link.
Now, one interesting entry that will use these Nehalem-EX chips is the Altix UV by SGI. That will certainly big a very interesting architecture for people looking at big CC-NUMA machines, but as it tops out at 256 sockets in a CC-NUMA configuration it won't get anywhere near the top of the top500 list. Of course, you can cluser together several such machines if your wallet is thick enough, but at that point you lose the global CC-NUMA and a more traditional cluster is more cost effective for MPI jobs.
There are a few efforts to bring similar capability to ethernet as well, TRILL and 802.1aq, AFAIK neither of which is ratified at the time of writing this.
The way I read it, it means that the nodes have 4 1 GbE ports builtin on the MB. If you're going to use IB, you'll by separate PCIe IB cards for each node. The 1GbE ports can then be used to run management traffic etc. Or left unused, there's no law saying you have to use them all, and since 1GbE ports are practically free it's not like you're leaving any money on the table either.
Wrt RDMA over ethernet, iWARP this and that, yes I know it exists. My point was that RDMA has been supported on IB since day 1, the software stack is mature and widely used, which can't be said for ethernet RDMA. Since IB infrastructure so far is cheaper there's really no reason to go with 10GbE.
That's not to say that 10GbE is useless. Of course it's useful, e.g. if you run high-bandwidth services over TCP/IP accessible from outside the cluster. But that's not what you're doing on a cluster. A cluster interconnect is typically used for MPI and storage, both of which can run over RDMA, avoiding the by comparison heavy-weight TCP/IP protocol. And of course, at some point 10GbE will replace 1GbE as the cheap builtin stuff on MB's.
Shouldn't you have figured out answers too all these (simple) questions before ordering several million $$$worth of hardware? Sheesh.. As for you specific questions: - IB vs. 10GbE: IB hands down. Much better latency and more mature RDMA software stacks (e.g. for MPI and Lustre). Cheaper and higher BW as well. - GPU: NVidia Fermi 2090 cards. CUDA is far ahead of everything else at the moment.
Most codes can't deal with node failure. So far it seems the solution is to checkpoint frequently (say, once per hour). There's not much else that is sensible. E.g. running pairs of nodes in lockstep is more expensive than an IO subsystem capable of the checkpointing.
This one uses SPARC chips designed and fabbed (IIRC?) by Fujitsu. Sun/Oracle has nothing to do with it. AFAICT the politics behind this machine is that a few years ago NEC pulled out from the project to design a next generation vector chip for use in a Japanese Earth Simulator follow-up. Hence the project resorted to the Fujitsu SPARC chips, which are not really designed for HPC but are still a domestic design. I wouldn't expect this machine design to become popular outside Japan.
Ah, but even Conservapedia isn't safe, as trolling it is a popular (?) pastime: The Conservapedia trolling game.
If someone's goals don't agree with yours, then the polite thing to do is refuse to give them advice, not give them bad advice.
So the FSF should not put up a web page explaining which licenses they recommend and why, because someone on the Internet might disagree? Seriously?
Persistency: once eth0, always eth0 - this is what most commentators here seem to think this is all about, but it's already taken care of by udev with most modern distributions.
To some extent. The persistency is taken care of by adding state to the system, that is, by storing the MAC's somewhere. That fails e.g. if you switch out a broken NIC, or if due to some hw failure you move the HD to another identical server.
Naming: The article says they're changing the naming. This is what makes no sense. It's not "required." ethx is just fine, as long as the names are enumerated consistently (meaning that on two "identical" boxes, the order is identical based on physical port).
IIRC the justification for this is that using ethX would race with the original kernel names. This thingy is based on udev, when the kernel boots devices are given ethX names and then udev rules rename them according to bios names, or PCI bus order etc.
Seems nowadays quite many of the pro(sumer) sound cards are external ones connected via USB.
Presumably the idea being to isolate the DAC from all the electrical noise inside the case?
What about latency on these things? One would imagine that one extra protocol hop would add latency, and then traffic would have to be shared with other traffic on the same bus? I mean, people doing audio production seem to be sensitive to latency, to the point that Linux users use the RT kernel. Is USB really up to it?
Replying to myself, TFA contains some info about this. Hey, this is slashdot, who has time to read TFA?
So you're claiming ACID; IOW you are saying your system provides consistency as per the definition used in CAP?
How do you deal with network partitions? That is, per the CAP theorem, if you have C, is your system CA or CP?
Thanks,
NIH?
As others have mentioned, this is nothing but the latest attempt to kill off the used books market. The textbook industry is just a big racket.
Curiously, the obvious solution of using widely available free online textbooks is ignored (see e.g. http://theassayer.org/ for a directory). Oh yeah, can't do that because we "need to save the textbook industry".
Of course, free online textbooks aren't the answer to everything, say for some grad-level specialized course the selection of appropriate textbooks might be quite limited, if available at all. But for all those massive "XXX 101" courses, surely the free online resources are plentiful, and some even very good quality. Or maybe even better, as a free online textbook writer has no incentive to bulk up the book with useless fluff, which just wastes student time when reading.
What is usually done over here, at least in the math and physics departments, is that homework problems are separate handouts. That way it doesn't really matter that much which edition of a textbook the students use.
It was similar in Finland as well, you had to write a letter to the local parish explaining why you wanted out, and the priest had to grant you leave. It wasn't until, oh, maybe 5-10 years ago when the law was changed so that you only need to notify the magistrate (so that they won't withhold some of your income for church tax), and the eroakirkosta.fi site went up at about the same time to make it even easier.
I think there is some guarantee yes, but it's not eternal. AFAIK graves are often reused after some decades when the corpse has rotted away to the point that they can dig down a new one in roughly the same spot.
Also, I think that as the official state church, the Lutheran church has some kind of responsibility for taking care of bodies of people who don't belong to any particular faith nor have any kin paying for the disposal, or such. I suspect most parishes have some odd corner in the graveyard for these people, or then they are cremated, whichever is cheaper. FWIW cremation is increasingly common in the cities also for church members, for obvious reasons.
From the PDF article (http://pdos.csail.mit.edu/papers/linux:osdi10.pdf ):
We run experiments on a 48-core machine, with a Tyan
Thunder S4985 board and an M4985 quad CPU daughter-
board. The machine has a total of eight 2.4 GHz 6-core AMD Opteron 8431 chips.
From the PDF article:
We run experiments on a 48-core machine, with a Tyan
Thunder S4985 board and an M4985 quad CPU daughter-
board. The machine has a total of eight 2.4 GHz 6-core AMD Opteron 8431 chips.
Unfortunately, the summary as well as the short articles on the web were more or less completely missing the point. The actual paper ( http://pdos.csail.mit.edu/papers/linux:osdi10.pdf ) explains what was done.
Essentially they benchmarked a number of applications, figured out where the bottlenecks were, and fixed them. Some of the things they fixed where done by introducing "sloppy counters" in order to avoid updating a global counter. Others were to switch to more fine-grained locking, switching to per-cpu data structures, and so forth. In other words, pretty standard kernel scalability work. As an aside, a lot of the VFS scalability work seems to clash with the VFS scalability patches by Nick Piggin that are in the process of being integrated into the mainline kernel.
And yes, as the PDF article explains, the Linux cpu scheduler mostly works per-core, with only occasional communication with schedulers on other cores.
Btrfs is a product of Oracle. Oracle now owns ZFS outright and controls the fate of Btrfs in terms of developer resources. One guess as to whether Oracle will remain motivated to complete Btrfs.
If Oracle for whatever reason decides to stop investing in BTRFS, the likely outcome AFAICS is not that BTRFS dies, but rather that Chris Mason and his team jump shop to Red Hat, Novell, Google, IBM or some other Linux contributor with an interest in seeing BTRFS succeed. That's one of the advantages of a collaborative project like Linux which isn't subject to the whims of any single corporation in complete control.
To the extent that there might be a threat against BTRFS, depends on how the ZFS-WAFL lawsuit plays out. I wouldn't be particularly surprised if Oracle settles with Netapp, covering only official Solaris releases, leaving other ZFS versions (Illumos, Nexenta, FreeBSD, etc.) out in the cold, and perhaps BTRFS as well, depending on to which extent the WAFL patents apply to BTRFS.
1. User-space scheduling. It would be nice if a process could have better control on the priority of each of its threads. For example, on a web service where multiple users are active, it is often necessary to give each user his/her share of the cpu. Right now this is rather difficult to do in a fair way, since multiple threads may belong to the same user.
If normal priorities aren't sufficient, you can setup cgroups.
3. "Nice" for bandwidth.
For IO, ionice? Or, again, cgroups allows fair sharing IO and network BW, IIRC.
4. "Select" or "poll" with access to inter-thread synchronization structures. Select and poll are system calls which act mainly on file-descriptors. However, sometimes you'd like to wait also on a mutex or semaphore. Some support for this would be great.
Isn't this what pthreads condition variables are for? Or can you explain what you want in more detail?
Presumably he meant the issue described here: http://lwn.net/Articles/393144/
From reading the mailing list thread, my impression was that it was a storm in a teacup, and the real problem was just a simple bug rather than a fundamental misdesign. Or if you want to be slightly less charitable, a case of "concern trolling".
No, it already has an appropriate name.
Windows server 2008 R2 dropped 32-bit x86 support, actually.
I wouldn't be so sure. Like previous incarnations of the Xeon MP series, this one will be much more expensive per FLOP than the 2 sockets-per-node machines that make up most x86 entries in the top500 list.
Anyway, for these big machines parallel scalability is mostly determined by the internode network, merely stuffing more cores per node does nothing. Or actually, if you don't increase network performance as you make the nodes fatter, parallel scalability will worsen as you have more cores sharing the network link.
Now, one interesting entry that will use these Nehalem-EX chips is the Altix UV by SGI. That will certainly big a very interesting architecture for people looking at big CC-NUMA machines, but as it tops out at 256 sockets in a CC-NUMA configuration it won't get anywhere near the top of the top500 list. Of course, you can cluser together several such machines if your wallet is thick enough, but at that point you lose the global CC-NUMA and a more traditional cluster is more cost effective for MPI jobs.