"SCTP and Infiniband focus on different areas. IB is largely a high performance HPC / cluster network architecture for LAN applications, where SCTP is a transport protocol designed to operate efficiently under WAN conditions (significant packet loss, high RTTs)."
Ok, that makes more sense. IB and other hardware based reliability systems all have problems with long distances. There are folks working on IB WAN though including the U.S. Naval Research Laboratory. Check out Obsidian Research.
"The interrupt issue has largely been solved - on Linux NAPI dynamically switches between interrupt and polled mode to reduce this overhead to negligible levels. Message signalled interrupts also help considerably."
Cool. I was not aware of NAPI...been too long since I have been in linux kernel land. I agree about MSI helping out. It would be interesting see how this effects HPC performance. HPC has appliations that can include both large transfers and latency sensitive messages. By the way, I am pretty sure intel nics had a similar control as an option on their linux drivers for some time.
"What would be much more helpful (and economical) for iSCSI, SCTP, and RDDP is NIC CRC32C checksum generation. CRC generation is quite expensive in software but trivial in hardware."
Yep. I guess its a matter of finding the right balance between what should be offloaded and for what cost.
"One advantage of SCTP over TCP is that on a per stream basis, SCTP connection establishment overhead is much lower than TCP - basically O(1) instead of O(N) in the number of streams."
Interesting. Oracle and SilverStorm pushed out Reliable Datagram Sockets for IB (could work over iWarp) to handle the issue of lots of connections to the same host. Oracle saw massive scaling issues on pretty much any hardware or software for their clustering. RDS solved it by multiplexing all threads' traffic that goes to the same host down one reliable pipe. I wonder how SCTP would handle this?
For all its problems TCP/IP is everywhere. This fact has made it the networking technology to use even when it doesn't make technical sense. For example, folks use it in high performance computing and in storage (iSCSI) where there are much better methods available technically. Its commonality (along with ethernet's popularity) often make TCP/IP over ethernet the cheapest solution to many problems (while not the best).
I used to work on InfiniBand where the reliability/congestion detection protocol (Reliable Connected and Link Level flow control in IB terms) are in hardware. This scales to 20 Gbit connections between hosts quite well. Other examples of hardware protocols include myrinet (invented by myricom) and qsnet (from quadrics) and scalable coherent interface current pushed by Dolphin Interconnect. All of these folks struggle to compete with good old TCP/IP over ethernet. Except for the parts of the HPC world, TCP/IP over ethernet wins. In the storage landscape, Fibre Channel, SAS, and SATA seem to be holding out but iSCSI sure is trying.
The performance issue is real though and very few systems can saturate a 10 Gbit TCP/IP etnernet link without massive host CPU overhead. One solution floating around is that instead of trying to make new protocols to replace TCP, we should imitate the competition and put hard work in hardware. TCP/IP offload NICs (TOE) are becoming increasingly more popular. With RDMA technology layered on top of it you get iWARP. For storage you get iSER (ironically from an IB company!). This technology is being adopted by both the MS and Linux camps so it seems to have a good shot. In fact, many of the interfaces used by IB work about as well over iWARP cards. Things like Message Passing Interface, Direct Access Provider Library, Sockets Direct Protocol (SDP), and iSER do not know the difference between iWARP and IB or anyone else.
Software can just post a full size message and it gets sent out the wire without copying, segmentation, timers, resends, or other CPU hogs. This kind of stuff really helps with large messages. With SDP, apps can be made to take advantage of it without changes to the application. MS is also providing a standard way for just TOE NICs without RDMA abilities to work with the OS. Linux doesn't seem to have a standardized way for TCP/IP to be offloaded entirely but is supporting RDMA and SDP.
The things SCTP seems to offer is more explicit understanding of the difference between failure and congestion and multi-home support. This could make load balancing over multiple paths between hosts pretty interesting. The problem I see is that is that it is competing with the established TCP that now has many of its warts fixed with hardware offload. SCTP will still have the issue of a CPU handling segmentation/reassembly, massive amounts of interrupts, timer/retry overhead, etc. It also seems to have a higher overhead for connection establishment (although that is mitigated by being able to send data during the end phases). Is this a solution looking for a real problem? Pehaps not. Does this really have a chance of being taken up? I am not too confident. -Ack
"I pray you never get a job managing a factory, then." Me too. I don't want to manage a factory!
I like programming software and this issue comes up a lot in that world. Lots of tuning parameters are great. I believe they should be able to self set to some reasonable defaults based on system setup and potentially current conditions.
Oh, I have setup manufacturing tests for a consumer electronics product and for a military product and they have worked out pretty well. -Ack
re: Controller side caching If a system takes a power hit, your data could get lost in several places. Computer RAM: This can especially occur if you are going throug a file system and get stuck in the file system cache. Hardware RAID Cache: This can be in the FC/SCSI/SAS/SATA controller although from my experience the big systems put the RAID in with the storage. See the big beasts from EMC, NetApp, etc. Disk Cache: Each hard disk has its own write cache
So on a massive power failure, you need UPS accross the board to let things make it to disk. Any component noted above will lose data on power failure if you don't have power backup.
On a massive power spike, any of these systems will lose data or get corrupted. Heck RAM itself will can read incorrectly on power spikes.
Now certainly Oracle may do things with their DB to insure the integrity of data and it probably does these things better then MySql. On the other hand some folks are comfortable with restoring from tape. Database snapshot systems can help aleviate the problem.
So I will take integrity of MySql vs Oracle as an argument by how do you argure Solaris vs. Linux? If Linux is so bad why does Oracle support it and event advertise for it? Companies like IBM, Unisys, and SGI all seem to think Linux can handle mission critical systems on the high end. Many of the top super computers in the world use linux. On the embedded side linux is doing quite well in mission critical components. WindRiver is supporting linux now despite having VxWorks.
If Oracle is running on linux or solaris on the same hardware, there isn't really much difference in integrity. At that point oracle is writing pretty much straight to disk so it comes down to Oracle, computer hardware, and storage hardware. The OS doesn't make much of a difference.
Sure, I agree that MySql does lack these things and they can contribute to performance but what some aspects of data integrity can be handled by using proper storage. Perhaps Oracle should make some of these things optional so that the speed vs. safety tradeoff can be in the user's hands? Perhaps they are comfortable with their RAID array with UPS backup and nightly tape backups.
From my point of view, a system should not require extensive tuning to run well in typical environments. I can't stand the idea that having lots of tunable settings makes something good. Does Oracle support some sort of auto tuning run where it figures out the best parameters given your system? It seems like it should be able to observe some sample usage and adjust things as needed? How about SQL optimization? It seems like bad software design if you need more expensive folks to make it work well vs the competition and that appears to be true for Oracle. Oracle DBAs cost more then others and it still often doesn't perform as well.
By the way, back at my internet ad company, we used oracle dbas to make sure our Oracle SQL was fast. Using oracle specific tricks helped quite a bit but not enough to catch up.
The company sold their software to lots of pretty large web sites. Often the same systems that hosted ads handled other main issues. Very few had a problem with the integrity issues of MySql even when we told them about it. -Ack
I used to work at an internet advertising company. We would track ads and keep a database of what was setup and clicked on,etc. We supported several databases including MySql, Oracle, and SqlServer. We defaulted to MySql unless the customer had a database installed already they wanted to use. The only reason we moved to Oracle was when folks hit a 2 GB limit on a table (and file) size that MySql on 32-bit X86 linux had back then (not sure if it does now). Things got soooo much slower. Scripts that were designed to make reports over night in an hour or so couldn't finish before folks came in the next day.
Also they seem to not be able to get their clustering to scale beyond a few servers without high end interconnects like InfiniBand. Even with IB, they needed a whole new protocol, Reliable Datagram Sockets, which SilverStorm made for them. I also used to work at SilverStorm. Oracle also wanted to invent a user mode RDMA based storage driver (user SCSI Remote DMA Protocol) because they seemed to feel that going through the kernel was a major bottleneck for storage.
It is interesting to see the need for all this new technology just to catch up in performance. -Ack
"Money is a measure of how much society values your time and work."
I think it is more a supply and demand issue. Demand comes from the ability of a job to produce something of value for an employer.
Perhap it is sad but true, but the US is a capitalist society where we are paid by the value of what we produce and how easy it is to find those that can do our work. This is the real reason a PhD in History may earn less than a PhD in a science who may earn less than an MBA.
I used to work in the InfiniBand space where folks are using host adapters at 20 Gbit (4X, Double Data Rate). Some of the big server vendors are doing 30 Gb (12X Single Data Rate) host adapters. With all of this host speed it is only a matter of time before the switch to switch links will go up in speed. High speed systems like this are getting used in high performance computing to build larger clusters. Having faster switch links will allow these fabrics to be created with less switches and thus less hops from node to node and thus lower latency. Latency is probably the most important factor in the performance of a HPC cluster. It doesn't stop here...IB defines up to Quad Rate Rate 12X (120 Gbit). The HPC market is growing very well and the ethernet folks want a bigger piece of the high end of this market. Systems with this high level of speed are also used in big telco setups. With broadband becoming increasing popular and bandwidth increasing, the telcos need to have higher end equipment in their core.
Going through PCI-Express is about as close you are going to get without a standard north bridge specification that everyone supported. It is very InfiniBand like...pretty much IB without the networking. Intel seemed to go this way when they didn't get their 4X HCA out the door. Then again, Advanced Switching Interconnect kind of gets you back to IB like fabrics. IBM has 12X HCAs for their high end lines that do not use PCI or PCI-Express. I suspect Sun is working on one as well given the all 12X switch they just released (based on what I believe is their silicon, not Mellanox). Pathscale goes through Hypertransport instead of PCI-Express. They are getting pretty good latency numbers.
What do you say "that fat-trees saturate so easily"? A proper full bisectional bandwidth tree on IB should allow full bandwidth to be used. For example, half the servers should be able to talk to the other half of the servers at full speed. A proper InfiniBand subnet manager should be able to program the routes so that you have an FBB fabric.
I have worked on linux and windows device drivers and embedded code on systems like VxWorks and on proprietary embedded systems. I have also written user space programs (math libraries, configuration programs, etc). I find that writing kernel code is much less forgiving then user space code, especially when working with hardware. The mistakes that can be made can be very subtle and much more difficult to debug given the systems available today. Deadlock your user space program and it just stops. Deadlock your kernel and your whole system hangs. Ever have the joy of having hardware trash system memory due to undocumented limitations on usage? Have you ever watched in horror as a low memory reference (like null->member) leads to a random hardware access instead of being caught by the system? When you create a complete system with no memory protection because you don't have the hardware for it, it is more difficult. There is a reason that there are a small number of folks who work in the embedded and kernel spaces.
I am not saying that user space programming is easy. Kernel space programming brings its own challenges that user space programmers almost never have to consider. Meanwhile it still has many of the same challenges that user space programming has.
I do agree that many kernels do do lots of the same things that user space programs could do and perhaps some of that should be pushed to user space. The problem is that attempts to move things to user space or use protected languages (like java) for kernel have not worked well so far. They tend to require more resources or run slower which often kernels don't want and embedded systems can't afford. Perhaps your idea of a mix might work. But isn't that going in the microkernel direction with the microkernel being the " few parts of the kernel that require running in fixed time windows or direct hardware access" and the rest being a protected or managed language? By the way, "direct hardware access" is really not a small part of the kernel when you consider all the device drivers that require direct access to the hardware they are trying to support. I am curious how much of most running O/S kernels are devoted to O/S services and how much are handling specific hardware (NIC, storage card, video card, etc).
I have done my share of kernel programming and I have always thought that it is pretty horrible that simple device driver bugs can take down the system. Almost all of Windows' Blue Screens are from bad third party drivers. Almost all of the oopses I have seen on linux are from device drivers for extra hardware (I mean drivers not for core common O/S features). On linux device driver debug still seems to be horrible; on Windows it is considerably better but still not as good as application debug. With common user systems as cheap and fast as they are now, do user mode device drivers make sense? Is the performance worth giving up for the stability? Check out Microsoft's User-mode Driver Framework approach. Here is an old linux journal article on the subject. Does anyone know of other interesting examples of user mode device drivers on any operating systems?
The IB folks are getting more MPI bandwidth than any interconnect out there. The latency is also very good.
By the way, MPI is not the first RDMA technology Windows has had. WinSock Direct has allowed user space RDMA through a sockets interface for some time.
From what I understand, Intel started the Itanium program with HP because they started to feel the heat of RISC and worried about their architecture's scalability in the future. So off went the team with HP compiler folks to do a VLIW next generation processor where the compiler can figure lots in advance. Meanwhile other folks at Intel figured out that they could make the core RISC like and provide a conversion layer to handle x86. The success of the later generations Pentiums put Intel in a situation where they needed to support multiple architectures.
Remember when everyone was going to use Itanium? Now it is down to a few big boxes like HP, SGI, and Unisys. HP is also one of the biggest Opteron sellers these days!
I think a few technical things blew it for Itanium. It is a fact that the Itanium sucked at I/O. It had much higher latencies then Xeons and much higher then Opterons. Folks in the high performance computing community know that the same PCI-X networking cards (InfiniBand, Myrinet, etc) always performed worse on Itanium based systems.
Also the combination of the VLIW architecture and lots of registers made the Itanium very cache hungry. You can get as much as 9 MB of L3 cache. They increases cost and die size. Also, all that cache has to be powered! I don't think Intel/HP saw it coming.
Currently Intel has to support Intanium, P4/Xeon, EM64T variants, Pentium M variants, and lets not forget its XScale based line of network/embedded processors it got from DEC. At least the i960 is dead.
I think Intel has lost their focus. Those of us in the InfiniBand community watched Intel screw up their 2nd gen InfiniBand adapter and got beat to market by a startup by the name of Mellanox. Since they couldn't win they decided not to play anymore. Suddenly we get PCI-Express...aka InfiniBand without the networking and as much management. Advanced Switcing Interconnect almost gets us back to IB...arg.
Intel was the first ethernet NIC to 10 Gig but now their competition has TCP/IP and iSCSI offload and iWarp/RDMA interfaces while Intel has canceled their full TCP/IP offload technology and doesn't even have a PCI Express adapter ready.
Intel is relying on marketing and manufacturing power to keep it going. They hope that the brand can be stronger then the sum of its parts. Centrino, ViiV, etc. Sadly, I don't think it will change until the biggest gun in the PC industry actually gives AMD a try. I suspect Dell is getting better prices on Intel CPUs then anyone else.
While windows itself doesn't support clustering, there are lots of libraries that run on top of windows that allow for it. Several vendors offer MPI (message passing interface) for windows running over TCP/IP for example. I am pretty sure Cornell is activly testing an InfiniBand cluster on windows. I wouldn't be surprised if they were using an NDA preview of the HPC edition.
Windows (and Linux for that matter) support offloading checksum calculations. Windows also support large sends where segmentation is offloaded. Finally windows supports ipsec offload. All of the top NICs (3com, broadcom, intel) support this. Microft was working on offloading all of TCP/IP processing. In theory, the NICs have ASICs and/or network processors that can handle high speed (think > 1 gigabit) TCP/IP processing and the OS just gives it data to send and a place to receive.
Depending on what you are doing it can make a big difference. Segmentation/reassembly can cost a lot of CPU time. Its effect is less if both sides of the connection (and the switches in between) support jumbo frames (9KB frames instead of 1500 byte frames). If done properly TCP/IP offload can even lead to avoiding the user/kernel switch and the copying of network frames that waste memory bandwidth. Many folks have noted that checksum offload doesn't help much when the memory is still being copied. Also, in theory an ASIC could handle TCP/IP with lower latency and higher throughput than most server CPUs. Other network adapters targeted at high performance computing like InfiniBand have their form of TCP/IP embedded in the ASICs of the controllers and show > 10 gigabit bidirectional throughput and sub 5 microsecond end-to-end latency in PCI Express x8 slots.
I have a BS in computer science and am in currently getting my masters. I have worked in the industry for 9 years or so and I love it. I find that people don't seem to understand what working as a computer scientist means. They also don't understand the infinite variety of things you can work on. I have lead a varied life for a programmer I think. Everytime I think I have done it all something new comes along and I am interested again. They don't understand the amount of creativity that can be involved. In some cases art can be involved. In some ways I think that computer science is the ultimate mix of art and science, creativity and logic.
If people really do feel that a shortage of computer scientists, electical engineers, and information technology folks is coming, they should do something about it! I feel that schools don't offer nearly enough grants for these areas. I also feel that years of success in industry have drained away many of the good teachers.
People who work in these fields need to try to spread the word about just what is that we do. I know folks who make software for video phones, rc cars, navy ships, stock traders, and massive computer clusters. There are so many things that you can do in this field. Many of them help people (like medical products) are innovative (music/video players), artistic (video games/web sites), etc.
I think if people really understood what is done in these fields more would be interested in it.
As for salary...I know quite a few software and hardware engineers and they all seem to be doing pretty well. CS is like any other field where you have to work hard to do well and move up.
"While some men apparently would be happy to spend the next 40 years of their lives working on the next version of MS Office, I want to *do* something"
I find what you said really rude and uninformed. There are literaly thousands of different types of jobs in the world of computer science. There are many more if you add electical engineering and information technology. There are computer scientists who "do" something everyday. What about the programmers who wrote the code to work through the human genome? What about the programmers who right code to simulate the effects of drugs to reduce the use of lab animals? What about the code that helps scientists find the cure for cancer? Isn't this doing something?
My resume is an example of moving around in different parts of computer science. In 9 years I have written financial software, device drivers for networking and storage, advertising software, network management software for high performance computing clusters, and now I work on software for radio controlled devices. My friends work in lots of other areas. Open you mind and then maybe your eyes will see what is really out there.
IB already exists over fibre. Most folks don't use it because it is much more expensive than copper solutions. Copper is going 10-15 meters these days. Mellanox and Gore just announced 40 meters. http://www.marketwire.com/mw/release_html_b1?relea se_id=73927
The quality of 4x IB cable has gotten much better over the last two years. It will continue to improve as 10 GigE also uses the same style cable.
For windows driver development, try http://www.osronline.com/ and click on "The Online DDK". Look under Kernel-Mode Driver Architecture, then under Design Guide, then under Servicing Interrupts. You can also order the DDK from http://www.microsoft.com/whdc/ddk/winddk.mspx . The Windows 2003 appears to be just the cost of shipping.
There are lots of ways to get info back from an interrupt. The simple "standard" way is for the caller to use an IOCTL interface in the driver and then wait on an event. On the Interrupt, disable the device's interrupts and queue a DPC. In the DPC, drain the device queue, signal the event the user call is waiting on and reactivate interrupts on the device.
There are other ways but this is the first one that occurs to me. Have fun!
I would love see linux really catch up to Windows here. Having worked on Windows and Linux drivers I can honestly say that for me, Windows device driver development is much easier. I am comparing writing Windows 2000/XP device drivers with Linux 2.4.x device drivers. Most that I have worked on are networking, storage, or low level bus drivers.
The driver APIs in windows appear to be more stable and documented better. The backwords compatiblity that MS allows in their driver model is great. For example, each time a new feature was added, it was always possible to use the old style for a few revisions. For example, when power managment and plug and play were added in Win2k, MS made sure you could still make a driver without the new calls and things would work. Even their wrapper models for networking (NDIS miniport) and storage (SCSI miniport) easily allow backworks compatibility. NDIS is nicely designed with versioning in the structures so that NDIS can know what version of the API the driver supports and handle it correctly.
The documentation in the DDK help is has improved greatly since the dark NT4 days. MS worked hard to audit the DDK docs and work with the developer community to improve them. These days their isn't much you will find in a header that doesn't have a nice page in the DDK help.
At each Windows Hardware Engineering Conference and also at the new Driver Developer Conferences they go way out of their way to make life easy for driver developers. On the source front, they provide source for sample drivers of almost every kind...even for some currently shipping internals.
The debugger is great. From a GUI or command linux, I can reload drivers over my debugger connection (serial or 1394) on a live system. I can connect to my debugger over TCP and remotely debug it. I can do almost everything I can do in a normal application debugger.
I can get kernel dumps of various types from full memory to 64 kb minidumps. Full memory dumps allow crashes to be totally debugged...much of the guess work is removed when you can see everything that was on the system at the time of death.
They also have great test tools built in. Between Driver Verifier and the Hardware Compatiblity Tests, a massive number of issues can be caught before the driver even gets to system testing.
In the linux world, I have to live with weak kernel debuggers and lack of true memory dumps. In real low level driver for a DMA device, in many cases you don't get the nice happy survivable oops...you get the "I need a damn camera and small console font to capture what stack made it out" oops. Every linux 2.4 device driver book should come with a digital camera for debugging! I heard that 2.6 adds some sort of memory dump...a dump to disk would make post-mortems so much easier. Any one know more about this?
Add to that the constant changes that instantly make documentation outdated and force driver develepers to rewrite with only the new source as their guide. The kernel rev issue is not just a GPL it and recompile...the APIs change and the meanings of status codes change, etc. Each kernel revision my company supports requires significant work on our end. Even if it was as simple as a recompile and test, the rate of kernels released makes it difficult for developers and system test groups to keep up. It takes a lot to test high end drivers. Weeks can go into a system test plan for a specific revision of the driver with a specific revision of the kernel only to see a newer kernel suddenly become the "new new" thing.
On the test tools front, the world is fragmented with some companies having some certification testing but no true driver certification tests. I would love to see a 2.6 storage driver tester and a 2.6 networking driver tester. Is there anything happening on this front?
I wonder what the interconnect between the nodes will be. Gigabit ethernet seems far too slow. There is Myrinet, Dolphin, and other HPC interconnects. 10Gb ethernet is still really expensive and there is only one NIC on the market (from Intel). InfiniBand would make a lot of sense...10 Gb, much cheaper than 10Gb Ethernet, much lower latency, and already supports MPI and TCP/IP offloaded sockets. Of course maybe for systems this large, a special machine specific interconnect makes sense.
"SCTP and Infiniband focus on different areas. IB is largely a high performance HPC / cluster network architecture for LAN applications, where SCTP is a transport protocol designed to operate efficiently under WAN conditions (significant packet loss, high RTTs)."
Ok, that makes more sense. IB and other hardware based reliability systems all have problems with long distances. There are folks working on IB WAN though including the U.S. Naval Research Laboratory. Check out Obsidian Research.
"The interrupt issue has largely been solved - on Linux NAPI dynamically switches between interrupt and polled mode to reduce this overhead to negligible levels. Message signalled interrupts also help considerably."
Cool. I was not aware of NAPI...been too long since I have been in linux kernel land. I agree about MSI helping out. It would be interesting see how this effects HPC performance. HPC has appliations that can include both large transfers and latency sensitive messages. By the way, I am pretty sure intel nics had a similar control as an option on their linux drivers for some time.
"What would be much more helpful (and economical) for iSCSI, SCTP, and RDDP is NIC CRC32C checksum generation. CRC generation is quite expensive in software but trivial in hardware."
Yep. I guess its a matter of finding the right balance between what should be offloaded and for what cost.
"One advantage of SCTP over TCP is that on a per stream basis, SCTP connection establishment overhead is much lower than TCP - basically O(1) instead of O(N) in the number of streams."
Interesting. Oracle and SilverStorm pushed out Reliable Datagram Sockets for IB (could work over iWarp) to handle the issue of lots of connections to the same host. Oracle saw massive scaling issues on pretty much any hardware or software for their clustering. RDS solved it by multiplexing all threads' traffic that goes to the same host down one reliable pipe. I wonder how SCTP would handle this?
For all its problems TCP/IP is everywhere. This fact has made it the networking technology to use even when it doesn't make technical sense. For example, folks use it in high performance computing and in storage (iSCSI) where there are much better methods available technically. Its commonality (along with ethernet's popularity) often make TCP/IP over ethernet the cheapest solution to many problems (while not the best).
I used to work on InfiniBand where the reliability/congestion detection protocol (Reliable Connected and Link Level flow control in IB terms) are in hardware. This scales to 20 Gbit connections between hosts quite well. Other examples of hardware protocols include myrinet (invented by myricom) and qsnet (from quadrics) and scalable coherent interface current pushed by Dolphin Interconnect. All of these folks struggle to compete with good old TCP/IP over ethernet. Except for the parts of the HPC world, TCP/IP over ethernet wins. In the storage landscape, Fibre Channel, SAS, and SATA seem to be holding out but iSCSI sure is trying.
The performance issue is real though and very few systems can saturate a 10 Gbit TCP/IP etnernet link without massive host CPU overhead. One solution floating around is that instead of trying to make new protocols to replace TCP, we should imitate the competition and put hard work in hardware. TCP/IP offload NICs (TOE) are becoming increasingly more popular. With RDMA technology layered on top of it you get iWARP. For storage you get iSER (ironically from an IB company!). This technology is being adopted by both the MS and Linux camps so it seems to have a good shot. In fact, many of the interfaces used by IB work about as well over iWARP cards. Things like Message Passing Interface, Direct Access Provider Library, Sockets Direct Protocol (SDP), and iSER do not know the difference between iWARP and IB or anyone else.
Software can just post a full size message and it gets sent out the wire without copying, segmentation, timers, resends, or other CPU hogs. This kind of stuff really helps with large messages. With SDP, apps can be made to take advantage of it without changes to the application. MS is also providing a standard way for just TOE NICs without RDMA abilities to work with the OS. Linux doesn't seem to have a standardized way for TCP/IP to be offloaded entirely but is supporting RDMA and SDP.
The things SCTP seems to offer is more explicit understanding of the difference between failure and congestion and multi-home support. This could make load balancing over multiple paths between hosts pretty interesting. The problem I see is that is that it is competing with the established TCP that now has many of its warts fixed with hardware offload. SCTP will still have the issue of a CPU handling segmentation/reassembly, massive amounts of interrupts, timer/retry overhead, etc. It also seems to have a higher overhead for connection establishment (although that is mitigated by being able to send data during the end phases). Is this a solution looking for a real problem? Pehaps not. Does this really have a chance of being taken up? I am not too confident.
-Ack
"I pray you never get a job managing a factory, then."
Me too. I don't want to manage a factory!
I like programming software and this issue comes up a lot in that world. Lots of tuning parameters are great. I believe they should be able to self set to some reasonable defaults based on system setup and potentially current conditions.
Oh, I have setup manufacturing tests for a consumer electronics product and for a military product and they have worked out pretty well.
-Ack
re: Controller side caching
If a system takes a power hit, your data could get lost in several places.
Computer RAM: This can especially occur if you are going throug a file system and get stuck in the file system cache.
Hardware RAID Cache: This can be in the FC/SCSI/SAS/SATA controller although from my experience the big systems put the RAID in with the storage. See the big beasts from EMC, NetApp, etc.
Disk Cache: Each hard disk has its own write cache
So on a massive power failure, you need UPS accross the board to let things make it to disk. Any component noted above will lose data on power failure if you don't have power backup.
On a massive power spike, any of these systems will lose data or get corrupted. Heck RAM itself will can read incorrectly on power spikes.
Now certainly Oracle may do things with their DB to insure the integrity of data and it probably does these things better then MySql. On the other hand some folks are comfortable with restoring from tape. Database snapshot systems can help aleviate the problem.
So I will take integrity of MySql vs Oracle as an argument by how do you argure Solaris vs. Linux? If Linux is so bad why does Oracle support it and event advertise for it? Companies like IBM, Unisys, and SGI all seem to think Linux can handle mission critical systems on the high end. Many of the top super computers in the world use linux. On the embedded side linux is doing quite well in mission critical components. WindRiver is supporting linux now despite having VxWorks.
If Oracle is running on linux or solaris on the same hardware, there isn't really much difference in integrity. At that point oracle is writing pretty much straight to disk so it comes down to Oracle, computer hardware, and storage hardware. The OS doesn't make much of a difference.
Sure, I agree that MySql does lack these things and they can contribute to performance but what some aspects of data integrity can be handled by using proper storage. Perhaps Oracle should make some of these things optional so that the speed vs. safety tradeoff can be in the user's hands? Perhaps they are comfortable with their RAID array with UPS backup and nightly tape backups.
From my point of view, a system should not require extensive tuning to run well in typical environments. I can't stand the idea that having lots of tunable settings makes something good. Does Oracle support some sort of auto tuning run where it figures out the best parameters given your system? It seems like it should be able to observe some sample usage and adjust things as needed? How about SQL optimization? It seems like bad software design if you need more expensive folks to make it work well vs the competition and that appears to be true for Oracle. Oracle DBAs cost more then others and it still often doesn't perform as well.
By the way, back at my internet ad company, we used oracle dbas to make sure our Oracle SQL was fast. Using oracle specific tricks helped quite a bit but not enough to catch up.
The company sold their software to lots of pretty large web sites. Often the same systems that hosted ads handled other main issues. Very few had a problem with the integrity issues of MySql even when we told them about it.
-Ack
I used to work at an internet advertising company. We would track ads and keep a database of what was setup and clicked on,etc. We supported several databases including MySql, Oracle, and SqlServer. We defaulted to MySql unless the customer had a database installed already they wanted to use. The only reason we moved to Oracle was when folks hit a 2 GB limit on a table (and file) size that MySql on 32-bit X86 linux had back then (not sure if it does now). Things got soooo much slower. Scripts that were designed to make reports over night in an hour or so couldn't finish before folks came in the next day.
Also they seem to not be able to get their clustering to scale beyond a few servers without high end interconnects like InfiniBand. Even with IB, they needed a whole new protocol, Reliable Datagram Sockets, which SilverStorm made for them. I also used to work at SilverStorm. Oracle also wanted to invent a user mode RDMA based storage driver (user SCSI Remote DMA Protocol) because they seemed to feel that going through the kernel was a major bottleneck for storage.
It is interesting to see the need for all this new technology just to catch up in performance.
-Ack
Doesn't the Itanium do pretty well on floating point?
-Ack
"Money is a measure of how much society values your time and work."
I think it is more a supply and demand issue. Demand comes from the ability of a job to produce something of value for an employer.
Perhap it is sad but true, but the US is a capitalist society where we are paid by the value of what we produce and how easy it is to find those that can do our work. This is the real reason a PhD in History may earn less than a PhD in a science who may earn less than an MBA.
-Ack
I used to work in the InfiniBand space where folks are using host adapters at 20 Gbit (4X, Double Data Rate). Some of the big server vendors are doing 30 Gb (12X Single Data Rate) host adapters. With all of this host speed it is only a matter of time before the switch to switch links will go up in speed.
High speed systems like this are getting used in high performance computing to build larger clusters. Having faster switch links will allow these fabrics to be created with less switches and thus less hops from node to node and thus lower latency. Latency is probably the most important factor in the performance of a HPC cluster. It doesn't stop here...IB defines up to Quad Rate Rate 12X (120 Gbit). The HPC market is growing very well and the ethernet folks want a bigger piece of the high end of this market.
Systems with this high level of speed are also used in big telco setups. With broadband becoming increasing popular and bandwidth increasing, the telcos need to have higher end equipment in their core.
Going through PCI-Express is about as close you are going to get without a standard north bridge specification that everyone supported. It is very InfiniBand like...pretty much IB without the networking. Intel seemed to go this way when they didn't get their 4X HCA out the door. Then again, Advanced Switching Interconnect kind of gets you back to IB like fabrics.
IBM has 12X HCAs for their high end lines that do not use PCI or PCI-Express. I suspect Sun is working on one as well given the all 12X switch they just released (based on what I believe is their silicon, not Mellanox).
Pathscale goes through Hypertransport instead of PCI-Express. They are getting pretty good latency numbers.
What do you say "that fat-trees saturate so easily"? A proper full bisectional bandwidth tree on IB should allow full bandwidth to be used. For example, half the servers should be able to talk to the other half of the servers at full speed. A proper InfiniBand subnet manager should be able to program the routes so that you have an FBB fabric.
I have worked on linux and windows device drivers and embedded code on systems like VxWorks and on proprietary embedded systems. I have also written user space programs (math libraries, configuration programs, etc). I find that writing kernel code is much less forgiving then user space code, especially when working with hardware. The mistakes that can be made can be very subtle and much more difficult to debug given the systems available today. Deadlock your user space program and it just stops. Deadlock your kernel and your whole system hangs. Ever have the joy of having hardware trash system memory due to undocumented limitations on usage? Have you ever watched in horror as a low memory reference (like null->member) leads to a random hardware access instead of being caught by the system? When you create a complete system with no memory protection because you don't have the hardware for it, it is more difficult. There is a reason that there are a small number of folks who work in the embedded and kernel spaces.
I am not saying that user space programming is easy. Kernel space programming brings its own challenges that user space programmers almost never have to consider. Meanwhile it still has many of the same challenges that user space programming has.
I do agree that many kernels do do lots of the same things that user space programs could do and perhaps some of that should be pushed to user space. The problem is that attempts to move things to user space or use protected languages (like java) for kernel have not worked well so far. They tend to require more resources or run slower which often kernels don't want and embedded systems can't afford. Perhaps your idea of a mix might work. But isn't that going in the microkernel direction with the microkernel being the " few parts of the kernel that require running in fixed time windows or direct hardware access" and the rest being a protected or managed language? By the way, "direct hardware access" is really not a small part of the kernel when you consider all the device drivers that require direct access to the hardware they are trying to support. I am curious how much of most running O/S kernels are devoted to O/S services and how much are handling specific hardware (NIC, storage card, video card, etc).
I have done my share of kernel programming and I have always thought that it is pretty horrible that simple device driver bugs can take down the system. Almost all of Windows' Blue Screens are from bad third party drivers. Almost all of the oopses I have seen on linux are from device drivers for extra hardware (I mean drivers not for core common O/S features). On linux device driver debug still seems to be horrible; on Windows it is considerably better but still not as good as application debug.
With common user systems as cheap and fast as they are now, do user mode device drivers make sense? Is the performance worth giving up for the stability? Check out Microsoft's User-mode Driver Framework approach. Here is an old linux journal article on the subject. Does anyone know of other interesting examples of user mode device drivers on any operating systems?
It will include OpenIB for Windows. See:
http://windows.openib.org/. It uses a BSD style license.
The IB folks are getting more MPI bandwidth than any interconnect out there. The latency is also very good.
By the way, MPI is not the first RDMA technology Windows has had. WinSock Direct has allowed user space RDMA through a sockets interface for some time.
From what I understand, Intel started the Itanium program with HP because they started to feel the heat of RISC and worried about their architecture's scalability in the future. So off went the team with HP compiler folks to do a VLIW next generation processor where the compiler can figure lots in advance.
Meanwhile other folks at Intel figured out that they could make the core RISC like and provide a conversion layer to handle x86. The success of the later generations Pentiums put Intel in a situation where they needed to support multiple architectures.
Remember when everyone was going to use Itanium? Now it is down to a few big boxes like HP, SGI, and Unisys. HP is also one of the biggest Opteron sellers these days!
I think a few technical things blew it for Itanium. It is a fact that the Itanium sucked at I/O. It had much higher latencies then Xeons and much higher then Opterons. Folks in the high performance computing community know that the same PCI-X networking cards (InfiniBand, Myrinet, etc) always performed worse on Itanium based systems.
Also the combination of the VLIW architecture and lots of registers made the Itanium very cache hungry. You can get as much as 9 MB of L3 cache. They increases cost and die size. Also, all that cache has to be powered! I don't think Intel/HP saw it coming.
Currently Intel has to support Intanium, P4/Xeon, EM64T variants, Pentium M variants, and lets not forget its XScale based line of network/embedded processors it got from DEC. At least the i960 is dead.
I think Intel has lost their focus. Those of us in the InfiniBand community watched Intel screw up their 2nd gen InfiniBand adapter and got beat to market by a startup by the name of Mellanox. Since they couldn't win they decided not to play anymore. Suddenly we get PCI-Express...aka InfiniBand without the networking and as much management. Advanced Switcing Interconnect almost gets us back to IB...arg.
Intel was the first ethernet NIC to 10 Gig but now their competition has TCP/IP and iSCSI offload and iWarp/RDMA interfaces while Intel has canceled their full TCP/IP offload technology and doesn't even have a PCI Express adapter ready.
Intel is relying on marketing and manufacturing power to keep it going. They hope that the brand can be stronger then the sum of its parts. Centrino, ViiV, etc. Sadly, I don't think it will change until the biggest gun in the PC industry actually gives AMD a try. I suspect Dell is getting better prices on Intel CPUs then anyone else.
While windows itself doesn't support clustering, there are lots of libraries that run on top of windows that allow for it. Several vendors offer MPI (message passing interface) for windows running over TCP/IP for example. I am pretty sure Cornell is activly testing an InfiniBand cluster on windows. I wouldn't be surprised if they were using an NDA preview of the HPC edition.
Windows Datacenter Edition can handle more than 8 processors. Unisys sells a 32x called the ES 7000 . http://www.unisys.com/products/es7000__servers/har dware/index.htm
http://www.top500.org/sublist/System.php?id=6560
Cornell is using a Windows cluster. It is ranked 326.
Windows (and Linux for that matter) support offloading checksum calculations. Windows also support large sends where segmentation is offloaded. Finally windows supports ipsec offload.
All of the top NICs (3com, broadcom, intel) support this. Microft was working on offloading all of TCP/IP processing. In theory, the NICs have ASICs and/or network processors that can handle high speed (think > 1 gigabit) TCP/IP processing and the OS just gives it data to send and a place to receive.
Depending on what you are doing it can make a big difference. Segmentation/reassembly can cost a lot of CPU time. Its effect is less if both sides of the connection (and the switches in between) support jumbo frames (9KB frames instead of 1500 byte frames). If done properly TCP/IP offload can even lead to avoiding the user/kernel switch and the copying of network frames that waste memory bandwidth. Many folks have noted that checksum offload doesn't help much when the memory is still being copied. Also, in theory an ASIC could handle TCP/IP with lower latency and higher throughput than most server CPUs. Other network adapters targeted at high performance computing like InfiniBand have their form of TCP/IP embedded in the ASICs of the controllers and show > 10 gigabit bidirectional throughput and sub 5 microsecond end-to-end latency in PCI Express x8 slots.
I have a BS in computer science and am in currently getting my masters. I have worked in the industry for 9 years or so and I love it. I find that people don't seem to understand what working as a computer scientist means. They also don't understand the infinite variety of things you can work on. I have lead a varied life for a programmer I think. Everytime I think I have done it all something new comes along and I am interested again. They don't understand the amount of creativity that can be involved. In some cases art can be involved. In some ways I think that computer science is the ultimate mix of art and science, creativity and logic.
If people really do feel that a shortage of computer scientists, electical engineers, and information technology folks is coming, they should do something about it! I feel that schools don't offer nearly enough grants for these areas. I also feel that years of success in industry have drained away many of the good teachers.
People who work in these fields need to try to spread the word about just what is that we do. I know folks who make software for video phones, rc cars, navy ships, stock traders, and massive computer clusters. There are so many things that you can do in this field. Many of them help people (like medical products) are innovative (music/video players), artistic (video games/web sites), etc.
I think if people really understood what is done in these fields more would be interested in it.
As for salary...I know quite a few software and hardware engineers and they all seem to be doing pretty well. CS is like any other field where you have to work hard to do well and move up.
"While some men apparently would be happy to spend the next 40 years of their lives working on the next version of MS Office, I want to *do* something"
I find what you said really rude and uninformed. There are literaly thousands of different types of jobs in the world of computer science. There are many more if you add electical engineering and information technology. There are computer scientists who "do" something everyday. What about the programmers who wrote the code to work through the human genome? What about the programmers who right code to simulate the effects of drugs to reduce the use of lab animals? What about the code that helps scientists find the cure for cancer? Isn't this doing something?
My resume is an example of moving around in different parts of computer science. In 9 years I have written financial software, device drivers for networking and storage, advertising software, network management software for high performance computing clusters, and now I work on software for radio controlled devices. My friends work in lots of other areas. Open you mind and then maybe your eyes will see what is really out there.
Verizon (and comcast for that matter) are fighting Philly's attempts at free wireless network.
http://www.philly.com/mld/philly/11410060.htm
IB already exists over fibre. Most folks don't use it because it is much more expensive than copper solutions. Copper is going 10-15 meters these days. Mellanox and Gore just announced 40 meters. http://www.marketwire.com/mw/release_html_b1?relea se_id=73927
The quality of 4x IB cable has gotten much better over the last two years. It will continue to improve as 10 GigE also uses the same style cable.
For windows driver development, try http://www.osronline.com/ and click on "The Online DDK". Look under Kernel-Mode Driver Architecture, then under Design Guide, then under Servicing Interrupts. You can also order the DDK from http://www.microsoft.com/whdc/ddk/winddk.mspx . The Windows 2003 appears to be just the cost of shipping.
There are lots of ways to get info back from an interrupt. The simple "standard" way is for the caller to use an IOCTL interface in the driver and then wait on an event. On the Interrupt, disable the device's interrupts and queue a DPC. In the DPC, drain the device queue, signal the event the user call is waiting on and reactivate interrupts on the device.
There are other ways but this is the first one that occurs to me. Have fun!
I would love see linux really catch up to Windows here. Having worked on Windows and Linux drivers I can honestly say that for me, Windows device driver development is much easier. I am comparing writing Windows 2000/XP device drivers with Linux 2.4.x device drivers. Most that I have worked on are networking, storage, or low level bus drivers.
The driver APIs in windows appear to be more stable and documented better. The backwords compatiblity that MS allows in their driver model is great. For example, each time a new feature was added, it was always possible to use the old style for a few revisions. For example, when power managment and plug and play were added in Win2k, MS made sure you could still make a driver without the new calls and things would work. Even their wrapper models for networking (NDIS miniport) and storage (SCSI miniport) easily allow backworks compatibility. NDIS is nicely designed with versioning in the structures so that NDIS can know what version of the API the driver supports and handle it correctly.
The documentation in the DDK help is has improved greatly since the dark NT4 days. MS worked hard to audit the DDK docs and work with the developer community to improve them. These days their isn't much you will find in a header that doesn't have a nice page in the DDK help.
At each Windows Hardware Engineering Conference and also at the new Driver Developer Conferences they go way out of their way to make life easy for driver developers. On the source front, they provide source for sample drivers of almost every kind...even for some currently shipping internals.
The debugger is great. From a GUI or command linux, I can reload drivers over my debugger connection (serial or 1394) on a live system. I can connect to my debugger over TCP and remotely debug it. I can do almost everything I can do in a normal application debugger.
I can get kernel dumps of various types from full memory to 64 kb minidumps. Full memory dumps allow crashes to be totally debugged...much of the guess work is removed when you can see everything that was on the system at the time of death.
They also have great test tools built in. Between Driver Verifier and the Hardware Compatiblity Tests, a massive number of issues can be caught before the driver even gets to system testing.
In the linux world, I have to live with weak kernel debuggers and lack of true memory dumps. In real low level driver for a DMA device, in many cases you don't get the nice happy survivable oops...you get the "I need a damn camera and small console font to capture what stack made it out" oops. Every linux 2.4 device driver book should come with a digital camera for debugging! I heard that 2.6 adds some sort of memory dump...a dump to disk would make post-mortems so much easier. Any one know more about this?
Add to that the constant changes that instantly make documentation outdated and force driver develepers to rewrite with only the new source as their guide. The kernel rev issue is not just a GPL it and recompile...the APIs change and the meanings of status codes change, etc.
Each kernel revision my company supports requires significant work on our end. Even if it was as simple as a recompile and test, the rate of kernels released makes it difficult for developers and system test groups to keep up. It takes a lot to test high end drivers. Weeks can go into a system test plan for a specific revision of the driver with a specific revision of the kernel only to see a newer kernel suddenly become the "new new" thing.
On the test tools front, the world is fragmented with some companies having some certification testing but no true driver certification tests. I would love to see a 2.6 storage driver tester and a 2.6 networking driver tester. Is there anything happening on this front?
I wonder what the interconnect between the nodes will be. Gigabit ethernet seems far too slow. There is Myrinet, Dolphin, and other HPC interconnects. 10Gb ethernet is still really expensive and there is only one NIC on the market (from Intel). InfiniBand would make a lot of sense...10 Gb, much cheaper than 10Gb Ethernet, much lower latency, and already supports MPI and TCP/IP offloaded sockets.
Of course maybe for systems this large, a special machine specific interconnect makes sense.