The Story Behind a Failed HPC Startup

← Back to Stories (view on slashdot.org)

The Story Behind a Failed HPC Startup

Posted by kdawson on Tuesday November 3, 2009 @11:25AM from the build-it-and-they-will-come-if-you-don't-run-out-of-money-first dept.

jbrodkin writes "SiCortex had an idea that it thought would take the supercomputing world by storm — build the most energy-efficient HPC clusters on the planet. But the recession, and the difficulties of penetrating a market dominated by Intel-based machines, proved to be too much for the company to handle. SiCortex ended up folding earlier this year, and its story may be a cautionary tale for startups trying to bring innovation to the supercomputing industry."

9 of 109 comments (clear)

Min score:

Reason:

Sort:

Re:Lesson learned by khallow · 2009-11-03 11:37 · Score: 2, Interesting

Don't be unlucky. At least, that's what the story is about.

More seriously, it looks like they were trying for high end supercomputing. There's probably a lot more money in smaller supercomputing clusters, but then they'd get hit hard by the proprietary structure of their hardware.
The fanciest-sounding solution ... by Wrath0fb0b · 2009-11-03 11:51 · Score: 5, Interesting

... is almost always wrong. As one of the principals on a large-ish (not large by world standards, 1000 cores, mainly Nehalem so approximately 100 GFLOPS) cluster, I've been very pleased that we've done things as simply as possible. Sun Grid Engine and ROCKS running on commodity 1Us delivers an economical and effective solution (no, I don't work for Sun).
Most importantly, the environment does not unduly restrict what kind of compute jobs can be run. If it can be compiled on *nix, we can probably run it. We lose to specialized hardware (GPU-based, Cell-based, ... ) in raw throughput but we make up for it in both initial price and ease of deployment. We don't even have a dedicated admin for the cluster -- we had one to set it up and he did such a good job we haven't needed to hire a replacement!
Ultimately, I feel like it's not worth paying extra in hardware and software-dev costs to save few dollars on cooling and power. Sure, you get credibility of running a "green" cluster (nevermind that you have to pay to feed and house those extra developers, which should legitimately come out of your carbon budget) but you end with with a far less useful product.
Long Live X86(_64)!
Commodity by oldhack · 2009-11-03 12:13 · Score: 2, Interesting

FTFA:

"It is possible for a small company to compete in the computer systems business," Reilly wrote. "There are some who will say that nobody can compete against 'commodity manufacturers.' Ignore them. ... There are only two true commodities in the computer business: DRAMs and wafer area. Everybody pretty much pays the same price for DRAMs. Wafer area is what you make of it. If you insist on building giant 100W chips, life will be tough. But if you use the silicon wafer area for something new, different and efficient, a market will open up to you."

Many years ago, I wrote a paper for my business class that using DRAM industry as a commodity industry. The ignint professor gave me a C for that cuz he insisted DRAM is not a commodity. That dude at the time was a young one, too.
Lesson? Don't waste your time and money at b-school - it may damage your brains.

--
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
GPU's? by toastar · 2009-11-03 13:26 · Score: 2, Interesting

My next cluster is going to be based around Tesla's. GPU's are the future. It takes 100,000 x86 cores to get a petaflop, You can get there with 25,000 if you use cells(5K x86, 20k cells) You can do the same thing with 10k if you use GPU's (5k x86, 5k Tesla's) Guess what the cheapest option is? They might not be the most energy efficient, but haven't we learned the problem with custom chips in HPC market, That's why we went to clusters in the first place
Re:Low Power Supercomputer by The+Archon+V2.0 · 2009-11-03 13:53 · Score: 2, Interesting

You are probably one of those guys that thinks that if you can get 36 women working together on making a baby, it will be ready in 1 week.
It'd certainly be fun trying, though.;)

Not all problems can scale out to many cpus (or wombs, for that matter). Threading overhead, network latency/bandwidth, mutual exclusion (or the overhead on atomic data types) all conspire to defeat attempts to scale.
It's not my skill set, but I remember years ago seeing a fascinating show on how blindly adding more resources can make something SLOWER. To translate the case study (involving editing individual segments for a news show on limited editing equipment) into geek speak, they demonstrated that unless you do things right, you might wind up with cores 2-8 zipping through their parts just to wait for core 1, which has unceremoniously had all the long tasks scheduled to run on it because the scheduling algorithm was only made with a dual- or quad-core system in mind and gets stupid when handed more, resulting in it scheduling things wrong.
Really, I got the feeling from that show that trying to make multiple interrelated units work together on a single task without bottlenecks or downtime is a logistical nightmare no matter if it's people in a company, robots in a factory, or cores in a computer.
Re:The fanciest-sounding solution ... by Gorobei · 2009-11-03 14:41 · Score: 3, Interesting

Exactly right. I've got >10K cores and >10M LOC. "Hardware fault" typically means a datacenter caught fire, or was flooded, or an undersea cable got cut.
If someone pitches a cheaper solution (e.g. power savings,) I'm happy to listen for 10 minutes. Then I just want to know how fast I can see results: a dev costs $50K/month here, so I'll give it a week or two: if you don't have a test farm ready to go with full compilers, a data security plan, etc, I'm going to just reject. If you can get traction with universities, great, come back and pitch again in a year.
Re:The fanciest-sounding solution ... by Jacques+Chester · 2009-11-03 16:47 · Score: 2, Interesting

That's what they designed: it's basically a bog-ordinary Linux-with-MPI cluster in a box. They had a custom internal fabric that was far more efficient than ordinary switches and even had on-die MPI accelerators. They also shipped with compilers for C, C++ and Fortran.
It was meant to be a drop-in replacement for room-sized clusters for a fraction of the space and heat. Basically what killed them was cashflow.

--

Classical Liberalism: All your base are belong to you.
Re:The fanciest-sounding solution ... by Gorobei · 2009-11-03 16:53 · Score: 2, Interesting

Yep, cashflow is a bitch: if I need to spend $25K to even look at the product, and they need $20M to run a demo datacenter, they need something like $100M in capital to avoid dying on the vine :(
SiCortex's failure by RzUpAnmsCwrds · 2009-11-03 22:44 · Score: 2, Interesting

Having actually used a SiCortex machine, I can tell you that the problem wasn't the VC, or the compilers, or even really the hardware.
The problem was the market.
There are two types of x86-based small clusters (the market that SiCortex was aiming for): clusters with Gigabit Ethernet and clusters with expensive interconnects (Mirinet, InfiniBand, or 10G Ethernet).
Gigabit Ethernet clusters do a good job with problems that are embarrassingly parallel (or at least have minimal communication demands). $150k gets you 300 Nehalem cores and a lot of memory. SiCortex fails here because their competition (the SC1458) is much more expensive and much slower. The fact that the SC1458 uses less power (around 5kW instead of 10kW) is impressive, but unless you're very power or cooling constrained, it's simply more cost effective to deal with the extra power and cooling cost.
SiCortex hardware was more cost effective against clusters with expensive interconnects. The problem is, the people who buy clusters with expensive interconnects do so because their problem is interconnect heavy. Unfortunately, despite all of the cool CS behind SiCortex's interconnect, the fact is that it just didn't do that well against InfiniBand. That's partly because the SiCortex system has more nodes, which means that more messages have to use the interconnect. It's partly because for very small clusters, it's possible to use a single IB switch that connects every node to every other node. And it's partly because SiCortex didn't have the kind of mature hardware/software stack that someone like Mellanox has.
So, there you have it. For the problems that ran well on SiCortex hardware, you could get the same performance at dramatically lower cost using Gigabit Ethernet. For the problems that require an expensive interconnect, the SiCortex approach of "more, smaller nodes" results in dramatically more overhead compared with the "fewer, faster nodes" strategy.