The Story Behind a Failed HPC Startup

← Back to Stories (view on slashdot.org)

The Story Behind a Failed HPC Startup

Posted by kdawson on Tuesday November 3, 2009 @11:25AM from the build-it-and-they-will-come-if-you-don't-run-out-of-money-first dept.

jbrodkin writes "SiCortex had an idea that it thought would take the supercomputing world by storm — build the most energy-efficient HPC clusters on the planet. But the recession, and the difficulties of penetrating a market dominated by Intel-based machines, proved to be too much for the company to handle. SiCortex ended up folding earlier this year, and its story may be a cautionary tale for startups trying to bring innovation to the supercomputing industry."

8 of 109 comments (clear)

Min score:

Reason:

Sort:

Fool's errand by Locke2005 · 2009-11-03 11:40 · Score: 3, Insightful

In a blog post after SiCortex shut down, Reilly says he believes there is still room for non-x86 machines in the HPC market. He is wrong. Much more money is being spent every year on improving x86 chips than all the competitors combined. Basing a supercomputer on MIPs was short-sighted; even if it offers a a price/performance or power/performance advantage now, in a couple years it won't, because x86 is being improved at a much faster rate. Where is Sequent now? The only way to build a successful desktop HPC company is to be able to do system design turns as fast as new x86 generations come out and ship soon after the new CPUs become widely available, e.g. a complete new product every 6 months. That requires partnership with either Intel or AMD, not use of a MIPs chip that no one is spending R&D resources on anymore.

--
I've abandoned my search for truth; now I'm just looking for some useful delusions.
The fanciest-sounding solution ... by Wrath0fb0b · 2009-11-03 11:51 · Score: 5, Interesting

... is almost always wrong. As one of the principals on a large-ish (not large by world standards, 1000 cores, mainly Nehalem so approximately 100 GFLOPS) cluster, I've been very pleased that we've done things as simply as possible. Sun Grid Engine and ROCKS running on commodity 1Us delivers an economical and effective solution (no, I don't work for Sun).
Most importantly, the environment does not unduly restrict what kind of compute jobs can be run. If it can be compiled on *nix, we can probably run it. We lose to specialized hardware (GPU-based, Cell-based, ... ) in raw throughput but we make up for it in both initial price and ease of deployment. We don't even have a dedicated admin for the cluster -- we had one to set it up and he did such a good job we haven't needed to hire a replacement!
Ultimately, I feel like it's not worth paying extra in hardware and software-dev costs to save few dollars on cooling and power. Sure, you get credibility of running a "green" cluster (nevermind that you have to pay to feed and house those extra developers, which should legitimately come out of your carbon budget) but you end with with a far less useful product.
Long Live X86(_64)!
Re:Low Power Supercomputer by Wrath0fb0b · 2009-11-03 12:03 · Score: 3, Insightful

Why not use something based of the Atom chip but massively parallel.

You are probably one of those guys that thinks that if you can get 36 women working together on making a baby, it will be ready in 1 week.
Not all problems can scale out to many cpus (or wombs, for that matter). Threading overhead, network latency/bandwidth, mutual exclusion (or the overhead on atomic data types) all conspire to defeat attempts to scale. This is, of course, if your problem is one that is even amenable to straightforward parallelization in the first place -- many problems (for instance, lattice simulations of Monte Carlo) are excruciating to scale to even 2 cpus.
In my own (informal) tests on our HPC (x64, Linux, see my post above for details), I concluded that you need to be able to discretize your work into independent (and NONBLOCKING) chunks of ~5ms in order to make spawning a pthread worth it. Of course, "worth it" is a relative term -- some people would be glad to double the cpu-time required for a 25% reduction in wall-clock time while others might not, so I'll concede that my measurement is biased. IIRC, I required a net-efficiency (versus the single-core version) of no worse than 85% -- e.g. spend less than 15% of your cpu-time dealing with thread overhead or waiting for a mutex. This was for 8 cores on the same motherboard by the way, if you are spawning MPI jobs over a network socket, expect much much worse.
Re:Lesson learned by jd · 2009-11-03 14:19 · Score: 3, Insightful

Having worked in one HPC startup (Lightfleet), I can say that one of the biggest dangers any startup faces is its own management. Good ideas don't make themselves into good products or turn themselves into good profits by selling. Good ideas don't even make it easier - you only have to look at how many products that are both defective by design AND sell astronomically well to see that.
I can't speak for SiCortex' case, but it looks to me like they had a great idea but lacked the support system needed to get very far in the market. It's not a unique story - Inmos didn't fail on technological grounds. Transmeta probably didn't, either.
Really, it would be great if there could be some effort into examining the inventions of the past to see what ideas are worth trying to recreate. For example, would there be any value in Content Addressable Memory? Cray got an MPI stack into RAM, but could some sort of hardware message-passing be useful in general? Although SCI and Infiniband are not extinct, they're not prospering too well either - could they be redone in a way that didn't hurt performance but did bring them into the mass market?
Then, there's all sorts of ideas that have died (or are dying - Netcraft confirms it) that probably should be dead. Bulk Synchronous Processing is fading, distributed shared memory is now only available in spiritualist workshops, CORBA was mortally wounded by its own specification committee and parallel languages like PARLOG and UPC are not running rampant even though there are huge problems with getting programs to run well on SMP and/or multicore systems.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:The fanciest-sounding solution ... by Gorobei · 2009-11-03 14:41 · Score: 3, Interesting

Exactly right. I've got >10K cores and >10M LOC. "Hardware fault" typically means a datacenter caught fire, or was flooded, or an undersea cable got cut.
If someone pitches a cheaper solution (e.g. power savings,) I'm happy to listen for 10 minutes. Then I just want to know how fast I can see results: a dev costs $50K/month here, so I'll give it a week or two: if you don't have a test farm ready to go with full compilers, a data security plan, etc, I'm going to just reject. If you can get traction with universities, great, come back and pitch again in a year.
Did you nerds read the article or the links? by labradore · 2009-11-03 15:44 · Score: 4, Insightful

They were ahead of schedule to profitability. They lost funding for the next gen. equipment development because one of their VCs was overextended (read: losing too much money on other risky ventures) and decided to pull out. The risk with a company like that may be high but once you get enough profitability, you can fund further product development internally. They had sold about twenty $1.5M machines in about a year's time on the market. They said they were about 1.5 years to profitability, so I'm guessing that they were expecting to sell another 75 or 100 top-end machines to get to break-even. At that rate, they were probably spending less than $20M a year on development. I'm guessing that they burned up $100M to get were they got. In the overall scheme of things, that's not a big bet. If they managed to develop 20 to 50- thousand node machines and increase the output per core within 3 years, that is something that would have been able to do more than fill a niche. They probably would have developed some game-changing technology in the bargain. Stuff that the Intel and Google might just be interested in.
To be clear: this was not a failure due to the economics of competing against Intel/x86. This was a failure due to not being lucky. It takes sustained funding to make your way from start-up to profit in most technical businesses. HPC is more technical and thus more expensive than most.
Re:Lesson learned by Jacques+Chester · 2009-11-03 16:55 · Score: 4, Insightful

They didn't die because their customers abandoned them for something cheaper. They died because they had a cashflow crisis due to investors pulling out of a planned round of fundraising. They had millions of dollars of sales in the pipeline.
The lesson isn't "Don't compete with Intel", it's "When you run out of money, you're out of business". Or perhaps, "The financial crisis killed lots of otherwise sound businesses". Luck, as the OP pointed out, played a large part.

--

Classical Liberalism: All your base are belong to you.
Re:GPU's? by peawee03 · 2009-11-03 17:26 · Score: 3, Informative

Currently, Teslas are the single-precision future. All my work is in double precision (64-bit), which is where most GPUs are much much slower. IIRC, the next generation GPUs are going to have respectable double precision performance, but they're way down the road- hopefully I'll have moved on to a job where it doesn't matter by then. Hell, I consider it a victory when I've gotten a code translated from FORTRAN 77 to Fortran 95. GPUs? I'll wait until next decade. More normal cores are low-hanging fruit I can use with any MPI code *now*.

--
I wish I could write clever and witty sigs.