Bill+Barth · Slashdot Mirror

Re:Hardly A New Problem on Supercomputers' Growing Resilience Problems · 2012-11-23 04:22 · Score: 1

Suppose that your job is computing along using about 3/4ths of the memory on node 146 (and every other node in your job) when that node's 4th DIMM dies and the whole node hangs or powers off. Where should the data that was in the memory of node 146 come from in order to be migrated to node 187?

There are a couple of usual options: 1) a checkpoint file written to disk earlier, 2) a hot spare node that held the same data but wasn't fully participating in the process.

Option 2 basically means that you throw away half of your compute resources in case there is a failure. Basically no computations being done today for scientific research are valuable enough to warrant this approach. Some version of Option 1 (periodic checkpointing and restarting) is always more cost effective. These systems, in the US at least, are generally between 2 and 10 times over requested by the science community. Taking half away to seamlessly prevent an occasional job death just isn't worth the lost opportunity to more fully utilize the resources.

Option 1 implies taking some time away from your job to do the checkpointing. The vast majority of the time, some sort of OS-level automated checkpointing would be overkill as well. The author of the code knows better when is a good time to checkpoint and when it's a bad idea. I.e. you might consider checkpointing at a phase of the calculation when the data volume required to restore the state is at a minimum even if that means losing some part of a future calculation. Generally the calculation is cheap to redo since checkpoints of large volumes of data are expensive.

In addition, OS-level checkpointing is a hard problem. E.g. if there are messages in flight on the network, do you try to log them and be able to restart them, or do you only checkpoint when the network is quiet? If the network is never quiet on all the nodes in the job, do you throw in needless synchronization that could ruin the parallel efficiency of the job in order to find a place to do your automated checkpoint? If you decide to log instead, where do you write the log data in order to avoid catastrophic failure of each node, and what's the cost of doing it?

If these were just a bunch of VMs running a LAMP stack, this wouldn't be a hard problem. That's basically solved already. Migrating tasks for HPC jobs is truly a hard problem with tradeoffs to be considered.

Re:So on Linux Foundation Offers Solution for UEFI Secure Boot · 2012-10-12 02:20 · Score: 1

Which is great unless you have 5000 nodes that you need to PXE.

Re:I find this depressing on Neutrino-Powered Financial Trading In Our Future? · 2012-08-05 17:12 · Score: 1

It hasn't really done much for HPC. It has made a dent in low-latency Ethernet, but that doesn't get much use in high-end HPC (that's IB and the Cray and BlueGene networks, FWIW).

Re:Similar work exists on UK Universities Launch Cloud Supercomputer For Hire · 2012-07-02 10:44 · Score: 1

It's nothing like TG. TG systems basically gave all their cycles away for free through the work of the Resource Allocation Committee--a peer-review body that met quarterly to review proposals and give out allocations of time. This work continues through the XD program under the auspices of XSEDE.

Re:There is not even a way to remove it! on Facebook Says Your Email Is @Facebook · 2012-06-26 00:56 · Score: 5, Informative

You can't get rid of the address, but you can make it so that no one sees it. You can also display to whomever you like whatever address you like. The settings updates you have to make are pretty straightforward.

Re:Impressive engineering feat on Gamera II Team Smashes Previous Best Human-Powered Helicopter Flight Time · 2012-06-25 14:39 · Score: 1

The contest doesn't require control, and given the power requirements, it's unlikely that even the best cyclists will every fly one of these around the countryside.

Re:battery life on Linaro Tweaks Speed Up Android, By Up To 100 Percent · 2012-06-10 06:13 · Score: 3, Interesting

That's not guaranteed at all. The power consumption of a CPU is a function of a huge variety of things. It's possible that while the run time is shorter, the power draw is higher--possibly more than proportionally higher.

Re:Clarification, as I live here and study there. on RMS Robbed of Passport and Other Belongings In Argentina · 2012-06-10 02:55 · Score: 2

Nobody checks your ID when you go to class in the US either, though there's much less of a culture of people just showing up listening in. It would often, but not always, be easier to detect a stranger in a class here, though there are plenty of 500-person freshman biology lectures, too. Typical classes have ~30 people in them.

Re:Go has some good ideas on Go Version 1 Released · 2012-03-29 09:17 · Score: 1

You know that Emacs has been able to let you use the tab key to insert the correct number of spaces for the current line for probably 30 years, right? I suspect that Vim can do it, too.

Re:56 gigabit InfiniBand on 10-Petaflops Supercomputer Being Built For Open Science Community · 2011-09-23 09:22 · Score: 1

The 36-port part is the ASIC. The switch boxes have a lot more ports.

Re:Why did IBM do this, and what next for NCSA? on NCSA and IBM Part Ways Over Blue Waters · 2011-08-09 02:42 · Score: 1

Not that it changes your argument, but you should know that NCSA has a brand new Altix.

Re:Typical on NCSA and IBM Part Ways Over Blue Waters · 2011-08-09 02:28 · Score: 1

It appears to be the latter. The spec is available here. NCSA negotiated a system with IBM, proposed it to NSF under the above linked RFP, went through a peer-reviewed awards process, negotiated an award with NSF, and started working on the delivery and other aspects with IBM and NCSA's other partners. Something went wrong in the last several months, and IBM's pull out was the result. I doubt that there is any more money to be found, and all parties knew what was asked of them in order for the project to be successful.

Re:"High level" programming environment? Sigh. on Cray Unveils Its First GPU Supercomputer · 2011-05-24 13:56 · Score: 1

Have you tried it off-node?

Re:Rack density on A Closer Look At Immersion Cooling For the Data Center · 2011-04-13 02:09 · Score: 1

Have you tried fitting a water block in a blade lately? How about 2000 of them? :)

Re:Rack density on A Closer Look At Immersion Cooling For the Data Center · 2011-04-13 00:23 · Score: 1

Given that you can lay two of these racks back to back and then run them end to end, and that you can remove most of the regular AC equipment from your room, the amount of stuff you can get in your datacenter is the same.

Re:"Responsive and trusted" on Google Scares Aussie Banks · 2010-11-08 02:16 · Score: 2, Informative

It's 3% on my Citibank card here in the US. That's the normal rate for US-based banks for foreign transactions, BTW.

Re:Seems inefficient on Supercomputing, There's an App For That · 2010-08-19 00:23 · Score: 2, Insightful

You don't understand how this works. You do the computation ahead of time on the supercomputer to build your reduced order model which you download onto your phone and take out into the field. Once you've downloaded the model, you don't need the supercomputer any more. You can use the phone to do computations using the reduced model as much as you like. If you get into a regime where the predicted error from the reduced order model is too high, you can go back to the supercomputer and update the model. If that happens, then you'll probably have to wait in queue again, but that's not such a big deal.

Re:Use databases! on How Do You Organize Your Experimental Data? · 2010-08-16 01:51 · Score: 1

You might find this interesting.

Re:Compute Hours? on NSF Gives Supercomputer Time For 3-D Model of Spill · 2010-05-27 00:52 · Score: 1

Unless we've all got InfiniBand between us, it's not really worth it. These simulations are tightly coupled across the nodes they run on making them very sensitive to the latency and bandwidth between them.

Re:Compute Hours? on NSF Gives Supercomputer Time For 3-D Model of Spill · 2010-05-26 09:51 · Score: 1

I should have pointed out above that we measure in wall-clock time not CPU time. Most of these codes don't spend much time waiting on I/O, so the two numbers are usually close. We use wall-clock time because that is the time that the user monopolizes the nodes that are assigned to their job.

Re:In Time? on NSF Gives Supercomputer Time For 3-D Model of Spill · 2010-05-26 09:01 · Score: 3, Informative

The code in question (ADCIRC) has been used for years to do hurricane storm surge simulations. It's being continuously developed for work in the Gulf of Mexico and already includes contaminant transport effects. Also, as with all things scientific, "right" is a relative quantity. The better question is whether or not useful predictions can be made that are better than what's been done so far. I think the answer to that is a resounding "Yes!" Finally, I guarantee that this event will be used by modelers to refine and improve their codes for years to come. Recent hurricances (Ike, Rita, Gustav, etc.) have been used in the very same way.

Re:Good for initial estimates, that is all. on NSF Gives Supercomputer Time For 3-D Model of Spill · 2010-05-26 08:56 · Score: 1

The model itself can get quite close to shore (much closer than the pixel-level resolution of that tiny map you linked to), includes wetting and drying of land regions, and has been used to predict (in both forecasting and hindcasting modes) hurricane storm surge.

Re:Compute Hours? on NSF Gives Supercomputer Time For 3-D Model of Spill · 2010-05-26 08:48 · Score: 1

The simulation is currently running on about 4k cores which is about 244 hours or 10 days worth of simulation. Each of the simulations they're running is about 10 hours in length, so this is enough for about 24 forecasts.

Re:Python is hard too on Should Undergraduates Be Taught Fortran? · 2009-06-11 01:26 · Score: 2, Insightful

End users rarely pay for supercomputer time. Very rarely. There's an application process and peer review in most cases, but the time is, in the end, free.

Re:Give back class As on Millions of Internet Addresses Are Lying Idle · 2008-10-15 05:41 · Score: 2, Informative

Isn't this what DHCP is for? I'm a little surprised you have 25k boxes come in via a merger with static addresses.

Slashdot Mirror

User: Bill+Barth

Comments · 115