There's A Cluster of 750 Raspberry Pi's at Los Alamos National Lab (insidehpc.com)
Slashdot reader overheardinpdx shares a video from the SC17 supercomputing conference where Bruce Tulloch from BitScope "describes a low-cost Rasberry Pi cluster that Los Alamos National Lab is using to simulate large-scale supercomputers." Slashdot reader mspohr describes them as "five rack-mount Bitscope Cluster Modules, each with 150 Raspberry Pi boards with integrated network switches."
With each of the 750 chips packing four cores, it offers a 3,000-core highly parallelizable platform that emulates an ARM-based supercomputer, allowing researchers to test development code without requiring a power-hungry machine at significant cost to the taxpayer. The full 750-node cluster, running 2-3 W per processor, runs at 1000W idle, 3000W at typical and 4000W at peak (with the switches) and is substantially cheaper, if also computationally a lot slower. After development using the Pi clusters, frameworks can then be ported to the larger scale supercomputers available at Los Alamos National Lab, such as Trinity and Crossroads.
BitScope's Tulloch points out the cluster is fully integrated with the network switching infrastructure at Los Alamos National Lab, and applauds the Raspberry Bi cluster as "affordable, scalable, highly parallel testbed for high-performance-computing system-software developers."
BitScope's Tulloch points out the cluster is fully integrated with the network switching infrastructure at Los Alamos National Lab, and applauds the Raspberry Bi cluster as "affordable, scalable, highly parallel testbed for high-performance-computing system-software developers."
Did they make a Beowulf cluster of those?
Fuck Beta!
When somebody buys 750 all at once.
It was my experience that pi's are hard to buy so i gave up trying to get one. Mind you when people use ancient rasbian os and make 'secure' email servers on port 26 and then get called out for issues it is good to see that somebody is using them properly instead of poorly.
This would be an amazing side project for someone to do at home. And it wouldn't break the bank (too badly... maybe the cost of a new car). What a talking point that would be at a job interview.
As in bidirectional communication I assume!
Twinstiq, game news
Is that similar to a Raspberry Trans?
You are missing the point! The idea is not to have an super computer but to emulate one. Writing code for stuff like thus is hard and running it on the real deal is expensive. This way the can emulate 750 core system at an fraction of the cost.
Apparently the point is to simulate a powerful machine with many cores, so that people can develop and optimize their code without requiring CPU time on the actual (very expensive) machine.
If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
I have 50 iPhones. Cost me under $1k
If you cannot get 1000's of slow cpu's to scale, then wasting debug time on the big fast server is really a waste. Today's programmers need to learn how it used to be. Even with using RPI's they have an advantage. The network is much faster than what we had 20 or 30 years years ago. Internal busses are faster, ram/memory is faster, caches are faster. This is a smart way to spend money for a bringup development environment on the cheap.
You get effect of network latency to induce concurrency paradoxes that wouldn't happen on a shared memory system.
ObCarAnalogy: a single bus can move a lot of people, but if you're modeling highway traffic, you want to use many independent cars.
Not really, but it does show that there are a lot more idiots like you coming here, and a lot fewer of the people who belong here. My very first thought was "Holy shit! Something that actually belongs on Slashdot on Slashdot!" If your thought was "meh" then I have no idea why you even come here.
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
You are missing the point! This way the can emulate 750 core system at an fraction of the cost.
So, what point am I missing? The Xeon phi 7290 is 4k$ and has 72 cores, you can get 10 of those and get way more speed, shared memory benefit etc...
Entirely different architecture. The point of this scale model is to have a cluster of compute nodes with TCP/IP communication between them.
You get effect of network latency to induce concurrency paradoxes that wouldn't happen on a shared memory system.
You can run MPI on a shared memory system. It will have no problem uncovering any concurrency race conditions.
Entirely different architecture. The point of this scale model is to have a cluster of compute nodes with TCP/IP communication between them.
Which is completely useless since the real machine will have a completely different interconnect.
Are they running 64-bit os? If so, they can tap into significant performance from the arm-64/NEON/SIMD/crypto instructions, etc.
So, what point am I missing? The Xeon phi 7290 is 4k$ and has 72 cores, you can get 10 of those and get way more speed, shared memory benefit etc...
The shared memory is a detrator not a benefit if you're trying to have something which emulates an expensive distributed architeture. The point isn't to get lots of speed, it's to get a bunch of cores distributed over a local network in order to get a cheap test bed emulation of a much larger machine.
SJW n. One who posts facts.
You don't need a supercomputer to figure out that the headline is poor usage. The Chicago Manual of Style will do that for you.
10 CPUs with 72 cores each is 720 cores.
750 SOCs with 4 cores each is 3,000 cores (and RAM and motherboards included).
The point is to have a massive number of cores in a large number of machines, to simulate a large number of machines, at the budget point. Your idea would have 75% fewer cores.
> shared memory
Yep, that's another problem with your idea. It would no longer be an accurate simulation. Well except your plan doesn't include any RAM at all. Or motherboards, networking, etc. You're going to need to buy 750 network cards to simulate 750 machines, motherboards each capable of holding 18 cards, a number of storage devices, etc. So maybe FIVE 7290 CPUs with exotic motherboards plus RAM, network cards, storage, etc. Five 7290s would provide 360 cores, vs the 3,000 cores they got with the Pis.
Now AFTER the research yields fruit, in a couple years someone might want to put the ideas into production using fifty 72-core processors which may cost $2,000 each.
You can run MPI on a shared memory system. It will have no problem uncovering any concurrency race conditions.
"There are no bugs in my code."
Learn how to use a fucking apostrophe
Once researchers get something running, they stop optimizing even if is it only a few times faster no matter how much better it could be. This cluster just encourages that behavior.
pure unsubstantiated bullshit pulled straight from your ass
I would think that this could be solved more efficiently, albeit less fun, by a virtual cluster. The hardware is different enough from the real supercomputer anyway that performance benchmarking is probably out of the question.
The Chicago Manual of Style will do that for you.
Maybe you need a supercomputer to figure out that books don't do any "figuring out"
This happens so often that I think we need a new mod:
Score: -1 Wrong topic, you idiot
#DeleteFacebook
Here you can unplug a node to simulate a hardware failure. The latency is more real world between nodes. Cache levels are more similar (L1 L2 RAM) , hardware levels (nic, bridge, CPU). It's a cheap approximation. Leave it at that.
"Let's test this freeway system at small scale"
"Nah let's take the airplane, no need to test it in the real world"
Your continuing statements are not relevant.
Your analogies have zero relevance to how HPCs work.
Not useless if you're debugging queue systems, schedulers etc.
ROFL. We're not talking about debugging the scientific number crunching code that will run on the actual cluster, but the cluster management software. The actual jobs to run may very well just be doing sleep(10000*rand()); if rand()0.1 call WriteAllTheDiskSpace; else if rand() 0.2 then call segfault_horribly(); else return SUCCESS;. etc.;, one should probably add in a few more "bad things", MPI calls etc.
You could do all that by setting up a bunch of VMs on a shared memory machine. This cluster serves zero purpose.
Your analogies have zero relevance to how HPCs work.
They are relevant. What is zero is your willingness to learn, or at least accept that other people know what they are doing.
No they don’t. Morons like you have no idea how super computers actually work.
The only difference between this cluster and a shared memory machine is that the shared memory machine will use less energy and will be useful for actual scientific work not just your idiotic debugging scenarios.
1440 is NOT 3000!!! AND
your power budget went to
hell in a hand basket.
One other thing, your being
a DICKWAD.
No a single shared memory device will use less energy than a cluster plus networking equipment needed to run it. Plus you can run more than one process on a physical or virtual core. So you will have no problem simulating a job requireing 10-100 times the number of physical cores.
Well, technically they could just run their nodes as a bunch of virtual machines and save money as well as gain performance. That would take out the networking part of the equation though, so it wouldn't be quite the same. On the other hand, if you're running test where networking and latencies might matter, I think using RPIs are a bit dubious considering you get 100Mbps tops over a pretty badly congested USB2 bus.
Presumeably they know what they are doing, but I think they could have found something more suitable than the Pi and its yucky networking.
The only difference between a cluster and and a shared memory machine is you pay by the minute on the cluster. If you haven't realized that consumption fee is the same whether you are debugging or doing actual science, then you are the moron who can't see the point of using a less powerful less costly cluster to do mockups.
Well then get an F250. It has the capacity to haul thousands of pies worth in traffic. With dealer incentives you can get them for under fifty grand.
Plus you can haul a decent boat. Granted it doesn't solve the goal of the scientists, but the other "xeon" poster doesn't understand that, so suggest a good truck instead
Or the Ford.
Wow you are a stupid moron. You can setup a que system that charges by the minute on a shared memory machine. By contrast, I have run scientific codes on clusters that do not charge by the minute.
Try again dumbass.
I think that other AC is at a level of idiocy simulating creamy dumpty.
All of which you can do on a shared memory machine with the added bonus that the shared machine is useful for actual scientific work.
Keep making moronic analogies dumbass.
The RPi modules are 4-core, so the cluster is 3000 cores.
Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/
Interconnects don't matter much. Whether you use InfiniBand, GigE or serial, you're just pumping TCP packets.
Custom electronics and digital signage for your business: www.evcircuits.com
Erm, wrong.
Nice apostrophe, bro!
I don't think it is acceptable to make an understandable and relevant car analogy.
Go swallow your own cock.
I wonder how many Commodore 64's I could emulate at once on a decent sized workstation. Maybe 500?
“Common sense is not so common.” — Voltaire
You could do all that by setting up a bunch of VMs on a shared memory machine.
That will give you different latencies and different bottlenecks. The point of this system is not to crunch data, but to serve as a testbed for parallel software development. It is possible that they also use VMs, but that would be in addition to this cluster rather than a replacement.
Good luck buying another 3000 core computer that only uses 4000 W ..with that kind of cash..yes the point here is many cores and low power for simulation of the really bad ass systems.. not raw single core performance of the dev system
Or you could use a single Xeon Phi, emulate the 750 Raspberry Pi's and the networking and still consume less power with more performance for a lower price.
1kW at idle is a lot. You could cut that down by shutting down Pis in banks as they went unused, and firing them up again as needed. It wouldn't require very much more hardware, just some microrelay boards which can be driven by some of the Pis themselves.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
There's A Cluster of 750 Raspberry Pi's at Los Alamos National Lab
I saw a bunch of them at the grocery store before Thanksgiving, next to the apple ones.
It must have been something you assimilated. . . .
That’s not how infinibsnd works.
The point of this system is not to crunch data, but to serve as a testbed for parallel software development.
A shared memory machine makes a far better test bed for that purpose.
Emulating the cores would falsify what they are testing since this would reduce a lot of possible race conditions (among other things). Virtualization is nice but it's not an end all solution.
Purely out of academic interest, how fast is this thing? How does it compete with, say, a 16 core Xeon or Threadripper workstation?
The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism
Emulating the cores would falsify what they are testing since this would reduce a lot of possible race conditions.
No it wouldn’t. It would make race conditions more likely to trigger.
You do get four threads per core, so 11 chips will suffice to get to the 3000 threads. Next you need to build a delay system to simulate the various interconnect types and disable cache coherence between those cores, or threads as needed. That's what I would imagine it taking, at least. And Occam, lots of Occam.
A rack system with three servers of four, fully loaded Phi nodes would cost at least third more to buy and the power consumption is probably little higher (max 5-6kW), so this Pi system would be cheaper to buy and to operate (by one point of comparison).
The latencies and bottlenecks of this system will have zero relevance to the production computer. Again making this cluster a complete waste. Better to do it on a shared memory machine.
And running 1.2GHz 4 core STB processors over 10/100 ethernet is going to be similar to clusters of dual socket 3GHz 54 core processors with 25+ Gbps interconnects? (aka Cavium ThunderX2 CPUs. Nobody is planning on building an ARM based supercomputer with only one CPU per node, let alone with IO limited smartphone/tablet/set-top-box oriented SoCs)
Heâ(TM)s absolutely right.
https://youtu.be/7ffj8SHrbk0
Dear God!
This will obviously be used to verify global warming, so it belongs here. Let's argue politics!
Have you read my blog lately?
Okay... This is what it's supposed to emulate.
This thing has more than nine hundred thousand processor chips and two petabytes of memory. Current x64 chips are limited to 256 TB (wikipedia) of physical address space; so these chips either [a] have larger than usual physical address space (I doubt), or [b] isn't a shared memory system.
So, dumbnuts, this isn't a shared memory system. Go read about the Cray XC40. Or even this document -- clearly showing it's a multi-node system with a fast interconnect. (It talks of each node running different OS images, so that means it isn't one shared OS image - which means it isn't shared memory).
Summary: What evidence do you have that the target system is shared memory? It looks to me like it's non-shared-memory (i.e., message passing); while with an extremely fast interconnect, I'm sure it's still slower than the CPU internal busses. The same is true with this Raspberry Pi - the interconnect (ordinary Ethernet) is still significantly slower than the ARM chip itself; and THAT environment is what's being emulated.
It doesn't really matter that other architectures could be faster - the GOAL is to replicate how the Cray XC supercomputers work - albeit at a fraction of the performance and price.
There are some timing interconnects on the BitScopes which Bruce uses to sync the signals, reduce the processing requirements.
We've heard him speak about it here.
Have to get him talking further on that side of it.
You raang?
I am Audience.
Morons like you have no idea how super computers actually work.
Funny, I've written actual code for shared use supercomputers like the ones discussed in TFA, and yet you and your ilk are the ones that look dense and naive.
Many of these machines have an application process and you must demonstrate that you will make efficient usage of your time on the machine. If you're utilization is too high, you run out of time before you get the results and may have to wait a while for your turn. Under utilize or get stuck in some crash, and you can get penalized, usually being told to go wait for time on a smaller machine and to fix your shit before being allowed to apply for time again.
For research projects dealing with a limited budget of cpu hours from a grant process, "debugging" and optimization is not idiotic, and becomes quite important. It amounts to bureaucracy, but that is necessary at some scales. Calling it idiotic is on par with saying it is idiotic to have budget planning and approval paperwork for spending on a large project. At large enough scales, just winging it and the associated mistakes cost a lot of people time, which adds up to a lot more than an ounce of prevention.
The first bi-sexual supercomputer cluster?
"He explained how the whole cluster can be bootstrapped from a single Micro SD card plugged into one of the nodes and how its power consumption and cooling requirements are vastly lower than similar scale HPC." [http://cluster.bitscope.com/blog/bitscope-raspberry-pi-cluster-press-conference]
"Cluster simulations can help to some extent but in many cases real-world issues can intervene to mitigate their effectiveness"
[http://cluster.bitscope.com/motivation]
Saturday Morning Breakfast Cereal
Oh yes, true. Now why on earth didn't the folks Los Alamos think of that!? They must be complete idiots. You should write to them and explain your ideas - they might give you a job as their chief architect, or maybe their Head of Cost Cutting.
I think this was also on slashdot last year:
https://robertmcgrath.wordpress.com/tag/the-megaprocessor-laughs-at-your-puny-integrated-circuits-stephen-cass/
Tracy Johnson
Old fashioned text games hosted below:
http://empire.openmpe.com/
BT
moderation is for assholes , but ...
here you don't get moderated, you get rated ...
one might say you seem to be over reacting a bit
but thats fine, i divide my days between standard and less bad too, if this is your biggest problem today then it can't be that bad its not like you get money or anything for it, right ?
i like the "green computing" approach here btw
Free speech was meant to be free for all... how can anyone grow up in a nanny state ?
Yes, I'm sure all the folks at Los Alamos are far stupider than you, Anonymous Coward.
They are testing how their software scales to a massive amount of cores. This you cannot do on a single Xeon Phi. The speed and available bandwidth is irrelevant for that, it is of course relevant for other test cases but that is not what they test here.
No because if a single core emulated 10 other cores there will i.e never be a situation where those 10 cores execute an instruction all at the same time. The laws of physics you know.
Someone should invent software for emulating a CPU, that way you could use one machine to emulate many.
I'd call it a virtual machine.
And you cannot (as of yet) effectively simulate the kind of massive scale out that places like this code for.