ARM Chips Designed For 480-Core Servers
angry tapir writes "Calxeda revealed initial details about its first ARM-based server chip, designed to let companies build low-power servers with up to 480 cores. The Calxeda chip is built on a quad-core ARM processor, and low-power servers could have 120 ARM processing nodes in a 2U box. The chips will be based on ARM's Cortex-A9 processor architecture."
It'll likely cost an ARM and a leg.
Have a beowulf cluster of cell phones.
Take the cheese to sickbay, the doctor should see it as soon as possible - B'Elanna Torres, "Learning Curve"
When you start piling all you can onto a chip the power consumption is going to naturally creep up. Once you reach a certain threshold of x chips you lose on the benefit of ARM being "low-power." Am i wrong?
Right now I'm running an Intel D510 rack server with dual 2.5" drives, it's great, does a lovely job even with it running Ubuntu 10.04 server + VirtualBox ( Ubuntu 8.04 LTS ), however, I'd dearly love to shift over to something even more low-power/compact/SOC, so long as it has SATA, Ethernet, USB and runs a debian-based distro I'd be happy.
Something like a dual-core ARM machine would run ample for the server loads I'm seeing.
So, anyone seen anything like that yet? Or even just a MB in Mini-ITX ?
(btw, why is it that Intel HT enabled still seems to cause random hangs... or maybe it's just coincidental).
Do many websites need a 64bit memory range? I don't think so. Big database servers and the like, yes, but I doubt many website servers.
I think you would have more luck over at ExpertSexchange.
Try titling your post 'Urgent: I password-protected my 1TB porn collection and I forgot my p/w'.
Yes, they do. First, if you're hosting a single web-site on a single server then you'll probably want to install more than 4Gb just because RAM is so cheap now. And you'll inevitably use it (for databases, file cache, etc.). If you're hosting multiple sites on a single server, then you DEFINITELY need more than 4Gb of RAM per server (as it's going to be the limiting component).
Maybe ARM is justified for large Google-style server farms doing specialized work which does not require great amounts of RAM.
Another 160 and that should be enough for anybody!
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
ARM's Large Physical Address Extensions (LPAE) allows access to up to 1TB of memory. While I doubt applications will use this, it will allow each virtualized host on the server to use 4GB of memory.
It couldn't be an SMP machine though, not with so many cores.
My bet would be that each of the 120 nodes actually is a complete computer with 4 cores and its own memory - linked to the other 119 only via Ethernet. In this arrangement the 32-bit memory limit is not such a big issue. Each individual machine will not be particularly powerful anyway.
You're an immobile computer, remember?
Even programs that you wouldn't expect to need much memory often benefit heavily, as any modern desktop or server OS uses free RAM for disk cacheing. Adding more memory means fewer slow, slow disk reads are needed.
How about a link to this rant, if you want us to read it? And, if you've got a problem with PAE-like extensions, then I presume you're aware that both Intel's and AMD's virtualisation extensions use PAE-like addressing?
All that PAE and LPAE do is decouple the size of the physical and virtual address spaces. This is a fairly trivial extension to existing virtual memory schemes. On any modern system, there is some mechanism for mapping from virtual to physical pages, so each application sees a 4GB private address space (on a 32-bit system) and the pages that it uses are mapped to some from physical memory. With PAE / LPAE, the only difference is that this mapping now lets you map to a larger physical address space - for example, 32-bit virtual to 36-bit physical. You see exactly the opposite of this on almost all 64-bit platforms, where you have a 64-bit virtual address space but only a 40- or 48-bit physical address space.
The big problem with PAE was that most machines that supported it came with 32-bit peripherals and no IOMMU. This meant that the peripherals could do DMA transfers to and from the low 4GB, but not anywhere else in memory. This dramatically complicated the work that the kernel had to do, because it needed to either remap memory pages from the low 4GB and copy their contents or use bounce buffers, neither of which was good for performance (which, generally, is something that people who need more than 4GB of RAM care about).
The advantage is that you can add more physical memory without changing the ABI. Pointers remain 32 bits, and applications are each limited to 4GB of virtual address space, but you can have multiple applications all using 4GB without needing to swap. Oh, and you also get better cache usage than with a pure 64-bit ABI, because you're not using 8 bytes to store a pointer into an address space that's much smaller than 4GB.
By the way, I just did a quick check on a few 64-bit machines that I have accounts on. Out of about 700 processes running on these systems (one laptop, two servers, one compute node), none were using more than 4GB of virtual address space.
I am TheRaven on Soylent News
How about a link to this rant
http://blog.linuxolution.org/archives/117
Utter bollocks. I work for a data centre, and there is no way 4GB is *required* for multiple sites or anything like that. How about one server, running 20-odd Linux Jails, each with between 20-32 sites, all in 2GB.
The real question is, can anyone afford to install an oracle database on that server?
Linus' rant is about using PAE in a desktop enviroment, which I agree with (that's why I said that I doubt any applications will use PAE). It says nothing about virtualisation. LPAE will work just fine for VMs.
And you're posting on Slashdot, instead of flying your private jet to Japan to personally pick up debris and rescue people.
Oh right, only rich people have private jets, a lot planes won't fly to Japan now, and even if you get a flight, unless you are currently in Japan with a car (most public transportation is down where help would be needed, and most Japanese people don't own cars), you'd have to walk to the disaster areas. You can't do anything except donate money and hope.
Grow up and learn that shit happens, and that your sheltered life can be destroyed in an instant, with little other people can do to help.
64bit memory range? Each node is going to have it's own memory slot(s). 120 cores, 4 cores per node = 30 nodes. If you plan to have less than 4GB of memory in this system, how small does each stick have to be when you plug 30 in? ~128mb. Good Luck finding a bunch of DDR2/3 128MB sticks to plug into your 4GB 120 core web server. Anyway, each node needs its own local copy of the data it needs to serve up. If you web page needs ~256MB, each node is going to need the same 256MB of data duplicated, plus any extra overhead. You can't expect all 30 nodes to access the same 2-3 memory slots; that would scale like crap. This is one of the issues you get when scaling via cores. Interconnection bandwidth/latency becomes an issue and you need to use local storage to allow fully independent processing. Once you start getting up into these ranges, you're better off thinking of each node as its own computer with a fairly high speed network.
Instead of virtualising ten servers on a single physical box, you could of course consider running a single server on a single piece of hardware again. And still win power/flexibility wise if you can get your "low-power" ARM board to cost much less than your souped up x86 board. If only because if a single board fails, just one server goes down. Not all ten.
So basically you want Slashdot to turn into every news outlet on earth right now?
If I want to hear more about any of the current natural disasters, the state of Libya or even what lipgloss Jooolia is wearing this week - I'll turn on the Television or read a news-corporation owned website.
This is Slashdot, News for Nerds - just because a disaster happened doesn't mean we stop wanting to know about anything else.
Jeez.
Comment removed based on user account deletion
His complaint basically boils down to the fact that the kernel needs to be able to map all of physical memory, and have some address space left over for memory-mapped I/O. This is a valid complaint for a kernel developer (although Linus' 'everyone who disagrees with me is an idiot' style is quite irritating), but it largely irrelevant to the issue at hand. There is nothing stopping a kernel on ARM with LPAE from using 64-bit pointers internally. You still need to translate userspace pointers, but you need to do that anyway on most architectures (on x86, context switches are insanely expensive, so typically you use a segment for the kernel and run system call handlers without changing the page tables, just making the kernel segment visible by switching to ring 0), so that code already exists in all of the relevant places in the kernel.
I am TheRaven on Soylent News
Right now my system doesn't even have 480 live processes on it, let alone ones contending for execution time.
You're obviously not running Gentoo.
No, the problem is:
1) Kernel is starved for _address_ _space_ for its internal structures.
2) Userspace is starved for address space, because it has to view all the RAM through a small aperture (think EMS in 80286).
3) Constant address space remapping is costly.
And it doesn't matter that you use 64-bit pointers internally, because you can't address data directly.
But web pages won't even need you to do any floating point arithmetic.
Provided your application is written in a language that supports not-floating-point arithmetic. In PHP, for example, any division returns a floating-point result, as does any computation with numbers over 2 billion (such as the UNIX timestamps of dates past 2038).
On a database server, if it's highly used, is largely stuck on the slowest part (disk i/o) when it has to do full table scans. You solve this by building proper indexes
Until you have to use a DBMS that ignores your indexes. For example, MySQL appears unable to make efficient use of an index on a subquery that uses GROUP BY. From the manual: "A subquery in the FROM clause is evaluated by materializing the result into a temporary table, and this table does not use indexes. This does not allow the use of indexes in comparison with other tables in the query, although that might be useful." The only reason I haven't already rewritten it as a join is that the subquery uses GROUP BY. The workaround I have adopted is to rewrite the query as multiple CREATE TEMPORARY TABLE ... SELECT statements so that as few rows at possible are seen at once. Or is there a better workaround, other than dropping MySQL entirely?
I do scientific computing where we regularly use virtual address spaces larger than 4GB. Not all of that is in the working set, of course, but it's often necessary to have that much mapped. One recent example is my leakage power and delay models for near-threshold circuits. I implemented the Markovic forumlas and found them to be too slow. My simulations would take days. So, I figured out the granularities I needed for voltage, power, and temperature, and I implemented those models as giant look-up tables. The leakage power model occupies 4GB of address space all by itself. I just mmap the file into the process and go. Now the simulations take only hours.
The worst natural disaster in recorded history occurred less than a week ago, and you people are discussing Calxeda's first ARM-based server chip, designed to let companies build low-power servers with up to 480 cores; as the chip is built on a quad-core ARM processor, and low-power servers could have 120 ARM processing nodes in a 2U box; chips will be based on ARM's Cortex-A9 processor architecture???? My *god*, people, GET SOME PRIORITIES!
The bodies of nearly 10,000 dead people could give a good god damn about the advent of LAN parties, your childish Lego models, your nerf toys and lack of a "fun" workplace, your Everquest/Diablo/D&D addiction, or any of the other ways you are "getting on with your life".
I have inlaws and friends in Japan, and thank God they are all fine. But even if something have had happened to them, what would you expect me, a /. reader, or anyone, to do? To cut my veins and pour ash on my head? What about the rest of the readers. You are just an attention whore looking for a cause celebre to be upset about. Nothing more as your little rant does nothing constructive.
You don't know if people reading this donated for the cause. You do not know anything about anyone here, about what they do or feel, and yet you act as if you would.
There is a difference between mourning and empathy, and shameless and useless "leave britney alone" attention whoring. Guess which one describes you buddy.
1) Kernel is starved for _address_ _space_ for its internal structures.
This is addressed by using physical addresses in the kernel, as I said. It can use 64-bit pointers, and the compiler emits direct loads and stores that bypass the MMU.
Userspace is starved for address space, because it has to view all the RAM through a small aperture (think EMS in 80286).
Which is only relevant if the process actually wants more than 4GB of address space, i.e. not very often (yet).
Constant address space remapping is costly
True, but this is only required on x86 because the kernel is using its own virtual address space. This is not an issue on ARM.
I am TheRaven on Soylent News
If you are doing scientific computing, then you are not in the target market for a system like this. The virtual address space size is the least of your problems - the relatively anaemic floating point performance is going to cripple your performance.
I am TheRaven on Soylent News
A proper webserver only needs 1 thread per core. Each socket/connection should only consume a few KB of RAM at most. A webserver shouldn't use more than a couple dozen MB of RAM at most, not including the OS file system cache. Look into Nginx or Lighttp.
Be relentless!
This kind of arrangement gets brought up over and over - one of the more recent examples is SiCortex, and it sucked. Having a Single System Image is always preferable to a "cluster in a box."
Now with 480 cores....2x as fast and with 9x better graphics than the iPad 19.
My God can beat up your God. Just kidding...don't take offense. I know there's no God.
You're more the target market for a nice G34 AMD system - 24 cores in 2 sockets, 64G of ram. This is more about serving lots of php.
"We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
Nice to know :-)
If it works as a unified memory, then 2GB per node and 30 nodes is going to be way more than 32bit addressing, but it would be great for distributed work. If each Node runs as it's own machine, then they will have to have a separate boot drive for each node and each node will have to have some sort of network connection to every other node. Should be interesting once more info comes out.
Raytracing.
I know tobacco is bad for you, so I smoke weed with crack.
I can't imagine a better workaround than dropping MySQL.
I know tobacco is bad for you, so I smoke weed with crack.
I can't imagine a better workaround than dropping MySQL.
In favor of what? PostgreSQL, or something one has to pay for? Either way, dropping MySQL support in the next version would require a lot of clients to drop their current hosting provider and switch from (cheap) shared hosting to a (more expensive) VPS.
AFAIK, most OSes shut down the MMU in kernel mode - linux for instance. Address space remaps are costly because of a lot of explicit, non-cached memory accesses. Though I don't see why some more PAE bits can't replace 64-bit mode - you just need an IOMMU. And possibly hardware virtualization with a simple hypervizor. Though that might actually be faster, considering all the savings you make from pointers, not to mention that if the MMU and wide load/store instructions trap to the hypervizor directly - the context switch cost is the same as calling the OS.
I know tobacco is bad for you, so I smoke weed with crack.
SSI can be done in the system firmware/hypervizor/kernel. Linux supports it.
I know tobacco is bad for you, so I smoke weed with crack.
AFAIK, most OSes shut down the MMU in kernel mode - linux for instance
Linux certainly doesn't do this on x86. It uses the segmentation mechanism. The kernel's memory is in a segment, marked as only visible to ring 0 code. When you make a system call, the current process's segment(s) remain visible to the OS, as does the kernel's segment. This means that you typically have 1GB of address space reserved for the userspace process, and 3GB for each userspace process. RedHat used to ship a kernel that used an entirely separate address space, so you got 4GB for the kernel and 4GB for each userspace app, but this required a TLB flush on each system call (in and out) so it was quite slow.
The problems with PAE are not really problems with PAE, so much as they are problems with the completely hatstand memory architecture of x86.
I am TheRaven on Soylent News
Paging is shutdown. The MMU does paging. The segmentation mechanism is separate.
I know tobacco is bad for you, so I smoke weed with crack.
There are so many things wrong with that, that I don't even know where to start. The MMU on x86 handles both paging and segmentation. Segments map from virtual addresses to linear addresses. Paging maps from linear addresses to physical addresses. Both are part of the virtual memory mapping handled by the MMU, which first walks the LDT / GDT, then the page tables, to translate from a virtual address to the physical.
It sounds like you're repeating something that you heard and didn't understand. What you probably heard was that kernel memory can't be swapped - it is always resident in RAM, not paged (swapped) out to disk. This used to be true for Linux, but hasn't been for a while - recent kernels (as in, ones from about the last ten years) can swap some kernel memory out - but not all of it (for example, swapping out the VM subsystem would be a really bad idea, since you wouldn't be able to swap it back in. Swapping out interrupt handlers would also break things).
I am TheRaven on Soylent News
You got me - though I've never actually heard of the MMU being used in the kernel.
I know tobacco is bad for you, so I smoke weed with crack.