Qualcomm Debuts 10nm Server Chip To Attack Intel Server Stronghold (tomshardware.com)
An anonymous reader quotes a report from Tom's Hardware: Qualcomm and its Qualcomm Datacenter Technologies subsidiary announced today that the company has already begun sampling its first 10nm server processor. The Centriq 2400 is the second generation of Qualcomm server SOCs, but it is the first in its new family of 10nm FinFET processors. The Centriq 2400 features up to 48 custom Qualcomm ARMv8-compliant Falkor cores and comes a little over a year after Qualcomm began developing its first-generation Centriq processors. Qualcomm's introduction of a 10nm server chip while Intel is still refining its 14nm process appears to be a clear shot across Intel's bow--due not only to the smaller process, but also its sudden lead in core count. Intel's latest 14nm E7 Broadwell processors top out at 24 cores. Qualcomm isn't releasing more information, such as clock speeds or performance specifications, which would help to quantify the benefit of its increased core count. The server market commands the highest margins, which is certainly attractive for the mobile-centric Qualcomm, which found its success in the relatively low-margin smartphone segment. However, Intel has a commanding lead in the data center with more than a 99% share of the world's server sockets, and penetrating the segment requires considerable time, investment, and ecosystem development. Qualcomm unveiled at least a small portion of its development efforts by demonstrating Apache Spark and Hadoop on Linux and Java running on the Centriq 2400 processor. The company also notes that Falkor is SBSA compliant, which means that it is compatible with any software that runs on an ARMv8-compliant server platform.
It depends a lot on how fast the interconnect is and how fast is access to memory.
It takes a LOT of cache and very clever data paths to keep 48 cores fed with data. Intel cores typically have 2.5MB of local level 3 cache for each core and multiple ring busses so cores can access the whole cache and not waste precious off-chip bandwidth trying to read from main memory. If this is a special purpose chip for executing deep learning algorithms that's one thing, but for a general purpose server where tasks are uncorrelated, it ain't easy to prevent stalls while cores wait for data.
It would be interesting since AMD cancelled their ARM efforts in the server space.
Shai Schticks:"You don't make peace with friends, you make peace with enemies"
Since node geometry now has more to do with marketing than it does with feature size, it's no longer a meaningful comparison. Intel's 14nm node is generally superior to TSMC's 10nm node (where the Centriq will most likely be fabbed).
Help save the critically endangered Blue Iguana
It's aimed at servers, so its pretty safe to say it will be running 48 Apache threads with the socket code pretty much always in cache.
Or 48 other *identical* threads servicing multiple users for the same thread type.
Qualcomm designs chips, from my experience based on ARM not x86, and outsources the actual making of chips to other companies (TSMC, Samsung, whomever).
Not really seeing how this threatens Intel outside of the whole ARM vs x86 thing. My understanding is most server farms are connected to dedicated nuclear power plants anyway, so power consumption isn't an issue. Heat dissipation? Yeah, that might be an issue.
Intel is only "refining" the 14nm design through the natural course of their "tick-tock" process (which has now added a third "tock", which seems likely to be due to lack of real competition). The fact remains that intel opened their 10nm fab in July, we're 6 months into production, and Canyonlake is on track for next year:
Intel starts up 10nm factory
And the A10 is about equal to x86 Sandy Bridge performance so it's going to take a lot of Qualcomm cores to be competitive with each x86 core.
It potentially could end up freeing the server space from a monopoly. You know? The thing Slashdot's always rallying against.
Shai Schticks:"You don't make peace with friends, you make peace with enemies"
Any chance I could get a data sheet without a prohibitive NDA and the need to fork over one of my children?
(I suspect the answer is "no.")
This is quite narrow. A 10 nm wide track fits less than 100 atoms of silicium.
Data center power is expensive. Mostly because it's reliable and redundant. And yes, every watt used is a watt of heat that has to be removed by the cooling system.
Suppose it was literally true that a data center was powered by a dedicated nuclear power plant. It costs about $12 billion to build a power plant. How many cores would you like to be able to power from your $12 billion investment? If I operated a big DC, I'd rather power a million low-power CPUs from my X gigawatts of power than only be able to use 100,000 power hungry CPUs.
Not that most DCs are powered by a dedicated power plant - you really want connections to at least TWO power plants, and you typically want to be in the datacenter business, not the power plant business.
It's called an "example". There are millions of servers that do almost nothing but run a bunch of Apache threads, many that do nothing but smtp, many that do nothing but nosql lookups, etc. It's very common, especially for companies with thousands of servers, to have servers dedicated to a single task.
By percentage the automotive industry has the highest margins, not the server industry.
say what? citation needed?
Intel have known it for some time and spent a lot of time refining the cache down to the geometry...
what they do not specify is the cache size or any benchmarks... personally I would like nothing more than to see a mix of architectures with a standard board interface layout...
john
As more than half the cores have to remain ideal most of the time to keep it from over heating.
:T:R:A:N:S:
Absolutely, if your WordPress blog needs about 1/4 the resources of a server, a virtual machine is a good way to do that. I offer that for our smallest customers. (We call it "Half Server", two cores and 8GB dedicated to each customer.)
If you need a cluster of 4, 40, or 400 nodes in your cluster of Squid proxies, the virtualization works the other way around - a true cluster is a rack row of machines that look and act like one. Each node, each piece of hardware, is an interchangeable and disposable part of of the whole. There's no reason to run a hypervisor on the nodes, the whole row, the whole cluster, is a virtual service.
It makes more sense in networking gear at first. If people could rewrite their packet forwarding engine or create something like DPDK or SRIOV for this chip. They could drop the mic. RISC usually kicks the shit out of x86 for packet forwarding.
Or use a GPU (http://shader.kaist.edu/packetshader/)
Shai Schticks:"You don't make peace with friends, you make peace with enemies"
OR, people will just have to learn writing code for the new memory hierarchy. It's not like this would be the first time; people had to learn writing for caches, too.
Ezekiel 23:20
If they do it even slower with more PCI-E lanes then Intel it's a win for the end user. A slower storage server loaded with pci-e storage can be better then a faster Intel one. and with the lower end Intel cpu's less having pci-e then $200-$300 more cpus that are a little bit faster in the same socket with more pci-e turned on can force Intel to give up on that idea.
amd's 64 bit system did not fail like the ititanic
Nobody learned.
Look at any standard library or application framework and you will not find any cache oblivious algorithms.
Linked lists are just traditionally implemented linked lists. Hash tables are just traditionally implemented hash tables. Trees are just traditionally implemented trees. Even sorting will be a ham-fisted quicksort.
pretty much only assembly language programmers give a shit, mainly because they are the only ones that understand the issues. Any exceptions you find are the exceptions that prove the rule.
"His name was James Damore."
Since the CPUs do about a third as much per cycle.
You don't need it for a large portions of an application's code. It's not even worth the effort for many things. Those people who need to learn it eventually learn it.
Ezekiel 23:20
>>cache oblivious
Did you mean cache aware?
Linked lists are just traditionally implemented linked lists. Hash tables are just traditionally implemented hash tables
Linked lists suck for caches, but hash tables don't have to. There's a trend for libraries to provide things like hopscotch hash tables as the default hash table implementation and these are very much cache aware. The real problem is the trend towards languages that favour composition by reference rather than by inclusion, which means that you do a lot of pointer chasing, which is very bad for both caches and modern pipelines.
I am TheRaven on Soylent News
I searched on Google. Found this in under two seconds. Took me more than that to write this reply.
http://www.theverge.com/2016/9...
Not true. Most of the stuff that programs do are totally dependent on the speed of a) a Database b) an Online web service c) a File system.
In those cases Caches are definitely used, a lot. And you get 95% of your speed gains from there.
The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism
For single machines, like you say you can upgrade the metal OS without disturbing the guests (hopefully). If you have a cluster of 16 Snort nodes, or 32 storage servers, you just take each offline as you upgrade it, then it rejoins the cluster when ready. It's kind of reverse virtualization - the 16 pieces of hardware are virtually one service.
Good luck finding assembly language programmers for modern processors. Almost all have gone in the RISC direction, relying on the higer-level compilers to fill-in the gaps to make the environment more CISC-like. Example - a RISC CPU doesn't have an ADD instruction, but you can implement the function by negating one of the parameters and using SUB ... a C compiler will do this for you, and will remember to flip the polarity of the carry flag and any conditionals as appropriate. It makes assembly programming insanely complex, but it's invisible once your abstract up a layer to something like C.
The further you abstract away from native code, the harder it gets to write code that's cache- or other-resource-aware.
The biggest differences, to me at least, are that:
- You can pre-bake your images, which means that the server is only down for a minute or two for the ugprade rather than having to wait through the install process.
- You don't have to fight with $RANDOM_VENDOR's dodgy implementation of out-of-band management to powercycle the server and watch the console (or, $deity forbid, actually attach virtual media).
- If you're dealing with an impending hardware failure, migration of the host and all of its data to another server is comparatively trivial; just use vMotion.
In the modern day, CPUs are so fast (and have so many virtualization-specific enhancements) that the hypervisor overhead is negligible. The only downside I've seen is the licensing cost, but that's usually a drop in the bucket next to the hardware itself.
-- sigs cause cancer.
I was surprised to learn recently, that the memory bandwidth per core has actually been falling in recent years. Thus the parent is correct.
The problem is this. Core counts are increasing faster than matching improvements in memory bandwidth. The system as a whole is getting better but individual cores are becoming more vulnerable to processor stalls. The means that there is a bottleneck on how much adding cores, helps the overall system. And the cache bandwidth problem is independent of application threading awareness. It's a separate problem.
And this is true even accounting for the ever-larger cache sizes, the improvements in processor efficiency (instructions per clock), and all the rest.
You don't need to write ASM to get large performance increases. I help optimize C# applications all of the time and I can regularly gain 10%-30% performance from simple refactoring that does not affect code readability and without changing the algorithm. Most of what I do is think of how the .Net code will be converted into ASM and how the Garbage Collector will be involved. While GC optimizations gain the most, re-ordering calculations can also give good returns.
Someone doesnt know what kind of cache is being talked about....
...but still decided to dive in with "Not true..." acting like they know something.
"His name was James Damore."
You don't need it for a large portions of an application's code.
We arent talking about application code. We are talking about library code.
If you write libraries like you write applications, then you are part of the problem.
"His name was James Damore."