3D DRAM Spec Published
Lucas123 writes "The three largest memory makers announced the final specifications for three-dimensional DRAM, which is aimed at increasing performance for networking and high performance computing markets. Micron, Samsung and Hynix are leading the technology development efforts backed by the Hybrid Memory Cube Consortium (HMC). The Hybrid Memory Cube will stack multiple volatile memory dies on top of a DRAM controller. The result is a DRAM chip that has an aggregate bandwidth of 160GB/s, 15 times more throughput as standard DRAMs, while also reducing power by 70%. 'Basically, the beauty of it is that it gets rid of all the issues that were keeping DDR3 and DDR4 from going as fast as they could,' said Jim Handy, director of research firm Objective Analysis. The first versions of the Hybrid Memory Cube, due out in the second half of 2013, will deliver 2GB and 4GB of memory."
the CPU vendors need to start stacking them onto their die.
In 5 years your systems will be sold with fixed memory sizes, and the only way to upgrade is to upgrade CPUs.
Stacked vias could also be used for other peripheral devices as well. (GPU?)
Just like Star Trek movies, every other iteration of memory tech is a dud. I will just wait for holographic crystals.
Where's my memristors?
Magnetic core menory was 3D. With something like 16k per cubic foot.
Where I have seen 3D silicon before?
Get free satoshi (Bitcoin) and Dogecoins
I was working at SGI at the time, late 1991. The cheapest way to buy expansion memory was to buy Indigo's and throw out the rest of the computer. SGI was just feeling the first tickles of the commoditization of computer hardware, and was looking for ways to make their components unique (and keep them expensive.)
I love Mondays. On a Monday, anything is possible.
Massive throughput is all well and good, very useful for many cases, but does this help with latency?
Near as I can tell, DRAM latency has maybe halved since the Y2K era. Processors keep throwing more cache at the problem, but that only helps to a certain extent. Some chips even go to extreme lengths to avoid too much idle time while waiting on RAM ("HyperThreading", the UltraSPARC T* series). Getting better latency would probably help performance more than bandwidth.
Submarine patent from Rambus [or someone else] surfacing in 3... 2... 1...
"For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled"
frakking excellent news. For some time now the bottleneck has been in the memory bandwidth, not in the cpu/gpu processing power. This will help a lot problems like raytracing/pathtracing which are memory bound.
.pdf file ( which I had found in the past, but lost it somehow ) with detailed explanations and calculations on the memory and flops requirements of raytracing, and how memory bandwidth is very low for such problems
thank you gods of the olympus!
p.s. for some time now I've been trying to find again a
So when can people running ddr1 or ddr2 expect to get some multilayer chips that vastly increase memory bandwidth in older systems?
Given that, for PC applications at any rate, the memory controller is built into either the motherboard or the CPU, there is likely to be a bottleneck there in any case. There would have been no reason for designers of memory controllers of the era to spec them out with the expectation of more than modest improvements.
Also, this '3D memory' stuff includes a memory controller with the DRAM dice stacked on top. To what, exactly, in a DDR2-using system are you going to connect a fancy new memory controller?
If you were a real high roller with a big cluster full of multi-socket hypertransport based systems or something, somebody might be moved to build some very, very, high performance memory modules that occupy CPU sockets; but that's a serious edge case. Most systems(even new ones) simply don't have a spare bus fast enough to hang substantially-faster-than-DDR3 RAM from.
It will probably be around 5 years until we can buy these things like we buy DDR3. This industry is developing so fast, yet moving so slow.
This HMC stuff is going to require new CPUs with new memory controllers on board. On the plus side, for the same bandwidth, they will use a lot fewer pins.
Of course, the down-side is the early-adoper penalty of HMC being rather expensive. I expect that if it takes off, the price will drop rapidly.
"-1 Troll" is the apparently the same as "-1 I disagree with you."
Absolutely nothing. Hence, no change in slashdot editing quality. New here, are you?
I've fallen off your lawn, and I can't get up.
Nobody ever accused SGI of sane pricing.
I read the internet for the articles.
The overall design reminds me a lot of Rambus. It saved pins and had excellent sustained throughput, but memory latency suffered.
I read the internet for the articles.
Hybrid Memory Cube exists in a 4-point world. Four corners are absolute and storage capacity is circumnavigated around Four compass directions North, South, East, and West. DRAM consortium spreads mistruths about Hybrid Memory Cube four point space. This cannot be refuted with conventional two dimensional DRAM.
Does that matter all that much? With cache lines sufficiently long, you're doing burst transfers all the time anyway, or not?
Ezekiel 23:20
I'm an American, and like many others I too cringed when I read that. Are you implying that people in the Uber-glittery Eurozone never make grammatical errors?
It could simply mean that L1 and L2 speakers tend to make different classes of errors.
Ezekiel 23:20
If you think that modern memory is simple send an address and read or write the data you are much mistaken.
Have a read of What every programmer should know about memory and get a simplified overview of what is going on. This too is only a simplification of what is really going on.
To actually build a memory controller is another step up again - RAM chips have configuration registers that need to be set, and modules have a serial flash on them that holds device parameters. With high speed DDR memory you have to even make allowances for the different lengths in the PCB traces, and that is just the starting point - the devices still need to perform real-time calibrate to accurately capture the returning bits.
Roll Serial Port Memory Technology!
? I mean, money? Psssh, there's people out there that have two GTX Titans ($1,000 cards) and would have more if there was room on the motherboard. Plus the vast reduction in power usage would be really useful for mobile high end stuff. Would love to grab a Nvidia 850 or whatever next year with 4 gigs of this onboard.
The power of a modern processor to get work done is dominated by cache misses. I mean by a factor of a hundred or more to one unless every bit you are computing lives in cache and nothing ever kicks your code or data's cache line out (including another line of code or data that you need. Because of the way that cache works you can't map every address to every line in cache).
Don't take my word for it though, take Cliff Click's: http://www.infoq.com/presentations/click-crash-course-modern-hardware
md5sum
d41d8cd98f00b204e9800998ecf8427e
Yeah, but how long till one of the partners run off and patent this new process and start suing everyone in sight? (Remember Rambus?)
Sig Battery depleted. Reverting to safe mode.
How do they cool this apparatus?
Um... yeah. No. I appreciate that what you have are considerably better than regular caps, but they're nowhere *near* the performance of what we keep being offered. Nanotube infused designs with power to weight ratios around that of batteries, graphene designs, etc. There's a huge wealth of applications waiting for them to hit somewhere around those marks. Electric cars, actual car battery replacements, cellphone power supplies that never die, backup systems for the house with peak powers far in excess of anything we have now but with comparable storage... the ultracap "breakthroughs" are as regular as any other kind (memristors, etc.) and the consistent no-show of actual commercially available units is also consistent. It's the flying car of electronic components, sigh. High voltage, high capacity, high vapor factor, lol.
Believe me, I've been following the whole ultracap thing for a while. I even keep an eye on EEStor, which I can assure you has been a stupendous exercise in fruitless waiting. As a ham with a full boat of offline powered goodies and the beginnings of a household able to run off backup systems, and more than a little willingness to buy an electric car, actual availability of ultracaps in what I call "the battery range" would truly light me up.
But that carrot is well and truly still out on the stick.
I've fallen off your lawn, and I can't get up.
> ... about something as insignificant than that.
There. Broke that for you.
I worked for a company that needed more RDRAM in a server. We bought a second hand server, took out the RAM and threw away the rest. It was cheaper.
Back in 1997, it was determined that ~90% of the benchmarks and customer applications (provided to us for testing purposes, the NDAs were amazing) used on PowerPC were completely dominated by cache misses. That means that if we knew how many times the processor touched a bus (data easily obtained in real time), we could be accurate to within 5% of what the performance would be using a spreadsheet calculation (Thanks, Dr. Jenevein) vs running the apps on a cycle accurate system simulation which could take weeks to develop a meaningful profile. Every time the caches got bigger, the code to solve customer problems would get proportionally bigger. That hadn't changed in 2007 and isn't anticipated to change by 2017. There are edge cases, but until people are satisfied with continuing to play Lode Runner instead of Crysis N, it won't matter for the mass market.
That doesn't mean that CPUs don't need to get bigger/faster, but it does mean that there is a meaningful limit on performance relative to the cache size, the calculation of which is probably left to an exercise for the student in H&P's Computer Architecture.
Most of those pins in the CPU are for power. While the overall system power consumption can be lowered, its entirely moved to a single chip. They may need more pins. A 130W CPU with a core voltage of ~1V needs an average of 130A of current going though those pins. The peaks will be much, much higher. They'll need more pins to get more bandwidth in and out of the CPU+Memory chip too.
.
I inherited all kinds of PS/2s...excrement. At this time they were being sold with a _12_ inch "billiard ball" monochrome IBM monitor. I eventually upgraded all of them to Zenith totally flat color monitors.
PS/2s were wildly proprietary -- wee, we get to buy all new add-in cards! And performance dogs -- Model 30/286 FTW.
A newb reading the parent's post would think otherwise as you cite wiki and all.
PS/2s and OS/2, released around the same time frame, killed IBM. End of story.
I come here for the love
NVidia Volta, coming in 2016?
It matters a great deal, and making sure burst transfers are effective is not always possible.
I do high performance calculations for a living. Knowing in advance what you will need in the future is a somewhat hard problem (and the basis of most modern optimization.)
The difference between main memory and cache is vast, if you can predict what you need far enough in advance to load it into cache that helps quite a bit, but realize that normally at best you are loading 4x what you really will need (which is the nature of trying to predict it so far ahead of time you are not able to calculate what you will really need.)
If you want to contest that, how much memory do you have in cache compared to your data set of a few terabytes? Multiple cores are usually a loss in performance if you even try, most real world problem are not possible to run in parallel once you hit the easy optimizations (which mask latency for the most part at the expense of a large amount of cache memory.)
Most of the harder problems I have run into could scale across multiple cores (or CPUs) if it was designed that way, but the run time would always be worse than a solution which assumed that it will always run on one core (introducing synchronization points kills it.)
Latency is essentially everything in most applications which are optimized (most are not, it costs too much.) The recent trend of simply including more CPUs is essentially an acknowledgement that computers have almost hit their limit in terms of the number of sequential calculations they can run over time.
If you are assuming that your application will become faster as time goes on you already lost. In most cases this cannot happen unless the original implementation was highly suboptimal (such as... you used Java or C# instead of C, or your C code is terrible.)
I'm a signature virus. Please copy me to your signature so I can replicate.
Come on, it is Anonymous Coward we are talking about! He has been around since the beginning and its UID is so low, it cant be shown ;-)
Tomorrow is another day...
visibility++;
If my comment didn't sound as good in your head as it did in mine, then I guess we all know who's to blame
I believe the UID for AC is 666, though it isn't shown on his posts.
RDRAM was never cheap. I binned a Dell because it was cheaper to build a new machine with the required spec than to add a Gig of RDRAM to that thing.
Operation Guillotine is in effect.
You say newer, I was teaching people to use dcbt/icbt in PowerPC (and similar instructions in other architectures) to do that in the 90's (granted, they affected the L2 if one existed, no one had implemented an L3 on-die at that point). I love the instructions, and used the heck out of them when I hand optimized assembly code- not a career choice I would recommend at this point in time, btw. Compilers exist that can make use of them, fortunately, and they do help maintain the performance curve, but they don't break it out to a new level.
Rude of me to reply to myself, but I should have added that when the vector units were added to PowerPC in the mid-late '90s, dst (data stream) instructions had the ability to indicate whether the fetches were transient or not and affect only the L1 if they were. gcc has supported the ability to do this since not long after the MPC7400 was released, IIRC.
The power of a modern processor to get work done is dominated by cache misses. I mean by a factor of a hundred or more to one unless every bit you are computing lives in cache and nothing ever kicks your code or data's cache line out (including another line of code or data that you need.
I happen to know that. What I meant by this was that it shouldn't matter all that much that latency is much worse than the throughput, because the burst transfers effectively amortize the latency cost. You're doing random reads against the L1 cache, not against the main memory. (If you organize your data so as to make the cache miss with every read, you're screwed anyway.)
Ezekiel 23:20