3D DRAM Spec Published

And for faster performance by mozumder · 2013-04-02 07:59 · Score: 1

the CPU vendors need to start stacking them onto their die.

In 5 years your systems will be sold with fixed memory sizes, and the only way to upgrade is to upgrade CPUs.

Stacked vias could also be used for other peripheral devices as well. (GPU?)

Re:And for faster performance by Anonymous Coward · 2013-04-02 08:06 · Score: 1

If you want more RAM you just add more CPUs!!!
WIN-WIN!!!
Re:And for faster performance by ArcadeMan · 2013-04-02 08:13 · Score: 5, Funny

Mac users won't see any difference in 5 years... wink wink
Posted from my Mac mini.

--
Get free satoshi (Bitcoin) and Dogecoins
Re:And for faster performance by TheRaven64 · 2013-04-02 08:21 · Score: 2

Most CPU vendors do. This has been the standard way of shipping RAM for mobile devices for a long time (search package-on-package). It means that you don't need any motherboard traces for the RAM interface (which reduces cost), increases the number of possible physical connections (increasing bandwidth) and reduces the physical size. The down side is that it also means that the CPU (and GPU and DSP and whatever else is on the SoC) and the RAM have to share heat dissipation. If you put a DDR chip on top of a Core i7, then one or the other (or possibly both) would be too hot to function. There are quite a few interesting experimental architectures that mix execution units and RAM on the same die, because the power cost of moving data between RAM and CPU is starting to be important. It's also often cheaper (in terms of both time and power) to recompute intermediate results than fetch them from main memory for workloads such as image processing.

--
I am TheRaven on Soylent News
Re:And for faster performance by harrkev · 2013-04-02 08:58 · Score: 3, Interesting

HMC does not need to sit on top of a CPU. HMC is just a way to package a lot of memory into a smaller space and use fewer pins to talk to it. In fact, because of the smaller number of traces, you are likely to be able to put the HMC closer to the CPU than is currently possible. Also, since you are wiggling fewer wires, the I/O power will go down. Currently, one RAM channel can have two DIMMs in it, so the drivers have to be beefy enough to handle that posibility. Since HMC is based on serdes, it is a point-to-point link that can be lower power.
I am sure that at speed ramps up that HMC will have its own heat problems, but sharing heat with the CPU is not one of them.

--
"-1 Troll" is the apparently the same as "-1 I disagree with you."
Re:And for faster performance by SuricouRaven · 2013-04-02 09:17 · Score: 1

Don't forget power. The frequencies memory runs it, it takes considerable power to drive an inter-chip trace. The big design constraints on portable devices are size and power.
Re:And for faster performance by ackthpt · 2013-04-02 09:52 · Score: 2

the CPU vendors need to start stacking them onto their die.
In 5 years your systems will be sold with fixed memory sizes, and the only way to upgrade is to upgrade CPUs.
Stacked vias could also be used for other peripheral devices as well. (GPU?)
IBM tried this with the PS/2 line. It fell flat on its face.

--

A feeling of having made the same mistake before: Deja Foobar
Re:And for faster performance by gagol · 2013-04-02 09:52 · Score: 2

I would like to see 4GB on die memory with regular DRAM controller for "swap" ;-)

--
Tomorrow is another day...
Re:And for faster performance by kaws · 2013-04-02 09:59 · Score: 2

Hmm, tell that to my upgraded Macbook. I have 16gb of ram in mine. On the other hand, you're probably right that it will take a long time for the upgrades to show up in apple's store.
Re:And for faster performance by forkazoo · 2013-04-02 10:08 · Score: 3, Interesting

To be fair, if somebody tried to sell something as locked down as the iPad is during the period when IBM first released the PS/2, it would have also flopped. The market has changed a lot since the 1980's. People who seriously upgrade their desktop are a rather small fraction of the total market for programmable things with CPU's.
Re:And for faster performance by Anonymous Coward · 2013-04-02 11:11 · Score: 0

Actually, I'd love it if I could get more processing power just by expanding my RAM. Maybe something like Venray is doing with their TOMI technology.
You can still have your main multi-core CPU; it would just talk to the parallel units in RAM like it does to your GPGPU, ASICs, DMA controllers, etc.
Re:And for faster performance by dj245 · 2013-04-02 12:04 · Score: 1

the CPU vendors need to start stacking them onto their die.
In 5 years your systems will be sold with fixed memory sizes, and the only way to upgrade is to upgrade CPUs.
Stacked vias could also be used for other peripheral devices as well. (GPU?)
IBM tried this with the PS/2 line. It fell flat on its face.
This is news to me. I owned a PS/2 model 25 and model 80, and played around with a model 30. The model 80 used 72 pin SIMMs and even had a MCA expansion card for adding more SIMMs. The model 80 I bought (when their useful life was long over) was stuffed full of SIMMs. The model 25 used a strange (30 pin?) smaller SIMM, but it was upgradable. I forget what the model 30 had. Wikipedia seems to disagree with you also.

--
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
Re:And for faster performance by Pseudonym · 2013-04-02 12:52 · Score: 1

You think modern bloatware is inefficient and slow? Just wait until every machine is a NUMA machine!

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Re:And for faster performance by viperidaenz · 2013-04-02 14:12 · Score: 1

Like the top of the range Mac Pro's currently with their 2 year old Xeon CPU's.
Re:And for faster performance by WuphonsReach · 2013-04-02 15:18 · Score: 1

From what I recall, IBM's problem with the PS/2 brand was:

1) They tried to shift everyone to MCA instead of the more open ISA/EISA, mostly because they were trying to stuff the genie back in the bottle and retake control of the industry.

2) The lower end of the PS/2 line was garbage, which tarnished the upper-end.

We had a few PS/2 server towers to play with. They were rather over-engineered and expensive, and the Intel / Compaq / AT&T commodity systems were faster and less expensive.

--
Wolde you bothe eate your cake, and have your cake?
Re:And for faster performance by Anonymous Coward · 2013-04-02 17:28 · Score: 0

I hate to tell you this, but when the IBM 5150 came out 8-bit ISA was proprietary. In addition EISA was no more sucucessful that MCA inspite of the fact that you could plug 8-bit and 16-bit ISA cards into ISA. With the PS/2, IBM simply failed to realize that they had no control of the (100% IBM-)PC (compatible) market.
Re:And for faster performance by Issarlk · 2013-04-02 18:27 · Score: 3, Funny

Since those are 3D chips, does that mean Apple's price for those RAM will be multiplied by 8 instead of 2?
Re:And for faster performance by Anonymous Coward · 2013-04-02 20:01 · Score: 0

NVIDIA has already announced stacked RAM for their future Volta cards (first link I could find).
Re:And for faster performance by TheRaven64 · 2013-04-02 20:13 · Score: 1

You might want to read the context of the discussion before you reply. My post was in reply to someone who said:

And for faster performance the CPU vendors need to start stacking them onto their die.

--
I am TheRaven on Soylent News
Re:And for faster performance by unixisc · 2013-04-02 21:04 · Score: 1

This is more a practice w/ portable and wireless devices, where low consumption of real estate too is a major factor, and not just low consumption of power. The top package is typically larger than the bottom package, and all the signal pins are at the periphery. For a memory-on-CPU POP, the CPU is typically the bottom package, and its signals are all the core pins, while the memory is the top package, w/ signals at the periphery. Internally, the CPU and memory could be connected, and only the separate signals drawn out.
Re:And for faster performance by IllogicalStudent · 2013-04-03 02:47 · Score: 1

the CPU vendors need to start stacking them onto their die.
In 5 years your systems will be sold with fixed memory sizes, and the only way to upgrade is to upgrade CPUs.
Stacked vias could also be used for other peripheral devices as well. (GPU?)
Problem with this, of course, is that Intel wants to stop having slotted motherboards. Chips will be affixed to boards. Makes RAM upgrades a costly proposition, no?

--
But Maaa! Everyone else has a .sig !
Re:And for faster performance by Tastecicles · 2013-04-03 03:59 · Score: 1

Yesterday's server chip: today's desktop chip.
Prime example: the AMD Athlon II 630. Couple years ago it was the dog's bollocks in server processors and you couldn't get one for less than a grand. Now it's the dog's bollocks of quad core desktop processors (nothing has changed except the name and the packaging) and my son bought one a month ago for change out of £100.
The Core series processors you find in desktops and laptops these days all started life as identically-specced Xeon server processors.

--
Operation Guillotine is in effect.
Re:And for faster performance by viperidaenz · 2013-04-03 07:52 · Score: 1

Not really. The workstations of slightly cheaper price from Dell and others use current Xeon's.
The Core i series processors never were server processors. They don't support ECC and have smaller caches than Xeon's.
Re:And for faster performance by Tastecicles · 2013-04-03 09:58 · Score: 1

Intel begs to differ:
http://ark.intel.com/products/codename/29890/Clarkdale

--
Operation Guillotine is in effect.
Re:And for faster performance by Anonymous Coward · 2013-04-03 09:59 · Score: 0

Ummm... every multi-cpu machine sold these days is.
Re:And for faster performance by viperidaenz · 2013-04-03 10:32 · Score: 1

All that link is telling me Clarkdale was a desktop CPU. I own one of those too, the i5-661.
It also tells me the Xeon Clarkdate has ECC and the i5 and i3 don't. The i-series also has integrated graphics, Xeon doesn't.
They branded one of them a Xeon for the workstation market, put ECC on it and took off the GPU.
Re:And for faster performance by Pseudonym · 2013-04-03 11:14 · Score: 1

Most machines aren't multi-CPU machines.

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Re:And for faster performance by Anonymous Coward · 2013-04-03 15:51 · Score: 0

Intel is actually pretty much the worst case for your argument. The temporal relationship for Intel is essentially never what you claim. When there's a common design between a given Xeon model and a given i3/i5/i7 model, they get released at about the same time.
Furthermore, when Intel's release dates are different, the Xeon is almost always later, not earlier. The reason for any such delays is usually how much time it takes to test, validate, and bugfix the chip. Xeon-specific features tend to be harder to test. Sometimes much harder (think support for cache coherency over multiple CPU sockets, advanced reliability features, and so forth). So sometimes it takes Intel a bit longer to do the Xeon, especially if they run into any problems which require them to do an extra revision or two. (Note that lots of bugs which are acceptable on the desktop aren't in servers.)
Your link bears this out -- when you click through for detailed information you'll find that the Clarkdale Xeon and several desktop Clarkdales debuted in Q1 2010. (This is a case where the Xeon variant isn't much different from the desktop variant, because it's a single-socket Xeon.)
Re:And for faster performance by Anonymous Coward · 2013-04-07 18:37 · Score: 0

Depends on whether you count a core as a CPU, or a whole chip die. Most machines sold today are multi-core.
Re:And for faster performance by Pseudonym · 2013-04-07 19:17 · Score: 1

The standard terminology appears to be that multi-core is not multi-CPU, and that's abundantly clear from the context of the thread.
The claim, you may recall, is that every multi-CPU machine sold is a NUMA architecture. That's patently untrue of almost all machines which feature multiple cores on one die.

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Re:And for faster performance by Anonymous Coward · 2013-04-08 04:53 · Score: 0

Got it. And since the cores on a multi-core chip share a single memory controller, it would be difficult to have them use separate memory channels!
Re:And for faster performance by Anonymous Coward · 2013-04-08 08:08 · Score: 0

Not quite... there was no "8-bit ISA," since ISA was the Industry Standard Architecture specification based on the AT bus. The 8-bit version was the "PC bus" and it was trademarked but not patented. It was open architecture. Card developers didn't have to pay royalties to IBM.

Every other iteration of ram tech is a dud by sinij · 2013-04-02 07:59 · Score: 3, Funny

Just like Star Trek movies, every other iteration of memory tech is a dud. I will just wait for holographic crystals.

Re:Every other iteration of ram tech is a dud by Anonymous Coward · 2013-04-02 08:29 · Score: 0

Just like Star Trek movies, every other iteration of memory tech is a dud. I will just wait for holographic crystals.
No, no! It is actually the (1, 2)-Ulam sequence.

nothing new here by Anonymous Coward · 2013-04-02 08:00 · Score: 0

Sounds like they've managed to re-invent (with modern fabrication techniques) the "controller+DRAM" modules used in the first-generation (R3K) Indigo by Silicon Graphics.

Re:nothing new here by Thagg · 2013-04-02 08:13 · Score: 5, Interesting

I was working at SGI at the time, late 1991. The cheapest way to buy expansion memory was to buy Indigo's and throw out the rest of the computer. SGI was just feeling the first tickles of the commoditization of computer hardware, and was looking for ways to make their components unique (and keep them expensive.)

--
I love Mondays. On a Monday, anything is possible.
Re:nothing new here by jandrese · 2013-04-02 09:06 · Score: 4, Insightful

Nobody ever accused SGI of sane pricing.

--

I read the internet for the articles.
Re:nothing new here by icebike · 2013-04-02 10:31 · Score: 1

Yeah, but how long till one of the partners run off and patent this new process and start suing everyone in sight? (Remember Rambus?)

--
Sig Battery depleted. Reverting to safe mode.
Re:nothing new here by viperidaenz · 2013-04-02 14:16 · Score: 1

I worked for a company that needed more RDRAM in a server. We bought a second hand server, took out the RAM and threw away the rest. It was cheaper.
Re:nothing new here by Tastecicles · 2013-04-03 04:04 · Score: 1

RDRAM was never cheap. I binned a Dell because it was cheaper to build a new machine with the required spec than to add a Gig of RDRAM to that thing.

--
Operation Guillotine is in effect.

Still waiting... by Shinare · 2013-04-02 08:04 · Score: 4, Interesting

Where's my memristors?

Re:Still waiting... by Anonymous Coward · 2013-04-02 09:01 · Score: 0

Samsung and Hynix announced some time this year a year or so ago. We shall see.
Re:Still waiting... by fyngyrz · 2013-04-02 09:03 · Score: 4, Funny

Your memristors are with my ultracaps, flying car, and retroviral DNA fixes. I think they're all in the basement of the fusion reactor. Tended to by household service robots.

--
I've fallen off your lawn, and I can't get up.
Re:Still waiting... by SuricouRaven · 2013-04-02 09:24 · Score: 1

Ultracaps are readily available now. I've got a bank of 2600 farad jobbies. I use to power my Mad Science setup.
Re:Still waiting... by Anonymous Coward · 2013-04-02 09:31 · Score: 0

Can I haz the superhydrophobic surface treatment spray?
Re:Still waiting... by Anonymous Coward · 2013-04-02 09:47 · Score: 0

seconded. i've got 4- 3000 Farad boostcaps just lying on my desk waiting for some project, and 6 more 2600 Farad caps in my friends subwoofer amplifier.
Re:Still waiting... by WrecklessSandwich · 2013-04-02 11:33 · Score: 1

Your ultracaps are right on over here: http://www.digikey.com/product-search/en/capacitors/electric-double-layer-capacitors-supercaps/131084
Re:Still waiting... by Anonymous Coward · 2013-04-02 13:25 · Score: 0

Screw memristors, where is my damned Racetrack Memory, IBM?!
Re:Still waiting... by QQBoss · 2013-04-02 14:13 · Score: 1

Screw memristors, where is my damned Racetrack Memory, IBM?!
Right here, just not quite as small.
Re:Still waiting... by viperidaenz · 2013-04-02 14:29 · Score: 1

Didn't Tron steal those race tracks?
Re:Still waiting... by gagol · 2013-04-02 15:45 · Score: 1

Easy, just glue your surface with lotus leaves! YMMV.

--
Tomorrow is another day...
Re:Still waiting... by Anonymous Coward · 2013-04-02 23:51 · Score: 0

they are not coming, stan williams is a fraud. more precisely, he is one of those parasites commonly found in the research community whose modus operandi is as follows:
1) read about some promising new, relatively unknown and high-risk research ideas
2) change them slightly and/or combine them in obvious ways
3) present this as "brand new work" to his or her victims (typically these would be naive, indifferent or incompetent employers and/or employees)
4) obtain a bundle of money and/or time from the victims to "conduct further research"
5) spend maybe a couple of hours per year actually working
5a) (unlikely case) if this actually leads to genuine progress, BREAK. there's no need to continue being a parasite.
5b) (likely case) find and commence grooming the next set of victims before continuing to 6)
6) when reports/results are due, either:
6a) fabricate plausible-sounding excuses (e.g. blame lack of progress on an unsuspecting victim)
6b) beg for more time and/or money
6c) declare that, despite the lack of tangible results, the research was a success because $RANDOM says so
6d) declare that the research turned out to be a dead-end
7) depending on the victim's response, either loop back to 1) and 4) or move onto the next set of victims found in 5b) and loop back to 1).
$RANDOM can be any one of:
- a "collaborator" (typically another parasite)
- an "accepts anything" academic outlet (e.g. japanese journal, iranian university)
- oneself
the trick to this scam is to select and prepare victims carefully. if done well, a single individual can milk a research lab or university for literally decades before moving on, leaving them none the wiser, and in some cases even with a too-big-to-fire division that can be reused in step 6c) at the parasite's new location.

Not really the first time for this by Anonymous Coward · 2013-04-02 08:07 · Score: 1

Magnetic core menory was 3D. With something like 16k per cubic foot.

Oh noes by ArcadeMan · 2013-04-02 08:11 · Score: 2

Where I have seen 3D silicon before?

--
Get free satoshi (Bitcoin) and Dogecoins

Re:Oh noes by Anonymous Coward · 2013-04-02 11:24 · Score: 0

I think this one might have come first.
Oh, wait... sili-CON, with no 'e'? Sorry, my mistake.
Re:Oh noes by alexo · 2013-04-03 09:07 · Score: 1

Where I have seen 3D silicon before?
On Pamela Anderson?

Dram by Anonymous Coward · 2013-04-02 08:14 · Score: 0

So when can people running ddr1 or ddr2 expect to get some multilayer chips that vastly increase memory bandwidth in older systems?

Re:Dram by fuzzyfuzzyfungus · 2013-04-02 08:37 · Score: 3, Insightful

So when can people running ddr1 or ddr2 expect to get some multilayer chips that vastly increase memory bandwidth in older systems?
Given that, for PC applications at any rate, the memory controller is built into either the motherboard or the CPU, there is likely to be a bottleneck there in any case. There would have been no reason for designers of memory controllers of the era to spec them out with the expectation of more than modest improvements.
Also, this '3D memory' stuff includes a memory controller with the DRAM dice stacked on top. To what, exactly, in a DDR2-using system are you going to connect a fancy new memory controller?
If you were a real high roller with a big cluster full of multi-socket hypertransport based systems or something, somebody might be moved to build some very, very, high performance memory modules that occupy CPU sockets; but that's a serious edge case. Most systems(even new ones) simply don't have a spare bus fast enough to hang substantially-faster-than-DDR3 RAM from.
Re:Dram by harrkev · 2013-04-02 09:02 · Score: 1

This HMC stuff is going to require new CPUs with new memory controllers on board. On the plus side, for the same bandwidth, they will use a lot fewer pins.
Of course, the down-side is the early-adoper penalty of HMC being rather expensive. I expect that if it takes off, the price will drop rapidly.

--
"-1 Troll" is the apparently the same as "-1 I disagree with you."
Re:Dram by jandrese · 2013-04-02 09:17 · Score: 2

The overall design reminds me a lot of Rambus. It saved pins and had excellent sustained throughput, but memory latency suffered.

--

I read the internet for the articles.
Re:Dram by Anonymous Coward · 2013-04-02 09:20 · Score: 0

Will the price drop? Is anyone allowed to join the HMC consortium or will this be another cartel that exploits their patent pool to prevent proper competition?
http://en.wikipedia.org/wiki/DRAM_price_fixing
Re:Dram by K.+S.+Kyosuke · 2013-04-02 09:35 · Score: 1

Does that matter all that much? With cache lines sufficiently long, you're doing burst transfers all the time anyway, or not?

--
Ezekiel 23:20
Re:Dram by doublebackslash · 2013-04-02 10:24 · Score: 1

The power of a modern processor to get work done is dominated by cache misses. I mean by a factor of a hundred or more to one unless every bit you are computing lives in cache and nothing ever kicks your code or data's cache line out (including another line of code or data that you need. Because of the way that cache works you can't map every address to every line in cache).
Don't take my word for it though, take Cliff Click's: http://www.infoq.com/presentations/click-crash-course-modern-hardware

--
md5sum /boot/vmlinuz
d41d8cd98f00b204e9800998ecf8427e /boot/vmlinuz
Re:Dram by Anonymous Coward · 2013-04-02 11:02 · Score: 0

Does that matter all that much? With cache lines sufficiently long, you're doing burst transfers all the time anyway, or not?
Yes, it matters. In the same way that disk access time matters. Most of the time you work against memory but when the swap starts everything sucks.
If you have very fast disk access then working against the swap is less painful.
This is pretty much the same thing. Get fast memory access and you won't be sour about life for those edge cases when the cache isn't enough.
Re:Dram by QQBoss · 2013-04-02 14:28 · Score: 2

Back in 1997, it was determined that ~90% of the benchmarks and customer applications (provided to us for testing purposes, the NDAs were amazing) used on PowerPC were completely dominated by cache misses. That means that if we knew how many times the processor touched a bus (data easily obtained in real time), we could be accurate to within 5% of what the performance would be using a spreadsheet calculation (Thanks, Dr. Jenevein) vs running the apps on a cycle accurate system simulation which could take weeks to develop a meaningful profile. Every time the caches got bigger, the code to solve customer problems would get proportionally bigger. That hadn't changed in 2007 and isn't anticipated to change by 2017. There are edge cases, but until people are satisfied with continuing to play Lode Runner instead of Crysis N, it won't matter for the mass market.
That doesn't mean that CPUs don't need to get bigger/faster, but it does mean that there is a meaningful limit on performance relative to the cache size, the calculation of which is probably left to an exercise for the student in H&P's Computer Architecture.
Re:Dram by viperidaenz · 2013-04-02 14:36 · Score: 1

Most of those pins in the CPU are for power. While the overall system power consumption can be lowered, its entirely moved to a single chip. They may need more pins. A 130W CPU with a core voltage of ~1V needs an average of 130A of current going though those pins. The peaks will be much, much higher. They'll need more pins to get more bandwidth in and out of the CPU+Memory chip too.
Re:Dram by cerberusti · 2013-04-02 15:22 · Score: 1

It matters a great deal, and making sure burst transfers are effective is not always possible.
I do high performance calculations for a living. Knowing in advance what you will need in the future is a somewhat hard problem (and the basis of most modern optimization.)
The difference between main memory and cache is vast, if you can predict what you need far enough in advance to load it into cache that helps quite a bit, but realize that normally at best you are loading 4x what you really will need (which is the nature of trying to predict it so far ahead of time you are not able to calculate what you will really need.)
If you want to contest that, how much memory do you have in cache compared to your data set of a few terabytes? Multiple cores are usually a loss in performance if you even try, most real world problem are not possible to run in parallel once you hit the easy optimizations (which mask latency for the most part at the expense of a large amount of cache memory.)
Most of the harder problems I have run into could scale across multiple cores (or CPUs) if it was designed that way, but the run time would always be worse than a solution which assumed that it will always run on one core (introducing synchronization points kills it.)
Latency is essentially everything in most applications which are optimized (most are not, it costs too much.) The recent trend of simply including more CPUs is essentially an acknowledgement that computers have almost hit their limit in terms of the number of sequential calculations they can run over time.
If you are assuming that your application will become faster as time goes on you already lost. In most cases this cannot happen unless the original implementation was highly suboptimal (such as... you used Java or C# instead of C, or your C code is terrible.)

--
I'm a signature virus. Please copy me to your signature so I can replicate.
Re:Dram by Anonymous Coward · 2013-04-03 02:14 · Score: 0

Newer instructions allow the CPUs to load data from memory into L1 without affecting L2/L3. This means fewer L2/L3 evictions caused by streaming data that rarely gets used more than once before getting evicted.

The other cool part is that these instructions run async of the request and lower priority than other memory accesses. This means the instructions can attempt to load data a bit before the data is required, while not blocking the pipeline or causing extra memory congestion. The CPU only eats the cost of an instruction decode, processes the load, and discards that instruction. If the data returns in time, then the CPU will access the data in L1, if it does not, then it will just be a cache miss and the normal path will be taken. If the normal path is taken, then the data will be copied into L2/L3 as a normal memory fetch.
Re:Dram by QQBoss · 2013-04-03 04:17 · Score: 1

You say newer, I was teaching people to use dcbt/icbt in PowerPC (and similar instructions in other architectures) to do that in the 90's (granted, they affected the L2 if one existed, no one had implemented an L3 on-die at that point). I love the instructions, and used the heck out of them when I hand optimized assembly code- not a career choice I would recommend at this point in time, btw. Compilers exist that can make use of them, fortunately, and they do help maintain the performance curve, but they don't break it out to a new level.
Re:Dram by QQBoss · 2013-04-03 04:28 · Score: 1

Rude of me to reply to myself, but I should have added that when the vector units were added to PowerPC in the mid-late '90s, dst (data stream) instructions had the ability to indicate whether the fetches were transient or not and affect only the L1 if they were. gcc has supported the ability to do this since not long after the MPC7400 was released, IIRC.
Re:Dram by K.+S.+Kyosuke · 2013-04-03 04:58 · Score: 1

The power of a modern processor to get work done is dominated by cache misses. I mean by a factor of a hundred or more to one unless every bit you are computing lives in cache and nothing ever kicks your code or data's cache line out (including another line of code or data that you need.
I happen to know that. What I meant by this was that it shouldn't matter all that much that latency is much worse than the throughput, because the burst transfers effectively amortize the latency cost. You're doing random reads against the L1 cache, not against the main memory. (If you organize your data so as to make the cache miss with every read, you're screwed anyway.)

--
Ezekiel 23:20
Re:Dram by Anonymous Coward · 2013-04-03 05:06 · Score: 0

Back in 1997, it was determined that ~90% of the benchmarks and customer applications (provided to us for testing purposes, the NDAs were amazing) used on PowerPC were completely dominated by cache misses.
That probably had a lot to do with the suckiness of the monster that is C++ with STL. :/
Re:Dram by Anonymous Coward · 2013-04-03 06:01 · Score: 0

That's why datastructures and access patterns make a difference.

Even with C#, I have gotten large 50%+ increases in performance of CPU bound opperations by tweaking datastructures and loops to be more cache friendly. One in particular got about a 15% increase out of the box just by changing the loop, then I changed how the loop grouped comparisons and gained another 20%, then I tweaked the group sizes and gained another 30%. As the groups got larger, throughput got higher, until the groups approached the size of the cache, then performance started to go down.

tweaking the processing group sizes to roughly the cache sizes of the current architecture resulted in the best performance.

Yeah, in C# you can get large gains like this.

I looked at the operations being done by my average case, which was the overwhelming one, counted the number of operations, gave clock-cycle values to each operation, then compared those values against my run-time and total operations performed and I came within 20%.

Keeping 32 cores at 90%+, while running close to metal speeds is fun.

I spent a lot of time finding ways to cull object allocations and to find ways to convert objects into values that can be compared using integers. Like converting a UTF16 string into a UTF8 encoded byte arrays, then pinning the byte array and using Int64 to compare. Or using math tricks to convert doubles that represent probabilities from 0-1, into integers, where the math is much faster.

MORE !! MORE DEEP THOUGHTS !! by Anonymous Coward · 2013-04-02 08:14 · Score: 0

By Jim Handy !!

It's "more... THAN", stupid, stupid Americans... by Anonymous Coward · 2013-04-02 08:14 · Score: 0, Informative

For Christ's sake - it's even in the summaries now.

"15 times more throughput AS standard DRAMs"

It's "15 times more throughput THAN standard DRAMs", you illiterate cretins...

What the hell happened to the American education system in the last ten years or so? It seems like half of you ignoramuses don't know what any of your prepositions mean. Just put in 'to', 'on', 'then', 'that', 'than', etc.etc. at random, that'll do. Near enough.

And 12 nanoseconds later by Anonymous Coward · 2013-04-02 08:19 · Score: 0

3D software will bring this hardware to its knees as leet users everywhere complain how "slow" the hardware is!

Latency? by gman003 · 2013-04-02 08:20 · Score: 4, Insightful

Massive throughput is all well and good, very useful for many cases, but does this help with latency?

Near as I can tell, DRAM latency has maybe halved since the Y2K era. Processors keep throwing more cache at the problem, but that only helps to a certain extent. Some chips even go to extreme lengths to avoid too much idle time while waiting on RAM ("HyperThreading", the UltraSPARC T* series). Getting better latency would probably help performance more than bandwidth.

Re:Latency? by Anonymous Coward · 2013-04-02 08:35 · Score: 0

That is only a problem at the first bit requested. How often do you want 1 bit and not the rest of the cache line? And you *ARE* going to get the rest of the cache line... Most cpu's work that way these days. Working with 1 byte or even 2 has not been true since around the time of the pentium/586.
Re:Latency? by harrkev · 2013-04-02 08:50 · Score: 4, Informative

I have a passing familiarity with this technology. Everything communicates through a serial link. This means that you have the extra overhead of having to serialize the requests and transmit them over the channel. Then, the HMC memory has to de-serialize it before it can act on the request. Once the HMC had the data, it has to go back through the serializer and de-serializer again. I would be surprised if the latency was lower.
On the other hand, the interface between the controller and the RAM itself if tighly controlled by the vendor since the controller is TOUCHING the RAM chips, instead of a couple of inches away like it is now, so that means that you shold be able to tighen timings up. All communication between the RAM and the CPU will be through serial links, so that means that the CPU needs a lot less pins for the same bandwidth. A dozen pins or so will do what 100 pins used to do before. This means that you can have either smaller/cheaper CPU packages, or more bandwidth for the same number of pins, or some trade-off in between.
I, for one, welcome our new HMC overlords, and hope they do well.

--
"-1 Troll" is the apparently the same as "-1 I disagree with you."
Re:Latency? by Cassini2 · 2013-04-02 09:00 · Score: 2

This technology will not significantly affect memory latency, because DRAM latency is almost entirely driven by the row and column address selection inside the DRAM. The additional controller chip will likely increase average latency. However, this affect will be lessened because the higher bandwidth memory controllers will fill the processors cache more quickly. Also, the new DRAM chips will likely be fabricated on a denser manufacturing process, with many parallel columns, which will result in a minor improvement in speed.
All told, this new technology will not change the fact that modern CPU's spend about 50% of their clock cycles waiting for data.
Re:Latency? by jandrese · 2013-04-02 09:18 · Score: 1

How often are your memory access patterns not neatly aligned? This is pretty frequent and can be a major bottleneck for some applications.

--

I read the internet for the articles.
Re:Latency? by Anonymous Coward · 2013-04-02 09:23 · Score: 0

The *interesting* thing is size of package. If they can get it down to the size of a cpu with the same thruput as current then in theory they could put the memory and CPU in the same package removing about 1 ft of wire travel. The only thing holding it back now is the size of the RAM sticks/boards. I see interesting 'SoC' in the near future... Or at the very least phones with radically more memory.
This tech should also be interesting in the GPU market. Where a non insignificant amount of power is lost to the RAM...
It will remain to be seen if moving the memory controller out of the CPU and back to an external bus item would deliver the goods. It may lower the cost though like you pointed out with pin count going down.
Re:Latency? by hamster_nz · 2013-04-02 10:08 · Score: 3, Informative

This change of packaging allows greater memory density, and maybe higher transfer bandwidths. It will not alter the "first word" latency much, if at al.
Signal propagation over the wires isn't the problem, it is the way all DRAM works is.
- The DRAM arrays have "sense amplifiers", used to recover data data from the memory cell. The are much like op-amps, To start the cycle both inputs on the sense amplifier are charged to a middle level,
- The row is opened, dumping any stored charge into one side of the sense amplifier.
- The sense amplifiers are then saturate the signal to recover either a high or low level.
- At this point the data is ready to be accessed and transferred to the host (for a read), or values updated (for a write). It is this part that the memory interconnect performance really matters (e.g. Fast Page mode DRAM, DDR, DDR2, DDR3).
- One the read back and updates are completed then the row is closed, capturing the saturated voltage levels back in the cells.
And then the next memory cycle can begin again. On top of that you have have to add in refresh cycles, the rows are opened and closed on a schedule to ensure that the stored charge doesn't leak away, consuming time and adding to uneven memory latency.
Re:Latency? by doublebackslash · 2013-04-02 10:57 · Score: 2

Pointer chasing is the cannonical example. Trees, linked lists of every flavor, maps, many many more.
Even if your memory accesses are aligned you will still start to stream cache misses as soon as you are operating beyond the limits of cache, or start bouncing between cores and/or threads (snooping is cheap, but it isn't free and by the time you get there another thread might have kicked out your data).
Then there is synchronization between threads. Fences aren't free (far far from it, though some can be cheaper than others)
Some practical examples are rays tracers (objects scattered all around memory), XML parsers (relatively huge objects and more. Love or hate it XML is everywhere), precise garbage collection scatters certain objects around memory, and compression.
That is just off the top of my head, but you get the idea. Not everything is contigious. Even when it is you can easily stream misses a rate collosally higher than they can be served.

--
md5sum /boot/vmlinuz
d41d8cd98f00b204e9800998ecf8427e /boot/vmlinuz
Re:Latency? by viperidaenz · 2013-04-02 14:46 · Score: 1

NVidia has already announced "stacked dram" on their future "Volta" GPU's a couple of weeks ago.
Re:Latency? by smallfries · 2013-04-02 20:32 · Score: 1

Do you know why the target bandwidth for USR (15Gb) is lower than the bandwidth for SR (28Gb)?
It seems strange that they would not take advantage of the shorter distance to increase the transfer speed.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php

streaks of bubbles in the water... by nicolaiplum · 2013-04-02 08:27 · Score: 1

Submarine patent from Rambus [or someone else] surfacing in 3... 2... 1...

--
"For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled"

Re:streaks of bubbles in the water... by ackthpt · 2013-04-02 09:53 · Score: 1

Submarine patent from Rambus [or someone else] surfacing in 3... 2... 1...
Yep. Hope they got signatories all notarized and everything.

--

A feeling of having made the same mistake before: Deja Foobar

This is what I call by etash · 2013-04-02 08:31 · Score: 1

frakking excellent news. For some time now the bottleneck has been in the memory bandwidth, not in the cpu/gpu processing power. This will help a lot problems like raytracing/pathtracing which are memory bound.

thank you gods of the olympus!

p.s. for some time now I've been trying to find again a .pdf file ( which I had found in the past, but lost it somehow ) with detailed explanations and calculations on the memory and flops requirements of raytracing, and how memory bandwidth is very low for such problems

3D is just a fad! by Anonymous Coward · 2013-04-02 08:33 · Score: 0

No one likes it, it's just a way for the industry to wring more money out of the consumer.

Re:3D is just a fad! by Anonymous Coward · 2013-04-03 02:40 · Score: 0

So true, just a fad. I'm totally waiting until they come out with 4D memory. Can't pull one over on me so easily!

5 years by Anonymous Coward · 2013-04-02 09:01 · Score: 1

It will probably be around 5 years until we can buy these things like we buy DDR3. This industry is developing so fast, yet moving so slow.

I sent the guys an email... by Anonymous Coward · 2013-04-02 09:01 · Score: 0

Seems to me they got the specs backwards in the announcement. Shouldn't UltraSR be faster than SR?
Also asked if they benchmarked WOW yet and what kind of frame rates I can expect?

Good news everyone by Anonymous Coward · 2013-04-02 09:03 · Score: 0

It does not need glasses, only if you want to look smart.

Re:It's "more... THAN", stupid, stupid Americans.. by fyngyrz · 2013-04-02 09:05 · Score: 1

What the hell happened to the American education system in the last ten years or so?

Absolutely nothing. Hence, no change in slashdot editing quality. New here, are you?

--
I've fallen off your lawn, and I can't get up.

Re:It's "more... THAN", stupid, stupid Americans.. by Anonymous Coward · 2013-04-02 09:12 · Score: 0

I'm an American, and like many others I too cringed when I read that. Are you implying that people in the Uber-glittery Eurozone never make grammatical errors?
What the hell happened to the Eurotrash education system that you would make such a ridiculous generalization?

Now if you want to complain about the literacy of the submitter, whose nationality you don't even know, and based on a single grammatical error, you may proceed- I'm sure those near you are used to hearing you rant and pound on your keyboard as they mop up the excess foam spurting from your mouth.

Hybrid Memory Cube has 4 Corner Time by Anonymous Coward · 2013-04-02 09:18 · Score: 1, Funny

Hybrid Memory Cube exists in a 4-point world. Four corners are absolute and storage capacity is circumnavigated around Four compass directions North, South, East, and West. DRAM consortium spreads mistruths about Hybrid Memory Cube four point space. This cannot be refuted with conventional two dimensional DRAM.

Re:Hybrid Memory Cube has 4 Corner Time by Anonymous Coward · 2013-04-02 16:14 · Score: 0

[This is the sound of me jumping from the 56th floor]...

Re:It's "more... THAN", stupid, stupid Americans.. by K.+S.+Kyosuke · 2013-04-02 09:37 · Score: 1

I'm an American, and like many others I too cringed when I read that. Are you implying that people in the Uber-glittery Eurozone never make grammatical errors?

It could simply mean that L1 and L2 speakers tend to make different classes of errors.

--
Ezekiel 23:20

Memory is far more complex than you imagine. by hamster_nz · 2013-04-02 09:37 · Score: 2

If you think that modern memory is simple send an address and read or write the data you are much mistaken.

Have a read of What every programmer should know about memory and get a simplified overview of what is going on. This too is only a simplification of what is really going on.

To actually build a memory controller is another step up again - RAM chips have configuration registers that need to be set, and modules have a serial flash on them that holds device parameters. With high speed DDR memory you have to even make allowances for the different lengths in the PCB traces, and that is just the starting point - the devices still need to perform real-time calibrate to accurately capture the returning bits.

Roll Serial Port Memory Technology!

So we can expect (Hope?) to see this in GDDR6 spec by locater16 · 2013-04-02 09:39 · Score: 1

? I mean, money? Psssh, there's people out there that have two GTX Titans ($1,000 cards) and would have more if there was room on the motherboard. Plus the vast reduction in power usage would be really useful for mobile high end stuff. Would love to grab a Nvidia 850 or whatever next year with 4 gigs of this onboard.

Re:It's "more... THAN", stupid, stupid Americans.. by Anonymous Coward · 2013-04-02 10:27 · Score: 0

Shut up. No one wants to hear your whining about something as insignificant as that.

Cooling by Anonymous Coward · 2013-04-02 11:09 · Score: 1

How do they cool this apparatus?

Ultracaps by fyngyrz · 2013-04-02 12:59 · Score: 4, Insightful

Um... yeah. No. I appreciate that what you have are considerably better than regular caps, but they're nowhere *near* the performance of what we keep being offered. Nanotube infused designs with power to weight ratios around that of batteries, graphene designs, etc. There's a huge wealth of applications waiting for them to hit somewhere around those marks. Electric cars, actual car battery replacements, cellphone power supplies that never die, backup systems for the house with peak powers far in excess of anything we have now but with comparable storage... the ultracap "breakthroughs" are as regular as any other kind (memristors, etc.) and the consistent no-show of actual commercially available units is also consistent. It's the flying car of electronic components, sigh. High voltage, high capacity, high vapor factor, lol.

Believe me, I've been following the whole ultracap thing for a while. I even keep an eye on EEStor, which I can assure you has been a stupendous exercise in fruitless waiting. As a ham with a full boat of offline powered goodies and the beginnings of a household able to run off backup systems, and more than a little willingness to buy an electric car, actual availability of ultracaps in what I call "the battery range" would truly light me up.

But that carrot is well and truly still out on the stick.

--
I've fallen off your lawn, and I can't get up.

Re:Ultracaps by jkflying · 2013-04-02 23:55 · Score: 2

TI has a new range of super-low-power embedded chips which use FRAM, they are using it to replace flash and get faster writes, lower power consumption and higher write cycles before failure, so there's one new tech which made it to market and might become more popular over the coming years as it gets cheaper.
And even current-gen ultra-capacitors have a similar or better *power*/weight ratio as a battery - I'd like to see a 30g battery which can give 30A at 600V without damage to itself. It's the *energy*/weight ratio which is a killer - that 30A spike doesn't last long enough to be useful for the types of applications we currently use batteries for.

--
Help I am stuck in a signature factory!
Re:Ultracaps by fyngyrz · 2013-04-03 08:36 · Score: 1

Yes, overall energy capacity, not peak power. My bad.

--
I've fallen off your lawn, and I can't get up.

Re:It's "more... THAN", stupid, stupid Americans.. by Anonymous Coward · 2013-04-02 13:06 · Score: 1

> ... about something as insignificant than that.

There. Broke that for you.

The PS/2 line stunk by justthinkit · 2013-04-02 14:43 · Score: 1

I think the grandparent's point is that IBM tried to be all slick and new and proprietary with the PS/2 line and only suckers -- big corp, gov., banks -- bought into it.
.

I inherited all kinds of PS/2s...excrement. At this time they were being sold with a _12_ inch "billiard ball" monochrome IBM monitor. I eventually upgraded all of them to Zenith totally flat color monitors.

PS/2s were wildly proprietary -- wee, we get to buy all new add-in cards! And performance dogs -- Model 30/286 FTW.

A newb reading the parent's post would think otherwise as you cite wiki and all.

PS/2s and OS/2, released around the same time frame, killed IBM. End of story.

--
I come here for the love

Re:So we can expect (Hope?) to see this in GDDR6 s by viperidaenz · 2013-04-02 14:49 · Score: 1

NVidia Volta, coming in 2016?

Re:It's "more... THAN", stupid, stupid Americans.. by gagol · 2013-04-02 15:57 · Score: 1

Come on, it is Anonymous Coward we are talking about! He has been around since the beginning and its UID is so low, it cant be shown ;-)

--
Tomorrow is another day...

Re:It's "more... THAN", stupid, stupid Americans.. by TheInternetGuy · 2013-04-02 16:43 · Score: 1

visibility++;

--
If my comment didn't sound as good in your head as it did in mine, then I guess we all know who's to blame

Re:It's "more... THAN", stupid, stupid Americans.. by sFurbo · 2013-04-02 18:29 · Score: 1

I believe the UID for AC is 666, though it isn't shown on his posts.

I for one.. by Anonymous Coward · 2013-04-03 02:06 · Score: 0

I for one will never let go of my Universe-bending 2-dimensional RAM.

114 comments