New Middleware Promises Dramatically Higher Speeds, Lower Power Draw For SSDs

← Back to Stories (view on slashdot.org)

New Middleware Promises Dramatically Higher Speeds, Lower Power Draw For SSDs

Posted by timothy on Friday May 23, 2014 @11:30PM from the well-it-sounds-good dept.

mrspoonsi (2955715) writes "A breakthrough has been made in SSD technology that could mean drastic performance increases due to the overcoming of one of the major issues in the memory type. Currently, data cannot be directly overwritten onto the NAND chips used in the devices. Files must be written to a clean area of the drive whilst the old area is formatted. This eventually causes fragmented data and lowers the drive's life and performance over time. However, a Japanese team at Chuo University have finally overcome the issue that is as old as the technology itself. Officially unveiled at the 2014 IEEE International Memory Workshop in Taipei, the researchers have written a brand new middleware for the drives that controls how the data is written to and stored on the device. Their new version utilizes what they call a 'logical block address scrambler' which effectively prevents data being written to a new 'page' on the device unless it is absolutely required. Instead, it is placed in a block to be erased and consolidated in the next sweep. This means significantly less behind-the-scenes file copying that results in increased performance from idle."

44 of 68 comments (clear)

Min score:

Reason:

Sort:

Excuse my naiveté by HuguesT · 2014-05-23 23:55 · Score: 2

Could the incoming data be written first in either a RAM or SLC cache while the formatting is going on ?
1. Re:Excuse my naiveté by Immerman · 2014-05-24 02:17 · Score: 1
  
  It could, but if you're writing large amounts of data (considerably larger than your write cache) that won't actually help much. It also doesn't change the number of erasures required to get the data written, which is the primary speed and power bottleneck.
  This technique is sort of like using that blank corner on a piece of scratch paper before you throw it away - the blank spot was there anyway, and by making a habit of reuse you can significantly reduce the number of fresh sheets of paper (erasures) that you need to write the same amount of data. Especially if you consider the fact that, given random data sizes, you are just as likely to have a block with 90% left unused as only 10%.
  
  --
  --- Most topics have many sides worth arguing, allow me to take one opposite you.
2. Re:Excuse my naiveté by Bengie · 2014-05-24 03:15 · Score: 1
  
  Only the last block of a file will have a "random" chance of usage.
3. Re:Excuse my naiveté by Immerman · 2014-05-24 03:46 · Score: 1
  
  Certainly - and if you're typically writing one huge file all at once this will have minimal benefit. But if you're filling the cache with lots and lots of small writes then this technique has potential.
  
  --
  --- Most topics have many sides worth arguing, allow me to take one opposite you.
4. Re:Excuse my naiveté by Jane+Q.+Public · 2014-05-24 05:33 · Score: 1
  
  Only the last block of a file will have a "random" chance of usage.
  Sure, BUT... blocks on SSDs can be as large or 16k and even larger. That's a lot of wasted space, especially if you have lots of small files.
  
  The real underlying issue here, though, is the number of lifetime write-cycles. Newer SSD technology (MLC in particular) actually made the number smaller, not larger. When it really, really must get larger before SSDs will be mature. That's the central reason why all these workarounds are necessary in the first place. And that's what they are: work-arounds.
  
  Maybe the awkwardly-named memristor or some similar technology will replace it soon. Or many somebody will come up with a way to give cells more write cycles or even "infinite", as magnetic disks basically are. (Yes, I know it is not really infinite, but AFAIK there is no practical limit to write cycles.)
5. Re:Excuse my naiveté by OffTheWallSoccer · 2014-05-26 02:07 · Score: 1
  
  Not even close to practical. The magnetic disk manufacturers implemented wear leveling back when the drives were in the 200MB-range. Before that disks wore out even quicker than flash disks and I didn't even use swap-files then.
  There is a huge difference between unlimited number of writes and undefined number of writes.
  In critical applications, a bad number is better than an undefined one. At least you can calculate a life-time and design after that.
  No, sir. HDDs (at least up until I stopped writing FW for them in 1999) did not have any wear leveling algorithms. In other words, the translation of LBA to physical location on the media (sometimes called Physical Block Address or PBA) is fixed, other than for defective sectors which have been remapped. So if an O.S. wrote to a specific LBA or range of LBAs repeatedly (think paging/swap file or hibernate file), those PBAs would be written to more frequently (or at least at a different rate) than other PBAs across the drive.
crappy journalism as always by Anonymous Coward · 2014-05-23 23:56 · Score: 3, Informative

http://techon.nikkeibp.co.jp/english/NEWS_EN/20140522/353388/?SS=imgview_e&FD=48575398&ad_q
they came up with a better scheme for mapping logical to physical. however, the results aren't as good as all the news sources say.
Compared To What? by rsmith-mac · 2014-05-24 00:00 · Score: 5, Insightful

I don't doubt that the researchers have hit on something interesting, but it's hard to make heads or tails of this article without knowing what algorithms they're comparing it to. The major SSD manufacturers - Intel, Sandforce/LSI, and Samsung - all already use some incredibly complex scheduling algorithms to collate writes and handle garbage collection. At first glance this does not sound significantly different than what is already being done. So it would be useful to know just how the researchers' algorithm compares to modern SSD algorithms in both design and performance. TFA as it stands is incredibly vague.
1. Re:Compared To What? by Anonymous Coward · 2014-05-24 00:08 · Score: 1
  
  It was tails
Re:Wear leveling by anubi · 2014-05-24 00:14 · Score: 5, Informative

I was looking into that when I was checking out alternatives to sub-gigabyte hard drives to keep legacy systems ( DOS and the like ) alive.

Sandisk's CompactFlash memory cards ( intended for professional video cameras ) seemed to make great SSD's for older DOS systems when fitted with a CF to IDE adapter. I can format smaller CF cards to FAT16 ( using the DOS FDISK and FORMAT commands very similar to installing a raw magnetic drive ). With the adapter, the CF card looks and acts like a magnetic rotating hard drive. I had a volley of emails between SanDisk and myself, and the gist of it was they did not advertise using their product in this manner, and they did not want to get involved in support issues, but it should work. They told me they had wear leveling algorithms in place, which was the driving force behind my volley of emails with them. I was very concerned the File Allocation Table area would be very short lived because of the extreme frequency of it being overwritten. I would not like to give my client something that only works for a couple of months - that goes against everything I stand for.

So, I have a couple of SanDisk memories out there in the field on old DOS systems still running legacy industrial robotics... and no problems yet.

Apparently the SanDisk wear-leveling algorithms are working.

I can tell you this works on some systems, but not on others, and I have yet to figure out why. I can even format and have a perfectly operational CF in the adapter plate so it looks ( both physically and supposedly electronically ) like a magnetic IDE drive in one system ... but another system ( say an old IBM ThinkPad ) won't recognize it. However a true magnetic drive swaps out nicely - albeit the startup files may need to be changed from one system to another.

--
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
Re:Wear leveling by csirac · 2014-05-24 00:27 · Score: 4, Informative

Many industrial computers have CF-card slots for this very application. I put together a few MS-DOS systems using SanDisk CF cards around 8 years ago and they're still going strong, using a variant of one of these cards which has a CF slot built-in (so no need for a CF -> IDE adapter): PCA-6751
Original by GrahamJ · 2014-05-24 00:30 · Score: 2

In the original-ish article here they go into a bit more detail but the "conventional scheme" they're comparing against appears to be just straight mapping. It would be interesting to see how this stacks up against some of the more advanced schemes employed in today's SSDs.
Re:"causes fragmented data by jones_supa · 2014-05-24 00:31 · Score: 2

Fragmentation of data doesn't even affect the speed.
Is this completely true? Because benchmarks show that even SSDs can read larger chunks much faster than small ones. So if a big file exists mostly on adjacent flash cells, it would be faster to read? Of course operating system -level defragmentation might not be very useful because the physical data might be mapped into completely different areas due to wear leveling. Thus the drive would have to perform defragmentation internally.
Re:"causes fragmented data by leuk_he · 2014-05-24 00:42 · Score: 2

If data is fragmented over multiple blocks, It requires mulitple reads. But this kind of fragmentation is not as bad as HDD where you had a seek time of 7-8 ms. Matching the block size of the SDD to the block sie of the FS is an effective performance enhancement.
Modern SDD have read limits. Every 10.000 reads or so the data has to be refreshed. The firmware will do this silent.
Wear leveling by jones_supa · 2014-05-24 00:43 · Score: 1

Per how big data areas is wear leveling performed in an SSD? Maybe not for each 4kB block, because that would require hundreds of megabytes of extra data just for the remap pointers, if we assume that they each are 48 bits long. Also TRIM data (which blocks are "nuked" and not just zeroes) requires similar kind of extra space.
Re:"causes fragmented data by tomhath · 2014-05-24 00:46 · Score: 1

The linked article is pretty bad. This link has a little more information. Apparently the saving they claim comes from filling the pages that already have valid data more completely rather than writing to new pages within the same block (the reduced fragmentation claim); then the garbage collector has fewer pages to relocate when erasing that block (the speed-up claim). Of course if the garbage collection happens in the background the savings are moot.
Re:Wear leveling by sribe · 2014-05-24 01:17 · Score: 2

Per how big data areas is wear leveling performed in an SSD? Maybe not for each 4kB block...
IIRC the erase/write block size is typically 128KB.
Re:Wear leveling by Anonymous Coward · 2014-05-24 01:42 · Score: 1

IBM ThinkPads want both ATA SECURITY and UNLOAD IMMEDIATE. If they don't detect it, they will bitch about it.
Re:Wear leveling by MarkRose · 2014-05-24 02:02 · Score: 1

It was 128 KB for smaller, older drives. For instance, the Samsung 840 EVO series use an erase block size of 2 MB. Some devices even have an 8 MB erase block size. 8 KB page sizes are common now, too, much like how spinning rust moved to 4 KB pages. Using larger pages and blocks allows for denser, cheaper manufacturing.

--
Be relentless!
Not wear leveling. by Immerman · 2014-05-24 02:08 · Score: 5, Interesting

Wear leveling is typically a system by which you write new data to the least-written empty block available, usually with some sort of data-shuffling involved to keep "stagnant" data from preventing wear on otherwise long-occupied sections. It sounds like this is a matter of not erasing the block first: For example if the end of a file has used 60% of a block and is then deleted, the SSD can still use the remaining 40% of the block for something else without first deleting it. Typically, as I understand it, once a block is written that's it until its page is erased - any unused space in a block remains unused for that erase cycle. This technique would allow all the unused bits at the end of the blocks to be reused without an expensive erase cycle, and then when the page is finally ready to be erased all the reused bits on the various blocks can be consolidated to fill a few fresh blocks.
It seems to me this could be a huge advantage for use cases where you have a lot of small writes so that you end up with lots of partially filled blocks. Essentially they've introduced variable-size blocks to the SSD so that one physical block can be reused multiple times before erasure, until all available space has been used. Since erasing is pretty much the slowest and most power-hungry operation on the SSD that translates directly to speed and power-efficiency gains.

--
--- Most topics have many sides worth arguing, allow me to take one opposite you.
1. Re:Not wear leveling. by Guspaz · 2014-05-24 06:39 · Score: 1
  
  You're incorrect. Writes can only happen at the page size, but there are multiple pages per block. If a block has unwritten pages, you can still write to the remaining pages.
2. Re:Not wear leveling. by Immerman · 2014-05-24 07:44 · Score: 1
  
  You are correct, I got the terms switched in line with the confusion in the summary. Reread with that in mind and I think you'll find the rest is in order (i.e. they are rewriting a partially used page)
  
  --
  --- Most topics have many sides worth arguing, allow me to take one opposite you.
3. Re:Not wear leveling. by Immerman · 2014-05-24 08:27 · Score: 1
  
  Are you certain? That sounds like what they're describing, and certainly the individual bits are capable (you're still just setting some of the bits that were reset in the last erase cycle), the rest is just the control hard/software. It's the reset that needs to be handled specially, so long as you are only setting bits that haven't been altered since the last erase there's shouldn't be a problem. It seems to me that, at the crudest, you could simply read a partially filled block, add extra data to the unused (still erased) portion, and then re-write the block without erasing it first. The previously used portions would have the exact same data written to them, to no effect, and the freshly used portions would be updated from their reset status as per normal
  
  --
  --- Most topics have many sides worth arguing, allow me to take one opposite you.
4. Re:Not wear leveling. by Immerman · 2014-05-24 15:54 · Score: 1
  
  Actually filesystems are typically *allocated* in 4k increments, but not necessarily *written* in such, it's easy enough on a magnetic drive to write only three bytes in the middle of a file, or only the bytes actually used in the last allocation block of each file, though caching systems may obscure that fact.
  As for the writing mechanism, you're right, it would likely be a bit more complicated. On further reflection I would suspect that they wouldn't bother reading a block at all, just write the new data with unset bits over all the old data (all zeros? all ones? whatever tells the hardware "leave it in the previous, normally erased, state"). Likely they'd also need to included partial-block checksums as well, but assuming checksums are calculated in firmware that's just an implementation detail. As for knowing how many bits are actually used in a block, I would think that would be easy enough, even just two status-bits per block would tell you whether the last 0,1,2,or 3/4 of the block are unused. Imprecise, but a little under-utilization is unlikely to be a major problem.
  
  --
  --- Most topics have many sides worth arguing, allow me to take one opposite you.
Re:"causes fragmented data by K.+S.+Kyosuke · 2014-05-24 02:13 · Score: 1

Because benchmarks show that even SSDs can read larger chunks much faster than small ones
Well, why shouldn't prefetching and large block reading work on the SSD controller level? I assume that Flash chips are still slower than DRAMs, and the controller has to do some ECC work, not to mention figuring out where to read from (which may also be something that is kept in Flash chips in non-volatile form, unless you want your logical-to-physical mapping completely scrambled when the drive is turned off). So prefetching the data into the controller's memory should help hide latencies even if the Flash chips themselves are true RAM chips, as in the equal random access time to all blocks thingy, and it should work most efficiently when reading operations are performed on larger chunks at once.

--
Ezekiel 23:20
Re:"causes fragmented data by ArcadeMan · 2014-05-24 02:44 · Score: 1

If you're reading a lot of small files, that's a lot of open/read/close commands. If you're reading a big file, that's one open command, multiple sequential read commands, one close command.
And if it's anything like SPI, there's not even multiple read commands, you just keep clocking to read the data sequentially.

--
Get free satoshi (Bitcoin) and Dogecoins
Re:"causes fragmented data by AdamHaun · 2014-05-24 03:05 · Score: 1

I'm not sure about NAND flash, which is a block device, but in NOR flash sequential reads are faster due to prefetching, where the next memory word is read before the CPU has finished processing the first one. For NAND, I'd imagine you could start caching the next page. Not sure if that's actually done, though.

--
Visit the
Not a word of that is true by slashmydots · 2014-05-24 03:23 · Score: 1

"Currently, data cannot be directly overwritten onto the NAND chips used in the devices. Files must be written to a clean area of the drive whilst the old area is formatted"
Am I the only one that knows that's not remotely true? I don't even know where to start. So the SSD wants to write to location 0x00000032 but it's occupied by old data. First of all, no it isn't. TRIM already took care of that. But let's say you're using the SSD in Windows XP so TRIM doesn't work. So they claim the SSD writes data to a blank location on the drive temporarily, then erases the original intended location and later moves it back to that location to be contiguous? What's so damn special about that location? Just leave it in the blank location. They claim that causes fragmentation, which has no impact on the performance of an SSD in any way.

This is a useless invention from people who don't know how SSDs work.
1. Re:Not a word of that is true by OffTheWallSoccer · 2014-05-26 02:36 · Score: 1
  
  So they claim the SSD writes data to a blank location on the drive temporarily, then erases the original intended location and later moves it back to that location to be contiguous? What's so damn special about that location? Just leave it in the blank location. They claim that causes fragmentation, which has no impact on the performance of an SSD in any way.
  This is a useless invention from people who don't know how SSDs work.
  You are correct. SSDs don't have a fixed LBA-to-physical arrangement, so host rewrites of an LBA will normally go to a new (erased) NAND location, with the drive updating its internal LBA map automatically (I.e. no need for TRIM of that LBA).
Re:Wear leveling by Bengie · 2014-05-24 03:27 · Score: 1

Tracking 4KB blocks wouldn't be that bad for meta data. Like you said, assume 48bit pointers, then some extra metadata, so 64bit, which is 8 bytes. 1GB is 262,144 4KB blocks, which is only about 2MB of metadata per 1GB, which is only 0.2% overhead. They over-provision something like 10%-30% just for wear leveling.
The problem with this article... by AcquaCow · 2014-05-24 03:48 · Score: 1

...is that in a properly-designed SSD, there is no such thing as data fragmentation. You lay out the nand as a circular log and write to every bit of it once before you overwrite, and maintain a set of pointers that translates LBA to memory addresses.
Pretty much every SSD vendor out there has figured this out a few years ago.

--

up 12 days, 22:30, 2 users, load averages: 993.20, 994.21, 994.56
*makes note to limit user processes...
1. Re:The problem with this article... by Bengie · 2014-05-24 04:42 · Score: 1
  
  I may be missing something, but if you have a circular log and the head meets the tail, how can you not start fragmenting to fill the holes in the log? My understanding of circular logs is you just start writing over the oldest data, which you cannot do with permanent storage.
2. Re:The problem with this article... by tlhIngan · 2014-05-24 05:24 · Score: 1
  
  I may be missing something, but if you have a circular log and the head meets the tail, how can you not start fragmenting to fill the holes in the log? My understanding of circular logs is you just start writing over the oldest data, which you cannot do with permanent storage.
  That's where overprovisioning and write-amplification come into play. The head NEVER meets the tail - the circular log is larger than the advertised size. E.g., a 120GB (120,000,000,000 byte) SSD would have 128GiB of flash. That difference is over-provisioning (and even older 128GB SSDs had 128GiB of flash). That overprovisioning accounts for bad blocks (up to 2% of flash is bad when new!), as well as ensuring there is a safe "landing zone" for new data, storage of the FTL tables (the "middleware" the article talks about), etc.
  So there is always more physical storage available than exposed, and a periodic thread in the SSD firmware reclaims blocks that have been TRIMmed or overwritten (i.e., marked "dirty") by cleaning up the head and moving the unchanged data to the tail. (You need to move unchanged data too otherwise slowly changing areas of disk will not wear evenly).
  The write amplification happens then - because you're causing more data to be written when no writes were issued by the host - writes of the data itself, and writes to the FTL tables to point to the new data location.
  Corruption of the FTL tables is serious business - it's the primary cause of SSD failure, and easily repairable too (do an ATA SECURE ERASE forces a reinitialization of the tables putting the SSD back to full operation, at a loss of user data).
  The real innovations for SSDs would be to be able to search FTL tables faster, update them safer, and lessen their susceptibility to corruption.
  (Modern SSDs are bottlenecked by SATA3, hence the move to PCIe SSDs).
Re:"causes fragmented data by gregben · 2014-05-24 03:51 · Score: 1

> Modern SDD have read limits. Every 10.000 reads or so the data has to be refreshed. The firmware will do this silent.
Please provide reference(s). I have never seen any indication of this, or at least there is no read limit for the flash memory itself. You can read from it indefinitely just like static RAM, without "refresh" as required for DRAM.
LWN? by JSG · 2014-05-24 04:07 · Score: 1

Have I stumbled into a new green themed version of LWN? The comments here are far too insightful and interesting for the usual /. fare. Can't even find the frist post.
Re:"causes fragmented data by altstadt · 2014-05-24 04:47 · Score: 1

Google: flash read disturb
The Micron presentation is rather old, but gives a good overview of how Flash works.
Re:Make them work first by Jane+Q.+Public · 2014-05-24 05:39 · Score: 1

That *IS* the basic problem.

However, they have gotten "good enough" for most use cases. Though I agree that it is in the "just barely" category. Limited rewrites are their one major problem at this time. If that can be improved, it would be a great advance for us all.
Re:Make them work first by regular_gonzalez · 2014-05-24 05:59 · Score: 1

You're in luck, as that time is right now!
http://techreport.com/review/2...

tl;dr - the Samsung 840 series is the only drive to really suffer problems but that's strictly relatively speaking; it's allocating from reserve capacity and to reach the point it's at now you'd have to have 150 gb of writes per day for 10 years, which is probably at least an order of magnitude higher than even a heavy standard user. And that's the consumer version -- the Intel ssd, aimed more at production / business environments, fares even better. Which mechanical hard drives do you use that support 150 gb of daily writes for 10 years?

--
Due to circumstances beyond my control, I am master of my fate and captain of my soul.
Already being done by dutchwhizzman · 2014-05-24 08:12 · Score: 2

Most flash drives have some RAM cache and most erasing is done as a background task by the on-board firmware of the drive. Part of flash drive reliability has to do with having big enough capacitors on board so a powerfailure will allow the drive to write enough data to flash to have a consistent state for at least it's own bookkeeping data on blocks and exposed data. The enterprise ones usually have enough capacitors to write all data to flash that has been reported to the OS as "we wrote this to the drive" on top of that.
The big difference here seems to be that they don't erase block level any more and a change to just a few bytes in a block don't lead to the whole block in it's new iteration being written to an empty block and tagging the old block with a "trim". While this is beneficial for throughput, you have to make certain you will not do this indefinitely, since wear level algorithms aren't used for nothing. You'll still need to do a certain percentage of rewrites or keep count of the number of rewrites to the same block and once your counter hits a limit, do a rewrite of the entire block to a "fresh" location.

--
I was promised a flying car. Where is my flying car?
bad idea - thrashing directory blocks by dltaylor · 2014-05-24 11:43 · Score: 1

I've written drivers for solid state media. It is a cost to find the the "next available block" for incoming data. Often, too, it is necessary to copy the original instance of a media block to merge new data with the old. Then, you can toss the old block into a background erase queue, but the copy isn't time-free, either.
Since so-called Smart Media didn't have any blocks dedicated to the logical-physical mapping (It was hidden in a per-physical-block logical id), there was also a startup scan required.
If the middleware is constantly trying to use the same physical block to represent a logical block (something even rotating media is giving up on), the physical block is going to take a pounding if is used for frequently-updated storage. Losing a directory block due to cell damage is not my idea of a good thing.
What I suspect they're really trying to do is reduce the number of blocks dedicated to logical-physical mapping. That lets them ship more parts with from-fab defective blocks at a given capacity out of the die.
Re:Wear leveling by anubi · 2014-05-24 13:24 · Score: 1

Thanks!

I was wondering why my ThinkPads would not see these.

--
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
Re:Wear leveling by anubi · 2014-05-24 13:32 · Score: 1

Thank you for the link, CSI! I did not know about that one. It looks like a very handy little board that can retrofit into other ISA systems. ( Yes, I can get desperate enough to fire up Eagle and layout a custom ISA motherboard for something like this if the dying dinosaur is important enough ).

--
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
Re:Wear leveling by csirac · 2014-05-24 20:06 · Score: 1

I believe Advantech will still happily sell you ISA backplanes. At the same time I put these things together, I had to reverse-engineer and fabricate some old I/O cards which had "unique" (incompatible with readily available cards) interrupt register mappings, also with EAGLE - great software!
I should mention: the MS-DOS system has outlasted three replacement attempts (two windows-based applications were from the original vendor who sold the MS-DOS system). There's just something completely unbreakable about the old stuff.
Re:"causes fragmented data by OffTheWallSoccer · 2014-05-26 02:23 · Score: 1

GP is correct about read disturb. NAND vendors will specify specific policy for a given part, but it is typically N reads to a particular area (i.e. one block, which is 256 or 512 or 1024 pages) then requires erasing that area. So even if page 7 in a block is never read, but page 100 is read a lot, the drive will have to rewrite that whole block eventually.
(I work for a NAND controller vendor.)