Software SSD Cache Implementation For Linux?

← Back to Stories (view on slashdot.org)

Software SSD Cache Implementation For Linux?

Posted by timothy on Thursday April 22, 2010 @09:37AM from the only-obscure-if-you-think-it-is dept.

Annirak writes "With the bottom dropping out of the magnetic disk market and SSD prices still over $3/GB, I want to know if there is a way to to get the best of both worlds. Ideally, a caching algorithm would store frequently used sectors, or sectors used during boot or application launches (hot sectors), to the SSD. Adaptec has a firmware implementation of this concept, called MaxIQ, but this is only for use on their RAID controllers and only works with their special, even more expensive, SSD. Silverstone recently released a device which does this for a single disk, but it is limited: it caches the first part of the magnetic disk, up to the size of the SSD, rather than caching frequently used sectors. The FS-Cache implementation in recent Linux kernels seems to be primarily intended for use in NFS and AFS, without much provision for speeding up local filesystems. Is there a way to use an SSD to act as a hot sector cache for a magnetic disk under Linux?"

25 of 297 comments (clear)

Min score:

Reason:

Sort:

ZFS by Anonymous Coward · 2010-04-22 09:45 · Score: 5, Informative

ZFS can do this (http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Cache_Devices) but I don't know about zfs-fuse
ZFS L2ARC by jdong · 2010-04-22 09:46 · Score: 5, Informative

Not Linux per se, but the same idea is implemented nicely on ZFS through its L2ARC: http://blogs.sun.com/brendan/entry/test
bcache by Wesley+Felter · 2010-04-22 09:52 · Score: 5, Informative

http://lkml.org/lkml/2010/4/5/41
I'm a little surprised at the lack of response on linux-kernel.
Solaris and DragonFly have already implemented this feature; I'm surprised that Linux is so far behind.
1. Re:bcache by Kento · 2010-04-22 10:02 · Score: 5, Informative
  
  Hey, at least someone noticed :)
  That version was pretty raw. The current one is a lot farther along than that, but it's still got a ways to go - I'm hoping to have it ready for inclusion in a few months, if I can keep working on it full time. Anyone want to fund me? :D
Waste of time by onefriedrice · 2010-04-22 09:54 · Score: 5, Informative

What a waste of time. Just put /home on a magnetic disk and everything else on the SSD. This way, you can get away with a small (very affordable) SSD for your binaries, libraries, config files, and app data, and use tried and true magnetic for your important files. Your own personal files don't need to be on a super fast disk anyway because they don't get as much access as you would think, but your binaries and config files get accessed a lot (unless you have a lot of RAM to cache that, which I also recommend). I've been doing this for over a year and enjoying 10 second boots, and instant program access coldstarts (including openoffice and firefox).

I personally fit all my partitions except /home in only 12.7GB (the SSD is 30GB). Seriously, best upgrade ever. I will never put my root partition on a magnetic drive ever again.

--
This author takes full ownership and responsibility for the unpopular opinions outlined above.
Go for Hardware implemented Caching by iammani · 2010-04-22 09:55 · Score: 1, Informative

Hardware implemented caching is the way to go. There are 'hybrid drives' available now, which automatically cache disk access to SSD. These are very specific for the task and way more efficient than any software implementation.
Re:I don't get it by Anonymous Coward · 2010-04-22 09:55 · Score: 5, Informative

The idea is to use the SSD as a second-level disk cache. So instead of simply discarding cached data under memory pressure, it's written to the SSD. It's still way slower than RAM, but it's got much better random-access performance characteristics than spinning rust and it's large compared to RAM.
As for how to do it in Linux, I'm not aware of a way. If you are open to the possibility of using other operating systems, this functionality is part of OpenSolaris (google for "zfs l2arc" for more information).

Cache Devices
Devices can be added to a storage pool as "cache devices."
These devices provide an additional layer of caching between
main memory and disk. For read-heavy workloads, where the
working set size is much larger than what can be cached in
main memory, using cache devices allow much more of this
working set to be served from low latency media. Using cache
devices provides the greatest performance improvement for
random read-workloads of mostly static content.
To create a pool with cache devices, specify a "cache" vdev
with any number of devices. For example:
# zpool create pool c0d0 c1d0 cache c2d0 c3d0
The content of the cache devices is considered volatile, as
is the case with other system caches.
You can also use it as an intent log, which can dramatically improve write performance:

Intent Log
The ZFS Intent Log (ZIL) satisfies POSIX requirements for
synchronous transactions. For instance, databases often
require their transactions to be on stable storage devices
when returning from a system call. NFS and other applica-
tions can also use fsync() to ensure data stability. By
default, the intent log is allocated from blocks within the
main pool. However, it might be possible to get better per-
formance using separate intent log devices such as NVRAM or
a dedicated disk. For example:
# zpool create pool c0d0 c1d0 log c2d0
Multiple log devices can also be specified, and they can be
mirrored. See the EXAMPLES section for an example of mirror-
ing multiple log devices.
Log devices can be added, replaced, attached, detached, and
imported and exported as part of the larger pool. Mirrored
log devices can be removed by specifying the top-level mir-
ror for the log.
Re:I don't get it by Unit3 · 2010-04-22 09:56 · Score: 1, Informative

No. Swap is not a cache. Swap holds things that don't fit in RAM. I/O cache will never hit swap, it limits itself to physical RAM.

--
-- sudo.ca
Re:isn't 40 GB enough for applications? by Unit3 · 2010-04-22 09:57 · Score: 4, Informative

They are huge for larger applications. Database servers, for instance, can see performance increases in the magnitude of 10-20x the number of transactions per second when using a scheme like this for datasets that are too large to fit in RAM.

--
-- sudo.ca
Re:I don't get it by Anonymous Coward · 2010-04-22 10:04 · Score: 1, Informative

http://leaf.dragonflybsd.org/cgi/web-man?command=swapcache&section=ANY
People forgot the low-level Linux stuff quickly. by guruevi · 2010-04-22 10:12 · Score: 2, Informative

First of all, you can do this with ZFS which is newer tech and works quite well but is not (ever going to be) implemented in the Linux kernel
For lower tech, you can do it the same way we used to do back when hard drives were small. In order to prevent people from filling up the whole hard drive we used to have partitions (now we just pop in more/larger drives in the array). /boot and /var would be in the first parts of the hard drive where the drive was fastest. /home could even be on another drive.
You could do the same, put /boot and /usr on your SSD (or whatever you want to be fastest - if you have a X25-E or another fast writing SSD you could put /var on there (for log, tmp etc. if you have a server) or if you have shortage of RAM make it a swap drive. If you have small home folders, you could even put /home on there and leave your mp3's in /opt or so.

--
Custom electronics and digital signage for your business: www.evcircuits.com
Re:Buffers? by MobyDisk · 2010-04-22 10:14 · Score: 2, Informative

No.
You would buffer on an SSD differently than your would do it in memory. Memory is volatile, so you write-back to disk as fast as possible. And whenever you cache something, you trade valuable physical memory for cache memory. With an SSD, you could cache 10 times as much data (Flash is much cheaper than DRAM), you would not have to write it back immediately (since it is not volatile), and the cache would survive a reboot so it could also speed the boot time.
dm-cache by Gyver_lb · 2010-04-22 10:15 · Score: 3, Informative

google dm-cache. Not updated since 2.6.29 though.
Re:Counter-Productive by pwnies · 2010-04-22 10:20 · Score: 3, Informative

Sadly, nowadays this is a myth. Current MLC and SLC SSD's have (on average) 10,000 and 100,000 writes (respectively) before any bitwear will occur. While this number is small, remember that all modern mainstream SSD's have wear leveling algorithms built into the controller. Intel rates their drives' minimum useful life at 5 years [pdf link - page 10], with an estimated life of 20 years. Note that this number is based on 20GB of writes per day, every day. SSD's nowadays will have no problems with acting as a cache for the system.
Re:I don't get it by Jezza · 2010-04-22 10:24 · Score: 1, Informative

Assuming the SSD was faster at both read and write - it should speed things up. Hell just moving the swap onto a different physical disk helps. But don't. SSD have a limited life, in a different sense to spinning disks. SSD wear with writing, so if you constantly write to the same "sectors" they will fail. If you think about what's happening when the system is swapping - that's exactly what's going on. So yes, it'll help (a bit) but it's really expensive given what will happen to the SSD. Better is add RAM, so the system won't need to swap (with enough RAM you don't need swap at all).
Re:I don't get it by TheRaven64 · 2010-04-22 10:30 · Score: 4, Informative

The submitter wants something like ZFS's L2ARC, which uses the flash as an intermediate cache between the RAM cache and the disk. This works very well for a lot of workloads. Since Linux users appear to be allowed to say 'switch to Linux' as an answer to questions about Windows, it only seems fair that 'switch to Solaris of FreeBSD' would be a valid solution to this problem.

--
I am TheRaven on Soylent News
Re:I don't get it by Anonymous Coward · 2010-04-22 10:47 · Score: 2, Informative

SSD wear with writing, so if you constantly write to the same "sectors" they will fail.
2006 called, they want their FUD back. While it's true that erase blocks in flash memory wear out with use, the whole battle between SSD manufacturers for the last couple years has been in mapping algorithms that ensure you don't hit the same erase block very often. By now, SSDs have longer lifetimes than HDDs. Of course that applies to real SSDs, not makeshift IDE-to-CompactFlash adapters.
Re:I don't get it by EvanED · 2010-04-22 10:55 · Score: 2, Informative

So yes, it'll help (a bit) but it's really expensive given what will happen to the SSD. Better is add RAM, so the system won't need to swap (with enough RAM you don't need swap at all).
A RAM buffer cache and SSD cache address far different issues. The buffer cache is far faster when it hits, but the SSD cache is far larger. It's pretty easy to find workloads where getting enough RAM so that your working set will fit into your buffer cache (alongside the memory use of whatever you're doing) would be more expensive than getting at least a cheap, small SSD. (You can get a cheap 30 GB OCZ drive for about the price of 4 GB of RAM.) Your buffer cache can't survive between boots, while an SSD cache would (though an SSD swap partition wouldn't; not really).
Finally, SSD wear is, I think, overstated. Even with quite heavy write activity, current SSDs will last years, and I suspect adding a 30 GB SSD cache would be a bigger help in 5 years than adding 4 GB of RAM now would, at least in many cases.
Saying "better is to add RAM" is way too simplistic an answer.
Wrong. Swap often acts as a cache. by Anonymous Coward · 2010-04-22 11:08 · Score: 1, Informative

I've seen several other blatantly wrong comments from you for this story, and you clearly don't understand what caches are, and how they work. It's quite easy for swap to be a cache.
An example of this in nearly every personal computer is information read from spinning plastic disc media, like CD-ROMs or data DVDs. Typically, the OS will read data from the plastic disc, and store it in memory. If the memory usage becomes tight, that data from the spinning plastic disc will be swapped out to a magnetic disk drive.
This is caching, because you're storing the data from a comparatively slower medium (the CD-ROM or DVD disc) on a comparatively faster medium (the magnetic hard drive). If the data is needed, it's retrieved from the faster magnetic disk drive, rather than the slower spinning plastic disc drive. Thus we have caching.
1. Re:Wrong. Swap often acts as a cache. by m.dillon · 2010-04-22 13:20 · Score: 4, Informative
  
  The way DragonFly's swapcache works is that VM pages (cached in ram) go from the active queue to the inactive queue to the cache (almost free) queue to the free queue. VM pages sitting in the inactive queue are subject to being written out to the swapcache. VM pages in the active queue (or cache or free queues) are not considered.
  In otherwords, simply accessing cacheable data or meta-data from the hard drive does not itself trigger writing to the SSD swapcache. It's only when the cached VM pages are pushed out of the active queue due to memory pressure and are clearly heading out the door when DragonFly decides to write them to the SSD.
  This prevents SSD write activity from interfering with the operation of the production system and also tends to do a good job selecting what data to write to the SSD when and what data not to. A file which is in constant use by the system just stays in ram, there's no point writing it out to the SSD.
  With respect to deciding what data to cache and what data not to, with meta-data its simple. You cache as much meta-data as you can because every piece of meta-data gives you a multiplicative performance improvement. With file data it is harder since you don't want to try to cycle e.g. a terrabyte of data through a 40G swapcache. The production system's working data set at any given moment needs to either fit in the swapcache or you need to carefully select which directory topologies you want to cache.
  -Matt
2. Re:Wrong. Swap often acts as a cache. by m.dillon · 2010-04-22 15:28 · Score: 3, Informative
  
  OS's have traditionally discarded clean cache data when memory pressure forces the pages out. Swap traditionally applied only to dirty anonymous memory (The OS needs to write dirty data somewhere, after all, and if it isn't backed by a file then that is what swap is for).
  However in the last decade traditional paging to swap has fallen by the wayside as memory capacities have increased. Most of the data in ram on systems today is clean data, not dirty data, and most of the dirty data is backed by a file (e.g. write()s to a database or something like that). On most systems today if you look at swap space use you find it near zero.
  But the concept of swap can trivially be expanded to cover more areas of interest. tmpfs (tmpfs, md, mfs, etc) is a good example. For that matter anonymous memory for VMs can be backed by swap. It is very desireable to back the memory for a VM with either a tmpfs-based file or just straight anonymous memory instead of a file in a normal filesystem. That is a good use for swap too.
  It isn't that big a leap to expand swap coverage to also cache clean data. It took about two weeks to implement the basics on DragonFly. Those operating systems which don't have this capability will probably get it as time goes on simply because it is an extremely useful mechanic for interfacing a SSD-based cache into a system. It is also probably the cleanest and simplest way to implement this sort of cache, and it pairs up well with the strengths of the SSD storage mechanic. Since you can reallocate swap space when something is rewritten there are virtually no write amplification effects and the storage on the SSD is cycled very nicely. You get much better wear leveling than you would if you tried to map a normal filesystem (or mirror the blocks associated with a normal filesystem) on top of the SSD.
  -Matt
3. Re:Wrong. Swap often acts as a cache. by Score+Whore · 2010-04-22 16:34 · Score: 5, Informative
  
  Solaris certainly doesn't. What developer would ever code this kind of behavior? Non-dirty filesystem data in the cache is already on disk, what would be the rational to write it out to another part of the disk? That's just stupid. Non-dirty pages are thrown away when RAM is in demand. Dirty filesystem data is just written to disk. Then the pages become non-dirty and can be freed at any time. Possibly immediately if there is demand.
  Scenario A:
  1. File is read and data is copied into system memory where is it buffered. Time passes.
  2. Memory usage skyrockets.
  3. Kernel writes data to swap space and frees the memory for use by other processes.
  4. Later an application wants that data. Kernel reads data from swap space.
  Scenario B:
  1. File is read and data is copied into system memory where is it buffered. Time passes.
  2. Memory usage skyrockets.
  3. Kernel locates non-dirty cached data and frees that page for use by other processes.
  4. Later an application wants that data. Kernel reads data from original file on disk.
  Differences between scenario A & B:
  Scenario A has two disk IOs (steps 3&4) during memory pressure. Scenario B has one (step 4).
  Scenario A uses limited swap space to store duplicate data. Scenario B doesn't.
  And no, Solaris doesn't cache slow devices (tape, dvd-rom, etc.) either. If you choose to access those types of devices, that is your choice. The OS isn't going to save your ass. If you want it cached, make your application do the caching.
  Also, I'm not considering special purpose systems such as ZFS's l2arc or other similar/more generalized systems that utilize SSD as a midway point between RAM and HDD. We're talking generic swap space and filesystem caches.
Re:Buffers? by m.dillon · 2010-04-22 13:06 · Score: 3, Informative

The single largest problem addressed by e.g. DragonFly's swapcache is meta-data caching to make scans and other operations on large filesystems with potentially millions or tens of millions of files a fast operation. Secondarily for something like DragonFly's HAMMER filesystem which can store a virtually unlimited number of live-accessable snapshots of the filesystem you can wind up with not just tens of millions of inodes, but hundreds of millions of inodes. Being able to efficiently operate on such large filesystems requires very low latency access to meta-data. Swapcache does a very good job providing the low latency necessary.
System main memory just isn't big enough to cache all those inodes in a cost-effective manner. 14 million inodes takes around 6G of storage to cache. Well, you can do the math. Do you spend tens of thousands of dollars on a big whopping server with 60G of ram or do you spend a mere $200 on a 80G SSD?
-Matt
Re:I don't get it by Just+Some+Guy · 2010-04-22 15:03 · Score: 2, Informative

Trick with FBSD - it doesn't believe in removing L2ARC devices yet.
You're wrong:

$ sudo zpool add tank cache ada1 $ sudo zpool status [...] cache ada1 ONLINE 0 0 0 $ sudo zpool remove tank ada1 $ sudo zpool status [nothing about a cache device]

You're probably thinking of ZIL devices. You can't remove them in FreeBSD, but the version of ZFS in Solaris (that's being ported to FreeBSD right now) supports removing them.

--
Dewey, what part of this looks like authorities should be involved?
Do you control the app? how 'bout the system? by IBitOBear · 2010-04-22 15:03 · Score: 1, Informative

Here are the things you probably didn't look into...
1) If you control the app source code, adding fadvise() calls after open to tell the application that you are going to use random access will turn off the read-ahead on those files selectively. The real reason that you are seeing the random access kill you is because you are probably using the defalut read ahead of 128k for the parition. That means that if you are reading 1k records, for example, and the total size of your active file set is greater than your dchace memory, your effective raw read lenght is 129k and you are wastind nearly 100% of that read.
2) For the reasons set forth above, go into the various /sys/block/(whatever)/queue/ directories and set the kb_read_ahead values to zero, or maybe 8k for the partitions where you are storing these large random access files. Remember that in your case there are layers (the raw block device, and the raid, etc) and tuning the layers may be in order.
3) The optimal stacking order is LVM on top of dmcrypt on top of raid_5_or_6 on top of media. No I don't say to do all of that every time, but that's the optimal order. If, for instance you put the raid over the encrypted volume then you will pay about 3 times the encryption cost than having the dmcrypt over the raid. (this is because saving and computing the xors of data that is already encrypted is much cheaper than decrypting all the sectors, computing the xors, then separately encrypting all those sectors as they return to the media.
4) for large storage soulutions _always_ have an LVM on top. The overhead cost of a volume in a volume group is basically a single add and an extra function call. In exchange you can persistently set the readahead profile of each partition. You can also migrate your elements between physical devices safely and on the fly. All the reasons why aren't terribly obvious until the day it saves your life, or your weekend. (I have, for instance, plugged six USB hard drives into a system with a failing raid, built a raid on those USB drives, added the new raid to the existing volume group, and then migrated the active partition from the failing raid to the plugable raid with the system up the whole time then Dropped the failing raid from the volume group. Meanwhile I built a second computer to replace the first. Shut down the old computer. Moved the plugables over to the new one. Booted up into production. then built the new perminant home for the partition. Added that raid into the volume group. Migrated the partiton, then dropped the plugables. Total down time was the cutover between the two computers. I also then got to grow the file system onto the new raid in the new computer (the drives were bigger) during the next maintence window. LVM is very much your friend.
5) it is easy and even desireable to build a volume group that is over heterogeneous storage. In particular lets say I have a raid6 built over five drives /dev/sd[a-e]1 called /dev/mapper/main. I will make a volume group "system" over /dev/mapper/main and /dev/sdf1 and then build my runtime file system in an LVM that is only on the /dev/mapper/main part of system. Now, when I have to make backups etc I can create the snapshot in the /dev/sdf1 part of system without seriously impacting my operations by compeeting for raid computation time/space.
6) make sure you tune the stripe cache for any raid to a big number if you are doing writes, particularly random writes, to it, the default stripe cache is tiny. This is also important for keeping your read rates up if you end up in a degraded state.
7) take a serious look at your choice of scheduler.
So anyway, the degradation you describe is identical to having an improperly (or naively) tuned storage stack. In particular it sounds like read-ahead waste.

--
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press