Software SSD Cache Implementation For Linux?

← Back to Stories (view on slashdot.org)

Software SSD Cache Implementation For Linux?

Posted by timothy on Thursday April 22, 2010 @09:37AM from the only-obscure-if-you-think-it-is dept.

Annirak writes "With the bottom dropping out of the magnetic disk market and SSD prices still over $3/GB, I want to know if there is a way to to get the best of both worlds. Ideally, a caching algorithm would store frequently used sectors, or sectors used during boot or application launches (hot sectors), to the SSD. Adaptec has a firmware implementation of this concept, called MaxIQ, but this is only for use on their RAID controllers and only works with their special, even more expensive, SSD. Silverstone recently released a device which does this for a single disk, but it is limited: it caches the first part of the magnetic disk, up to the size of the SSD, rather than caching frequently used sectors. The FS-Cache implementation in recent Linux kernels seems to be primarily intended for use in NFS and AFS, without much provision for speeding up local filesystems. Is there a way to use an SSD to act as a hot sector cache for a magnetic disk under Linux?"

12 of 297 comments (clear)

ZFS by Anonymous Coward · 2010-04-22 09:45 · Score: 5, Informative

ZFS can do this (http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Cache_Devices) but I don't know about zfs-fuse
ZFS L2ARC by jdong · 2010-04-22 09:46 · Score: 5, Informative

Not Linux per se, but the same idea is implemented nicely on ZFS through its L2ARC: http://blogs.sun.com/brendan/entry/test
bcache by Wesley+Felter · 2010-04-22 09:52 · Score: 5, Informative

http://lkml.org/lkml/2010/4/5/41
I'm a little surprised at the lack of response on linux-kernel.
Solaris and DragonFly have already implemented this feature; I'm surprised that Linux is so far behind.
1. Re:bcache by Kento · 2010-04-22 10:02 · Score: 5, Informative
  
  Hey, at least someone noticed :)
  That version was pretty raw. The current one is a lot farther along than that, but it's still got a ways to go - I'm hoping to have it ready for inclusion in a few months, if I can keep working on it full time. Anyone want to fund me? :D
Waste of time by onefriedrice · 2010-04-22 09:54 · Score: 5, Informative

What a waste of time. Just put /home on a magnetic disk and everything else on the SSD. This way, you can get away with a small (very affordable) SSD for your binaries, libraries, config files, and app data, and use tried and true magnetic for your important files. Your own personal files don't need to be on a super fast disk anyway because they don't get as much access as you would think, but your binaries and config files get accessed a lot (unless you have a lot of RAM to cache that, which I also recommend). I've been doing this for over a year and enjoying 10 second boots, and instant program access coldstarts (including openoffice and firefox).

I personally fit all my partitions except /home in only 12.7GB (the SSD is 30GB). Seriously, best upgrade ever. I will never put my root partition on a magnetic drive ever again.

--
This author takes full ownership and responsibility for the unpopular opinions outlined above.
Re:I don't get it by Anonymous Coward · 2010-04-22 09:55 · Score: 5, Informative

The idea is to use the SSD as a second-level disk cache. So instead of simply discarding cached data under memory pressure, it's written to the SSD. It's still way slower than RAM, but it's got much better random-access performance characteristics than spinning rust and it's large compared to RAM.
As for how to do it in Linux, I'm not aware of a way. If you are open to the possibility of using other operating systems, this functionality is part of OpenSolaris (google for "zfs l2arc" for more information).

Cache Devices
Devices can be added to a storage pool as "cache devices."
These devices provide an additional layer of caching between
main memory and disk. For read-heavy workloads, where the
working set size is much larger than what can be cached in
main memory, using cache devices allow much more of this
working set to be served from low latency media. Using cache
devices provides the greatest performance improvement for
random read-workloads of mostly static content.
To create a pool with cache devices, specify a "cache" vdev
with any number of devices. For example:
# zpool create pool c0d0 c1d0 cache c2d0 c3d0
The content of the cache devices is considered volatile, as
is the case with other system caches.
You can also use it as an intent log, which can dramatically improve write performance:

Intent Log
The ZFS Intent Log (ZIL) satisfies POSIX requirements for
synchronous transactions. For instance, databases often
require their transactions to be on stable storage devices
when returning from a system call. NFS and other applica-
tions can also use fsync() to ensure data stability. By
default, the intent log is allocated from blocks within the
main pool. However, it might be possible to get better per-
formance using separate intent log devices such as NVRAM or
a dedicated disk. For example:
# zpool create pool c0d0 c1d0 log c2d0
Multiple log devices can also be specified, and they can be
mirrored. See the EXAMPLES section for an example of mirror-
ing multiple log devices.
Log devices can be added, replaced, attached, detached, and
imported and exported as part of the larger pool. Mirrored
log devices can be removed by specifying the top-level mir-
ror for the log.
Re:isn't 40 GB enough for applications? by Unit3 · 2010-04-22 09:57 · Score: 4, Informative

They are huge for larger applications. Database servers, for instance, can see performance increases in the magnitude of 10-20x the number of transactions per second when using a scheme like this for datasets that are too large to fit in RAM.

--
-- sudo.ca
Re:I don't get it by Colin+Smith · 2010-04-22 09:58 · Score: 4, Insightful

so
CPU L1
CPU L2
CPU L3
RAM
SSD
DISK
NETWORK
Internet
I estimate SSDs would be closer to Level 5 cache.

--
Deleted
Re:I don't get it by TheRaven64 · 2010-04-22 10:30 · Score: 4, Informative

The submitter wants something like ZFS's L2ARC, which uses the flash as an intermediate cache between the RAM cache and the disk. This works very well for a lot of workloads. Since Linux users appear to be allowed to say 'switch to Linux' as an answer to questions about Windows, it only seems fair that 'switch to Solaris of FreeBSD' would be a valid solution to this problem.

--
I am TheRaven on Soylent News
FSCache would work except... by Jah-Wren+Ryel · 2010-04-22 12:02 · Score: 4, Interesting

I have a similar problem and I tried the FSCache approach:
I've got two raids.
One is optimized for big ass files read contiguously and has raid6 redundancy.
The other is a much smaller JBOD that I can reconfigure via mdraid to anything that linux supports in software.
The problem is that 5% of the big ass files need read-only random access and that kills throughput for anything else going on. It takes me down from ~400MB/s to 15MB/s.
So, I thought I'd use the FSCache approach and use the JBOD as the cache.
I did an NFS mount over loopback and pointed the fscache to the JBOD.
It worked great got practically full throughput for contiguous access, for about 10 hours and then crashed the system.
Apparently NFS over loopback is well known to be broken in linux and has been since, essentially, forever.
I was stunned, it had never even occurred to me that NFS over loopback would be broken. Its freaking 2010 - that something I had been using on Sun0S 3 a bazillion years ago didn't work on linux today had not even entered my mind.
I've also tried replicating the files from the raid6 to the jbod, but that quickly turned into a hassle keeping everything syncronized between the files on disk and the applications that create the files on the raid6 and the apps that use the files on the JBOD. Plus, it doesn't scale out past the size of the JBOD, which I also ran into.
So now, I'm looking at putting the apps that need random access reads to the data in a VM and NFS mounting it with cache to the VM hoping to avoid the NFS-broken-over-loopback problem. I haven't had time to implement it yet, and personally and leery of doing so since I have to wonder what new "known-broken" problems will bite me in the ass.
So, if there is a better way, I am dying to hear it, unfortunately solaris/freebsd is not an option...

--
When information is power, privacy is freedom.
Re:Wrong. Swap often acts as a cache. by m.dillon · 2010-04-22 13:20 · Score: 4, Informative

The way DragonFly's swapcache works is that VM pages (cached in ram) go from the active queue to the inactive queue to the cache (almost free) queue to the free queue. VM pages sitting in the inactive queue are subject to being written out to the swapcache. VM pages in the active queue (or cache or free queues) are not considered.
In otherwords, simply accessing cacheable data or meta-data from the hard drive does not itself trigger writing to the SSD swapcache. It's only when the cached VM pages are pushed out of the active queue due to memory pressure and are clearly heading out the door when DragonFly decides to write them to the SSD.
This prevents SSD write activity from interfering with the operation of the production system and also tends to do a good job selecting what data to write to the SSD when and what data not to. A file which is in constant use by the system just stays in ram, there's no point writing it out to the SSD.
With respect to deciding what data to cache and what data not to, with meta-data its simple. You cache as much meta-data as you can because every piece of meta-data gives you a multiplicative performance improvement. With file data it is harder since you don't want to try to cycle e.g. a terrabyte of data through a 40G swapcache. The production system's working data set at any given moment needs to either fit in the swapcache or you need to carefully select which directory topologies you want to cache.
-Matt
Re:Wrong. Swap often acts as a cache. by Score+Whore · 2010-04-22 16:34 · Score: 5, Informative

Solaris certainly doesn't. What developer would ever code this kind of behavior? Non-dirty filesystem data in the cache is already on disk, what would be the rational to write it out to another part of the disk? That's just stupid. Non-dirty pages are thrown away when RAM is in demand. Dirty filesystem data is just written to disk. Then the pages become non-dirty and can be freed at any time. Possibly immediately if there is demand.
Scenario A:
1. File is read and data is copied into system memory where is it buffered. Time passes.
2. Memory usage skyrockets.
3. Kernel writes data to swap space and frees the memory for use by other processes.
4. Later an application wants that data. Kernel reads data from swap space.
Scenario B:
1. File is read and data is copied into system memory where is it buffered. Time passes.
2. Memory usage skyrockets.
3. Kernel locates non-dirty cached data and frees that page for use by other processes.
4. Later an application wants that data. Kernel reads data from original file on disk.
Differences between scenario A & B:
Scenario A has two disk IOs (steps 3&4) during memory pressure. Scenario B has one (step 4).
Scenario A uses limited swap space to store duplicate data. Scenario B doesn't.
And no, Solaris doesn't cache slow devices (tape, dvd-rom, etc.) either. If you choose to access those types of devices, that is your choice. The OS isn't going to save your ass. If you want it cached, make your application do the caching.
Also, I'm not considering special purpose systems such as ZFS's l2arc or other similar/more generalized systems that utilize SSD as a midway point between RAM and HDD. We're talking generic swap space and filesystem caches.