Software SSD Cache Implementation For Linux?
Annirak writes "With the bottom dropping out of the magnetic disk market and SSD prices still over $3/GB, I want to know if there is a way to to get the best of both worlds. Ideally, a caching algorithm would store frequently used sectors, or sectors used during boot or application launches (hot sectors), to the SSD. Adaptec has a firmware implementation of this concept, called MaxIQ, but this is only for use on their RAID controllers and only works with their special, even more expensive, SSD. Silverstone recently released a device which does this for a single disk, but it is limited: it caches the first part of the magnetic disk, up to the size of the SSD, rather than caching frequently used sectors. The FS-Cache implementation in recent Linux kernels seems to be primarily intended for use in NFS and AFS, without much provision for speeding up local filesystems. Is there a way to use an SSD to act as a hot sector cache for a magnetic disk under Linux?"
Linux caches data from any disks all the same, SSD or not.
I can't find a commercial SSD / Platter Hybrid anywhere...
I wouldn't mind doing some of the work myself but I can't even find those mythical ssd / hard drive kits.
Is there really a need for this? Intel 40 GB SSD still has a read speed of 170 MB/s and costs about 100 euro here in NL. Why have some kind of experimental configuration while prices are like that? OK, 35 MB/s write speed is not that high, but with the high IOPS and seek times you still have most of the benefits.
I can see why you would want something like this, but I doubt the benefits are that large over a normal SSD + HDD configuration.
Get one drive of each type. Stick them in your computer. Create partitions for the main directory hierarchies. Put /, /boot, /bin, /etc, /usr, and other relatively-static hierarchies on the SSD drive. Put /home, /var, and other frequently-modified directories on the magnetic disk drive. There, you've got the caching you want.
ZFS can do this (http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Cache_Devices) but I don't know about zfs-fuse
I hate to sound dumb, but isn't what you're describing basically file system buffering that OS's have been doing for many decades now?
Not Linux per se, but the same idea is implemented nicely on ZFS through its L2ARC: http://blogs.sun.com/brendan/entry/test
Install the operating system and key applications to the SSD and use your standard harddrive for all the data storage
By having the SSD act this way, you will lower the lifespan of the drive by unnecessarily depleting the write-cycles with continued optimization
How much was Slashdot paid for this plug for Silverstone?
Hackers want to know.
Thanks in advance.
Nick Haflinger.
He wants the OS to intelligently (and automatically) use an SSD to store the frequently used files from his larger spinning hard disk. It's a great idea and surely Windows will do it soon enough (as much as I hate to say it).
I think SSD will always be behind Magnetic. Ive dreamed of building an SSD controller that had a magnetic backend. Popin SD cards to expand the cache. OS doesnt notice that your drive is really 2 drives. I'm going to get flamed but I do think M$ had it right when they backed hybrid drives!
The OSDI deadline is in August; plenty of time to implement this, write it up, and get a publication at a top research conference out of it!
http://lkml.org/lkml/2010/4/5/41
I'm a little surprised at the lack of response on linux-kernel.
Solaris and DragonFly have already implemented this feature; I'm surprised that Linux is so far behind.
What a waste of time. Just put /home on a magnetic disk and everything else on the SSD. This way, you can get away with a small (very affordable) SSD for your binaries, libraries, config files, and app data, and use tried and true magnetic for your important files. Your own personal files don't need to be on a super fast disk anyway because they don't get as much access as you would think, but your binaries and config files get accessed a lot (unless you have a lot of RAM to cache that, which I also recommend). I've been doing this for over a year and enjoying 10 second boots, and instant program access coldstarts (including openoffice and firefox).
/home in only 12.7GB (the SSD is 30GB). Seriously, best upgrade ever. I will never put my root partition on a magnetic drive ever again.
I personally fit all my partitions except
This author takes full ownership and responsibility for the unpopular opinions outlined above.
Set the root and swap partitions on the SSD, set up the kernel so the hibernation image is stored on the SSD, and set the /home partition on the HDD.
Not as elegant as what your asking for, and your solution could easily be implemented, but the solution outlined above should get most people the same results. Any directories you don't want on the SSD you can just assign to the HDD.
Ahh, configurability - don't you love Linux? :D
Hardware implemented caching is the way to go. There are 'hybrid drives' available now, which automatically cache disk access to SSD. These are very specific for the task and way more efficient than any software implementation.
I've actually been working on this off-and-on for a while, I'm hoping we can release some beta code soon. Currently developing it on Linux, but planning to release OSX and Windows versions, too. We're caching reads and writes, and only the blocks that are most frequently used, plus various other SSD-relevant optimisations. The block allocation logic is pretty complex (and I'm too busy with work), which is why it's been taking so long.
The oldest and simplest solution is to mount partitions from a small fast disk where you want fast read/write speeds, and partitions from slower disks everywhere else. Works quite well, too.
GAAH! MY PRINTER IS ON FIRE!!! PUT IT OUT! PUT IT OUT!
This wouldn't be hard to do with a little smart configuring. /home /boot and /bin end up on the SSD (probably / would be fine), and mound the spinning HDD somewhere else. I call mine /data. /data.
When you install linux, make sure that
When you rip your DVDs, or whatever, put them on
When you 'sudo apt-get' a new application, it goes onto the SSD, without you having to do anything special.
No cache implementation rewrite necessary.
Sure it's caching. You're storing frequently-used data on a faster medium.
It's actually better than caching, because in this case you don't have to write back the data to the spinning-disk drive when there are changes, since the SSD drive itself is non-volatile. An added benefit of that is that there's no need to keep the SSD-stored directories on the spinning media, freeing up more space there.
Call it a "shortsighted kludge" all you want, but this general technique has been used very effectively for decades by people and organizations working with absolutely huge data sets. It's a proven technique that dates back to mainframes and microcomputers.
Hell, that's why this is so easy to do with UNIX-like systems. Disk drives then had a smaller capacity but faster access times, and were used to store frequently-accessed data (like system files and applications). User data was stored on tape, since it was comparatively plentiful, even if somewhat slower. The very earliest UNIX implementations split their filesystem hierarchy over multiple storage devices with differing capabilities.
First of all, you can do this with ZFS which is newer tech and works quite well but is not (ever going to be) implemented in the Linux kernel
For lower tech, you can do it the same way we used to do back when hard drives were small. In order to prevent people from filling up the whole hard drive we used to have partitions (now we just pop in more/larger drives in the array). /boot and /var would be in the first parts of the hard drive where the drive was fastest. /home could even be on another drive.
You could do the same, put /boot and /usr on your SSD (or whatever you want to be fastest - if you have a X25-E or another fast writing SSD you could put /var on there (for log, tmp etc. if you have a server) or if you have shortage of RAM make it a swap drive. If you have small home folders, you could even put /home on there and leave your mp3's in /opt or so.
Custom electronics and digital signage for your business: www.evcircuits.com
I stash these items on my SSD (4gb GIGABYTE IRAM SSD #1 - this can actually BOOT AN OS though, so others know):
====
1.) WebBrowser caches
2.) %Temp% + %Tmp% ops
3.) Event Logs
4.) Print Spooler
5.) cmd.exe %Comspec%
----
And, on my 3gb CENATEK RocketDrive:
1.) Pagefile.sys (for nearly the ENTIRE SSD in size of it)
====
Seems to all work out well, for better performance... how so?
Well - not just because those items benefit by mostly being smallish files inside folders, which tend to "increase speed" (less latency in seeks mostly & NO head movements either as in mechanical disks) but, also because I am "offloading" my main C: drive (bootdrive in Windows for those "not in the know" on Windows, & yes, there are those folks out there @ times, albeit rarely), making it do LESS WORK also!
APK
P.S.=> Your ideas aren't 1/2 bad either for Linux though... good job! apk
google dm-cache. Not updated since 2.6.29 though.
Windows 7 (and I think XP) has ReadyBoost. I haven't been able to find anything similar for Linux. It is also not clear how much difference ReadyBoost makes. The only benchmark I was able to find uses a crappy USB flash drive. I was wondering how much difference something like the 80GB x-25m would make. There is clearly potential for huge gains as MaxIQ benchmarks show.
This would be an awesome speedup if it was supported: just add a 40-80GB SSD for swap & file cache, and gain a massive performance boost over a standard cheap 7200RPM drive. Given that the price per GB for SSD is likely to stay very high for the foreseeable future, this seems like the best way to go. If only there was OS support for it...
The alternative (that I'm waiting for) is for SSD prices to drop.
___
If you think big enough, you'll never have to do it.
"If Linux doesn't already do it, you don't need it anyway!"
Sure it's caching. You're storing frequently-used data on a faster medium.
That's not what caching means. It comes from the French word for 'hidden' and the fact that it is not directly addressable is the important part of the definition. A cache is not just a faster medium, it's a faster medium that is hidden from the user / programmer and is used to accelerate access to the slower medium.
I am TheRaven on Soylent News
Kind of like the current idea of pushing the wear-leveling back to the drives. This is something the OS can do, and it's a case where flexibility matters -- it's not something I want in a black box inside a drive controller.
Don't thank God, thank a doctor!
Sure it's caching. You're storing frequently-used data on a faster medium.
You're also storing a lot of infrequently-used data on a faster medium, and a lot of frequently-used data on the slower medium. How is that better than using the faster medium only for frequently used data?
As it happens, for many of my workloads anyway, I'd expect to see bigger benefits by storing some of my data directories on SSDs than storing /usr. Which directories? Well, they're sort of scattered all around. Why should I have to go through and figure out what things I use all the time? How can I even determine that (I don't know what files programs are opening behind my back). Figuring out that sort of thing is exactly what computers are good at.
I's actually better than caching, because in this case you don't have to write back the data to the spinning-disk drive when there are changes...
That's not a problem. Writes are already delayed because they can be buffered; if you cache in a SSD, they can be delayed even further before you write them back to the magnetic drive.
An added benefit of that is that there's no need to keep the SSD-stored directories on the spinning media, freeing up more space there.
Whoopde do. Even if you got a 100 GB SSD, duplicating that space on, say, a 1 TB hard drive would cost under $10. Considering that the SSD would be in the vicinity of $400, that extra $10 in lost space isn't exactly something to cry about.
Puppy Linux implements periodic syncing to its save file (ext2,3,& 4) and the times can be adjusted. This was initially implemented when flash drives were less reliable, but still useful if you want to reuse old equipment or if you need to do a lot of read/write intesive operations.
I would like to see something like this, except as a file system layer similar to unionfs that does copy-on-read from some other place (network, slow usb hdd etc) , and purges or keeps files (based on popularity) when the place it caches to gets full.
I've seen several other blatantly wrong comments from you for this story, and you clearly don't understand what caches are, and how they work. It's quite easy for swap to be a cache.
An example of this in nearly every personal computer is information read from spinning plastic disc media, like CD-ROMs or data DVDs. Typically, the OS will read data from the plastic disc, and store it in memory. If the memory usage becomes tight, that data from the spinning plastic disc will be swapped out to a magnetic disk drive.
This is caching, because you're storing the data from a comparatively slower medium (the CD-ROM or DVD disc) on a comparatively faster medium (the magnetic hard drive). If the data is needed, it's retrieved from the faster magnetic disk drive, rather than the slower spinning plastic disc drive. Thus we have caching.
What a waste of time. Just put /home on a magnetic disk and everything else on the SSD. This way, you can get away with a small (very affordable) SSD for your binaries, libraries, config files, and app data, and use tried and true magnetic for your important files.
Solaris' experience with use SSDs as a read cache shows 5x to 40x increase in read IOps:
http://blogs.sun.com/brendan/entry/l2arc_screenshots
while still getting the advantages of bulk storage with SATA drives (in various forms of RAID configuration).
This may not be a big deal for home stuff, but if you're serving homedirs, VMware VMDKs, or databases over NFS for work, it could save a lot of money in equipment and power/cooling just by adding a few SSDs.
Similarly using write-optimized SLC SSDs can help synchronous write operations (12x more IOps, 20x reduction in latency in some benchmarks):
http://blogs.sun.com/brendan/entry/slog_screenshots
More on the general concept of "hybrid storage pools":
http://blogs.sun.com/brendan/entry/hybrid_storage_pool_top_speeds
I believe parts of this functionality has been ported to FreeBSD as well (they're a few ZFS revs behind Solaris).
You can run ZFS on Linux via FUSE. This would probably achieve exactly what the OP is looking for. See http://zfs-fuse.net/
I'm surprised no one mentioned "preload":
"preload is an adaptive readahead daemon. It monitors applications that users run, and by analyzing this data, predicts what applications users might run, and fetches those binaries and their dependencies into memory for faster startup times."
http://sourceforge.net/projects/preload/
Development seem staled, but i think the idea is there. Well, they attacked the problem of using unused RAM, but it could easily be adapted to use a SSD partition.
Sebastien Giguere
Caching is only worthwhile if the data can benefit from higher bandwidth. I don't want, for example, my porn or SETI@home data using valuable cache space regardless of how frequently it's accessed, because it can't be processed at anything approaching the bandwidth of magnetic storage, let alone a good SSD. I'd much prefer to have my app/games stored on the SSD, because regardless of how infrequently I use any one of them, the performance gains would be far more dramatic.
https://www.eff.org/https-everywhere
Replace the optical drive. I've been keeping a log of how many times I've ever used my DVD drive while away from home. So far I'm at 1; I ripped a CD I got for Christmas before I brought it back home with me. It could have waited.
Yes, I know it doesn't work for everyone, but I think it works for most people, assuming you get a USB powered optical drive or enclosure.
Try jamming two hard drives into a laptop.
Re-read the problem as stated:
"Is there a way to use an SSD to act as a hot sector cache for a magnetic disk under Linux?"
Ya think maybe it's assumed that there are two drives? Just Maybe?
This kind of thing is fairly big in some circles.
I have a similar problem and I tried the FSCache approach:
I've got two raids.
One is optimized for big ass files read contiguously and has raid6 redundancy.
The other is a much smaller JBOD that I can reconfigure via mdraid to anything that linux supports in software.
The problem is that 5% of the big ass files need read-only random access and that kills throughput for anything else going on. It takes me down from ~400MB/s to 15MB/s.
So, I thought I'd use the FSCache approach and use the JBOD as the cache.
I did an NFS mount over loopback and pointed the fscache to the JBOD.
It worked great got practically full throughput for contiguous access, for about 10 hours and then crashed the system.
Apparently NFS over loopback is well known to be broken in linux and has been since, essentially, forever.
I was stunned, it had never even occurred to me that NFS over loopback would be broken. Its freaking 2010 - that something I had been using on Sun0S 3 a bazillion years ago didn't work on linux today had not even entered my mind.
I've also tried replicating the files from the raid6 to the jbod, but that quickly turned into a hassle keeping everything syncronized between the files on disk and the applications that create the files on the raid6 and the apps that use the files on the JBOD. Plus, it doesn't scale out past the size of the JBOD, which I also ran into.
So now, I'm looking at putting the apps that need random access reads to the data in a VM and NFS mounting it with cache to the VM hoping to avoid the NFS-broken-over-loopback problem. I haven't had time to implement it yet, and personally and leery of doing so since I have to wonder what new "known-broken" problems will bite me in the ass.
So, if there is a better way, I am dying to hear it, unfortunately solaris/freebsd is not an option...
When information is power, privacy is freedom.
Does Linux cache actual data (content of files) or just the block addresses?
As far as I know, only the second.
It caches both.
Some fun things to try by way of benchmarks:
sync; sudo sh -c "echo 2 > /proc/sys/vm/drop_caches" # clear out the file cache /usr/share/doc -type f > files # uncached file system performance /dev/null # uncached data read performance /usr/share/doc -type f > files # cached file system performance /dev/null # cached file system + data performance /proc/sys/vm/drop_caches" # clear out the cache again /dev/null # uncached performance for both data and inodes
time find
time cat `cat files` >
time find
time cat `cat files` >
sync; sudo sh -c "echo 2 >
time cat `cat files` >
As long as you don't overrun your available cache memory, you should get at least 100x faster performance on cached accesses.
Check out Atrato's AppSmart software...
http://www.atrato.com/products/app-smart.asp
Thanks for your display of ignorance.
In Alaska, the Yukon and British Columbia, the word "cache" also refers to a platform used to store food away from animals. It's quite accessible to whoever puts the food there. Given that the pioneers (Runciman and Booth) in the study of caching, both at the processor level and for IO, hailed from Juneau, it's likely that's what they were thinking of when they coined the term.
A cache doesn't have to be hidden or inaccessible. In fact, most caches at the software level explicitly allow the users to specify the size of the cache, the replacement and deletion policies, among other factors.
Besides, what you're saying doesn't even apply here. A given application won't know where its data is stored. It won' t know that /etc/hosts is on a SSD drive, while /home/raven/penises.jpeg is on a traditional magnetic disk drive.
The algorithm seems simple. Wouldn't you just have to track every sector (or other more appropriate unit) loaded to the ssd and have a score associated with each one. (+1 every time a given sector is loaded from hdd or ssd) ssd remains full all the time. Users don't 'access' it as a drive, ie - it's 100% managed. if a load request gives a sector a higher score than the lowest sector in the ssd, overwrite it with the new one. Keep the score list in RAM and store to the hdd when shutting down. Then whatever you load most will always be right there, including the os. It would be like a selective mirror for most requested items, the whole ssd becoming like a giant page file.
In fact, your own files might be accessed frequently, but they're very likely their small enough for ram cache. All this changes if you commonly manipulate huge files of course.
The Christian religion has been and still is the principal enemy of moral progress in the world. -- Bertrand Russell
If that's indeed the case, then why not simply put the MBR, /boot, /bin, and /usr on the SSD, then mount stuff like /home, /tmp, swap, and the like onto a spindle disk? No algorithm needed, thus no overhead needed to run it, etc.
This sounds like a good way to go.
Thanks to wear leveling, your HDD would probably wear out first. This guy did the math.
Read "Flash SSD Application from Hell - the Rogue Data Recorder". And keep in mind it was written in 07 - things are better than that now. You might die of old age before your SSD does, depending on your setup.
Weaselmancer
rediculous.
I already do something similar on regular hard drives, based on the fact that the logical start of most hard drives now are almost twice as fast as the end. Create a small root partition, which will be on the fast part of the disk, for things that tend to be randomly accessed all the time, and then put all the bigger files on the slow part. There is no reason you need tiered storage at the hardware or OS level for this.
Microsoft has a technology called "ReadyBoost" that does this! It works great. No kidding! Google it.
One idea that I've had but haven't had an opportunity to try is doing this in LVM. "pvmove" lets you move a physical extent from one device to another, so if you can track which extents are hot you can move them to the SSD. But pvmove works by mirroring the extent from one device to another, so it would seem that you could keep it on the spinning disc and mirror it for reads. But if it is write heavy you probably need it to stay on the SSD primary.
:-)
Of course, what we really want is more than just one level, we want hierarchical storage: Tape for bulk storage of infrequently needed stuff, maybe optical. 5400RPM big drives, 15K for faster IO on that, MLC for faster, and SLC for fastest. That's totally what we need, right?
I could totally see 4GB of SLC, 32GB of MLC, and then a 500GB hard drive in my laptop.
While I think that purchasing more RAM should ALWAYS be your first choice, couldn't you put /SWAP on the SSD. If it gets used a lot, the SSD will get worn out (write cycle limit on the SSD), but hey you are the one that wants to use SSD for SWAP.
Here are the things you probably didn't look into...
1) If you control the app source code, adding fadvise() calls after open to tell the application that you are going to use random access will turn off the read-ahead on those files selectively. The real reason that you are seeing the random access kill you is because you are probably using the defalut read ahead of 128k for the parition. That means that if you are reading 1k records, for example, and the total size of your active file set is greater than your dchace memory, your effective raw read lenght is 129k and you are wastind nearly 100% of that read.
2) For the reasons set forth above, go into the various /sys/block/(whatever)/queue/ directories and set the kb_read_ahead values to zero, or maybe 8k for the partitions where you are storing these large random access files. Remember that in your case there are layers (the raw block device, and the raid, etc) and tuning the layers may be in order.
3) The optimal stacking order is LVM on top of dmcrypt on top of raid_5_or_6 on top of media. No I don't say to do all of that every time, but that's the optimal order. If, for instance you put the raid over the encrypted volume then you will pay about 3 times the encryption cost than having the dmcrypt over the raid. (this is because saving and computing the xors of data that is already encrypted is much cheaper than decrypting all the sectors, computing the xors, then separately encrypting all those sectors as they return to the media.
4) for large storage soulutions _always_ have an LVM on top. The overhead cost of a volume in a volume group is basically a single add and an extra function call. In exchange you can persistently set the readahead profile of each partition. You can also migrate your elements between physical devices safely and on the fly. All the reasons why aren't terribly obvious until the day it saves your life, or your weekend. (I have, for instance, plugged six USB hard drives into a system with a failing raid, built a raid on those USB drives, added the new raid to the existing volume group, and then migrated the active partition from the failing raid to the plugable raid with the system up the whole time then Dropped the failing raid from the volume group. Meanwhile I built a second computer to replace the first. Shut down the old computer. Moved the plugables over to the new one. Booted up into production. then built the new perminant home for the partition. Added that raid into the volume group. Migrated the partiton, then dropped the plugables. Total down time was the cutover between the two computers. I also then got to grow the file system onto the new raid in the new computer (the drives were bigger) during the next maintence window. LVM is very much your friend.
5) it is easy and even desireable to build a volume group that is over heterogeneous storage. In particular lets say I have a raid6 built over five drives /dev/sd[a-e]1 called /dev/mapper/main. I will make a volume group "system" over /dev/mapper/main and /dev/sdf1 and then build my runtime file system in an LVM that is only on the /dev/mapper/main part of system. Now, when I have to make backups etc I can create the snapshot in the /dev/sdf1 part of system without seriously impacting my operations by compeeting for raid computation time/space.
6) make sure you tune the stripe cache for any raid to a big number if you are doing writes, particularly random writes, to it, the default stripe cache is tiny. This is also important for keeping your read rates up if you end up in a degraded state.
7) take a serious look at your choice of scheduler.
So anyway, the degradation you describe is identical to having an improperly (or naively) tuned storage stack. In particular it sounds like read-ahead waste.
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
Here are the things you probably didn't look into...
1) If you control the app source code, adding fadvise() calls after open to tell the application that you are going to use random access will turn off the read-ahead on those files selectively. The real reason that you are seeing the random access kill you is because you are probably using the default read ahead of 128k for the partition. That means that if you are reading 1k records, for example, and the total size of your active file set is greater than your dchace memory, your effective raw read length is 129k and you are wastind nearly 100% of that read.
2) For the reasons set forth above, go into the various /sys/block/(whatever)/queue/ directories and set the kb_read_ahead values to zero, or maybe 8k for the partitions where you are storing these large random access files. Remember that in your case there are layers (the raw block device, and the raid, etc) and tuning the layers may be in order.
3) The optimal stacking order is LVM on top of dmcrypt on top of raid_5_or_6 on top of media. No I don't say to do all of that every time, but that's the optimal order. If, for instance you put the raid over the encrypted volume then you will pay about 3 times the encryption cost than having the dmcrypt over the raid. (this is because saving and computing the xors of data that is already encrypted is much cheaper than decrypting all the sectors, computing the xors, then separately encrypting all those sectors as they return to the media.
4) for large storage solutions _always_ have an LVM on top. The overhead cost of a volume in a volume group is basically a single add and an extra function call. In exchange you can persistently set the readahead profile of each partition. You can also migrate your elements between physical devices safely and on the fly. All the reasons why aren't terribly obvious until the day it saves your life, or your weekend. (I have, for instance, plugged six USB hard drives into a system with a failing raid, built a raid on those USB drives, added the new raid to the existing volume group, and then migrated the active partition from the failing raid to the plugable raid with the system up the whole time then Dropped the failing raid from the volume group. Meanwhile I built a second computer to replace the first. Shut down the old computer. Moved the plugables over to the new one. Booted up into production. then built the new permanent home for the partition. Added that raid into the volume group. Migrated the partition, then dropped the plugables. Total down time was the cut-over between the two computers. I also then got to grow the file system onto the new raid in the new computer (the drives were bigger) during the next maintenance window. LVM is very much your friend.
5) it is easy and even desirable to build a volume group that is over heterogeneous storage. In particular lets say I have a raid6 built over five drives /dev/sd[a-e]1 called /dev/mapper/main. I will make a volume group "system" over /dev/mapper/main and /dev/sdf1 and then build my runtime file system in an LVM that is only on the /dev/mapper/main part of system. Now, when I have to make backups etc I can create the snapshot in the /dev/sdf1 part of system without seriously impacting my operations by competing for raid computation time/space.
6) make sure you tune the stripe cache for any raid to a big number if you are doing writes, particularly random writes, to it, the default stripe cache is tiny. This is also important for keeping your read rates up if you end up in a degraded state.
7) take a serious look at your choice of scheduler.
So anyway, the degradation you describe is identical to having an improperly (or naively) tuned storage stack. In particular it sounds like read-ahead waste.
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
sorry, it gave me an error the first time... the second copy is even spell checked... 8-)
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
Something like this might also help:
http://insights.oetiker.ch/linux/external-journal-on-ssd/
I would think for SSD caching purposes a HyperDrive 5 RAM SSD in RAID0 mode would be the better choice compared to a flash SSD, if you could afford it. The long term abuse of a flash SSD due to the write patterns of a cache would make it potentially as slow as a spinning hard disk due to the tricks modern flash SSD drives use to improve write performance (trying to write into unoccupied large contiguous blocks).
Well yeah, caching is not the problem its a possible solution.
The raid6 is on an areca, and it was deliberately tuned for streaming reads and writes over random access - in retrospect separate raid volumes might have been better as the need for random access was not part of the initial requirements. I'm also using jfs and it appears to have a problem with fadvise, although that's been with POSIX_FADVISE_DONTNEED on streaming reads rather than POSIX_FADVISE_RANDOM.
I will look into playing with the kb_read_ahead setting, although I've my doubts because what I see with sar suggests the problem is in the areca-- when random accesses are occurring sar shows vastly reduced tps and rd_sec/s with 100% utilization. If it were buffercache readaheads, I would expect those values to equal or at least be in the neighborhood of the values during pure streaming access. I've already played with areca's own firmware settings for read-ahead aggressiveness without much benefit. Plus, readahead is good for all the streaming work, its just that a small amount of random access has a disproportionate impact on the streaming access.
When information is power, privacy is freedom.
The original ask slashdot also mentioned speeding up boot times. I suppose that could be accomplished by copying the contents of /boot to a small filesystem on the SSD, then modifying /etc/fstab to mount /boot from the SSD, then running grub or whatever bootloader you are using to re-write the MBR.
However, that still isn't *quite* what the original poster was asking for. I do think he has a good idea (I've wondered about something along those lines too).
The thing about swap is, the the files will *always* be initially read from the magnetic disk into the swap when you first boot and first access files. Also, if a file has been closed by all processes that had requested it, I don't think it will be considered by the kernel to still be loaded into swap, will it (even if the sectors it was written to still physically have the data)? I think the original poster wanted something where the kernel will look *first* on the SSD to see if it can fulfill the request, and only after that will try to fetch from the magnetic drive - even right after a reboot.
Putting /swap on that won't really do that. Still, putting /swap on the SSD, I suppose, would probably still help quite a bit.
Actually, I'm thinking about this. Doesn't the Linux kernel try to keep most open or recently accessed files in an in-RAM cache (as long as there is enough RAM), and access them directly in RAM instead of from either swap or disk, if it can fulfill the request from cache? If that is the case, doesn't it make the most sense to just get more RAM? 8 or 12 Gigs of RAM would, I should think, provide plenty of space to keep a lot of files in cache (unless you are dealing with some really massive files like full feature-length movies).
I guess my point is, it is my understanding that once loaded from the magnetic disk initially, a lot of file requests can be serviced from RAM, thus negating a lot of the benefits of putting /swap on an SSD (I think), so the main place where the SSD can provide a benefit is during the initial loading, but swap doesn't get hit during the initial loading. So, only doing something like the OP suggested, I believe, would provide any benefit?
Can you elaborate on that broken loopback NFS in Linux? I couldn't find anything about that, last mention of it being broken was in 2002.
You know, a lot of people use loopback nfs for crypto homedirs and I think fuse. I'd like to think that it isn't broken...
Cleancache was just recently posted to LKML: Cleancache [PATCH 0/7] (was Transcendent Memory): overview.
:wq
Hierarchical Storage Management (HSM) is a data storage technique which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage devices. The HSM system monitors the way data is used and makes best guesses as to which data can safely be moved to slower devices and which data should stay on the fast devices.
Full article on HSM: http://en.wikipedia.org/wiki/Hierarchical_storage_management
http://insights.oetiker.ch/linux/external-journal-on-ssd/
We should have flash memory devices with just a PCIE memory controler (with external PCIE for external devices).
The wear leveling code should be in the kernel.
That would allow services taylored wear leveling algorithm to be easily implemented.
i think such a software could very easily be implemented using inotify and aufs...
Do you have any other example of a directly accessible cache, like what you're calling your SSD ? I don't... And being able to specify size, policies... does not cut it. Cache is transparent and system-managed, so that neither the user nor the apps have to care/know much about it.
Manually installing certain files on certain media does not feel at all like cache to me.
The Cloud - because you don't care if your apps and data are up in the air.
its not caching because you do not have two copies of your data
SURELY NOT!!!!!
I keep /home on my SSD but symlink everything big (eg Music or Videos sub-sdirecgtories) out to a magnetic disk. Having /home on the SSD helps speed things up for me - all those kde4 dot files that get loaded...
I have the OS and most files I use on an ACARD 9010b battery backed ram drive that I boot off of, and long-term storage for less accessed files on a regular hard drive.
If power goes out, the everything on the ram drive is copied to a compact flash card built into the unit.
This way you have the same speed of an SSD, but no wear-leveling problems, and your write speed is the same as the read speed (unlike an SSD).
It uses SATA, as well as dual-cable SAS interface to double the transfer rate, making it a tiny bit (OK, insignificantly) faster than an SSD for some operations.
Cheap? no, but not bad either.
Capacity is up to 64 GB of Ram.
most OS's might take 4-6 GB so plenty
Google acard 9010 to see the specs!
DM-cache appears to do just what you are talking about, perhaps you could further that project?
http://users.cis.fiu.edu/~zhaom/dmcache/index.html
It seems I'm a bit late to the party.
The only potential's I've seen in linux for this (aside from ZFS fuse already mentioned) is...
OHSM
CacheFS
If you want to do this in hardware, you can use MaxIQ from Adaptec which, IIRC, uses linux drivers to get this function from their storage controllers. There is also a few others, one of which only mirrors the first part of the hard drive.
UnionFS / AUFS is currently used to make Live CDs look writable (via ram). Using this would prefer the SSD more than the hard-drive it overlays.
Changes Needed:
- add write through
- copy to ssd on read
- if SSD full, delete lesser-used blocks
Science & open-source build trust from peer review. Learn systems you can trust.
For example Adaptec MaxIQ on quite a few of their controllers, which lets you add an Intel X25-E to your existing raid array. It transparently uses the SSD as a cache for the most frequently used data.
only thing, its sloooooooooooooow. I did some tests a while ago (v0.6.0) and the speed was about an order of magnitude below a raid5 setup on the same machine.
It's The Golden Rule: "He who has the gold makes the rules."
Courtesy of Facebook kernel hackers:
http://github.com/facebook/flashcache
As booting from a LiveUSB key demonstrates it is possible to boot and run from a USB key 2GB or so in size and a 8GB would be roomy.
Thus a modest priced SSD on SATA could be used to boot linux and also contain symbolic links to directories or files on a much larger rotating media or network resource.
Sun an others did a bit of work to move files and junk off the boot disk and onto a shared NFS resource....
One difficult to address IO problem on a demand paged VM OS is the latency and lack of streaming that can be obtained. i.e. each page fault generates a single IO request of a disk. This applies to pages or text, data, or swap IO.
As many folk know IO is often measured with largish memory buffers and largish system calls to the OS. Demand paged IO has a granularity that is page size... and is about the worst IO increment that the system sees.
Latency of rotating media for a 5400 RPM or a hot 10,000 RPM disk is very slow when compared to a SSD and this alone could tip the balance giving SSD a strong place in a system design.
For this to gain traction laptops and systems must have two disk interfaces. A SSD form factor could be established that was about the size of a book of matches or smaller to permit a pair of device to fit in the box. This does not make sense for desksize systems because LARGE DRAM is perhaps cheeper and tricks like readahead daemons and a revisit of the sticky bit would make it moot.
Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities; Truth isn't. Mark Twain.
http://perspectives.mvdirona.com/2010/04/29/FacebookFlashcache.aspx Maybe this ask /. article prompted them to release it?
Charles Wyble System Engineer