Optimizing Linux Systems For Solid State Disks
tytso writes "I've recently started exploring ways of configuring Solid State Disks (SSDs) so they work most efficiently in Linux. In particular, Intel's new 80GB X25-M, which has fallen down to a street price of around $400 and thus within my toy budget. It turns out that the Linux Storage Stack isn't set up well to align partitions and filesystems for use with SSD's, RAID systems, and 4k sector disks. There are also some interesting configuration and tuning that we need to do to avoid potential fragmentation problems with the current generation of Intel SSDs. I've figured out ways of addressing some of these issues, but it's clear that more work is needed to make this easy for mere mortals to efficiently use next generation storage devices with Linux."
I think the bigger challenge will be in getting mere mortals to have a $400 toy budget to afford the SSD
However, for
many of us who require better-than-average data security, the matter of SSD's read/write behaviour makes the devices extremely vulnerable to analyses and discovery of data the owner/author of which believes to be inaccessible to others: 'secure wiping', or lack thereof, is the issue.
Obviously you should be encrypting your sensitive data.
Also, it should be no problem to write a bootable cd/usb that does a complete wipe. Just write over the whole disk, erase, repeat. No wear leveling will get around that.
Save your wrists today - switch to Dvorak
This article makes me wonder if any OS is really properly optimized for SSDs. Has there been any analysis as to whether or not windows machines properly optimize the use of solid state disks? Perhaps the problem goes beyond just linux?
The Matrix is real... but I'm only visiting!
If I mount /home on a separate drive, (good to do when upgrading) the rest of the Linux file system fits nicely on a small SSD.
My rights don't need management.
From economics, lets turn our attention to optimizing this toy of ours. The thing with SSDs is that they don't have a read/write head to worry about. This means that no matter where the data is stored in the device, all we need to do is specify the fetch location and the logic circuits select that block to extract the data from desired location. From what I've heard, the SSDs have an algorithm to actually assign different blocks to store the data so that the memory cells in a single locations aren't overused.
Face your daemons!
> Vista has already started working around this problem, since it uses a default partitioning geometry of 240 heads and 63 sectors/track. This results in a cylinder boundary which is divisible by 8, and so the partitions (with the exception of the first, which is still misaligned unless you play some additional tricks) are 4k aligned. So this is one place where Vista is ahead of Linuxâ¦.
Although the technology it is used in is repugnant, NTFS has always been the One True Filesystem. It descended from DIGITAL's ODS2 (On Disk Structure 2) which traces back to the original Five Models (PDP 1, 8, 10, 11 and 12). You see, ODS was written by passionate people with degrees and rich personal lives in Massachusetts who sang and danced before the fall of humanity to the indignant Gates series who assimilated their young wherever possible and worked them into early graves during his epic battle with the Steves before the UNIX enemy remerged after a 25 year sleep and nuked the United States, draining all of its technological secrets to the other side of the world. Gates, realizing what he's done, now travels the universe seeking to rebuild his legacy by purifying humanity while the Steve series attempts to rebuild itself. Some of the original Five are still around, left to logon to Slashdot and witness what's left of the shadow of humanity still in the game as they struggle blindly around in epic circles indulging new and different ways to steal music, art and technology to make up for their lack of creativity long ago bred out of them by the Gates series.
SSDs gradually gain more and more sophisticated controllers which do more and more to try to make the SSD seem like an ordinary hard drive, but at the end of the day the differences are great enough that they can't all be plastered over that way (the fragmentation/long term use problems the story linked to are a good example). I know that (at present- this could and should be fixed) making these things run on a regular hard drive interface and tolerate being used with a regular FS is important for Windows compatibility, but it seems like a lot of cost could be avoided and a lot of performance gained by having a more direct flash interface and using flash-specific filesystems like UBIFS, YAFFS2, or LogFS. I have to wonder why vendors aren't pursuing that path.
Such tools already exist. Even the venerable "dd if=/dev/zero of=/dev/sda" is extremely efficient at flushing a drive well beyond the ability of any but the most well-equipped recovery services, and it's a lot faster than the "overwrite with zeroes, then ones, then 101010..., then 010101..., then random data" approach used by some people with too much time on their hands and too much paranoia for casual data.
Do you even lift?
These aren't the 'roids you're looking for.
> So why should I get a SSD vs. a CF card?
10 times better performance and wear-leveling worth a crap.
Don't forget android.
They're using their grammar skills there.
I don't think this is going to be a significant problem when compared to normal seek time problems.
Lets say we have 100 k of data to read. 512 byte blocks would require 200 reads. 4k blocks would require 25 reads.
For rotating discs: If the data is contiguous, we have to hope that all the blocks are on the same track. If they are, then there is 1 (potentially very costly) seek to get to the track with all the blocks on it. The cost of the seek is dependent on the track it's going to, the track it's on, and whether or not the drive is sleeping or spun down. Otherwise we also get to do another very short seek, which is going to add a bit of time to get to the next adjacent track. Worst case scenario all 200 blocks are on different tracks, scattered randomly on the platter, requiring 200 seeks. Ouch ouch ouch.
For SSDs: What is important is the number of cells we have to read. Cells will be 4k in size. All seek times are essentially zero. Best case scenario, all data is contiguous, and the start block is at the start of a cell. Read time boils down to how fast the flash can read 20 cells. Worst case scenario is where the data is 100% fragmented, such that all 200 512 byte blocks reside in a different cell, requiring 200 cell reads. (10fold increase in time required) There will also be overhead in copying out the 512 byte data from each buffer and assembling things, but this time is negligible for this comparison.
While the 20x time increase (order N) looks significant, it's important to compare the probabilities involved, and just how bad things get. The most important difference between how these two drives react is the space between fragments. In the "worse case' for SSD, 100% fragmentation, is highly unlikely. I don't even want to think about what a spinning disc would do if asked to perform a head seek for 100% of the blocks in say, a 1mb file. The read head would probably sing like a tuning fork at the very least. 2000 cell reads compared to 2000 seeks, the SSD will win handily every single time, even if the tracks on the disc are close.
If the spacing between fragments is anything near normal, say 30-100k, then there will be some seeking going on with the disc, and there will be some wasted cell reads with the SDD, but having to do an extra one cell read compared with having to do an extra head seek, again the SSD wins hands down. The advantage of the SSD actually goes down as fragmentation goes down, because most fragments are going to cause a head seek, each of will significantly widen the time gap. Also a spinning disc will read in the blocks much faster than the cells on a SSD.
I realize the OP was more describing the possibility of "not so much bang for the buck as you are expecting" due to fragmentation, and I know the above hits more on comparing the two than what happens to the SSD, but if you consider the effects of fragmentation on a spinning disc, and then weigh how the impact compares with a SSD, it's easy to see that fragmentation that sent you running for the defrag tool yesterday may not even be noticeable with a SSD. So I'd call this a "non-issue".
What I'm waiting for is them to invest the same dev time in read speeds as write speeds. SSDs don't appear to be doing any interleaved reads - they're doing it for the writes because they're so slow. Though at this point I wonder if read speeds are just plain running into a bus speed limit with the SSDs?
I work for the Department of Redundancy Department.
It will outlast a standard hard drive by orders of magnitude so it's completely not an issue.
With wear leveling and the technology now supporting millions of writes it just doesn't matter. Here's a random data sheet: http://mtron.net/Upload_Data/Spec/ASIC/MOBI/PATA/MSD-PATA3035_rev0.3.pdf
"Write endurance: >140 years @ 50GB write/day at 32GB SSD"
Basically the device will fail before it reaches the it runs out of write cycles. You can overwrite the entire device twice a day and it will last longer than your lifetime. Of course it will fail due to other issues before then anyway.
Can there be a mention of SSDs without this out-dated garbage being brought up?
Unfortunately flash SSDs usually have some percentage of sectors you cannot directly access, these are used for wear leveling and bad sector remapping. So when you dd with /dev/zero, it is quite possible that some part of the original data is left intact. And there can be quite alot of those sectors, I recall reading on one SSD drive that had 32GiB flash in it, but had 32GB available for the user, so 2250MiB was used for wear leveling and bad sectors (helps to get better yealds if you can have several bad 512KiB cells).
- Raynet --> .
I'm just sitting here thinking. Doesn't an SSD have a preset number of writes in it due to it's nature?
Does it really matter if they spread these writes around on the hard drive when the number of writes the drive is capable of doing is still the same in the end?
To drastically oversimplify, lets say that each block can be written to twice. Does it really matter if they used up the first blocks on the drive and just spread towards the end of the drive partition with general usage rather than jumping all over to try to spread the writes around?
Am I thinking about this the wrong way? What benefit does it give them to spread the writes around if the total number of writes doesn't change? Doesn't it just further fragment the files with little gain?
Your CF card is going to use the USB interface which maxes out at about 40Mbps as opposed to using an internal SSD's SATAII interface which maxes at 300Mbps. Not quite an order of magnitude, but close.
On the other hand, if you're going to use an external SSD connected to the USB port, then you wouldn't see any difference between the 2 in terms of speed. Lifespan might be longer w/ the SSD due to better wear leveling, but in either case you're probably going to lose or break it before you get to the fail point.
A real SSD has several advantages over using CF cards, but not for the reasons you state.
With a simple plug adapter, CF cards can be connected to an IDE interface, so speeds won't be limited by interface speed. The most recent revision of the CF spec adds support for IDE Ultra DMA 133 (133 MB/s)
A couple of additional points, just because I love nitpicking:
- A USB 2.0 mass storage device has a practical maximum speed of around 25 MB/s, not 40 Mb/s.
- The so-called SATA II interface (that name is actually incorrect and is not sanctioned by the standardization body) has a maximum speed of 300 MB/s, not Mb/s.
Why not functionally group files to decrease or eliminate fragmentation? Or maybe this is already done.
For example - I have a large collection of MP3 files. They essentially do not change, as in I don't edit them, and rarely erase them. The file system could look at they type of file (mp3, vs doc) and place it accordingly. It could also look at the last change in the file and place it in a certain area. Older unchanged files are placed in a tightly placed/packed file area that is optimized and not fragmented.
..........FULL STOP.
There are a few tricks up the manufacturer's sleeve to make this slightly better than it really is:
1. large block size (120k-200k?) means that even if you write 20 bytes, the disk physically writes a lot more. For logfiles and databases (quite common on desktops too, think of index dbs and sqlite in firefox for storing the search history...) where tiny amounts of data are modified, this can add up rapidly. Something writes to the disk once every second? That's 16.5GB / day, even if you're only changing a single byte over and over.
2. Even if the memory cells do not die, due to the large block size, fragmentation will occur (most of the cells will have a small amount of space used in them). There has been a few articles about this that even devices with advanced wear leveling technology like Intel's exhibit a large performance drop (less than half of the read/write performance of a new drive of the same kind) after a few months of normal usage.
3. According to Tomshardware unnamed OEMs told them that all the SSD drives they tested under simulated server workloads got toasted after a few months of testing. Now, I wouldn't necessary consider this accurate or true, but I'd sure as hell would not use SSDs in a serious environment until this is proven false.
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
All nice and dandy, but these figures aren't exactly honest. In a normal scenario your filesystem consists for a large part on static data. These blocks/cells are never rewritten. Therefore the writes (for logfiles etc) are concentrated on a small part of the disk, wearing it out rather more quickly.
Having a few Compact Flash disks wear out in the recent past, I'm not exactly anxious to replace my server disks with SSD.
If it's an older laptop or the mechanical hard disk died, go for it. Addonics make SATA CF adapters so you are not restricted to IDE CF adapters.
"This post is an artistic work of fiction and falsehood. Only a fool would take anything posted here as fact."
Why is this informative? CF with an adapter is NOT USB.
From my experience, using an adapter puts it on the native interface - notably, with CF, it's easiest to put the device into a machine that has a native IDE (not SATA) interface. CF is pin compatible with IDE.
Now, in the current offering of SLC/MLC "drives" you can actually get better read/write since they "raid" for lack of a better term the internal chips. I'm using a transcend ATA-4 CF device that gets around 30MB/sec read/write in a machine in my garage; it's an SLC device that isn't their top of the line, but it was more cost-effective.
So, using the IDE/ATA-4 interface on the CF card, it gets lower CPU utilization than a USB device. Still doesn't hit the 40MB/sec you quoted, but 40MB/sec is a pipe dream on USB in my experience.
Karnal
Your CF card is going to use the USB interface
This is Informative?
CF cards are actually IDE devices. The adapters that plug CF into your IDE bus are just passive wiring.. no protocol adapter needed.
It's trivial to replace a laptop drive with a modern high-density CF card, and sometimes a great thing to do.
The highest-performance CF cards today use UDMA for even higher bandwidth.
HighSpeed USB can't reasonably get over 25MB/sec from the cards using a USB-CF adapter, but you can do better by using its native bus.
I purchased an X300 Thinkpad for the company this week and took a close look at it. I thought expensive business notebooks come without crapware. And I was sure the X300 would be optimized. But they had defrags scheduled! I always thought defrag is a no no for ssds. Now I am not sure anymore. I deinstalled it first. But who knows?
So why should I get a SSD vs. a CF card?
CF works passably in WORM-like scenarios, where you basically use it in read-only mode and update it rarely and in big chunks. For random R/W access, CF lacks wear leveling to give it a tolerable life expectancy... Thus you commonly see it used in embedded devices such as routers and dumbterms where you may update the firmware or OS every few months; You don't see it used much in real, live writable FSs.
It also tends to have rather poor performance, with reads in the sub-5MB/s range and writes taking forever. So again, using a 32MB CF to boot a router, works great; Using a 32GB CF as the system partition for a modern desktop PC (even with some solution to the limited erase lifetime, such as a UnionFS against a ramdisk with commit-on-shutdown), you can expect 10+ minute boot times.
Why not functionally group files to decrease or eliminate fragmentation? Or maybe this is already done.
In a Linux system, this is easily done, but few people bother.
Most of the write activity in Linux is in /tmp, and also in /var (for example, log files live in /var/log). User files go in /home.
So, you can use different partitions, each with its own file system, for /, /tmp, /home, and /var.
The major problem with this is that, if you guess wrong about how big a partition should be, it's a pain to resize things. So my usual thing is just to put /tmp on its own partition, and have a separate partition for / and for /home.
The /tmp partition and swap partition are put at the beginning of the disc, in hopes that seek penalties might be a little lower there. Then / has a generous amount of space, and /home has everything left over.
When a *NIX system runs out of disk space in /tmp, Very Bad Things happen. Far too much software was written in C by people who didn't bother to check error codes; things like disk writes don't fail often, but when /tmp is 100% full, every write fails. A system may act oddly when /tmp is full, without actually crashing or giving you a warning. So, the moral of the story is: disk is cheap, so if you give /tmp its own partition, make it pretty big; I usually use 4 GB now. However, if you run out of disk space in /var, it is not quite as serious. Your system logs stop logging. And, many databases are in /var so you may not be able to insert into your database anymore.
The main Ubuntu installer is fast, because it wipes out the / partition and puts in all new stuff. So, if you have separate partitions for / and /home, life is good: you just let the installer wipe /, and your /home is safely untouched. It's annoying when you have /home as just a subdirectory on / and you want to run the installer. But, by default, the Ubuntu installer will make one big partition for everything; if you want to organize by partitions, you will need to set things up by hand.
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely
Although the technology it is used in is repugnant, NTFS has always been the One True Filesystem.
I thought ZFS was.
And ZFS has native support for SSD as L2ARC. http://www.c0t0d0s0.org/media/presentations/ssd.pdf I have nothing but praise for ZFS. Simple to manage, reliable, fast. With native CIFS instead of User file system Samba, I've seen orders of magnitude performance from windows machines when doing networked file access. Gary
Your CF card is going to use the USB interface which maxes out at about 40Mbps as opposed to using an internal SSD's SATAII interface which maxes at 300Mbps. Not quite an order of magnitude, but close.
There are three factual errors in that statement.
1. CF-cards can be connected directly to the ATA-port via a simple passive connector-adapter and therefor have a theoretical maximum transfer speed of 133MB/s, which roughly translates to 1300Mbps. There's even adapters with room for both a master and slave CF-card in the same shape, size and connector position as a 2.5" ATA drive, specifically made to use CF-cards in laptops.
2. USB is 480Mbps.
3. SATA is 3000Mbps
The big speed-difference between SSD and CF is due to the construction of the devices themselves, not the interface that connects them to the computer.
A fast CF-card can get you around 40MB/s and at the moment they also top out at 32GB sizes and they're not made to handle long term random write operations.
A fast SSD can get you all the way to the theoretical maximum of SATA, around 300MB/s, and are available in much bigger sizes.
/.Mattsson - My native language is not English, so please don't whine over linguistic errors. (That's lame anyway...)
CHS disappeared ages ago. The maximum device supported was ~8 Gbyte (1023 cylinders * 255 heads * 63 sectors * 512 bytes)
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Because of this, I imagine that the author would like Linux devs to better support SSD's by getting non-flash file systems to support SSD better than they are today.
Heh. The author is a Linux dev; I'm the ext4 maintainer, and if you read my actual blog posting, you'll see that I gave some practical things that can be done to support SSD's today just by better tuning parameters given to tools like fdisk, pvcreate, mke2fs, etc., and I talked about some of the things I'm thinking about to make ext4 better at support SSD's better than it does today.....
The modern hot-shit high-speed CF cards have wear leveling and do UDMA transfers, you get a CF to ATA adapter, not CF to USB, and they will outperform most hard disks.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
That was my idea when I've proposed an "object storage system" here on /. a few months ago: associate type and metadata with every file, making them more "object-like" (as in object-oriented programming). The storage system would know the behaviour of each object (whether it is likely to grow, or more likely to be modified in place, or probably not modified at all, etc), and would choose the most efficient way of storing every particular kind of data. I've also proposed separate namespaces for each process, capability-based security, dropping paths in favour of non-hierarchical tags, and a few other "revolutionary" ideas that all had only one downside: nobody's going to break backwards compatibility, especially while the current system still "just works".
"tytso" is Theodore T'so.
He and Remy Card wrote ext2. He and Stephen Tweedie wrote ext3. He and Ming Ming Cao wrote ext4.
He maintains the filesystem repair tool (e2fsck) and resizing tool for those filesystems.
He also created the world's first /dev/random device, maintained the tsx-11.mit.edu Linux archive site for many years, and wrote a chunk of Kerberos. He's been the technical chairman for many Linux-related conferences. He pretty much runs the kernel summit.
He's certainly not a kid. I think he's about to turn 40.
Really, Intel ought to give tytso piles of free SSD hardware before it goes on sale. This would help Intel by encouraging tytso to optimize Linux for Intel's SSD hardware.
This could be fun. Here are some more suggestions:
- Welder - The little chips don't last long against a good arc welder.
- 600 VAC - Why stop at a wall outlet?
- Tesla Coil - 200 kV is better than 600 VAC
- Lightening Rod. Why stop at 200 kV?
- Oxy-acetylene Torch - higher temperatures
- Plasma Cutter - even higher temperatures
- NdYAG Laser - Etch your name into the remains of the flash chip.
- Chew Toy for Dog - Don't underestimate some of those canines, although USB keys might not be good for them.
- Log-Splitting Practice. How good are you at aiming that Axe?
- Place USB in Cement Footings of a building. Do the mob thing.
- Rock crusher
- Grinding Machine
- Wood chipper / pulper
- Cement kiln
- Blast Furnace
- Industrial Press - Terminator Style!
I'm pretty sure that some of these machines can destroy industrial quantities of USB keys, with little difficulty. Cement kilns and rock crushers can destroy just about anything. It would be interesting the see the resulting crushed rock in a piece of cement though. It would be colorful.
I use 1GB for /boot because I'm a kernel developer and I end up experimenting with a large number of kernels (yes, on my laptop --- I travel way to much, and a lot of my development time happens while I'm on an airplane). In addition, SystemTap requires compiling kernels with debuginfo enabled, which makes the resulting kernels gargantuan --- it's actually not that uncommon for me to fill my /boot partition and need to garbage collect old kernels. So yes, I really do need a 1GB for /boot.
As far as LVM, of course I use more than a single volume; separate LV's get used for test filesystems (I'm a filesystem developer, remember), but more importantly, the most important reason to use LVM is because it allows you to take snapshots of your live filesystem and then run e2fsck on the snapshot volume --- if the e2fsck is clean you can then drop the snapshot volume, and run "tune2fs -C 0 -T now /dev/XXX" on the file system. This eliminates boot-time fsck's, while still allowing me to make sure the file system is consistent. And because I'm running e2fsck on the snapshot, I can be reading e-mail or browsing the web while the e2fsck is running in the background. LVM is definitely worth the overhead (which isn't that much, in any case).