A Semi-Radical Approach To Avoiding fsck
Dru writes: "This is an article about a hardware technology that is largely unknown
in the new Unix community. In theory, with this inexpensive hardware, your BSD or Linux box could start doing
guranteed
reboots in under 2 minutes (no fsck required) and super fast database writes. It could leapfrog all of the journaling filesystem projects as well.
Yes, I wrote the article. The article is long, detailed, and mentions FreeBSD often. However, I do believe it
is relevant to any other PC Unix. If enough people learn about it, maybe they will start demanding
it from their favorite hardware vendor." With RAM and hard drive space both continuing to decline, I wonder how the speed / use curve for individual PCs' storage (from L1 cache to backups) will evolve. With a similar bent, Arek urges you to "take a look at our company's
Solid State Disk Drives." How'dja like 8 or so gigs of DRAM next time you edit a video or burn a CD?
Did he mean "RAM and hard drive price both continuing to decline"?
Just something to think about for those still skeptical...
- A.P.
--
* CmdrTaco is an idiot.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
- A.P.
--
* CmdrTaco is an idiot.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
A write-caching disk controller combined with a journaling file system would give you the same benefit. You're just reinventing the wheel..
The only really new thing here seems to be the fact that the "TRAM" is file-system aware, which is just another way of saying that you are investing in hardware which will just tie you to tired old EFS.
Windows NT has had a journaled file system forever, and the journaling doesn't cause the major performance impact that everyone seems to think it does. Maybe someday Linus will get in the mood and allow a journaling FS into Linux.
On a side note, what does the OS do in case of some sort of TRAM failure?
The website provided is nonnavigable
t ml.
Yes, it is a crapy website, after about 15 min I dicovered you click on a image to get to the 'qukdrive' info page here: http://www.platypus.net/pages/products_qikdriv2.h
I havent dled the pdf, but looks like no price info. Does offer drivers tho, so probably not vaporware.
echo $email | sed s/[A-Z]//g | rot13
Am i mistaken or did this article feel just to warm and fuzzy. I know there is a lot of good technical info in there but its all wrapped up in a very strange manner. I dont think you are solving the whole problem, by overlooking and waving your hands over the rest. I mean so you put a few sticks of memory and power on a PCI card! You do this cause a UPS can die, well i got news for you, the PCI Card could die too! AND if you are trying to make reboots faster, dont bother, if you are serious you would have a backup system and the same should be true for a web server dying. The only time i want fast reboots is when a good game of UT croacks on me and i want to get back to fraggin..... this is not technology i would use for mission critical apps!
Hmmmmm... the more I think of it, the more it feels like a marketing team thought that a semi tech article on Slashdot would be just the ticket for killer web site traffic! Maybe its the lack of caffinne on my part...What do other ppl think ?
Non-Deterministic Finite Automata
Sure, a battery backup sounds like it solves this, but consider that DRAM stores its charge on tiny capacitors, and requires a controller to be performing "refresh" access cycles regularily (usually every 15.26 s). This means that not only must the battery be good, but the controller accessing the DRAM must continue providing the refresh cycles without interruption. That may sound simple, but not all DRAMs are created alike.... SDRAM DIMMs have a feature called Serial Presence Detect (SPD) that is a small non-volatile 2-wire serial EEPROM memory that hold identifying data about the size and timing parameters for the memory. A typical DRAM controller would be initialized at boot time... a card like this would require a special DRAM controller that only initialized its timings when the DRAM/battery is first installed. Perhaps the controller would be designed to use relatively slow and conservative timings, always, so it'd never be able to reinitialize to other settings (that could be wrong) and/or stop providing the critical refresh at any point.
The point is that to retain memory, DRAM requires not only power but a properly operating controller to supply the refresh cycles. Magnetic media maintains its memory without either of these conditions. Compared to magnetic media, DRAM is very volatile. "Mission Critical" data, whatever that may be, would be existing at tiny charges on the very tiny capacitors, which could dissipate in only about 4-8 ms, if the DRAM controller doesn't perform perfectly.... inside a computer (designed as a reliable server) which has just crashed for some unknown reason!
PJRC: Electronic Projects, 8051 Microcontroller Tools
First, it is absolutely critical that the OS creates some log or structure of operations on the TRAM for filesystem operations. Basically, if the OS can mark the beginning and end of an operation and place it in this memory, you can now get a journaled meta data filesystem without a complete re-architecture of a filesystem.
Basically, if the OS can determine the beg/end of an operation (transaction) and it logs this information, then we have a journaling file system. Any persistent storage will suffice for the journal - 'TRAM' or hard disk or clay bricks. The only difference is the access time.
In general there is no magical way for the OS to know what data is the beg/end of a transaction. The OS could try to handle meta-data in this fashion. It can log the meta-data changes it would make in atomic transactions and replay un-commited transactions on a reboot. However, the file system still needs to be aware of this journaling.
Consider a power failure during a commit to the file system. The file system is in a partially modified state and the transaction has not been retired from the TRAM journal (since it did not complete). When the system boots again, the TRAM journal is replayed and the same operation begins again, except this time on an inconsistent file system. The file system needs to recognize that a partially commited transaction needs to be rolled back.
The above is based on my (very incomplete) understanding of journaling file systems. However, a TRAM card amounts to a cache for a file system journal, so in no sense is it going to replace or leap-frog journaling file systems.
Here is a link to the Solid State Hard Drive Pricing Page from CDW.O
http://www.cdw.com/shop/search/results.asp?grp=HS
Platypus products are listed as well as some from Quantum and Sandisk.
You are talking $1,969.40 US currency for the Platypus QikDRIVE8 512MB, the smallest model i saw.
CDW is the Authorized reseller I found for the US.
Most RAID controllers will give you a battery backed-up write-back RAM cache. Depending on how you configure it, it will say that a write is committed as soon as it's in RAM. This accomplishes the same net effect without requiring all this modification of the OS.
Of course, lots of people don't like to configure their RAID controllers this way, because there is no redundancy for data in RAM, not to mention that the risk of failure is still higher than with a hard disk.
I hate to say it, but that article seems like it was written by someone who has not been out in the real world.
sigs are a waste of space
Why on earth do you want to tell us things like this Unix was designed to be simple. This means, if they found that they could do certain things as libraries in user space, then it didn't belong in the kernel.? It has absolutely nothing to do with TRAM. Actually that's true for nearly everything you say in your article; you use a lot of irrelevant examples and try to mention everything you seem to know about Unix and then explain the solution in 2 lines?! Why don't you mention the real interesting things like that such cards most probably fail just as often as UPS'es, why this should be on a PCI-card and not on the disk (ok that's because you want to access the memory directly, but please explain this...) or what the consequences are concerning access-time?
Although the idea is good, I think you could have done a much better article; come to the point!
0x or or snor perron?!
No, but I'd bet on it being lots of money. We looked at solid state drives at a previous company. Since it was for mission critical stuff, we were looking at a RAID5 array of them, and it was priced at over £200,000 for a modest sized array -- something that would cost probably a tenth of that with conventional drives.
"The invisible and the non-existent look very much alike." -- Delos B. McKown
This has been talked of for quite a time, and is hardly radical. Whats more it is not an alternative to journal based filesystems, but logically its an adjunct to them.
:-) ]
First you have your filesystem that buffers transactions in a journal that is streamed to disk. Then, for performance, by avoiding all those extra seeks, you put the filesystem journal on another device - say a small fast dedicated disk. Then you make that device a NVRAM device rather than something based on spinning rust.
Whats more, if you are interested in something like mail systems, where you get a lot of transactions that *must* committed to stable storage (although a lot of MTAs don't do that in spite of the wording of RFC821), and you use a fileystem like ext3 with a data journalling mode, then putting the journal onto NVRAM makes a huge difference - by the time it comes to the point where data would be committed to the disk from the journal, most of the data (ie e-mail messages) is now unwanted (since the messages have been delivered to final local destination or for onward transmission) and so you don't even need to do the disk ops...
All of this is pretty much available now in ext3 other than the tools to get the journal onto a NVRAM disk - and thats just detail.
So, nice idea, needs more flesh, a little more infrastructure needed round it.
[Those who came to the London UKUUG Linux Conference might well have heard these discussions before going on in various corridors
Of course, implementing a separate bus will take millions in research (after all, it has to be done right), but once everything is decided on, it's probably only $20 or so in extra hardware. In theory, all you'd need is another PCI bridge chip or similar. Ever seen the inside of a NetApp? The motherboard has a CPU, space for RAM, a PCI bridge, and some slots. Nothing else. Extremely simple.
What I was trying to say is that hardly anything was said about TRAM compared to the extensive description of how it currenty is done; all sorts of current solutions are covered, but when it comes to TRAM he just says `do it so and so' without mentioning WHY to do it that way or what the alternatives are. This was just an example of that, but I could have chosen a better one...
0x or or snor perron?!
Aside from some reliability issues of this technology, I wonder if it's worth the trouble.
Besides, BeOS boots in about 15 seconds into GUI, even if you previously turned off the PC without shutting down. So, journaled filesystems DO have advantages. Linux may never achieve such high speeds in booting up, but still, I predict that a good JFS will benefit it.
Sigged!
see the Rio / Vista work by Pete Chen, Dave Lowell, et al. which won best paper at SOSP several years ago...
You should read this Bob The Angry Flower comic:
http://www.angryflower.com/bobsqu.gif
How about this TRAM stored on the disk drive, and have the OS simply tell it the dependency DAG? It can perform its own write reordering (probably more efficiently since it knows where the disk head really is and all the specifics about its geometry) and then finish off the queue when first getting power after a power loss.
Perhaps in the UNREAD (which I guess is fairly large, hurmf) portion of the community. Chapter 8, section 2 of The Design and Implementation of the 4.4BSD Operating System talks about this idea on page 284. It referrs to research done by Moran et al, 1990. The references at the back of the chapter refer to "Breaking Through the NFS Performance Barrier," Proveedings of the Spring 1990 European UNIX Users Group Conference, pages 199-206.
So there you go, there's TWO ways that we could have heard about this. I doubt anyone here got that first hand, but the 4.4BSD book is a fairly common book to have for those who are interested in the innards of an OS.
-bugg
Seriously...do they use some special proprietary ram? 8G of cdram in 512 meg chunks is only a couple of K. Hardly justifies another $24k tacked onto it. Is this another example of "charge what the market will bear"? I understand there are development costs and the like, but _geez_ $26k is _a lot_ of money. Don't give me an answer like "they are not intended for home use, so they charge more", because that's a bullshit reason (even though it's done all the damn time).
-- Who is the bigger fool? The fool or the fool who follows him? --
Both boot and swap caching would be most helpful on a [lap|palm]top machine, where epic uptimes are irrelevant. Normally, low-power RAM that is most efficient for battery-backed use is slower than the RAM we're accustomed to using. But it's still orders of magnitude faster than disk access.
[100% ISO 646 Compliant]
SVM, ERGO MONSTRO.
This is what the Network Appliance boxen do to speed NFS writes.
All NFS write transactions are commited to NVRAM first, so that they can be acknowledged. Then the writes to disk are sorted and blasted out. Very efficient, very fast.
It is this NVRAM (as well as using a modified RAID-4 on top of the WAFL filesystem) that makes a NetApp much faster (yet still safer) than most other NFS servers. I've often thought about creating just such an NVRAM board for a PC, so that I could do the same thing with my Linux fileservers.
Note that the NetApp implementation caches NFS requests, not filesystem-level data. Say I'm changing 1 byte in a block. If I buffer filesystem data, I have to cache the whole block. If I'm buffering the NFS request, it'll be much smaller.
Buffering (in NVRAM) the log data might work well for something like ext3.
There are better ways to fix this problem, such as put a battery backup on system RAM, so that the OS won't need to be reloaded at all; it can pick up where it left off when power to the CPU comes back on.
And if computer power supplies were also designed for battery-backup (dual-voltage, can run on either 120VAC or say 24VDC) then the complexity of the UPS (converting DC to AC just so the computer can convert it back to DC again) would be eliminated, and the result would be more reliable. There should be a standard connector so an external lead-acid battery can be plugged into the back of every computer, and the power supply would be responsible for keeping it charged, and switching to using it when the power fails. (But there should be a switch to turn off the charging feature in case you have several computers hooked to one large battery with a separate external charger.) Then maybe when the power fails the OS would be notified, and it could finish doing any uncommitted writes before powering off the disks and CPU; the battery would continue to backup the memory for a very long time since it would be such a small load compared to the whole system. That way the two battery-backup systems could be combined. (Or not... maybe a separate memory backup would be more reliable.)
I think that's what I was getting at in my prior post. If the controller can cache my commonly accessed sectors at boot-time, that's fine. But as for "M cylinders for a partition", for traditional partitions, that's fine. For things like UnixWare, Openserver and *BSD, it'd have to know about the sub-filesystem partitions they use, would they not?
"All those tubes and wires and careful notes!"
"All those tubes and wires and careful notes!"
Note, no free unix today has, at least to the point of people trusting their main database on it, a production-ready journaled filesystem.
Linux+ReiserFS.
I would trust ReiserFS to keep my main DB safe, I've been using ReiserFS with Linux for some time now with no data loss. (and many power failures and some crashes due to a partitcular closed source XF86 video card driver)
-- iCEBaLM
But it's important to keep these things straight: Drivers written for an OS (hopefully source code that can compile under any *nix) may well benefit from knowing about hardware. The controller never needs to know anything about the OS. (And let me reiterate, shouldn't know. Can you say "Win[Modem|Printer]?" Yuck.)
[100% ISO 646 Compliant]
SVM, ERGO MONSTRO.
This sounds like a lot of standard techniques already in use in storage today. Many storage controlers (FibreChannel/SCSI, RAID/nonRAID) support battery backup power that will let them finish writes. Most use internal write and read cache and include lots of memory. Transaction support already exists in some ways. If I send a command to write 1024 bytes to a disk that does 512 byte writes, a good controller will attempt to make the 1024 byte operation atomic even though it is internally broken up into multiple 512 byte writes. File systems fragment the pieces of a file and thus have to issue multiple non-sequential commands for a given operation. This is where the problem errupts. Some controllers support combining multiple operations into one call but this is usually done without FS knowledge; it just fills a buffer of ops and then dumps it. RAID already handles the issue of non-sequential operations by hiding them. RAID may present a 1 terrabyte drive as sequential data when it is really stripped across various areas of various drives. When RAID is told to write 1 sequential GB out it is done as one operation, even though it involves many non-sequential writes to multiple locations on multiple disks. The trick is to put file system support in the storage controllers or RAID systems. With file system support, it will attempt to make sure that physically non-sequential writes that are sequential at the file level are completely written out. This doesn't happen much because for various reasons but does exist. For example, some RAID systems support running the file system on their system directly.
What does this do that is so different? At $300, it may be less expensive than similar solutions but I do not see "how it differs from everything out there".
-- soldack