Can You Compress Berkeley DB Databases?
Paul Gear asks: "I'm wanting to create a database using Berkeley DB that includes a lot of textual information, and because of the bulk of the data and its obvious compressibility, I was wondering whether it was possible to have the DB libraries automatically compress it at the file level, rather than me compressing each data (rather small) item before putting it in (which would result in much less gain in compression). Section 4.1 of the paper "Challenges in Embedded Database System Administration" talks about automatic compression, but that is the only place in the documentation that it is mentioned. Can anyone point me in the right direction?"
FWIW, linux can do this on the ext2fs as well, with chattr +c filename. There are analogs in other unix operating systems and filesystems as well. :-) (man chattr for more info)
--
News for Geeks in Austin, TX
Although compressing the data might seem like a good idea, in practice you will curse day and night unless you take certain precautions. Using a compressed filesystem will save you plenty of space, but you will pay in the form of cpu usage. Decompressing data on-the-fly is costly, and in the case of database operations, it can be downright nasty. A simple workaround might be to store the main data (indexes) on an uncompressed filesystem, and leave only the larger blob fields in a separate table stored compressed, that way you could still do lightning fast searches and selects, only slowing down to grab the memos when absolutely required.
-Billco, Fnarg.com
Zlibc is a read-only compressed file-system emulation. It allows executables to uncompress their data files on the fly. No kernel patch, no re-compilation of the executables and the libraries is needed. Using gzip -9, a compression ratio of 1:3 can easily be achieved! (See examples below). This program has (almost) the same effect as a (read-only) compressed file system.
See the web page for more.
Baz
Reading through a compressed random-access file may not be as big of a win as you think, since the db will need to decompress things to determine where your data is. OTOH, if you have a fast processor and slower media (CDROM) and plenty of RAM, the drawbacks will disappear after a few spins through due to caching.
The caching won't save you from uncompressing the blocks repeatedly. If you really want to compress the database metadata, you basically need a block-oriented compressed filesystem that allows random access within compressed files. I don't know if such a thing already exists, but it's effectively what you'd be writing to do it...
I'd just use zlib to compress the individual entries and not try to compress the entire database as a whole. I've done this before, and it actually works better than you'd think. Even with data entries as small as 50-100 bytes, you get reasonable compression. Yes, you'd get much better compression across the entire database, but you can't hope to access a fully-compressed database without uncompressing it or doing a lot of work to make random-access possible. (And like I said, at that point you might as well be making a compressed filesystem.)
Deven
"Simple things should be simple, and complex things should be possible." - Alan Kay