Slashdot Mirror


Improving Unix Mail Storage?

At first, there was mbox, then there was Maildir, and Bill begat Outlook and .mbx. CaraCalla wonders if there is a better way to store mail than the way we currently store it today. I admit, with the changes that email has undergone over the past 5 years (changes in what is being sent, not necessarily in how it is sent), it may be time to reinvent the mail format. Read on for CaraCalla's analysis of the current mail options, and his thoughts on where we may go in the future. If you were to design your own MUA, how would you design its mail storage? CaraCalla asks: "Does anybody know a good, free solution for storing mail on unix hosts? The reason that I ask this question is my discontent with available techniques:
  • mbox: There are problems with locking, corruption, access-times, and bloat.
  • Maildir: Do you really want to clutter your system with millions of small files? That's waste of inodes, space (unless perhaps you use Linux/ReiserFS or SGi) and just try to open a Maildir with 1000+ mails and see how long it takes your favorite Mailprogram to only display the subjects.
  • Cyrus: Basically the same as Maildir with database features.
  • UW-Imap mbx: That's classical mbox with extensions allowing multiple access.
  • Evolution: Basically mbox with database features.
  • Windows clients: Typically some proprietary db-format. Pathetic.

But the thing that bugs me most is disk space. Typical inboxes are made of 5% to 10% of Text including Headers and HTML. The rest are BASE64- (or UU-) encoded pictures, word documents, zip archives and so on. The problem here is the encoding which wastes considerable amounts of space (at least one third).

Some ideas about the ideal mail-storage:

  • One file per Mailbox-folder, allowing multiple folders per user. Should those files reside in one central location or in users Homedirs?
  • Compression: Should messages be broken into pieces and the MIME-attachments stored separately (thus searching of the text parts would still be possible without decompressing the whole file)?
  • File format: gdbm, Sleepycat db? Something new?
  • Should the security model allow users to directly access their files, grep them, copy them around?
  • Shared folders, virtual domains?
  • Unicode support in folder names? Imap message-IDs, flags, useragent specific state-information?
  • How would MTAs deliver mail? How would clients access? File-locking (NFS)?
  • What about backwards-compatibility? Writing libmailstore (anyone)? adopting UW c-client?

Does my ideal mailstorage exist somewhere? Is somebody working on a project addressing this? Does anybody have some other hints? And please no mbox/Maildir flamewar!"

15 of 554 comments (clear)

  1. Re:One folder to rule them all... by TheBracket · · Score: 4, Insightful
    You do realise that you just described MS Exchange (albeit in a chronically simplified form), right? :-)

    Exchange is actually a pretty decent mail server, although only using it for mail is pretty dumb - its groupware features are the killer app. It exposes both benefits (in particular, single storage of messages with multiple recipients) and flaws (if your db goes boom, it affects all your users - or at least all your users in a given mail partition) of database-based mail storage.

    I remember seeing a project to combine mail storage with PostgreSQL a while ago. Anyone know what happened to it?

    --
    Lead developer, http://wisptools.net
  2. I vote for a filesystem-based database by Dr.+Awktagon · · Score: 5, Insightful

    Something like Maildir .. if the FS is slow and can't handle that kind of application, then we need to improve our filesystems!

    Lots of applications need lightweight databases with indexes, locking, and atomic operations. Why not bake this into the filesystem, and it won't have to be just for email, it will have many uses.

    I was thinking about this the other day as I was working on a logging system for a large in-house email filtering system.. similar problem, except instead of storing emails, I'm storing small XML fragments describing the structure of each email and what was done to each. So far the easiest solution was large monolithic XML files, and an external index pointing in the large file (i.e., like mbox + a DB index). As it grows we'll probably have to move it to a "real" database.

    There is a need for something like sleepycat DB + ReiserFS on steriods..

  3. Something to keep in mind... by cwinters · · Score: 5, Insightful

    /. punchingbag jwz has some strong opinions about using databases (etc.) for mail storage. I tend to agree: everything can read from and write to files, there no versioning issues, they can be easily transported among different operating and file systems, they can be backed up easily. But it's another wheel to reinvent, so everyone hop to it at once and then lose interest in two or three weeks!

    --

    Chris
    M-x auto-bs-mode

  4. one file per message by g4dget · · Score: 3, Insightful
    One file per mail message is the right thing to do. That lets you use standard UNIX tools for manipulating mail and it gives you convenient locking semantics. And the hierarchical UNIX file system structure, together with links, matches mail semantics nearly perfectly.

    Of course, with traditional UNIX file systems, this is a bit slow. The thing to do is to fix the file system, not to kludge ever more complex mail formats on top of it. ReiserFS goes much of the way; we now also need some system calls to open and read multiple files with a single call.

    Until file systems catch up, one kludge is as good as another. UNIX mbox format is at least simple, so I stick with that.

  5. Life is not that simple by coyote-san · · Score: 3, Insightful

    Life is not that simple. All databases are limited by the size of the basic block, and if you can't fit your data into that block performance takes a hit.

    With PostgreSQL this a compile-time option, default 8k and it can go up to 32k.

    It *is* possible to store larger items, esp. if they're 'TOASTable' or blobs, but this often just pushes the problem of dealing with thousands of files onto the database. Only now it's a lot harder to figure out why performance sucks.

    Does this mean that database solutions won't work? Of course not. But it does mean that simple solutions won't scale well when you're dealing with massive amounts of data.

    --
    For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
  6. Re:One folder to rule them all... by Ageless · · Score: 3, Insightful

    If you think that you can replicate what Exchange does in "a couple house of time" you have not been at it long enough.
    There are two excellent reasons that so many people use Exchange.

    1) In general, it works out of the box. A company with someone with meager knowledge can set up a fairly complex mail handling system without much help.

    2) It does A LOT. In it's most basic configuration it does what you need 10 or more programs in Linux to do, not to mention that most of those 10 don't exist.

    Rage against the machine all you want, but when your boss says you will have shared contacts and calendars and your clients will run Windows; find me a solution that comes within miles of the ease of Outlook and Exchange and I'll give you a cookie.
    Actually, I'll probally give you several thousand dollars.

  7. Databases Ptewey. by Jason+Pollock · · Score: 3, Insightful

    The problem with using database formats is that you can't access them with vi. How many times has your mail client crashed attempting to read an email, but you still _need_ to get access to it? If it's in a database (proprietary or not), you're up the creek. If it's stored in a flat file, you at least have the option of using vi/emacs/grep to find and read the email, and then excise it.

    This has happened to me in Netscape, Kmail, Outlook, Evolution, Eudora, etc. Every single one has had problems at one point or another. The best programs are the ones that are _truly_ open, and let you get at the mail from other directions.

    Don't doubt the power of the text utilities in Unix. :)

    Jason Pollock

  8. Re:The Reiser guys have some ideas. by SwellJoe · · Score: 5, Insightful
    I have also heard from someone who does Linux consulting who won't use ReiserFS. Overall, I don't call it stable.


    Heheh...I read a funny quote here on slashdot earlier today that I think applies:

    The plural of anecdote is not data.


    I've heard from a lot of people who consider themselves experts that ReiserFS is not stable, never has been, never will be, all that fun stuff. But I know better, because I have data. Hard numbers...I know I can run a Squid box harder and at higher loads for longer on ReiserFS than ext2 or ext3. I know that I can run a Squid machine for 2 years with ReiserFS cache partitions with uptimes over a year, with the reboot after all that time being for a kernel upgrade.

    Yes, there have been data corruption issues for some people for ReiserFS. But I'm on the ext3 and jfs mailing lists as well...I know they have data corruptions of their own. It's a fact of life when dealing with computers, things go wrong for everyone at some point. I simply don't believe the masses when they tell me ReiserFS is not suitable for production use, because I have more machines to administer than the vast majority of slashdotters, and I believe I can trust ReiserFS. I trust my opinion above most.

  9. Don't do it! Compress in the file system. by billstewart · · Score: 3, Insightful

    Shredding and compressing mail messages is almost always a bad idea. Essentially *nobody* does it correctly, and you can't reconstruct messages in their original byte-for-byte formats, which trashes digital signatures. You won't save much disk space, because real text doesn't take up enough space for anybody except a big ISP mailsystem to worry about, and binary attachments usually only compress well if they've been encoded in some non-8-bit-transparency format like base64 or uucode. About the only time it wins is when one person on your keep-mail-on-server mailsystem is sending an attachment to a bunch of people who can then all use the original, which is to say they should probably have stored the file on the web and mailed a URL. If you're going to do things like this, get yourself a compression-equipped filesystem and just store your raw mail messages there.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  10. Don't speculate. Profile. by Doktor+Memory · · Score: 5, Insightful
    Maildir: Do you really want to clutter your system with millions of small files? That's waste of inodes, space (unless perhaps you use Linux/ReiserFS or SGi)
    Psssst. It's not 1978 any more. Inodes are cheap. So is disk space. Stop spreading FUD.
    and just try to open a Maildir with 1000+ mails and see how long it takes your favorite Mailprogram to only display the subjects.
    Quite right. Just try it. You might be a bit surprised by the results.
    --

    News for Nerds. Stuff that Matters? Like hell.

  11. Re:I like MS Exchange by Slashamatic · · Score: 3, Insightful
    Backups are important

    Sorry, restores are even more important. I hope you check your backup strategy by trying a recovery every so often. Many a time I have heard people who "thought they had a backup" and then it turns out that the thing that was being backed up was in an inconsistent state.

  12. Didn't we solve this with NNTP? by speedenator · · Score: 3, Insightful

    So NNTP solved this IMHO a rather elegant way...

    You have directories corresponding to newsgroups or mail folders or whatnot. i.e. alt.swedish.chef.bork.bork.bork is really alt/swedish/chef/bork/bork/bork

    Articles are numeric, i.e. \d+ for Perl types. The raw message is stored in each file.

    In each directory, there's a file called .overview, which is just the summary information for all the files.

    Thus, you can have zillions of small files, and happily grep and copy them to your heart's content. But you never do a 'ls' on a huge directory, you always just look through the .overview file. Or grep through it, if you like.

    So, in that sense, it's very much the best of both worlds. And, on the same box, you can specify rules on who can access the folders, so one file can be read by multiple people. Ooh.

    GNUS, an Emacs based mail/news reader, uses a variant of this called nnml, which rocks.

    Of course, when you get down to it, JWZ arguments aside, databases start to really look like what you want, especially on a corporate level when you're tossing the same piece of mail around to tons of different folks.

    -e

  13. Re:the plan 9 approach by glv · · Score: 4, Insightful
    You alluded to this, but I know slashdot, and it's worth being explicit about it to avoid all the flames:

    This is not how mail is actually stored on disk in Plan 9. The "real" mail storage is just mbox files. What rpeppe has described is the view that the mail storage system provides to clients.

    I agree it's very sweet, but the question is primarily dealing with the actual storage format.

    --
    ---glv
  14. Re:Exchange brain-damage by Anonymous Coward · · Score: 4, Insightful

    Sometimes it makes me want to scream that Microsoft gets away with this stuff and no one seems to care.

    One thing to keep in mind about Exchange is that it's really a X.400 mail system, with some proprietary routing features kludged on top, lots of back-compat MS Mail features kludege on top of that, and then (as the last afterthought) SMTP kludged on top of all that. Next time you are at a computer book store, gander at the architecture diagram for Exchange -- it's so complex that it _should_ make you queasy. The thing just reeks of early-90s incorrect design assumptions.

    So it shouldn't be a shock that it can't handle a large number of SMTP edge cases. Frankly, nobody would buy a product like Exchange if it didn't have the Microsoft logo on it and a nice client which gets installed with Word and Excel.

    Microsoft, in their heart of hearts knows that it's a piece of shit, but it's _their_ piece of shit. And it happens to sell well, and any product that profitable can't be all that bad.

    I wouldn't be shocked if numerous skunkwork project have come and gone at MS to replace their Big X.400 Jet DB Kludge with a real Internet-saavy mail server, but the poltics of the place probably dictate that that they lumber on with what's working (that also explains products like Windows ME).

  15. I think a hybrid solution is called for. by mellon · · Score: 3, Insightful

    I don't think it makes sense to store email in dbm files. It's too sketchy - what happens when the dbm file gets corrupted? The nice thing about flat files is that if something goes wrong, you can fix it with vi.

    I think the right solution to the problem is to key off the message ID, which is supposed to be unique. Then define a mail folder as simply a list of message IDs. Messages can appear in more than one folder, but hopefully not in no folders.

    To make this efficient, I'd hash the message ID, and use a hierarchy of directories, because Unix doesn't do well with large flat directories. The hierarchy could auto-extend, so that as one subdirectory fills up, you do a sub-hash and split it into more directories.

    The problem of tiny files is a real one. The solution is probably to make the bottom of a hash a file rather than a directory, and store more than one message in each such file. You don't have to store a lot of messages in these files to win - even ten messages would produce a big win, and would be pretty efficient.

    The format of the individual files should probably be indexed sequential access - that is, a TOC at the front, and then the contents as plain text, nothing fancy. The TOC should be in ASCII, not binary, and you should be able to rebuild the TOC by looking at the file.

    Babyl used to use a control character as a delimiter, which worked pretty nicely - much better than using "^From ". Ever seen >From in an email message? That's because Unix mail uses "^From " as an inter-message delimiter, so it has to quote it, and it does so stupidly. So use ^_ as a delimiter, and if ^_ appears in the email message, just double it. Take a doubled ^_ out when reading a message.

    As for compression, I don't think it's worth doing at first. Disk space is cheap. Yes, my email folder is pretty huge, but it's really not a major problem. Making the storage system extra-complicated by uncompressing MIME is something to add on after you've got something more basic that works - you don't have to solve every problem all at once.

    As for folder scan performance, you can make a cache, and have the mail program scan the cache from time to time when it's idle to clean up errors. This is much better than trying to come up with a format that's optimized toward folders - if you try to optimize toward folders, you wind up creating all kinds of problems, IMHO.