Improving Unix Mail Storage?
At first, there was mbox, then there was Maildir, and Bill begat Outlook and .mbx. CaraCalla wonders if there is a better way to store mail than the way we currently store it today. I admit, with the changes that email has undergone over the past 5 years (changes in what is being sent, not necessarily in how it is sent), it may be time to reinvent the mail format. Read on for CaraCalla's analysis of the current mail options, and his thoughts on where we may go in the future. If you were to design your own MUA, how would you design its mail storage?
CaraCalla asks: "Does anybody know a good, free solution for storing mail on unix hosts? The reason that I ask this question is my discontent with available techniques:
- mbox: There are problems with locking, corruption, access-times, and bloat.
- Maildir: Do you really want to clutter your system with millions of small files? That's waste of inodes, space (unless perhaps you use Linux/ReiserFS or SGi) and just try to open a Maildir with 1000+ mails and see how long it takes your favorite Mailprogram to only display the subjects.
- Cyrus: Basically the same as Maildir with database features.
- UW-Imap mbx: That's classical mbox with extensions allowing multiple access.
- Evolution: Basically mbox with database features.
- Windows clients: Typically some proprietary db-format. Pathetic.
But the thing that bugs me most is disk space. Typical inboxes are made of 5% to 10% of Text including Headers and HTML. The rest are BASE64- (or UU-) encoded pictures, word documents, zip archives and so on. The problem here is the encoding which wastes considerable amounts of space (at least one third).
Some ideas about the ideal mail-storage:
- One file per Mailbox-folder, allowing multiple folders per user. Should those files reside in one central location or in users Homedirs?
- Compression: Should messages be broken into pieces and the MIME-attachments stored separately (thus searching of the text parts would still be possible without decompressing the whole file)?
- File format: gdbm, Sleepycat db? Something new?
- Should the security model allow users to directly access their files, grep them, copy them around?
- Shared folders, virtual domains?
- Unicode support in folder names? Imap message-IDs, flags, useragent specific state-information?
- How would MTAs deliver mail? How would clients access? File-locking (NFS)?
- What about backwards-compatibility? Writing libmailstore (anyone)? adopting UW c-client?
Does my ideal mailstorage exist somewhere? Is somebody working on a project addressing this? Does anybody have some other hints? And please no mbox/Maildir flamewar!"
Exchange is actually a pretty decent mail server, although only using it for mail is pretty dumb - its groupware features are the killer app. It exposes both benefits (in particular, single storage of messages with multiple recipients) and flaws (if your db goes boom, it affects all your users - or at least all your users in a given mail partition) of database-based mail storage.
I remember seeing a project to combine mail storage with PostgreSQL a while ago. Anyone know what happened to it?
Lead developer, http://wisptools.net
Something like Maildir .. if the FS is slow and can't handle that kind of application, then we need to improve our filesystems!
Lots of applications need lightweight databases with indexes, locking, and atomic operations. Why not bake this into the filesystem, and it won't have to be just for email, it will have many uses.
I was thinking about this the other day as I was working on a logging system for a large in-house email filtering system.. similar problem, except instead of storing emails, I'm storing small XML fragments describing the structure of each email and what was done to each. So far the easiest solution was large monolithic XML files, and an external index pointing in the large file (i.e., like mbox + a DB index). As it grows we'll probably have to move it to a "real" database.
There is a need for something like sleepycat DB + ReiserFS on steriods..
/. punchingbag jwz has some strong opinions about using databases (etc.) for mail storage. I tend to agree: everything can read from and write to files, there no versioning issues, they can be easily transported among different operating and file systems, they can be backed up easily. But it's another wheel to reinvent, so everyone hop to it at once and then lose interest in two or three weeks!
Chris
M-x auto-bs-mode
Heheh...I read a funny quote here on slashdot earlier today that I think applies:
I've heard from a lot of people who consider themselves experts that ReiserFS is not stable, never has been, never will be, all that fun stuff. But I know better, because I have data. Hard numbers...I know I can run a Squid box harder and at higher loads for longer on ReiserFS than ext2 or ext3. I know that I can run a Squid machine for 2 years with ReiserFS cache partitions with uptimes over a year, with the reboot after all that time being for a kernel upgrade.
Yes, there have been data corruption issues for some people for ReiserFS. But I'm on the ext3 and jfs mailing lists as well...I know they have data corruptions of their own. It's a fact of life when dealing with computers, things go wrong for everyone at some point. I simply don't believe the masses when they tell me ReiserFS is not suitable for production use, because I have more machines to administer than the vast majority of slashdotters, and I believe I can trust ReiserFS. I trust my opinion above most.
Quite right. Just try it. You might be a bit surprised by the results.
News for Nerds. Stuff that Matters? Like hell.
This is not how mail is actually stored on disk in Plan 9. The "real" mail storage is just mbox files. What rpeppe has described is the view that the mail storage system provides to clients.
I agree it's very sweet, but the question is primarily dealing with the actual storage format.
---glv
Sometimes it makes me want to scream that Microsoft gets away with this stuff and no one seems to care.
One thing to keep in mind about Exchange is that it's really a X.400 mail system, with some proprietary routing features kludged on top, lots of back-compat MS Mail features kludege on top of that, and then (as the last afterthought) SMTP kludged on top of all that. Next time you are at a computer book store, gander at the architecture diagram for Exchange -- it's so complex that it _should_ make you queasy. The thing just reeks of early-90s incorrect design assumptions.
So it shouldn't be a shock that it can't handle a large number of SMTP edge cases. Frankly, nobody would buy a product like Exchange if it didn't have the Microsoft logo on it and a nice client which gets installed with Word and Excel.
Microsoft, in their heart of hearts knows that it's a piece of shit, but it's _their_ piece of shit. And it happens to sell well, and any product that profitable can't be all that bad.
I wouldn't be shocked if numerous skunkwork project have come and gone at MS to replace their Big X.400 Jet DB Kludge with a real Internet-saavy mail server, but the poltics of the place probably dictate that that they lumber on with what's working (that also explains products like Windows ME).