Improving Unix Mail Storage?
At first, there was mbox, then there was Maildir, and Bill begat Outlook and .mbx. CaraCalla wonders if there is a better way to store mail than the way we currently store it today. I admit, with the changes that email has undergone over the past 5 years (changes in what is being sent, not necessarily in how it is sent), it may be time to reinvent the mail format. Read on for CaraCalla's analysis of the current mail options, and his thoughts on where we may go in the future. If you were to design your own MUA, how would you design its mail storage?
CaraCalla asks: "Does anybody know a good, free solution for storing mail on unix hosts? The reason that I ask this question is my discontent with available techniques:
- mbox: There are problems with locking, corruption, access-times, and bloat.
- Maildir: Do you really want to clutter your system with millions of small files? That's waste of inodes, space (unless perhaps you use Linux/ReiserFS or SGi) and just try to open a Maildir with 1000+ mails and see how long it takes your favorite Mailprogram to only display the subjects.
- Cyrus: Basically the same as Maildir with database features.
- UW-Imap mbx: That's classical mbox with extensions allowing multiple access.
- Evolution: Basically mbox with database features.
- Windows clients: Typically some proprietary db-format. Pathetic.
But the thing that bugs me most is disk space. Typical inboxes are made of 5% to 10% of Text including Headers and HTML. The rest are BASE64- (or UU-) encoded pictures, word documents, zip archives and so on. The problem here is the encoding which wastes considerable amounts of space (at least one third).
Some ideas about the ideal mail-storage:
- One file per Mailbox-folder, allowing multiple folders per user. Should those files reside in one central location or in users Homedirs?
- Compression: Should messages be broken into pieces and the MIME-attachments stored separately (thus searching of the text parts would still be possible without decompressing the whole file)?
- File format: gdbm, Sleepycat db? Something new?
- Should the security model allow users to directly access their files, grep them, copy them around?
- Shared folders, virtual domains?
- Unicode support in folder names? Imap message-IDs, flags, useragent specific state-information?
- How would MTAs deliver mail? How would clients access? File-locking (NFS)?
- What about backwards-compatibility? Writing libmailstore (anyone)? adopting UW c-client?
Does my ideal mailstorage exist somewhere? Is somebody working on a project addressing this? Does anybody have some other hints? And please no mbox/Maildir flamewar!"
A single database to hold of the user's email. Single instance storage ensures that only one copy of any attachment is in the database at once, no matter in how many email messages it was sent in. API's for back up let you back up the whole database or individual mailboxes. And depending on your backup solution you can restore mailboxes and individual emails. Anti-virus software that integrates into the server side of the software. In Exchange 2000 if you accidently delete a mailbox you can easily bring it back with all emails without restoring from tape. Only files to worry about on the user end is a personal address book and archived email. Unless you use POP3 or it's archived in personal folders the email always stays on the server preventing problems like accidentaly downloading important emails you need at the office being on a home PC. And it's stable. Not as stable as UNIX I admit, but it stays up for months without a reboot. And in my experience most problems are solved by a simple reboot. In 4 yeas of exposure to exchange, the only non-admin related problems I've seen were 1 database corruption where I needed to run a utility and wait 45 minutes for it to work again. And a corrupted MTA that needed a reboot to get it working right again.
I've followed ReiserFS development for years now, shipping our first servers with it some two years ago (and every box we've shipped since then), and I believe they have the best long-term plan for this kind of thing. Hans has written some excellent white-papers on making small files extremely cheap.
The eventual goal of Reiser is a filesystem that is indistinguishable from a powerful database (if a special purpose database). The plan is to make small files so cheap that every extension of a file, directory, etc. is just another file. Another interesting turn is that files would no longer be, necessarily, of the form '/big/long/path/to/some/file'...because the filesystem is a database, one could also access it by a category, so that one file read pulls in all of the data of that category (from any number of files). Directories become just one view of the data available, with any number of other views possible depending on the application.
As was mentioned in the parent, this would lead to things like 250 email recipients and only one actual file. But of course, this leaves out the copy-on-write functionality needed to make this seamless.
So I think the solution is probably to fix the filesystem--not to fix the email storage mechanism. A number of very smart people have 'fixed' email storage in the past, leading to all of the options we have today, none of which works extremely well on really large mailboxes. Yes, many are good enough, and many work fabulously for small to mid-sized applications. But the day will come when they do not work so well, due to the higher volume and growing average size of emails.
A good place to start for information about these ideas (which are primarily a consolidation of the most interesting research in the field of filesystems and databases):
http://www.namesys.com/whitepaper.html
ReiserFS is good stuff. Give Hans' papers a read sometime.
BTW-Don't gripe at me about ReiserFS instability, etc. I know better. As I mentioned I've been shipping servers with it for 2 years, and we've never had a single ReiserFS-caused corruption. Not one.
Attachment Converted: "C:\EUDORA\ATTACH\NEW YORK.pps"
Click on that in Eudora and the attachment opens.
This keeps the actual text in the mbox file lean. I've got almost a decade of correspondence that totals about 20 MB, if it included all the attachments it'd be much more.
Also it allows you to edit messages after receipt, (this might trouble some people, but it just simplifies what I used to do by opening the mbx file in a text editor). I can select all the text, then paste it back in. This has the effect of removing all the HTML coding that is especially crufty from Word generated mail -- a 20k message reduces to 1k.
when you say big emails....I assume that you mean less that 16 MB to handle MySQL row limitation. We have users who want to send 30 MB messages. Damn artists.
/etc/mysql/my.cnf) and ensuring everything like "max_allowed_packet" etc are > 50-ish MB.
Nope, This limitation disapeared ages ago..
Information can be found here here and here
I suggest opening up the config file (generally
David
stuff
Maildir : Do you really want to clutter your system with millions of small files? That's waste of inodes, space (unless perhaps you use Linux/ReiserFS or SGi) ...
... ...
... ...
In case you haven't noticed, the default settings for the Linux ext[23] filesystems is to allocate one inode per 4096 or 8192 bytes of disk space. Which happens to be pretty much the size of an average E-mail message. So, in other words, you are unlikely to run out of inodes before you run out of disk space, since both are going to be used up pretty much at the same clip.
It may come as a shocking surprise to some, but the average large filesystem is just littered with small files here, and small files there, all over the place. Here's my workstation -- a fairly large box with all sorts of crap loaded:
Filesystem 1k-blocks
/dev/sdb5 8159388
Filesystem Inodes
/dev/sdb5 1036288
I'm using up almost exactly 8192 bytes per inode.
and just try to open a Maildir with 1000+ mails and see how long it takes your favorite Mailprogram to only display the subjects.
How about instantly? Most GUI E-mail clients cache mail headers, so they don't have to go and wait for the server to reply each time you click on the folder index window to re-sort, or scroll the folder index.
...
Some ideas about the ideal mail-storage:
* One file per Mailbox-folder, allowing multiple folders per user.
Using one file per folder essentially forces you to use some form of locking each time folder access is necessary. Locking of any sort has been problematic for years whenever NFS (or pretty much any other network filesystem) is involved. A single circuit will now take out your entire network spool, as all clients are now spinning on lock requests out on the unreachable server.
Compression: Should messages be broken into pieces and the MIME-attachments stored separately (thus searching of the text parts would still be possible without decompressing the whole file)?
I thought you wanted to save everything in a single file per folder, and using multiple files for messages is supposed to waste inodes, remember?
File format: gdbm, Sleepycat db? Something new?
Ask an Exchange admin about joys of a corrupted Exchange database. If mail are stored in simple, plain, files, a single instance of corruption will affect at most one mailbox, instead of taking out the entire monolithic database.
Unicode support in folder names? Imap message-IDs, flags, useragent specific state-information?
IMAP already uses Unicode to encode folder names. Not sure what "useragent specific state-information" means...
yEnc isn't all that great. See http://www.exit109.com/~jeremy/news/yenc.html.
It's called a brick level backup, and most Exchange admins don't use them. The better setup is to set a reasonable deleted item retention policy. I set mine for 60 days. If I need any email deleted in the last 60 days, I can get it with out any restore, mailbox or otherwise. Works great.
i concur. there's nothing wrong with maildir or the linux filesystem, at least for me. my mailbox has about 3000 messages, and it opens pretty much instantly, using Maildir, Courier-IMAP and EXT2, from a server running a 700mhz Athlon and 7200 RM IDE disks.
the author's comments about Maildir make it sound like they've been using it and having problems. perhaps the problem is with their imap daemon? or their client? or their hardware? if running out of inodes or space for small files is such a problem, why not use ReiserFS? reformatting your filesystem is probably a lot quicker than inventing another new UNIX mailbox standard and getting people to support it.
i use the OS X mail client, and it indexes my messages in the background as they arrive, so i can do instantaneous (i mean in-stan-tan-e-ous!) searches through my 3000 message mailbox by subject, to, from, or the entire message text. i can't imagine how this could work much better.
your experience is clearly different; but i think there are other factors you should consider before blaming the mailbox format.
James from the apache group can use an SQL datastore.
War is necrophilia.
I actually found a nifty little package called dbmail which uses an SQL messagestore. I've been playing with such things at work since they wanted me to write them a web-based mail client, and I wanted something which would let me deal with a MySQL database on the web client, but also allow people to connect to it via IMAP or POP3.
Of course, the whole replication part of it might be a bit more difficult, but it could probably be arranged as well. I'm pretty sure there are tools in existance for doing replication on a MySQL database (of course, don't ask me the names of any of them...)
Dogma: Dead (mostly because your Karma ran it over)
Samsung Contact:
http://www.samsungcontact.com
Which is based on HP OpenMail. About 1/6 the cost.
Go out and get sailing!
As part of my job, I've written software to send out HTML mails to people (no, it's not spam). When these messages pass through an Exchange server, Exchange does us the "service" of creating a text version of the mail from the HTML. I guess this is so that people without HTML-capable mailers can have a readable version...
The problem is, we include our own text/plain version alongside the HTML (ain't multipart/alternative great?). Nicely formatted and everything. Instead of leaving our mail alone, Exchange rips out the text version and creates a new one from the HTML. The result is an ugly mess of URLs because we use some graphics in the HTML version. Our nicely formatted text version ends up in the bit bucket so that Exchange can dump it's url-barf on people.
This is really stupid behaviour for an MTA. And for some reason, it's always CEOs of important clients who use text-based MUAs while sitting behind an MS Exchange server. They call us up asking which URL to click on.
This, combined with other mail-rewriting bogons, has lead me to the conclusion that Exchange has no respect for the messages passing through it.
Current versions of PostgreSQL no longer have such limits (they're much higher, a single field can use up to 1GB ...).
"I love my job, but I hate talking to people like you" (Freddie Mercury)
A couple things:
:-)
1. Evolution is NOT "Basically mbox with database features". It can use Maildir or MH as the backend (and you can write your own plugin to extend this if you like).
2. Evolution's body indexing and summary files are extremely fast and efficient, about the best you'll get. I hear MySQL has text indexing capabilities that are extremely fast, but I'm not sure if they are faster than Evolution's indexer or not. Might be interesting to check this out.
3.
> But the thing that bugs me most is disk space. Typical inboxes are
> made of 5% to 10% of Text including Headers and HTML. The rest are
> BASE64- (or UU-) encoded pictures, word documents, zip archives and so
> on. The problem here is the encoding which wastes considerable amounts
> of space (at least one third).
It's theoretically possible, if you wrote your own Evolution storage plugin, to change the Content-Transfer-Encoding header value of binary attachments to "binary" (and text attachments to "8bit") before writing the message out to disk (or wherever) thus magically making it so that you no longer save the encoded text of the attachments but rather in-line binary data content. (Yes, it's as easy as setting an enum value in the CamelMimePart structure).
However, you have to be aware of the consequences of this. Most importantly, you will not be able to validate any of your PGP/MIME or S/MIME signed messages as according to the RFCs for these types, the signed MIME parts MUST be treated as opaque (meaning that you may not modify them in any way).
Now on to your ideas...
> One file per Mailbox-folder, allowing multiple folders per user.
> Should those files reside in one central location or in users
> Homedirs?
How is this different from mbox? (btw, CVS Evolution can handle mbox files and directory trees in external locations - ie, not within the
~/evolution directory).
> Compression: Should messages be broken into pieces and the
> MIME-attachments stored separately (thus searching of the text parts
> would still be possible without decompressing the whole file)?
If you break apart the MIME parts, you run into the same problem I described above about not being able to verify signatures.
However... if you took a normal mbox and gzipped it, you would certainly save space (at the expense of speed). I've been thinking about writing a CamelMimeFilterGzip class for gzip compresing/decompressing streams which would allow Evolution to read and write to gzipped mbox files for example.
Once the class is written (which should be fairly simple), allowing Evolution to read gzipped mboxes should be as simple as doing:
camel_stream_filter_add (MboxStream, GzipFilter);
...before feeding 'MboxStream' to the MIME parser.
> File format: gdbm, Sleepycat db? Something new?
Please not Sleepycat. If you are so sure that a generic database backend will be better than what Evolution's got, at least have the sense to use MySQL or PostgreSQL.
I'm personally against using a generic database as a storage and heres why:
1. The average user does not have an SQL database installed on their desktop systems, and so this is a completely rediculous dependency for them. If you think library dependencies are bad, just wait till you have to go installing, configuring, and maintaining a multi-user database running on your system. This may be fine for a company solution, but not the average end-user.
2. I'm not too familiar with MySQL or PostgreSQL, but I recall there being problems with mailers that use SQL database backends that tried to store the content of the messages as part of the table (due to them making the size of the table too small or whatever). If you can set the size to be "infinite", then I guess that's not a problem.
If your plan is just to have the database index the folder and actually store the contents as separate files, then you've instantly gained nothing over Maildir except that now you have a hefty database that you have to maintain and very little to no speed improvements (especially if you have a well designed/implemented summary index like Evolution does).
The only improvements you might gain here is body indexing? As I said earlier, MySQL supposedly has a REALLY good text indexer and so it might be a little faster than Evolution's. I'm really not sure on the comparison here.
> Should the security model allow users to directly access their
> files, grep them, copy them around?
Is there a reason NOT to? I don't see one. It's their mail.
> Shared folders, virtual domains?
This doesn't really have anything to do with folder formats and everything to do with features of the client itself.
(Evolution can do this).
> Unicode support in folder names? Imap message-IDs, flags, useragent
> specific state-information?
Unicode support in folder names I'd say is a pretty important feature. I'm not sure what you mean by "Imap message-IDs". Do you mean UIDs? Evolution, for example, has a UID assigned for each message whether it be in an mbox folder, Maildir folder, MH folder, or IMAP folder. So this isn't necessarily dependant on folder format (though it could be if you used a database backend for example, you might want a UID in the table).
I don't feel that UIDs are a must though, but I would suggest them. They are definetely useful especially for folders that can be accessed by multiple clients at once.
Flags are good. I'd go so far as to say a MUST have.
As far as user-agent specific state-information, it'd be nice to not need it. But if the client needs to keep it's own info, it'd be nice to be able to map the info to UIDs and keep it's own state file somewhere else (not necessarily alongside of the mail storage).
For example, IMAP doesn't have any means for the client to store state information on it, but that's perfectly fine. If a client chooses to
have it's own state, then it can save it locally.
It would be nice if the storage could handle user-defined flags/tags though. This would allow the client to extend the native features of the format (Flag-for-Followup, message colouring, etc).
> How would MTAs deliver mail? How would clients access? File-locking
> (NFS)?
This is one reason to just stick with what's available
File locking is a MUST have (or a scheme to make it not needed, such as Maildir).
--
You know, I have one simple request...and that is to have messages with freakin' laser beams attached to their headers. Now evidently my MIME specification informs me that that can't be done. Uh, can you remind me what I pay you people for? Honestly, throw me a bone here. What do we have?
*sigh*
yEnc is a complete waste of time. Had the author of yEnc actually gone out and read some pre-existing MIME specifications before going out and re-inventing (a square) wheel, he would have found that MIME already defines an encoding that gets even better compresion than yEnc. It's called "binary". Yes, MIME can handle binary content.
Content-Transfer-Encoding: binary
it's as simple as that.
Btw, I've implemented the yEnc specification in my library GMime
My favourite part of the yEnc authors defense for why he implemented yEnc is "but most news clients don't implement MIME". Hah, join the real world where NO news reader implemented yEnc. (yes, I know there are clients that implement it now, in fact my code is used in a few of them).
Believe me as someone who spends time hacking on news and mail readers, yEnc is nothing but a headache.
No we definetly do not need another standard to move mail around.
MIME *is* a transport. MIME *IS* easy to decode. MIME *must* be supported by any email client already.
MIME *is* the solution, it already exists, it supports everything you need (multiple binary attachments, multilingual headers), and it *works*.
XML is *not* a good idea.
_
\\/ are accustomed' - First Lensman
Life With Qmail
Building a Linux Qmail Toaster
Same thing, but with FreeBSD (more scalable, in my experience)
have fun
Remember that what's inside of you doesn't matter because nobody can see it.
Alen, your experiences with MS-Exchanges are so many worlds of difference away from mine that I nearly suspect that you've written a troll. Rebooting a mission critical service like a mail server during working hours is unsatisfactory. If other mission critical services like file and print sharing are also disabled during that reboot, then it's time to look for a more robust product.
I have worked closely with three shops in the previous three years that used Microsoft Exchange. Each had at least 3 full time equivalents of MSCEs to babysit their Exchange servers, probably more if you count overtime. This is not counting the occasional high priced consultant. None of these shops could keep Exchange running for a full week. Nor could they keep it from losing mail (When I measured it was 10-15%, ). Nor could they get it to communicate well with other mail servers. Nor could they keep it from getting wiped out once every three months by MSTDs (especially worms and virii).
In contrast, Novell servers run years at a time unattended (nearly every consultant has at least two such anecdotes of their own) and many UNIX-based MTA's need only a few hours of non-hardware maintenance per year, when set up tight. I guess running MS-Exchange is a new status thing to flaunt resources, like having a tuburcular wife was during the Vicrotian era.
Needless to say the managment's support was/is a real PITA for anyone doing work via e-mail with people outside of the house's MS-Intranet. In one case it even delayed a publishing a book by several weeks. In house use of Exchange was fine -- when it was down for you, it was down for everyone else so it was a nice time out and a chance to go have coffee with the others. When put to the test, file sharing couldn't, wouldn't, didn't function often enough to be useful either. For file sharing, those without access to a Novell or Unix file server, used sneaker net or mailed attachments. Yes, Exchange does look good in the 4-color glossy marketing brochure, but that's were it ends and reality sets in.
Puh.
Back to mail databases. RFC 2822, Internet Message Format specifies the general structure of a message. This can be over simplified as a header with its standard and non-standard fields and one or more message bodies. RFC2049 specifies multipart bodies. These structures do seem very well suited to a relational database.
Beta is broken and the link to classic doesn't work. Stop wasting our time or there won't be anybody left here.
Oh dear, another file format debate. I'm glad there was a library suggestion though ... that allows us to change our mind when we do it wrong the first time ;)
First, you need to consider the possibilitiy of moving the mailbox. To a different computer, or a different platform. This means it must be easy to access in any environment, and the tools must be portable.
This doesn't completely rule out a database solution (like mySQL), but it certainly makes it less-than-ideal.
Second, having used many mailers which separate out attachments ... Please Don't Do It! You can't easily move your mailbox, because there are a host of associated attachment files. There is ALWAYS a synchronisation issue between attachments and messages, so you end up scanning and cleaning out the attachment folder every so often to prevent dead files from accumulating.
Compression is nifty, but isn't really important. Disk space is seldom a concern these days, and the really big stuff (binaries) is often already compressed or don't compress well.
The real issue with most mailbox formats is how do you deal with the problem of removing dead space from the mailbox? Some program just leave it there until you hit "compact", which is wasteful and confuses users. Others rewrite the entire mailbox every time, which causes the software to "hang" for a while on shutdown.
The best suggestion I can come up with off the top of my head is this: One file per mailbox folder, and that file is its own filesystem. The "root node" contains a group of summaries (from, to, subject, date, etc) and node links. Other nodes are chained to contain the message and attachments.
Handling attachments: attachments are separated out and stored as binary in the mailbox. This conserves space but keeps the attachment with the message.
Compacting: is avoided. When a mail is deleted, it is merely flagged in the root node (index). So each mailbox has its own deleted items folder, so to speak. When the deleted items folder is empties, the index is rewritten and nodes freed - every node not at the end of the file is overwritten with a node from the end of the file (and appropriate reindexing done), so the file is automatically compacted.
Ideally the file needs some sort of transation logging area to ensure its integrity at all times.
Shared access to files is best handled through a library or a service. File locking is notoriously prone to bugs and security issues, and avoiding multiple implementations in different mail clients would be beneficial.
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
For Pete's sake, leave mail alone. If I can't fix it in less than 20 minutes with grep and perl, I don't want to know about it.
Divide mail into 20-30 logical "folders" (files), use procmail to help sort/scan/unspam, do IMAP to get to it from Win machines, archive mail out of your working files once it gets a year old, and you're all set. Strive to keep your inbox empty (you need a proper "action" orientation with your mail folders to accommodate this). No big deal.
Postfix is good, free, and open source. It's also easy to configure. You should be able to get it going in about half an hour.
As for Microsoft and Exchange Server, aren't they convicted criminals? I don't want to use software made by criminals. If they are willing to break the anti-trust laws, what other laws might they be willing to break? I don't trust them with my email.
That's what maildir is.
New things are always on the horizon
There should be a standard byte-compiled representation of XML (CXML), which has been flattened into an easily readable data structure. It would be portable, with byte orders indicated in flags (or would just use network byte order, i.e., big-endian), and with fixed-length element start/end headers, and could be used in lieu of XML for machine-machine communications. If a human wants to inspect the data, XCML could be trivially converted to and from XML.
There is one. It's called ASN.1
Simon
Coming soon - pyrogyra
I would definitely recommend XMail Server. Cross platform (Linux, FreeBSD, WinNT/2K/XP, Solaris), runs multiple domains with no problem. Not really that hard to set up if you read all the docs. There are several web config apps for it now and it's not that hard to program against the TCP config interface. It's being actively developed (new release every month or more often if a rare bug comes up.) It's licensed under the GPL. I use it with about 30 domains with 4-20 users per domain. I have had 0 problems with it. Easy to use, easy to upgrade (just copy the new binaries) no complaints.
Byte magazine had an article on this, at least 15 years ago (possibly more). It involved bouncing a laser off one of (apparently) many golf ball satellites orbiting the earth (tiny spheres covered in mirrors, designed for measuring continental drift) and encoding data in the laser. The roundtrip of the light beam allowed the "storage" of quite a few megabytes IIRC. The fact that I still remember this vividly so many years later is a testament of how cool the idea was back then. *Sigh* Those were the days.
Depends on how the mail side is set up. Single instance store solves this. BUT, few places run it, as the admin overhead is generally not worth it.
Never thought I'd like an email client less than I liked Exchange, but Notes wins that prize.
You're confusing the client (Notes) with a server (Exchange). You can run Outlook against a Domino mail server. The Domino mail server, which does have its quirks, is in my experience way more reliable than Exchange. Plus with Notes clients, mail born viruses are very unlikely.
I have never used Exchange, but a friend of mine admins a large (50,000+ users) Exchange system. Even a few years ago, running on NT4, their servers did NOT go down, ever. They scheduled a reboot for patches etc every 6 months, that's it. I have had lots of Netware boxes up for over a year, but not Netware 5 running mail. I inherited such a box & it needed to be rebooted every month or two. Now I've replaced it with a Linux based mail server & I'm much happier. Still have a 4.11 box cranking along happily, even happier since the 5 box is no longer giving annoying messages about it's licences. And my 2000 Server has been up for coming up on a year with no problems.
when your boss says you will have shared contacts
LDAP
and calendars
CorporateTime (http://www.steltor.com)
and your clients will run Windows
Both of these work fine on Windows
find me a solution that comes within miles of the ease of Outlook
You can even keep using Outlook; LDAP is supported by Outlook, and Steltor provides an Outlook plugin that talks to their server instead of Exchange.