Improving Unix Mail Storage?
At first, there was mbox, then there was Maildir, and Bill begat Outlook and .mbx. CaraCalla wonders if there is a better way to store mail than the way we currently store it today. I admit, with the changes that email has undergone over the past 5 years (changes in what is being sent, not necessarily in how it is sent), it may be time to reinvent the mail format. Read on for CaraCalla's analysis of the current mail options, and his thoughts on where we may go in the future. If you were to design your own MUA, how would you design its mail storage?
CaraCalla asks: "Does anybody know a good, free solution for storing mail on unix hosts? The reason that I ask this question is my discontent with available techniques:
- mbox: There are problems with locking, corruption, access-times, and bloat.
- Maildir: Do you really want to clutter your system with millions of small files? That's waste of inodes, space (unless perhaps you use Linux/ReiserFS or SGi) and just try to open a Maildir with 1000+ mails and see how long it takes your favorite Mailprogram to only display the subjects.
- Cyrus: Basically the same as Maildir with database features.
- UW-Imap mbx: That's classical mbox with extensions allowing multiple access.
- Evolution: Basically mbox with database features.
- Windows clients: Typically some proprietary db-format. Pathetic.
But the thing that bugs me most is disk space. Typical inboxes are made of 5% to 10% of Text including Headers and HTML. The rest are BASE64- (or UU-) encoded pictures, word documents, zip archives and so on. The problem here is the encoding which wastes considerable amounts of space (at least one third).
Some ideas about the ideal mail-storage:
- One file per Mailbox-folder, allowing multiple folders per user. Should those files reside in one central location or in users Homedirs?
- Compression: Should messages be broken into pieces and the MIME-attachments stored separately (thus searching of the text parts would still be possible without decompressing the whole file)?
- File format: gdbm, Sleepycat db? Something new?
- Should the security model allow users to directly access their files, grep them, copy them around?
- Shared folders, virtual domains?
- Unicode support in folder names? Imap message-IDs, flags, useragent specific state-information?
- How would MTAs deliver mail? How would clients access? File-locking (NFS)?
- What about backwards-compatibility? Writing libmailstore (anyone)? adopting UW c-client?
Does my ideal mailstorage exist somewhere? Is somebody working on a project addressing this? Does anybody have some other hints? And please no mbox/Maildir flamewar!"
But put multiple indexes (by sender, subject, date, whatever key-classes you want to assign messages) and the possibility to restrict the range displayed. With careful programming, you can manage many users who won't be able to read each other's mail, except as required.
This way, you can arrange your mail as you please.
No more message duplication. Send a memo to 250 people? Just send it once, but tag it as readable by the 250 sendees.
Of course, this calls for an SQL database... :) :) :)
I'd store it in either an XML format or possibly as separate files in a directory structure with a filesystem that could handle the extra load like XFS. But its nice to have a single file to backup, or at least a single directory. I think evolution, netscape, mbox handle mail just fine as it is. Or how about a filesystem in a file that you mount loopback and compress old mail as needed. Store all mail in separate directories and files for attachments and have XML metadata describing everything in an easily parsable system.
Perhaps storing the message, attachments, etc. in an RDBMS would be a better way. Give each user a table-space with a table per folder/directory or just each user with a single table. With a decent RDBMS, storage on disk is no longer your concern. This way web, local text/gui, and remote text/gui clients could easily access the same information. There's probably several solutions out there already (with wrappers for your favorite mail clients).
Well. My solution for storing ALOT of BIG email but still browsing fast is to use MySQL. My mail client is Pronto! (written by Muhri, in perl, gtk, etc).. I have several 10's of thousands emails in about 10 different folders. Reaction time is immediate and searching is pretty damn quick aswell.
The mysql server is at work, and I can view my mail from anywhere simply by pointing my client at my IP. Presto.
I'm also slowly writing a MySQL-based IMAP server which will hopefully be compatible with Pronto!... But as with so many projects, itl probably take some time to complete...
David
stuff
...and why not?
20.....15.....10......5......doo!
I like how you included the Microsoft proprietary client format blah blah blah. Gotta have that eh?
Are you aiming for this discussion to be server side, or client side? Or a general slugfest as long as it's anti-Microsoft?
If you're so worried about encoded binaries, why not try yEnc instead of base64 or uuencoding? It works well in newsgroups. It might work well for email storage as well.
Receive new-mail as ASCII files 9just as now), store them in a database. Attachments should be decoded and stored as binary objects in the database, with the ability to extract them and save them. The extraction process would leave behind info about when they were extracted and where they were saved. Following them after that would be up to the user. Database could be MySQL or some other common OS SQL database.
"A gun is a tool, Marian. No better, no worse than any other tool. An axe, a shovel, or anything." Shane (1953)
A single database to hold of the user's email. Single instance storage ensures that only one copy of any attachment is in the database at once, no matter in how many email messages it was sent in. API's for back up let you back up the whole database or individual mailboxes. And depending on your backup solution you can restore mailboxes and individual emails. Anti-virus software that integrates into the server side of the software. In Exchange 2000 if you accidently delete a mailbox you can easily bring it back with all emails without restoring from tape. Only files to worry about on the user end is a personal address book and archived email. Unless you use POP3 or it's archived in personal folders the email always stays on the server preventing problems like accidentaly downloading important emails you need at the office being on a home PC. And it's stable. Not as stable as UNIX I admit, but it stays up for months without a reboot. And in my experience most problems are solved by a simple reboot. In 4 yeas of exposure to exchange, the only non-admin related problems I've seen were 1 database corruption where I needed to run a utility and wait 45 minutes for it to work again. And a corrupted MTA that needed a reboot to get it working right again.
By replacing the file system with SQL Server!
A well-crafted lie appears unquestionable - Dama Mahaleo
...and a pox. Definitely... both a stench and a pox on you and your benign ancestral parade. Now, bring me another Dr. Pepper.
And I don't want to hear any further talk of this 'male' thing.
Personally, I would store mail in one big XML file, all gzipped up. XML is large and bulky, but it's repetative and texty so it should compress well. The alternative is a SQL style database, but that seems like overkill; there aren't really that many relationships there. Just use XML and search it for what you want.
Something like Maildir .. if the FS is slow and can't handle that kind of application, then we need to improve our filesystems!
Lots of applications need lightweight databases with indexes, locking, and atomic operations. Why not bake this into the filesystem, and it won't have to be just for email, it will have many uses.
I was thinking about this the other day as I was working on a logging system for a large in-house email filtering system.. similar problem, except instead of storing emails, I'm storing small XML fragments describing the structure of each email and what was done to each. So far the easiest solution was large monolithic XML files, and an external index pointing in the large file (i.e., like mbox + a DB index). As it grows we'll probably have to move it to a "real" database.
There is a need for something like sleepycat DB + ReiserFS on steriods..
Automatically deliver mail to recipients who can then save the mail on their own machines. It's like distributed processing except it's distributed storage.
/dev/null
If someone isn't logged on to receive their mail (like those saps who turn their machines off every night), then forward the mail to
I have been pwned because my
I've followed ReiserFS development for years now, shipping our first servers with it some two years ago (and every box we've shipped since then), and I believe they have the best long-term plan for this kind of thing. Hans has written some excellent white-papers on making small files extremely cheap.
The eventual goal of Reiser is a filesystem that is indistinguishable from a powerful database (if a special purpose database). The plan is to make small files so cheap that every extension of a file, directory, etc. is just another file. Another interesting turn is that files would no longer be, necessarily, of the form '/big/long/path/to/some/file'...because the filesystem is a database, one could also access it by a category, so that one file read pulls in all of the data of that category (from any number of files). Directories become just one view of the data available, with any number of other views possible depending on the application.
As was mentioned in the parent, this would lead to things like 250 email recipients and only one actual file. But of course, this leaves out the copy-on-write functionality needed to make this seamless.
So I think the solution is probably to fix the filesystem--not to fix the email storage mechanism. A number of very smart people have 'fixed' email storage in the past, leading to all of the options we have today, none of which works extremely well on really large mailboxes. Yes, many are good enough, and many work fabulously for small to mid-sized applications. But the day will come when they do not work so well, due to the higher volume and growing average size of emails.
A good place to start for information about these ideas (which are primarily a consolidation of the most interesting research in the field of filesystems and databases):
http://www.namesys.com/whitepaper.html
ReiserFS is good stuff. Give Hans' papers a read sometime.
BTW-Don't gripe at me about ReiserFS instability, etc. I know better. As I mentioned I've been shipping servers with it for 2 years, and we've never had a single ReiserFS-caused corruption. Not one.
Slashdot never dissapoints. You wait a minute; and 10 people have already said what you want to say.
I, like others, suggest a RDBMS to implement a secure and quick mail system. This way you get the benefits of administrative security, file locking already in place, performance, redundancy, and potential for easy management. This, of course, all hinges on how well your RDBMS handles those specific details. That might also lead to some cool server-side email apps, as well. A blinding-fast email search utility on a MySQL mail system, or a nice way to relate your user information to email statistics (shudder).
Beef! Beef! Beef!
I just want full text indexing. Don't care about nuthin' else.
The great advantage of the current system is that it is very easy to move your e-mail from one program or computer to another with little hassle and/or risk. With any type of database system, you introduce a level of complexity that virtually assures that only one e-mail program will be able to read your e-mail. I think the best solution as far as I am concerned is to just stick with current mbox format, but allowing attachments to be deleted independently though that is just personal preference. But I think we should be wary of adding any complexity that endangers the portability of mail. Also, the other thing to be said for the mbox format is that worst come to worse you can still access your e-mail with a text editor and/or grep.
All I have to say is that Linux rules! The author's bias against Windows is not without merit.
- g>>(o)atse
/. punchingbag jwz has some strong opinions about using databases (etc.) for mail storage. I tend to agree: everything can read from and write to files, there no versioning issues, they can be easily transported among different operating and file systems, they can be backed up easily. But it's another wheel to reinvent, so everyone hop to it at once and then lose interest in two or three weeks!
Chris
M-x auto-bs-mode
I've been joking for years about getting two shell accounts on opposite sides of the planet and setting each up with procmail to bounce all my mail between the two (always rewriting the header so as to avoid a loop.) I figure at any given time, my mail would be in both places and neither simultaneously.
.procmailrc for a few seconds and change it back. Plenty of mail storage without chewing-up precious file-system quota.
If I want to read some, I'd just chmod
Some people have a way with words, and some people, um, thingy.
Probably noone will ever see this since I'm posting AC but anyway...
jwz has quite a case for mail summary files that were used in pre-NS4 mailreaders. See http://www.jwz.org/doc/mailsum.html
The basic idea is to use the old (relatively space efficient, compatible with everything) mbox format but also keep a "summary file" to allow quick threading/seeking/etc within the file. Actually quite workable. Worth a read if you're going off and designing (what you think will be) a grand new mail storage scheme. Don't repeat the same mistakes netscape made with NS4!
This is one feature i miss in Linux mail clients. At one stage i wrote a perl filter to achieve this functionality with Kmail.
Attachment Converted: "C:\EUDORA\ATTACH\NEW YORK.pps"
Click on that in Eudora and the attachment opens.
This keeps the actual text in the mbox file lean. I've got almost a decade of correspondence that totals about 20 MB, if it included all the attachments it'd be much more.
Also it allows you to edit messages after receipt, (this might trouble some people, but it just simplifies what I used to do by opening the mbx file in a text editor). I can select all the text, then paste it back in. This has the effect of removing all the HTML coding that is especially crufty from Word generated mail -- a 20k message reduces to 1k.
The next generation of mail storage should definitely work on taking optimal advantage of compression technologies. Preferably in a way that compresses the data from end to end, not just in the recieving mailbox. As to managing the kind of data sent, I'd suggest using a twofold approach. Save binary attachments in the natural state in a subfolder linked to the message itself, which would be kept in a compressed database format.
As to the database format itself, I'd like to see a form of redundancy in the structure of it. Give the design some self-healing ability in case flaws develop as the information gets shuffled around. Media isn't perfect, but mail stability should try and be as good as it can get.
If you want to speed searches, index the data in a seperate file and use that. Just keep the actual data storage as simple and reliable as possible, anything like searching or sorting is just a bonus.
My own pointless vanity vintage computing page
One file per Mailbox-folder, allowing multiple folders per user. Should those files reside in one central location or in users Homedirs?
Depends on how the user accesses their mail. If they read their mail only on the local machine, it should be in their home dir. If the server allows multiple forms of access (like local + IMAP), central storage makes sense. There's a lot of other issues here, like backup methods.
Compression: Should messages be broken into pieces and the MIME-attachments stored separately (thus searching of the text parts would still be possible without decompressing the whole file)?
No. Separating a single mail into its component parts is just asking for trouble (not to mention that it massively increases your locking problems).
File format: gdbm, Sleepycat db? Something new?
Personally, I like Maildir, since it lets me use standard tools like grep to find particular mails. I admit that a more efficient method is probably required these days.
Should the security model allow users to directly access their files, grep them, copy them around?
Yes, of course. It's their mail - let them do what they want with it. The mail app must be able to deal with that.
Shared folders, virtual domains?
Shared folders would be nice - IMAP can do that now, although it's overengineered and not necessarily fully implemented in any particular IMAP server. Virtual domains I've never had any use for myself...
Unicode support in folder names?
Why not?
Imap message-IDs, flags, useragent specific state-information?
As you say, IMAP does that already...
File-locking (NFS)?
More the fault of NFS than the mail software (and I believe NFS4 handles locking better).
>just try to open a Maildir with 1000+ mails and see how long it
>takes your favorite Mail program to only display the subjects.
A mailbox with over 1400 messages, using Courier-IMAP, viewing through my webmail interface (see shameless plug below), it takes about 1.4 seconds to sort all messages by size and display the subject, sender, date and size of the first 20 messages.
Am I missing something?
---
Open Source Shirts
One of the things that MS Exchange does well is its storage of messages. It uses a database for the private store (i.e. mailboxes).. the only problem is that it's in a format not unlike MS Access.
A while ago I went looking for a Linux MTA/IMAP server which supported MySQL message-storage. The closest match was Courier; it allows authentication and mailbox-location by MySQL, but not message-storage.. and there was a pretty hostile response to the suggestion that it be added.
Personally, I'd love to see a Linux MTA/IMAP system which uses an SQL message-store. The ability to replicate a message-store across multiple physical sites without having to get into distributed filesystems like Coda would be a huge benefit for those who need to provide a redundant mail service.
what's about reiserfs + Maildir + imap4?
:)
work very well, quite fast, easy to install and happy users..
what to ask more?
(and NO Outlook imap client PLEASE!)
Of course, with traditional UNIX file systems, this is a bit slow. The thing to do is to fix the file system, not to kludge ever more complex mail formats on top of it. ReiserFS goes much of the way; we now also need some system calls to open and read multiple files with a single call.
Until file systems catch up, one kludge is as good as another. UNIX mbox format is at least simple, so I stick with that.
Okay, so XML is still quite the buzzword, but that aside, it might be a nice format for storing mail. XML is very cross-platform and it would be easy to write a number of front ends that would access the XML files, which can be organized like a database. Plus, with proper parsing, you could just link to all your attachments, storing them elsewhere and allowing you to compress the XML which is all text. Has anyone every done something like this?
Who said Freedom was Fair?
People have been arguing about the balance between standard formats that are easy to parse and move between systems and complex formats that make searching easier.
What we need is a standard DTD or schema for mail data that all well written email systems can understand. If everything can import and export XML representations of email, the internals aren't so important.
Maildir : Do you really want to clutter your system with millions of small files? That's waste of inodes, space (unless perhaps you use Linux/ReiserFS or SGi) ...
... ...
... ...
In case you haven't noticed, the default settings for the Linux ext[23] filesystems is to allocate one inode per 4096 or 8192 bytes of disk space. Which happens to be pretty much the size of an average E-mail message. So, in other words, you are unlikely to run out of inodes before you run out of disk space, since both are going to be used up pretty much at the same clip.
It may come as a shocking surprise to some, but the average large filesystem is just littered with small files here, and small files there, all over the place. Here's my workstation -- a fairly large box with all sorts of crap loaded:
Filesystem 1k-blocks
/dev/sdb5 8159388
Filesystem Inodes
/dev/sdb5 1036288
I'm using up almost exactly 8192 bytes per inode.
and just try to open a Maildir with 1000+ mails and see how long it takes your favorite Mailprogram to only display the subjects.
How about instantly? Most GUI E-mail clients cache mail headers, so they don't have to go and wait for the server to reply each time you click on the folder index window to re-sort, or scroll the folder index.
...
Some ideas about the ideal mail-storage:
* One file per Mailbox-folder, allowing multiple folders per user.
Using one file per folder essentially forces you to use some form of locking each time folder access is necessary. Locking of any sort has been problematic for years whenever NFS (or pretty much any other network filesystem) is involved. A single circuit will now take out your entire network spool, as all clients are now spinning on lock requests out on the unreachable server.
Compression: Should messages be broken into pieces and the MIME-attachments stored separately (thus searching of the text parts would still be possible without decompressing the whole file)?
I thought you wanted to save everything in a single file per folder, and using multiple files for messages is supposed to waste inodes, remember?
File format: gdbm, Sleepycat db? Something new?
Ask an Exchange admin about joys of a corrupted Exchange database. If mail are stored in simple, plain, files, a single instance of corruption will affect at most one mailbox, instead of taking out the entire monolithic database.
Unicode support in folder names? Imap message-IDs, flags, useragent specific state-information?
IMAP already uses Unicode to encode folder names. Not sure what "useragent specific state-information" means...
I believe that UW-IMAP .mbx also includes indexing in the mail file, along with the concurrent access stuff. It's definitely WAY faster than mbox.
>Windows clients: Typically some proprietary db
>-format. Pathetic.
Both Netscape and Eudora use the regular old mbox format.
Outlook may use something else, I've never touched it.
Pegasus uses something different, you'll have to track down either of the guys using it and ask them.
Now I'll never get to sleep tonight.
If it ain't broke, you need more software.
You could have the Database for:
Now a simple mail system would only need a few of the DBs/Tables, but you could easily add the other options later without breaking something you already have going. Which wouldn't be the case if you were to move from just about anything To MS Exchange.
This would almost inevitably break any form of backwards compatability, except for some possibility of a wrapper that sat around the database, and pretended like it was another format. But I think the pros out weigh the cons....
I know the author did not like 1-file-per-email, but then when used with a VERY good fs (like BFS) it's a very effective method of storeing email.
You don't store the subject, date, other metadata in the text file, but in the attributes.
Mlk
Wow, I should not post when knackered.
Cyrus rejected my zlib patches for their IMAP server because, ``disk is cheap.'' I've been using my zlib patch everywhere I use cyrus and it's saved me tons of disk space (it's been so long since I've done a conversion, I don't remember details, but I know it's more than 50% on average).
Cyrus is one of the better systems out there, IMO. Individual files take up a lot of inodes, sure, but the ``database'' files counter the performance lost to having to open all those files when you don't need them.
Before that, I concatenated gzip files in mbox format. Modifying anything that can read mbox to read gzip files can't be terribly hard, but I assure you, the benefit is huge.
-- The world is watching America, and America is watching TV.
SPAM is a burden to everyone. As a system admin, I was told to do something about it. After some research, the best solution was to impliment SpamAssassin on our linux mail server. I tried sendmail SPAM filters, procmail rules, etc. SpamAssassin is undoubtedly the best solution and I recommend it to everyone. It needs to be implimented at the server level, so email your ISP if you don't have root access. It is a simple perl script that can be run with sendmail (using a C++ version) or in procmail (perl). It is very easy to setup using perl CMOS.
How does it work so well? Spamassassin checks the headers and body of every email passing in to the mail server. It searches the email for certain keywords and phrases and other SPAM characteristics and assigns points to the email based on these. It works very well and has many options --including the ability to have "black lists" and "white lists" in file glob format.
So far I have blocked about 94% of the SPAM coming in through our mail server. It only misses a couple and is highly configurable! Download and install it!
Cheers,
Tom
http://tomgould.com/
Storing mail in a Postgres DB is actually at Mail2DB
You can find it by searching the Qmail Site
I don't know about anyone else, but I have maildirs with thousands and thousands of email and the subjects display nice and fast.
CowboyNeal! Have him read all your mail type it out on a typewriter, delete the files, eat the carbon paper, and stuff the messages in a backpack and follow you around all day. Assuming his memory is dood you get fast and relevant responses to searches, excellent security, and easily access at all times! The only problem is space, while hard drive space isn't used the physical size of the system is far from negligable, it can also start to smell after a few days...
I stole this Sig
At first, there was mbox, then there was Maildir, and Bill begat Outlook and .mbx.
How do you misspell "began"? The "t" key is up above on the first row and the "n" is on the third row.
Email attachments are really just for people who are too stupid to use ftp or http.
yEnc isn't all that great. See http://www.exit109.com/~jeremy/news/yenc.html.
I did some research a few years back and found only one company that seemed to have a solid mail solution; Oracle.
d ex.ht ml
They seemed to have developed a Mail server (smtp, IMAP, pop3, and LDAP) package that runs on top of they 9i database.
Check it out at:
http://www.oracle.com/ip/deploy/ias/email/in
I have tried it out and it seem pretty solid. But definetly not easy to setup.
-hope this helps
The DBMail project is already well underway, with a fabulous beta 3 release and an active development team pushing towards a 1.0 release. The project is being supported by a Dutch ISP called IC&S. It provices an MDA interface for Postfix/Sendmail/Exim/Procmail/etc and POP3 and IMAP servers. MySQL and PostgreSQL are supported backends. The CVS tree has a non-relaying SMTP server, too.
http://www.dbmail.org/
There's also a solo developer working on a mail server called mmmail. It provides a non-relaying SMTP and POP3 with a MySQL backend.
http://mmondor.gobot.ca/software.html
Life is not that simple. All databases are limited by the size of the basic block, and if you can't fit your data into that block performance takes a hit.
With PostgreSQL this a compile-time option, default 8k and it can go up to 32k.
It *is* possible to store larger items, esp. if they're 'TOASTable' or blobs, but this often just pushes the problem of dealing with thousands of files onto the database. Only now it's a lot harder to figure out why performance sucks.
Does this mean that database solutions won't work? Of course not. But it does mean that simple solutions won't scale well when you're dealing with massive amounts of data.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
You don't need to store them encoded. You unencode it and it uses less space then yEnc even. If you need to forward it etc, then it gets re-encoded.
It takes me 5 seconds exactly to open a maildir folder with 1315 emails in it.
SealBeater
-- Its survival of the fittest...and we got the fucking guns!!!
You can go a step further - don't bother with setting up a new compression layer, just encrypt it with existing tools. Most encryption routines compress it first, to make cryptanalysis more difficult (and for performance, since there's less data to encrypt), but this is partially offset by the continuing need for 7-bit safe transport layers.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
File-based mail storage makes sense on a resource-constrained device, but on a machine with enough CPU and disk to run an RDBMS, the database would be a better plan for many reasons, not least of which is that database developers have already spent countless hours producing efficient storage and retrieval systems so that you won't have to.
Given a schema, it should be pretty straightforward to write an SMTP server to put messages in, and POP3/IMAP/HTTP+CGI servers to pull messages out.
If anyone knows of any existing open-source RDBMS-centric mail systems, I'd love to know where to learn more about them.
Build stuff. Stuff that walks, stuff that rolls, whatever.
...the age of Desktop/Application/Document are done.
Data/Application/View
..is what you should be aiming for as long as you're doing an overhaul. Anything else it just paint on rust.
Check out the Citadel system. (Disclaimer: I am one of the developers, so my opinion on this is kind of strong.) We use Berkeley DB from Sleepycat Software for the data store. Yes, this is the same Berkeley DB that Sendmail uses to store its alias tables, access tables, etc. But it's capable of so much more than that. It's a robust, non-relational database that is hugely scalable and even has transactions/logging support!
We store all messages in the database.
Works like a charm. No pounding through ugly directory hierarchies or insanely long flat files. No need to escape out the word "From" when it appears at the start of a line. None of the cruft.
Ok, so it's a black box. But it's an open source server that uses an open source database backend, and since it supports SMTP/POP/IMAP plus webmail all by itself, you can still plug your favorite utilities into it (Pine, elm, fetchmail, etc.) and you don't have to graft together Sendmail+IMAP+whatever to make your mail system work.
The traditional Unix mail utilities are getting a little long in the tooth. I'm going to get flamed for saying this but look at what's happened to the email world: Lotus and Microsoft have run away with most of the market because Unix traditionalists won't give up their flat files. It's time for us to evolve, folks.
Tired of FB/Google censorship? Visit UNCENSORED!
Wrong approach, instead use directories as folders
and individual files for each message. So it takes
a little more disk, but you can use all the unix tools on your mail without being trapped by the a poor MUA design for features. If you have a speed of directory traversal problem, just use a cache database for the messages headers as a hidden file in the directory. Reiserfs is really nice for an underlying FS.
This should look familiar: mh & Andrew Messages (aka AMS) both use this. With AMS you could have 30,000+ messages in a folder without slowdowns.
You're newfangled system better have DRM built in. I don't have any data, but it must be obvious that artists are losing billions in revenues every week due to mp3s being sent as attachments. This criminal behavior must be stopped or the practice of free expression will come to a screeching halt.
We have to have the remote hammer to pop out of the monitor to whack the end user. This is a must for any admin that works with more than 300 people. Hammer trigger from e-mail, pager, SMS, or telephone number.
Power mains must be connected to the user's chair, see above for trigger.
The MUA must forward all p0rn to the admin account. Likewise with credit card info.
The MUA must know when the user is about to do something to tick off the admin, like sending a "me too" to everyone in the office, or replying to a confidential e-mail to the whole office and prevent the user from reproducing. X-rays are fine for impromptu sterlizations. The side effect of loosing all your body hair is no problem, as it alerts others to a stupid co-worker.
The MUA must alert the admin when a coworker he has got the hots for changes her home phone number. Just to be fair, if the the admin is female, the reverse applies.
The MUA must analyze the admins e-mail and throw a bucket of cold water if (s)he attempts to send a really stupid e-mail.
Also, the MUA must be able to launch nuclear missles at spammers automatically. After that, it should refer the e-mail to the admin to see if a stronger response is warranted. Better yet, the MUA should employ a time machine to go back and choke the spamming creep when the spammer is still a baby, then use X-rays on the parents as above.
The MUA should have a hypnotic effect on the object of the Admins desire and cause that person to preform disgusting oral acts on the Admins body each time a new e-mail arrives. (HOORAY FOR KELZ!)
For the PHB, he should (by the same hypnotic effect) do a "Full Monty" when the big cheese walks in. Twice.
The MUA should be able to cause back dated confirmation messages from HR approving a 51 week paid vacation upon pressing a special key combination, unless it's the PHB pressing the keys, then it should cause an e-mail to be sent to HR from the PHB's account turning in notice.
Sorry, if you had a day like mine, you'd need a laugh about now...
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
What is the problem with Maildir? I mean if you're going to store email, might as well use reisterfs. I don't have any problems with big mail boxes, and the extra integrity of the email messages, are worth the (non-noticable) dealy.
Used to get the mbox corrupted once in a while. Never had problems with Maildir.
Je ne parle pas francais.
The problem with using database formats is that you can't access them with vi. How many times has your mail client crashed attempting to read an email, but you still _need_ to get access to it? If it's in a database (proprietary or not), you're up the creek. If it's stored in a flat file, you at least have the option of using vi/emacs/grep to find and read the email, and then excise it.
This has happened to me in Netscape, Kmail, Outlook, Evolution, Eudora, etc. Every single one has had problems at one point or another. The best programs are the ones that are _truly_ open, and let you get at the mail from other directions.
Don't doubt the power of the text utilities in Unix. :)
Jason Pollock
Just edit your .procmailrc:
:0
/dev/null
And all your problems are solved.
Courier IMAP uses an enhanced Maildir format (original Maildir didn't support subfolders)
Je ne parle pas francais.
I actually found a nifty little package called dbmail which uses an SQL messagestore. I've been playing with such things at work since they wanted me to write them a web-based mail client, and I wanted something which would let me deal with a MySQL database on the web client, but also allow people to connect to it via IMAP or POP3.
Of course, the whole replication part of it might be a bit more difficult, but it could probably be arranged as well. I'm pretty sure there are tools in existance for doing replication on a MySQL database (of course, don't ask me the names of any of them...)
Dogma: Dead (mostly because your Karma ran it over)
I was working at a Governmental institution which was one of the first to run Linux servers already in 1993 (or was it 1994?).
I recall I once tried to send a 12 MB attachment with zipped GIS data to a colleague at the other end of the corridor via e-mail.
I subsequently had a long conversation with my superiors on dos and don'ts.
As I understand it most servers still don't like files larger than 1 or 2 MB. Why is that? Can't one use an ftp-undercover or something?
While I think Unecoded binary stores of Attachements would be great.. I dont think it will work. I have yet to find a mail client that will *never* fail on decodeing an attachement. Now granted.. many of thes are messages from broken mail clients..but still I have been able to extract the attachments after a little work. Email virus scanners Often have this same problem. Until we can garentee that the software can extract ALL encoded documents we cant assume that the orignal message wont be required at a later date. In the end we are talking about disk space. Last I checked Drives were cheap.. and prices were falling.. ( I just picked up 100GB WD drive for a few dollars north of $100). The data base store and search may be fast.. and should be fast at the expense of space. So do we really need(want) unencoded attachments?
Not with the mbox/maildir formats... After all, who in their right mind REALLY needs to keep 10,000 emails? That's absurd. Even if you received 20 emails a day (that were worth keeping -- and most email is definitely NOT worth keeping), to amass 10,000 emails in your little kingdom would take nearly two years! And realistically, how many emails really maintain their validity after 2 entire years???? I would venture very few... The problem here is not technology, it's people and their lazy habits....
on a REAL computer (albeit big iron), OS400 does exactly what they are proposing. Sure, the as400 has a bunch of smaller processors that operate the individual subsystems, but isn't this somewhat like what the video card industry is stepping towards in terms of GPUs. If your hard drive handled all of the hard drive tasks (meaning it only requests/sends data to the CPU) things would be a lot faster. Also a lot of proprietary hardware, but that's what standards are for. something like this is years away, but there is a limit on how bloated and stupid an OS can get. (sorry XP, but your 1000MB butt is too big for my taste.)
Shredding and compressing mail messages is almost always a bad idea. Essentially *nobody* does it correctly, and you can't reconstruct messages in their original byte-for-byte formats, which trashes digital signatures. You won't save much disk space, because real text doesn't take up enough space for anybody except a big ISP mailsystem to worry about, and binary attachments usually only compress well if they've been encoded in some non-8-bit-transparency format like base64 or uucode. About the only time it wins is when one person on your keep-mail-on-server mailsystem is sending an attachment to a bunch of people who can then all use the original, which is to say they should probably have stored the file on the web and mailed a URL. If you're going to do things like this, get yourself a compression-equipped filesystem and just store your raw mail messages there.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Very few systems give alternate functional views to different views. In order to send a letter to a section, I'd have to find out a name of a person in the section, send it to that person's email, and then hope that person is in.
What is needed is a parallel view where people can add functions to their role (at user level, for example).
So an email to recruitment@sample.com will get to the recruitment folder, and any of the recruitment officers can deal with it.
The only way around this is then to look at the issue of spam. If everyone has a "recruitment" address, then one could send out mail to "recruitment@[each domain]" a lot easier than getting the right name for each domain.
The idea is that a section inbox should be available to a section, and not an individual, and that people in the section should be given access when appropriate. A section would then retain the same name, regardless of the personnel making it up.
None of the mail systems that I see grasp this point.
One could have view sets, which are alternate tree structures, with the accounts at the leaf objects. One could be in the flat "name" tree, or access the personnel\recruitment intray, or whatever.
OS/2 - because choice is a terrible thing to waste.
Why waste CPU cycles on parsing a human-readable text file format such as XML?
There should be a standard byte-compiled representation of XML (CXML), which has been flattened into an easily readable data structure. It would be portable, with byte orders indicated in flags (or would just use network byte order, i.e., big-endian), and with fixed-length element start/end headers, and could be used in lieu of XML for machine-machine communications. If a human wants to inspect the data, XCML could be trivially converted to and from XML.
Why go to the trouble of running a parser for files that 99% of the time no human will ever look at?
As part of my job, I've written software to send out HTML mails to people (no, it's not spam). When these messages pass through an Exchange server, Exchange does us the "service" of creating a text version of the mail from the HTML. I guess this is so that people without HTML-capable mailers can have a readable version...
The problem is, we include our own text/plain version alongside the HTML (ain't multipart/alternative great?). Nicely formatted and everything. Instead of leaving our mail alone, Exchange rips out the text version and creates a new one from the HTML. The result is an ugly mess of URLs because we use some graphics in the HTML version. Our nicely formatted text version ends up in the bit bucket so that Exchange can dump it's url-barf on people.
This is really stupid behaviour for an MTA. And for some reason, it's always CEOs of important clients who use text-based MUAs while sitting behind an MS Exchange server. They call us up asking which URL to click on.
This, combined with other mail-rewriting bogons, has lead me to the conclusion that Exchange has no respect for the messages passing through it.
Current versions of PostgreSQL no longer have such limits (they're much higher, a single field can use up to 1GB ...).
"I love my job, but I hate talking to people like you" (Freddie Mercury)
Hi,
.pst file sizes in the order of 1GB is not unusual. Try opening something like that in pine! And this is not going to change for the better with the advent of unifed messaging (1 information store for voice, email, Video??).
Having worked for a IT sales organisation as a Systems Engineer, you quickly become used to corrupt email stores, sluggish mail systems, due mostly ignorance of the Sales people with it comes to their mission critical application. Outlook
Exchange is still a reasonable mail system even if the database is Access. Where I am working now (Large network kit manufacturer, sssh!) a team is in the process of rolling out what will be a mission critical Exchange 2000 implementation. Their major gripe / concern: having to place the exchange db's on several machines / EMC storage array. Despite the mission critical nature of this, no clustering is involved.
Oracle have really latched on to the storage / manageability / reliability problems inherent in large mail systems (and these are only going to get bigger and bigger)and have a great system based on Oracle 9i. It's benefits - huge scalability and easy clustering, management etc. What it doesn't have: the online calandering and collaboration tools that exchange has.
Notes is db backended as well, and being based on views, it can be very quick. It's cross platform, has a very rich client, and all the collaboration tools you could need. I can't understand why it seems to be losing ground to Exchange (at least that's my opinion of the Irish marketplace)
If anybody out there is considering developing a new mail system, the things I would look for are:
* RDBMS Back end (ala Oracle or notes, not exchange)
* LDAP integration for user management (ala Exchange)
* Bolt on interfaces, such as http / imap / pop3 / wap / voice, others can be added as necessary
* Support for clustering, replication
* Perhaps built in HSM, allowing users to migrate old email at the server rather than at the client. Never give the user the opportunity to store email locally, it will come back to bite you !!!
Just my few cent.
Jonathan Bourke
If you store your messages in an usenet server you get all kinds of neat features like auto expiration and tools that can put the binaries together and let the server deal with the file format.
Back when C-news was new, there was a systems called "Notes" that keep usenet posts in a database. From what I can tell that became the ancestor of lotus notes at some point.
I still like the existing mbox format, primarily because it is in plain text. This makes it easy to manage with common text tools and editors, as well as making it portable across many different MUA's. I just cringe whenever somebody mentions storing mail in some kind of database; if users want to use an MUA that does that, that's fine, but if sendmail ever changes the mailbox format to something that I can't read with 'less', I'll stop upgrading.
The only thing I don't really like in the mbox format is the separation between messages. It's very similar to the mail header lines, and I'm not sure what would happen if a message happened to contain a line with the same format. OTOH, I can't think of another separator that would work better while keeping the file in 7-bit ASCII.
I also use Eudora as my MUA, one of the reasons being that all mailboxes it creates are also in mbox format. One thing I like about Eudora is that it stores almost all metadata about each message in a separate file, including indexes to each message for faster access. But the down side to that is there have been several times when Eudora crashed (it is a M$-Windows program, after all) and the metadata got out of sync with the mailbox, so the tags were often lost. I think a better solution would be to store metadata as extra header lines in the message (I think pine does this) -- although I wouldn't want *too* many extra headers cluttering the mail -- and have the MUA use a separate file just for indexing and re-sorting.
I also like the fact that Eudora extracts attachments from messages so that the mbox file doesn't get too big. However, placing all attachments in a single directory creates its own cluttered mess, plus there are potential problems when two or more messages happen to have attachments with the same file name, and it's difficult to keep track of which mailbox each attachment came from. Perhaps a partial solution to this would be to have a separate attachment directory for each mailbox, and each attachment's filename would be modified to indicate which message it came from (such as a date prefix or message ID suffix). The downside is that it may not work on old systems with a minimum filename length (POSIX programs must not depend on more that 14 chars). Attachment separation should also be limited to MUA's; MTA's ought to keep attachments inline so that user agents can do whatever they want with them.
Use the Qmail native format, 1 file per message. Let the filesystem do its job. Qmail is a great alternative MTA to Sendmail, fast, secure (no exploits so far IIRC) and easy to configure. Qmail
I stand corrected. :^)
(Guess I could've/should've looked up Courier-IMAP before responding!)Quite right. Just try it. You might be a bit surprised by the results.
News for Nerds. Stuff that Matters? Like hell.
Mail is crying out to be stored in XML.
The greatest benefit would be the ability to specify accurately the structure of the hierarchical document (which MIME mail is) using DTD, schema, etc. This would lead to greater standardisation across the internet.
There are a great number of high quality tools available for the manipulation, transformation and parsing of XML.
There are a number of protocols already available for the transmission of XML data.
And mail is text based.
A couple things:
:-)
1. Evolution is NOT "Basically mbox with database features". It can use Maildir or MH as the backend (and you can write your own plugin to extend this if you like).
2. Evolution's body indexing and summary files are extremely fast and efficient, about the best you'll get. I hear MySQL has text indexing capabilities that are extremely fast, but I'm not sure if they are faster than Evolution's indexer or not. Might be interesting to check this out.
3.
> But the thing that bugs me most is disk space. Typical inboxes are
> made of 5% to 10% of Text including Headers and HTML. The rest are
> BASE64- (or UU-) encoded pictures, word documents, zip archives and so
> on. The problem here is the encoding which wastes considerable amounts
> of space (at least one third).
It's theoretically possible, if you wrote your own Evolution storage plugin, to change the Content-Transfer-Encoding header value of binary attachments to "binary" (and text attachments to "8bit") before writing the message out to disk (or wherever) thus magically making it so that you no longer save the encoded text of the attachments but rather in-line binary data content. (Yes, it's as easy as setting an enum value in the CamelMimePart structure).
However, you have to be aware of the consequences of this. Most importantly, you will not be able to validate any of your PGP/MIME or S/MIME signed messages as according to the RFCs for these types, the signed MIME parts MUST be treated as opaque (meaning that you may not modify them in any way).
Now on to your ideas...
> One file per Mailbox-folder, allowing multiple folders per user.
> Should those files reside in one central location or in users
> Homedirs?
How is this different from mbox? (btw, CVS Evolution can handle mbox files and directory trees in external locations - ie, not within the
~/evolution directory).
> Compression: Should messages be broken into pieces and the
> MIME-attachments stored separately (thus searching of the text parts
> would still be possible without decompressing the whole file)?
If you break apart the MIME parts, you run into the same problem I described above about not being able to verify signatures.
However... if you took a normal mbox and gzipped it, you would certainly save space (at the expense of speed). I've been thinking about writing a CamelMimeFilterGzip class for gzip compresing/decompressing streams which would allow Evolution to read and write to gzipped mbox files for example.
Once the class is written (which should be fairly simple), allowing Evolution to read gzipped mboxes should be as simple as doing:
camel_stream_filter_add (MboxStream, GzipFilter);
...before feeding 'MboxStream' to the MIME parser.
> File format: gdbm, Sleepycat db? Something new?
Please not Sleepycat. If you are so sure that a generic database backend will be better than what Evolution's got, at least have the sense to use MySQL or PostgreSQL.
I'm personally against using a generic database as a storage and heres why:
1. The average user does not have an SQL database installed on their desktop systems, and so this is a completely rediculous dependency for them. If you think library dependencies are bad, just wait till you have to go installing, configuring, and maintaining a multi-user database running on your system. This may be fine for a company solution, but not the average end-user.
2. I'm not too familiar with MySQL or PostgreSQL, but I recall there being problems with mailers that use SQL database backends that tried to store the content of the messages as part of the table (due to them making the size of the table too small or whatever). If you can set the size to be "infinite", then I guess that's not a problem.
If your plan is just to have the database index the folder and actually store the contents as separate files, then you've instantly gained nothing over Maildir except that now you have a hefty database that you have to maintain and very little to no speed improvements (especially if you have a well designed/implemented summary index like Evolution does).
The only improvements you might gain here is body indexing? As I said earlier, MySQL supposedly has a REALLY good text indexer and so it might be a little faster than Evolution's. I'm really not sure on the comparison here.
> Should the security model allow users to directly access their
> files, grep them, copy them around?
Is there a reason NOT to? I don't see one. It's their mail.
> Shared folders, virtual domains?
This doesn't really have anything to do with folder formats and everything to do with features of the client itself.
(Evolution can do this).
> Unicode support in folder names? Imap message-IDs, flags, useragent
> specific state-information?
Unicode support in folder names I'd say is a pretty important feature. I'm not sure what you mean by "Imap message-IDs". Do you mean UIDs? Evolution, for example, has a UID assigned for each message whether it be in an mbox folder, Maildir folder, MH folder, or IMAP folder. So this isn't necessarily dependant on folder format (though it could be if you used a database backend for example, you might want a UID in the table).
I don't feel that UIDs are a must though, but I would suggest them. They are definetely useful especially for folders that can be accessed by multiple clients at once.
Flags are good. I'd go so far as to say a MUST have.
As far as user-agent specific state-information, it'd be nice to not need it. But if the client needs to keep it's own info, it'd be nice to be able to map the info to UIDs and keep it's own state file somewhere else (not necessarily alongside of the mail storage).
For example, IMAP doesn't have any means for the client to store state information on it, but that's perfectly fine. If a client chooses to
have it's own state, then it can save it locally.
It would be nice if the storage could handle user-defined flags/tags though. This would allow the client to extend the native features of the format (Flag-for-Followup, message colouring, etc).
> How would MTAs deliver mail? How would clients access? File-locking
> (NFS)?
This is one reason to just stick with what's available
File locking is a MUST have (or a scheme to make it not needed, such as Maildir).
--
You know, I have one simple request...and that is to have messages with freakin' laser beams attached to their headers. Now evidently my MIME specification informs me that that can't be done. Uh, can you remind me what I pay you people for? Honestly, throw me a bone here. What do we have?
I have in some times used a custom made MTA and indexed the incoming mails, headers and first message body part, into MySQL database. Attachments are compressed and stored within "regular filesystem". The whole kludge is then interfaced to IMAP. User authentication is also done via MySQL, thus making it unnecessary to create "real users". The solution has lasted without problems for years now already. MySQL in general, works like a dream, I have never had any problems with it.
It is good (and fast) for some purposes, which I am not going to discuss here, everyone is probably very well equipped to figure out the plusses and minuses of this way of doing it.
But unfortunately, the Notes client does not. We still need to dick around with wine to access the corprorate Notes server. If anybody from IBM (who likes to show their committment to Linux...) is listening: are there any plans for a native Linux Notes client? If so: when? If not: why not?
Say no to software patents.
The future is in storing emails into a database.
What the GNU (or other open source) people need to do is establish an API, and then hooks for the mail app and/or database. With the right API it should fairly easy to use MySQL, PostGRES, Whatever as the database. The client could be anything from mutt to pegasus or even Outlook
POP is nice.
IMAP is nicer
DbSQLMail_API will rule!
Users will be able to easily catagorize mail, filter, archive, etc.
The current problem is that each mail program (ie. Outlook and even Evolution) is reinventing the wheel again, and again. The programmers cannot see the forest for the trees.
Look beyond the trees and you will see that the current ways are not viable for storing email for the rest of our lives.
This is kind of a little-known fact, but Exchange 2000 implements the IETF's WebDAV protocol, meaning you can access the whole database (Email, Personal Contacts, Calendar, shared folders, personal folders, etc.) from any standards-compliant WebDAV client library. And as to file formats, the files the server spits out are in standard formats: MIME, vCARD, iCAL, etc.
Authentication and group contacts is LDAP, WebDAV uses HTTP auth. All standards-compliant.
So you can write a fairly complete integration to Exchange using commonly available open source libraries and tools.
It's not a bad mail server/groupware platform, really.
ask the FBI
Be wary of any facts that confirm your opinion.
Plain and simple. Switch from mail to Usenet. Maildir-like structure, but with a .overview (XOVER) file to help out with indexing.
Storage is another problem, though... but Usenet messages can be sidetracked a bit with the encoding.
--
# Canmephians for a better Linux Kernel
$Stalag99{"URL"}="http://stalag99.net";
Easy solution: Build a list of "VIP" users who will get a text-only version. Or who will get the text and the HTML version in 2 separate mails.
Say no to software patents.
Messages as individual files, parse headers into XFS attributes, implement a few indexes on those attributes... Cool.
Actually, the mailbox does use maildirs. I specifically installed Courier-IMAP because of it's maildir support.
However, as the poster above (or in between, whatever) points out, Courier-IMAP may have a nonstandard Maildir format...
---
Open Source Shirts
Life With Qmail
Building a Linux Qmail Toaster
Same thing, but with FreeBSD (more scalable, in my experience)
have fun
Remember that what's inside of you doesn't matter because nobody can see it.
Maildirs are:
- Crash proof: an interrupted delivery cannot cause folder-wide corruption or the delivery of an incomplete message.
- Lockless: all Maildir operations (deliver, delete, read, etc) can be performed simultaneously by multiple processes on multiple machines without the need for any sort of file locking.
What does this mean? Reliability. That's why you use Maildirs. It may be slightly slower than some other formats (although I've never noticed a difference) and it certainly consumes more inodes. But it's way more reliable. You never have to worry about someone's mail program crashing and leaving the mail folder in an inconsistent state. Maildirs don't have an inconsistent state. And when you're delivering over NFS you don't have to worry about whether or not file locking is going to work right. Maildir's don't need locking.In email, reliability is everything. People may grumble a bit if they think their email isn't arriving fast enough. No big deal. But there is nothing more terrifying than a user with corrupted or (gasp) missing email. While using Maildirs won't solve all of your email problems, they are definitely a step in the right direction.
Exchange is actually a decent way to store e-mail on a server. But if you're gonna look at PC-based groupware solutions, DON'T use Exchange because it's loaded with holes. Its monolithic, proprietary JET-based data format is prone to corruption (I've seen this happen several times. :) They're trying to get it to work on SQL server, but I don't like MS SQL server that much, either.
:) but I can tell you GroupWise solved this probelm (on UNIX, about ten years ago, when it was WordPerfect Office) with a proprietary database broken into different types of individual files. GroupWise these days consists of a few important files for each user:
:) (Did I mention the whole database gets constantly reindexed, so you can find anything in seconds? Exchange does not do this on the back end without third-party software. Of course, it has no document management, either... but I digress. :)
:) I'm still waiting for an open-source package that does everything GroupWise does... I think it'll be a while though. :(
I don't have a lot of experience with Lotus Notes (though I hear it's good...
1) a smallish userxyzy.db (where xyzy is the unique user identifier, so you can change their e-mail address the items aren't duplicated; the pointers in the userxyzy.db files are updated to point to the shared items.
3) an unlimited number of special-purpose directories (FD01...FDXX) that hold items that are bigger than a certain size (I think it's 4k or 8k?)
All of the database files are encrypted & compressed (algorithms licensed from Stac). The connection between the clients & server is encrypted & compressed. You can also use POP/IMAP (+ POP+SSL/IMAP+SSL) to access a GroupWise post office (and a web-based interface written in java servlets)). But I'm drifting off the topic...
anyhow, I always thought this setup was a really nice, well thought-out way to maximize performance for a large mail system without wasting lots of space (or inodes
Perhaps some of this info could be adapted for a UNIX-based open-source e-mail solution? (of course it seems silly since GroupWise is already available for UNix
...as soon as you've visited MeetingMaker's web site.
...
Real-time scheduling, planning, organising. Scalable, cross-platform, web-enabled.
chomp, chomp,
Alen, your experiences with MS-Exchanges are so many worlds of difference away from mine that I nearly suspect that you've written a troll. Rebooting a mission critical service like a mail server during working hours is unsatisfactory. If other mission critical services like file and print sharing are also disabled during that reboot, then it's time to look for a more robust product.
I have worked closely with three shops in the previous three years that used Microsoft Exchange. Each had at least 3 full time equivalents of MSCEs to babysit their Exchange servers, probably more if you count overtime. This is not counting the occasional high priced consultant. None of these shops could keep Exchange running for a full week. Nor could they keep it from losing mail (When I measured it was 10-15%, ). Nor could they get it to communicate well with other mail servers. Nor could they keep it from getting wiped out once every three months by MSTDs (especially worms and virii).
In contrast, Novell servers run years at a time unattended (nearly every consultant has at least two such anecdotes of their own) and many UNIX-based MTA's need only a few hours of non-hardware maintenance per year, when set up tight. I guess running MS-Exchange is a new status thing to flaunt resources, like having a tuburcular wife was during the Vicrotian era.
Needless to say the managment's support was/is a real PITA for anyone doing work via e-mail with people outside of the house's MS-Intranet. In one case it even delayed a publishing a book by several weeks. In house use of Exchange was fine -- when it was down for you, it was down for everyone else so it was a nice time out and a chance to go have coffee with the others. When put to the test, file sharing couldn't, wouldn't, didn't function often enough to be useful either. For file sharing, those without access to a Novell or Unix file server, used sneaker net or mailed attachments. Yes, Exchange does look good in the 4-color glossy marketing brochure, but that's were it ends and reality sets in.
Puh.
Back to mail databases. RFC 2822, Internet Message Format specifies the general structure of a message. This can be over simplified as a header with its standard and non-standard fields and one or more message bodies. RFC2049 specifies multipart bodies. These structures do seem very well suited to a relational database.
Beta is broken and the link to classic doesn't work. Stop wasting our time or there won't be anybody left here.
Content-type: multipart/folder
-- Pretty easy.
Oh dear, another file format debate. I'm glad there was a library suggestion though ... that allows us to change our mind when we do it wrong the first time ;)
First, you need to consider the possibilitiy of moving the mailbox. To a different computer, or a different platform. This means it must be easy to access in any environment, and the tools must be portable.
This doesn't completely rule out a database solution (like mySQL), but it certainly makes it less-than-ideal.
Second, having used many mailers which separate out attachments ... Please Don't Do It! You can't easily move your mailbox, because there are a host of associated attachment files. There is ALWAYS a synchronisation issue between attachments and messages, so you end up scanning and cleaning out the attachment folder every so often to prevent dead files from accumulating.
Compression is nifty, but isn't really important. Disk space is seldom a concern these days, and the really big stuff (binaries) is often already compressed or don't compress well.
The real issue with most mailbox formats is how do you deal with the problem of removing dead space from the mailbox? Some program just leave it there until you hit "compact", which is wasteful and confuses users. Others rewrite the entire mailbox every time, which causes the software to "hang" for a while on shutdown.
The best suggestion I can come up with off the top of my head is this: One file per mailbox folder, and that file is its own filesystem. The "root node" contains a group of summaries (from, to, subject, date, etc) and node links. Other nodes are chained to contain the message and attachments.
Handling attachments: attachments are separated out and stored as binary in the mailbox. This conserves space but keeps the attachment with the message.
Compacting: is avoided. When a mail is deleted, it is merely flagged in the root node (index). So each mailbox has its own deleted items folder, so to speak. When the deleted items folder is empties, the index is rewritten and nodes freed - every node not at the end of the file is overwritten with a node from the end of the file (and appropriate reindexing done), so the file is automatically compacted.
Ideally the file needs some sort of transation logging area to ensure its integrity at all times.
Shared access to files is best handled through a library or a service. File locking is notoriously prone to bugs and security issues, and avoiding multiple implementations in different mail clients would be beneficial.
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
For Pete's sake, leave mail alone. If I can't fix it in less than 20 minutes with grep and perl, I don't want to know about it.
Divide mail into 20-30 logical "folders" (files), use procmail to help sort/scan/unspam, do IMAP to get to it from Win machines, archive mail out of your working files once it gets a year old, and you're all set. Strive to keep your inbox empty (you need a proper "action" orientation with your mail folders to accommodate this). No big deal.
Lotus Notes used the Lotus database scheme to store email, among other things. I never knew anybody who really made full use of the groupware functionality of Notes, though (I'm sure somebody somwhere did). I would hope, based on my personal experience, that whatever the nextgen email tools are, they are NOT like Notes.
Personally, I would love to see a database message store that would be compatible with IMAP access, especially with emails coming to my phone, my handheld, my laptop, my main machine, and various other places. With replication and virtual domains support.
The mbox+index scheme seems to be a fairly decent second choice, but when someone says "but I like to be able to use normal text tools like grep" my answer would be "why don't you use other normal tools like SQL queries to do the same thing, and then some!"
If I recieved Slashdot Poll postings as e-mail messages, I might use the following to find recent common whining about lame slashdot poll choices (and faster too!):
Go to the link above, or look for MH or NMH in rpmfind.net or your local ports tree.
-ez
MP3.com has many terrabytes of disk on reiserfs for over 18 months with great success. Again, hard data that reiserfs works. And it works well!
While trolling takes on many forms, many of them merely being nuisances (crapflooding, goat links, page widening, etc) you'll find the vast majority of trolling occurring in posts similar to posts such as your original. On Slashdot, well-thought out and reasoned posts have become indistinguishable from trolls. This is made all the more obvious by the dimness of the moderators who would mod you down -1 in a heartbeat if not for the length of your post (as if that were the measure of an argument).
I too am a troll, much along the lines as you (though perhaps you don't realize yourself as such yet). I used to post, IMO, well argued posts and was consistently modded down by the Slashdot groupthink moderators. This is not to say that I didn't eventually hit the karma cap, but that along the way it was painfully obvious that my pro-Windows, anti-GPL opinion was not tolerated here.
Upon the realization of that I had my epiphany that pearls are not to be given to swine (this seems to be the same satori experience you are having now). Pigs deserve slop, and now that is all they get from me.
In any case, I'm not one of the nuisance trolls as I listed above, but one of the provocative trolls such as yourself (please do not take offense, this is not an insult as it may first appear). The Slashdot feeding frenzy that follows any post that attempts to support Microsoft or attack Linux or posit Creationism is a wondrous thing to watch, much like a thunderstorm or a supernova. The one difference is that you, the troll, have total control over the experience, much like a god who views his masterpiece from another dimension.
This is not to say that Slashdot is void of intellectual content. On the contrary, you'll find quite a bit of interesting information in the Science and Developer sections. You will find *no* intellectual content in the YRO section.
It's a travesty that a good idea like Slashdot, allowing users to create their own content, has succumbed to the mindless pursuit of mental masturbation of FSF zealots.
So while this may be the end of your Slashdot infancy, I think you will find your maturation into a Slashdot provocateur quite fulfilling and fun. Isn't that why you joined the technology revolution in the first place?
As I would like to have access to my mails in 10+ years, I vote for the usage of plain files, no other propriete format that has to be converted.
If access speed is problem, cache some of the files' contents in indices.
Disk space is surely no problem...
Personally, I like the idea of Maildir, every e-mail is a file. It's very easy for apps to work with Maildir. Also, everyone above is talking about databases, which probably is a fast and nice solution. But communication with the database probably will be kind of messy. Why not just 'map' the mailbox on the filesystem somewhere and use whatever underlying system (database, plain files) you want for the actual storage. Exactly what plan9 is doing. With that, every mail could also easily be split up in headers, body, attachments etc. (I seem to remember they already do that.)
This way, 'transparent compression' can be done, the 'file format' is very easy to use, people can still grep their mail (even copy attachments using normal commands), flags could also be mapped on the filesystem.
For delivering mail a probably similar method can be used, perhaps like Maildir, which does fine over NFS.
Ah, it's the entire plan9 'map it on the filesystem'-idea that just seems great to me, why don't we have it in the average BSD or linux?
There is no need to toy around with the mail as-is. It's a little like IP packet, doesnt matter what's in it, but the essential thing is that it has a destination and source addresses and it travels in the net. No techinal solution will _ever_ overcome the fallacies with current emails, becuase the current email is as unrestrictive as IP packets.
./ said it really well once, he said that the best solution would be "Cheap plasma handguns and justice for us all"
Take spam for example. The problem will always be present theoretically when you want to receive mail also from people you've never received before and/or havent given your public crypto key for example. Another side of the aspect is when _we_ allow people to send source address spoofed spam.
The problems with email are people, as with almost every damn problem in the IT sector, it's always us. People bend towards stricker rulesets, to avoid abuse, which in many cases is not the way to go, let alone the solution to the root of the problem. Somebody here in
1 Earth is warming, 2 It's us, 3 it's royally bad, 4 we need to take action NOW
If you're looking for a great mailserver for 1000 - 500 000 people, try Cyrus Imapd from CMU. It's fast, secure and stable.
Why not use a real database for this, like MySQL? The advantages are obvious. You can search your email using plain SQL statements. Storage is handled by the database implementation, so you don't have to care about that. Performance can be improved in the standard ways, by having indexes and perhaps lookup tables and/or columns.
Also, you can have multiple MUA's use the same mailbox since databases normally handle concurrency already.
Eeeeh?
My System Folder is no more than 10 directories deep (just a cursory inspection, but I don't think I missed anything), with Extensions and Preferences having HUGE breadth. So why, again, would HFS be designed for depth based on the System Folder?
Not that it isn't, just why did they do it that way, since the System Folder explanation doesn't make sense to me.
-knots
Anarchy$ dd if=/dev/random of=~/.signature bs=120 count=1
Don't store your mail on an NFS server. NFS is bad (use AFS instead, but not for mail either), and it's especially bad for storing mail. Use IMAP instead. IMAP is a secure file system designed for storing mail, and that's what your really need. Also, all relevant mail clients supports IMAP, and for those that don't, Cyrus IMAPd contains a POPd as well.
I lost the plot half way through this, but here's some food for thought anyway. Now I should get back to work ...
;), but it works reasonably well, and we've had a chance to try and deal with users with lots of email.
...
...
...
... and you dont want 1-tier applications), so it doesn't matter what format you use under the belt - you can choose the format which best suits what you're trying to do.
Z
I think that this is looking for the solution to a problem that doesn't really exist in the first place. Although I guess it depends somewhat on what you define as 'Unix mail'.
I'm a developer on Evolution, and primarily on Camel, evolution's email library. I'm not sure i'd rave about it (although I think Camel is a mostly beautiful piece of code
What IS 'Unix mail'?
I would define Unix mail as mail (rfc822 format) downloaded and stored locally on a per-user basis. IMAP, Exchange, and other remote protocols are very different beasts.
Why are DBMS's not suitable for 'Unix mail'?
Once you have a remote server you have to do things differently than if you have local access. Using a DBMS, and having a trained administrator to manage it are practical considerations, as are the benefits you might get from this configuration. These solutions dont really make sense for standalone users. They shouldn't need to install and manage databases, complex backup prodedures, and so forth, just to read their email.
i.e. rdbms's are:
hard to setup
hard to maintain
another major point of failure
If however, I was to design a multi-user groupware server, then a DBMS would come into serious consideration - at the backend at least. It allows you do to things like easily consolidate authentication outside of the operating system (the idea of having a 'shell account' to access mail is somewhat outdated), it allows you to save space by storing common data, like attachments and email content in a single place, and redirecting it to multiple recipients (which is a common practice within organisations). It may be practical to use a mixture, a RDBMS to store textual parts or indices to data stored in a more conventional filesystem.
But even with a RDBMS backend, I would personally probably still stick to IMAP to serve it to actual clients. The IMAP protocol is a bit heavy, but not really that bad, and it serves email, I dont think there's really any need to reinvent the wheel here.
So
If you define unix mail as I have, and separate it from a *mail server*, then you rule out full blown RDBMS's, and are left with:
single file database
multiple file database
I'm not even going to mention XML because I think it is the single most stupid idea anyone's come up with. It is completely unsuitable for this purpose.
And well, there's really no reason not to use MIME to store the messages. MIME already does everything you can possibly do with email (since, uh, it is how the email *will* be sent), any client will already have to deal with it, and mime decoding is for the most part really quite simple and fast anyway. Translating the mime format into some other storage format really doesn't make sense.
single file databases
mbox
Mbox is a single file database. Its just that everyone that uses it generally writes their own access code. This is where problems with 'locking' come about, either because the underlying filesystem doesn't support it properly (e.g. some nfs implementations), or everyones clients don't use the same locking mechanism. This really just an implementation issue anyway. There would be nothing to stop someone writing a common 'mbox.db' library that stored everything in completely compatible mbox files, which took all the work out of it, and then you'd have an mbox DBMS
mbox scales ok, without any caching of header information it handles in the order of 2K messages in an interactive timescale, and quite a lot more if you dont mind some short delays (i.e. in the order of the time it takes mozilla to start up).
Appending and reading is quick, and reliable - assuming the filesystem works, which is a pretty safe assumption to make. This is assuming the mailbox is first summarised at first opening, otherwise looking up messages can be slow, because you have to scan the whole file first.
The only operation that is slow is expunging messages, and at worst case isn't really any slower than copying a whole file across to another file.
The only other issue is agreement on the 'standard' for what constitutes an mbox file. For example. Solaris uses and honours the 'Content-Length' header, and thus it does not translate any lines beggining with "From " into the conventional ">From ". Some mail clients translate "(>*)From " into ">\1From " (using sed syntax) and visa versa, others do not. There is no standard, just some conventions, some of which aren't easy to determine either.
Because you need to keep the whole index in memory at once, this can become expensive, but you could use a secondary database as an index into the real file. But eventually you hit a point where the cost of expunging does get too expensive. You could just archive the mail regularly, or use a format like maildir instead.
gdbm/db/etc
db files wrap the single file in a common api that handles all of the locking issues and access issues for you. Some have different features, e.g. querying capability, logging and transactions, etc.
We've never tried to use db for this purpose, more just because we didn't think it was worth it. All you really get with a minimal implementation is the ability to store and retrieve a blob of data using a single key. Writing is fairly slow because the database has to manage more details for you (locking, allocating blocks, unlocking, etc). You could use multiple db files as indices to perform multiple-key searches, but they are quite slow at creating them (we tried using db for the content indices and it was way too slow).
i.e. even if you store the data in a db file, which gives you a slight benefit of inbuilt referential integrity, you still need to provide additional indices to actually be able to use it in any useful way. Evolution suffers this problem with the addressbook which stores vCards in db records.
Most db libraries (all?) also dont provide any mechanism to stream data. You either get the whole lot into memory, or you get none of it. So for large messages you're limited by memory (well, evolution is anyway, but it doesn't have to be). Yes, memory is cheap, but it is still a consideration, and it would certainly rule out a simple database in a multi-user environment.
db files are also slower than native files, especially for large objects. You're mapping an arbitrarily sized chunk of data to some 'database blocks', which are then stored in an arbitrarily sized 'database file' which the operating system is then mapping to its 'filesystem blocks'.
multifile solutions
Well I guess this comes down to mh and maildir. mh isn't really suitable for anything, because of its just plain bad design and lack of defined semantics. There's no way to guarantee anything about its operation.
maildir - i like. It moves the scourge of trying to implement a reliably, scalable, multiple access database almost entirely into the operating system layer. Operating systems already do this very well - they manage hundreds of thousands of files randomly written across your disks, without skipping a beat.
No operation requires more than a single message size of data, and the operating system already indexes the message, via its filename. Sure, ext2 doesn't do such a swell job with long directories, but that can be addressed (and the same problem can be addressed on just about any platform). For 'free' you get concurrent multiple-reader, multiple-writer database access, without any of the considerable problems you have to solve to implement it otherwise.
The maildir 'protocol' is simple, reliable, and it works.
Again, it can easily be augmented by a client with additional indices, but for things like delivery agents who dont care about existing email, they dont need to suffer that overhead at all.
Some other comments specific to the question:
Compression. Personally I dont see the point. But a maildir-like structure would fit well with compression. Flat files would be the worst (e.g. mbox), and block-file formats (like db files) would also work well with compression. The good thing about email is it is 'write once', you don't edit or change the messages in the mailbox.
External attachments. I guess its possible, but again, it isn't really worth it in most cases. Parsing MIME is *fast*. It is much faster than parsing xml, and besides, people rarely look at an email more than once or twice. There isn't much use going off and storing the attachment in a high-performance reading format if it isn't going to be accessed often, and it just places a greater burden on your server.
base64, etc. Well, its entirely possible simply to store the messages as 'binary' format. Assuming the boundary markers are checked properly, Camel can work with binary encoded mail messages, and probably at least some other mail clients can too. There are some problems with some of the extremely broken openpgp/pgp/mime specs which suddenly say that mail transports aren't allowed to alter the *transport* encodings of some parts, but well, these specs are just braindead, and can be worked around.
Security model. Well, talking about Unix mail, not server mail, the filesystem is adequate.
Shared folders - is not an issue for unix mail.
Unicode. Well you can write unicode filenames to most unix filesystems, evne if 'ls' doesn't show it right.
MTA. Nothing could be simpler or safer than maildir as a delivery format. The mta doesn't have to care about any client-side indices, the mua will simply update them when it incorporates the new messages, etc.
Writing libmailstore? Mate, its called Camel, and its already written. Camel already does mbox, maildir, mh, it can read spool files directly (it doesn't create a summary file or build any indexes), it can talk imap, pop, and partial support for nntp. If someone gave me a decent RDBMS table schema and a carton of pale, I could probably write a MySQL backend in a couple of days, well, assuming the MySQL api is mt-safe.
Finally, some comments on evolution.
Evolution isn't reinventing any wheel. We use standard mbox format (if such a thing really exists anyway). We use standard maildir format, etc. Yes we may optionally create body indices, and we do usually create on-disk binary/compressed 'summaries' of the data, but these are really just on-disk caches of in-memory data structures, rather than anything to do with the mail storage format.
We put mail in another location, but everyone else has done that too, elm:Mail, pine:mail (or is it the other way around?), netscape:ns_mail, etc. At least we now offer the option to read most of this 'in place'.
The main problems evolution has with scalability is:
indexing.
Indexing is quite costly. The original index code was written somewhat like a database, it handled all internal data structures, used blocks of data, etc. It was slow, it scaled poorly. Definetly some of the algorithm choices and the implementation wasn't that hot, but it shows that such a solution isn't as simple as at first thought. Using libdb was impossibly slow (like several orders of magnitude slower).
The new stuff is a lot better, but can still use a lot of resources while indexing, and copies the whole file (well 2 files) across when performing expunges, but they are only performed occasionally, and the indices are smaller than the original indices, so in practice it scales much much better.
the summaries
The summaries are indices of a sort anyway. They are an in-memory tree of a subset of the information on each message. Enough information to display a list of messages, and perform vfoldering operations. Even though we do some tricks, like sharing common strings, the summary can get very large.
But, its a tradeoff I thought was worth it, rather than using on-disk summaries. The api's are much easier to use, and the problem gets pushed to the user - if they want to have folders with 100K messages, they should expect it to use a bit of memory. The on-disk size of the summaries is very small too, although I guess it could be made even smaller if we consolidated common strings.
per-message memory use
Currently, a lot of data gets copied around in memory. Every time you read a message, at least 1 whole copy of the (decoded) message is in memory at a given time (yes, including attachments). For IMAP this can get even worse (2-3 copies of a given attachment at a given time), because it doesn't stream enough. Most of this could use a disk-backing without changing any api's though, and well, i'm rewriting IMAP.
Wrapping up
And yeah, we're talking 100K messages here, not 1400. My 500Mhz celeron laptop has about 35K messages stored over about 10 mbox files, and it starts up in under 10 seconds, and that includes all of the bonobo/activation overhead (which is very significant). Yeah it uses a bit of memory, but memory is cheap on a personal workstation.
In short. The current mailbox formats we have suffice for "Unix mail". Add some archiving abilities to your mail client (even RDBMS backed mail clients need archiving), and you'll never have to delete a message again, and still get work done and still use mbox.
If you want to talk about writing a server - well who cares, you can do whatever you want, because everyone has to go through your interface anyway (you DO NOT want clients accessing data under you, thats what DBMS's are all about in the first place
It seems some people think using 1-tier applications (client code talking directly to a database) are the way to go for multi-user environments. They're not, they dont scale and are impossible to maintain. Nobody writes any real software like that anymore, unless you're writing dodgey vb toy apps.
_
\\/ are accustomed' - First Lensman
Has anyone actually implemented a distributed email system based
. sent-mail.lists.slugx
on NNTP? Not like the simple email to nntp gateways, but something
far more featureful. This would work as follows:
Every system that you would like to have full email access from has
a local NNTP server. All these systems are hooked up using
mostly standard NNTP configurations and protocols. Only relatively
minor modifications would be need to support authentication and
the other features.
Your domain(s) are configured to use all of these (net-reachable)
systems as MX hosts. And each mailbox/mailspool is setup as a
separate 'newsgroup', allowing for hierarchial mailboxes. Presumably
your top level hierarchies are local usernames, and the server
only allows authenticated users access to their 'mailbox'(hierarchy).
Group mailboxes would be easy to implement though.
Something like this:
bb.inbox
bb.inbox.lists.slug
bb.sent-mail
bb
[..]
public.somegroup.inbo
etc
Whenever a mail comes into one of the MX hosts, it is filtered
out, using procmail or something, and dropped into the appropriate
newsgroup. Alternatively have only the primary MX handle this,
but then you cannot get any new mail if this box is unreachable.
The magic of NNTP then comes into play, distributing that
email across all of the hosts in the NNTP group.
You then read your email using any nntp capable client. To delete
messages, your client sends a usenet 'cancel' type message to the
local server, and this gets distributed around the network.
But to start with, it'd be simple to create a wrapper that
gave an IMAP interface, so (almost) any mail client
will work. But that would limit you to read and delete.
Having sent items and saving items probably isnt supported in
IMAP.
Not a bad start though, a "full" client would be able to
do the works, such as automatically moving messages across
"folders", saving sent messages, etc.
Sending an email sends via normal SMTP protocols, and optionally
puts a message out via NNTP to update the sent-messages groups.
This is incredibly useful especially with intermittently connected
hosts like laptops. You can read/send/delete messages there, and
when it gets put on line again, it will send the cancel messages,
sent-messages and other things via the NNTP net to all other
hosts, ensuring a consistent system across all hosts.
What would be the limitations/weaknesses/etc that would make
this a bad idea?
Sparks:Gadget:Beer Maker
There is absolutely no reason to abandon the standard e-mail file format, including uuencode for file formats. Doing that, you would end up with a file format that depends on certain versions of the e-mail file format to work optimally. If you want to reduce harddisk space, zip it like OpenOffice.org does.
E-mails are documents. Documents belong into the home directory, and so do e-mails. If you want to do something new, you should use the harddisk folders as e-mail storage, so that e-mails, spreadsheets and documents mix. This probably requires inventing a new ".e-mail" file format so that e-mails can be properly recognized and indexed.
Storing one e-mail in one file is not a problem as long as you index the filenames properly, for which you can use gdbm.
Dybdahl.
I don't think things are that bad - for example, Cyrus with its indexes works pretty well and large (20,000+) folders. And things like searches are pretty fast with a client like evolution that does a lot of cacheing.
I would take the simple structure of Cyrus over the easy to break "database" files of Exchange server any day.
This is all very interesting because I'm slowly writing an IMAP server at the moment..
But here's the setup I'm currently using:
Inbox:
/var/mail/$USER
/var/mail/$USER-folders/$FOLDER/.messages
Subfolders
Eg:
|-- root
|-- fred
`-- fred-folders
|-- 1ZB
| `--
|-- Friends
| `--
|-- Games
| |--
| |-- Rune-Beta
| | `--
| `-- Tribes
| `--
`-- Mailing Lists
|--
|-- EFNZ chat
| `--
`-- Hard News
`--
I started with uw-imap but I want to store messages and subfolders together. Plain uw-imap doesn't do this and last time I checked, neither does Maildir. So I did a [kludgy, incomplete] mod and produced the above. Works for me :)
Get the patch: http://home.y3m.net/uw-imap-2001a-nested-folders.p atch
(diff against imap-2001a)
In the server I'm working on you will be able to implement a relatively simple C++ API to do your own storage. So you can use Maildir, mbox, PostgreSQL, whatever. We'll see.
flame away :P
No, I did not read the f***ing article!
There seem to be two discussions going on in the comments today, one about mail storage for an MUA and one for storing mail on servers. ;)
I've never used OpenMail/Domino/Notes/whatever, but I guess they do roughly the same thing, which is a pretty good idea. However, these things all have the distinct disadvantage that they use propritary protocols and aren't particularly cheap. There's always IMAP, which many people really like, but I feel is too complex a protocol (compare with the infant levels of complexity in POP3).
As far as the client end is concerned, from the point of view of writing an MUA, having an SQL backend is a complete godsend because you have to write virtually no IO code, you can put all the logic in the queries. However, there are some tricks you need to use to keep up the speed, most importantly to use two tables, one for metadata and one for the mails themselves. This keeps the speed up by keeping the metadata table small (maybe on a better RDBMS than MySQL this wouldn't make a difference, but I found that >10,000 mails all in a single table in MySQL got quite slow until I moved the metadata into a seperate table).
The obvious downside of using a DB for client end storage is that you have to have a centreal DB server, or one on each client and you need to admin one more set of authentication/permission details, plus you can't move the mail very easily to other MUAs. IMO a much better solution would be to keep the use of SQL/RDBMS, but move the DB into the filesystem so you can just have a bunch of files with metadata stored in the fs. Need to make an mbox? "cat ~/mail/* >>/tmp/my_new_mbox".
From the server point of view, many people have been mentioning Exchange/Domino etc. Personally I can't stand Exchange, I've had to admin it on several occasions and it's generally done everything it can to stop me from having an easy life (just thought I'd air my predjudice against Exchange in the spirit of fairness and honesty
With a colleague of mine, I'm working on a set of POP3 extensions that give some IMAP like features, but is really designed to keep multiple mail clients in sync with each other by way of a transaction log. There are still some limitations, but I think I know what they are and how to fix them (e.g. not enough metadata can be associated with each mail yet). It adds about 6 or 7 commands to POP3 and currently lacks any decent client support, but I have written a fairly usable library and patch to gnu-pop3d for it. I've just submitted it as my University final year project, so I'll try and get the protocol description documentation online soon. In the mean time, if you're interested, it's on SourceForge
Chris "Ng" Jones
cmsj@tenshu.net
www.tenshu.net
So NNTP solved this IMHO a rather elegant way...
.overview, which is just the summary information for all the files.
.overview file. Or grep through it, if you like.
You have directories corresponding to newsgroups or mail folders or whatnot. i.e. alt.swedish.chef.bork.bork.bork is really alt/swedish/chef/bork/bork/bork
Articles are numeric, i.e. \d+ for Perl types. The raw message is stored in each file.
In each directory, there's a file called
Thus, you can have zillions of small files, and happily grep and copy them to your heart's content. But you never do a 'ls' on a huge directory, you always just look through the
So, in that sense, it's very much the best of both worlds. And, on the same box, you can specify rules on who can access the folders, so one file can be read by multiple people. Ooh.
GNUS, an Emacs based mail/news reader, uses a variant of this called nnml, which rocks.
Of course, when you get down to it, JWZ arguments aside, databases start to really look like what you want, especially on a corporate level when you're tossing the same piece of mail around to tons of different folks.
-e
From the MUA's point of view the storage is abstracted in that you use IMAP. (Don't you? :)
:)
If you need to run elm/mailx etc use fetchmail.
From the MTA's point of it's abstracted via LMTP.
Job done
I'm a Cyrus admin, and the only reason I'd care about how mail is stored is for doing tape restores.
no one ever try DBMail? http://www.dbmail.org
works with mysql and postgres for now...
Well I know who should *not* be in charge of the new mailstorage format standard, I guess These boys are not yet capable of getting the right mail in the right place.... and keeping it there, although it would be cool if they released the carnivore source just so wo could add carnivore format file import to evolution ;-)
Oracle has a product: Internet File System (iFS) that aim to provide a global solution.s e/features/ index.html?ifs.html
They store files, mails... in an Oracle 9i database.
http://www.oracle.com/ip/deploy/databa
In fact, one of these pathetic windows clients has found a quite good solution, IMHO: Files are extracted from the mail body and stored in a seperate folder. This is has many advantages:
1. You can easily browse this folder, deleting files you don't want. As you pointed out, Attachements use the most space and like this, you must only keep what you want.
2. By directly writing them to binary files, no space is wasted (other than keeping them as MIME).
TheBat's Mail format is far from being perfect. Mails are still written seqentially into a mail file (We all know this effect of "deleted mails", which are physically on disk).
Are you trying to tell me that a 5MB empty mailbox is asking too much? A text message that says "Hi!" costing 1.2MB is somehow wasteful?
Lotus Notes Uber Alles!
This
I need to build a new mailbox file format.
Let me ask the elite engineers and database gurus on slashdot.org.
What bugs me the most with current mail technology is the problems with distributed mail handling.
:(
I access my mail on all kinds of devices, sometimes online sometimes not.
My main problem is not so much witch mail-server / retrieval / presentation to use, since they all have the same inability to give me a working distributed solution.
For online usage imap is sufficient, but if I go ofline with my laptop or ipaq, Im lost.
POP isnt very efficient either, since only one of my clients can be the deleter, I must make sure that I synced all my other devices before the deleter removes the message.
Since I use tons of folders for my mail, some of my stored mails data back to the late '80s, it basically forces me to use imap so my folders are insync on all the devices, but again that only works online
Further it only works if my imap server is online. That can be a trouble if Im in some far of part of the world and for some reason or not I have no contact with my mailserver.
What I would like is a concept I call SyncMail
A distributed db-system. First I set up some 3-4 primaries, spread out on the net with completly different access routes. Each of them gets a MX record.
The sending mta is happy to deliver to a secondary mailserver if the primary is ofline.
But here comes the magic!
The system regarded as a secondary MX by the rest of the world is in fact a primary!
It sucks the message instead of queing it into its db, tags it with it's own internal server id, and tries to sync it to all other SyncMail primaries.
Sooner or later the new mail is propagated to all the primaries.
On the client side, the SyncMail app, contacts all the primaries, and cheks against a private index, and syncs all new mails, first trying with the closest server.
Since all mails are tagged with what primaries it's been delivered to, no mail is retrieved to the client more than one time.
Now I have a complete local mail-tree in my client, regardles of which primary I was able to contact, sure if a mail was delivered to a primary that goes ofline before the client syncs, and it hasnt been able to sync it to the other primaries, I wont get it until that primary comes online, but - what the heck, in pop/imap is my mailserver ofline im completly out of buisness, so the loss is defenetly smaller in this case.
And for my ipaq i just configure the client to work with a few important folders, and to skip attachments, to save storage
And for sending, all clients stores it in a outbox, wich is then synced to the primaries, once it gets to a primary it is sent in normal SMTP
this way I solve the problem of being able to send mail with propper originating SMTP headers. Of course the outbox is synced as well, so I get a ref copy of my mail on all systems.
I have started on a SyncMail application and someday I might be able to complete it, but there is so much work all the time
Would anybody else be interested in this concept, maybe we could complete it together.
Or if this is a realy stupid Idea, I'd be glad if someone would point it out, so that I can focus on finding a better solution.
I once tried benchmarking Maildir vs mbox for my mail archives (mailboxes with ~3000 messages). On ext2 Maildir was a loss:
- Mutt took twice as long to open a Maildir than mbox from cold cache.
- Mutt still took a bit longer to open Maildir than mbox from hot cache.
- On ext2 with 4K blocks mbox ate 13 MB of space, Maildir ate 21 MB.
- Small UI degradation: Mutt wouldn't show the number of lines in a message from a Maildir, and it wouldn't show percent progress indicator while reading the Maildir.
Basically for my situation (read-only mail archives with large numbers of messages, which are rarely in filesystem cache, ext2 and constant disk space shortage) mbox was better. But my situation (personal mosty static mail archives) is remarkably different from running IMAP server.I did this test in 2000. I should probably try again some day with Reiserfs, but I heard various people telling me it doesn't improve Maildir performance. Can't say anything until I try myself.
I therefore recommend you to try it yourself and see if Maildirs really help in your situation.
There are many, many emails sent to one person that really need to be stored in a project folder, an administrative folder, etc. When someone is searching for info, they want to go to a central location and search all the documents, folders and emails that have to do with those documents. Storing email as one file or many is a discussion orthagonal to this need. I don't have an answer on this one, only a need. How can I quickly drop an email into a directory?
I realize it is possible, but in Eudora or Mozilla's mail server, I have to do a Save As, rename, browse to a specific folder and finally save. It would be great to be able to put a folder someone and just drag and drop.
Perhaps I'm missing the point here, but isn't it down to your mail client to store your mail however it sees fit? Why should you as a user have to know or care?
Ceterum censeo subscriptionem esse delendam.
If your e-mail is in a binary DB, you're pretty much reliant on the developer of the DB format to let you export it. Outlook Express, in particular, is very reluctant to let you bulk export e-mail - it'll export .eml files, which are the e-mail in plain text just like OE received it, but only one at a time via right-click, Save As, which is a pain for large folders (at least in the version I used to use, 5.5, it might have got better since).
.dbx files, but it's a bit ugly (half a paragraph of mail, 20 bytes or so of random binary, the other half of the paragraph).
Yes, it's possible to scan through binary DBs with 'less' if they contain the plain text somewhere, and I have been known to do this with my old OE
With a maildir or mbox format (I now use MH, which has a modified maildir as its native format) you can just grep through the files if you want to extract information from them and your e-mail client isn't working/installed/whatever (or you've switched to a different one).
here's a little transcript:
% cd /mail/fs/mbox
/mail/fs/mbox/318/2/body is a jpeg file, viewable directly by any usual jpeg viewer).
% lc
Directories:
1 113 128 142 157 171 186 20 214 229 243 258 272 287 300 315 33 344 359 373 388 401 416 430 445 46 474 56 70 85
[...]
% cd 318
% lc
Files:
bcc date filename info messageid rawbody sender type body digest from inreplyto mimeheader rawheader subject unixheader cc disposition header lines raw replyto to
Directories:
1 2 3
% head raw
Return-Path:
Received: from punt-1.mail.demon.net by mailstore for rog@vitanuova.com
id 1021665470:10:17045:138; Fri, 17 May 2002 19:57:50 GMT
Received: from psuvax1.cse.psu.edu ([130.203.4.6]) by punt-1.mail.demon.net
id aa1016828; 17 May 2002 19:57 GMT
Received: from psuvax1.cse.psu.edu (psuvax1.cse.psu.edu [130.203.6.6])
by mail.cse.psu.edu (CSE Mail Server) with ESMTP
id 27DA4199BE; Fri, 17 May 2002 15:57:13 -0400 (EDT)
Delivered-To: 9fans@cse.psu.edu
Received: from acl.lanl.gov (plan9.acl.lanl.gov [128.165.147.177])
% head body
This is a multi-part message in MIME format.
--upas-mbyuptynpdsmbjuyeermihdgur
Content-Disposition: inline
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Hi,
If you seek excitement and thrills you need to look no further than
Plan9 -- it gives you everything and then some, but in a good way (or
% cd 2
% lc
Files:
bcc date filename info messageid rawbody sender type
body digest from inreplyto mimeheader rawheader subject unixheader
cc disposition header lines raw replyto to
% cat mimeheader
Content-Type: image/jpeg
Content-Disposition: attachment; filename=iostats.jpg
Content-Transfer-Encoding: base64
% page body
reading through graphics...
%
"raw" contains the raw data that makes up the message. "body" contains the data after the encoding formats have been applied (hence in that case
the beauty of this scheme is that it hides the underlying storage scheme from the mail clients. if i wish to change things so that the underlying storage format is many files [currently it uses a traditional mbox format], none of the mail client programs have to change.
plus i can use grep, diff, shell scripts, etc directly on the messages in my mailbox. procmail eat your heart out.
After i learned Java and Perl some, i had planned to create an email app that was resonably cross platform.
The main detail is that i'd cause the email to be stored by return address, much like OE (and the like, i'm sure) stores EVERYTHING in one file, with the exception of storing the MIME-attachments in a separete directory.
every email i'd get from friend A would be stored in file A and indexed by some sort of Perl db. Friend B would have their own file, and the db would keep up to date so it all displays in the app, but on the computer they're sepereate files. (spam might be easily removed by just removing them from your files system; repeat spammers needn't be deleted more than once).
attachments could easily be moved to a seperate directory and the db keeps notes to what mail entry it belongs to and could display it in the app.
since i'm very new to Perl (and havent even started learning Java), its going to take me a while. So, i'm interested in some collaboration.
once this is working, perhaps it wouldnt take too much work to be implemented into a fully fleged Unix mail thingy (with a cli instead of a Java-gui?) and be usefull for this topic.
All Scottish food is based on a dare.
All is nice, but I have additional requirement - I need to access my email from a text terminal sometimes. So I need I client which woul be able to access mail processed by Evolution (possibly with reduced functionality).
Another question: after many back-ups, moving between many hosts I ended up with many folders with partially duplicated emails. How to make "one big merge" to make mail IDs unique, but still link to the virtual folders there were originally in (so merge the data, but keep the original forlder names as meta-data)?
Backend the silly stuff onto PostgreSQL (which is the best for clean transactionality and locking). To heck with this low level mucking about in the filesystem, that should be left to the database designers.
... in the form of 8 different machines, all of which were running reiserfs on various GNU/Linux distros ranging from Suse to Mandrake to Debian, all of which suffered data corruption, data loss, and even the mysterious vanishing of entire directory trees (while disk usage exploded). In short, all had unrecoverably corrupted filesystems, not as a result of unscheduled shutdowns (which journalling is supposed to help protect against anyway), but on machines that were operating normally, without interruption. None of these filesystems survived more than 9 months of normal, everyday activity (without improper shutdowns, I will stress once again).
... I have hard data to back up my claims, and, quite frankly, a filesystem is sufficiently important that "your milage may vary" should be an unacceptable answer. By all accounts, if those who haven't (yet) suffered data loss with ReiserFS are to be believed, with ReiserFS YYM indeed V.
These machines were located at three disparate sites, had different base configs, and in two cases were installed and maintained by different people.
The only things they had in common were that they used Reiser, they lost data (severely), and had to be reconstructed from backups (this time without using Reiserfs).
You may believe that you can trust ReiserFS, but I know for an absolute fact that I cannot, and I think it is very possible you will discover that at some point as well. Of course, having relegated everyone else's experience to mere anecdote, it is clear you won't learn this until it hits you in the face, personally. That's OK, not everyone is willing to learn from the experience of others.
However, to those who are interested in learning from the experience of others I will say this: tread very, very carefully with ReiserFS. It is not ready for prime time, and should not be used in any production system. If you really need journalling, use XFS. It is very stable and quite difficult to damage (so far it has survived every stress test I've been able to throw at it).
Now, go ahead and relegate this to anecdote if it makes you feel better
The Future of Human Evolution: Autonomy
Mysql.
I think it's the best of both worlds. Your 'INBOX' is like MailDir, where each 'new' message is a seperate text file. Once you've 'Filed' that message, however, it's compressed into a single file along with the rest of the emails for that folder.
Personally, I think you're looking at the WRONG aspects of mail servers. You're getting way too technical. Nobody gives a shit about wasted inodes. When's the last time you defragmented ANY disk?
The reason I use Mercury, is because of it's exceptional Netware NDS integration. Combine that with Pegasus Mail's NDS integration, and you have 'Roaming' users without all the profile garbage (Pegasus will use NDS calls to see 'who' you are, and read your email from your home directory). Oh, and it's free.
To bad it hasn't been ported to Linux.. along with the PAM stuff needed to keep up the kick-ass user integration :)
"I can't give you a brain, so I'll give you a diploma" - The Great Oz (blatently stolen sig)
Courier MTA/Courier IMAP w/ XFS on software raid5. Maildirs and XFS is a good thing. So far i'm pretty happy with it.
Sam Varshavchik, the coder/maintainer/author is... a character. He can be rather acerbic and opinionated, but responds to EVERY issue raised on the mailing list within a day. Admitedly the answer is somtimes "their software is broken. don't use it with courier" but serious issues are addressed quickly and well. Nice change from having 2 wait for 6 months for a institutional patch...
So far I have mutt, pine, webmail, netscape, and OE, clients working well with it. (pine needs to run through IMAP and OE can be a little iffy)
For more info check out www.courier-mta.org
I came up with mboxdir. It was actually a preliminary specification for a Win32 client.
Did you really try cyrus or did you just dump it because it looks similar to maildb?
cyrus really has some interesting features and is way faster than mbox:
- full IMAP-4.1 complicance with multi access
- ACL
- Quota
- sieve support
- hard link support for multiple recipients (yes, this means sending a 10 MB file to all local users will take 10 MB disk space on the mail server).
And it proved to be very reliable.
--jochen
My experiences mirror yours. I worked for a company that supported Exchange and Domino. We had a number of Exchange guys (who were pretty clued up_ to support the client's Exchange system. We had just one person who supported a greater number of Domino clients.
;-)
As for the NetWare stories: Netware 2.15c server that was up for 2 years. Shut down and moved to new site, and wouldn't restart. Investigation showed this to be incorrect termination of the SCSI drives (which had been like that for two years!). Corrected the termination, and off it starts
There is no reason though why a combination approach cannot be used. Store binaries (and text) on the file system, and have the "meta info", pointers etc stored ina DB. That way the DB doesn't need to be too flash or large.
VMSmail's storage format is instructive. Each message is represented by a single record in an indexed file. A short message body is simply tucked into the record along with the headers and other metadata. Long bodies (more than around 2kb IIRC) are stored as individual files and their header records point to the files by name.
Of course you all realized at once that the main file can get out of sync. with the directory which holds the external bodies. It does, sometimes, and fixing it up can be a pain. Any storage method which partitions a single message among multiple files is going to have similar problems. But it works pretty well, and it shouldn't be too hard to write a tool to groom the message store in case of inconsistency. It's worth study.
It was a natural choice on VMS, which has really good multi-indexed file support in the base package. It works well with text messages, which often do fall within the size limit for avoiding external storage of the message body. Today it suffers the same problem that mbox does -- people use email differently now.
I have never used Exchange, but a friend of mine admins a large (50,000+ users) Exchange system. Even a few years ago, running on NT4, their servers did NOT go down, ever. They scheduled a reboot for patches etc every 6 months, that's it. I have had lots of Netware boxes up for over a year, but not Netware 5 running mail. I inherited such a box & it needed to be rebooted every month or two. Now I've replaced it with a Linux based mail server & I'm much happier. Still have a 4.11 box cranking along happily, even happier since the 5 box is no longer giving annoying messages about it's licences. And my 2000 Server has been up for coming up on a year with no problems.
I remember a few months ago that Oracle was going to release some db oriented mail server that was supposed to revolutionize enterprise level email. Anyone know anything about this?
I mean come on shit on the server? Think of the smell. What if you're running a hot server? You'll have some brownie lookin turds.
Just my 2 cents but you may want to avoid shitting on the server.
I often searched for a MUA that uses MySQL as storage and often though of creating one myself.
yEnc reintroduces the problems the world had before MIME, suffers from the same "begin youcantreadthis.txt" "attachment" games as Outlook, but does not solve the transport reliability issues. Don't waste your time.
About 1 second with mutt (in an rxvt) on my dell dimension 4100 (I GHz pIII; 512 MB ram; 7200 rpm IDE disk) running debian. The maildir contained 1429 messages and is on an xfs, the kernel is a recent 2.4.18+xfs. Idiot.
"Mit der Dummheit kaempfen Goetter selbst vergebens." - Schiller
I have in excess of 46K email messages in my account alone, not to mention everyone elses accounts on my company's mail server. We use cyrus IMAP and qmail, both of which use the Maildir format mailboxes ... every client I've used (Mozilla, Communicator, Outlook/OL Express, Mail.app on OS X, Eudora, and Papi-Mail on PalmOS) seem to have absolutely no problem with this setup. Most MUAs are intelligent enough not to download all your headers every time you connect, so unless you're getting 1000+ new emails everytime you open a particular folder, you're generally not going to need to read all those headers every time.
... long live Rackspace).
... optimize the file subsystem beneath it, maybe allow for compression/encryption or that sort of thing, but otherwise, the folks that put together Maildir have certainly done a decent job!
The server that runs this is a measly 600MHz PIII w/ 128MB RAM running RedHat 6.2 w/ a 20GB hard drive. I haven't gotten even close to running out of inodes, to my knowledge, and my server never goes down (really, the only times its gone down is when power has been cut to it and this has only happened twice in the past 1.8 yrs
Maildir is specifically designed to handle mailboxes with large numbers of emails in them, contrary to other formats such as mbox. The problem with any sort of DB approach is the waste of space, even if you compress. A basic course in file structures will teach you a wealth of knowledge in this regard.
Imagine this: you have a table that stores everything you need to know about an email. You have a few distinct fields for commonly accessed headers (subject, from, to, cc, etc.) each of which would need to be 'text' blobs, since you cannot limit their size (you've seen the emails that have to/cc fields that are miles long, right?) - well, 'text' fields are notoriously poorly optimized in database engines and quite difficult to search (you can create an index on a part of a text field, but that might not be enough, right?). Next you have the message body which would also need to be a text field since you don't limit it's length, either.
Now, since the space for these fields (which don't *ever* change) is not optimized in the slightest, you might think that compressing them is a good idea, right? Well, what if an email is deleted - then you start looking at fragmented space in your database table which would need to be compacted periodically (much as mbox/.mbx files do today, if I recall).
All in all, storing each message to its own file is not really *that* bad
You are going to run out of inodes at exactly the same time you run out of disk space, because they are one and the same thing.
No.
Running out of inodes is not the same thing as running out of space. Some of the symptoms of the two are the same ("can't create new files"), but they are completely different failure modes.
Consult your local man pages for further details.
News for Nerds. Stuff that Matters? Like hell.
There are two excellent reasons that so many people use Exchange.
1) In general, it works out of the box. A company with someone with meager knowledge can set up a fairly complex mail handling system without much help.
And that same person with meager knowledge is going to get hacked six ways from Sunday when the next Exchange exploit comes around, because what's not included in that meager knowledge is that you have to keep up on security patches if you want your easy-to-install mail server to not be an easy-to-hack mail server.
2) It does A LOT. In it's most basic configuration it does what you need 10 or more programs in Linux to do, not to mention that most of those 10 don't exist.
And God help you if one (or many) of those pieces of Exchange are broken or don't do what you want to do. Can't change it, it's part of Exchange! At least if one of those 10 linux programs are broken or doesn't work right, you can replace it with something better without affecting all the other parts.
These are simple philosophic differences between Unix and Borg. Borg stuff usually has a shallow learning curve at the beginning, but then it ramps up as you discover things that are difficult or impossible to do. Whereas, the initial Unix learning curve may be steep, but it flattens out further in.
At least mafia-owned pizzarias make excellent pizza. Compare to Bill Gates.
Sounds like you worked closely with a bunch of clowns who had no idea how to run Exchange.
Well, how many MCSE's on paper equal one person who has actually done the stuff?
We have an exchange server - one person managers it along with tons of other stuff. It pretty much runs itself. We just moved to Ex2k, but were on Ex5.5 for quite a while - I can think of only one time it crashed and that most likely had to do with a 3rd party virus scanner intergrated onto the server. Removed that and no more problems.
While there are similarities, note that cyrus also keeps a couple of files per folder to enhance the performance.
-- The world is watching America, and America is watching TV.
"..in the form of 8 different machines, all of which were running reiserfs"
I have never heard anyone having so many problems with reiserfs as you! I am using reiserfs on several squid boxs, 21 production qmail boxes and a handful of other production and testbed systems and I have never had so much as hiccup that related to reiserfs. Searches of google and google groups turns up no one else that shares your experences of "unrecoverably corrupted filesystems" with reiserfs.
You forgot to mention the fastest and most scalable solution there exists, which is Cyrus imapd, see http://asg.web.cmu.edu/cyrus/
Basically, it is maildir with a header database.
It scales well for tens of thousands of very active users on a single small box, and has also support for clustering. I know of installations which serve many hundreds of thousands of users on a single host, so imagine what a cluster of them could do.
It doesn't do much to economize on space, but that's a non-issue. Anyone who is willing to keep dozens of megabytes in his mailbox is willing to pay for the privilege, and hard disk space is cheap. Anyway, I think that any mail system which does not preserve the rfc822 format all the way from sender to recipient is evil.
I don't think it makes sense to store email in dbm files. It's too sketchy - what happens when the dbm file gets corrupted? The nice thing about flat files is that if something goes wrong, you can fix it with vi.
I think the right solution to the problem is to key off the message ID, which is supposed to be unique. Then define a mail folder as simply a list of message IDs. Messages can appear in more than one folder, but hopefully not in no folders.
To make this efficient, I'd hash the message ID, and use a hierarchy of directories, because Unix doesn't do well with large flat directories. The hierarchy could auto-extend, so that as one subdirectory fills up, you do a sub-hash and split it into more directories.
The problem of tiny files is a real one. The solution is probably to make the bottom of a hash a file rather than a directory, and store more than one message in each such file. You don't have to store a lot of messages in these files to win - even ten messages would produce a big win, and would be pretty efficient.
The format of the individual files should probably be indexed sequential access - that is, a TOC at the front, and then the contents as plain text, nothing fancy. The TOC should be in ASCII, not binary, and you should be able to rebuild the TOC by looking at the file.
Babyl used to use a control character as a delimiter, which worked pretty nicely - much better than using "^From ". Ever seen >From in an email message? That's because Unix mail uses "^From " as an inter-message delimiter, so it has to quote it, and it does so stupidly. So use ^_ as a delimiter, and if ^_ appears in the email message, just double it. Take a doubled ^_ out when reading a message.
As for compression, I don't think it's worth doing at first. Disk space is cheap. Yes, my email folder is pretty huge, but it's really not a major problem. Making the storage system extra-complicated by uncompressing MIME is something to add on after you've got something more basic that works - you don't have to solve every problem all at once.
As for folder scan performance, you can make a cache, and have the mail program scan the cache from time to time when it's idle to clean up errors. This is much better than trying to come up with a format that's optimized toward folders - if you try to optimize toward folders, you wind up creating all kinds of problems, IMHO.
Lotus Notes has a special database format called Notes Storage Format (NSF). It supports clustering, replication, and encryption inside the file. We run it on Unix and NT servers, and it performs great, even on hundreds of users each having hundreds/thousands of messages...oh, and Notes not only does email.
I'm a small time mail admin, since i'm somewhat small time running only a small hosting servers delivering no more than 300 emails a day, i don't require these super respondant and super efficient MTAs....
What i do required is the functionality i have found in qmail - and i know plenty of people hate Dr Dan Berstien (sp?) for it.
I've written two authentication modules for my hosting server since we use name vhosts and ip based vhosts, therefore there's a requirement to default to $ENV{TCP_LOCAL_HOST} on ip based connections and to user user%host suffixes for http named based vhosts. I could not have done this if my MTA didn't authenticate with an environment variable and a string of exec loving apps.
I've taken my authorization somewhat further, including courier imap auth modules and custom logging. Again, smth that i could not do without basic functionality, offered by my favourite MTAs.
I'm now in the loving hands of a custom chrooted setup with loging and authentication i dreamed and developed and _know how to maintain_ - don't let any database-based MTx take this away from me!
Matt
The questioner makes the correct observation that Maildir is very slow with large directories when performing aggregate operations such as viewing the inbox.
Unfortunately the questioner doesn't notice the correlary that the single-file-per-folder solution will tend to be slower for *unit* operations -- adding newly arrived mail becomes a problem because of locking issues, removing deleted mail neccesitates compacting the file and so forth.
I worked at the 8th largest web based e-mail provider -- they provide cobranded web based e-mail for over half a million domains, with over 12,000,000 mailboxes when I left.
A gentleman we interviewed who had left a competitor told us about a major problem they had: They were using stock maildir to store messages, and with a *slighty* larger userbase than us they were crushing a $1,000,000 EMC SAN capable of handling some 8,000 NFS operations per second (Or was it 16,000? Can't recall...) -- 300-400 NFS operations to view an inbox just isn't good. My employer was using a low-end NetApp capable of handling something like 4,000 NFS operations per second (Again, don't remember for certain -- it was half or less of the EMC box's capacity though) and the box was only at 20% of it's throughput capacity, with nearly as much mail coming through the system.
The *one* key architectural difference we made was storing certain headers in a MySQL database -- from, subject, sent date, etc. The stuff you need to view an inbox or what have you.
Following such an approach -- particularly with a DB capable of fine-grained locking gives you the best of both worlds: Fast aggregate operations (use the DB to aggregate and index data for inbox-viewing, searches, and so forth), and fast unit operations (using individual files to store messages). And writing software to interact with such a mailbox remains very simple.
You can use compression on the individual files to save space, or you could be courageous and come up with a binary-safe heirarchical file format that can represent a MIME document efficiently in order to "undo" the 35+% penalty encoding poses. If you're really gutsy you could then compress that file. Or, in order to really maximize performance you could simply opt to compress *segments* of the file (think binary attachments -- leave headers and text/HTML sections uncompressed), so that viewing a mail doesn't involve decompressing it -- only accessing large attachments would incur that penalty. In fact, this gives you room to make user-definable performance vs. space tradeoffs: Let the user decide what sorts of things get compressed. Want to save the maximum amount of disk space? Compress everything. Maximum speed? Compress nothing. (And in that event you don't even have to pay the CPU penalty of MIME-decoding the attachment!)
The Andrew Messages System is pretty neat./ People/ AUIS/ams.html
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/Web
i hope not ... using it right.
... so 'all mesages with from or to these domains newer than 60 days in this folder' and 'all unread messages in this folder' and not have duplicated messages.
now nothing better than having virtual mboxes that allow me to look through all my gig's of mail via imap in under a second.
i can also organize the mail messages into folders but not move them
this thread was obsolete a long time ago.
members are seeing something, your seeing an ad
fucktards fuck see - Here
Tards, isin't a word and therefore doesen't exist. I pity those whom are quick to flame and never understand. Ours is an enlightend relality and unfourtuanetly you will never be a part of it. Pity, ignorance is bliss and I guess you'll never understand enough to move on and evolve like the rest of us have.
Desire is the first evil and it begets desire -Mohatma Buddah
Peace can only come as a natural consequence of universal enlightenment ~Tesla
What about LDAP as the message store?
I keep all of these messages so that they can be re-read. But they are obviously only written once. If the MTA would write the headers (To, From, CC, Subject) and then the body to diferent attributes, it would be very serachable and fast. In addition, Access control lists could be set up, and seperate container nodes could replace the "folder" concept. I think it makes more sense and would be easier to do than a database.
Until about 3 days ago, I had 1700+ messages in my Maildir, and pine (patched to support Maildir) opened my inbox in about two seconds. Compare this with my sent-mail folder, which had about the same number of messages in it. This folder is stored in mbox format and it took 5+ seconds to open AND CLOSE this folder. I believe that Maildir is the fastest option, short of keeping a seperate database.
Searches of google and google groups turns up no one else that shares your experences of "unrecoverably corrupted filesystems" with reiserfs.
ahem. You really didn't look very hard, did you?
filesystem corruption (2.4.18, reiserfs)
Bug#122230: reiserfsprogs: filesystem corruption with reiserfs
Re: ReiserFS / 2.4.6 / Data Corruption
ReiserFS desaster - advice please !
and about 829 other matches. Need I go on?
Oh, BTW, as I noted, two of those systems didn't belong to me, they belonged to people I know who experienced similar difficulties (and documented them as well).
Enough people, of enough diverse walks of life, are having issues like this with Reiserfs that it is clearly not something that is safe to be deploying in a production environment. Even if only 1% of the people using it are being so bitten, that number is way too high (and based on my own experiences and those of several people I know, I suspect that number is a lot higher than 1 per cent).
The Future of Human Evolution: Autonomy
MySQL (and SQL in general) is a great way to store large amount of data that later needs to be searched in some onscure way. And with the addition of the full text search to MySQL you can do queries that return possible matches, not just exact or wildcard matches.
Check out http://qvcs-guide.sourceforge.net
I don't think "unix mail" is all that useful a handle, but if I was going to use it I'd be referring to mail that stayed on unix hosts - usually in mbox format - as opposed to mail downloaded to user PCs with unknown operating systems.
Corporations and other profit-making legal entities can't dedicate specific PCs to single users cost-effectively in most situations, and they certainly can't effectively manage storage and back-up email stores if the Email messages are scattered over many failure-prone end-user hard drives. IMAPv4 and whatever the proprietary boyz are shopping this week purposely keep the email on the server, so that evidence can be extracted (or destroyed, if you work for Enron) from server backups, and so that filtering and surveying of mail data is easily possible.
For example, some corporations sweep their drives for return & delivery receipts over a month old and delete them.
Another example, corporations doing highly sensitive government contracts will sweep their email stores for classified information leaks.
Another example, I need to get my Email regardless of whether I'm on my laptop at a remote site, at my desk in town, or at home tunneled through SSH. Downloading it to one of these boxes makes it inaccessable to the others.
The list goes on, but basically downloading email to a local drive is primarily for AOL users and basement hackers. That being the case, your points about maildir are excellent - let the filesystem handle most of the details. I'd add that if you must run a db for speed reasons (such as a subject line db used by an IMAP server) do it so that it can be deleted and/or recreated on the fly from the contents of the maildir. No need to create additional dependencies.
...seeing as how he's a camel developer for Evolution. And the reason the RFCs are unreadable is because they use words like "pedagogical" (and byzantine grammatical structures) not because MIME is complicated.
I've also has similar experiances, though all with one server. Of course, it's the server in co-lo which is dificult to get to. Over the last year or two that server's been mostly happiliy camped out in co-lo (it was put in with 2.4.0-testsomething) using reiserfs for /home. Now on three separate occations I've come across files that cannot be access, cannot be deleted, and some times, cannot be seen even by root. /home -exec cat \{\} \; > /dev/null" (read every file in /home and discard the data to the uninitiated) to find it spitting IO errors on one file. I inspect it as root, and sure enough I can't read it, or even delete it. AND, when I try to access the file the "access beyond end of device" messages show up in the syslog. /home ext3 and copied everything over. 5 days so far with no problems, which is sadly an improvement. I'll give it another couple weeks before I invest in a Hans Reiser voodoo doll. /home and /media, and havn't run into this problem. Guess I must not be rubbing it the right way.
Actually, I lied. It's been two separate servers as the hardware was complely swaped out once because of random crashing and other instabilities. Now in a last couple months it started randomly crashing again. This time I was noticing occations "access beyond end of device" in the syslog. After the last crash and hard reboot I ran "find
There are no IDE drive errors. lm_sensors shows everything within reasonable ranges, and I'm told the system passes a trivial visual inspection.
So I pulled out a nice fresh chunk of space from the LVM pool and made a spiffy new
To be fair to the "it works for me, so it's perfect" crowd. I have been running it at home too,
- RustyTaco
If you administer a corporate e-mail system, one thing you will find is that your mail system rapidly fills up with multiple copies of the same e-mails, most of them with uncompressed Excel spreadsheets weighing in at hundreds of kilobytes of wasted space.
Furthermore, if you store these things in databases or mbox-type flat files, you also find that your "incremental" backup tapes fill up with the same stuff.
One file per e-mail solves part of this problem. One file per MIME part would probably do it even better.
Sure, you can do the same thing with databases and fancy backup strategies, but why bother? If file systems aren't adequate to the struggle, use better ones. (Anyway, I'm not convinced of that -- if you're really concerned about inodes, change the setting on the partition which holds the mail. If you're concerned about the time it takes to read linearly through a directory, use directory trees.)
Databases always make me cringe. The number of times I've failed to restore Outlook mail files after they have been incompletely transferred over a network has convinced me to never again even think about database mail storage. Use a database for indexing if you really want to (although IMHO an SQL server is a ridiculous extravagance and waste of cycles for a database which will comfortably fit in less RAM than a typical screensaver), but make sure you can rederive it from the original data.
Breaking into down into single files makes everything simpler (starting with locking and going up from there.)
Probably too late to bother contributing and I'll bet all these points have been made already anyway, but I feel passionate about this. So there.
Rici
For some reason, no one seems to ever know what they're talking about on this subject. (*sigh*)
Evolution uses things other than mbox. In fact, you'd be wise to choose Maildir with Evo, aside from not dealing with the flaws of mbox, it can be much faster. (see the Evo archvies)
No they are not. The parent post is correct.
In fact, I believe all the inodes are created when you create your filesystem, all space is mapped to an inode (though of course one file can use multiple inodes).
What you believe has nothing to do with reality. I suggest you take an OS course. Or read up on how Unix filesystems work.
It's usually said that if you have 4k inodes, you'll lost 2k (on average) per file.
There is no such thing as a "4k inode". You got your terminology wrong. You are thinking about blocks. On average, you waste 1/2 the block size for each file on your filesystem, since the last block is, on average, half-full. An inode is not the same as a block! They are two completely different things, which is why your entire post makes no sense. Think of an inode as a "file header". I don't have time or energy to post the full description but I already mentioned where you can get relevant information.
___
If you think big enough, you'll never have to do it.
The poster of the article just assumes that filesystem must be slow when working with 1000+ files per directory and we need a database to save us. That's nonsense, from my experience.
Apart from that, there are some very important reasons why maildir is much better than a DB. With maildir you can use standard Unix tools to manipulate your email. With a DB you can't do that. Mailbox corruption is not a problem with maildir -- even if corruption were to happen it would be limited to one message, or a small number of messages (not even a mailbox). With a corrupted DB storage, you lose everything -- all the mail of all the users in all the mailboxes. Ask an Exchange admin about it some time.
___
If you think big enough, you'll never have to do it.
I doubt I'll be alone in the opinion that the above discussion kinda emphasised the need for a particular method of access being tailorized to your needs, and not everyone's needs being the same...
Do you know WHY?
Because the format is rather extensible, and adding/removing/rewriting headers, when you don't know how many of each are supposed to be, isn't such a good idea(think about the "Received:" headers, for which there must be several formats, perhaps as much as a dozen) the X-headers are another kind of hard-to-mess with "content" That leads the MDA and MUA to use the same format, to minimize the number of operations on each email, etc.. (They don't NEED to have the same(sylpheed converted an mbox I had into mx(or mbx) which was nice, if unexpected, yet not wholly what I wanted)
Now what does that have to do with anything? Well it's no coincidence that most MUA(mail user agent aka mail client) REUSE the work of the MDA as much as possible... It's easier to have high performance when you don't have to do anything... Now most people only use ONE mua to read all their mails... But most systems administrators manage servers where not everyone uses the same MUA (yes if you're a sysadmin I'm preaching to the choir...) Locking and compatibility become important... otherwise you have to remember joe wants his mail in mbx format and dave wants it in maildir, so you can deliver to them... hence the lowest common denominator has its advantages... and why do work when the client will make you redo it(think people with enough procmail rules to consume 1 minute of cpu per incoming message)
Now think... all of this has to do with the link between mua and mda formats... What's the future for storage of emails? well if you are writing a client and have access to something like camel, which lets you choose the format as you see fit... You sure aren't hurting your chances, are you? You KNOW everyone likes their email du jour different...
Now what does that mean for servers? Well I can see the mailfront project(where the "front end" or "customer facing") is seperate from the "back end" or processing unit, allowing one to basically mix and match, or at least to integrate seperate approaches more easily as an approach with lots of future...
What does that change? Well for having tried alternate file systems and alternate mail"drop" formats a lot in recent months, I can tell a smart sysadmin will want to choose the filesystem and mail"drop" format together... as an optimisation measure... Lots of people don't seem to like maildir... on e2fs... Where it's not at its best... But put it on reiserfs... and it flies.. why? Simple... The filesystem is a data retrieval method... and your mail"drop" is a database of sorts... Would you just pick the database that comes with your operating system, because it comes with the operating system... with no thought to size or performance or contention or locking? I know I wouldn't... Now databases et al... are all good ideas... for the right needs... Does everyone need the same mail server? No... I use courier, on reiserfs... it does what I need...
For a larger setup... copying the headers for indexing purposes is a good idea... IF you search your email a lot... Which is why it makes sense for evolution to do it... most people don't search email ON THE SERVER... they search a local copy... (or hopefully cache some of the metadata instead of brute-force download all messages of a mailstore...) Does it make sense on a pop-toaster? Probably not, most people don't "Keep" mail long enough for it to make sense... But some do... And it probably was how the original idea of exchange/domino/etc... developed... a database you subscribe(as in publish/subscribe, not as in cash) to... that gives you access to your email/meetings/etc...
From the namesys project's web page, it appears some people are working to integrate reiserfs into maildir to a greater degree, allowing more efficient searches, headers stored as attributes, etc... All lovely ideas... for the right client...
The same with embedded databases(bdb, gdbm, cdb, etc...) or generic relational or object databases... For some people they make a lot of sense... For the pop toaster kind of setup, it seldom makes sense: the "end users" don't appreciate the kind of work involved into making searches fast(most of them don't search too often "live" over hundreds of folders in a webmail type of situation for example).
Of course in a smaller office, with say a pair of email power users with a gig or two of emails for "data mining" purposes such databases might mean the difference between life and death...
On that note mbox is probably fine for up to a hundred messages... if a bit slow... maildir might get scary after 100000 messages(especially on e2fs, inode vs directory table considerations...)
Does it make sense to spend lots of work on performance vs compliance to standards vs interoperability? Depends on patterns of access, installed base, usage metrics and other such considerations... But email is a tool... Like all tools, what's important is: What are you going to use it for today?
If you are looking at a file system as a heirarchial structure, why can't you have more than one such table.
The idea being that some mail clients would be only in the "person" tree, and that others would only be in a "function" tree. One could then be given access to both the person and function trees, and shunt mail between them for others to see.
The other thing that we should do is do things that encourage the use of these things. Make the tools for doing this easier to use and understand, and make the concepts easier to grasp.
OS/2 - because choice is a terrible thing to waste.
Why use XML anyway? Why not a format that is optimized for mailboxes? Put another way, what's the advantage of XML over mbox format?
Just thought I would throw in my two cents. I manage the mail sever for a company that has 10,000 mailboxes and handles about 100,000 email messages a day (that's minus SPAM because our SPAM filter stops 20,000 email a day). Before I was able to convince TPTB that Linux is the best solution all around for a server, we were using NTMail. NTMail uses a format similar to mbox but also has an idx file that contains an index. NTMail finally got where it couldn't handle the load so I moved us up to SendMail on a RedHat system.
We were using EXT2 for the filesystem and IDE drives with a software RAID. Although we never had any corruption problem, some of the larger mailboxes did take a while to open (10 seconds max). The processor load average also went up and the whole machine slowed down (not too bad tho) when a large glob of emails came in at once.
We have finally upgraded to a Dell PowerEdge 1650 (with one processor) and hardware RAID SCSI drives. For the filesystem, I used XFS because it is a jounaling filesystem and has at least the performance of ReiserFS. We are also using the RAV antivirus milter (The Most Affordable Virus Scanner for Linux Mailservers for anyone not using a virus scanner on your mailserver). Our new server is very fast, even under high load. We have not had ANY corrupted mailboxes (except one who accessed it through IMAP and POP3 at the same time). I personally dont believe it is the format that needs changing, changing the hardware and software choices to scale to the growing about of email. The fact that email use is growing faster than any other internet service. Picking the right file hardware and filesystem are a must, as well as a properly configured mailserver. But just because you are having file corruption problems or the server is taking a long time to access your mailbox is no reason to go back and totally rewrite standard. Why are the mailboxes being corrupted? Why does it take to so long to open a mailbox? This is when you get to the root cause, not trying to go around it.
Unfortunately, the standard way to move MIME encoded mail from one system to another is to mail it. This is *not* a good idea.
Suppose you are changing to a new computer (Linux to MacOS X for example), and you want to keep your email. Or suppose you are changing jobs.
Imagine emailing thousands of messages to yourself just to move them from one machine to another... If your former employer followed the prevailing advice here of locking up mail on a server, then this is the ONLY way you can keep your email.
If you kept copies of your mail locally, then you can burn the mail archive to CD, but since your mail is still in some client-specific format, you must install the same mail client on your new machine. Perl help you if the mail client software does not run on it.
You can have it good, fast, or cheap. Pick any two.
courier-imap doesn't use a non-standard format.
see the maildir spec.
we used to call this idea cc:Mail...
Notes/Domino could (and can) also be configured to use single-store. No one uses single-store in Domino because there used to be a dearth of backup tools for this configuration, and old habits die hard.
Having used DEC Notes and Lotus Notes, however, I can't really see any obvious connection between the two. Except, maybe DEC sold the name to Lotus...
DEC Notes was designed primarily as an online, threaded conferencing system. A bit like /. really, but on text terminals. Not very many sites still running Notes, but it is quite a neat little system. AIUI, it also supports message replication over a network using DECnet. Don't know how the messages are stored.
On the mail front, VMS initially (V2.0?) stored mail in flat text files, with form feeds between messages. Later versions of MAIL stored the mail in an indexed file; messages under 2,048 bytes are stored directly in the MAIL.MAI file; messages over that size are stored in external files of the form MAIL$nnnnnnnnnnnnnnnn.MAI, where nnnn...nnnn is the time of receipt in hexadecimal with a resolution of one hundredth of a second. The file organisation is handled by the file system. [Well, sort of. It's handled by RMS, the Record Management System, which isn't part of the filesystem per se, but is an adjunct to it. Like the name says, it manages records in files, not files themselves. The two are independent; the file system merely has to provide space for RMS to use to store file metadata. In situations where this is not possible (such as an ISO 9660 CDROM), the ACP (Ancilliary Control Program) for the file system ususally just makes a guess and fakes the RMS information: this is one of the major problems with trying to support foreign file systems in VMS...]
Mail stored in the old, flat-file format can still be read by MAIL in the current version of VMS (7.3). [Backwards compatibility has always been one of VMS's strong points...]. IMAP and POP3 servers are available (commercial and freeware) to allow your mail to be read from the platform of your choice.
One of the nice advantages of the way VMS mail works is that it is easy to merge mailboxes from two systems together. Just copy all the files into the same directory (the second MAIL.MAI becomes MAIL.MAI;2 thanks to version numbering), RENAME MAIL.MAI;2 MAIL_TO_BE_MERGED.MAI, then MERGE MAIL.MAI MAIL_TO_BE_MERGED.MAI.
Of course, in Unix you can just cat two mbox files together <g>
No doubt this post will be vigorously flamed for even mentioning the "legacy" operating system VMS, never mind explaining some of its mysteries :-) But, hey, I like it, even if no one else does these days. It has some really neat features and is incredibly stable (Until I get my hands on it!).
-Malcolm (a VMS sysadmin).
Sen vord is thrall and thocht is fre,
Keip veill thy tonge I conseill the.
I wonder...
would a tarred maildir decrease the number of disk reads (renames would be trickier, but possible) and inodes, or would the tar overhead be greater than that of the filesystem?
I see the problem here. You are attempting to use Evolution when the mail client you were actually wanting to install is called "mutt".
If you don't like GNOME and GTK+, for the love of pete don't use a mailer that says in big flaming letters "I am a GNOME program!".
News for Nerds. Stuff that Matters? Like hell.
Are you on crack? Calling Exchange's "groupware features" anything but an utter joke is absurd. They're still trying to catch up to what Lotus has been doing for years, and they aren't doing a very good job of it.
If you just want to run email, Exchange/Outlook is fine. If you want a collaborative groupware sollution with work flow built in, Domino/Notes is the only answer, currently.
Plus, Domino runs on Linux, Aix, Solaris, NT, 2000, OS/2, AS/400... The list goes on and on. As far as a shared database, just setup shared mail.
Not to mention, unlike Exchange, when one mail database gets hosed your whole server doesn't get scrapped. And you aren't supporting Microsoft.
Mention that Linux sucks, as it designed for and by Communists, homosexuals, and other undesirable sub-groups. Explain that computers cost money, thus software should cost money. Give directions to barbers, stores that sell soap, and churches. Make it clear that you have no problem with Unix, nor do you hate freedom. Finally, explain the role of currency in society.
Writers imply. Readers infer.