Its nice to see the Mozilla one is very similar to the ASL file requester (you can't tell me someone on the design team never used the ASL one). The only thing missing is assigns, for shortcuts.
The new gtk one looks awful, you gotta pop up a box to use the keyboard? Esp if you use focus follows mouse, that makes it essentially unusable without a lot of mousing around.
Note that he was testing 1.5, in 1.5 you can't change the format of internal folders. They are all stored as mbox. We changed a lot of the internal architecture, and they had to be all one format, or all another format. We chose mbox, although personally I put a vote in for maildir. There may be a way to change the format in the future, but currently there is not.
You can still setup a new 'account' which points to any part of your filesystem and can have maildir/mh, or mbox files, and just access them directly.
For message threading, i was out-voted again, and the subject isn't included in threading, which can lead to broken threads when people reply with some mailers. You can re-enable the fall-back to subject threading by setting the (undocumented) gconf key/apps/evolution/mail/thread_subject to true (its only used as a last merge-threads stage).
File a bug report about the too many open files thing, although with the dynamic nature of a multithreaded application, it may be easier to up your open file count in your kernel. Doesn't 2.6 address this anyway?
Well it seems at least the netfilter guys are serious about their GPL'd stuff, what about other kernel guys?
When I bought one of these FlashTrax things I was suprirsed to see in the 'about box' that it ran on linux. A query about getting the source came back with a 'our software engineering partners are considering releasing parts of the code as source', and nothing more...
Who needs to be contacted/go to action about this?
Yeah its a real pity about Disney doing it. Well, Americans in general will have to screw it up, they always 'dumb down' anything they don't grok, so the quirky Englishman-edness of it all will end up being watered away to something barely recognisable, no doubt.
Just see how they butchered Doctor Who for a perfect example of exactly the same issue.
I was really disssapointed with the dubbing of Spirited Away too, Disney had to Americanise all that 'hard to understand foreign stuff' going on in it - but at least you could watch it with the subtitltes, which had much better dialogue.
A pity, the story is so cool, but on the other hand it might be an example of a story that will never translate to film properly because of the imagination involved (not that anyone would recognise that anymore).
Also, here's a duplicate code report, thanks to CPD. I like the comment on the first duplicate code chunk:
/* sigh, so much for oo code reuse... */ /* FIXME: put in a function */
Ahh so true.
I tried, yet failed, to avoid such issues, but.. what can you do eh? It was all written in a bit of a rush.
As an occasional visitor to the USA, I have to say how frustrating it is trying to communicate with people who can't understand anything but a mid-USA accent.
Maybe its time you lot realised there's a whole other world out there, and try to communicate with it. Other people deserve to eat too.
I've been working from home most of the time for the last 3-4 years.
And i'm kind of sick of it. Its not good. Its very depressing. Its impossible to meet people. It is terribly lonely. Find an office to work in and get a job there instead.
Of course it is illegal for some people to discriminate. For instance, my employer will happily fire me 'cause I'm a guy with long hair. "I will not hire you because you have long hair".
Why is that legal? Simply because it is not illegal. It MAY be illegal in certain places, but I know of none.
Well its not legal in any decent country. Unfair dismissal. An employee has a right to their job, they're not slaves.
Yes it will need to be approved by Austel. Anything you plug into the phone line physically needs to be approved. Primarily it is for electrical isolation, safety, and interference reasons.
However, an accoustic coupler, or just using VoIP through an approved modem does not require any 'approval', so its not a practical limitation.
Australia hasn't a so high crime rate or shouldn't fear about being attacked by terrorists like other countries.
Note if you believe John "Dubya" Howard, who seems to be hell bent on positioning Australia as far in front of the US as he can, and making terrorism against Australia a reality.
Re:Frequently Asked User Interface Questions
on
Inside Ximian
·
· Score: 1
Thats mostly IMAP, which IMHO sucks, but we haven't had the resources to fix it yet (i've been working on an replacement but my boss would have a fit if i did it it work time). Let alone the tree widget scalability issues, which the author doesn't seem to want to recognise.
Not sure if the tree widget honours themese either, so thats the colour problem.
FWIW, POP3 and getting local mail is *much* faster with the 1.2 development code (pop3 over slow modem links benefit the most, and indexing is much faster too).
Get off the "pirate" crap, the music is shit (and overpriced, esp in USA, thanks to your protective trade policies), and thats the real reason nobody is buying much of it.
Large corporations (the real pirates) making carbon copies of the latest plastic fad, trying to guide the public tastes, and mostly just getting it plain wrong.
The only guy I know who copies stuff all the time, copies movies just as much as music. And I can't imagine him with a sword cutting your legs off - some pirate.
It might be kind of hard to realise for you damn yanks, but America isn't the centre of my, or a lot of people's universe.
How big is Texas? Thats like a kind of little state isn't it? Its not even as big as the puny Victoria (22 million hectares?)? A mere spot on the map? A crappy dry shithole of a spot at that.
Write code and test it as you write it, a combination of bottom-up and top-down development and incremental testing. At this level, the test cases should be trying to break everything, particularly if you're dealing with any kind of external input (i.e. anything NOT in memory that you didn't put there yourself). Remove files, fill them with binary, make them unreadable, etc. This sort of stuff can be the basis or even be the regression test suites that can be used for automated testing.
If I write a particularly tricky bit of code (which sometimes you just have to, for performance or something), I separate it and write it completely separately so I can test as much of it as possible before it gets into a bigger system. Faster turn-around with changes helps with more-or-less prototype code like this.
If your api says 'you can't pass a null here', or 'must be between 1 and 7', put an assertion inside that invocation, let the computer test it for you at runtime.
Tools like gprof and quantify let you know if you're test cases are touching all the cases. If you have exception cases, try and write test cases that will touch all different exception cases at least once. A bunch of bugs i come across are stuff that wouldn't normally happen that does. Sure, the program might be going down the toilet already, but its better it keeps going down than floods the bathroom floor.
As the scale of the part of the system under test grows (i.e. from unit to module to system testing), its usually harder to exercise individual cases, so the tests change focus a bit. e.g. to usability (just because dialogue a works great, and dialogue b works fine, what happens if a opens before b and you close a first, etc), scalability - 10 things might be fine, 1000 might be unusable, functionality - did you hook up a menu to run function y?
Some things you can fully automate, others you need someone to operate.
If its something that needs to run in multiple environments, test in as many of them as you can. Before you beta test. Just as its bad to get a low-level api type bug exposed by your qa testing team, its much worse for it to reach your beta testers.
And yeah, the best way is to write and maintain quality *DESIGN* from the start. It makes finding and fixing bugs easier, and adding new features isn't like walking a tightrope with glass slippers on.
I lost the plot half way through this, but here's some food for thought anyway. Now I should get back to work...
Z
I think that this is looking for the solution to a problem that doesn't really exist in the first place. Although I guess it depends somewhat on what you define as 'Unix mail'.
I'm a developer on Evolution, and primarily on Camel, evolution's email library. I'm not sure i'd rave about it (although I think Camel is a mostly beautiful piece of code;), but it works reasonably well, and we've had a chance to try and deal with users with lots of email.
What IS 'Unix mail'?
I would define Unix mail as mail (rfc822 format) downloaded and stored locally on a per-user basis. IMAP, Exchange, and other remote protocols are very different beasts.
Why are DBMS's not suitable for 'Unix mail'?
Once you have a remote server you have to do things differently than if you have local access. Using a DBMS, and having a trained administrator to manage it are practical considerations, as are the benefits you might get from this configuration. These solutions dont really make sense for standalone users. They shouldn't need to install and manage databases, complex backup prodedures, and so forth, just to read their email.
i.e. rdbms's are:
hard to setup
hard to maintain
another major point of failure
If however, I was to design a multi-user groupware server, then a DBMS would come into serious consideration - at the backend at least. It allows you do to things like easily consolidate authentication outside of the operating system (the idea of having a 'shell account' to access mail is somewhat outdated), it allows you to save space by storing common data, like attachments and email content in a single place, and redirecting it to multiple recipients (which is a common practice within organisations). It may be practical to use a mixture, a RDBMS to store textual parts or indices to data stored in a more conventional filesystem.
But even with a RDBMS backend, I would personally probably still stick to IMAP to serve it to actual clients. The IMAP protocol is a bit heavy, but not really that bad, and it serves email, I dont think there's really any need to reinvent the wheel here.
So...
If you define unix mail as I have, and separate it from a *mail server*, then you rule out full blown RDBMS's, and are left with:
single file database
multiple file database
I'm not even going to mention XML because I think it is the single most stupid idea anyone's come up with. It is completely unsuitable for this purpose.
And well, there's really no reason not to use MIME to store the messages. MIME already does everything you can possibly do with email (since, uh, it is how the email *will* be sent), any client will already have to deal with it, and mime decoding is for the most part really quite simple and fast anyway. Translating the mime format into some other storage format really doesn't make sense.
single file databases
mbox
Mbox is a single file database. Its just that everyone that uses it generally writes their own access code. This is where problems with 'locking' come about, either because the underlying filesystem doesn't support it properly (e.g. some nfs implementations), or everyones clients don't use the same locking mechanism. This really just an implementation issue anyway. There would be nothing to stop someone writing a common 'mbox.db' library that stored everything in completely compatible mbox files, which took all the work out of it, and then you'd have an mbox DBMS...
mbox scales ok, without any caching of header information it handles in the order of 2K messages in an interactive timescale, and quite a lot more if you dont mind some short delays (i.e. in the order of the time it takes mozilla to start up).
Appending and reading is quick, and reliable - assuming the filesystem works, which is a pretty safe assumption to make. This is assuming the mailbox is first summarised at first opening, otherwise looking up messages can be slow, because you have to scan the whole file first.
The only operation that is slow is expunging messages, and at worst case isn't really any slower than copying a whole file across to another file.
The only other issue is agreement on the 'standard' for what constitutes an mbox file. For example. Solaris uses and honours the 'Content-Length' header, and thus it does not translate any lines beggining with "From " into the conventional ">From ". Some mail clients translate "(>*)From " into ">\1From " (using sed syntax) and visa versa, others do not. There is no standard, just some conventions, some of which aren't easy to determine either.
Because you need to keep the whole index in memory at once, this can become expensive, but you could use a secondary database as an index into the real file. But eventually you hit a point where the cost of expunging does get too expensive. You could just archive the mail regularly, or use a format like maildir instead.
gdbm/db/etc
db files wrap the single file in a common api that handles all of the locking issues and access issues for you. Some have different features, e.g. querying capability, logging and transactions, etc.
We've never tried to use db for this purpose, more just because we didn't think it was worth it. All you really get with a minimal implementation is the ability to store and retrieve a blob of data using a single key. Writing is fairly slow because the database has to manage more details for you (locking, allocating blocks, unlocking, etc). You could use multiple db files as indices to perform multiple-key searches, but they are quite slow at creating them (we tried using db for the content indices and it was way too slow).
i.e. even if you store the data in a db file, which gives you a slight benefit of inbuilt referential integrity, you still need to provide additional indices to actually be able to use it in any useful way. Evolution suffers this problem with the addressbook which stores vCards in db records.
Most db libraries (all?) also dont provide any mechanism to stream data. You either get the whole lot into memory, or you get none of it. So for large messages you're limited by memory (well, evolution is anyway, but it doesn't have to be). Yes, memory is cheap, but it is still a consideration, and it would certainly rule out a simple database in a multi-user environment.
db files are also slower than native files, especially for large objects. You're mapping an arbitrarily sized chunk of data to some 'database blocks', which are then stored in an arbitrarily sized 'database file' which the operating system is then mapping to its 'filesystem blocks'.
multifile solutions
Well I guess this comes down to mh and maildir. mh isn't really suitable for anything, because of its just plain bad design and lack of defined semantics. There's no way to guarantee anything about its operation.
maildir - i like. It moves the scourge of trying to implement a reliably, scalable, multiple access database almost entirely into the operating system layer. Operating systems already do this very well - they manage hundreds of thousands of files randomly written across your disks, without skipping a beat.
No operation requires more than a single message size of data, and the operating system already indexes the message, via its filename. Sure, ext2 doesn't do such a swell job with long directories, but that can be addressed (and the same problem can be addressed on just about any platform). For 'free' you get concurrent multiple-reader, multiple-writer database access, without any of the considerable problems you have to solve to implement it otherwise.
The maildir 'protocol' is simple, reliable, and it works.
Again, it can easily be augmented by a client with additional indices, but for things like delivery agents who dont care about existing email, they dont need to suffer that overhead at all.
Some other comments specific to the question:
Compression. Personally I dont see the point. But a maildir-like structure would fit well with compression. Flat files would be the worst (e.g. mbox), and block-file formats (like db files) would also work well with compression. The good thing about email is it is 'write once', you don't edit or change the messages in the mailbox.
External attachments. I guess its possible, but again, it isn't really worth it in most cases. Parsing MIME is *fast*. It is much faster than parsing xml, and besides, people rarely look at an email more than once or twice. There isn't much use going off and storing the attachment in a high-performance reading format if it isn't going to be accessed often, and it just places a greater burden on your server.
base64, etc. Well, its entirely possible simply to store the messages as 'binary' format. Assuming the boundary markers are checked properly, Camel can work with binary encoded mail messages, and probably at least some other mail clients can too. There are some problems with some of the extremely broken openpgp/pgp/mime specs which suddenly say that mail transports aren't allowed to alter the *transport* encodings of some parts, but well, these specs are just braindead, and can be worked around.
Security model. Well, talking about Unix mail, not server mail, the filesystem is adequate.
Shared folders - is not an issue for unix mail.
Unicode. Well you can write unicode filenames to most unix filesystems, evne if 'ls' doesn't show it right.
MTA. Nothing could be simpler or safer than maildir as a delivery format. The mta doesn't have to care about any client-side indices, the mua will simply update them when it incorporates the new messages, etc.
Writing libmailstore? Mate, its called Camel, and its already written. Camel already does mbox, maildir, mh, it can read spool files directly (it doesn't create a summary file or build any indexes), it can talk imap, pop, and partial support for nntp. If someone gave me a decent RDBMS table schema and a carton of pale, I could probably write a MySQL backend in a couple of days, well, assuming the MySQL api is mt-safe.
Finally, some comments on evolution.
Evolution isn't reinventing any wheel. We use standard mbox format (if such a thing really exists anyway). We use standard maildir format, etc. Yes we may optionally create body indices, and we do usually create on-disk binary/compressed 'summaries' of the data, but these are really just on-disk caches of in-memory data structures, rather than anything to do with the mail storage format.
We put mail in another location, but everyone else has done that too, elm:Mail, pine:mail (or is it the other way around?), netscape:ns_mail, etc. At least we now offer the option to read most of this 'in place'.
The main problems evolution has with scalability is:
indexing.
Indexing is quite costly. The original index code was written somewhat like a database, it handled all internal data structures, used blocks of data, etc. It was slow, it scaled poorly. Definetly some of the algorithm choices and the implementation wasn't that hot, but it shows that such a solution isn't as simple as at first thought. Using libdb was impossibly slow (like several orders of magnitude slower).
The new stuff is a lot better, but can still use a lot of resources while indexing, and copies the whole file (well 2 files) across when performing expunges, but they are only performed occasionally, and the indices are smaller than the original indices, so in practice it scales much much better.
the summaries
The summaries are indices of a sort anyway. They are an in-memory tree of a subset of the information on each message. Enough information to display a list of messages, and perform vfoldering operations. Even though we do some tricks, like sharing common strings, the summary can get very large.
But, its a tradeoff I thought was worth it, rather than using on-disk summaries. The api's are much easier to use, and the problem gets pushed to the user - if they want to have folders with 100K messages, they should expect it to use a bit of memory. The on-disk size of the summaries is very small too, although I guess it could be made even smaller if we consolidated common strings.
per-message memory use
Currently, a lot of data gets copied around in memory. Every time you read a message, at least 1 whole copy of the (decoded) message is in memory at a given time (yes, including attachments). For IMAP this can get even worse (2-3 copies of a given attachment at a given time), because it doesn't stream enough. Most of this could use a disk-backing without changing any api's though, and well, i'm rewriting IMAP.
Wrapping up...
And yeah, we're talking 100K messages here, not 1400. My 500Mhz celeron laptop has about 35K messages stored over about 10 mbox files, and it starts up in under 10 seconds, and that includes all of the bonobo/activation overhead (which is very significant). Yeah it uses a bit of memory, but memory is cheap on a personal workstation.
In short. The current mailbox formats we have suffice for "Unix mail". Add some archiving abilities to your mail client (even RDBMS backed mail clients need archiving), and you'll never have to delete a message again, and still get work done and still use mbox.
If you want to talk about writing a server - well who cares, you can do whatever you want, because everyone has to go through your interface anyway (you DO NOT want clients accessing data under you, thats what DBMS's are all about in the first place... and you dont want 1-tier applications), so it doesn't matter what format you use under the belt - you can choose the format which best suits what you're trying to do.
It seems some people think using 1-tier applications (client code talking directly to a database) are the way to go for multi-user environments. They're not, they dont scale and are impossible to maintain. Nobody writes any real software like that anymore, unless you're writing dodgey vb toy apps.
I'll drink to that ...
Its nice to see the Mozilla one is very similar to the ASL file requester (you can't tell me someone on the design team never used the ASL one). The only thing missing is assigns, for shortcuts.
The new gtk one looks awful, you gotta pop up a box to use the keyboard? Esp if you use focus follows mouse, that makes it essentially unusable without a lot of mousing around.
Actually its got nothing to do with linking order.
The linker has to have libraries to link to in order to resolve anything - this above is about build order.
Hi guys,
/apps/evolution/mail/thread_subject to true (its only used as a last merge-threads stage).
Note that he was testing 1.5, in 1.5 you can't change the format of internal folders. They are all stored as mbox. We changed a lot of the internal architecture, and they had to be all one format, or all another format. We chose mbox, although personally I put a vote in for maildir. There may be a way to change the format in the future, but currently there is not.
You can still setup a new 'account' which points to any part of your filesystem and can have maildir/mh, or mbox files, and just access them directly.
For message threading, i was out-voted again, and the subject isn't included in threading, which can lead to broken threads when people reply with some mailers. You can re-enable the fall-back to subject threading by setting the (undocumented) gconf key
File a bug report about the too many open files thing, although with the dynamic nature of a multithreaded application, it may be easier to up your open file count in your kernel. Doesn't 2.6 address this anyway?
!Z
When I bought one of these FlashTrax things I was suprirsed to see in the 'about box' that it ran on linux. A query about getting the source came back with a 'our software engineering partners are considering releasing parts of the code as source', and nothing more ...
Who needs to be contacted/go to action about this?
He didn't lose his job, he left voluntarily for family reasons, which is quite clearly and publically on the public record.
Just see how they butchered Doctor Who for a perfect example of exactly the same issue.
I was really disssapointed with the dubbing of Spirited Away too, Disney had to Americanise all that 'hard to understand foreign stuff' going on in it - but at least you could watch it with the subtitltes, which had much better dialogue.
A pity, the story is so cool, but on the other hand it might be an example of a story that will never translate to film properly because of the imagination involved (not that anyone would recognise that anymore).
Ahh so true.
I tried, yet failed, to avoid such issues, but .. what can you do eh? It was all written in a bit of a rush.
As an occasional visitor to the USA, I have to say how frustrating it is trying to communicate with people who can't understand anything but a mid-USA accent.
Maybe its time you lot realised there's a whole other world out there, and try to communicate with it. Other people deserve to eat too.
I've been working from home most of the time for the last 3-4 years.
And i'm kind of sick of it. Its not good. Its very depressing. Its impossible to meet people. It is terribly lonely. Find an office to work in and get a job there instead.
Maybe they could make self-raining skies too, for the drier parts of the world, so the windows dont get too dusty.
I'm surprised they didn't call it terrorism.
However, an accoustic coupler, or just using VoIP through an approved modem does not require any 'approval', so its not a practical limitation.
Note if you believe John "Dubya" Howard, who seems to be hell bent on positioning Australia as far in front of the US as he can, and making terrorism against Australia a reality.
couldn't live without it!
Easy, by adding to the pool of available Free Software that all developers can benefit from, not just the priviledged few.
There's really no more to it than that.
Oh the speed issue.
Thats mostly IMAP, which IMHO sucks, but we haven't had the resources to fix it yet (i've been working on an replacement but my boss would have a fit if i did it it work time). Let alone the tree widget scalability issues, which the author doesn't seem to want to recognise.
Not sure if the tree widget honours themese either, so thats the colour problem.
FWIW, POP3 and getting local mail is *much* faster with the 1.2 development code (pop3 over slow modem links benefit the most, and indexing is much faster too).
Empty trash vs expunge is mostly historical, but a lot of us like it that way so its stayed for now.
We didn't used to have a trash folder, and personally i dislike it anyway.
As for menu options, most are there because users demand more features. Sigh.
... people might buy more of it.
Get off the "pirate" crap, the music is shit (and overpriced, esp in USA, thanks to your protective trade policies), and thats the real reason nobody is buying much of it.
Large corporations (the real pirates) making carbon copies of the latest plastic fad, trying to guide the public tastes, and mostly just getting it plain wrong.
The only guy I know who copies stuff all the time, copies movies just as much as music. And I can't imagine him with a sword cutting your legs off - some pirate.
It might be kind of hard to realise for you damn yanks, but America isn't the centre of my, or a lot of people's universe.
How big is Texas? Thats like a kind of little state isn't it? Its not even as big as the puny Victoria (22 million hectares?)? A mere spot on the map? A crappy dry shithole of a spot at that.
Well, at least partially.
Write code and test it as you write it, a combination of bottom-up and top-down development and incremental testing. At this level, the test cases should be trying to break everything, particularly if you're dealing with any kind of external input (i.e. anything NOT in memory that you didn't put there yourself). Remove files, fill them with binary, make them unreadable, etc. This sort of stuff can be the basis or even be the regression test suites that can be used for automated testing.
If I write a particularly tricky bit of code (which sometimes you just have to, for performance or something), I separate it and write it completely separately so I can test as much of it as possible before it gets into a bigger system. Faster turn-around with changes helps with more-or-less prototype code like this.
If your api says 'you can't pass a null here', or 'must be between 1 and 7', put an assertion inside that invocation, let the computer test it for you at runtime.
Tools like gprof and quantify let you know if you're test cases are touching all the cases. If you have exception cases, try and write test cases that will touch all different exception cases at least once. A bunch of bugs i come across are stuff that wouldn't normally happen that does. Sure, the program might be going down the toilet already, but its better it keeps going down than floods the bathroom floor.
As the scale of the part of the system under test grows (i.e. from unit to module to system testing), its usually harder to exercise individual cases, so the tests change focus a bit. e.g. to usability (just because dialogue a works great, and dialogue b works fine, what happens if a opens before b and you close a first, etc), scalability - 10 things might be fine, 1000 might be unusable, functionality - did you hook up a menu to run function y?
Some things you can fully automate, others you need someone to operate.
If its something that needs to run in multiple environments, test in as many of them as you can. Before you beta test. Just as its bad to get a low-level api type bug exposed by your qa testing team, its much worse for it to reach your beta testers.
And yeah, the best way is to write and maintain quality *DESIGN* from the start. It makes finding and fixing bugs easier, and adding new features isn't like walking a tightrope with glass slippers on.
I lost the plot half way through this, but here's some food for thought anyway. Now I should get back to work ...
;), but it works reasonably well, and we've had a chance to try and deal with users with lots of email.
...
...
...
... and you dont want 1-tier applications), so it doesn't matter what format you use under the belt - you can choose the format which best suits what you're trying to do.
Z
I think that this is looking for the solution to a problem that doesn't really exist in the first place. Although I guess it depends somewhat on what you define as 'Unix mail'.
I'm a developer on Evolution, and primarily on Camel, evolution's email library. I'm not sure i'd rave about it (although I think Camel is a mostly beautiful piece of code
What IS 'Unix mail'?
I would define Unix mail as mail (rfc822 format) downloaded and stored locally on a per-user basis. IMAP, Exchange, and other remote protocols are very different beasts.
Why are DBMS's not suitable for 'Unix mail'?
Once you have a remote server you have to do things differently than if you have local access. Using a DBMS, and having a trained administrator to manage it are practical considerations, as are the benefits you might get from this configuration. These solutions dont really make sense for standalone users. They shouldn't need to install and manage databases, complex backup prodedures, and so forth, just to read their email.
i.e. rdbms's are:
hard to setup
hard to maintain
another major point of failure
If however, I was to design a multi-user groupware server, then a DBMS would come into serious consideration - at the backend at least. It allows you do to things like easily consolidate authentication outside of the operating system (the idea of having a 'shell account' to access mail is somewhat outdated), it allows you to save space by storing common data, like attachments and email content in a single place, and redirecting it to multiple recipients (which is a common practice within organisations). It may be practical to use a mixture, a RDBMS to store textual parts or indices to data stored in a more conventional filesystem.
But even with a RDBMS backend, I would personally probably still stick to IMAP to serve it to actual clients. The IMAP protocol is a bit heavy, but not really that bad, and it serves email, I dont think there's really any need to reinvent the wheel here.
So
If you define unix mail as I have, and separate it from a *mail server*, then you rule out full blown RDBMS's, and are left with:
single file database
multiple file database
I'm not even going to mention XML because I think it is the single most stupid idea anyone's come up with. It is completely unsuitable for this purpose.
And well, there's really no reason not to use MIME to store the messages. MIME already does everything you can possibly do with email (since, uh, it is how the email *will* be sent), any client will already have to deal with it, and mime decoding is for the most part really quite simple and fast anyway. Translating the mime format into some other storage format really doesn't make sense.
single file databases
mbox
Mbox is a single file database. Its just that everyone that uses it generally writes their own access code. This is where problems with 'locking' come about, either because the underlying filesystem doesn't support it properly (e.g. some nfs implementations), or everyones clients don't use the same locking mechanism. This really just an implementation issue anyway. There would be nothing to stop someone writing a common 'mbox.db' library that stored everything in completely compatible mbox files, which took all the work out of it, and then you'd have an mbox DBMS
mbox scales ok, without any caching of header information it handles in the order of 2K messages in an interactive timescale, and quite a lot more if you dont mind some short delays (i.e. in the order of the time it takes mozilla to start up).
Appending and reading is quick, and reliable - assuming the filesystem works, which is a pretty safe assumption to make. This is assuming the mailbox is first summarised at first opening, otherwise looking up messages can be slow, because you have to scan the whole file first.
The only operation that is slow is expunging messages, and at worst case isn't really any slower than copying a whole file across to another file.
The only other issue is agreement on the 'standard' for what constitutes an mbox file. For example. Solaris uses and honours the 'Content-Length' header, and thus it does not translate any lines beggining with "From " into the conventional ">From ". Some mail clients translate "(>*)From " into ">\1From " (using sed syntax) and visa versa, others do not. There is no standard, just some conventions, some of which aren't easy to determine either.
Because you need to keep the whole index in memory at once, this can become expensive, but you could use a secondary database as an index into the real file. But eventually you hit a point where the cost of expunging does get too expensive. You could just archive the mail regularly, or use a format like maildir instead.
gdbm/db/etc
db files wrap the single file in a common api that handles all of the locking issues and access issues for you. Some have different features, e.g. querying capability, logging and transactions, etc.
We've never tried to use db for this purpose, more just because we didn't think it was worth it. All you really get with a minimal implementation is the ability to store and retrieve a blob of data using a single key. Writing is fairly slow because the database has to manage more details for you (locking, allocating blocks, unlocking, etc). You could use multiple db files as indices to perform multiple-key searches, but they are quite slow at creating them (we tried using db for the content indices and it was way too slow).
i.e. even if you store the data in a db file, which gives you a slight benefit of inbuilt referential integrity, you still need to provide additional indices to actually be able to use it in any useful way. Evolution suffers this problem with the addressbook which stores vCards in db records.
Most db libraries (all?) also dont provide any mechanism to stream data. You either get the whole lot into memory, or you get none of it. So for large messages you're limited by memory (well, evolution is anyway, but it doesn't have to be). Yes, memory is cheap, but it is still a consideration, and it would certainly rule out a simple database in a multi-user environment.
db files are also slower than native files, especially for large objects. You're mapping an arbitrarily sized chunk of data to some 'database blocks', which are then stored in an arbitrarily sized 'database file' which the operating system is then mapping to its 'filesystem blocks'.
multifile solutions
Well I guess this comes down to mh and maildir. mh isn't really suitable for anything, because of its just plain bad design and lack of defined semantics. There's no way to guarantee anything about its operation.
maildir - i like. It moves the scourge of trying to implement a reliably, scalable, multiple access database almost entirely into the operating system layer. Operating systems already do this very well - they manage hundreds of thousands of files randomly written across your disks, without skipping a beat.
No operation requires more than a single message size of data, and the operating system already indexes the message, via its filename. Sure, ext2 doesn't do such a swell job with long directories, but that can be addressed (and the same problem can be addressed on just about any platform). For 'free' you get concurrent multiple-reader, multiple-writer database access, without any of the considerable problems you have to solve to implement it otherwise.
The maildir 'protocol' is simple, reliable, and it works.
Again, it can easily be augmented by a client with additional indices, but for things like delivery agents who dont care about existing email, they dont need to suffer that overhead at all.
Some other comments specific to the question:
Compression. Personally I dont see the point. But a maildir-like structure would fit well with compression. Flat files would be the worst (e.g. mbox), and block-file formats (like db files) would also work well with compression. The good thing about email is it is 'write once', you don't edit or change the messages in the mailbox.
External attachments. I guess its possible, but again, it isn't really worth it in most cases. Parsing MIME is *fast*. It is much faster than parsing xml, and besides, people rarely look at an email more than once or twice. There isn't much use going off and storing the attachment in a high-performance reading format if it isn't going to be accessed often, and it just places a greater burden on your server.
base64, etc. Well, its entirely possible simply to store the messages as 'binary' format. Assuming the boundary markers are checked properly, Camel can work with binary encoded mail messages, and probably at least some other mail clients can too. There are some problems with some of the extremely broken openpgp/pgp/mime specs which suddenly say that mail transports aren't allowed to alter the *transport* encodings of some parts, but well, these specs are just braindead, and can be worked around.
Security model. Well, talking about Unix mail, not server mail, the filesystem is adequate.
Shared folders - is not an issue for unix mail.
Unicode. Well you can write unicode filenames to most unix filesystems, evne if 'ls' doesn't show it right.
MTA. Nothing could be simpler or safer than maildir as a delivery format. The mta doesn't have to care about any client-side indices, the mua will simply update them when it incorporates the new messages, etc.
Writing libmailstore? Mate, its called Camel, and its already written. Camel already does mbox, maildir, mh, it can read spool files directly (it doesn't create a summary file or build any indexes), it can talk imap, pop, and partial support for nntp. If someone gave me a decent RDBMS table schema and a carton of pale, I could probably write a MySQL backend in a couple of days, well, assuming the MySQL api is mt-safe.
Finally, some comments on evolution.
Evolution isn't reinventing any wheel. We use standard mbox format (if such a thing really exists anyway). We use standard maildir format, etc. Yes we may optionally create body indices, and we do usually create on-disk binary/compressed 'summaries' of the data, but these are really just on-disk caches of in-memory data structures, rather than anything to do with the mail storage format.
We put mail in another location, but everyone else has done that too, elm:Mail, pine:mail (or is it the other way around?), netscape:ns_mail, etc. At least we now offer the option to read most of this 'in place'.
The main problems evolution has with scalability is:
indexing.
Indexing is quite costly. The original index code was written somewhat like a database, it handled all internal data structures, used blocks of data, etc. It was slow, it scaled poorly. Definetly some of the algorithm choices and the implementation wasn't that hot, but it shows that such a solution isn't as simple as at first thought. Using libdb was impossibly slow (like several orders of magnitude slower).
The new stuff is a lot better, but can still use a lot of resources while indexing, and copies the whole file (well 2 files) across when performing expunges, but they are only performed occasionally, and the indices are smaller than the original indices, so in practice it scales much much better.
the summaries
The summaries are indices of a sort anyway. They are an in-memory tree of a subset of the information on each message. Enough information to display a list of messages, and perform vfoldering operations. Even though we do some tricks, like sharing common strings, the summary can get very large.
But, its a tradeoff I thought was worth it, rather than using on-disk summaries. The api's are much easier to use, and the problem gets pushed to the user - if they want to have folders with 100K messages, they should expect it to use a bit of memory. The on-disk size of the summaries is very small too, although I guess it could be made even smaller if we consolidated common strings.
per-message memory use
Currently, a lot of data gets copied around in memory. Every time you read a message, at least 1 whole copy of the (decoded) message is in memory at a given time (yes, including attachments). For IMAP this can get even worse (2-3 copies of a given attachment at a given time), because it doesn't stream enough. Most of this could use a disk-backing without changing any api's though, and well, i'm rewriting IMAP.
Wrapping up
And yeah, we're talking 100K messages here, not 1400. My 500Mhz celeron laptop has about 35K messages stored over about 10 mbox files, and it starts up in under 10 seconds, and that includes all of the bonobo/activation overhead (which is very significant). Yeah it uses a bit of memory, but memory is cheap on a personal workstation.
In short. The current mailbox formats we have suffice for "Unix mail". Add some archiving abilities to your mail client (even RDBMS backed mail clients need archiving), and you'll never have to delete a message again, and still get work done and still use mbox.
If you want to talk about writing a server - well who cares, you can do whatever you want, because everyone has to go through your interface anyway (you DO NOT want clients accessing data under you, thats what DBMS's are all about in the first place
It seems some people think using 1-tier applications (client code talking directly to a database) are the way to go for multi-user environments. They're not, they dont scale and are impossible to maintain. Nobody writes any real software like that anymore, unless you're writing dodgey vb toy apps.
No we definetly do not need another standard to move mail around.
MIME *is* a transport. MIME *IS* easy to decode. MIME *must* be supported by any email client already.
MIME *is* the solution, it already exists, it supports everything you need (multiple binary attachments, multilingual headers), and it *works*.
XML is *not* a good idea.
Well, unless I drink too much, and even then I usually just fall asleep.
- NotZed (in Australia)
__// `Thinking is an exercise to which all too few brains
Read the manpage, there are 64.
Upper/lowercase alphanumerics (26+26+10=62) plus / and . (+2=64)
__// `Thinking is an exercise to which all too few brains