You've Got Mail -- Tons Of It
Daniel Goldman writes "The Baltimore Sun has an article about the City of Baltimore's email problem." A snippet: "Millions of old e-mail messages are clogging Baltimore's municipal computers, so the city is going to start automatically deleting any messages older than 90 days.
A common practice in private business, the move raises questions when made by a municipality, which has a responsibility to retain certain public records." Goldman points out "Just think about all the potential law suits; 'if it's not there, they can't subpoena it.'"
This might be a practical use of one, determine which emails are valid, and which aren't, like a spam filter. Allow users to flag 25% or so emails as important, and archive those.
got sig?
Now we get to subpoena entire hard drives so we can run data-recovery software on them. It would be smart of any operation, public or private, to wait out the statute of limitations (which I realize may vary) of any states with which they have substantial contacts before they start deleting data.
figure out what percentage is spam, and sue spammers to recover damages for lost resources.
can't they, like, just buy a big hard drive and stuff?
If the average message is 10kib (10,000 bytes, make the math easier), and compresses down to 3kib (probably even better if you compress a bunch together), then you'd need roughly 30gib to store 10 million of them. Can you even buy hard drives that small any more?
Add some search index, throw a crappy web interface on it, and call it a day. Never delete an email again!
There are always going to be things like replies to an original question and subsequent follow up questions going back and forth, so normally hanging onto the latest/final reply would be sufficient (providing it had the previous history - clearly showed the conclusion).
Now if they were to use this as an excuse to accidently lose records that would be a different matter. This however is where auditors should be playing a role to ensure that they are keeping the right records and discarding the rubbish.
"Baltimore officials, who approved the new e-mail policy at a Board of Estimates meeting last month, say they have no choice but to delete old messages, which are slowing city computers to a crawl. They say the system is so overburdened that creating a daily backup has become impossible; there is so much data that it takes more than 24 hours to copy it."
What?!? What's wrong with an incremental backup? Surely all those millions of messages aren't *changing* every day?!?
Think of all the children that will suffer from this!!!
This has to be the stupidest approach to the problem. Their networks are too slow, so instead, they're going to have each employee go through their old email and save individually important messages to their local hard disk? Not only are they going to tie up employees with this manual effort, they're also going to lose key documents and a key service - the ability to centrally search and reply to requests for information. In the future, each department will have to search their local hard drives for this information.
They've taken a simple problem of old or improperly speced equipment and turned it into a manual labor solution instead. That's an insane waste of time and salary. They should just upgrade their network and storage. If I can build a 4 terabyte RAIDed PC for a few thousand dollars, they can centralize their mailserver and back it up for say a hundred thousand, even with extra redundancy and inefficiencies and admin costs.
By contrast, forcing every current employee to perform a task that would eat up weeks of time per employee per year, in a city of Baltimore's size, will cost tens of millions of dollars.
Dumb, dumb, dumb.
--Pat / zippy@cs.brandeis.edu
Backup all e-mails from the last 4½ years into permanent storage, and then from there, get organized. Put spam filters on, force people to sort any important mail or else it gets deleted after, say, two weeks. People always seem to want to "start from scratch". without looking at the situation rationally. Five years of documents, gone overnight. How can anyone not be at least outraged by that?
because even one false positive can get them in trouble?
"Have you ever thought about just turning off the TV, sitting down with your kids, and hitting them?"
Deleting all the mail... or delete a few false positives. Hrm, tough call...
You could have a manual meta-check on all the positives to make sure they aren't anything vital. Computers don't make mistakes, they just don't think like we do so sometimes it's necessary to sort through it all.
HAH! I just wasted a second of your life making you read this, but I wasted a minute of mine thinking it up. DAMN.
"because even one false positive can get them in trouble?"
You should probably go take a class on probability. When you're dealing with millions of email, there are going to be some false positives.
What's the alternative, hand sort them?
Yeah, that's a good idea right? But with bayesian filtering, you can do a lot of refining when you're dealing with millions of email.
And who says that you need to use the same filters for the health dept and the transport dept.
Jesus christ, there are lots of companies that already do this. Its not like Baltimore's the only city with millions of old email.
This is not a mars mission, this is judicious use of existing technology- bayesian filters(or whatever fits the profile) and enterprise storage solutions.
Its better off spending a few hundred thousand(or less) on archiving the mails than spend a million or two on lawyers and court 5 years later defending the decision to delete the data after some citizen sues them for records etc.
This highlights a fundamental problem with email -- many people pass documents as attachments, or in the body of the email, instead of using email as a sort of metadata describing their works in progress. Documents shouldn't be passed around in email; they should be stored on a network share, where proper controls for mutual exclusion and such can be employed.
'He who has to break a thing to find out what it is, has left the path of wisdom.' -- Gandalf to Saruman
I am not sure if they can use email as official communication? There would be problems with repudiation ("we never received it"), privacy ("someone intercepted it who was not supposed to") and authentication ("it wasn't me who sent it, it was my dog"). Can they use an email in the court then? What would have to be done is to have all the messages signed and encrypted with a public key, and perhaps have some way for the sender to get a receipt back when reciever reads the message.
Gmail wouldn't solve problems like this, they only offer one 1GB. I work with secretaries who would use up 1GB of storage a year if they didn't delete any emails. The organisation I do systems administration for isn't even that big, so I could easily imagine that other people running into problems earlier.
>A better idea would be to write a script to go through each user's mailboxes every month, export any old emails to text, store the files on a server that uses a journaling filesystem, index the emails, and compress them.
No file system will save you from multiple HDD failures; they should save old (>12 months) data to DVD burners and/or tapes or cheap SATA storage. One can buy 1TB of external SATA space for couple thousand dollars.
>One or two XServe G5s could do the trick quite well.
What do XServe boxes have to do with generic application like email? Besides, they're more expensive than comparable Intel+Linux servers (especially considering the fact that CPU perormance is unimportant for most mail servers).
Once an actual human person has read and acted on the mail, they should be able to mark it "official business" and/or move the email into an "official business" folder which does get kept as required.
Better procedures and training goes a long way here. These same folks have no problems with snail mail.
I don't know what business you work in, but if they haven't read it in 3 days, they've lost my business.
A better option would be to archive old messages rather than remove them entirely. From the article it sounds like they are keeping ALL messages active all the time. For example:
"They say the system is so overburdened that creating a daily backup has become impossible; there is so much data that it takes more than 24 hours to copy it."
So, it seems like the solution would be to periodically lop off old messages to offline storage (tape, spare drives, whatever). In the event of a lawsuit the old messages could be reasonably recovered and the cost for such a system would be extremely minimal.
Unlike a legal office where communications are governed by extensive regulation, governments are really only required to keep records of official documents and decisions. The myriad of e-mails leading up to a decision are not generally protected under such an act, nor are snail mail or phone conversations. In fact, the whole idea of there being a digital trail to follow for governmental decision making is really very new. Does it makes sense to change that practice? Do we really think our government officials should be so closely watched that EVERY e-mail/phone conversation/smoke signal should be recorded and exposed to public scrutiny? Talk about making an unattractive job even less inticing.
In responce to the posters question about all those subpoenas: welcome to the world of civil litigation, where the first one to destroy the evidence wins!
Only 120 characters... who can summarize their entire world understanding in 120 characters?!
I'll handle these in reverse order.
Word attachments are acceptable when they are just a means of moving files around, and not the entire content of the email. What is not acceptable is expecting me to load a large word processor just so you can use the company letterhead. In my experience the latter type is far more common. Besides the security implications (macro viruses, etc), I do not have a gui on the computer I read my email. Nor should I need one.
As for HTML email, I'm simply not going to render strange IMG tags. They could lead to goatse, or back to a spammer's site, and now they know my email is active. HTML email generally looks like it was designed by an 8 year old with downs syndrome anyway. Plain text is just more readable for nearly every email. Check out HTML email is STILL evil!!! for more.
Give me Classic Slashdot or give me death!
Spam is easily recognized by the subject line? Boy I wish I was getting your spam instead of mine!
Mine's full of:
hi
how are you?
Please Complete and Return
I miss you
Fwd: I need your help
Re: Your Account
etc... etc...
Any one of these could be legitimate (occasionally you get a headline that's so inocuous I think the spam filter has got it wrong... until I actually read the email).
Five points for excellent use of buzzwords. I would say compress messages older than 90 days and save them. The government is not supposed to just willy nilly throw things away. I would invest in more hard drive space to hedge against lawsuits.
I hate sigs.
ILM is the next big thing. Its the logical extension to the ever increasing SAN/NAS Server/Workstation exponentially-increasing-data problem (go google for pretenders to the law).
You can't oversee growing data storage without a parallel increase in administration costs. Instead, the idea is to build automatic archiving into your storage architecture.
In practice this means you build tiers of storage/archive methods. Tier 1 is a high tkt Shark SAN etc, Tier 2 is lower priced SATA RAID and Tier 3 is a DAS Tape Library. Build retention guidelines into the storage management playform (Tivoli etc). Older items are automatically moved to the Tier corresponding to that retention/access policy. Really old items "live" on Tape. Frequently accessed data lives on the high speed boxes near to the users/application. You snapshot updates to a DR replica offsite or burn periodic Tape sets etc. Its a good idea to team this with storage virtualization (virtual LUNS/ Metadata directory servers) and you can add/rotate/modify the storage tiers when necessary without any downtime.
From a user perspective, you click on the link and if applicable, get notified the item is being retrieved from media x (its mostly transparent). Worse case - access times are in the minutes.
Of course, all this comes with a high price. Enterprise Storage systems are not cheap. Recent legislated policy (Sarbanes Oxley etc) enforces the retention of some media (e.g. email). You cannot rely on end users to enforce data retention. This lets you mandate tiers of protection and is highly configurable to support per application monitoring.
Nothing is foolproof. Its still being finessed but if you can afford it - its truly a thing of beauty.
That's fine. Disk storage is cheap. Certainly cheaper than paying hundreds of staff for the time taken to go through all their old mail sorting the wheat from the chaff. The right solution to running out of disk space for email is to add more disks.
The 10 secretaries in question were only using 1 GB each per year. 10GB per year in total. If your company is as large as you imply, the amount of work hours involved in sorting though old emails will be larger than that. Each person (or their PA) would need to do their own. That's a lot of hours.
There is an easy enough solution to this if you have management's support. (Assuming I understand the problem, which is apparently that the pop server is overburdened.)
The first step is to solve the steady-state problem. This is easy enough: you make it very well known that they are not to leave messages older than 90 days in their mailbox. But because the messages may contain official stuff and can't be deleted, you don't delete old messages. Instead, you test every mailbox periodically to see if it contains old stuff, and if it does, you block delivery of new messages to the mailbox. You can leave them with POP access to it so they can clean it out. Of course, you make this policy well-known. And you put an automated message into their mailbox that notifies them they've been blocked too.
By doing this, you've set up a give and take situation: as long as they do their part to keep their mailbox generally clean, you do your part to deliver messages. Presumably managers will encourage their employees to keep up on the maintenance because they don't want employees to be unable to be reached by e-mail.
Second part is to solve the problem of too much data already on the server. To do this, you announce the policy above and put it into place. Send people advance notice (two weeks, one week, two days, one day, etc.) that their mailbox is going to be locked if they don't clean it. For those who don't clean it, go ahead and lock it. Leave it that way for a short while (until you get some complaints) and then announce a one-week extension.
Then, for those who *still* don't do anything about it, take all the messages that are older than 60 days, remove them from the user's mailbox, then put them aside. Burn a CD of the mailbox and send it (interoffice mail, or whatever) to the user's manager. Then make your own archive of all such messages, and delete them from the server.
Now the recalcitrant people will have to go see their manager to get their old messages, and the managers will know why and will know that they've been given several warnings and an extension and still didn't bother to do anything about it. Maybe the manager won't care, but I can't imagine they'll have a positive feeling about their employee having found a way to waste their time.
How to save 90% of disk space:
Sort all users e-mail recieved by size for a given year.
Delete 5% of the largest e-mails. These will probably account for around 90% of all disk usage. They probably represent file attachments which should have been stored on a server instead of in an e-mail account anyway.
Just think, when you mail a 2MB attachment to 3,000 people in a division, that could use quite a bit of disk space.
With a properly designed mail system, only one copy of the message would be stored on disk, with pointers from each mailbox to the single central copy.
*shrugs*
Give a man a fish, he'll eat for a day, but teach a man to phish...
I hate it when people associate taxpayers + government with customers + business. The two relationships are very different.
There are no laws I know of that tell me I have to pay Company X for products. If I don't want any products from Company X, I won't buy anything from them. I'm not going to be breaking any laws because of it. However, if I don't pay my taxes I'll get hounded to death with the possibly of being tossed in jail.
See the difference?
A Penny for my thoughts? Here's my two cents. I got ripped off!
I don't know what business you work in, but if they haven't read it in 3 days, they've lost my business.
Let me guess.... you're emigrating a lot, yes? Otherwise you might have to have "business" with the government. Good luck getting a reply in three days there.
Kjella
Live today, because you never know what tomorrow brings
She doesn't sound like a bitch. She sounds like someone who wanted to share the experiences of the company picnic with everyone. That doesn't sound like a bitch to me. She may not have known about the problems her post would cause. If she had done something like this in the past, and did it again, that might make her "stupid", but not a bitch. A bitch is someone who complains when another employee has a family picture on her desk, because personal decorations are against company policy. A bitch is someone who expects people to drop everything to help him/her, but won't lift a finger to help others. A bitch is, generally, a person who is unpleasant to be around, a person whom almost no one likes. A bitch is not someone who would pass around pictures of the company picnic. A person who calls a woman a "stupid bitch" because she made a simple mistake sounds like a sexist asshole to me, and not someone that I'd like to know.