Email Offline At the Home of Sendmail
BobJacobsen writes "The UC Berkeley email system has been either offline, or only providing limited access, for more than a week. How can the place where sendmail originated fall so far? The campus CIO gave an internal seminar (video, slides) where he discussed the incident, the response, and some of the history. Briefly, the growth of email clients was going to overwhelm the system eventually, but the crisis was advanced when a disk failure required a restart after some time offline. Not discussed is the long series of failures to identify and implement the replacement system (1, 2, 3, 4). Like the New York City Dept. of Education problem discussed yesterday, this is a failure of planning and management being discussed as a problem with (inflexible) technology. How can IT people solve things like this?"
It's the backend. When you have too many connections on too few servers, with not enough storage
you usually see this kinda issue.
It's an economic one. It needs an economic solution.
e.g.
Have people buy a $10 ticket to get an account on the email server.
Deleted
I am depressed.
"Eve of Destruction", it's not just for old hippies anymore...
Oh no, a service had downtime. Surely this is the end of the world and only the greatest sinners of the IT world ever have to bring something down for maintenance.
To offset political mods, replace Flamebait with Insightful.
By hiring more cost accountants and requiring special and complicated business case studies with a thorough financial analysis on even the most mundane upgrade on how it will raise the companys stock price. Just ask any visionary MBA? Always buy cheap consumer grade stuff and view talent as unneccesary expenses. Do that and you will never have problems. What could go wrong?
http://saveie6.com/
When I started college in 1991 I was amazed by the telnet access I had to the email account given to me by the University. I hadn't had an email address prior to that. Now I have an email addresses through hotmail, gmail and yahoo that I use for different things and facebook also gives me an email address. So, I doubt students really need email addresses provided by the university anymore. As for the NYC Dept of Ed example, I think it just shows that trying to build IT competence into a government agency basically a waste of money because the institutional culture of government. In short, all of these kinds of organizations could just offer email through gmail/google business or any number of other providers that will scale up almost infinitely.
if your life is such a big joke then why should I care?
I know /. is a a little slow usually, but it's a little silly to see this article pop up now as full service has essentially been restored (just now getting back mail client access, while webmail was working for the past few days).
Maybe it has something to with the fact that the state of california has cannibalized the funding for my beloved alma mater.
Beware the Jubjub bird, and shun the frumious Bandersnatch.
Briefly, the growth of email clients was going to overwhelm the system eventually, but the crisis was advanced when a disk failure required a restart after some time offline.
Capacity planning is supposed to account for reduced capacity due to component failures, system outages, and temporary demand spikes due to restart events.
It's called sendmail.
Not sendmailnomatterwhat
http://slashdot.org/comments.pl?sid=2556922&cid=38249652
IT should have unions so they are not the fail guy for management mess up's / lack of funds and or planing.
There was some Silicon Valley ISP whose name unfortunately escapes me just now, that had the "problem" that its service had grown so popular that the time required to search for a mailbox in /var/spool/mail was greater than the time duration between incoming mails. The result was that their system worked great right up until a certain critical threshhold, then all of a sudden most of their users' mail started to bounce.
Their solution was to place user mail spools in their home directories rather than all in one directory, that being /var/spool/mail. Because the home directories weren't all in the same parent directory - that is, not all in /home - rather than a linear search, finding the right spool became a much quicker tree search.
If you have a large number of users, even if you have only one filesystem for home directories, you can speed access to individual user files by placing, say, my "mike" home directory in /home/m/mi/mike, rather than just /home/mike.
IT people need to move into management at a more useful rate. Instead most of the people who ultimately make the financial decisions for IT centers around the world have little grounding in IT and hence limited understanding of what is actually important beyond the bottom line.
Of course, this requires IT people who are willing to put their foot down. We don't seem to have many of those...
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
IT goes to management and says "based on current usage/loadings etc the system will fail in 6 months to prevent it we need to do this....." Management says "Really, that's not what the sales man told me and its his equipment so he should know".
Undetectable Steganography? Yep, there's an app fo
Now I have an email addresses through hotmail, gmail and yahoo that I use for different things and facebook also gives me an email address. So, I doubt students really need email addresses provided by the university anymore.
You are quite wrong. Email addresses - especially .edu addresses - are still quite valuable. At lot of academic resources that take registration via email won't allow registration to go to a throwaway account (a la hotmail, gmail, yahoo, etc). Many organizations that are interested in real information on users insist that users use an actual unique account and not a freebie. And when you're in college and making very little money a lot of those things can be important.
I think it just shows that trying to build IT competence into a government agency basically a waste of money because the institutional culture of government
You're not very accurate on that, either. Government organizations need to be able to keep track of their email - especially internal communications - which they would not be able to do if they outsourced email and other telecom.
In short, all of these kinds of organizations could just offer email through gmail/google business or any number of other providers that will scale up almost infinitely.
With the various privacy breeches that have occurred, that would be a terrible idea. And on top of that, IT is a lot more than just email. Do you want the government to turn to comcast for networking support while their at it? What if the IRS web servers go down on tax day? Do you want them to have to lean on an outside company to get it back up?
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
It's SO HARD to point your MX record to a working host!
And then, to populate an IMAP from your directory? Don't get me started...
"Flyin' in just a sweet place,
Never been known to fail..."
At the school where I teach, whenever there's a discussion of how much it costs us to run our own email, someone suggests outsourcing (e.g., to gmail), and then someone else says, "No, we can't do that because of privacy laws." Am I right in guessing that privacy laws don't in fact prevent outsourcing to google? I suspect the argument is basically a way for IT folks to have job security. There are certainly laws that say, e.g., that we can't give students' grades to third parties. But it's hard to believe that letting google keyword-index emails and serve ads based on the keywords would violate these laws. (Whether google creeps you out is a different issue -- a moral/political one, not a legal one. It may also be an issue, but it's not an issue that can automatically end the discussion the way the legal issue can.) Does anyone know of any colleges or universities that do outsource to google or someone else?
Find free books.
Seriously, is Berkeley like the only college campus that hasn't outsourced their e-mail to Google yet?
Only 70000 accounts? That's not a big system at all. I was running systems with over million email accounts ten years ago, and by today's standards even those would be considered small.
worldmobilenet.com -- World Prepaid Wireless Internet plans
In the video, they don't even mention sendmail at all. Are they using it?
Also, they mention that the cost of the system is something like $1.30 per account per month. I don't know much about IT budgeting, but that seems like a really low number for something as critical as messaging and calendaring. I have to imagine that they spend more money per user just cutting the grass around the campus.
There aint no pancake so thin it doesn't have two sides.
it's like saying IT can do heart surgery or IT can provide pscyhological counseling to a trauma survivor. IT is IT, it is not management and it is not leadership. IT is IT.
of course, shit rolls downhill, and leaders nowdays are incompetent buffoons who gain their positions largely through bribery, kickbacks, extortion, and other 'features' endemic to societies where the rule-of-law breaks down thanks to a greedy, corrupt elite.
again, IT cannot fix that.
Considering that the outage meant that IMAP and POP couldn't be used, while webmail could, I'm not sure that an MX record change would've helped much. :)
I still wanna know why they didn't just change the load balancer to take the bad server out of the pool (which could just as easily be a DNS round robin entry as anything). What? They didn't have redundancy built into the protocol which, short of DNS, is probably the easiest one to make redundant? Then remind me to consider "UC Berkely IT" as a back mark on future resumes when people apply to work here.
I've only heard from people on one side of this but the story that I hear is that in the past, many departments had their own IT, mail servers, web, etc. When the campus built its centralized computing services facility, there was great pressure on departments to move to the central system. There was some griping about the costs for central services often exceeding the internal costs the departments formerly had but there was, I'm told, much need to justify the expense of and to pay for the new center. I've heard that some departments have been able to resurrect their internal systems to get through the outage.
Perhaps someone with more inside knowledge than I have can fill in and/or correct information from both sides of the story.
That slideshow is pure management-spin right from the opening "look how complicated and difficult this is..." I love how the "solution" to a system that is soon to outstrip its capacity is to stop expanding (and, it appears, properly maintaining) said system and hope it doesn't implode before you can toss the potato to an external party (who can then take the blame). Guess I was never learned at that school of capacity "planning".
~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
The press pretty much reads like this to me:
1) We didn't size the system large enough to handle the possible outages.
2) The outage we didn't size for happened, basically taking everything down.
3) My team is now working on a band-aid solution, which basically involves hobbling the application.
4) Since we're incompetent, we're going to outsource this next year.
I mean, if I was the CIO's boss I would have fired him on the spot. Maybe outsourcing is a better answer than putting in place a proper system and looking at that analysis could be interesting. I see no indication any of that was done here, basically the CIO gave the Barbie response, "Mail is hard, let's go shopping." If he doesn't understand how to do it in house, he won't understand how to arrive at a good outsourcing agreement.
Which means this pretty much sums up everything that is wrong with large org IT today.
All comments so far are like "it's not sendmail's fault".
Why is everybody so defensive about it?
... not treating a non-technical problem as a technical problem? Identify the problem, write the memo, keep the flimsies, and drop it on the relevant manager's trouble ticket queue. Or however the flow goes in your locality. The rest I leave as an exercise for the reader.
Google have 24x7 phone support now. It is really a futile exercise to maintain local email systems even for a few thousand users, it will be outsourced sooner or later.
They left out the slide where management get great big bonuses for being such swell thinkers.
The world's burning. Moped Jesus spotted on I50. Details at 11.
Due to FEDERAL LAW communications from school staff to students (and the reverse) must go to University Email accounts.
plus if somebody "does not get" a given email then its the schools fault.
Any person using FTFY or editing my postings agrees to a US$50.00 charge
Clearly email is an afterthought thrown in for free.
If you want a service to work, you have to fund it. You can try to fight for budgets against the football team or you can simply charge and the money automatically goes where it's needed.
Think of money as little packets of information. You buy something there is a need for it, you don't buy it, there is no need. Resource allocation without dozens of layers of management.
Maybe nobody cares about email and they can just shut it down. Charge for it and find out.
Deleted
Look up Microsoft live@edu and Google aps for education.
Wow, Squirrelmail. So at least they managed to migrate from pine at some point.
Yeah, they're planning the upgrade from squirrel to carrier pigeon as we speak!
These posts express my own personal views, not those of my employer
I hate it when people try to act as if IT isn't subject to budget constraints and having to prioritize spending like any other department of a large organization. Sure the money comes out of the "client" departments, but it's an issue that IT does have to plan for and deal with.
The summary asks "How can IT people solve things like this?"
Forward the emails and responses to the demands for planned capacity growth to the public.
Oh, you didn't keep the email from your manager refusing to pay for a needed capacity upgrade? I guess you haven't been in IT long enough to learn to cover your own butt.
I do not fail; I succeed at finding out what does not work.
Sigh...
Look at the first bullet point of the timeline. Productivity suite approved, upgrade to Calmail cancelled. Then a week ago, they decided on an interim upgrade because not upgrading in the first place caused problems. So, rather than a planned upgrade, the IT folks were thrown into panic mode because their (probable) proposed timeline for safely doing an upgrade, including burning in and testing of new hardware, was cut to a fraction of what it should've been.
You can argue about the budgets, or the IT folks, but this is a failure of management. If (in Spring 2011) they cancelled the upgrade, and then had to have an emergency upgrade, what you have is management that fundamentally does not understand the system. This would (probably) not be the IT folks managing the system, but rather the budget and personnel management that doesn't quite grok how upgrades should be done in a safe and controlled manner. They misjudged the initial cancellation, and then (likely) pushed through a poorly planned emergency upgrade.
If the slides are correct, there is very little having to do with a failure from a technical aspect, and everything to do with a breakdown of management.
As IT personnel, we could demand that no shiny new toys be allowed to talk to the nerd stuff in the back until there is 2x the amount of nerd stuff in the "back" as would be needed to effectively handle all the new shiny things.
Of course this would equate to a bunch of loathsome, Grinch-like, bureaucrats needlessly handicapping the business/school/non-profit to serve there own petty need for authority.
All we want to do is hook all these toys to the e-mail cloud. Why do we even need to talk to you people?
Believe it or not, maintaining a mail host for a larger, geographically diverse
If it were easy, there'd be no push to outsource it to "the Cloud" (or anywhere else), and countless organizations wouldn't be moving from the "burden" of administering something like Exchange (ie, a trivial amount of knowledge is required compared to any other MTA) to Office 365 or Google.
It's not just as simple as setting the mx to point to a 'working host', especially not in academia (though many try). Do you have to deal with this kind of thing?
As someone who has to deal with this stuff on a daily basis - I had dealings regarding CalMail last week on a similar mail related problem of their's - and with academic mail systems in general, let me clue you in:
* This is not your business mail system, where everyone has a uniformly specified mailbox.
* It is not dictated from the top down how mail is run. In a corporation, there is standardization. CalMail is the exception in academia, as far as I can tell, in that it's run somewhat like the business model. However, there is still somewhat of the "Greek" (vs. "Roman") model of management involved, and this does tend to lead to problems. (This is much more true with other academic mail systems, from what I can tell.)
* Unlike in the work place, there is very little systems experience where it is needed (ie in the actual administration). Even with dedicated IT, very few people are actually good with the mail system due to how broad and complicated mail management can be.
* Running a mail server effectively is now quite difficult. Not only do you have to "just make it work" - ie, dealing with all the misbehaving mail systems out there from other academic institutions and verifying the VIP email makes it through (regardless of how much spam that means letting through - but never let any spam through!) - but it's got to run like a top.
* Often, you're dealing with decades of systemic dependencies. Mail was the first connected application, after all, and nobody's had it as long as Berkeley. Based on my own experience with networks which grew around their mail system, small changes can compound any sort of change or update. Suddenly, there's something everywhere that needs a specific mail system functionality which can't simply be copied over during a move to replicate it.
* An organizational system like this is big, it's not garden variety email. Hell, i guarantee you they don't have as many IT people maintaining accounts as they have admissions people, probably not even a 10th. Yet the IT people have to actually make sure those records get to the right places all while assuring the admissions people that the information transits securely.
* There is undoubtedly a faculty member with his pet requirements for email. He probably has things which will not migrate properly.
* There will undoubtedly be the people using their mail account for file storage.
* Believe it or not, it's actually fairly difficult to migrate mail from, say, Cyrus IMAP to anything else. It takes time (and anything at all with Cyrus, which I'd not be surprised if they were using, takes a lot of time). Sieve scripts, procmail, IMAP states, et al. It's a pain in the ass, and takes a loooong time to do seamlessly. Doing it under duress of hardware failure is something else entirely.
From my reading of the events (and seeing some other things not mentioned in OP or linked article) there were a number of things which caused this prolonged outage. First and foremost, the system was not designed to be resilient so much as it was designed to scale up (or proper failure condition testing was not performed beforehand). Second, they either don't have the necessary (knowledgeable) human resources, or enough time allocated to those resources, to effectively manage this system. (You would not believe how difficult it is to find a "mail administrator". Everyone's done it, but nobody seems to like it or is all that good at it. If they are, they want a LOT in compensation.) Third, they may
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
First and foremost, it has to account for budget, and the rationalization thereof. It's scary how often suits (and more and more "engineers") say things like "Come on really; how often does that kind of thing actually happen?" This is usually uttered after staring at a couple dozen slides of metrics that detail exactly how often it happens...
Shift happens. Fire it up.
Has anybody thought of using Nginx e-mail proxy to solve the issue?
http://nginx.org/en/#mail_proxy_server_features
University IT departments providing email and calendaring services is like the facilities employees being required to build all the chairs for classrooms.
"If I can't dance, I don't want to be part of your revolution" - Emma Goldman
I thought nobody (especially college kids) used email anymore. Facebook is where all the cool kids hang these days, right? California is doing a bang up job alright. When they're done with this project maybe they can consult themselves out to the Feds. I hear they've got a mail problem of their own with this Post Office thing. For those of you unfamiliar with Post Offices, the wikipedias have a decent write up: http://en.wikipedia.org/wiki/Post_office. Anyway, I can't wait to see the powerpoint slides for the Post Office Turn Around Medical Marijuana Home Delivery Program.
There's a more mundane problem. Unless you are an incredibly huge customer the large service providers are just not going to care if there is an outage. One example I ran into last year is a University of 45,000+ students that lost their student email hosting (hotmail) for a week due to a DNS typo for a machine in a hotmail MS Exchange server farm. To get a job offer to a student I had to put an entry in /etc/hosts of my mail server - meanwhile no other mail was getting to any of the students for a week.
That's the price of outsourcing. Your important services are farmed out to people that just do not care enough to fix a typo for a week.
You don't even have to go as far as malice when apathy is enough to provide unacceptable problems.
I think the mail gods are angry at the academic community. Our mail server crashed as well last week. Took them a 4 days to get a backup server online. And another week to get the emails to the new server.
Far out - this comment is so bad that it's not even wrong.
Making 70,000 people change the mail settings and lose all their state and access to their old mail while you spool it somewhere, and then integrate it back into their main spool. Make sure all the filtering rules got applied correctly along the way.
Trivial, I could do it in my sleep.
As a mail administrator for a big system, I completely agree with you.
The biggest problem was that they had everything on a single SAN, so when they ran out of IOPs, there was no spare capacity anywhere, and nowhere to mitigate it to. I've had people try to sell me on putting all our systems on a SAN too "it's so simple to administrate. It has plenty of IOPs, see, look at these shiny numbers". Fine when it's empty and you're only hitting the battery backed cache.
Which is why we have hundreds of separate little disk sets managed with templated configurations rather than any single points of failure. I'm really glad to be there!
Building rockets is hard. E-mail is a piece of cake and getting easier by the year.
A SPOC is never a good idea. So putting everything on one SAN isn't either. Why would you purposely introduce so much fragmentation in your data when you could just build a proper SAN environment which had high availability built in?
http://virtualize.wordpress.com/
Watching the video, two things were apparent: Sendmail wasn't being used--Exim was, and the fault was not the MTA, but rather the use of a single SAN backend for everything.
I've been in the Messaging Infrastructure business for many years. The UC problem is poor design. They left themselves open to a single point of failure by not splitting the mailbox load across multiple SANs. Their load isn't really all that great--I've designed for much larger email volumes. What they need is an LDAP-based routing (or similar) mechanism to send different recipients' emails to different SAN backend stores--say, alphabetically by last name, or by entry (employee/student/alumnus/account) id. When a disk failure occurs, it then would only affect a small percentage of the population, and for a much shorter time. By enforcing RFC compliance on the front end, they would also reduce the load on the back end, and could easily handle their traffic load with far fewer servers--thereby costing far less than what they currently have.
They certainly can pay someone else to do proper design, of course, but they should understand that technology and budget did not cause their problem, their poor design did.
-David Gillam
www.davegillam.com
I don't know a whole lot about this but I'm on the mailing list for a department that was in the process of migrating to Calmail. (My email goes through a different system so it hasn't affected me.) After a slew of messages this past week about Calmail problems, they've decided to cancel the migration for now. Apparently Calmail is going to the cloud in the future, so they're hoping the existing servers last until then.
a cascade failure.
1. data storage failure.
2. database crash, presumably due to the fault in data storage
3. heavy backlog of deferred mail begins hammering a generally neglected piece of Berkeley IT.
email, calendaring, and instant messaging arent mythical, and they need constant competent care
just like any part of the IT infrastructure. Having worked on complex email systems for the better half of my career
some of the fault lies with the berkeley teams "set it and forget it" mentality as shel waggener scolds the audience about at the start of the video.
is that backend database known to the DBA and in a healthy state? are the front-end components configured for email as it ran in 1993 or have they
over time been upgraded with new features to address email as it operates in the 21st century.
Good people go to bed earlier.
I have a story of a mail system at a university running on a SAN, which ran out of file descriptors and took the mail system down for a week as well. I think it was some form of ZFS, although I've seen similar things on GPFS and other things as well.
Just rebuilding a system like that after a crash can take days to weeks. (please note I'm talking Petabytes here).
RogerWilco the Adventurous Janitor
I've run out of inodes once. Now I monitor them as well. Thankfully it was just one machine and changing the replica was simple.
Here it is in a nutshell:
UC Management had decided to delay the purchase of a replacement system a year ago. UCB has been working on OE (Operational Excellence - euphemism for downsizing) for a few years now. Because of the decision, one of the main techs that help created the stable, top notch mail system left for Twitter a few months ago. Someone on joked on the Micronet mailing list that mail was going to crash and burn.
The university decided at the beginning of the year not to replace the aging, out of warranty hardware while they evaluated outsourcing to gmail or Microsoft.
Mail services were already experiencing problems during the past few months prior to the crash. This was due, according to the CIO, to the unprecedented amount of users using cell phones & iPad to access mail constantly starting this semester. Then a critical server crashed the day after Thanksgiving. The techs brought it back up over the weekend, but they were still recovering data.
Then on Monday, everyone came back to work and when the staff all started to log in, the system degraded so much they had to take it down again. They disabled imap & pop services to keep all the cell phones and ipads off the servers, which also destroyed the productivity of the staff that relied on Thunderbird and the Mac mail client to access imap. The web based squirrel mail and roundcube clients were unfamiliar and lacking functionality of real mail clients. Even the techs were figuring it out with each other on the Micronet mail list.
Basically, it was a management decision. It was a gamble that failed.
Capcha = raided