Slashdot Mirror


Email Offline At the Home of Sendmail

BobJacobsen writes "The UC Berkeley email system has been either offline, or only providing limited access, for more than a week. How can the place where sendmail originated fall so far? The campus CIO gave an internal seminar (video, slides) where he discussed the incident, the response, and some of the history. Briefly, the growth of email clients was going to overwhelm the system eventually, but the crisis was advanced when a disk failure required a restart after some time offline. Not discussed is the long series of failures to identify and implement the replacement system (1, 2, 3, 4). Like the New York City Dept. of Education problem discussed yesterday, this is a failure of planning and management being discussed as a problem with (inflexible) technology. How can IT people solve things like this?"

39 of 179 comments (clear)

  1. Nothing to do with Sendmail by bobstreo · · Score: 3, Insightful

    It's the backend. When you have too many connections on too few servers, with not enough storage
    you usually see this kinda issue.

    1. Re:Nothing to do with Sendmail by grcumb · · Score: 2

      It's the backend. When you have too many connections on too few servers, with not enough storage you usually see this kinda issue.

      I see it as yet another failure for the client/single server model.

      It surprises me that people are still investing so much time and effort on centralisation of services when obviously the most practical technical[*] answer is the opposite. Simple, common protocols and decentralised infrastructure are the most robust model for overall survival of a communications system. DARPA proved that some time ago, but we seem intent on forgetting as much of that lesson as possible.

      ----------------
      [*] Okay, I don't want to be disingenuous about this. The reasons for centralisation are financial and organisational. It's more costly to spread IT capacity through the breadth of an organisation, and it's hell on wheels in administrative terms. But past a certain point, you would think that IT would finally earn the right to have some input into the discussion about how best to manage an organisation's information. Unfortunately, IT managers are not always the best ones to advocate for a different approach because they're the ones who've made their mark by proving (or pretending) they could manage these big, ugly 'enterprise' systems.

      --
      Crumb's Corollary: Never bring a knife to a bun fight.
    2. Re:Nothing to do with Sendmail by vlm · · Score: 4, Funny

      It's the backend. When you have too many connections on too few servers, with not enough storage
      you usually see this kinda issue.

      Knowing the speed and flexibility of university upgrade policies, and knowing sendmail was born around 4.1BSD, and knowing the -BSDs were VAX only until 4.2 or 4.3 or so in the 80s, I'm guessing they're still using the original VAX it was developed on?

      --
      "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    3. Re:Nothing to do with Sendmail by TheRaven64 · · Score: 3, Informative

      Uh, email is decentralised. Anyone can set up a node in the network just by pointing an MX record at a machine. The problem in this case is too many people using the same node. You'll note that while UCB was having problems, email continued to work fine for everyone else unless they had unrelated problems.

      --
      I am TheRaven on Soylent News
    4. Re:Nothing to do with Sendmail by CAIMLAS · · Score: 2

      Many educational institutions lag behind because they're an ever-evolving door. Even when they've got dedicated and experienced IT staff, most of it's just in a managerial role for the student work studies (it saves money, of course).

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
  2. It isn't an I.T. problem by Colin+Smith · · Score: 2, Insightful

    It's an economic one. It needs an economic solution.

    e.g.
    Have people buy a $10 ticket to get an account on the email server.

    --
    Deleted
    1. Re:It isn't an I.T. problem by StikyPad · · Score: 5, Insightful

      Pretty sure that's what tuition is.

    2. Re:It isn't an I.T. problem by Anonymous Coward · · Score: 4, Funny

      no I'm pretty sure tuition is more than $10

  3. Telnet by qualityassurancedept · · Score: 2

    When I started college in 1991 I was amazed by the telnet access I had to the email account given to me by the University. I hadn't had an email address prior to that. Now I have an email addresses through hotmail, gmail and yahoo that I use for different things and facebook also gives me an email address. So, I doubt students really need email addresses provided by the university anymore. As for the NYC Dept of Ed example, I think it just shows that trying to build IT competence into a government agency basically a waste of money because the institutional culture of government. In short, all of these kinds of organizations could just offer email through gmail/google business or any number of other providers that will scale up almost infinitely.

    --
    if your life is such a big joke then why should I care?
    1. Re:Telnet by slimjim8094 · · Score: 5, Insightful

      Students need school email addresses because that way all students have an email address.

      At my school, students are expected to check their university email at least once every 24 hours. Many people forward it to a personal account, and obviously most people check it more frequently than that, but if the university issues an account to everyone, then there can be no debate about how they didn't get the email. The school takes responsibility for the email system (and any failures), and then professors can be assured that if they send an email out to the class, it will be (or should have been) read, leaving the onus on the student to actually do it. It's similar to why we provide computer labs - that way, each student unequivocally has a way to do electronic assignments, even if nearly everyone has their own machine.

      --
      I have developed a truly marvelous proof of this comment, which this signature is too narrow to contain.
    2. Re:Telnet by Mr.+Freeman · · Score: 3, Informative

      Having a .edu address gains you a lot of credibility when communicating with people outside the university. They are quite valuable. You can often get very quick responses to questions that most companies won't even respond to if they came from a name@gmail.com or name@yahoo.com.

      Also, email is used for a lot of very important stuff like sending reports, design files, etc. Having someone on campus that can fix problems is quite valuable. Your campus email will never be "accidentally" seized, locked out, etc. like people have experienced with google and yahoo. Because the campus maintains backups (or at least, they should), you data will never be suddenly gone with no chance for recovery like people have experienced with google and yahoo.

      --
      -1 disagree is not a modifier for a reason. -1 troll, flaimbait, redundant, overrated are NOT acceptable substitutes.
    3. Re:Telnet by mr100percent · · Score: 2

      Actually, most schools require an Official school email address. This guarantees the uptime from the faculty's point of view; you can't claim you never got the assignment or that you turned it in on time and nothing was there. It's also important for them from a liability standpoint; my Registrar will not send me any bills unless it's to my .edu account, and professors are instructed to ignore any student emails from any other domain. They're also organized by real name, so the school has a working internal directory and doesn't have to bother with LDAP.

    4. Re:Telnet by QuantumRiff · · Score: 3, Interesting

      When I was admining at a small college, we DID NOT provide email for students, only for staff. We ran a listServe (sympa) and if the students gave us their personal email address, and checked a box, they would be added to a mail list for every class automatically..

      Any student that didn't have an email would be sent to the library, where they would be shown how to sign up for a hotmail, yahoo, or gmail account.

      We had students thank us, since they have gone to other schools, and though it was silly to have to check yet another account, when they already had 3 or 4.

      The ONLY reason colleges give out emails is because they have been doing it since before email was a common thing. There is no actual reason for it.. (but I have heard some neighboring colleges give very, very very good sounding arguments on why they needed to drop a few hundred grant on a SAN and exchange)

      --

      What are we going to do tonight Brain?
    5. Re:Telnet by Chaos+Incarnate · · Score: 2

      Given my university's propensity to send irrelevant e-mails to all students, students marking some e-mail from your university as spam was probably quite legitimate.

      --
      Benford's Corollary to Clarke's Law: "Any technology distinguishable from magic is insufficiently advanced."
  4. Improper capacity planning by mysidia · · Score: 3, Informative

    Briefly, the growth of email clients was going to overwhelm the system eventually, but the crisis was advanced when a disk failure required a restart after some time offline.

    Capacity planning is supposed to account for reduced capacity due to component failures, system outages, and temporary demand spikes due to restart events.

    1. Re:Improper capacity planning by Bacon+Bits · · Score: 2

      In my experience this type of "planning failure" is caused when IT repeatedly tells management they need money to maintain and upgrade systems, and management consistently says no because they don't have the money for it. Not enough money or people to configure, install, support, and maintain any new systems because the budget won't allow any more. Yet somehow there always seems to be money for shiny new iPads and iPhones for the executives.

      --
      The road to tyranny has always been paved with claims of necessity.
  5. Re:So the ultimate solution will be outsourcing by Anonymous Coward · · Score: 2, Funny

    Wow, Squirrelmail. So at least they managed to migrate from pine at some point.

  6. IT is not the Problem by arthurpaliden · · Score: 4, Insightful

    IT goes to management and says "based on current usage/loadings etc the system will fail in 6 months to prevent it we need to do this....." Management says "Really, that's not what the sales man told me and its his equipment so he should know".

    1. Re:IT is not the Problem by iggymanz · · Score: 3, Informative

      no way, I work at a Value Added Reseller of hardware and the good sales guy would definitely use your fears to sell you some expandable solution

  7. No. by damn_registrars · · Score: 5, Insightful

    Now I have an email addresses through hotmail, gmail and yahoo that I use for different things and facebook also gives me an email address. So, I doubt students really need email addresses provided by the university anymore.

    You are quite wrong. Email addresses - especially .edu addresses - are still quite valuable. At lot of academic resources that take registration via email won't allow registration to go to a throwaway account (a la hotmail, gmail, yahoo, etc). Many organizations that are interested in real information on users insist that users use an actual unique account and not a freebie. And when you're in college and making very little money a lot of those things can be important.

    I think it just shows that trying to build IT competence into a government agency basically a waste of money because the institutional culture of government

    You're not very accurate on that, either. Government organizations need to be able to keep track of their email - especially internal communications - which they would not be able to do if they outsourced email and other telecom.

    In short, all of these kinds of organizations could just offer email through gmail/google business or any number of other providers that will scale up almost infinitely.

    With the various privacy breeches that have occurred, that would be a terrible idea. And on top of that, IT is a lot more than just email. Do you want the government to turn to comcast for networking support while their at it? What if the IRS web servers go down on tax day? Do you want them to have to lean on an outside company to get it back up?

    --
    Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
    1. Re:No. by qualityassurancedept · · Score: 2

      you can get an email address through google with any domain name you want... so, the company I work for runs its email from google but we all still have email addresses that say mrbigshot@seriousbusiness.com so I don't think the point about the .edu ending of the address is really valid.

      --
      if your life is such a big joke then why should I care?
  8. Re:Funding by DesScorp · · Score: 3, Insightful

    Maybe it has something to with the fact that the state of california has cannibalized the funding for my beloved alma mater.

    They wouldn't have to if they didn't have too many colleges (they do), and try to send too many kids to college (they do), many of whom may have no business being in college (they don't). Tax revenue is not an infinite resource. But California seems to have a community college on every two dirt roads, and several 4 year (or higher) colleges in a similar area.

    --
    Life is hard, and the world is cruel
  9. Re:outsourcing? by betterunixthanunix · · Score: 3, Insightful

    Since both my alma mater and my current institution have migrated to Google, and both are covered by FERPA and other privacy laws, I am inclined to say that that argument is bogus. However, I have a separate issue with outsourcing student email: third parties get to set the rules for student conduct without any action by the university itself.

    Typically universities have acceptable computer policies and at those institutions that run their own mail servers, such policies usually govern email. Students and faculty can demand changes to university policy if the policy does not properly align with the academic mission of the institution. Students and faculty have essentially no power over the terms of use that Google or Microsoft or any other third party email service imposes on them. It is easy to say, "Well, it is not like Google is going to demand something outrageous!" but there is really nothing preventing Google from doing so (if you do not think they have done so already). Google does not have the best interests of academia in mind when it sets its policies, nor is there any reason for Google to care about academic needs.

    --
    Palm trees and 8
  10. What does this have to do with Sendmail? by farnsworth · · Score: 3, Insightful

    In the video, they don't even mention sendmail at all. Are they using it?

    Also, they mention that the cost of the system is something like $1.30 per account per month. I don't know much about IT budgeting, but that seems like a really low number for something as critical as messaging and calendaring. I have to imagine that they spend more money per user just cutting the grass around the campus.

    --

    There aint no pancake so thin it doesn't have two sides.

    1. Re:What does this have to do with Sendmail? by m1ss1ontomars2k4 · · Score: 2

      They no longer use Sendmail; they use Exim.

    2. Re:What does this have to do with Sendmail? by lucm · · Score: 2

      Also, they mention that the cost of the system is something like $1.30 per account per month. I don't know much about IT budgeting, but that seems like a really low number for something as critical as messaging and calendaring. I have to imagine that they spend more money per user just cutting the grass around the campus.

      Totally agree. One of my client did a major cost-cutting initiative for its email platform, and there was just no way to make it reliable under 9$ a month (per account). And this is when there is no Crackberry (which brings the numbers way up).

      --
      lucm, indeed.
    3. Re:What does this have to do with Sendmail? by TClevenger · · Score: 3, Informative

      That's amazingly cheap. I don't know how you'd do it any cheaper outsourced. Microsoft is $8.80/user in qty. 20,000, and while Google starts at $4.17/user, I couldn't imagine that even 70,000 accounts could bring down the price that much.

  11. Re:Funding by scapermoya · · Score: 2

    I agree that the overall system is probably too large, but we are talking specifically about the flagship university of the UC system. arguably the best public university in the world, and it is getting hurt just as bad as UC Riverside. that's absurd and embarrassing.

    --
    Beware the Jubjub bird, and shun the frumious Bandersnatch.
  12. Re:outsourcing? by GIL_Dude · · Score: 2

    Tons of schools use Google as their email provider. Here's a quote from a Time article from 2009:

    Google now manages e-mail for more than 2,000 colleges and universities, enabling students to transform accounts capped at 100 mb into Google-managed inboxes that allow for 70 times as much mail. Microsoft also provides free Web-based mail for thousands of schools, including colleges in 86 countries.

    Here's the article: http://www.time.com/time/business/article/0,8599,1915112,00.html. Now, a specific school? Sure, my daughter and I just toured California State Sonoma and they use Google services.

  13. The failure is leadership, planning, budgeting... by linuxwrangler · · Score: 3, Interesting

    I've only heard from people on one side of this but the story that I hear is that in the past, many departments had their own IT, mail servers, web, etc. When the campus built its centralized computing services facility, there was great pressure on departments to move to the central system. There was some griping about the costs for central services often exceeding the internal costs the departments formerly had but there was, I'm told, much need to justify the expense of and to pay for the new center. I've heard that some departments have been able to resurrect their internal systems to get through the outage.

    Perhaps someone with more inside knowledge than I have can fill in and/or correct information from both sides of the story.

    That slideshow is pure management-spin right from the opening "look how complicated and difficult this is..." I love how the "solution" to a system that is soon to outstrip its capacity is to stop expanding (and, it appears, properly maintaining) said system and hope it doesn't implode before you can toss the potato to an external party (who can then take the blame). Guess I was never learned at that school of capacity "planning".

    --

    ~~~~~~~
    "You are not remembered for doing what is expected of you." - Atul Chitnis
  14. Did the CIO just give up in the presentation? by Above · · Score: 3, Insightful

    The press pretty much reads like this to me:

    1) We didn't size the system large enough to handle the possible outages.

    2) The outage we didn't size for happened, basically taking everything down.

    3) My team is now working on a band-aid solution, which basically involves hobbling the application.

    4) Since we're incompetent, we're going to outsource this next year.

    I mean, if I was the CIO's boss I would have fired him on the spot. Maybe outsourcing is a better answer than putting in place a proper system and looking at that analysis could be interesting. I see no indication any of that was done here, basically the CIO gave the Barbie response, "Mail is hard, let's go shopping." If he doesn't understand how to do it in house, he won't understand how to arrive at a good outsourcing agreement.

    Which means this pretty much sums up everything that is wrong with large org IT today.

  15. Re:So the ultimate solution will be outsourcing by lucm · · Score: 2

    Outsourcing would work, because when there is another failure they will have another party to blame instead of pointing fingers to a decision made in Spring 2011 (even as a total stranger I could feel the bitterness under that bullet point in the slides).

    --
    lucm, indeed.
  16. Re:So the ultimate solution will be outsourcing by twisted_pare · · Score: 2

    This is really quite common. It happened at my alma mater as well. Servers could not handle the POP requests, so they started blacklisting students that checked their mail more than four times an hour. A month later a RAID drive failed and email for 17k people (including a hospital) was completely offline for 3 days. It is sad that seemingly anyone can be a high paid "IT Professional" these days, but without a clue about HA.

    --
    HTFU
  17. Re:So the ultimate solution will be outsourcing by lucifuge31337 · · Score: 4, Insightful

    One can have all the clue in the world, yet be powerless to prevent failures if not funded to purchase the appropriate equipment.

    --
    Do not fold, spindle or mutilate.
  18. IT has to deal with budgets, too by msobkow · · Score: 2

    I hate it when people try to act as if IT isn't subject to budget constraints and having to prioritize spending like any other department of a large organization. Sure the money comes out of the "client" departments, but it's an issue that IT does have to plan for and deal with.

    The summary asks "How can IT people solve things like this?"

    Forward the emails and responses to the demands for planned capacity growth to the public.

    Oh, you didn't keep the email from your manager refusing to pay for a needed capacity upgrade? I guess you haven't been in IT long enough to learn to cover your own butt.

    --
    I do not fail; I succeed at finding out what does not work.
  19. Re:So the ultimate solution will be outsourcing by AK+Marc · · Score: 2

    HA isn't an IT problem. It's a business issue. I've never seen a business who put a dollar amount on their downtime. I know they are out there, but every "real" place I've worked has never quantified their costs. How do you justify having a cold spare if it's a "waste" of money planning for an outage that is "free" (after all, if downtime were a problem, then someone would calculate the cost of downtime). IT did what they should, gave the users the best they could within their budget. That the budget was too small is partly the fault of the IT manager, but only partly.

  20. Re:How IT people can solve this problem... by Ritchie70 · · Score: 2

    History.

    Computers originally came into companies to do accounting and related work.

    --
    The preferred solution is to not have a problem.
  21. Re:Hate Being First .... by CAIMLAS · · Score: 4, Insightful

    Believe it or not, maintaining a mail host for a larger, geographically diverse

    If it were easy, there'd be no push to outsource it to "the Cloud" (or anywhere else), and countless organizations wouldn't be moving from the "burden" of administering something like Exchange (ie, a trivial amount of knowledge is required compared to any other MTA) to Office 365 or Google.

    It's not just as simple as setting the mx to point to a 'working host', especially not in academia (though many try). Do you have to deal with this kind of thing?

    As someone who has to deal with this stuff on a daily basis - I had dealings regarding CalMail last week on a similar mail related problem of their's - and with academic mail systems in general, let me clue you in:

    * This is not your business mail system, where everyone has a uniformly specified mailbox.
    * It is not dictated from the top down how mail is run. In a corporation, there is standardization. CalMail is the exception in academia, as far as I can tell, in that it's run somewhat like the business model. However, there is still somewhat of the "Greek" (vs. "Roman") model of management involved, and this does tend to lead to problems. (This is much more true with other academic mail systems, from what I can tell.)
    * Unlike in the work place, there is very little systems experience where it is needed (ie in the actual administration). Even with dedicated IT, very few people are actually good with the mail system due to how broad and complicated mail management can be.
    * Running a mail server effectively is now quite difficult. Not only do you have to "just make it work" - ie, dealing with all the misbehaving mail systems out there from other academic institutions and verifying the VIP email makes it through (regardless of how much spam that means letting through - but never let any spam through!) - but it's got to run like a top.
    * Often, you're dealing with decades of systemic dependencies. Mail was the first connected application, after all, and nobody's had it as long as Berkeley. Based on my own experience with networks which grew around their mail system, small changes can compound any sort of change or update. Suddenly, there's something everywhere that needs a specific mail system functionality which can't simply be copied over during a move to replicate it.
    * An organizational system like this is big, it's not garden variety email. Hell, i guarantee you they don't have as many IT people maintaining accounts as they have admissions people, probably not even a 10th. Yet the IT people have to actually make sure those records get to the right places all while assuring the admissions people that the information transits securely.
    * There is undoubtedly a faculty member with his pet requirements for email. He probably has things which will not migrate properly.
    * There will undoubtedly be the people using their mail account for file storage.
    * Believe it or not, it's actually fairly difficult to migrate mail from, say, Cyrus IMAP to anything else. It takes time (and anything at all with Cyrus, which I'd not be surprised if they were using, takes a lot of time). Sieve scripts, procmail, IMAP states, et al. It's a pain in the ass, and takes a loooong time to do seamlessly. Doing it under duress of hardware failure is something else entirely.

    From my reading of the events (and seeing some other things not mentioned in OP or linked article) there were a number of things which caused this prolonged outage. First and foremost, the system was not designed to be resilient so much as it was designed to scale up (or proper failure condition testing was not performed beforehand). Second, they either don't have the necessary (knowledgeable) human resources, or enough time allocated to those resources, to effectively manage this system. (You would not believe how difficult it is to find a "mail administrator". Everyone's done it, but nobody seems to like it or is all that good at it. If they are, they want a LOT in compensation.) Third, they may

    --
    ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
  22. Re:Hate Being First .... by Bronster · · Score: 3, Interesting

    As a mail administrator for a big system, I completely agree with you.

    The biggest problem was that they had everything on a single SAN, so when they ran out of IOPs, there was no spare capacity anywhere, and nowhere to mitigate it to. I've had people try to sell me on putting all our systems on a SAN too "it's so simple to administrate. It has plenty of IOPs, see, look at these shiny numbers". Fine when it's empty and you're only hitting the battery backed cache.

    Which is why we have hundreds of separate little disk sets managed with templated configurations rather than any single points of failure. I'm really glad to be there!