LiveJournal Servers Go Down
Wind writes "According to any journal hosted off of LiveJournal.com, the LiveJournal data center Internap has suffered a critical power failure, leaving all of LiveJournal and its content temporarily offline and requiring the revival of 100+ servers. Perhaps Six Apart wasn't quite prepared for the responsibilities of a website of this size? Updated information is posted here."
Sounds like someone was taking a nap over at Internap
You can't imagine the withdrawals I'm going through. It's like the great Slashdot brownouts of '98.
I need my fix, man!
In related news, 6,000 teen-age girls were heard yelling "OMG! WTF! How will John know I life him if I can't blog about it!"
An effective signature identifies a particular user amongst a base of thousands.
...the collective IQ of the internet has raised about 20 points.
but that's ONE HELL of a Slashdotting! :)
Join the TWIT army now!
and search.pl is constantly being trashed by distributed xanga botnets. perhaps michael wasn't quite prepared to be an editor of slashdot?
Bush just appointed Internap's CEO to his National Infrastructure Advisory Council, yet the man can't keep a co-lo facility switched on.
I'm not sure what that says of Bush or of Interap. And it certainly doesn't seem to have anything to do with SixApart.
"Perhaps Six Apart wasn't quite prepared for the responsibilities of a website of this size?"
Perhaps shit happens, and a blog service doesn't warrant the necessary investment to survive whatever caused this outage?
so it's deadjournal now ?
Well now the millions (?) of users might actually have something to write about when the servers are back up. "Today I went outside. My pupils have never been tinier..."
Perhaps Six Apart wasn't quite prepared for the responsibilities of a website of this size?
Ok, I understand that you don't like Six Apart; I'm no fan of their new licensing scheme either. However, I really doubt that SixApart has any control over any power failures that might occur at Internap.
Where will I write about my depression over this event?
Oh. Slashdot.
Use the Coralized link. No sense in crashing their status page. Plust it'll respond a lot quicker than loading the actual web page.
I feel a great disturbance in the force..... It's as if a million bloggers cried out all at once..... and became silent.
The population of depessed pre-teens has just dropped by 20%
It's not like most LiveJournal user's have enough to worry about, here's something for most LJ users to get melodramatic about. I'm serious, randomly pick 5 LiveJournal blogs, and I guarantee 4 out of 5 are going to be "Fuck the World" posts.
sounds like all the fucking spammers they host overtaxed spammer-nap's power resources and brought it all down.
Seriously though, spammer-nap is a massive spam haus, see for yourself
Lawyers, MBA's, RIAA? A jedi fears not these things!
I know nothing of how InterNap is set up. I just want to throw that out there ahead of time. Now, it's time for my patent pending "Bull Shit Theory of the Day."
Ok, here is the rant. I used to work for a Colocation facility. Nothing special, small by Telco terms. The whole facility only had about 1500 cabinets. (Though I hear they are now full, and going to be expanding.)
We had a main power draw off of the local grid. We had a backup power draw off of the *next* cities power grid. (ie, when all the offices around us went dark, we still had power.) And you don't even want to know the kind of red tape we had to go through for *that* pull. I'm still not sure how they did it. We had fly wheel kinetic electricity storage systems, battery backups, and a diesel engine from a train so large it had it's own building.
We used to joke that if we lost power, we had more important things to worry about. And again, we were small time compared to some of the massiveness that is out there. *cough*AADS Chicago*cough*
So I'm kind of in agreement with the statement currently on LiveJournal. It's unknown to me how any self respecting colo facility can say "We've had a power outage that also took our redundant systems."
I have to call bullshit on that entire train of thought. If that's true then they don't *have* any redundant systems, and I'd be looking for a new provider. The most likely thing (at least in my mind) is that someone, somewhere got mad at something specific and decided to make a point by popping the main breaker to their portion of the facility.
Oh, that was another thing, each room had several "main" breakers. It took a hell of a power surge to pop all of them, and the Liebert systems had power filters of some kind, really really big capacitors or something I think, so a surge really never made it to the other side anyway, it got stored in the cap and then trickled out like the rest of the power.
But I was a UNIX admin, not the EE that was planning the power generation aspects of the facility. So take some of it with grains of what ever white powdered spice you prefer.
"Genius may shine aloof and alone, like a star, but goodness is social, and it takes two men and God to make a Brother."
Update from the site:
"Update #1, 7:35 pm PST: we're up on 'dirty' power for now (it works, but it's unreliable)".
Congrats to LiveJournal for assembly a coal generator in a record time.
On the Livejournal main page:
Update #1, 7:35 pm PST: we're up on 'dirty' power for now (it works, but it's unreliable), and we're working to assess the state of the databases. The worst thing we could do right now is rush the site up in an unreliable state. We're checking all the hardware and data, making sure everything's consistent. Where it's not, we'll be restoring from recent backups and replaying all the changes since that time, to get to the current point in time, but in good shape. We'll be providing more technical details later, for those curious, on the power failure (when we learn more), the database details, and the recovery process. For now, please be patient. We'll be working all weekend on this if we have to.
Lovely. I just bought another year's subscription for my wife, figuring the change to Six Apart wouldn't change anything for a few months at least. LJ could lose a lot of subscribers with an outage just after the takeover.
live journal is dark like my soul like my heart a void its link is cut just like i'll be doing to my arm i blame my parents
The Slashdot effect is more visible because we send all our readers to one place at the same time, while LJ is highly distributed.
... as if millions of teenage girls suddenly cried out in terror and were suddenly silenced.
This is another thing that bothers me about this scenario. I can't say that I've ever admined 100 servers, the most I've ever had was about 30, but if we had a power loss of any kind, you'd just repower them and walk away. Most of them were DEC Alpha gear running Tru64. Why would you spec out a box that has to be handheld every reboot? The only time you should have to handhold a server is during an upgrade. A power cycle without proper SIGHUP or term signals should just run fdisk on it's way back up. (K, so it might take an hour for the server to go live again, but still.) I mean, am I missing something here? Maybe since nothing I've admined got the traffic these things do .... I'm just lost. Some one hit me with the clue by four.
The only thing I can even think of is they have explicit services that must be started manually ..... but why would you want that? If you have a power hiccup in the middle of the night, you want it to come back up, and be live and happy again *before* you even get the first page. I mean sure, if there was a surge, and that destroyed components, and those components have to be replaced ..... but ..... a reboot is a reboot, man. Here, smoke some source. It's the good stuff.
"Genius may shine aloof and alone, like a star, but goodness is social, and it takes two men and God to make a Brother."
Er, they just announced Six Apart was buying them like days ago. I doubt they transitioned the servers in the first week.
They all came back up when the power came back.
...)
But we intentionally don't have databases come back up on boot because if there was a blip, we want to do an integrity check first. (we run InnoDB, so it's ACID, but we're paranoid
We have clusters of 2 identical databases in separate cabinets, separate switches, separate Internap power feeds... so normally losing one database in each cluster doesn't matter: the other one gets used. But when we lose every single database, in all clusters, all at once... that's the time to be paranoid and double check stuff.
LiveJournal Servers Go Down
With thousands of teenage girls unable to ponder in an open forum whether or not to blow their boyfriends, thousands of teenage girls go down.
500GB of disk, 5TB of transfer, $5.95/mo
Because michael needs a beating. The site that rolls beta (alpha?) code onto live servers complaining and making jokes because another site goes down through no fault of its own?
Jesus was all right but his disciples were thick and ordinary. -John Lennon
Perhaps Six Apart wasn't quite prepared for the responsibilities of a website of this size?
What does Six Apart have to do with Internap? Livejournal has been using - and wanting to switch from - Internap for a long time.
For those people who might not know, Brad Fitzpatrick is Livejournal User #1.
I'd have to agree with the AC, Brad, stop posting to slashdot and hover over that DB rebuild a bit more.
(Yes, posting to slashdot relieves tension... Whatever it takes, Brad.)
The LiveJournal status page claims "Our data center (Internap) lost all its power, including redundant backup power". This is nothing to do with "cheapskate blog admins" and everything to do with a serious and quite likely unacceptable problem at Internap.
Of course, that's why Anonymous Cowards start out with zero points. Guilty of idiocy until proven innocent.
At this point all my whiteboards are full of boxes of each database cluster, the machines in that cluster, which have passed their checksum tests. (innodb checksums each 16k page), which replayed their replay/undo logs, where in binlogs each was writing/reading/executing etc...
So lots of waiting now on the checksum validators. I don't want to put a machine back in and find out in a week there was a database page that was corrupt because the battery-backed write-back cache on the RAID card didn't work as advertised. (which happens on about 95% of RAID cards, in my experience, because they're mostly crap, even the most expensive ones...)
Also whenever there's any doubt about something's integrity, we backup or snapshot the potentially corrupt version before operating on it. That operation can take time too.
It's going to be a fun night.
For those who don't know what's so hot about it and for those who think Livejournal is just a bunch of teenage girls whining.... Livejournal has just about four years of my life documented. The ease of use and the ability to "vent" is comforting, but the real value comes in the interaction. My friends see my life at their convenience and I see theirs at mine. We can choose to ignore the whining of others or we can choose to relate and comment on our own experience. Think of it this way: Open-source philosophy, emotion, and life. I put my own out there and others add to it. I add mine to others. Granted ... those quiz/meme things HAVE TO GO. I do not want to read about "what frog best resembles me" or "which 80's hair band song is me." Grrr.
Just remember it's not ALL obnoxious, over-emotional teen-angst teenage girls. I use mine to showcase (non-depressing)poetry and make intelligent comments about intelligent topics. Basically, if someone makes an LJ about their own life, it sucks. If you can manage to write an LJ and make it about things that matter to more people than just you(ie, "Why Bush's Iraqi war is unjust" vs. "Why this babe I know should bang me"), and at the same time make it funny and enjoyable to read, then you have a good LJ. Most LJs DO suck, but there are some diamonds in the rough.
Blog blog blog blog.
Lovely blog!
Wonderful blog!
Blog blo-o-o-o-o-og blog blo-o-o-o-o-og blog.
Lovely blog! Lovely blog!
Lovely blog! Lovely blog!
Lovely blog!
Blog blog blog blog!
-- The Viking Blog Song
From the article write-up (and reflecting the thoughts of quite a few of the comments I just read):
I'd love to know what makes you think this has anything to do with Six Apart. The very first line at http://www.livejournal.com states:
They've been with Internap for years, predating Six Apart's takeover. Unless LJ staff is lying, the fault here sounds like it lies entirely with Internap.
And as far as I can tell, Six Apart didn't ditch the LJ team when they bought them out, so you probably have the exact same people working on bringing the site back up now as you would have if Six Apart had never got involved.
I know the feeling. I have an LJ (for friends to read) in which I relay news, ramble about things that interest me, and write mini-essays from time to time. I don't whine about my parents or people at school or whatever (well, I do, but it's grumbling about idiots at work, since I work at a university) and the people I know are generally much the same. But I can't stand those typically teen idiotic ramblings either.
But I too find it irritating that a service I use, that is supposed to be backed up (my account was bouncing up and down numerous times in the past week, too). For a paid service, I'd have expected there to be a lot more backups to make it more difficult for power problems to wipe out the entire site. If the hosting facility doesn't have a UPS, why wasn't one installed?
i am a soviet space shuttle
Someone probably hit the big red switch on the wall, the one covered in a plastic case
That does happen. I remember working at Purolator Courier's data center in NJ back in -- oh, geez, mid-80s some time. I was a third shift print operator, helped out with the mag tape library too. One night the trouble alarm went off on the fire suppression panel. We'd been having trouble with it all week, and the alarm guy was due in in the morning. One of the newbie operators -- the only one at the console at the time, the others being on a smoke break or asleep in the tape library -- panicked and went over to the annunciator panel. He opened it as I watched him from the console area. I think he thought the halon was about to dump because he reached around the panel and instead of hitting the halon dump abort, he hit the emergency power cutoff.
BLAM! It was as if a firecracker went off as all the breakers tripped and the fans came to a sighing halt. Both on this floor -- the one with the console and the tape drives -- and the floor above, with the CPU and the disk farms. Dead as a doornail.
Now, this was Purolator COURIER. We had AIRPLANES coming in to land at Indy center and as of this moment, no way to tell the crews which gate to go to, where to unload their stuff, or how to sort it.
Not only that, but this was an IBM mainframe shop -- S/390, the Big Iron, with 3380 disk drives. You don't just flip the power switch back on. An emergency power cutoff blows breakers in the power supplies on those DASD strings. The IBM Field Engineer was duly dispatched and arrived with cases of breakers the next morning. But we were still dark when I got off shift the following morning.
The next night a brand new plexiglass cover was mounted over the Big Red Switch.
Mit der Dummheit kämpfen Götter selbst vergebens.
I'm surprised to see that Internap's main servers are back up. It's pretty irresponsible to bring up your corporate servers before those of your clients.
That being said, LJ's servers are back up now, but they're making sure that the databases are all in sync -- LiveJournal has one of the most massive distributed MySQL clusters in existance along with a complete caching system.
They need to make sure that the database is all synchronized before bringing it back up -- chances are they're going to rebuild the cache too. If they didn't, the initial strain on the DB servers would probably bring the site down again.
This does however, bring up some questions about LiveJournal's network infrastructure. Danga (the creaters of LJ, recently purchased by Six Apart) are heavy users of Perl and MySQL. Needless to say, they have made numerous contributions to both projects and have developed an innovative memory caching system for linux.
The questions raised however, come from Perl and MySQL. Both are questionable in terms of scalability. Although I'm not qualified to comment on this, I belive that the general concensus is that MySQL is one of the least efficent databases today. Livejournal has 100+ servers. I honestly don't think that a system the size of LiveJournal should require a server cluster that big. It seems that they are trying to solve their performance/reliability problems by blindly throwing hardware at it.
Of course, I love livejournal. It's simple, easy to use, and is a great tool for building communities. Just as it is simple, it can also be incredibly nerdy (there's actually a command prompt!). They're also completely open source.
Hopefully, Six Apart can make their network infrastructure more 'professional' while still maintianing the community spirit that has made it so successful.
-- If you try to fail and succeed, which have you done? - Uli's moose
There were already lots of LiveJournal users who were upset and confused and unhappy with the idea that LJ and Danga (the company which made LJ) had been bought by SixApart. No doubt, as there have been no downtimes of this magnitude at LJ before, doomsayers will be claiming that it's SixApart's fault.
Never mind common sense; it won't matter that if SixApart can be held responsible for failures at InterNAP's colocation facilities, they're a much bigger -- and more powerful -- company than most people have ever given them credit for...
--Rachel
Update #2, 10:11 pm: So far so good. Things are checking out, but we're being paranoid. A few annoying issues, but nothing that's not fixable. We're going to be buying a bunch of rack-mount UPS units on Monday so this doesn't happen again. In the past we've always trusted Internap's insanely redundant power and UPS systems, but now that this has happened to us twice, we realize the first time wasn't a total freak coincidence. C'est la vie.
According to some LiveJournal employees, a massive UPS exploded. From IRC:
<rahaeli> As far as we can tell, a UPS exploded.
Their site now says that they're buying their own UPSes, because this is the second time that the entire data center has lost power. Details on the first outage can be found here (a Google cache since LJ is down).
For the paranoid: This has nothing to do Six Apart buying LJ. They're still in the same "world-class" data center they've been in for years.
you want beer and pizza? email me an address/zipcode at the sig email and ill do my part to support restoring lj.
;)
if my wife cant post this weekend, im gonna hear about it. and not even be able to post my lj about getting yelled it about lj being down as if i caused the power outage myself.
not really.
well maybe.
Cheers.
This is my sig. There are many like it, but this one is mine.
Remember when teenagers were happy when people couldn't read all the personal details in their diary?
One line blog. I hear that they're called Twitters now.
You do realize that LiveJournal handles far more traffic than Slashdot, and when Slashdot got linked on the front page of LJ, Slashdot started spewing out errors (more than normal).
Oh hey, Slashdot just went down as I was typing this. Smooth.
"I have felt a great disturbance in the force; as if a million voices suddenly cried out in terror."
Those poor, poor children.
i won't exaggerate if i tell that in recent years most of "social life" in .ru zone moved to livejournal.
it's 10 a.m. in russia now, and most of russian lj-addicts still don't know about apocalypse in lj.
i hope everything will be turned up in the nearest future. brad, we believe in you! :)
| ...the poor APC UPS batteries weren't able to hold up the 150 servers I run.
|
| When the power came back on, we had 143 servers back on-line in ten minutes.
| We had 149 on line in fifteen minutes. We had two servers (leased dedicateds)
| that requires some file system repairs before they would come back on-line, but
| that task was finished 30 minutes after power restoration.
|
| What's so hard about that?
What's so hard about that? Well... not everyone who has 150 servers can get 151 of them back online in 30 minutes.
Unless it means that the "cheapskate blog admins" were too cheapskate to buy proper dual-power supply boxes so that they can have dual power paths right to the servers.
You can have all the great redundant mains and backups you want, and it's for shit if you only have one power line to the system and that power bus loses juice.
It's funny how I was just met with some Internap sales people a few months ago. They were bragging about how their network infrastructure was superior to most others, since it intelligently routes traffic to the path of shortest response (not hops).
They even bragged to me how their network uptime SLA is 100%! I mean good god, now I find out this is the SECOND time it's happened (from the livejournal update site)???
I'm glad I didn't go with them...
eTrade SUCKS
The comments seem to be full of contempt for teenage -angst inane ramblings that are common on LJ. Come on. It's not like you are forced to read through this stuff.
I have a few "friends" there at LJ, some of them net.celebs, and I like their posts. It's the matter of whose writings do you find interesting, and you are free to be completely unaware of the rest. Why all the vitriol?
My exception safety is -fno-exceptions.
The Alexa link was the only tangible example I could find. I distinctly recall seeing a post by Brad himself mentioning how much more traffic LJ handles, but obviously I can't link to it at the moment.
/. has any stats available, but skimming through this page, the highest UID I see is in the 800,000 range. I'm not going to even attempt to guess what the relative activity level of LJ users is compared to /., or which has bigger pages or whatever, but I would offhand say that LJ probably handles more image traffic (user pictures, and now the in-testing photo hosting service). I know they used to use Akamai for that, but I seem to recall that fairly recently they switched over to doing something else. (I think they handle it themselves again, but I'm not sure.) There's also the audio files from phone posts. I'd say there's little question that LJ is the more heavily trafficked site.
/. isn't in much of a position to pooh-pooh the technical ability of Brad/LJ.
Anyway, as of Google's last crawl of the stats page (shortly before the outage), there were almost 6 million LJ users, a little under half of those "active." I don't know if
Besides, a lot of the DB load on Slashdot is eased tremendously by Memcached, developed by... Danga Interactive, i.e. LJ. Wikipedia uses it too, and just started using Perlbal. (And I do mean "just") Ditto for Audioscrobbler/Last.fm. So
I seem to remember that a few years back they had a similar problem (Internap lost all power) and it turned out that some idiot had hit the big red "shut down all power to the entire datacenter" emergency button. This isn't the first time this has happened, and last time it wasn't under Six Apart's management.
I'd say it's Internap's incompetence that caused this problem. If they can't keep their datacenter running even though they have multiple redundant power supplies then something is very wrong. I see from the outage page that LJ people are now planning to buy their own UPS so that they don't have to trust Internap anymore.
For power outages, my house has a better record than Internap right now, and I don't even own a UPS!
Personally, I'll trade a subdomain for the elegant simplicity of the friends system, post security, threaded comments, communities, user images, easy and powerful customization, an open-source backend with some seriously useful software contributed to the community, clients, and a site that, during the 99% of the time it's running properly, is ridiculously fast.
Actually, I won't trade a subdomain for all that. I'm a paid user, so I get one anyway.
(And there's a simple solution to the emo teens: ignore them.)
Hey, you try to find an open nick these days!