Behind the Scenes At Sony's NOC
VonGuard writes "Earlier this year, I spoke to Mark Rizzo, the man who manages the people who run Sony's online game servers. Rizzo learned the ropes of MMO hosting back on Ultima Online, and we chatted about where the tough problems were then versus now. Rizzo compares the operation to a 24/7 scientific simulation, albeit with some sassier and more involved end-users. His favorite innovation since those early days? Rapidly provisioning and deploying Linux installations tailor-made to their purposes. Here's my article on Rizzo and his band of 50-some-odd sysadmin-cum-dungeon-masters, written for the new newspaper The Systems Management News."
So to sum up, they have lots of programs that are constantly watched by scripts. They get to heave server machines around to expand certain areas and replace old servers. Their lives are mostly taken up with making sure that the backups are properly done on time each day and that no one accidentally steps on the power cord.
Fascinating!
Anyone else have images of S&M runnin through their minds?
Rizzo. Oh hang on a minute, that's Frank.
If you spend an awful lot of time grinding in the NOC you eventually become a level-something 'Network Architect' with no direct reports but with the ability to tell everyone what to do.
Taking a few management tips from in-game, perhaps?
Pretty pretty please tag 'cum-dungeon-masters' just for this!
I mis-read part of the last sentence in the summary as "50-some-odd sadism-cum-dungeon-masters" which, oddly enough, makes some sense.
Yeah, I have a scary mind. Boo !
The secret to achieving five nines uptime is not to improve the reliability of the systems, but instead to be very careful about how you define "uptime".
"Hey, about those two hours of downtime last night..."
"There wasn't any downtime."
"No, really, the phones were lit up with people complaining that the applications weren't answering properly..."
"So the applications were answering queries? Then they were up. It's not downtime."
"But they were answering queries with error messages."
"Then that's an application problem. The system was still up."
"But the error messages said 'No response from database'. The database servers were down."
"No they weren't. They were still running. They still had power. The servers were up. It's not as if they fell down out of the racks. You can't call it downtime just because a few programs aren't behaving exactly the way you want."
"So about this SLA..."
"Five nines, baby. We've still got five nines."