Slashdot Mirror


Behind the Scenes At Sony's NOC

VonGuard writes "Earlier this year, I spoke to Mark Rizzo, the man who manages the people who run Sony's online game servers. Rizzo learned the ropes of MMO hosting back on Ultima Online, and we chatted about where the tough problems were then versus now. Rizzo compares the operation to a 24/7 scientific simulation, albeit with some sassier and more involved end-users. His favorite innovation since those early days? Rapidly provisioning and deploying Linux installations tailor-made to their purposes. Here's my article on Rizzo and his band of 50-some-odd sysadmin-cum-dungeon-masters, written for the new newspaper The Systems Management News."

11 of 49 comments (clear)

  1. Summary: We have scripts by BadAnalogyGuy · · Score: 5, Funny

    So to sum up, they have lots of programs that are constantly watched by scripts. They get to heave server machines around to expand certain areas and replace old servers. Their lives are mostly taken up with making sure that the backups are properly done on time each day and that no one accidentally steps on the power cord.

    Fascinating!

  2. sysadmin-cum-dungeon-masters by Anonymous Coward · · Score: 5, Funny

    sysadmin-cum-dungeon-masters


    Anyone else have images of S&M runnin through their minds?
    1. Re:sysadmin-cum-dungeon-masters by somersault · · Score: 4, Funny

      Beowulf's been a naught boy.. PRINT IT!!!

      kill (1000+('od -An -N2 -i /dev/random')%2001)

      Oh, you like that don't you!? Want me to do it again? First, I'm going to show you what a real glob is..

      --
      which is totally what she said
  3. What change management? by Antique+Geekmeister · · Score: 4, Interesting

    I see the article does not mention what they use for change management. I'm curious what they use: I like Bugzilla myself for ticket tracking, and it's potentially useful for configuration management as well, but needs significant revision to provide that or source control integration.

    Most change control systems make odd choices between a business model of selling proprietary clients, strange choices of backend databases, and a focus on managing sales contact information, hardware inventory, software updates, filling out lots of forms for tracking minutes used doing the work, etc., etc. The choices of the change control system affect the workflow quite a lot: so I'm quite curious what they use. Does anyone here on Slashdot know?

  4. And Remedy :P by Moraelin · · Score: 5, Interesting

    No shit. And they also use Remedy. (Same as half the companies out there.)

    That said, if they claim to be also architects, IMHO they do a poor job too.

    E.g., at one point, after much lurking, and after I already had a big list of veteran awards in SWG, I want to post a suggestion. I didn't have a forum handle yet (hadn't needed one before), it's ok, I'll just go to the account management and create one. Turns out I'm sandboxed in a newbie forum noone else needs for the next two weeks, 'cause apparently the forum can't read from the database whether I'm on a trial account or a regular one. But it can read whether I have an active account, or whether I just deactivated it. (Sony's games were in the habit of asking why you quit. Post-NGE SWG was the only one which basically told me "go away, we don't take input from people with inactive accounts" after I filled that form.) But it can't read whether I'm on a trial account or not.

    Well, it sounds to me like those architects of the server room don't do a particularly great job, then. Whatever interface they use to that customer database (SOAP, XMLRPC, plain SQL, whatever) should be trivial to extend to fetch that one extra piece of information. If month after month noone can figure out how to do that, it doesn't come across as a particularly competent architecture.

    That, or they have no qualms with lying to the customers.

    Additionally, I kinda find this funny, and while pioneered by UO, it's become a typically _Sony_ excuse later: "While today most of the problems faced by Rizzo's team are technical or development related, back in the Ultima Online days, these were compounded by the unpredictable player base. In its day, no one had ever seen the psychological and sociological reactions of players in a massive online world before."

    Erm, no. The vast majority of problems UO had, were already known (and some even solved) by MUDs before. There was no excuse to repeat the same mistakes verbatim, and try the same things which were known not to work.

    E.g., player justice was known not to work, as there's nothing you can do to the disposable character of a griefer, that its owner would care about. Plus, mobilizing whole posses to hunt down a griefer is, basically, just feeding the troll: he got some attention out of tens of people. Tens or hundreds of MUDs have tried that before, as it was the holy grail of being able to run a MUD without the non-fun headache of policing it, and it just didn't work without being backed by a lot of admin support. UO's recipe was known to fail, every time.

    What really happened with UO was Lord British having his head so far up his own arse, that he couldn't see there's a world outside. He didn't as much discover those issues, as thoroughly ignored everything that had been discovered by anyone else. And then repeated the same thing with Tabula Rasa.

    And as for Sony, since a lot of people there seem very fond of the same excuse: you have even less right to use that excuse, guys. SWG was a _third_ generation MMO, EQ2 is even later. There wasn't really an excuse even for UO to ignore the lessons of MUDs before it. Ignoring a couple dozen MMOs before you, is even less excusable.

    And finally: how were those social issues relevant, in any form or shape, for the IT guys running the servers? I mean, seriously, they were (A) poor game design issues, (B) created some work for the coders who had to keep implementing fixes (which created more problems and the need for the next fix), and (C) a neverending headache for the GM's who had to sort out the thousands of support requests resulting from that fuck-up. Daily. But for the guys monitoring the servers and doing backups? Exactly how does it affect them whether the MMO is a friendly place or a newbie-hostile gank-fest run amok?

    --
    A polar bear is a cartesian bear after a coordinate transform.
    1. Re:And Remedy :P by kjart · · Score: 5, Insightful

      All the problems you are describing are engineering/development issues and don't have anything to do with operations. The architects would be for the infrastructure, deployment, monitoring, etc etc, not for the games themselves.

  5. Promotion Strategy. by Angostura · · Score: 4, Funny

    If you spend an awful lot of time grinding in the NOC you eventually become a level-something 'Network Architect' with no direct reports but with the ability to tell everyone what to do.

    Taking a few management tips from in-game, perhaps?

  6. Would love to hear more from these teams by magamiako1 · · Score: 5, Interesting

    I think a lot of people underestimate the requirements of running 24/7 online game servers for persistent worlds. There are definitely some serious architectural hurdles to overcome that don't necessarily exist in other areas of IT. In fact, one could say it's like "regular" IT work but on steroids.

    For one, the server hardware has to be pretty powerful. Because it's doing a lot of high demand database work, everything from the lower layers of the hard disks to the file system to the software itself has to be fast and reliable.

    For two, there is an increased demand for data reliability. If you manage an e-mail server and for some reason a flaw in the e-mail server doesn't pass e-mail on properly, you may be able to fix it and tell users to simply resend whatever e-mail they were sending and that's that. If a flaw comes up in the online game world that requires users to possibly "redo" something they did in the game, you will immediately lose a vast majority of your playerbase as they will see the game as unreliable.

    That said also, the servers are very high demand 24/7. Even when the maintenance times are scheduled outages, people still complain. Generally in a normal business IT scenario, you can reboot a few servers here or there and nobody will notice anything during off time. So you've got change control windows that can occur 2 hours before anyone else gets to work and have to use the system, and they won't care one way or the other as long as everything's fine when they get into the office.

    The databases are vast, doing constant read/write operations. Again, constantly changing database as players move about the world and interact. Exchanging items, gold, leveling up, learning new abilities.

    Clustering and load balancing become very real problems for game servers. This is extremely apparent when you look at Blizzard where they number over 200 seperate, completely independent realms worldwide.

    We won't even get into issues where the game world can't be dynamic and involving due to the technical limitations that we have, resulting in very limited forms of gameplay.

    And again, you cannot forget the customer base. You know, if Joe cannot access e-mail for an hour because something is up with his e-mail account on the server, in most situations that's perfectly fine, he has something else he can do and you won't necessarily lose money on productivity. If Joe cannot access his online gaming character, you have the potential to lose a sale and a customer.

    Very high demand indeed.

    1. Re:Would love to hear more from these teams by Jellybob · · Score: 3, Interesting

      You can backup the SQL database files, but what if the data hadn't been paged out to disk yet? Stuck in cache somewhere that got erased when the machine powered off?

      You replay the binary logs of any transactions that were run since the last backup.

      I'm not saying it's not a big problem because it's a game - I play lots myself, and understand the frustration when things break. I'm saying it's not a big problem because whether your tracking forum posts, medical records, or game players, when it gets to the database and hardware level, it's all the same thing.

      These are solved problems. The headline may as well be "sysadmins adminster systems for Sony". The only reason this is getting any coverage is because they mentioned MMOs at some point.
    2. Re:Would love to hear more from these teams by Minwee · · Score: 4, Funny

      The secret to achieving five nines uptime is not to improve the reliability of the systems, but instead to be very careful about how you define "uptime".

      "Hey, about those two hours of downtime last night..."

      "There wasn't any downtime."

      "No, really, the phones were lit up with people complaining that the applications weren't answering properly..."

      "So the applications were answering queries? Then they were up. It's not downtime."

      "But they were answering queries with error messages."

      "Then that's an application problem. The system was still up."

      "But the error messages said 'No response from database'. The database servers were down."

      "No they weren't. They were still running. They still had power. The servers were up. It's not as if they fell down out of the racks. You can't call it downtime just because a few programs aren't behaving exactly the way you want."

      "So about this SLA..."

      "Five nines, baby. We've still got five nines."

  7. Re:Linux? by magamiako1 · · Score: 3, Insightful

    The client and what the server does and has to do are entirely separate things and pretty much have no relation with regards to each other in any way except that they communicate data back and forth for one or the other to process.