On World of Warcraft's Network Issues
alphaneutrino writes to mention a C|Net article discussing some of the recent problems the World of Warcraft playerbase has experienced. From the article: "'Being a system administrator myself, I have some understanding of what goes on in a corporate data center,' said Evgeny Krevets, a sometimes-frustrated WoW player. 'I don't know Blizzard's system setup. What I do know is that if I kept performing 'urgent maintenance' and taking the service down without warning for eight-hour periods, I would be out of a job.' Blizzard blames some of the problems--such as the disconnection, for several hours on Friday, of players linked to several servers--on AT&T, its network provider. (AT&T did not respond to a request for comment.) "
Sunday: The day the server stood still
Monday: *gasp*, playable (until 11pm)
Tuesday: Weekly Maintenance Day. Nothing else EVER needs to be said about this day.
Wednesday: Playable (until 11pm), good chance maintenance aftermath.
Thursday: The 10 second instant-casts day for MC & BWL.
Yeah, it goes on. Our server reliably bites the dust around 11pm every night for 6 hours, not to mention the constant plague of login issues and 30-minute loading screens during peak hours. Funny how this is all on a low-medium population server.
...so THAT'S how Blizzard is combatting server lag.
Maybe it's the Blizzard guys' moms that come in and say "Enough of those stupid games already, go to bed!"? ;)
Or are they too cool to be running the servers out of their parents' basements like the rest of us?
"Well, at least I have chicken!"
Don't take the above poster too seriously. He doesn't.
Blizzard, I can guarantee this: if you spend $35 million per month on refactoring, hardware and bandwidth, all your shareholders go away. Guaranteed. I promise.
Just like with nine women you can have a baby in one month.
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts
Ive barely even seen any issues since patch 1.10. I think patch day the servers were down all day, but thats to be expected.
Server preformance varies from realm to realm. I hadn't really had any issues until the last week or two when my server decided to drop 40 minutes into our 45 minute baron run, and then again in the BG's later on.
As someone else mentioned, I think they are still a victim of their own success. Sure it's been over a year since launch, but they were expecting 250,000 subscribers and got 6,000,000.
No one except the 6 Million users that play the game.
I'm not a WoW player but if it's true that these systems regularly go dark for 8 hours at a time I have to wonder if they're not racing through some software patch. In other words, I don't know an architecture out there that can't be rebooted in 8 hours so a straight-up crash seems unlikely. I would assume they've taken care of scalability problems by now so system load / tablespace, etc, ought to not be an issue.
... and some succeed, requiring a quick patch to the code base? I wouldn't doubt that they have monitoring mechanisms in play which detect unreasonable changes in a character's level / gold, etc.
Could it be that WoW suffers constant attempts at subverting the framework of play
CommentBot 0.7a running with args "-module irritate,disagree -target random"
Our game had its server problems and we were in "learning mode" to deal with some major outages, major gameplay renovations, major strife from jerks, and major socio-legal issues behind the scenes such as player-to-player harassment and real-life stalking. EA/Origin's Ultima Online started later and had some of the same issues in an almost predictable order and timing. Then EverQuest repeated our mistakes, and so on.
I would think that as an industry, as a set of geeks, we MMORPG server managers would learn from each others' mistakes, but apparently, we do not. It is also a problem in that the management in *product* companies think it is easy to become a world-class *service* company, where the service is being sold to thousands to millions of *household* mass market customers.
[
The problem doesn't seem to be how much they spend but where they spend their money. According to the article AT&T seems to be their only network provider. Who thinks that makes sense? To have such a huge bandwidth hungry product and rely on one provider for it. I would never host a commercial web site on a host with a single provider, let alone a huge undertaking like WoW.
But, then again, I may also be an idiot... who knows?
Sounds like WoW has a house of cards network with single point of failure architecture problems.
And that AT&T is exploiting them, marketing a new "premium service/support" contract by letting them go down.
I can't wait until WoW has to pay AT&T (and its handful of competitors, if they get rid of the SPF) the extra "premium tier" routing fees, once the telcos market their "nonneutral" Internet. Because a world of angry Warcraft players jonesing for their fix will be a nice gift for telco suits just trying to make it home from work.
--
make install -not war
It's hard for AT&T to cater to so many millions of users *AND* filter/direct all of their customer data illegally and directly to the NSA.
As an example, I came home from holiday (I'm in the UK) on Sunday evening & I immediately noticed my ADSL connection was down. So I phoned my ISP to report the fault, only to be told that they knew about the problem - a faulty server had been down for 48 hours!!! And when the tech support person could not tell me when the service would be restored, she seemed totally bemused as to why I was angry about the duration of downtime & demanded to speak to her manager.
The manager was even worse... polite and courteous but did not have a clue as to the cause of the problem or when the ADSL service would be back up. He even admitted that they'd been making some network changes to accomodate a recent merger with another company and that they had no backup server to put in place to at least give some degree of restricted service.
I may pay (the equivalent of) $30 a month for my ADSL service but am I the only person who expects good service from any company I deal with, whether I spend £3 or £30,000 with that company? I accept that sometimes there are service outages, I'd even view an 8-hour outage a few days a year as being understandable. But 48 hours???
I've been in the telecoms/computer industry now for about 20 years now and I've seen the whole perception of what is and isn't good customer service change over that time - it seems now that customers are forced to accept worse service because every company has reduced the level of service they give.
And when it comes to poor Joe Public "peons" like ourselves, who only spend a small amount each month with these companies, we're expected to endure countless menu selections, long delays in call-centre queues and lengthy outages as a matter of course.
It would be good to see a lot more people complain more and cancel their services with some of these providers - I'm sure this is the only way that they will be forced to offer better service to us.
Gentoo Linux - another day, another USE flag.
This was a business decision, not a technical descision. Probably an exclusive agreement made with the PHB in charge of WoW.
"But why do we need two providers? ATT has assured me that they can provide all the bandwidth we need, and that they have failover capability! *plus* their datacenters are built on SPRINGS!"
94% of Repubs and 21% of Dems voted to renew the Patriot Act
The problem really is visible when you are adventuring in difficult to beat places. You depend on having your team perform to their best ability. It is then so frustrating to be constantly dealing with part of you team getting disconnected or being lagged to the point of ineffectiveness.
My guild is doing MC BWL, ZG and AQ20 right now. It is a regular occurence right now to wait 20 minutes to start a fight because of disconnected people, only to then lose that battle because you lost two priests to a disconnect during it.
The anger may not be at the threshold point yet Blizzard, but it most definitely building fast. The thing about angry customers is that there is a point of no return when they are forever lost. Blizzard has a lot of customers right now, but they would lose them fast if somebody else stepped up with a great game and more reliable game play.
Blizzard, you executed very very well on game content by effectively removing much of the grind that other games are plagued with, but you have failed with customer interaction. Some of your representatives treat your customers with borderline contempt (Tseric) and you fail miserably at explaining properly the multitude of changes you make to the game.
Blizzard, your six million customers are waiting; it's your move, take too much time and you could lose them. Start with being public about your server improvement plans, telling people what you're doing and why and how its going to make things better. Not knowing when things are going to get better is really making people angry.
Do not spread "09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0" over the internet, thank you.
Given their reliance on only ATT as their network provider, this is precisely the problem they have _now_, and what they need to spend bucketloads of cash fixing.
They need multiple sites around the world, with multiple OC192s to multiple providers, all BGP'd to the gills. They need to buy dark fiber and light that shit up.
Then again, why bother, it's not like it's a free market out there and there won't be any competitors to WoW that can get their act together, right? I mean Blizzard owns the patents on MMORPGaming, right?
Oh wait.
A large part of the problem is that Blizzard's communication with the player base sucks, to speak frankly. The login server for their forums seems to be one and the same as the login server for the game itself, so when that goes down the forums tend to shut down as well. There is a "Realm Status" page which purports to show the real-time status of the various servers, but which is frequently unreachable. There is a "Realm Status" forum which *might* contain some acknowledgement of a problem while the problem is still ongoing, but usually doesn't. When you start up the game client, Blizzard can stick up a 'News' window on your screen but, again, the appearance of any news often lags the problem, even severe problems, by a matter of hours. And, of course, Blizzard's chief form of communication with players is Community Managers on the forums, who themselves tend to be given dick in the way of information, are extremely controlled in what they can and cannot say, and who are (honestly, I'm not joking), tasked with yelling at users for stuff posting subject headers that contain excessive capitalization; what an obscene waste of resources.
Seriously, a little timely information goes a long way. Yes, I agree that the downtime they have is absurd; consider that *every Tuesday* the game goes offline for *six hours* of maintenance. That's *planned, scheduled* downtime, folks, so that *alone* means they aren't even attempting to have greater than 96.4% uptime, and I can't think of another commercial service for which you pay a monthly fee where that would be even remotely acceptable; if your cable or your phone just plain didn't work for 6 hours every Tuesday, heads would roll. Then things just get asinine when you factor in all the spontaneous, freewheeling, unplanned downtime as well.
But know what? I'd feel a lot better about it if, when something shits the bed, or goes tits-up, or whatever colorful metaphor you'd use to describe a server-killing technical problem, Blizzard would tell us, promptly, as they receive the information themselves:
1. We know there's a problem.
2. We know what the proglem is.
3. Here's what we're doing to fix it.
4. Here's when we expect it to be fixed.
5. Update as old information is obsolete.
They don't do this. A few hours after something happens, you might get some of the above information. Or you might not. Usually, it's the latter.
I don't play any online games but I thought the whole idea of them was that you subscribe to that service for it to be available just about 24x7 whenever you feel like jumping in. Sure, occasional outages are to be expected but if it gets to the stage where the game is frequently slow or unavailable, the common sense solution would be to cancel your subscription until Blizzard (or whomever) improves the service they deliver you. If enough people did this, they'd have to do something about it...
I'm sorry but I think far too many people have become "slaves" to marketing by truly believing that they simply cannot do without a lot of the products & services that they pay good money for - to the point where they "need" those items so much that they're afraid of complaining in case they're denied those things completely.
Gentoo Linux - another day, another USE flag.
WOW server downtime is saving my marriage.
tidokoro
what turns a man's karma neutral? lust for gold? power? or just a heart born full of neutrality?
Because they have to pay developers, bandwidth fees, datacenter fees, customer service people, billing people, web designers, janitors, office supplies, and basically everything else it takes to run a business. $35 million / month with probably 15-20 million a month in overhead.
Yes they are making money (businesses are allowed to do this, remember?) Re-architecting a massively distributed game like this takes time *and* money. They underbuilt their infrastructure to begin with, which is where they really went wrong. They are supposedly trying to remedy that, but by the time you have re-architected the system it has grown to the point where you have to do it again.
Also, they're pulling so much bandwidth from so many disparate places that when a link close to them goes down, all the other links have to compensate and there's not necessarily enough fat pipes close to their datacenters to allow everyone on. I would be curious to see what percentage of traffic flowing over certain core routers can be attributed to World of Warcraft; I am betting it is non-trivial.
If these problem are really related to AT&T, then why do we Germans experience exact the same problem? Over here T-Online is the bad guy. To solve the problem, Blizzard even suggested to alter you MTU-rate for your dsl to 1400. I don't know how many people ever heard of a thing called MTU ever. (the common people, not the nerds here ;-) )
Blizzard should ask themself why the whole IT ifrastructure are haveing problems with there product and if it is really the isp's fault.
Actually, that's how software maintenance happens in the real world.
Real code is complex, and generally written as a massive matrix of inter-related side-effects causing things to happen*. When it gets written, the entire matrix is designed, intended, documented, and understood. Two years later the guys working on the code have no clue about the matrix of side-effect driven code, no clue about the complex set of business factors driving the technical aspects of the code (and by business factors, in a MMORPG I mean things like class X has bad faction with everybody making it more difficult for him to start out, but in return for overcoming that challenge has more powerful magic later in life - stuff like that) and when they are making a change they go in, find the one line of code that looks like what needs to be fixed and just change it without knowing all the places that change will ripple back to, invisibly, via the side-effect matrix.
A technical phrase to understand here is 'globally scoped variables' - and another one is 'design intent' - and as the current set of hacks don't understand the ramifications or scope of either, this is what happens.
Footnotes
* I didn't say it was a good idea. I just said it happens.
Glonoinha the MebiByte Slayer
As someone else mentioned, I think they are still a victim of their own success. Sure it's been over a year since launch, but they were expecting 250,000 subscribers and got 6,000,000.
The controlling factor for their server performance should not be the total number of subscribers, but the number of subscribers per realm, and Blizzard has complete control over that number, because they can mark a realm as "full" and disallow logins/signups. IOW, as you know, those 6,000,000 people are not all playing in the same game at the same time.
It should be possible to make the realms completely independent, so that this just becomes a matter of horizontal scaling, and having hardware/systems monkeys roll out new realms via some standard operating procedure.
Unfortunately, based on the rumors I have heard, Blizzard has chosen to tie a bunch of stuff together. For instance, the common web forums use the characters from all the realms (the web forums know about your level 23 mage), they have a single set of auth servers, it's not clear that the item databases are not shared between realms, and so on. This is sort of sad, because it's not like Blizzard are the first people to roll out an MMORPG.
Now, some might argue that tying some of this stuff together makes for a better user experience. However, when this entanglement leads to downtimes, one could make the argument that it's not worth it.
Anyway, my point is not to bash on Blizzard; I'm sure they've made some difficult design decisions correctly, and some difficult ones incorrectly. My point is that "we have lots of users" is not a good excuse when you have a service that lets you divide those users into sub-populations, and that there are probably architectural improvements they could make to improve their scalability. The real question is whether they have competent and experienced systems engineers to help them make those improvements, and whether management is committed to supporting them.
Anyway, so much for pre-coffee ramblings....
...while you're not an idiot, I can understand where they could end up with one supplier for bandwidth.
:-)
1) You need a SLA with each ISP you pull backbone level feed from. You can use InterNAP and hook into the peering points in the US and a few other places, but it's got it's own issues- and if you just use them, you're still with only one ISP; if they fail, you're still up a creek without a paddle.
2) You'd need to frame the servers into one massive data center with a HUGE honking data-pipe from each ISP with BGP routing on the inbound routers from the ISPs to your DMZ to establish one IP address range for the front-facing servers
OR
Come up with some sort of nasty DNS trick to hopefully make the server front-ends transparent to the clients and spread them across multiple IP blocks (Which is what epicRealm did to make their CDN actually completely transparent to client and customer- and to be able to handle dynamic HTTP content...)- but be prepared, because in order for this to work right, you either need to trust the client's state, share state across server pools on different IP blocks, be stateless, or somesuch like the previous.
There's a bunch more, but those above two and the first item will hopefully show you why someone (a bean counter, most likely...) will make the decision to just simply hold the ISP or Tier-1 host (Which is the most likely case here- they're very probably colocated at an AT&T Tier-1 facility...) to the SLA they promised- because it's cheaper and waaay simpler if everything goes right and they're "not to blame" if things go wrong. If you went an alternate route and had a mishap that wasn't server related, then you'd be to blame and have nobody to point fingers at when it all broke (And you just KNOW it will at some point- it always does...
I am not merely a "consumer" or a "taxpayer". I am a Citizen of the State of Texas
Call me anal, but it's bad enough when I pissed half my college years away playing Diablo II online for free. I don't see the point in having to pay for the privilege to waste my time.
... they leverage a small amount of content with a gigantic dollop of tedium to keep people online as long as possible, paying their monthly fees and ruining their expensive college educations.
Actually I think it's a good thing to charge a monthly fee, that way even folks who don't understand the concept of opportunity cost won't be blissfully unaware that playing games all day is never "free". The really annoying thing for me is that most of these games require you to, basically, work (in the game).
E.g. in WoW at some point you'll want to collect a set of gear from Molten Core. Each class has eight pieces of "tier 1 set gear" which can be obtained from Molten Core (we'll ignore the other stuff you can get there). It takes 40 people to clear Molten Core, you can only do it once per week, and you get about 20 pieces of set gear from one trip. Do the math and, optimistically, you'll need to do Molten Core 16 times to equip each of those forty people (of course, it will actually take much longer -- say six months -- to get most of the people most of their pieces).
Now, every visit to Molten Core -- once you figure out how to do it -- is pretty much the same. So after your first few nightmarish two-three evening death-a-thons, you'll eventually be able to "do" MC (as it's known) in maybe three hours. So we're talking at absolute minimum 48h solid gameplay, much of it mindless repetition. (You know how to do everything, you're just waiting for your helmet to "drop".)
But that's not all. At least until you all become very well equipped, Molten Core takes a toll on your equipment and consumables (e.g. potions and ammunition). To stock up on victuals and repair your gear, you'll probably need to spend another couple of hours prep time for each "adventure". So, we're now talking, at absolute minimum, 80h of solid grind to get a complete suit of "tier 1" gear. Again, all of this is mindless repetition.
Now Molten Core is just one instance. I don't know how long it took to assemble it, but I suspect it would take a team of developers fewer person hours to put something like Molten Core together than it will take a typical guild to finish collecting set armor. Of course, they had to attend meetings and so on, so multiply that by ten, but what you're looking at is the fundamental flaw in all current MMORPGs
"My crack pipe...My crack pipe!....suck...suck....It's not working right!"
With proper pipelining, you CAN get one baby after an initial nine-month delay, then a baby a month of throughput until your cache is depleted.