WGA Meltdown Blamed On Human Error

← Back to Stories (view on slashdot.org)

WGA Meltdown Blamed On Human Error

Posted by Zonk on Monday September 3, 2007 @12:51AM from the kinda-of-big-for-an-oopsie dept.

Erris writes "As commentators like Ars Technica slam WGA as deeply flawed, Microsoft is blaming human error and swears it won't happen again. 'Alex Kochis, Microsofts senior WGA product manager, wrote in a blog posting that the troubles began after preproduction code was installed on live servers. ... rollback fixed the problem on the product-activation servers within 30 minutes ... but it didnt reset the validation servers. ... "we didnt have the right monitoring in place to be sure the fixes had the intended effect"' Critics were not impressed. 'A system thats not totally reliable really should not be so punitive, said Gartner Inc. analyst Michael Silver. Michael Cherry, an analyst at Directions on Microsoft in Kirkland, Wash., said he was surprised that it was even possible to accidentally load the wrong code onto live servers ... [and asks], "what other things have they not done?' This is not the first time this has happened, either."

8 of 250 comments (clear)

"won't happen again"? by haeger · 2007-09-03 01:00 · Score: 5, Insightful

So, if it's human error that caused the problem, how can the swear that it won't happen again? Will there be no more humans working at microsoft anymore?
I don't get it?
People make mistakes and as long as people are involved in any process they will cock up from time to time.

The point about systems not being so punitive is a valid one and should be brought up more often and louder. People who've paid money for their product should not be punished for an error on microsofts end.

.haeger

--
You are not entitled to your opinion. You are entitled to your informed opinion. -- Harlan Ellison
It's a fair point by Joe+Jay+Bee · 2007-09-03 01:01 · Score: 5, Interesting

Critics were not impressed. 'A system thats not totally reliable really should not be so punitive, said Gartner Inc. analyst Michael Silver. Michael Cherry, an analyst at Directions on Microsoft in Kirkland, Wash.,

WGA is a natural, if not perfect (or even good) business response to the problem of piracy (leaving out all the debate over whether it's a good or bad thing for Microsoft as a whole). But the technical implementation leaves a lot to be desired; if anything, the response to a WGA server failure should be automatic pass (fail safe) instead of an automatic fail (fail deadly).

Sure, for a 24 hour window pirates would have a free-for-all in getting perfectly valid WGA results, but at the same time legitimate customers would not be inconvenienced. As far as I can see, that's the only way to keep WGA while minimising the backlash against it.

--
I write bullshit
1. Re:It's a fair point by Anonymous Coward · 2007-09-03 01:06 · Score: 5, Insightful
  
  Sure, for a 24 hour window pirates would have a free-for-all in getting perfectly valid WGA results.
  
  Actually, pirates would probably very quickly figure out how to set the WGA server failure condition in Windows to get the automatic pass without ever actually contacting the real WGA servers, which would render WGA completely worthless. Well... more so.
  
  I don't use Windows, can't stand Microsoft, and had a hearty laugh at the news of the WGA meltdown, but the problem is not as easy to solve from a technical standpoint as you believe.
Re:Zoom by gatzke · 2007-09-03 01:08 · Score: 5, Insightful

Slashdot is not about journalistic integrity, it never has been. It is about nerd topics and dupes.

ACs complaining about twitter does look like astroturfing. MS has enough money to pay a few guys to beat back public opinion on well-known public tech sites. Without facts disputing the current article, it looks like you are just pro-MS ranting against a anti-MS article without any substance.

Fact- WGA broke for a while causing many people troubles.

Fact- Some people don't like having to phone MS all the time to keep a product running.

Fact- MS has paid astroturfers to anonymously post pro-MS grassroots stuff online.
Not an acceptable answer by Anonymous Coward · 2007-09-03 01:14 · Score: 5, Insightful

Look, most of us here work (directly or indirectly) in software. Who hasn't had a launch fail, or a product go bad, in a way that's negatively impacted customers. Such things DO happen. Usually not out of malice, and even sometimes not from carelessness--there are things that sometimes you can't catch on a test system. So to that extent, I feel for the folks who caused this problem..

So why do I call it unacceptable? Because of the difference in standards. On Microsoft's side, they are holding the user to a high level of scrutiny, and reserve the right to cripple some OS features if Microsoft believes the install is pirated. No discussions. Go directly to "aero jail".

Which is possibly understandable if their stance is "look, we're losing billions here--we need to fight piracy." But if they're going to take such radical and punitive measures as locking down OS features based on their tool, then they have to have an absolutely rock solid fail resistant totally monitored system. Basically, they need to hold WGA to a higher standard than most business software. This needs to be the gold standard if they want people to trust the system (and TFA links to a number of other reasonably well-balanced Ars articles that suggest it is not).

Oops, we forgot to monitor the validation boxes? You can't be organic about this--add monitoring for problems as they're discovered on a system this critical not just to Microsoft, but to their customers. You have to anticipate what MIGHT happen, even if "there's no way that should ever occur." You have to think of things that should never happen, but would be problematic if they did.

The fact that they failed here, if it never happens again, might not be a huge deal. But their answer shreds confidence that this is an isolated issue. The fact that this specific failure might not happen again gives me no comfort. Because their answer indicated that they didn't get it when they designed the system, and the don't get it now.

What they SHOULD have said is "boy, this was something we never thought could happen. We have fixed the issue, and are confident we have the monitoring to prevent this specific issue going forward. And we are undertaking a comprehensive review of our validation and monitoring systems to make sure nothing even remotely close to this could ever possibly happen again." Nothing less should be acceptable.
Re:Have we gone backwards? by PeeAitchPee · 2007-09-03 01:21 · Score: 5, Interesting

Strictly speaking, there are no tasks I do today that I couldn't do in 1997.

Speak for yourself. Just because *you personally* don't use the extra processing power, memory, and storage that are available doesn't mean that lots of others don't. For example, I'm in the middle of digitizing and OCRing 110 years of local newspapers from microfilm into archival-quality PDFs for an historical society. Quite simply, you *cannot* have too much processing power when doing OCR -- I'm running multiple instances of ABBYY FineReader Corporate on a 2x Quad Core Xeon that has been pegged for two weeks now. It's quick, multithreads across all 8 cores and does a great job, but there's simply too much data. Note that this project would have been completely impossible in 1997 -- there simply wasn't enough processing power, memory or storage available to do it on anything less than a supercomputer. And that's not even considering truly bandwidth- and processor-intensive tasks related to video, weather meodeling, etc.
Re:Have we gone backwards? by PeeAitchPee · 2007-09-03 02:25 · Score: 5, Interesting

As for your task, it may not have been done on single machine in a reasonable timeframe and certainly not in a point and click fashion. However you could have easily integrated the ABBY engine into a networked batch OCR solution and then hired the capacity to run it (eg: a renderfarm).

Ahhh, spoken like someone who's never done a project like this before. So easy to plan in your head on Slashdot in 30 seconds, isn't it?

If creating the required integration work to ABBYY's OCR engine to some sort of distributed processing farm wasn't cost-prohibitive (which it is -- historical societies aren't exactly made of money), how would you suggest I upload over a terabyte of raw image data in a timely fashion to said render farm? And then download it again once completed (not as big of a problem, but still an issue)?

The bigger question is whether or not to take on OCR in-house at all. If you want to sub-out OCR, then you have to wait until the scanning is complete (weeks) -- sending partial jobs via hard drive is more expensive than sending everything at once at the end. It's still too much money at the end of the day -- much, much cheaper to keep it in-house, and the QA process is better. The cheapest option is to buy the fastest server your budget permits and run it 24x7 in parallel with scanning and final PDF assembly / burning. ABBYY FineReader multithreads on recognition, but NOT on opening batches or writing out PDFs. That is the real bottleneck, and the reason it's necessary to run multiple instances.
Re:tagged as "blamebill" by Doctor+O · 2007-09-03 02:46 · Score: 5, Funny

Bill is still chairman. Ballmer is CEO. Last thing I heard, Ballmer indeed is the chair man. I don't think Bill has *ever* thrown a chair.

--
Who is General Failure and why is he reading my hard disk?