Debugging Microsoft.com
teslatug writes "Channel 9 has an interesting video interview with Chris St.Amand and Jeff Stucky who test and debug Microsoft.com. They reveal some of the big problems they used to face such as recycling processes every 5 minutes due to memory leaks and 32 bit limitations, and being unable to push more than 10 Mbits of data to their datacenters due to Windows' networking stack limitations."
WMV? You serious?
How the hell am I supposed to watch that?
The summary is missing the fact that many of their problems went away after upgrading to an early 64 bit version of Vista with its improved networking stack.
Help Brendan pay off his student loans
The next post will be about debugging slashdot.
Why don't they just migrate to Apache on OpenBSD? :)
Oh, right...
1. mplayer
2. xine
Not that tough, really, now is it?
The secret to creativity is knowing how to hide your sources. -- Albert Einstein
I suppose that, transitively, it is due to a limitation in an archaic version of the BSD stack.
"Strangers have the best candy" -Me
Hey, Microsoft has to eat their own dogfood if they want to keep some modicum of credibility no matter how bad the food tastes...
Is anybody really suprised here? What they didn't tell us is that there's a top-secret Debian redundancy server running behind it just in case all hell breaks loose. Nothing to see here, move along.
( I
Is that not one of the most ironic things you've ever heard? The limitations of the operating system made by the same company holding back another division? Shock and awe.
The limitations discussed in the video of the Windows TCP stack are not limited to Windows. These are limitations imposed by a to-the-spec implementation of TCP. TCP is 30+ years old, and it wasn't designed for the kinds of networks it runs on today.
The new TCP stack in Vista effectively implements TCP is such a way that it removes these limitations while preserving compatibility with old stack implementations.
Interviewer: "Hey dude."
Chris St.Amand "What up bro"
Interviewer: "So like what happened when you worked on microsoft.com? Oh but first...Did you get all the chicks at the bars when you mentioned your job or what?"
Chris St.Amand "Oh totally. I'd just say, 'what up babe. I work on the microsoft.com web portal' and she'd degfrag my harddrive all night."
Interviewer: "Sweet. So what was your biggest hurrdle writing all that HTML? After all that's a complicated langaguage to master."
Chris St.Amand "It'd definelty have to be that F'ing page not found shit. You don't know how many times I'd go to microsoft.com after doing a big update and it'd just say four-oh something and the page just wouldn't show up. You know we tried to put up a 420 page not found but got in trouble with our boss."
Interviewer: "Yea totally! That would have been cool. Oh ummm let's see here. So what other problems did you have?"
Chris St.Amand: "Not being able to use FreeBSD to serve that shit. When I first heard I actually had to use Microsoft I was completely like, 'Not cool Bill. Not F'ing cool, Bill.'
Interviewer: "Any thing else? Like was it hard to get up every day in the morning knowing that your existence was updating microsoft.com HTML?"
Chris St.Amand: "Yea I tried sucicide a number of times. But then I discovered that I could just completely make up new HTML tags and that was a lot of fun."
Interviewer: "Make up HTML?"
Chris St.Amand: "Oh yea, we're microsoft. When I first started they told me that no other browsers exist other then that big blue F'ing E and that no other operating systems exist. And that I could do whatever I wanted to do. So I just started making up *ALL KINDS* of crazy ass HTML.
Interviewer: "Cool dude. You rock. Anything else you want to mention?"
Chris St.Amand: "Yea you know all that crazy F'ed up HTML that all of our products output? You know without indention and messed up question marks everywhere? That was me. I was all hung over the day I added that. And that's about it."
Interviewer: Thanks Chris, I'm sure you'll go down in infamancy for such a piece of F'ing shit web page and end up in some lame ass 'Don't write web pages like this' hall of fame.
Chris St.Amand: "Peace out and remeber to eat your greens not smoke 'em!"
http://wm.microsoft.com/ms/msnse/0511/25766/micro
While I usually RTFA (unlike most slashbots) I think we can all agree that at 40 minutes maybe 1/2 a percent of
/me waits for the transcript
And yea, I saw the cans, but the bit-rate of that video is so low, I have no clue what they were. Maybe that red one on the left is a coke or dr. pepper?
[Fuck Beta]
o0t!
Is windows really limited to 10mb/s due to the network implementation? Now I am really glad I convince people to use Linux, we have one server pushing 480Mbits/s or so using Lighttpd.
They should be redesigned.
That's a big problem of software made by companys:
1 - The company's cashflow is based arround selling new versions of the software
2 - They can't sell to it's customers improvements that they customers can't see
3 - There is a fixed time that can go by beetween one release and the next one
4 - Resources are limited
Because of this, a major redesign is something that won't be profitable, because only the advanced users will note the changes, but 99% of their customers won't, so the software won't sell well. Bug fixes also won't sell, because they are also unvisible to the naked eye of the majority of the userbase, and also customers expect those changes to be free.
So, some companys only can expect revenue from a given software once a year, and they have to invest into that software, a given set of limited resources over, say, 6 months, when they have to freeze the featureset so they can start debugging. Seeing which things sell, they will obviously focus their atention on: New Features, and a nicer GUI.
OTH, a project that doesn't have a company running it, can just get out lots of upgrades, when needed, and focus their time on making the software better, even if some of the changes made to the software won't be seen by most of it's users.
With software prices dropping, and Free Software proving to be a better option, the budget of software companys will be even more limited, and we won't see this situation changing anytime soon.
WTF am I doing replying to an AC at 5 A.M on a Friday night?
Slightly off topic, but the new Windows TCP stack will be implementing their new Compound TCP stack, aka, CTCP. More information can be read here:
a spx?type=Technical%20Report&id=940
http://research.microsoft.com/research/pubs/view.
Am I the only one who looked at the title and thought: "debug microsoft.com? Who still uses .com files any more?"
Yup, thought so. I suck.
They reveal some of the big problems they used to face such as recycling processes every 5 minutes due to memory leaks and 32 bit limitations, and being unable to push more than 10 Mbits of data to their datacenters due to Windows' networking stack limitations."
Micro$oft needs 64 bit so it can leak more memory faster and stay running. Or at least this is how I read this.
As for 10mbs, maybe they should put a Linux/BSD/UNIX cache in front of those servers like MSNBC did to get through the last olympics.
Absolutely true. I used to work for a hosting company, we had GNU/Linux and Windows servers. ...
The GNU/Linux servers were the ones with more hits, and the ones that required less atention. The windows servers were a pandora box of problems. IIS just can't hold up by itself, if you just serve static pages you are ok, but when people starts using that asp + odbc shit, you have to restart IIS every 5 fucking minutes. We used to receive a stupid "too many conections" from ODBC in our log, and restarting the stupid services woudln't do a damn thing, all you could do was restart the machine, Yes, restart a SERVER. That's about the worse thing a sysadmin can go through, the panic of not knowing if that crappy windows was going to come back up or not. OTH, our GNU/Linux machines with sites running a variety of CGI apps (PHP, Perl, etc), all using MySQL, supported 5 times the load on the windows machines without complaining, and i'm talking about 300 sites on simple x86 hardware, less powerfull than the one on the windows machines, that died with less than 100 sites
WTF am I doing replying to an AC at 5 A.M on a Friday night?
Well do you think want to give us Linux users the satisfaction of seeing Microsoft employees admitting faults in their software?
Microsoft does this all the time. They call it eating their own dogfood. In a way, it's quite smart actually. One, it shows customers that they aren't afraid to run their own product. Two, it helps them learn how to use and support their products in a large network. And three, it helps them find defects in the software.
Geek used to be a four letter word. Now it's a six-figure one.
People like us aren't running web sites that process 10 to 15 Gigabits per second.
Remember... ZG9uJ3QgZm9yZ2V0IHRvIGRyaW5rIHlvdXIgb3ZhbHRpbmU=
According to microsoft, the MSN messenger service (which serves to around 70 million people) used to run on 250 32-bit servers, and now it runs on just 25 or something like that... (apparently one of the big reasons was the limit on the number of tcp connections).
It's quite amazing to think that a service as huge as messenger can run on just 25 servers!
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
Wow. Just wow.
I look at Solaris (err, OpenSolaris) and how it can now push a 10Gb/s interface at line speed (or close to it) and MS has struggled up until recently to get satisfactory speeds above 10Mbit/s ?
Yet another "how do users/admins accept this as OK" thought going through my head re: Windows internals.
You realize that that article talks about issues that had been long since solved by 1996, and list the solutions to them? In the case of the particular quote, the TCP Window Scale Option.
The following is just hearsay, as I've never actually worked for MS. But a couple of engineer buddies I used to work with did some subcontracting for MS, and they said they deployed a whole lot of internal-facing *nix servers during that period. I tend to believe it, because the MS security guys who taught some seminars I attended wouldn't confirm or deny that they used any Linux internally. If they could have denied it in clean conscience, wouldn't they have done so emphatically?
Working in a DevOps shop is like playing in a band made up entirely of keytarists.
IIS just can't hold up by itself, if you just serve static pages you are ok, but when people starts using that asp + odbc shit, you have to restart IIS every 5 fucking minutes.
That's not because of IIS; it's because of the people writing the ASP apps and stupid admins not configuring IIS correctly. If you have stupid people writing applications, those applications have a tendency of doing stupid things. Combine that with admins who don't properly isolate that applications running on IIS and you've got a recipe for requiring an IIS restart "every 5 fucking minutes".
Give me 5 minutes and I can write a nice app that takes down Apache no problem. A few infinite loops, perhaps each creating a dozen new database connections and allocating a massive string buffer in memory.
IIS 6.0 has a lot of features built into it that allow for admins to configure application pools to more effectively isolate applications. You can configure those application pools to recycle automatically given certain criteria (like memory usage, CPU usage, # req/sec, @ req/total, etc.), and the pools are isolated from each other so that if one dies due to a misbehaving application, the other applications on the system are not affected.
We used to receive a stupid "too many conections" from ODBC in our log, and restarting the stupid services woudln't do a damn thing, all you could do was restart the machine, Yes, restart a SERVER.
Perhaps that's all you could do, but somebody who spent more than 10 minutes reading about administering IIS would know to recycle the ODBC COM+ application to clear out the connection pool. Then they would find the stupid people writing that crappy applications and fire them, or at least isolate their applications in a separate app pool or worker process. (Dllhost.exe.)
Spare me the anecdotal stories of your LAMP solutions doing so much better than your Windows solutions. You have absolutely no credibility given your complete ignorance.
Hmm, nearly-direct link to a 145-megabyte video file on the /. front page, posted right as the geeks of the world are getting home from work. What are you, crazy? Are you trying to Slashdot Microsoft?
Don't answer that.
"In a 32-bit world, you're a 2-bit user. You've got your own newsgroup, alt.total.loser." -Weird Al
In other words, TCP is a protocol, not an algorithm.
So ... if Vista has some fabulous new algorithms for implementing TCP, then why can't other OSes be patched to benefit from those algorithms also? OR, if Vista is implementing something other than TCP, then how can it be (fully) backwards compatible?
Seems like the word "compatibility" might need to be scrutinized here.
Human being (n.): A genetically human, genetically distinct, functioning organism.
Surely.
But you know that does not really solve the problem. Window Scale just allows you to "adjust" your window further than the 64Kb. Also a packet loss with a large window has some dramatic consequences, and to address that is not easy.
Second large windows degrade what we call "fair queuing" mechanisms: splitting bandwidth over multiple TCP/SWP connections. Large windows cause a lot of congestion.
I am not a Windows user myself:
[ 16.784315] TCP reno registered
[ 16.784454] TCP westwood registered
[ 16.784487] TCP highspeed registered
[ 16.784515] TCP hybla registered
[ 16.784542] TCP htcp registered
[ 16.784570] TCP vegas registered
[ 16.784597] TCP scalable registered
I've all those TCP "flavours" available. Some are good for high-speed links, some for high-latency, some for low-congestion and so on.
There are some other issues around that may arise if you have some other "active" node in between the endpoints (such as routers). But you know that.
This is why I love AAL5 (ATM)
Bandwidth vs Latency.
/proc/sys/net/ipv4/icmp_ratelimit
:)
Take a truck. A huge one. Fill it up with recorded DVD's and send it over a hundred miles distance.
You'll have huge bandwidth.
But wait, somehow a DVD got lost in transit. What now ?
You have to phone back and have a taxi to pick it up and deliver the missing DVD.
As you need the last DVD, you'll have to wait. Your bandwidth decreases.
It's pretty much costly for you to do so if you miss a DVD.
So you decide to take only a hundred DVD's per truck and using multiple smaller trucks. But somehow none is missing this time, so you spent a lot of money for the extra trucks.
This issue is somehow similiar to Heisenberg's Uncertainty Principle. You cannot get maximum bandwidth and minimum latency.
Linux can respond faster if it has to. OS/X doesn't do that because it does not want to.
It can also respond slower:
$ cat
250
Tune it as you wish.
Yes, I had some beers today, and what?
No, no, no... they can saturate a 10MB/s connection easily. What they had problems with was database connections over a long distance (a problem with TCP, not windows)... which they rectified (using a concept called CTCP), check this paper out: http://research.microsoft.com/research/pubs/view.a spx?type=Technical%20Report&id=940
-everphilski-
You could scan through all of my old posts for background if you like, but back when NT 4.0 was brand new, I helped to save a failing ISP (for at least the next 6 months or so) by setting up a new mail server to replace the one that was failing ever 2 to 10 minutes. I used a machine with less than half the power and resources of the machine already running... and loaded slackware. I think the kernel was jsut over 1.00 at the time.
...I guess I've repeated enough digs on microsoft for one posting...
Yeah, "old technology" couldn't do anything better than new stuff like NT right? Come to think of it, there's not a LOT of difference between XP's kernel and NT's from what I understand... a few bug fixes here and there... but basically, it uses the same vulnerable messaging scheme and drivers running at ring-0 and all that.
After having this video playing in the background for awhile, one interview question caught my ear:
"So is your security getting better?..."
Aside, its funny to hear them concede that they're actually having to adjust for other browsers visiting their home page.
"Use standard-compliant code? Heresy!..."
The one on the left is Coke, the other 3 are Red Talking Rain. Personally, I'm a Green Talking Rain programmer, but I can respect teh other side :) Talking rain (particularly green) is the nectar of the programmers here in Seattle.
:)
You see, Microsoft started the great thing a few years back where every floor was stocked with 2 giant refrigerators of free soda. The rest of the local software companies quickly moved to copy this ingenious move, so you can't program and not be in contact will all the free soda you can drink. This sounds pretty cool until you've done it for about 2 years. At that time, assuming you are not a natural soda addict, the last thing on earth you want to drink is any kind of beverage with sugar in it, because you are so unbelievably sugared out. In come Talking Rain. Talking Rain is a simple carbonated spring water, with just a hint of fruit oil added, and no sugar. Green Talking Rain adds lime oil, and Red Talking Rain adds Rasberry, I think, although being a Greener myself, I never really paid attention. The fact that only senior programmers have completed this Talking Rain pupation, allows you to easily glance at someone's trash can in their office and peg them for a Senior or Junior level developer. You will almost never see a Junior level developer drinking Talking Rain, and almost never see a Senior level NOT drink it. Kind of a free soda pecking order.
Of course I may be reading to much into this, but my Greener roots run deep
Latency and bandwidth are not orthogonal when you have flow control. Try looking up 'bandwidth delay product' and tcp windowing. To achieve 1gbp/s to mars you need to buffer all that data in case of packet loss. Available memory will throttle your throughput.
A quick web search says round trip times to mars are between 10-50 minutes. Say 60 minutes * 60 seconds = 360 gigabits of window space to achieve full line rate. Now consider some minor packet loss and even with SACK you're buffering an unreasonable amount of data.
Annoying that the parent got modded up with bad information and this post will likely be passed over.
The round trip time to Mars varies between about 10 and 40 minutes, depending on the relative positions of the Earth and Mars.
Slashdot has turned from "Microsoft sucks" to waxing poetically about how Microsoft used to suck.
How times change...
It was one of there secondary sites, something like blah.microsoft.com. The ISP was supposed to be hosting it on a colo NT box as part of an outsourced hosting contract. Well the site crashed constantly and the support team got sick of the late night pager calls and moved it over to a BSDI box with Apache and spoofed the server headers to read IIS, never told the M$ guys.
You're totally not understanding where the bottleneck is. The issue isn't if it's possible to push 480Mb/s out of one machine, or if it's possible to push it over a link rated at 480Mb/s. The issue is if it's possible to do it *using the original TCP standard*.
Each end of a TCP connection allocates a receive buffer. The available empty space in this buffer is mentioned as part of the header on every packet. A TCP implementation allowed to continue sending packets to the other machine as long as there is space in the buffer. If machine A says it has an 8 KB buffer, then machine B can send 8 KB without worrying. If machine B receives an ACK packet saying that there is free buffer space from machine A before it sends 8 KB, then it can keep on sending data. However, if 8 KB is sent before machine B hears anything from machine A, machine B is required to completely stop sending data until it receives an ACK indicating free buffer space.
The buffer size specified in the TCP header is a 16 byte number. This worked fine on slower networks, but according to the article it peaks around 10Mb/s. It becomes an issue of latency. Once you receive a packet, you need to be able to get an acknowledgement packet to the other machine before it can send out 64KB of data (counting the packet you just received). If you can't, the other machine stops sending data until it hears from you.
Sometime in the 90's when the problem first became an issue, a solution was developed. A new TCP option was created that indicated the buffer size in the TCP header was to be multiplied by a number specified during the initialization of the connection to get the true buffer size. Apparently MS only implemented this recently.
And they still manage to have a service outage for at least a few minutes to a few hours a month. AIM and Yahoo! don't seem to do that to me.
Administration, software issues, whatever. MSN isn't that amazing, especially compared to the other services.
My blog. Good stuff (when I remember to update it). Read it.
... Google runs on several thousand PC-class servers.
;-)
Yeah, but Google's servers aren't just passing bits around, they store a copy of the whole (freely accessible) web.
-- "I never gave these stories much credence." - HAL 9000
all everyones problems went away when they switched to winxp ?
Sorry i though everyones problems went away when they switched to winme?
Sorry i though everyones problems went away when they switched to win98
Sorry i though everyones problems went away when they switched to win95.
all i seem to hear before a new windows release is how xxxx is stable now xxxx starts up in only 4 seconds xxxx doesnt have this problem xxxx doesnt have that problem.
Windows has had commercial server software for how long ?
and its just fixing a stack limitation when ?
If you watch the system log on an OS X machine that's getting ping flooded, you'll note that it starts printing "Limiting icmp ping response from (large number) to 250 packets per second". It's entirely intentional.
TANSTAAFI: There Ain't No Such Thing As A Free iPod.
Heh, if you had to pay those MSFT licensing fees, I'm sure you'd find a way to reduce the number of Windows Servers you used too. ;)
It gracefully "cycles" your process so you have your memory leaks. If only other apps were coded for memory leaks.
I think you underestimate just how much I just dont care.
Microsoft--and the two staffers shown in this video--deserve strong praise for the *unedited* candor, the self-depricating humor, and the absense of spin on this video.
:-)
Maybe I've missed the comments, but what no one seems to mention here is that these guys--clearly both geeks at heart (in a good way)--really are peeling back a lot of the layers of MS's site. The candor about their security problems, the 2gb memory issues, and a variety of other things was refreshing.
Heck, they even mention firefox.
Good work all. Good work.
Running 'Nix is like owning a Lightsaber. It's "a more elegant weapon for a more civilized time."
I like how 10:14-10:18 zooms in right on Chris's keyboard as he types his password. Just using Windows Media Player on cntl+shift+s takes a lot of the guess work out of the password.
Especially with a little help from our friends from UC-Berkley.
Also, I like 12:32, "so we'll avoid showing ip address... haha we'll have to cut that part out..." like the large 207.46.16.30 address looking at us in the face and then seconds later the 3 ip addresses in clear view on the right.
"So we have terminal server access to all the servers in the data center, right.". Right, well I wonder how may of those servers, whose IP addresses we just saw, are attached to Chris's login and password?
Ready, aim, proxy.
Hey look no pointless curley braces or semicolons... just like Python
At around 10:25 in the video Chris St. Amand, who runs Microsoft's website and data center, types in his password, which the camera recorded. And the video is hosted off of Microsoft's website...although I don't know how long that'll still be operational.
If you think that AIM never goes down, you have no idea what you're talking about. I've had AIM shit out on me MANY, MANY times, and yes, this is with the actual AIM client. It'll kick me off, and I won't be able to sign in for a few minutes, sometimes it'll get stuck at verifying login/password and just sit there until it times out, etc.
AIM has its server problems too.
Also, not everyone who disagrees with you is an astroturfer. As hard as it may be to believe, some people might ACTUALLY have different experiences and opinions as you.
But Apache never crashed (and this was on a comparatively memory-poor box by today's standards - 256 meg), just took a second or two ... and nobody else connected to the box complained.
.NET framework is and how much bang for my buck I can get out of ASP.NET on IIS. Sometimes I pick Java for those rare cases one needs a server application to be portable.
Apache, like IIS, has a finite number of threads it uses to handle incoming requests. If you use up all those threads, Apache, and IIS, can't respond. You either must increase the number of threads or users will be denied access to the site. Eventually, you run out of system resources. In either case, you've prevent one (or likely a lot more) request from being fulfilled by the web server. End of story.
Your example is a foolish one. You never caused Apache to run out of resources. If you had, it would have "crashed" as the originally posted meant it... it couldn't handle further requests. That wasn't because Apache is superior in some way to IIS, it's because your clicking didn't use up all the threads. Simple as that. That's what I was explaining... the same thing can happen to Apache as can happen to IIS. Just because Apache is open source doesn't make it invulnerable to resource exhaustion due to inept programmers.
No, its Windows that pretty much has no credibility. The one thing it DOES have that nobody else has is the widest selection of trojans, viruses, worms, and idiot users.
That and the majority of the fortune 500 companies running on it. Windows is a fully capable server platform, and there are countless examples to back that up... just as there are countless examples that show that Linux can be a capable server platform. My point was that IIS is not inherently flawed as the original poster suggested. In fact, IIS 6.0 is in my opinion the best web application server on the market if cost is not an issue. (Windows licenses can be too expensive for a small company.) It's had extremely few security holes (FAR fewer than Apache has in the same timeframe), it's very fast (thanks to advanced features like kernel mode listeners), it's extremely reliable thanks to application isolation, process recycling, and great management and monitoring tools, and it's host to many excellent development platforms from PHP to ASP.NET.
IIS 7.0 is shaping up to be even better with some great ways to customize the web server to make it as bare metal as possible if that's what you want.... taking a hint from Apache in this case.
But for you to sit there and question the intelligence of somebody who uses Windows as a server platform shows your ignorance. It shows you don't bother to really examine alternatives to what you're comfortable with. When choosing a platform for a project I make sure to consider as many things as possible... from portability requirements, to intellectual property issues, to performance, to cost, to ease of development. That's my job as a software architect. Sometimes I choose LAMP for its very low initial cost. (Basically free.) Sometimes I pick ASP.NET because of how robust the
Regardless, there are lots of options out there and until you're able to pick the best one for the job at hand you're just going to be limiting yourself for no good reason. Both career wise and intellectually.
So because AIM simply refuses to connect instead of giving you the useful info that the service is down (and thus don't bother trying to troubleshoot your computer/network) means that it never goes down?
I imagine that the price tag, the exposure to malware (one of the big reasons I don't use MS products myself), and possibly the lack of PPC and/or 64-bit versions of MS-Windows and/or the codecs might have something to do with it.
What your assertion basically amounts to is: "He should run x86/32 and use an illegal copy of MS-Windows rather than run a Free (and probably free) OS and player on the hardware of his choice."
Let's put this in modern, everyday terms. Imagine Sony's media companies releasing only DVDs that work only on Sony players. I own a Panasonic player. You're telling me that I should buy a Sony player at whatever price Sony asks rather than whining about Sony's exclusivity?
It's kind of like signing a temperance pledge because practically everybody else in my community has VD, and subsequently being told that if I want to watch a movie I have to have sex in the back row of it. Am I a whiner because I refuse?
And how about you?
Got time? Spend some of it coding or testing
Microsoft is the only place I've every worked that hired other engineers remove the ongoing responsibility of performance and debug from development engineers. They should require that a developer has to maintain whatever they work on for at least a year after release.
Pardon me if I think you're lying through your teeth. How could they not notice that they're no longer connecting to a Windows server? They would still have to connect via FTP or something other protocol, did you spoof those too? Not just that, how did you manage to fake the whole directory tree? If they connect to upload files, they'd notice it was a unix system by the file hierarchy and the fact that ASP DIDN'T WORK ANYMORE. Yes, there are some *nix ASP products, but they don't work that well. They'd definitely notice something was wrong the second they tried changing something on the website.