W3C Gets Excessive DTD Traffic

← Back to Stories (view on slashdot.org)

W3C Gets Excessive DTD Traffic

Posted by ScuttleMonkey on Friday February 8, 2008 @01:22PM from the stop-the-intertubes-i-wanna-get-off dept.

eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"

20 of 334 comments (clear)

The Solution by OdieWan · 2008-02-08 13:29 · Score: 5, Funny

I have a solution to the problem; I wrote it down at http://www.w3.org/TR/html4/strict.dtd !
1. Re:The Solution by Anonymous Coward · 2008-02-08 13:32 · Score: 5, Funny
  
  Don't click that link! It's some sort of ascii pornography!
Do what.... by Creepy+Crawler · 2008-02-08 13:29 · Score: 5, Funny

Do what any other respectable web provider would do..

Put links to Goatse in the definitions!
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Delay by erikina · 2008-02-08 13:31 · Score: 5, Interesting

Have they tried delaying the response by 5 or 6 seconds? It could cause a lot of applications to hang pretty badly. That or just serve a completely nonsensical schema every thousandth request. Gotta keep developers on their toes.
1. Re:Delay by bwb · 2008-02-08 14:47 · Score: 5, Insightful
  
  Sure, they're ignoring the response status, but I'll betcha most of them are doing synchronous requests. If I were solving this problem for W3C, I'd be delaying the abusers by 5 or 6 *minutes*. Maybe respond to the first request from a given IP/user agent with no or little delay, but each subsequent request within a certain timeframe incurs triple the previous delay, or the throughput gets progressively throttled-down until you're drooling it out at 150bps. That would render the really abusive applications immediately unusable, and with any luck, the hordes of angry customers would get the vendors to fix their broken software.
Re:Leave it to Slashdot... by snl2587 · 2008-02-08 13:36 · Score: 5, Informative

Note: It is my understanding that the browser is what looks up the DTD. So /. having the declaration is irrelevant.
Simple solution by mcrbids · 2008-02-08 13:36 · Score: 5, Funny

The answer to this problem is quite easy.

Continue to host the data referenced on a single T-1 line. That will cut your expenses to the bone since you'll never exceed 1.54 Mbps and that should be quite cheap. And, any dumfuxorz who fubarred their parser to not cache these basically static values will probably figure it out... very quickly.

You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year.

Problem solved!

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Umm, no. by pavon · 2008-02-08 13:43 · Score: 5, Informative

That is supposed to be there according to the standard. And all the major browsers cached that that file after loading it (at most) once, and then never read it again. So no, slashdot is not causing a problem. The problem is all the other HTML processing software besides browsers that do not cache their DTD files, not the files for containing it.

If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.
1. Re:Umm, no. by MillionthMonkey · 2008-02-08 16:37 · Score: 5, Interesting
  
  At least what little code I've written to process HTML/XML has always entirely ignored the DTD.
  
  Don't be so sure- even if your own code ignores it. Unless you're dealing with it on a raw character level, with most XML libraries and frameworks it can be quite tricky to prevent DTDs from being resolved behind your back.
  
  I wrote some Java code a while back to parse some XML files that were downloaded from NCBI. Typical for NCBI data, this involved wading through terabytes of crap, and anything based on DOM wasn't going to work- so I used the lower level event-based SAX library in JAXP. The files did have DTD declarations in them pointing to NCBI, which I wanted to ignore, since this was a one-time data mining operation. I just examined some sample files, figured out pseudo-XPath expressions for what I wanted to pull out, set up a simple state machine to stumble through the SAX events, and not caring about the DTD, cleared the namespace-aware and validating flags on the SAXParserFactory. So I ended up with this:
  
  File xmlgz = new File("ncbi_diarrhea.xml.gz"); DefaultHandler myHandler = new MyNCBIStateMachineHandler(); GZIPInputStream gzos = new GZIPInputStream(new FileInputStream(xmlgz)); SAXParserFactory spf = SAXParserFactory.newInstance(); spf.setValidating(false); spf.setNamespaceAware(false); SAXParser sp = spf.newSAXParser(); InputSource input = new InputSource(gzos); sp.parse(input, handler);
  This ran fine, until it mysteriously froze up 18 hours into the run. It turned out to be caused by our switch to a different ISP, during which time the building lost its outside network access. The thread picked up the next file and immediately got blocked in the SAX library, trying to resolve the NCBI DTD.
  
  This is how I fixed it:
  
  spf.setFeature("http://xml.org/sax/features/external-general-entities", false); spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
  
  Now I'm sure someone is going to come on here calling me a noob for not knowing to use an XMLReaderFactory (or whatever XML API class isn't obsolete this week) and setting a custom EntityResolver that can provide my local copy of the NCBI DTD when presented with its URI, but why should I even have to bother with that? XML pretends to be simple but it's seriously messed up.
Re:Wow by Bogtha · 2008-02-08 13:45 · Score: 5, Insightful

Why on earth are you blaming webmasters? They are just about the only people who cannot be responsible for this. People who write HTML parsers, HTTP libraries, screen-scrapers, etc, they are the ones causing the problem. Badly-coded client software is to blame, not anything you put on a website.

--
Bogtha Bogtha Bogtha
Such an easy solution by mwasham · 2008-02-08 13:49 · Score: 5, Funny

And it is only 4 articles down.. Host with Yahoo! Yahoo Offers All-You-Can-Eat Storage and Bandwidth http://hardware.slashdot.org/article.pl?sid=08/02/08/1811236

--
Dallas Real Estate
Submitted this to /.? by dotancohen · 2008-02-08 13:52 · Score: 5, Funny

Great, they cry "we get too much traffic", so we go ahead and slap them on the front page of slashdot. Sick, sick fucking joke.

--
It is dangerous to be right when the government is wrong.
1. Re:Submitted this to /.? by ger · 2008-02-08 14:47 · Score: 5, Informative
  
  To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us, etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
Re:That's what you get for making stupid rules. by Bogtha · 2008-02-08 13:55 · Score: 5, Informative

They insist that every document begin with a declaration that includes a link to their site.

It's not a link. It's a reference to an external DTD subset. It's there so that generic SGML software can properly parse the document without any special knowledge of HTML.

The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.

No, external DTD subsets are a part of SGML, which is at least a decade older than the W3C.

--
Bogtha Bogtha Bogtha
Starting on the 1st, fool by Scrameustache · 2008-02-08 14:06 · Score: 5, Funny

You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year. I suggest April! :D

--
You can't take the sky from me...
Re:Wow by Bogtha · 2008-02-08 14:28 · Score: 5, Informative

They literally wrote the standard.

"Webmasters" refers to people who run websites, not the W3C. And this particular feature is an artefact of SGML, which was around for over a decade before the W3C ever existed.

If they didn't want the traffic they should have specified the matter in their RFCs.

You mean like how RFC 2616 describes the caching mechanism that is being ignored by the problem clients? Or are you referring to the established-for-decades SGML system catalogue that they mention in the HTML 4 specification multiple times?

You can tell them apart by their attention to the consequences of their actions.

If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

--
Bogtha Bogtha Bogtha
Re:Wow by Anonymous Coward · 2008-02-08 14:54 · Score: 5, Insightful

They literally wrote the standard

Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
Re:Wow by Blakey+Rat · 2008-02-08 14:59 · Score: 5, Insightful

If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

Wow, it just struck me... welcome to Microsoft's world.

Their security was so bad for so many years because they worked on the assumption that:
1) Programmers know what they're doing
2) Programmers aren't assholes

Of course, the success of malware vendors (and Real Networks) has proved those two assumptions wrong many years ago, and probably 90% of the development work on Vista was adding in safeties to protect against idiot programmers, and asshole programmers.

And now the W3C is getting their lesson on a golden platter.

In short, here's the lesson learned:
1) Some proportion of programmers don't know what they're doing and never will
2) Some proportion of programmers are assholes

--
Comment of the year
Re:Wow by ibbie · 2008-02-08 16:16 · Score: 5, Insightful

They literally wrote the standard

Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
Good and jolly bacon bits, please mod parent up. I realize that their comment might come off as harsh, but crap, come on. If one is building an application, would one really want to have to connect to a website to get instructions on how to read a filetype? Especially when all it would take it a single wget and including those instructions with the application to avoid all of this.

Furthermore, it would seem that the process of reading a file would be far faster if the processing instructions were on the local file system rather than on a remote host. If one were really worried about changes to the instructions, one could code a routine to update the DTD whenever the application is updated; if the app isn't such that *would* be updated, one could always have it run a diff against the W3C's DTD every few months - after it's been standardized, it's not like the DTD is going to change on a daily basis. While not a complete cure, it'd still be far more considerate to the W3C's bandwidth than hitting it every request, or even every time a program is started.

Honestly, I wouldn't blame them if they 302'd the file to a page that, upon CAPTCHA'd request, made the file temporarily available for download, so that vendors could fix their broken software. They're obviously far more considerate and forgiving people than I - and, I suspect, many of you fellow Slashdotters - tend to be.

*puts on flame-resistant suit*

--
The wise follow a damned path, for to know is to be forsaken.
Re:Wow by sco08y · 2008-02-08 21:52 · Score: 5, Insightful

Furthermore, it would seem that the process of reading a file would be far faster if the processing instructions were on the local file system rather than on a remote host. If one were really worried about changes to the instructions, one could code a routine to update the DTD whenever the application is updated; if the app isn't such that *would* be updated, one could always have it run a diff against the W3C's DTD every few months - after it's been standardized, it's not like the DTD is going to change on a daily basis.

It's more like this: your app should *never* query the DTD. If the DTD changes, your app's code probably needs to change and your app should *never* try to parse using a DTD that hasn't been tested by a human being, or at least through your regression tests. Any changes to DTDs should be handled by updating the app itself.

The only exception to this is an app that also happens to be a development tool.