W3C Gets Excessive DTD Traffic

← Back to Stories (view on slashdot.org)

W3C Gets Excessive DTD Traffic

Posted by ScuttleMonkey on Friday February 8, 2008 @01:22PM from the stop-the-intertubes-i-wanna-get-off dept.

eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"

17 of 334 comments (clear)

Min score:

Reason:

Sort:

Re:Leave it to Slashdot... by snl2587 · 2008-02-08 13:36 · Score: 5, Informative

Note: It is my understanding that the browser is what looks up the DTD. So /. having the declaration is irrelevant.
Umm, no. by pavon · 2008-02-08 13:43 · Score: 5, Informative

That is supposed to be there according to the standard. And all the major browsers cached that that file after loading it (at most) once, and then never read it again. So no, slashdot is not causing a problem. The problem is all the other HTML processing software besides browsers that do not cache their DTD files, not the files for containing it.

If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.
They already do. by pavon · 2008-02-08 13:51 · Score: 4, Informative

The spec already recommends this and all the major browsers do it. The software that is causing the problem are generic XML/SGML processing packages which were designed to be able to deal with documents with any random DTD, not just the main HTML/XHTML ones from W3C. They are the folks that are downloading each DTD every single time and not caching it, contrary to the standard. Sometimes caching is a configuration option which defaults to off and administrators never turn it on.
Re:That's what you get for making stupid rules. by Bogtha · 2008-02-08 13:55 · Score: 5, Informative

They insist that every document begin with a declaration that includes a link to their site.

It's not a link. It's a reference to an external DTD subset. It's there so that generic SGML software can properly parse the document without any special knowledge of HTML.

The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.

No, external DTD subsets are a part of SGML, which is at least a decade older than the W3C.

--
Bogtha Bogtha Bogtha
I'm going to say this as clearly as possible. by glwtta · 2008-02-08 13:55 · Score: 3, Informative

Browsers cache the DTDs.

There, you can now stop posting your hilarious "jokes".

--
sic transit gloria mundi
Re:Leave it to Slashdot... by corsec67 · 2008-02-08 14:03 · Score: 2, Informative

Actually, do any browsers get the DTD?
From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.

Browsers are also pretty good about caching stuff.

--
If I have nothing to hide, don't search me
Re:Delay by bunratty · 2008-02-08 14:09 · Score: 3, Informative

RTFA. They returned the 503 Service Unavailable error to many abusers, and they just kept on with abusive requests. Many abusers aren't checking the response to the request at all.

--
What a fool believes, he sees, no wise man has the power to reason away.
Re:Leave it to Slashdot... by milsoRgen · 2008-02-08 14:09 · Score: 2, Informative

From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.
FTA:

The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG)
I don't claim to fully grasp what software is causing the problem but it does seem to effect more than just XML.

--
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
Re:Wow by milsoRgen · 2008-02-08 14:14 · Score: 2, Informative

You're kidding, right? They literally wrote the standard. Well yes they (as long as the 'they' you are refering to is the W3C) did, and no where in the standards they have approved does it call for every system parsing a document with a DTD, to request that information over and over again. Especially considering that data tends to remains static once committed to an official standard.

--
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
Re:Wow by Bogtha · 2008-02-08 14:28 · Score: 5, Informative

They literally wrote the standard.

"Webmasters" refers to people who run websites, not the W3C. And this particular feature is an artefact of SGML, which was around for over a decade before the W3C ever existed.

If they didn't want the traffic they should have specified the matter in their RFCs.

You mean like how RFC 2616 describes the caching mechanism that is being ignored by the problem clients? Or are you referring to the established-for-decades SGML system catalogue that they mention in the HTML 4 specification multiple times?

You can tell them apart by their attention to the consequences of their actions.

If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

--
Bogtha Bogtha Bogtha
Re:Delay by RhysU · 2008-02-08 14:41 · Score: 3, Informative

Good: Delivered a piece of code once that tested just fine for us, but blew up at the customer's site. We never realized that the new J2EE-like features were hitting a live URL during DTD parsing.

Better: Had a build system once that looked for a host and had to TCP timeout before the build could continue. Had to happen several hundred times a build cycle.

The Java libraries do this down in their innards unless you're very careful to avoid it.
Re:Submitted this to /.? by ger · 2008-02-08 14:47 · Score: 5, Informative

To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us, etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
Re:Irony by ion.simon.c · 2008-02-08 16:12 · Score: 2, Informative

See this comment. /. is NOTHING compared to the traffic generated by DTD requests.

http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic#c1821
Re:The problem is with the docs by paul248 · 2008-02-08 16:34 · Score: 4, Informative

Oh, please do tell me how to use iptables or iproute2 to set my ip address, or to enable/disable a network adaptor. ip link set eth0 up
ip addr add 192.168.1.2/24 dev eth0
ip link set eth0 down

etc. etc.
Re:Submitted this to /.? by ger · 2008-02-08 18:45 · Score: 4, Informative

650 times as many hits. (163 times as many bytes.) But that's just from a quick sample.
Re:I always thought it was stupid by MtHuurne · 2008-02-08 22:19 · Score: 2, Informative

I think I was using the Java version of Apache Xerces at the time for the Docbook processing. More recently I've used lxml in Python (based on libxml2), which has an option (no_network) to suppress DTD loading from the web, but you have to request that explicitly.

I've never seen a parser that caches DTDs by default. I'm not sure about parsers that do not download by default.
Re:I'd write the crap code. by BZ · 2008-02-09 05:08 · Score: 3, Informative

Just use html5lib instead of rolling your own; it'll parse pretty much "the same way" for all practical purposes.