W3C Gets Excessive DTD Traffic
eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"
Note: It is my understanding that the browser is what looks up the DTD. So /. having the declaration is irrelevant.
That is supposed to be there according to the standard. And all the major browsers cached that that file after loading it (at most) once, and then never read it again. So no, slashdot is not causing a problem. The problem is all the other HTML processing software besides browsers that do not cache their DTD files, not the files for containing it.
If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.
The spec already recommends this and all the major browsers do it. The software that is causing the problem are generic XML/SGML processing packages which were designed to be able to deal with documents with any random DTD, not just the main HTML/XHTML ones from W3C. They are the folks that are downloading each DTD every single time and not caching it, contrary to the standard. Sometimes caching is a configuration option which defaults to off and administrators never turn it on.
It's not a link. It's a reference to an external DTD subset. It's there so that generic SGML software can properly parse the document without any special knowledge of HTML.
No, external DTD subsets are a part of SGML, which is at least a decade older than the W3C.
Bogtha Bogtha Bogtha
Browsers cache the DTDs.
There, you can now stop posting your hilarious "jokes".
sic transit gloria mundi
Actually, do any browsers get the DTD?
From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.
Browsers are also pretty good about caching stuff.
If I have nothing to hide, don't search me
RTFA. They returned the 503 Service Unavailable error to many abusers, and they just kept on with abusive requests. Many abusers aren't checking the response to the request at all.
What a fool believes, he sees, no wise man has the power to reason away.
FTA:
The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG)
I don't claim to fully grasp what software is causing the problem but it does seem to effect more than just XML.
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
"Webmasters" refers to people who run websites, not the W3C. And this particular feature is an artefact of SGML, which was around for over a decade before the W3C ever existed.
You mean like how RFC 2616 describes the caching mechanism that is being ignored by the problem clients? Or are you referring to the established-for-decades SGML system catalogue that they mention in the HTML 4 specification multiple times?
If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.
Bogtha Bogtha Bogtha
Good: Delivered a piece of code once that tested just fine for us, but blew up at the customer's site. We never realized that the new J2EE-like features were hitting a live URL during DTD parsing.
Better: Had a build system once that looked for a host and had to TCP timeout before the build could continue. Had to happen several hundred times a build cycle.
The Java libraries do this down in their innards unless you're very careful to avoid it.
To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us, etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
See this comment. /. is NOTHING compared to the traffic generated by DTD requests.
http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic#c1821
ip addr add 192.168.1.2/24 dev eth0
ip link set eth0 down
etc. etc.
650 times as many hits. (163 times as many bytes.) But that's just from a quick sample.
I think I was using the Java version of Apache Xerces at the time for the Docbook processing. More recently I've used lxml in Python (based on libxml2), which has an option (no_network) to suppress DTD loading from the web, but you have to request that explicitly.
I've never seen a parser that caches DTDs by default. I'm not sure about parsers that do not download by default.
Just use html5lib instead of rolling your own; it'll parse pretty much "the same way" for all practical purposes.