W3C Gets Excessive DTD Traffic
eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"
"Webmasters" strike again. Clowns.
The Kruger Dunning explains most post on
I have a solution to the problem; I wrote it down at http://www.w3.org/TR/html4/strict.dtd !
Do what any other respectable web provider would do..
Put links to Goatse in the definitions!
Oh, that was you? I thought that making every webauthor refer to a W3C URL in every web page was going to get someone in trouble someday. Today seems to be someday.
It's a good we don't contribute to the problem - Oh, wait...
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Slashdot: News for nerds, stuff that matters</title>
Have they tried delaying the response by 5 or 6 seconds? It could cause a lot of applications to hang pretty badly. That or just serve a completely nonsensical schema every thousandth request. Gotta keep developers on their toes.
I'm surprised none of the CDNs out there haven't volunteered to host this file - the problem is they'd have to host the entire w3.org site, else move the rest of it to a another hostname.
The answer to this problem is quite easy.
Continue to host the data referenced on a single T-1 line. That will cut your expenses to the bone since you'll never exceed 1.54 Mbps and that should be quite cheap. And, any dumfuxorz who fubarred their parser to not cache these basically static values will probably figure it out... very quickly.
You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year.
Problem solved!
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Don't click the link, it's malware.
this is my sig
the doctype was being used during a xsl transform during our build process; when the hibernate sight flaked out, the builds would fail intermittently.
solution was to add a xmlcatalog using a local resource.
bet this happens a lot more than most people realize; we'd been doing this for years before we noticed a problem.
That is supposed to be there according to the standard. And all the major browsers cached that that file after loading it (at most) once, and then never read it again. So no, slashdot is not causing a problem. The problem is all the other HTML processing software besides browsers that do not cache their DTD files, not the files for containing it.
If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.
So, w3c complains about their bandwidth, and the response is: The Slashdot Effect. Doesn't that make the old bandwidth problem seem less of a problem?
I'm just loving the irony in that.
And it is only 4 articles down.. Host with Yahoo! Yahoo Offers All-You-Can-Eat Storage and Bandwidth http://hardware.slashdot.org/article.pl?sid=08/02/08/1811236
Dallas Real Estate
The spec already recommends this and all the major browsers do it. The software that is causing the problem are generic XML/SGML processing packages which were designed to be able to deal with documents with any random DTD, not just the main HTML/XHTML ones from W3C. They are the folks that are downloading each DTD every single time and not caching it, contrary to the standard. Sometimes caching is a configuration option which defaults to off and administrators never turn it on.
Great, they cry "we get too much traffic", so we go ahead and slap them on the front page of slashdot. Sick, sick fucking joke.
It is dangerous to be right when the government is wrong.
It's not a link. It's a reference to an external DTD subset. It's there so that generic SGML software can properly parse the document without any special knowledge of HTML.
No, external DTD subsets are a part of SGML, which is at least a decade older than the W3C.
Bogtha Bogtha Bogtha
Browsers cache the DTDs.
There, you can now stop posting your hilarious "jokes".
sic transit gloria mundi
The only thing I'm unclear on is whether your average browser is contributing to this problem when parsing properly written documents.
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
You're solving that problem at the wrong layer. HTTP already includes caching mechanisms, the W3C already use them, and part of the problem is that buggy software is ignoring them.
Please read the article. This is already supposed to happen. Buggy software fails to do this, which is the problem being talked about.
Bogtha Bogtha Bogtha
You can't take the sky from me...
I've got to say, this doesn't surprise me at all. In the time I've spent at my job, I've been repeatedly floored by the amazing conduct of other companies IT departments. We've only encountered two people I can think of who have been hostile. Everyone else has been quite nice. You'd think people would have things setup well, but they don't.
We've seen many custom XML parsers and encoders, all slightly wrong. We've seen people transmitting very sensitive data without using any kind of security until we refused to continue working without SSL being added to the equation. We've seen people who were secure change their certificates to self-signed, and we seem to consistently know when people's certificates expire before they do.
But even without these things, I can't tell you how many people send us bad data and flat out ignore the response. We get all sorts of bad data sent to us all the time. When that happens, we reply with a failure message describing what's wrong. Yet we get bits of stuff all the time that is wrong, in the same way, from the same people. I'm not talking about sending us something that they aren't supposed to (X when we say only Y), I'm saying invalid XML type wrong... such that it can't be parsed.
We have, a few times while I've been there, had people make a change in their software (or something) and bombard us with invalid data until we we either block their IP or manage to get into voice contact with their IT department. Sometimes they don't even seem to notice the lockout.
Some places can be amazing. Some software can be poorly designed (or something can cause a strange side effect, see here). I really like one of the suggestions in the comments on the article... start replying really slow, and often with invalid data. They won't do it. I wouldn't. But I like the idea.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
If the problem is that it gets served out too many times, then make the server slow as molasses. If it takes 1-2 minutes to get the DTD from the server, or more, then it is quickly discovered by the performance teams.
--
Thorbjørn Ravn Andersen "...and...Tubular Bells!"
The problem does not lie in the mechanism itself - it's in the documentation - or the lack of understandable (or at least often-used) docs directly at the source.
/.'s front page might actually help improve the situation. But.. it's like this with stupid programmers - they never die out, they'll always create problems. Let's get used to it.
Simple caching on client side could already improve the situation a whole lot... BUT:
When people implement something for html-ish or svg-ish or xml-ish purposes, they go google for it: "Howto XML blah foo" - result, they're getting basic screw-it-with-a-hammer tutorials that don't point out important design decisions, but instead Just Work - which is what the author wanted to achieve when they started writing the software.
It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2. But since most tutorials and howtos on the net are just dumbed-down copypasta for quick and dirty hacks - and since nobody fucking enforces the standards - nobody does it the Right Way.
So if I start writing some sax-parser, some html-rendering lib, some silly scraper, whatnot... and the first example implementations only deal with basic stuff and show me how to do it so basic functionality can be implemented... and I'm not really interested in that part of the program anyways, because I need it for putting something more fancy on top... once after I'm through with the initial testing of this particular subsystem, I won't really care about anything else. It works, it doesn't seem to hit performance too badly, it's according to some random guy's completely irrelevant blog - hey, this guy knows what he's doing. I don't care!
This story hitting
I'm an infovore...
I wrote my thesis in Docbook and installed the processing toolchain on a laptop. Sometimes the processing would fail and sometimes it worked. After a while I noticed it worked when I was setting behind my desk and failed when I was sitting on my bed. After some digging, I found out that the catalog configuration was wrong and the XML parser was downloading the DTDs from the web. This was before WiFi, so sitting on the bed meant the laptop did not have internet access.
The core of the problem is that most XML parsers will automatically and transparently fetch the DTD from the URL and do not cache it. So if you have no DTDs installed locally, or if your XML parser cannot find them (catalog configuration is easy to mess up), the parsing will work just fine and if processing the XML takes a significant amount of time, you probably won't notice the small delay from downloading the DTD.
There are several possible solutions for this:
All of these are things that should be addressed in the XML parsers.
I think they screwed up, and brought this on themselves. I already thought that it was annoying having so verbose an identifier... this just makes it more hateful.
If they'd at least made the identifier NOT a URI, something like domain.example.com::[path/]versionstring, or something else that wasn't a URT, so it was clearly an identifier even if it was ultimately convertible to a URI, they would have avoided this kind of problem.
Here's an example of what correct markup should look like: The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD.
... you can't blame Microsoft for this problem! After all, IE ignores pretty much all web standards and best practices, and does its own thing!
#DeleteChrome
It was the W3C that decided to make HTML a subset of SGML. They could have done what HTML 5 is doing by creating a "serialization" that doesn't care about the DTD (HTML5 doesn't call for one in the DOCTYPE). As it is, theoretically, I can write my own DTD (a modification of the HTML4 DTD, for example) that adds new elements. Technically, the SGML parser should know and understand those DTDs. To do such, it must download the DTD. Browsers are supposed to be handling SGML docs, but chose to implement against the W3C recommendations instead of caring about the DTDs; the popular ones don't download the DTD at all... they don't even care if it exists or not, which is why the short HTML5 DOCTYPE works as a quirksmode switch but is still valid HTML and renders like HTML should.
Should SGML renderers cache it? Yes. Should W3C bitch that some SGML renderers are downloading their DTD? No. They should have thought about that before they made HTML a subset of SGML. I don't feel sorry for them.
I think I was using the Java version of Apache Xerces at the time for the Docbook processing. More recently I've used lxml in Python (based on libxml2), which has an option (no_network) to suppress DTD loading from the web, but you have to request that explicitly.
I've never seen a parser that caches DTDs by default. I'm not sure about parsers that do not download by default.
If you'd actually bothered reading what you quoted you might've noticed the sentence "The system identifier may be changed to reflect local system conventions". Only the public identifier is required to be one of the strings provided. The system identifier (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd) can point wherever you want it to. But well behaved clients are expected to use a catalog anyway.
Nothing, it's a non-profit.
(ducks and runs)
"My country, right or wrong; if right, to be kept right; and if wrong, to be set right." --Senator Carl Schurz (1872)
Just use html5lib instead of rolling your own; it'll parse pretty much "the same way" for all practical purposes.