W3C Gets Excessive DTD Traffic

← Back to Stories (view on slashdot.org)

W3C Gets Excessive DTD Traffic

Posted by ScuttleMonkey on Friday February 8, 2008 @01:22PM from the stop-the-intertubes-i-wanna-get-off dept.

eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"

70 of 334 comments (clear)

Wow by geekoid · 2008-02-08 13:26 · Score: 2, Funny

"Webmasters" strike again. Clowns.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
1. Re:Wow by Breakfast+Pants · 2008-02-08 13:34 · Score: 4, Insightful
  
  Not only that, this document gets cached all over the place by ISPs, etc., and they *still* get that many hits.
  
  --
  
  --
  
  WHO ATE MY BREAKFAST PANTS?
2. Re:Wow by x_MeRLiN_x · 2008-02-08 13:43 · Score: 3, Interesting
  
  The summary strongly implies and the article states that this unwanted traffic is coming from software that parses markup. Placing the DTD into a web page or other medium where markup is used is the intended and desirable usage.
  
  I don't claim to know why you have a problem with webmasters (I am not one), but if you're a programmer and perceive them to have less technical ability than yourself, well.. your ilk seem to be the "clowns" this time.
3. Re:Wow by Bogtha · 2008-02-08 13:45 · Score: 5, Insightful
  
  Why on earth are you blaming webmasters? They are just about the only people who cannot be responsible for this. People who write HTML parsers, HTTP libraries, screen-scrapers, etc, they are the ones causing the problem. Badly-coded client software is to blame, not anything you put on a website.
  
  --
  Bogtha Bogtha Bogtha
4. Re:Wow by milsoRgen · 2008-02-08 14:14 · Score: 2, Informative
  
  You're kidding, right? They literally wrote the standard. Well yes they (as long as the 'they' you are refering to is the W3C) did, and no where in the standards they have approved does it call for every system parsing a document with a DTD, to request that information over and over again. Especially considering that data tends to remains static once committed to an official standard.
  
  --
  I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
5. Re:Wow by MenTaLguY · 2008-02-08 14:19 · Score: 4, Insightful
  
  That's the whole purpose of the public identifier (e.g. "-//W3C//DTD HTML 4.01//EN") in the doctype, and the SGML and XML Catalog specifications!
  
  The expectation is that software would ship with its own copies of "well-known" DTDs with associated catalog entries; the URL is only there as a fallback. The problem is ignorant and/or lazy software developers not implementing catalogs and simply downloading from the URI each time.
  
  --
  
  DNA just wants to be free...
6. Re:Wow by Bogtha · 2008-02-08 14:28 · Score: 5, Informative
  
  They literally wrote the standard.
  
  "Webmasters" refers to people who run websites, not the W3C. And this particular feature is an artefact of SGML, which was around for over a decade before the W3C ever existed.
  
  If they didn't want the traffic they should have specified the matter in their RFCs.
  
  You mean like how RFC 2616 describes the caching mechanism that is being ignored by the problem clients? Or are you referring to the established-for-decades SGML system catalogue that they mention in the HTML 4 specification multiple times?
  
  You can tell them apart by their attention to the consequences of their actions.
  
  If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.
  
  --
  Bogtha Bogtha Bogtha
7. Re:Wow by Anonymous Coward · 2008-02-08 14:54 · Score: 5, Insightful
  
  They literally wrote the standard
  
  Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
8. Re:Wow by Blakey+Rat · 2008-02-08 14:59 · Score: 5, Insightful
  
  If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.
  
  Wow, it just struck me... welcome to Microsoft's world.
  
  Their security was so bad for so many years because they worked on the assumption that:
  1) Programmers know what they're doing
  2) Programmers aren't assholes
  
  Of course, the success of malware vendors (and Real Networks) has proved those two assumptions wrong many years ago, and probably 90% of the development work on Vista was adding in safeties to protect against idiot programmers, and asshole programmers.
  
  And now the W3C is getting their lesson on a golden platter.
  
  In short, here's the lesson learned:
  1) Some proportion of programmers don't know what they're doing and never will
  2) Some proportion of programmers are assholes
  
  --
  Comment of the year
9. Re:Wow by ibbie · 2008-02-08 16:16 · Score: 5, Insightful
  
  They literally wrote the standard
  
  Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
  Good and jolly bacon bits, please mod parent up. I realize that their comment might come off as harsh, but crap, come on. If one is building an application, would one really want to have to connect to a website to get instructions on how to read a filetype? Especially when all it would take it a single wget and including those instructions with the application to avoid all of this.
  
  Furthermore, it would seem that the process of reading a file would be far faster if the processing instructions were on the local file system rather than on a remote host. If one were really worried about changes to the instructions, one could code a routine to update the DTD whenever the application is updated; if the app isn't such that *would* be updated, one could always have it run a diff against the W3C's DTD every few months - after it's been standardized, it's not like the DTD is going to change on a daily basis. While not a complete cure, it'd still be far more considerate to the W3C's bandwidth than hitting it every request, or even every time a program is started.
  
  Honestly, I wouldn't blame them if they 302'd the file to a page that, upon CAPTCHA'd request, made the file temporarily available for download, so that vendors could fix their broken software. They're obviously far more considerate and forgiving people than I - and, I suspect, many of you fellow Slashdotters - tend to be.
  
  *puts on flame-resistant suit*
  
  --
  The wise follow a damned path, for to know is to be forsaken.
10. Re:Wow by sco08y · 2008-02-08 21:52 · Score: 5, Insightful
  
  Furthermore, it would seem that the process of reading a file would be far faster if the processing instructions were on the local file system rather than on a remote host. If one were really worried about changes to the instructions, one could code a routine to update the DTD whenever the application is updated; if the app isn't such that *would* be updated, one could always have it run a diff against the W3C's DTD every few months - after it's been standardized, it's not like the DTD is going to change on a daily basis.
  
  It's more like this: your app should *never* query the DTD. If the DTD changes, your app's code probably needs to change and your app should *never* try to parse using a DTD that hasn't been tested by a human being, or at least through your regression tests. Any changes to DTDs should be handled by updating the app itself.
  
  The only exception to this is an app that also happens to be a development tool.
11. Re:Wow by Curtman · 2008-02-08 23:47 · Score: 4, Funny
  
  I don't claim to know why you have a problem with webmasters (I am not one)
  
  Probably for the same reason that many other people hate them. They announce themselves to people as being a "webmaster". It's a really stupid title. They don't preform wizardry. If I can't at least be a "codemaster", and maybe our plumber gets to be called a "pipemaster", then we'll continue to mock anyone who uses the word. Oooh, "plungemaster". I think he'd go for that.
12. Re:Wow by jacksonj04 · 2008-02-09 01:15 · Score: 2, Insightful
  
  "Webmaster" is to "Person who makes sure the website and all associated gizmos are working properly" as "Foreman" is to "Person who makes sure the work site and all associated equipment and personnel are behaving properly".
  
  It's fallen into common usage. What else would you suggest? "Web Designer", "Network Architect" and all the other 'bits' of webmastery are already taken. Perhaps "Web Systems Administrator".
  
  --
  How many people can read hex if only you and dead people can read hex?
13. Re:Wow by mollymoo · 2008-02-09 01:41 · Score: 2, Interesting
  
  If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.
  
  Failing to be aware of how your users will likely behave is a design bug. If a tiny fraction of your users make a particular error it's probably their fault. If a significant proportion of your users make a particular error, it's your fault.
  
  --
  Chernobyl 'not a wildlife haven' - BBC News
14. Re:Wow by gbjbaanb · 2008-02-09 01:49 · Score: 4, Insightful
  
  It's more like this: your app should *never* query the DTD.
  
  then there's little point in having one at all, is there.
  
  You're quite right though, copy the DTD, develop against it, publish without the DTD being present in your released app. simple. If only the W3C hadn't specified it as being required to be present. If only every sample didn't have it shown in place.
15. Re:Wow by man_of_mr_e · 2008-02-09 02:59 · Score: 4, Insightful
  
  That's a bit disingenuous. Nowhere in the stadards does it require anyone to cache the DTD's either.
  
  If you ask me, the W3 asked for this. They didn't consider the consequences, and now that they're under siege, they want to blame everyone else.
  
  --
  If you need web hosting, you could do worse than here
16. Re:Wow by Jonboy+X · 2008-02-09 03:39 · Score: 3, Funny
  
  Webmonkey?
  
  --
  
  "In a 32-bit world, you're a 2-bit user. You've got your own newsgroup, alt.total.loser." -Weird Al
The Solution by OdieWan · 2008-02-08 13:29 · Score: 5, Funny

I have a solution to the problem; I wrote it down at http://www.w3.org/TR/html4/strict.dtd !
1. Re:The Solution by Anonymous Coward · 2008-02-08 13:32 · Score: 5, Funny
  
  Don't click that link! It's some sort of ascii pornography!
Do what.... by Creepy+Crawler · 2008-02-08 13:29 · Score: 5, Funny

Do what any other respectable web provider would do..

Put links to Goatse in the definitions!
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
1. Re:Do what.... by gronofer · 2008-02-08 15:27 · Score: 3, Insightful
  
  No, a respectable provider, like Network Solutions for example, would find a way to dish up adverts.
Who made the DTD a URL? by Anonymous Coward · 2008-02-08 13:29 · Score: 2, Interesting

Oh, that was you? I thought that making every webauthor refer to a W3C URL in every web page was going to get someone in trouble someday. Today seems to be someday.
1. Re:Who made the DTD a URL? by colinrichardday · 2008-02-08 13:40 · Score: 2, Insightful
  
  Or you could do what I do, and simply download the DTD, install it on your system,
  and use that instead.
2. Re:Who made the DTD a URL? by ozamosi · 2008-02-08 14:10 · Score: 4, Insightful
  
  It does contain a URL. It also contain a URN (for instance "-//W3C//DTD HTML 4.01//EN"). The point of a URN is that it doesn't have a universal location - you're supposed to find it wherever you can, probably in local cache somewhere.
  
  The URL can be seen as a backup ("in case you don't know the DTD for W3C HTML 4.01, you can create a local copy from this URL" - in the future, when people have forgotten HTML 4.01, that can be useful), or the same way XML namespaces is used - you don't have to send a HTTP request to http://www.w3.org/1999/xhtml to know that a document that uses that namespace is a xhtml document - it's just another form of a unique resource identifier (URI), just like a URN or a guid.
  
  What the W3C is having a problem with is applications that decide to fetch the DTD every single request. That's just crazy. Why do you even need to validate it, unless you're a validator? Just try to parse it - it probably won't validate anyway, and you'll have to do either do it in some kind of quirks mode or just break. If you can parse it correctly, does it matter if it validates? If you can't parse it, does it matter if it validates? And if you actually do want to validate it, why make the user wait a few seconds while you fetch the DTD on every page request? The only reasonable way this could happen that I can think of is link crawlers who find the URL - but doesn't link crawlers usually avoid to revisit pages they just visited?
3. Re:Who made the DTD a URL? by vux984 · 2008-02-08 20:49 · Score: 3, Insightful
  
  Could have made the DTD a unique ID, rather than an address.
  
  An address is effectively a unique ID.
  
  And the advantage of an address is that its a logical place to put the DTD if you don't happen to have your own copy. Its a unique id and a map to where to get it if you don't already have it.
  
  What were they thinking?
  
  They were thinking people wouldn't needlessly continually redownload the same page over and over and over again.
  
  The root dns servers operate under the same assumption. Do you think they were crazy too? After all, you can force your dns queries to go through the route servers every time if you really want to. Your not supposed to, and doing so needlessly puts more load on them, but you could.
Leave it to Slashdot... by PocketPick · 2008-02-08 13:30 · Score: 2, Funny

It's a good we don't contribute to the problem - Oh, wait...

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<title>Slashdot: News for nerds, stuff that matters</title>
1. Re:Leave it to Slashdot... by snl2587 · 2008-02-08 13:36 · Score: 5, Informative
  
  Note: It is my understanding that the browser is what looks up the DTD. So /. having the declaration is irrelevant.
2. Re:Leave it to Slashdot... by Vectronic · 2008-02-08 13:41 · Score: 2, Insightful
  
  And if he really wanted to be funny, he would have quoted it from the webpage that the Story/Blog was posted on on W3C
3. Re:Leave it to Slashdot... by corsec67 · 2008-02-08 14:03 · Score: 2, Informative
  
  Actually, do any browsers get the DTD?
  From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.
  
  Browsers are also pretty good about caching stuff.
  
  --
  If I have nothing to hide, don't search me
4. Re:Leave it to Slashdot... by milsoRgen · 2008-02-08 14:09 · Score: 2, Informative
  
  From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.
  FTA:
  
  The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG)
  I don't claim to fully grasp what software is causing the problem but it does seem to effect more than just XML.
  
  --
  I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
5. Re:Leave it to Slashdot... by MenTaLguY · 2008-02-08 14:35 · Score: 2
  
  Even then, those should be caching in a local catalog, based on the PI.
  
  --
  
  DNA just wants to be free...
Delay by erikina · 2008-02-08 13:31 · Score: 5, Interesting

Have they tried delaying the response by 5 or 6 seconds? It could cause a lot of applications to hang pretty badly. That or just serve a completely nonsensical schema every thousandth request. Gotta keep developers on their toes.
1. Re:Delay by bunratty · 2008-02-08 14:09 · Score: 3, Informative
  
  RTFA. They returned the 503 Service Unavailable error to many abusers, and they just kept on with abusive requests. Many abusers aren't checking the response to the request at all.
  
  --
  What a fool believes, he sees, no wise man has the power to reason away.
2. Re:Delay by dotancohen · 2008-02-08 14:36 · Score: 4, Funny
  
  You must be a Microsoft engineer.
  
  --
  It is dangerous to be right when the government is wrong.
3. Re:Delay by RhysU · 2008-02-08 14:41 · Score: 3, Informative
  
  Good: Delivered a piece of code once that tested just fine for us, but blew up at the customer's site. We never realized that the new J2EE-like features were hitting a live URL during DTD parsing.
  
  Better: Had a build system once that looked for a host and had to TCP timeout before the build could continue. Had to happen several hundred times a build cycle.
  
  The Java libraries do this down in their innards unless you're very careful to avoid it.
4. Re:Delay by bwb · 2008-02-08 14:47 · Score: 5, Insightful
  
  Sure, they're ignoring the response status, but I'll betcha most of them are doing synchronous requests. If I were solving this problem for W3C, I'd be delaying the abusers by 5 or 6 *minutes*. Maybe respond to the first request from a given IP/user agent with no or little delay, but each subsequent request within a certain timeframe incurs triple the previous delay, or the throughput gets progressively throttled-down until you're drooling it out at 150bps. That would render the really abusive applications immediately unusable, and with any luck, the hordes of angry customers would get the vendors to fix their broken software.
MIT needs a CDN! by rekoil · 2008-02-08 13:35 · Score: 2, Interesting

I'm surprised none of the CDNs out there haven't volunteered to host this file - the problem is they'd have to host the entire w3.org site, else move the rest of it to a another hostname.
Simple solution by mcrbids · 2008-02-08 13:36 · Score: 5, Funny

The answer to this problem is quite easy.

Continue to host the data referenced on a single T-1 line. That will cut your expenses to the bone since you'll never exceed 1.54 Mbps and that should be quite cheap. And, any dumfuxorz who fubarred their parser to not cache these basically static values will probably figure it out... very quickly.

You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year.

Problem solved!

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
WARNING: GNAA by SirBudgington · 2008-02-08 13:36 · Score: 2, Funny

Don't click the link, it's malware.

--
this is my sig
had this problem with hibernates website... by rgrbrny · 2008-02-08 13:39 · Score: 3, Interesting

the doctype was being used during a xsl transform during our build process; when the hibernate sight flaked out, the builds would fail intermittently.
solution was to add a xmlcatalog using a local resource.
bet this happens a lot more than most people realize; we'd been doing this for years before we noticed a problem.
Umm, no. by pavon · 2008-02-08 13:43 · Score: 5, Informative

That is supposed to be there according to the standard. And all the major browsers cached that that file after loading it (at most) once, and then never read it again. So no, slashdot is not causing a problem. The problem is all the other HTML processing software besides browsers that do not cache their DTD files, not the files for containing it.

If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.
1. Re:Umm, no. by MillionthMonkey · 2008-02-08 16:37 · Score: 5, Interesting
  
  At least what little code I've written to process HTML/XML has always entirely ignored the DTD.
  
  Don't be so sure- even if your own code ignores it. Unless you're dealing with it on a raw character level, with most XML libraries and frameworks it can be quite tricky to prevent DTDs from being resolved behind your back.
  
  I wrote some Java code a while back to parse some XML files that were downloaded from NCBI. Typical for NCBI data, this involved wading through terabytes of crap, and anything based on DOM wasn't going to work- so I used the lower level event-based SAX library in JAXP. The files did have DTD declarations in them pointing to NCBI, which I wanted to ignore, since this was a one-time data mining operation. I just examined some sample files, figured out pseudo-XPath expressions for what I wanted to pull out, set up a simple state machine to stumble through the SAX events, and not caring about the DTD, cleared the namespace-aware and validating flags on the SAXParserFactory. So I ended up with this:
  
  File xmlgz = new File("ncbi_diarrhea.xml.gz"); DefaultHandler myHandler = new MyNCBIStateMachineHandler(); GZIPInputStream gzos = new GZIPInputStream(new FileInputStream(xmlgz)); SAXParserFactory spf = SAXParserFactory.newInstance(); spf.setValidating(false); spf.setNamespaceAware(false); SAXParser sp = spf.newSAXParser(); InputSource input = new InputSource(gzos); sp.parse(input, handler);
  This ran fine, until it mysteriously froze up 18 hours into the run. It turned out to be caused by our switch to a different ISP, during which time the building lost its outside network access. The thread picked up the next file and immediately got blocked in the SAX library, trying to resolve the NCBI DTD.
  
  This is how I fixed it:
  
  spf.setFeature("http://xml.org/sax/features/external-general-entities", false); spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
  
  Now I'm sure someone is going to come on here calling me a noob for not knowing to use an XMLReaderFactory (or whatever XML API class isn't obsolete this week) and setting a custom EntityResolver that can provide my local copy of the NCBI DTD when presented with its URI, but why should I even have to bother with that? XML pretends to be simple but it's seriously messed up.
Irony by davburns · 2008-02-08 13:44 · Score: 4, Funny

So, w3c complains about their bandwidth, and the response is: The Slashdot Effect. Doesn't that make the old bandwidth problem seem less of a problem?
I'm just loving the irony in that.
1. Re:Irony by ion.simon.c · 2008-02-08 16:12 · Score: 2, Informative
  
  See this comment. /. is NOTHING compared to the traffic generated by DTD requests.
  
  http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic#c1821
Such an easy solution by mwasham · 2008-02-08 13:49 · Score: 5, Funny

And it is only 4 articles down.. Host with Yahoo! Yahoo Offers All-You-Can-Eat Storage and Bandwidth http://hardware.slashdot.org/article.pl?sid=08/02/08/1811236

--
Dallas Real Estate
They already do. by pavon · 2008-02-08 13:51 · Score: 4, Informative

The spec already recommends this and all the major browsers do it. The software that is causing the problem are generic XML/SGML processing packages which were designed to be able to deal with documents with any random DTD, not just the main HTML/XHTML ones from W3C. They are the folks that are downloading each DTD every single time and not caching it, contrary to the standard. Sometimes caching is a configuration option which defaults to off and administrators never turn it on.
1. Re:They already do. by _xeno_ · 2008-02-08 15:52 · Score: 3, Insightful
  
  The problem is that several major XML libraries don't just default to no DTD/schema cache - they don't even implement a cache or local catalog. Implementing such a thing is left to the developers using the library.
  
  For example, the XML libraries that come with Sun's Java rely on java.net.URL for downloading resources. I just checked my 1.6 Java install, and by default, it has no cache. In looking up how the java.net cache works, I discovered it wasn't even added until Java 1.5. So prior to Java 1.5, most Java libraries wouldn't cache responses at all because the included library didn't support caching. 'Course, even in Java 1.6, there's no default implementation, so each Java application would have to implement their own cache[1].
  
  The included Java libraries also offer no internal DTD/schema catalog. You can create one (implement org.xml.sax.EntityResolver[2]) but by default they're off to the Internet to download any DTD they run across.
  
  It's really not hard to see how these libraries could result in millions of hits a day - most people using them probably don't even realize that they're hitting the W3C's servers since it happens transparently. And fixing it is unfortunately not just setting configuration files and saving the DTDs locally: it's implementing a bunch of classes.
  
  [1] And for added fun, the stub that is provided appears to be insufficient to support conditional requests - either the cache says "I have it!" and the cached response is used, or the server has to send a new copy. There's no way to do offer up an "If-Modified-Since:" request via the cache class.
  
  [2] Noting that this can't be set for all parsers, it's set on a per-parser object basis. So if you use a third-party library that parses XML after creating its own parser object, you can't make it use your local DTD catalog.
  
  --
  You are in a maze of twisty little relative jumps, all alike.
Submitted this to /.? by dotancohen · 2008-02-08 13:52 · Score: 5, Funny

Great, they cry "we get too much traffic", so we go ahead and slap them on the front page of slashdot. Sick, sick fucking joke.

--
It is dangerous to be right when the government is wrong.
1. Re:Submitted this to /.? by ger · 2008-02-08 14:47 · Score: 5, Informative
  
  To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us, etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
2. Re:Submitted this to /.? by ger · 2008-02-08 18:45 · Score: 4, Informative
  
  650 times as many hits. (163 times as many bytes.) But that's just from a quick sample.
Re:That's what you get for making stupid rules. by Bogtha · 2008-02-08 13:55 · Score: 5, Informative

They insist that every document begin with a declaration that includes a link to their site.

It's not a link. It's a reference to an external DTD subset. It's there so that generic SGML software can properly parse the document without any special knowledge of HTML.

The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.

No, external DTD subsets are a part of SGML, which is at least a decade older than the W3C.

--
Bogtha Bogtha Bogtha
I'm going to say this as clearly as possible. by glwtta · 2008-02-08 13:55 · Score: 3, Informative

Browsers cache the DTDs.

There, you can now stop posting your hilarious "jokes".

--
sic transit gloria mundi
Gumdrops by milsoRgen · 2008-02-08 13:58 · Score: 4, Insightful

They are just about the only people who cannot be responsible for this. Exactly, for as long as I've been involved with HTML's various forms over the years it was always considered proper technique (from W3C documentation) to include the doctype (or more recently xmlns). Certainly sounds like a parser issue to me.

The only thing I'm unclear on is whether your average browser is contributing to this problem when parsing properly written documents.

--
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
Re:caching by Bogtha · 2008-02-08 13:59 · Score: 2, Insightful

Add some sort of caching parameter to the DTD spec, that specifies how long browsers should cache those DTDs.

You're solving that problem at the wrong layer. HTTP already includes caching mechanisms, the W3C already use them, and part of the problem is that buggy software is ignoring them.

Another potential solution: Have browsers keep the DTDs cached

Please read the article. This is already supposed to happen. Buggy software fails to do this, which is the problem being talked about.

--
Bogtha Bogtha Bogtha
Starting on the 1st, fool by Scrameustache · 2008-02-08 14:06 · Score: 5, Funny

You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year. I suggest April! :D

--
You can't take the sky from me...
Surprise by MBCook · 2008-02-08 14:08 · Score: 3, Insightful

I've got to say, this doesn't surprise me at all. In the time I've spent at my job, I've been repeatedly floored by the amazing conduct of other companies IT departments. We've only encountered two people I can think of who have been hostile. Everyone else has been quite nice. You'd think people would have things setup well, but they don't.
We've seen many custom XML parsers and encoders, all slightly wrong. We've seen people transmitting very sensitive data without using any kind of security until we refused to continue working without SSL being added to the equation. We've seen people who were secure change their certificates to self-signed, and we seem to consistently know when people's certificates expire before they do.
But even without these things, I can't tell you how many people send us bad data and flat out ignore the response. We get all sorts of bad data sent to us all the time. When that happens, we reply with a failure message describing what's wrong. Yet we get bits of stuff all the time that is wrong, in the same way, from the same people. I'm not talking about sending us something that they aren't supposed to (X when we say only Y), I'm saying invalid XML type wrong... such that it can't be parsed.
We have, a few times while I've been there, had people make a change in their software (or something) and bombard us with invalid data until we we either block their IP or manage to get into voice contact with their IT department. Sometimes they don't even seem to notice the lockout.
Some places can be amazing. Some software can be poorly designed (or something can cause a strange side effect, see here). I really like one of the suggestions in the comments on the article... start replying really slow, and often with invalid data. They won't do it. I wouldn't. But I like the idea.

--
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
Make it slower, not faster by Thunderbear · 2008-02-08 14:20 · Score: 2, Insightful

If the problem is that it gets served out too many times, then make the server slow as molasses. If it takes 1-2 minutes to get the DTD from the server, or more, then it is quickly discovered by the performance teams.

--

--
Thorbjørn Ravn Andersen "...and...Tubular Bells!"
The problem is with the docs by Mantaar · 2008-02-08 14:33 · Score: 4, Insightful

The problem does not lie in the mechanism itself - it's in the documentation - or the lack of understandable (or at least often-used) docs directly at the source.

Simple caching on client side could already improve the situation a whole lot... BUT:

When people implement something for html-ish or svg-ish or xml-ish purposes, they go google for it: "Howto XML blah foo" - result, they're getting basic screw-it-with-a-hammer tutorials that don't point out important design decisions, but instead Just Work - which is what the author wanted to achieve when they started writing the software.

It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2. But since most tutorials and howtos on the net are just dumbed-down copypasta for quick and dirty hacks - and since nobody fucking enforces the standards - nobody does it the Right Way.

So if I start writing some sax-parser, some html-rendering lib, some silly scraper, whatnot... and the first example implementations only deal with basic stuff and show me how to do it so basic functionality can be implemented... and I'm not really interested in that part of the program anyways, because I need it for putting something more fancy on top... once after I'm through with the initial testing of this particular subsystem, I won't really care about anything else. It works, it doesn't seem to hit performance too badly, it's according to some random guy's completely irrelevant blog - hey, this guy knows what he's doing. I don't care!

This story hitting /.'s front page might actually help improve the situation. But.. it's like this with stupid programmers - they never die out, they'll always create problems. Let's get used to it.

--
I'm an infovore...
1. Re:The problem is with the docs by paul248 · 2008-02-08 16:34 · Score: 4, Informative
  
  Oh, please do tell me how to use iptables or iproute2 to set my ip address, or to enable/disable a network adaptor. ip link set eth0 up
  ip addr add 192.168.1.2/24 dev eth0
  ip link set eth0 down
  
  etc. etc.
Re:I always thought it was stupid by MtHuurne · 2008-02-08 14:52 · Score: 2, Interesting
I wrote my thesis in Docbook and installed the processing toolchain on a laptop. Sometimes the processing would fail and sometimes it worked. After a while I noticed it worked when I was setting behind my desk and failed when I was sitting on my bed. After some digging, I found out that the catalog configuration was wrong and the XML parser was downloading the DTDs from the web. This was before WiFi, so sitting on the bed meant the laptop did not have internet access.

The core of the problem is that most XML parsers will automatically and transparently fetch the DTD from the URL and do not cache it. So if you have no DTDs installed locally, or if your XML parser cannot find them (catalog configuration is easy to mess up), the parsing will work just fine and if processing the XML takes a significant amount of time, you probably won't notice the small delay from downloading the DTD.

There are several possible solutions for this:
- Do not automatically fetch DTDs from the web: make it an explicit option that the user has to set.
- Be vocal when fetching a DTD from the web, for example issue a warning.
- Cache fetched DTDs locally.
All of these are things that should be addressed in the XML parsers.
That's the problem with a URI for an ID by argent · 2008-02-08 14:59 · Score: 3, Insightful

I think they screwed up, and brought this on themselves. I already thought that it was annoying having so verbose an identifier... this just makes it more hateful.

If they'd at least made the identifier NOT a URI, something like domain.example.com::[path/]versionstring, or something else that wasn't a URT, so it was clearly an identifier even if it was ultimately convertible to a URI, they would have avoided this kind of problem.
Oy Vey... by zanaxagoras · 2008-02-08 15:41 · Score: 3, Interesting

PocketPick is 100% correct.

Here's an example of what correct markup should look like:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://schemas.slashdot.org/strict.dtd">
The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD.
1. Re:Oy Vey... by Zarel · 2008-02-08 16:45 · Score: 3, Interesting
  
  The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD. Well, no, it's not. It's true that the standard does not require usage of the URL to W3C's copy of the DTD, but it's definitely recommended, since every client presumably has a cached copy of the W3C's DTD for something as common as HTML 4.01, and if you were to link to your own, some parsers might be confused and unsure about whether or not you're using Official W3C HTML (tm). (Yes, yes, I know; they should know by '-//W3C//etc' but this article is about stupid parsers, isn't it?)
  
  --
  Want a high quality FOSS RTS game? Try Warzone 2100!
Hey, for once... by 93+Escort+Wagon · 2008-02-08 15:46 · Score: 4, Funny

... you can't blame Microsoft for this problem! After all, IE ignores pretty much all web standards and best practices, and does its own thing!

--
#DeleteChrome
Re:That's what you get for making stupid rules. by somethinghollow · 2008-02-08 17:58 · Score: 4, Insightful

It was the W3C that decided to make HTML a subset of SGML. They could have done what HTML 5 is doing by creating a "serialization" that doesn't care about the DTD (HTML5 doesn't call for one in the DOCTYPE). As it is, theoretically, I can write my own DTD (a modification of the HTML4 DTD, for example) that adds new elements. Technically, the SGML parser should know and understand those DTDs. To do such, it must download the DTD. Browsers are supposed to be handling SGML docs, but chose to implement against the W3C recommendations instead of caring about the DTDs; the popular ones don't download the DTD at all... they don't even care if it exists or not, which is why the short HTML5 DOCTYPE works as a quirksmode switch but is still valid HTML and renders like HTML should.

Should SGML renderers cache it? Yes. Should W3C bitch that some SGML renderers are downloading their DTD? No. They should have thought about that before they made HTML a subset of SGML. I don't feel sorry for them.
Re:I always thought it was stupid by MtHuurne · 2008-02-08 22:19 · Score: 2, Informative

I think I was using the Java version of Apache Xerces at the time for the Docbook processing. More recently I've used lxml in Python (based on libxml2), which has an option (no_network) to suppress DTD loading from the web, but you have to request that explicitly.

I've never seen a parser that caches DTDs by default. I'm not sure about parsers that do not download by default.
Re:That's what you get for making stupid rules. by vidarh · 2008-02-08 23:12 · Score: 3, Insightful

If you'd actually bothered reading what you quoted you might've noticed the sentence "The system identifier may be changed to reflect local system conventions". Only the public identifier is required to be one of the strings provided. The system identifier (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd) can point wherever you want it to. But well behaved clients are expected to use a catalog anyway.
Re:Speaking of caches... by cnettel · 2008-02-09 00:49 · Score: 4, Insightful

That doesn't stop the loading page from making a request to the W3C servers, does it? Worse, many of the existing cache implementations won't cache an error result.
What's their website URN, anyway? by Prototerm · 2008-02-09 01:28 · Score: 3, Funny

Nothing, it's a non-profit.

(ducks and runs)

--
"My country, right or wrong; if right, to be kept right; and if wrong, to be set right." --Senator Carl Schurz (1872)
Re:I'd write the crap code. by BZ · 2008-02-09 05:08 · Score: 3, Informative

Just use html5lib instead of rolling your own; it'll parse pretty much "the same way" for all practical purposes.