W3C Gets Excessive DTD Traffic

Wow by geekoid · 2008-02-08 13:26 · Score: 2, Funny

"Webmasters" strike again. Clowns.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

Re:Wow by Breakfast+Pants · 2008-02-08 13:34 · Score: 4, Insightful

Not only that, this document gets cached all over the place by ISPs, etc., and they *still* get that many hits.

--

--

WHO ATE MY BREAKFAST PANTS?
Re:Wow by x_MeRLiN_x · 2008-02-08 13:43 · Score: 3, Interesting

The summary strongly implies and the article states that this unwanted traffic is coming from software that parses markup. Placing the DTD into a web page or other medium where markup is used is the intended and desirable usage.

I don't claim to know why you have a problem with webmasters (I am not one), but if you're a programmer and perceive them to have less technical ability than yourself, well.. your ilk seem to be the "clowns" this time.
Re:Wow by Bogtha · 2008-02-08 13:45 · Score: 5, Insightful

Why on earth are you blaming webmasters? They are just about the only people who cannot be responsible for this. People who write HTML parsers, HTTP libraries, screen-scrapers, etc, they are the ones causing the problem. Badly-coded client software is to blame, not anything you put on a website.

--
Bogtha Bogtha Bogtha
Re:Wow by x_MeRLiN_x · 2008-02-08 14:11 · Score: 1

I agree with your main point, but blaming authors of screen scrapers is ridiculous. Screen scraping is reading the final output of program (or in this case web page) in image format and converting that into usable data with methods such as OCR (optical character recognition).
Re:Wow by milsoRgen · 2008-02-08 14:14 · Score: 2, Informative

You're kidding, right? They literally wrote the standard. Well yes they (as long as the 'they' you are refering to is the W3C) did, and no where in the standards they have approved does it call for every system parsing a document with a DTD, to request that information over and over again. Especially considering that data tends to remains static once committed to an official standard.

--
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
Re:Wow by MenTaLguY · 2008-02-08 14:19 · Score: 4, Insightful

That's the whole purpose of the public identifier (e.g. "-//W3C//DTD HTML 4.01//EN") in the doctype, and the SGML and XML Catalog specifications!

The expectation is that software would ship with its own copies of "well-known" DTDs with associated catalog entries; the URL is only there as a fallback. The problem is ignorant and/or lazy software developers not implementing catalogs and simply downloading from the URI each time.

--

DNA just wants to be free...
Re:Wow by Bogtha · 2008-02-08 14:28 · Score: 5, Informative

They literally wrote the standard.

"Webmasters" refers to people who run websites, not the W3C. And this particular feature is an artefact of SGML, which was around for over a decade before the W3C ever existed.

If they didn't want the traffic they should have specified the matter in their RFCs.

You mean like how RFC 2616 describes the caching mechanism that is being ignored by the problem clients? Or are you referring to the established-for-decades SGML system catalogue that they mention in the HTML 4 specification multiple times?

You can tell them apart by their attention to the consequences of their actions.

If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

--
Bogtha Bogtha Bogtha
Re:Wow by Anonymous Coward · 2008-02-08 14:32 · Score: 1, Insightful

The problem is ignorant and/or lazy software developers No, the problem is a standards body which doesn't take into account that there are quite a few ignorant and/or lazy software developers around. Don't make people put one of your URLs in every web document if you don't want that file to be downloaded a gazillion times.
Re:Wow by Bogtha · 2008-02-08 14:32 · Score: 1

Screen scraping is reading the final output of program (or in this case web page) in image format and converting that into usable data with methods such as OCR (optical character recognition).

Actually, the term is widely used as a synonym for spidering a site. It's rare I see it used in the way you describe. Sorry for the confusion.

--
Bogtha Bogtha Bogtha
Re:Wow by Anonymous Coward · 2008-02-08 14:54 · Score: 5, Insightful

They literally wrote the standard

Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
Re:Wow by Blakey+Rat · 2008-02-08 14:59 · Score: 5, Insightful

If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

Wow, it just struck me... welcome to Microsoft's world.

Their security was so bad for so many years because they worked on the assumption that:
1) Programmers know what they're doing
2) Programmers aren't assholes

Of course, the success of malware vendors (and Real Networks) has proved those two assumptions wrong many years ago, and probably 90% of the development work on Vista was adding in safeties to protect against idiot programmers, and asshole programmers.

And now the W3C is getting their lesson on a golden platter.

In short, here's the lesson learned:
1) Some proportion of programmers don't know what they're doing and never will
2) Some proportion of programmers are assholes

--
Comment of the year
Re:Wow by techno-vampire · 2008-02-08 15:18 · Score: 1

I realize this is Slashdot, but I still have to ask if you even bothered to read the post before hitting Reply? The OP wasn't saying that the people at W3C aren't responsible, he was saying that webmasters aren't responsible, and he's right. The problem here is with badly-written programs constantly requesting something they neither need nor use.

--
Good, inexpensive web hosting
Re:Wow by X0563511 · 2008-02-08 16:16 · Score: 1

Or, that in the event they DO need it, they fetch it rather than use a cached version.

Imagine if the DTDs managed to get updated every few seconds. That would be a hell of a target to track!

There is NO reason for the DTD's to be downloaded repeatedly, as it seems to be the case. This is what all these posters seem to be missing.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Wow by ibbie · 2008-02-08 16:16 · Score: 5, Insightful

They literally wrote the standard

Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
Good and jolly bacon bits, please mod parent up. I realize that their comment might come off as harsh, but crap, come on. If one is building an application, would one really want to have to connect to a website to get instructions on how to read a filetype? Especially when all it would take it a single wget and including those instructions with the application to avoid all of this.

Furthermore, it would seem that the process of reading a file would be far faster if the processing instructions were on the local file system rather than on a remote host. If one were really worried about changes to the instructions, one could code a routine to update the DTD whenever the application is updated; if the app isn't such that *would* be updated, one could always have it run a diff against the W3C's DTD every few months - after it's been standardized, it's not like the DTD is going to change on a daily basis. While not a complete cure, it'd still be far more considerate to the W3C's bandwidth than hitting it every request, or even every time a program is started.

Honestly, I wouldn't blame them if they 302'd the file to a page that, upon CAPTCHA'd request, made the file temporarily available for download, so that vendors could fix their broken software. They're obviously far more considerate and forgiving people than I - and, I suspect, many of you fellow Slashdotters - tend to be.

*puts on flame-resistant suit*

--
The wise follow a damned path, for to know is to be forsaken.
Re:Wow by marcog123 · 2008-02-08 18:01 · Score: 1

They should cache the pages though. As the article mentions, the content of these pages almost never changes. The article is picking on software that doesn't cache and/or makes ridiculous numbers of queries for the same content.
Re:Wow by LaskoVortex · 2008-02-08 18:20 · Score: 0

IE is logically causing most of the problem, as that is probably the program most used to go to websites. I'm not randomly hating on windows here: Remember, windows is the most used OS, so the IE browser is the most used program that fetches web pages. This is pure logic and is the same reasoning defenders of windows use when they account for the preponderance of viruses on windows machines.

--
Just callin' it like I see it.
Re:Wow by SanityInAnarchy · 2008-02-08 18:56 · Score: 1

It's not just web pages. That's not all the W3C does.

In fact, they cite things that IE doesn't support natively, like SVG -- unless something's changed.

But before you go blaming MS, try gathering some evidence, maybe? Shouldn't be hard -- you do have access to a machine with IE, right? And Wireshark is easy to find...

--
Don't thank God, thank a doctor!
Re:Wow by SanityInAnarchy · 2008-02-08 19:04 · Score: 1

The W3C may still be living in the era when all websites were looked after directly by an admin, who did everything to do with the site, including write the software. They might share software, just as they might share HTML, but for many of these sites, best practices, or even common sense practices, were completely gone -- so no formal "libraries".

Of course, I just completely made that up...

--
Don't thank God, thank a doctor!
Re:Wow by poker-pauly · 2008-02-08 19:24 · Score: 0

I have absolutely no sympathy for their plight. I sent them a comment ONCE and they printed my email address and I have been plagued by spam ever since. They refused to delete the addy or email so it's just poetic justice that this is happening to them.
Re:Wow by Z00L00K · 2008-02-08 21:21 · Score: 1

It's not actually the fault of webmasters, correctly written web pages SHALL include the doctype.
The real problem is that there are softwares out there that doesn't cache the DTD:s. Sometimes it may be that a web browser is told to not cache anything at all, but this may be since the user has specific problems of caching other content and that the web browser uses one global cache instead of doing a separate caching of special resources like the DTD:s.
Using special caching of these resources in your applications and libraries will provide a lot less load and also better performing web pages. We see too much of bad and ugly performance on the web today, some of it is caused by stupidities like this. Of course - a few applications may not be able to cache the DTD:s at all, but these should be designed to not need to download the DTD:s anyway.
The only circumstance that the webmasters shall check is that they provide exactly the same doctype definition for all their pages. Essentially not having a variation in mix of upper and lower case, bad spacings etc. but that's a trivial fix.
One thing that really fails to make the mark here is some kind of statistics provided telling about what kind of user-agents that are the worst. By providing this information a lot could be gained in turn by pointing out the offenders. (Some may not even be aware that their application does this)

--
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
Re:Wow by sco08y · 2008-02-08 21:52 · Score: 5, Insightful

Furthermore, it would seem that the process of reading a file would be far faster if the processing instructions were on the local file system rather than on a remote host. If one were really worried about changes to the instructions, one could code a routine to update the DTD whenever the application is updated; if the app isn't such that *would* be updated, one could always have it run a diff against the W3C's DTD every few months - after it's been standardized, it's not like the DTD is going to change on a daily basis.

It's more like this: your app should *never* query the DTD. If the DTD changes, your app's code probably needs to change and your app should *never* try to parse using a DTD that hasn't been tested by a human being, or at least through your regression tests. Any changes to DTDs should be handled by updating the app itself.

The only exception to this is an app that also happens to be a development tool.
Re:Wow by Anonymous Coward · 2008-02-08 22:37 · Score: 0

I agree, I suggest we let everybody require government approval and tracking to use the internet. These people are terrorists.
Re:Wow by Curtman · 2008-02-08 23:47 · Score: 4, Funny

I don't claim to know why you have a problem with webmasters (I am not one)

Probably for the same reason that many other people hate them. They announce themselves to people as being a "webmaster". It's a really stupid title. They don't preform wizardry. If I can't at least be a "codemaster", and maybe our plumber gets to be called a "pipemaster", then we'll continue to mock anyone who uses the word. Oooh, "plungemaster". I think he'd go for that.
Re:Wow by aspx · 2008-02-09 00:41 · Score: 1

Yeah, totally dude. The government should decide who and what can access the Internet! Hahahaha!

Oh wait, you are serious.
Re:Wow by jacksonj04 · 2008-02-09 01:15 · Score: 2, Insightful

"Webmaster" is to "Person who makes sure the website and all associated gizmos are working properly" as "Foreman" is to "Person who makes sure the work site and all associated equipment and personnel are behaving properly".

It's fallen into common usage. What else would you suggest? "Web Designer", "Network Architect" and all the other 'bits' of webmastery are already taken. Perhaps "Web Systems Administrator".

--
How many people can read hex if only you and dead people can read hex?
Re:Wow by mollymoo · 2008-02-09 01:41 · Score: 2, Interesting

If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

Failing to be aware of how your users will likely behave is a design bug. If a tiny fraction of your users make a particular error it's probably their fault. If a significant proportion of your users make a particular error, it's your fault.

--
Chernobyl 'not a wildlife haven' - BBC News
Re:Wow by gbjbaanb · 2008-02-09 01:49 · Score: 4, Insightful

It's more like this: your app should *never* query the DTD.

then there's little point in having one at all, is there.

You're quite right though, copy the DTD, develop against it, publish without the DTD being present in your released app. simple. If only the W3C hadn't specified it as being required to be present. If only every sample didn't have it shown in place.
Re:Wow by GaryOlson · 2008-02-09 02:23 · Score: 1

If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
Bailiff, whack his httpee-pee!

--
Every mans' island needs an ocean; choose your ocean carefully.
Re:Wow by man_of_mr_e · 2008-02-09 02:59 · Score: 4, Insightful

That's a bit disingenuous. Nowhere in the stadards does it require anyone to cache the DTD's either.

If you ask me, the W3 asked for this. They didn't consider the consequences, and now that they're under siege, they want to blame everyone else.

--
If you need web hosting, you could do worse than here
Re:Wow by norton_I · 2008-02-09 03:01 · Score: 1

The point of the DTD URI is to uniquely identify the document type. An application can use it to decide what to do with a given document, or whether it can handle it at all. For almost all client applications, if the DTD uri is one that is recognized, you don't need to download it. If it isn't recognized, downloading it doesn't help--you still don't know what to do with the data. The existence of a downloadable document at the URL of the same name is a convenience, it is not required for correct operation of any application.

The only type of application which should ever want (not need) to fetch the DTD is a general-purpose validator, such as in an XML editor. In that case, it can save the user a bit of effort by automatically loading the DTD rather than requiring the user to supply it (a feature which must be supported--there is no requirement that the URI corresponds to a specific URL). Applications which do this are expected to maintain a cache.
Re:Wow by ultranova · 2008-02-09 03:19 · Score: 1

If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.

If you don't know the difference between the HTTP transfer protocol and a HTML renderer, then perhaps you shouldn't judge implementations of the former for problems which are clearly caused by neither, but rather by deficiencies in the cache system which goes between them.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Wow by Anonymous Coward · 2008-02-09 03:20 · Score: 0

I think "God" just punish the W3C. That's a screw up standard.
Get rid of that standard fast. Block any access to those pages from W3C so they don't have to waste any more bandwidth.
Distribute out the file to the developers for them to store on their server.
And the distributed file should be zipped up, or even encrypted, so developers can't even linked
to that file directly, but have to jump through some loop to get it to work.
Re:Wow by Jonboy+X · 2008-02-09 03:39 · Score: 3, Funny

Webmonkey?

--

"In a 32-bit world, you're a 2-bit user. You've got your own newsgroup, alt.total.loser." -Weird Al
Re:Wow by Anonymous Coward · 2008-02-09 04:06 · Score: 0

If they're okay with developers including static copies of the DTDs in client software, it should fucking well say so in the DTDs. Instead they make absolutely no mention of intellectual property issues whatsoever. Writing software is an IP minefield. Copy/pasting other peoples' work into your code is a damn fine way to get yourself sued. People don't want to get sued. The w3c have never understood this kind of real-world concern.
Re:Wow by khallow · 2008-02-09 04:47 · Score: 1

So why is the unique identify a URL? This implies functionality beyond what you are claiming.
Re:Wow by BZ · 2008-02-09 05:13 · Score: 1

> So why is the unique identify a URL?

Because certain constituencies ("semantic web", etc) feel that URLs (URIs, IRIs, whatever) are a one-size-fits-all tool for identifying everything. That's why XML namespaces are technically recommended to be URIs (though in the end you just to an explicit string match on the namespace and never actually fetch anything from the URI). This is why URIs are used for pointing to parts of the same document. The list goes on and on.

Basically, some people feel they have a hammer and all unique identification problems start to look like nails to them.
Re:Wow by msuarezalvarez · 2008-02-09 05:16 · Score: 1

Because that allows you to recycle a well know and very much available hierarchical system which already exists. This is explain in the standards.
Re:Wow by msuarezalvarez · 2008-02-09 05:20 · Score: 1

If you can't really find the copyright information on the W3C schemas and DTDs, then you should probably not be writing software...
Re:Wow by msuarezalvarez · 2008-02-09 05:23 · Score: 1

Yes. All the spam you've ever gotten can be tracked back to their posting of your email address. Have you considered legal action?
Re:Wow by Anonymous Coward · 2008-02-09 05:43 · Score: 0

What the fuck are you talking about?
Re:Wow by DavidTC · 2008-02-09 06:01 · Score: 1

Because they didn't post the user-agents that are doing this, we've got a lot of goobers here who think they know what's doing it. As someone who's actually watched web pages load at the network level, let me assure everyone that neither IE, Firefox, or Opera have ever made a request to the w3c for a DTD. And it's not just caching...you can make up a URL and standard name and use that in the doctype and they don't go get it. It's not end-user web browsers doing it, mainly because the doctype is used in web browsers solely as a trigger string...they don't actually have the DTD and use it to actually parse the document.
What I suspect it is, but have no evidence for, are validation and parsing libraries. Not applications, libraries. Libraries that expect a DTD, and someone who wrote the application just handed it the URL, or told it to get the URL itself, instead of implementing a local copy or cache. Java's already been mentioned above by someone else, but I don't know if they have any evidence of this or are just making it up.

--
If corporations are people, aren't stockholders guilty of slavery?
Re:Wow by poker-pauly · 2008-02-09 06:32 · Score: 0

The spam began shortly after they published my comment and it's the only place on the www where my email appears. What else am I to think?
Re:Wow by Anonymous Coward · 2008-02-09 08:55 · Score: 0

Because it seemd a good idea at the time. But, people are learning from that.

As I comment above, the trend now is to use URNs intstead of URLs for XML Namespace URIs now. See http://www.oasis-open.org/specs/index.php and notice how the earlier ones have namespaces like "http://docs.oasis-open.org/ws-tx/wscoor/2006/06" (for WS-Transaction 1.1) but the most recent ones have things like "urn:oasis:names:tc:xliff:document:1.2" (for XLIFF 1.2)

URNs don't have anywhere near enough information to download with; no matter how hairbrained the programmer gets, they can't think it's a download location. You can still validate these, but the validating program has to know where the file is that maps to that URN namespace; hopefully it comes *with* the program instead of being a hard-coded URL.

Programs should really come with a bundled copy of these schema files to validate against.

My original comment:
http://developers.slashdot.org/comments.pl?sid=447350&cid=22362548
Re:Wow by Anonymous Coward · 2008-02-09 09:02 · Score: 0

So why is the unique identify a URL?

It isn't. The unique identifier is "-//W3C//DTD HTML 4.01//EN" - the URL is only there as a convenience to applications that don't already have a copy, and these applications should NOT download a fresh copy for every page they process.
Re:Wow by canuck57 · 2008-02-09 10:27 · Score: 1

Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
Now what am I missing?
The DTD is a reference, if you app needs it then why not include it in your app? Make your app so it runs on an isolated network?
If it is intended to be downloaded on each start of a app, what a daft design move. W3C or the app writer.
Why don't they just move the directories/links...let the bad apps break? Will teach them good to write poorly design applications. This way they are off the internet as you suggest.
Re:Wow by aevans · 2008-02-09 10:58 · Score: 1

Exactly. The only reason they put it there in the first place is they wanted free advertising and the credibility that comes with everyone listing "their" site as the canonical source of the DTDs for HTML, etc. It just worked better than they thought. Or rather, the internet is much bigger than they were capable of imagining. People mentioned above the idea that (perpetual) caching of DTDs should be mandatory. While that's idiotic, even if it were the case. With billions of web pages and billions of users, even if everyone only ever loaded each DTD once, it would still tax their system far beyond their ability to handle.
Re:Wow by POWRSURG · 2008-02-09 13:11 · Score: 1

I realize this is Slashdot, but I still have to ask if you even bothered to read the post before hitting Reply? The OP wasn't saying that the people at W3C aren't responsible, he was saying that webmasters aren't responsible, and he's right. The problem here is with badly-written programs constantly requesting something they neither need nor use. Webmasters write web sites, not applications that render HTML (or any of the other standards that require a DTD published by the W3C).
Re:Wow by Z00L00K · 2008-02-09 20:56 · Score: 1

Oh - Java is simple, just use the URL class to get a URLConnection through the openConnection() method.
If not using the setRequestProperty("User-Agent", "pickled herring") the user agent reported will be the current Java version.

--
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
Re:Wow by Gideon+Fubar · 2008-02-10 11:54 · Score: 1

Why would you spend all that extra time converting it to an image and then back again? Webpages are in text, be that ascii or unicode, and regex is a hell of a lot faster than OCR.

--
http://www.xkcd.com/354/
Re:Wow by Yvan256 · 2008-02-10 13:20 · Score: 1

To summarize the summary of the summary: people are a problem. - Douglas Adams
Re:Wow by BasharTeg · 2008-02-10 15:19 · Score: 1

Yes, you're right. The English language not having a sufficient number of words to describe the 50,000,000 other job titles in the world, "web master" is clearly the only way to describe the position in question.

Why not Web Lord? Web Messiah? Web Christ?

There is obviously some ego that goes into pushing the title "webmaster".
Re:Wow by Anonymous Coward · 2008-02-10 15:42 · Score: 0

Why not Web Lord? Web Messiah? Web Christ?

Because those made up from scratch, not based existing roles. "webmaster" is used by direct analogy with "postmaster", the RFC-mandated address for the email administrator of any domain, and "hostmaster", the typical address for the administrator of the DNS records themselves.
Re:Wow by shentino · 2008-02-11 04:46 · Score: 1

Hmm...perhaps these DTD's should be fetched using HTTP?

What protocol do they use to fetch the actual DTD anyway? FTP? Rsync?
Re:Wow by MenTaLguY · 2008-02-11 06:07 · Score: 1

So who should host the canonical copies of those DTDs to which the system identifier refers, if not the W3C?

If the !DOCTYPE has an external identifier at all, the external identifier has to include a system identifier (it's the public identifier that's optional). Although it's a requirement of XML, the requirement is inherited from SGML, which predates the W3C. The requirement exists for good reason, too: without it, there would be no way to obtain a new DTD in a forward-compatible way.

--

DNA just wants to be free...
Re:Wow by x_MeRLiN_x · 2008-02-11 06:22 · Score: 1

I'm guessing it's mostly performed by spammers wanting to harvest email addresses displayed as images. In fact, there were a few Slashdot articles about Robert Scoble who screen scraped Facebook in order to gather personal data and was (temporarily) banned for it.
Re:Wow by DragonWriter · 2008-02-11 06:37 · Score: 1

It's more like this: your app should *never* query the DTD.

That depends what your app is. If your app is designed to do something with arbitrary SGML data and runs into something that happens to be HTML, querying the DTD makes some sense, as for any other specific kind of document it might be fed, though even in that case, it should only have to pick it up once (barring limitations on caching), not once for each HTML document it hits.

If your app is designed to work with HTML specifically, it should never query the HTML DTD(s). If it is, for some reason, built on a generic SGML engine and needs the DTD as part of the input to that, it should have a local copy (if it handles more than one version of HTML, it should have all of the DTDs needed.)
Re:Wow by jdschulteis · 2008-02-11 11:03 · Score: 1

... the internet is much bigger than they were capable of imagining.
The internet is big. You just won't believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it's a long way down the tubes to Slashdot, but that's just peanuts to the internet.
Re:Wow by Gideon+Fubar · 2008-02-11 12:03 · Score: 1

Well.. perhaps the method you're talking about.

I work with federated search engines, which automatically search academic databases and aggregate the contents. While most of the databases searched by this system have dedicated points of entry for systems like mine (Z39.50), some still require you to search in their own native interface. For those databases we use a regex parser over the html, to try and identify various pieces of information by their previous tags and their positions on the page. These parsers tend to break easily, of course.

--
http://www.xkcd.com/354/

First? by robo_mojo · 2008-02-08 13:27 · Score: 0, Redundant

"oops"

The Solution by OdieWan · 2008-02-08 13:29 · Score: 5, Funny

I have a solution to the problem; I wrote it down at http://www.w3.org/TR/html4/strict.dtd !

Re:The Solution by Anonymous Coward · 2008-02-08 13:32 · Score: 5, Funny

Don't click that link! It's some sort of ascii pornography!
Re:The Solution by jgoemat · 2008-02-08 13:40 · Score: 0

ROFL! I can't believe I used up my Mod points...
Re:The Solution by colinrichardday · 2008-02-08 13:42 · Score: 0

But wouldn't you access ascii porn by means of a submit button instead, or perhaps using DOM?
Re:The Solution by Rinisari · 2008-02-08 14:32 · Score: 1

I clicked on that just for the principle of the thing.

--
Colin Dean Go a year without DRM
Re:The Solution by TheSpoom · 2008-02-08 20:12 · Score: 1

Man, I'm so disappointed now that I actually clicked it.

--
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs
Re:The Solution by eugene+ts+wong · 2008-02-08 21:39 · Score: 1

My browser wouldn't render it, because it didn't have a DTD.

Eh, why did I bother clicking it, when I wouldn't read the article, anyways?

--
testing out my trending skills
Re:The Solution by Anonymous Coward · 2008-02-09 02:02 · Score: 0

They've got what they deserve.
How stupid to mix URI and URL concept.
Re:The Solution by msuarezalvarez · 2008-02-09 05:26 · Score: 1

You seem to have missed the fact that ever since SGML came to be there are these PUBLIC identifies which, guess what, are URIs, nicely separated from the SYSTEM identifiers, which are these URLs you seem to think the confused.
Re:The Solution by rvJJax · 2008-02-10 19:30 · Score: 1

you make me click the link. thank you

--
S.S.D.D

Do what.... by Creepy+Crawler · 2008-02-08 13:29 · Score: 5, Funny

Do what any other respectable web provider would do..

Put links to Goatse in the definitions!

--

Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37

Re:Do what.... by gronofer · 2008-02-08 15:27 · Score: 3, Insightful

No, a respectable provider, like Network Solutions for example, would find a way to dish up adverts.
Re:Do what.... by Nicolay77 · 2008-02-10 17:36 · Score: 1

More than insightful, there should be a 'sad but true' mod option.

--
We are Turing O-Machines. The Oracle is out there.

Who made the DTD a URL? by Anonymous Coward · 2008-02-08 13:29 · Score: 2, Interesting

Oh, that was you? I thought that making every webauthor refer to a W3C URL in every web page was going to get someone in trouble someday. Today seems to be someday.

Re:Who made the DTD a URL? by colinrichardday · 2008-02-08 13:40 · Score: 2, Insightful

Or you could do what I do, and simply download the DTD, install it on your system,
and use that instead.
Re:Who made the DTD a URL? by ozamosi · 2008-02-08 14:10 · Score: 4, Insightful

It does contain a URL. It also contain a URN (for instance "-//W3C//DTD HTML 4.01//EN"). The point of a URN is that it doesn't have a universal location - you're supposed to find it wherever you can, probably in local cache somewhere.

The URL can be seen as a backup ("in case you don't know the DTD for W3C HTML 4.01, you can create a local copy from this URL" - in the future, when people have forgotten HTML 4.01, that can be useful), or the same way XML namespaces is used - you don't have to send a HTTP request to http://www.w3.org/1999/xhtml to know that a document that uses that namespace is a xhtml document - it's just another form of a unique resource identifier (URI), just like a URN or a guid.

What the W3C is having a problem with is applications that decide to fetch the DTD every single request. That's just crazy. Why do you even need to validate it, unless you're a validator? Just try to parse it - it probably won't validate anyway, and you'll have to do either do it in some kind of quirks mode or just break. If you can parse it correctly, does it matter if it validates? If you can't parse it, does it matter if it validates? And if you actually do want to validate it, why make the user wait a few seconds while you fetch the DTD on every page request? The only reasonable way this could happen that I can think of is link crawlers who find the URL - but doesn't link crawlers usually avoid to revisit pages they just visited?
Re:Who made the DTD a URL? by milsoRgen · 2008-02-08 14:19 · Score: 1

it probably won't validate anyway Ain't that the truth, brother... I find myself coding for the program parsing the information, way more often then I am coding for the standards. As coding standards-based markup always runs into issues.

--
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
Re:Who made the DTD a URL? by Anonymous Coward · 2008-02-08 14:20 · Score: 0

Yes, that's the idea. However, regardless of the purpose of the URL, you can't make people put one of a few URLs which point to your server in *every* web page, and expect to get away with it unscathed. It's a stupid idea. Complaining afterwards that people are people only puts the icing on the cake, IMHO.
Re:Who made the DTD a URL? by MenTaLguY · 2008-02-08 14:33 · Score: 1

Minor quibbles: "-//W3C//DTD HTML 4.01//EN" is not a URN but a PI (public identifier), and there is a reason to have validating parsers: the DTD can contain essential information for correctly interpreting the document (e.g. entity declarations, as is obviously the case in HTML).

Other than that you're spot on.

--

DNA just wants to be free...
Re:Who made the DTD a URL? by Bogtha · 2008-02-08 14:38 · Score: 1

Why do you even need to validate it, unless you're a validator? Just try to parse it

The external DTD subset isn't just for error checking. It defines the character entities and the content model for element types. If you don't have access to the DTD (or hard-coded HTML-specific behaviour) you can't parse it fully.

--
Bogtha Bogtha Bogtha
Re:Who made the DTD a URL? by Anonymous Coward · 2008-02-08 15:01 · Score: 0

Even non-validating parsers have to fetch the DTD. The DTD may contain XML entities that the parser has to substitute in to the parsed document (think <, for example).
Re:Who made the DTD a URL? by inKubus · 2008-02-08 16:07 · Score: 1

Maybe they could use some sort of distributed cache so it doesn't go to one place. Sort of like DNS does but instead of IP's it returns the DTD. Obviously this would mean deprecating the current standard but who cares? Just turn it off, people will figure out pretty quickly what they need to do.

--
Cool! Amazing Toys.
Re:Who made the DTD a URL? by VGPowerlord · 2008-02-08 18:43 · Score: 1

Why would it deprecate the existing standard? They could just have a web script return a URI to a copy of it using a HTTP 302/303/307 redirect.

--
GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011
Re:Who made the DTD a URL? by kabdib · 2008-02-08 19:09 · Score: 1

Could have made the DTD a unique ID, rather than an address.

Having it be usable URL makes about as much sense as making it the phone number of your company's customer support group.

What were they thinking?

--
Any sufficiently advanced technology is insufficiently documented.
Re:Who made the DTD a URL? by vux984 · 2008-02-08 20:49 · Score: 3, Insightful

Could have made the DTD a unique ID, rather than an address.

An address is effectively a unique ID.

And the advantage of an address is that its a logical place to put the DTD if you don't happen to have your own copy. Its a unique id and a map to where to get it if you don't already have it.

What were they thinking?

They were thinking people wouldn't needlessly continually redownload the same page over and over and over again.

The root dns servers operate under the same assumption. Do you think they were crazy too? After all, you can force your dns queries to go through the route servers every time if you really want to. Your not supposed to, and doing so needlessly puts more load on them, but you could.
Re:Who made the DTD a URL? by Anonymous Coward · 2008-02-09 04:02 · Score: 0

It does contain a URL. It also contain a URN (for instance "-//W3C//DTD HTML 4.01//EN"). The point of a URN is that it doesn't have a universal location - you're supposed to find it wherever you can, probably in local cache somewhere.
That's not a URN. URNs are just URIs which have urn: as their schema. You can use URNs in DOCTYPE declarations, but it's a later addition to the SGML standard. The "-//[organisation]//[DTD name]//[language of standard]"-string you're referring to is the public identifier, which predates URNs. (The dash at the front can also be a +, in which case the organisation name is officially registered with the ISO, IIRC) Besides, it shouldn't matter that the system identifier is a URL. Any conforming SGML parser would use the public identifier to fetch the DTD from its own catalog insted of dereferencing the system identifier.
Re:Who made the DTD a URL? by msuarezalvarez · 2008-02-09 05:29 · Score: 1

You read the GP exactly in the opposite way: he observed that the HTML is not going to validate, probably, and you are saying that the parser is probably going to misparse. You seem to be one of the creators of the content he is object to!
Re:Who made the DTD a URL? by msuarezalvarez · 2008-02-09 05:31 · Score: 1

You have not ever heard of SGML catalogs, clearly.

Unix is not the only thing which is bound to be reinvented once and again, poorly.
Re:Who made the DTD a URL? by greengearbox · 2008-02-09 05:34 · Score: 1

Why do you even need to validate it, unless you're a validator? Just try to parse it - it probably won't validate anyway, and you'll have to do either do it in some kind of quirks mode or just break. If you can parse it correctly, does it matter if it validates? If you can't parse it, does it matter if it validates?
It's not quite that simple. The DTD isn't just used for validation. It's also may contain attribute defaults and entity declarations, without which you may not be able to make sense of the document.

Still, on the whole it is idiotic. The "Public Identifier" can be used to identify the document as HTML (or whatever) and a cached DTD can/should be used. My point is just that in general, there are uses of the DTD beyond validation.
Re:Who made the DTD a URL? by Anonymous Coward · 2008-02-09 18:56 · Score: 0

Blame ISO. If all the specs weren't trapped behind a paywall, SGML would be well-supported rather than an obscure niche.

Leave it to Slashdot... by PocketPick · 2008-02-08 13:30 · Score: 2, Funny

It's a good we don't contribute to the problem - Oh, wait...

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<title>Slashdot: News for nerds, stuff that matters</title>

Re:Leave it to Slashdot... by snl2587 · 2008-02-08 13:36 · Score: 5, Informative

Note: It is my understanding that the browser is what looks up the DTD. So /. having the declaration is irrelevant.
Re:Leave it to Slashdot... by Vectronic · 2008-02-08 13:41 · Score: 2, Insightful

And if he really wanted to be funny, he would have quoted it from the webpage that the Story/Blog was posted on on W3C
Re:Leave it to Slashdot... by Bogtha · 2008-02-08 13:49 · Score: 1

No, Slashdot is not contributing to the problem, that is correct code. Just because a URI is listed, it doesn't mean that software should request it each and every time it sees it. Most code that sees that URI should already have a copy of the DTD in the local catalogue. It's only generic SGML software that cannot be expected to have a copy of the DTD.

--
Bogtha Bogtha Bogtha
Re:Leave it to Slashdot... by corsec67 · 2008-02-08 14:03 · Score: 2, Informative

Actually, do any browsers get the DTD?
From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.

Browsers are also pretty good about caching stuff.

--
If I have nothing to hide, don't search me
Re:Leave it to Slashdot... by Anonymous Coward · 2008-02-08 14:07 · Score: 0

Note: its the thief that does the stealing. So, anyone putting them up to it (and/or is providing the means and/or the easy access) is irrelevant ?
Re:Leave it to Slashdot... by milsoRgen · 2008-02-08 14:09 · Score: 2, Informative

From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.
FTA:

The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG)
I don't claim to fully grasp what software is causing the problem but it does seem to effect more than just XML.

--
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
Re:Leave it to Slashdot... by MenTaLguY · 2008-02-08 14:22 · Score: 1

Browsers should ship with the DTD. That's the whole point of the public identifier (e.g. "-//W3C//DTD HTML 4.01/EN"), so that a local copy can be obtained using the PI as an index into a local catalog. The URL is only there as a fallback.

--

DNA just wants to be free...
Re:Leave it to Slashdot... by corsec67 · 2008-02-08 14:31 · Score: 1

Except that I was talking about software ASIDE from browsers, like a XML validator, crawler, etc...
Stuff that deals with generic XML and is being used for xhtml.

--
If I have nothing to hide, don't search me
Re:Leave it to Slashdot... by MenTaLguY · 2008-02-08 14:35 · Score: 2

Even then, those should be caching in a local catalog, based on the PI.

--

DNA just wants to be free...
Re:Leave it to Slashdot... by darkpixel2k · 2008-02-08 15:09 · Score: 1

Note: It is my understanding that the browser is what looks up the DTD. So /. having the declaration is irrelevant.

Yeah. Whose retarded idea was it to give something a valid URI and then say "wait--don't query this URI", it's just for show. Well if it's just for show, don't make it a URI that various automated systems might want to query because a programmer failed to include some 'query everything except this' code.

(In case it isn't ovbious, I'm talking out my butt. I really have no clue when it comes to DTD's except that most WYSIWYG web design programs past that garbage in automagically.)

--
There's no place like ::1 (I've completed my transition to IPv6)
Re:Leave it to Slashdot... by SQLGuru · 2008-02-08 15:16 · Score: 1

Aiding and abetting. Facilitation. All chargable offenses in legal circles.

Layne
Re:Leave it to Slashdot... by coaxial · 2008-02-08 16:35 · Score: 1

Note: It is my understanding that the browser is what looks up the DTD. And there's no point to lookup the DTD. Fetching DTDs is only useful for validating an unknown document. When it comes to displaying document, you already know what you can display, so there's no discovery going on. There's no validation is meaningless, because you have to handle invalid documents anyway, and even if you do feel like validating the doc, you already know what you can handle, and so you know the DTD!

DTDs are worthless outside of specification.
Re:Leave it to Slashdot... by gsnedders · 2008-02-08 23:43 · Score: 1

Browsers don't use SGML parsers, they use purpose built HTML parsers that have all kinds of quirks that are totally unspecified (except in the HTML 5 draft). As such, they don't actually read the DTD whatsoever: the only thing the DTD means to the browser is whether it will parse in standards mode, almost standards mode, or quirks mode. They actually have the list of entities in HTML hardcoded.
Re:Leave it to Slashdot... by TeamSPAM · 2008-02-09 02:51 · Score: 1

2 out of the 3 you highlighted are XML related. XSLT are used to transform XML into something else or more readable. While SVG is a graphic format, it is described in XML.

--
Brought to you by Team SPAM! where we believe: "Information in the noise!"
Re:Leave it to Slashdot... by msuarezalvarez · 2008-02-09 05:37 · Score: 1

(In case it isn't ovbious, I'm talking out my butt. I really have no clue when it comes to DTD's except that most WYSIWYG web design programs past that garbage in automagically.)
Have you considered that instead of adding disclaimers such as this, it would be quite more useful to be the signal, instead of the noise in /. threads?
Re:Leave it to Slashdot... by greengearbox · 2008-02-09 05:46 · Score: 1

And there's no point to lookup the DTD. Fetching DTDs is only useful for validating an unknown document.
No, DTDs are used for more than validation. Attribute defaults and entity declarations are two instances that come to mind. You may need those even if you don't plan on validating the instance.

Of course, this doesn't mean that HTML user agents should be fetching DTDs off the web repeatedly, for all the obvious reasons.
Re:Leave it to Slashdot... by irc.goatse.cx+troll · 2008-02-09 05:53 · Score: 1

It's more "Query this url if you don't know how to parse this Public Identifier".

W3C is just shocked that so much software would rather query the url than internally have a copy already.

--
Pain lasts, kid. Its how you know you're alive. Sometimes I think this growing up thing is just pain management-TheMaxx
Re:Leave it to Slashdot... by Anonymous Coward · 2008-02-09 05:59 · Score: 0

because a programmer failed to include some 'query everything except this' code.

If the code is doing anything remotely similar to "query everything," then the programmer is doing it wrong.

In case it isn't ovbious, I'm talking out my butt. I really have no clue when it comes to DTD's except that most WYSIWYG web design programs past that garbage in automagically.

Yet you posted anyway.
Re:Leave it to Slashdot... by coaxial · 2008-02-09 09:16 · Score: 1

DTDs are specificiation, and while machine readable is nice, it's not really needed for anything except validation. A napkin is good enough for listing defaults to an app, since they're going to have to be coded anyway. Yeah, you could use as a type of static config file, but there's no reason why you have to. I'm not convinced that DTDs really get you anything in practice.
Re:Leave it to Slashdot... by greengearbox · 2008-02-10 06:05 · Score: 1

As a document designer/producer, that may be true. But as a consumer you may not be in a position to make that decision. At any rate, DTDs being part of the XML spec, a general purpose parser must be prepared to deal with DTDs, even outside the realm of validation.

Who designed this crazy system?! by Anonymous Coward · 2008-02-08 13:31 · Score: 1, Funny

Isn't this what you call "eating your own dogfood"?

Delay by erikina · 2008-02-08 13:31 · Score: 5, Interesting

Have they tried delaying the response by 5 or 6 seconds? It could cause a lot of applications to hang pretty badly. That or just serve a completely nonsensical schema every thousandth request. Gotta keep developers on their toes.

Re:Delay by bunratty · 2008-02-08 14:09 · Score: 3, Informative

RTFA. They returned the 503 Service Unavailable error to many abusers, and they just kept on with abusive requests. Many abusers aren't checking the response to the request at all.

--
What a fool believes, he sees, no wise man has the power to reason away.
Re:Delay by dotancohen · 2008-02-08 14:36 · Score: 4, Funny

You must be a Microsoft engineer.

--
It is dangerous to be right when the government is wrong.
Re:Delay by RhysU · 2008-02-08 14:41 · Score: 3, Informative

Good: Delivered a piece of code once that tested just fine for us, but blew up at the customer's site. We never realized that the new J2EE-like features were hitting a live URL during DTD parsing.

Better: Had a build system once that looked for a host and had to TCP timeout before the build could continue. Had to happen several hundred times a build cycle.

The Java libraries do this down in their innards unless you're very careful to avoid it.
Re:Delay by erikina · 2008-02-08 14:42 · Score: 1

Probably because a 503 Service Unavailable might not break the app, just skip the validation stage. You need to do something to degrade the usefulness of the application (cause it to hang or break). Also an across the board 5 second wait, will mean developers will see the problem at development time - not only after it has already been deployed, causing problems and has been blocked.
Re:Delay by bwb · 2008-02-08 14:47 · Score: 5, Insightful

Sure, they're ignoring the response status, but I'll betcha most of them are doing synchronous requests. If I were solving this problem for W3C, I'd be delaying the abusers by 5 or 6 *minutes*. Maybe respond to the first request from a given IP/user agent with no or little delay, but each subsequent request within a certain timeframe incurs triple the previous delay, or the throughput gets progressively throttled-down until you're drooling it out at 150bps. That would render the really abusive applications immediately unusable, and with any luck, the hordes of angry customers would get the vendors to fix their broken software.
Re:Delay by VGPowerlord · 2008-02-08 18:52 · Score: 1

Given that RFC2616 (written by the w3c) says "The implication is that this is a temporary condition which will be alleviated after some delay." about 503, is it any surprise that abuse continued?

--
GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011
Re:Delay by xstonedogx · 2008-02-08 19:59 · Score: 1

RFC2616 also states:
10.5 Server Error 5xx ...the server SHOULD include an entity containing an explanation of the
error situation, and whether it is a temporary or permanent
condition. User agents SHOULD display any included entity to the
user...

10.5.1 500 Internal Server Error

The server encountered an unexpected condition which prevented it
from fulfilling the request.

10.5.4 503 Service Unavailable ...If known, the length of the delay MAY be indicated in a
Retry-After header. If no Retry-After is given, the client SHOULD
handle the response as it would for a 500 response...

SHOULD (RFC2119) in this context means:
...there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

Clients for which this does not apply are, by definition, those written by authors whom have 'understood and carefully weighed' the implications. An 'abusive' client probably does not fall under this definition. So, any abusive client should be waiting for the 'Retry-After' (assuming the header exists) and displaying the contents of the message to the user. Either method should have a dramatic impact on the number of abusive clients hitting the server in a relatively short period of time.
Re:Delay by ArsenneLupin · 2008-02-08 21:29 · Score: 1

Many abusers aren't checking the response to the request at all. Maybe they aren't checking the contents of the response, but I wanna bet that they surely wait for the response to come in. So, putting in a 6 second delay will be noticed, even if a 503, or complete garbage won't.
Re:Delay by BZ · 2008-02-09 05:15 · Score: 1

> Given that RFC2616 (written by the w3c)

I'm sorry to break this to you, but anything called an "RFC" is not in fact written by the W3C. RFCs are an IETF thing.
Re:Delay by VGPowerlord · 2008-02-09 06:00 · Score: 1

If you look at the top of RFC2616, you might notice that W3C is mentioned several times.

--
GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011
Re:Delay by BZ · 2008-02-09 07:27 · Score: 1

Yes. Certain persons affiliated with the W3C were members of the working group that created this RFC. So were others who are not affiliated with W3C directly (though they represent companies that might also participate in the W3C).

That doesn't change the fact that the W3C is not the standards organization responsible for RFC 2616: they don't control its status, and they're not the ones who voted to actually accept it as a standard.
Re:Delay by De+Lemming · 2008-02-09 13:31 · Score: 1

That's a funny idea, but changing the handling of those 130 million requests per day from stateless to an application that keeps a session state per user, is going to take a hell of a lot more resources.
Re:Delay by ger · 2008-02-10 18:23 · Score: 1

I think this is an excellent idea, thanks.

We considered tarpitting before, I think we were always scared off by the prospect of having to keep tens of thousands of connections open.

Does anyone have specific software to recommend that is able to keep that many connections open on a typical cheap Linux box? (Lighttpd? Nginx? Varnish? Yaws?)

The implementation I'm thinking might work well is:

Switch www.w3.org to use some lightweight server software that is able to keep lots of connections open, and configure it to serve DTD files with an artificial 5 second delay. Proxy all the other requests to our existing Apache server running elsewhere (possibly on another port on the same system)

Most people shouldn't notice or care about the delay for DTD files, only the apps that are requesting them hundreds or thousands of times in a row will notice.

W3C's current traffic is something like:

- 66% DTD/schema files (.dtd/ent/mod/xsd)
- 25% valid HTML/CSS/WAI icons
- 9% other

So we'd probably want to configure the lightweight server to serve those icons too (but then it would have to do conneg as well)

MIT needs a CDN! by rekoil · 2008-02-08 13:35 · Score: 2, Interesting

I'm surprised none of the CDNs out there haven't volunteered to host this file - the problem is they'd have to host the entire w3.org site, else move the rest of it to a another hostname.

That's what you get for making stupid rules. by v(*_*)vvvv · 2008-02-08 13:35 · Score: 1, Interesting

They insist that every document begin with a declaration that includes a link to their site. Now they are complaining about traffic.

The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.

Re:That's what you get for making stupid rules. by colinrichardday · 2008-02-08 13:45 · Score: 1

You don't need a DTD, nor do you need to link it to W3C.
Re:That's what you get for making stupid rules. by Bogtha · 2008-02-08 13:55 · Score: 5, Informative

They insist that every document begin with a declaration that includes a link to their site.

It's not a link. It's a reference to an external DTD subset. It's there so that generic SGML software can properly parse the document without any special knowledge of HTML.

The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.

No, external DTD subsets are a part of SGML, which is at least a decade older than the W3C.

--
Bogtha Bogtha Bogtha
Re:That's what you get for making stupid rules. by reddburn · 2008-02-08 15:08 · Score: 1

From the W3C specifications for XHTML documents [Link]

3.1.1 - Strictly Conforming Documents ...There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in DTDs using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions... An XML declaration is not required in all XML documents; however XHTML document authors are strongly encouraged to use XML declarations in all their documents.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

This is the DTD they require. Because the DTD is not declared inline in an XHTML document, it must contain the external reference (the second part - the link to "w3.org/TR/xhtml1/DTD/xhtml1-XXXX.dtd") to the W3C's DTD - which is, presumably, what they're bitching about.

--
"Those who believe in telekinetics, raise my hand" - Kurt Vonnegut, Jr.
Re:That's what you get for making stupid rules. by Bogtha · 2008-02-08 15:13 · Score: 1

Did you mean to reply to my comment? I can't see the connection between what I said and what you are saying.

--
Bogtha Bogtha Bogtha
Re:That's what you get for making stupid rules. by lwsimon · 2008-02-08 15:17 · Score: 1

For a website? I completely disagree. If you want to do anything at all on more than one browser, you need to have a DTD. It is the switch that turns the browser from "Quirks mode" to standards-compliancy (or as close as some browser *COUGH* ie *COUGH* get.)

--
Learn about Photography Basics.
Re:That's what you get for making stupid rules. by Bogtha · 2008-02-08 15:48 · Score: 1

There are four sentences in my comment. One saying it's not a link. One saying what it is. One saying what it's for. One saying where it came from. I fail to see how the comment relates to any of these, let alone proves them wrong. If you disagree, please explain why instead of ranting about your grudge with the W3C. It's not hard to do, my comment wasn't long or complicated.

--
Bogtha Bogtha Bogtha
Re:That's what you get for making stupid rules. by anarxia · 2008-02-08 17:10 · Score: 1

Properly configured xml systems have common schemas in their catalogs so they never fetch those dtd from remote sites. If they don't have a dtd they would only need to fetch it once.
To summarize: Doctype = good, misconfigured or stupid xml systems = bad
Re:That's what you get for making stupid rules. by somethinghollow · 2008-02-08 17:58 · Score: 4, Insightful

It was the W3C that decided to make HTML a subset of SGML. They could have done what HTML 5 is doing by creating a "serialization" that doesn't care about the DTD (HTML5 doesn't call for one in the DOCTYPE). As it is, theoretically, I can write my own DTD (a modification of the HTML4 DTD, for example) that adds new elements. Technically, the SGML parser should know and understand those DTDs. To do such, it must download the DTD. Browsers are supposed to be handling SGML docs, but chose to implement against the W3C recommendations instead of caring about the DTDs; the popular ones don't download the DTD at all... they don't even care if it exists or not, which is why the short HTML5 DOCTYPE works as a quirksmode switch but is still valid HTML and renders like HTML should.

Should SGML renderers cache it? Yes. Should W3C bitch that some SGML renderers are downloading their DTD? No. They should have thought about that before they made HTML a subset of SGML. I don't feel sorry for them.
Re:That's what you get for making stupid rules. by HeroreV · 2008-02-08 20:07 · Score: 1

I know three web programmers that were fired because of the W3C zealots. You sound like you were one of them. Don't be so bitter.
Re:That's what you get for making stupid rules. by vidarh · 2008-02-08 23:12 · Score: 3, Insightful

If you'd actually bothered reading what you quoted you might've noticed the sentence "The system identifier may be changed to reflect local system conventions". Only the public identifier is required to be one of the strings provided. The system identifier (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd) can point wherever you want it to. But well behaved clients are expected to use a catalog anyway.
Re:That's what you get for making stupid rules. by cyberon22 · 2008-02-08 23:27 · Score: 0

AMEN.

I'm still pissed off about needing to slap a DOCTYPE on a textfile. ASCII is a standard too.
Re:That's what you get for making stupid rules. by msuarezalvarez · 2008-02-09 05:49 · Score: 1

You seem to be under the impression that you can send ASCII-encoded text without an encoding declaration. That's as wrong as your apparent belief that omitting the doctype can result in an unambiguous semantic for mark up.
Re:That's what you get for making stupid rules. by DavidTC · 2008-02-09 06:12 · Score: 1

You know three people who were first for refusing the DOCTYPE correctly at the top of the document?
...well, yeah, I'd probably fire them too if they were deliberately writing code in quirks mode, but that's a web browser issue, not the fault of the w3c.

--
If corporations are people, aren't stockholders guilty of slavery?
Re:That's what you get for making stupid rules. by jtheisen · 2008-02-09 11:31 · Score: 1

It's not a link. It's a reference to an external DTD subset. It's there so that generic SGML software can properly parse the document without any special knowledge of HTML. They can't, as html isn't sgml. see here
It's just close enough to satisfy validators and browsers at the same time (though they interpret it differently). And the guy you quoted is right in that as matters are in practise, the doctype is irrelevant. That's why the html 5 working draft recommends it to be empty.
Re:That's what you get for making stupid rules. by cyberon22 · 2008-02-09 16:20 · Score: 1

I'm complaining that you often CANNOT send ASCII content without including a DOCTYPE or encoding specification. Like it or not, when you have to deal with integrating different systems with multiple encoding, data-storage and serving methods, the ability to kick back to a format/encoding agnostic method of data transfer is useful.
Re:That's what you get for making stupid rules. by i_liek_turtles · 2008-02-09 17:25 · Score: 0

I can write my own DTD (a modification of the HTML4 DTD, for example) that adds new elements. DTD write
Re:That's what you get for making stupid rules. by i_liek_turtles · 2008-02-09 17:30 · Score: 0

Slashdot wouldn't let me use my new tags,
<goatse>, <sovietrussia>, and <overlord>
Re:That's what you get for making stupid rules. by msuarezalvarez · 2008-02-09 22:42 · Score: 1

There is no such thing as a format/encoding agonstic method of data transfer. Any data transfer requires an agreement between the sender and the receiver on the format and the encoding: you cannot send `data': the only thing you can send is formatted and encoded data.
Re:That's what you get for making stupid rules. by cyberon22 · 2008-02-09 23:03 · Score: 1

I thought ASCII was a standard. Try feeding a data-file into MySQL that contains both GB2312 and UTF8-formatted data sometime and let me know what DOCTYPE you'd use passing the file over a network.
Re:That's what you get for making stupid rules. by msuarezalvarez · 2008-02-10 13:08 · Score: 1

I thought ASCII was a standard.
It is a standard. As there are others, you need to have the sending party and the receiving party to agree which one to use. That is what an encoding declaration does for you.

Try feeding a data-file into MySQL that contains both GB2312 and UTF8-formatted data sometime and let me know what DOCTYPE you'd use passing the file over a network.
I am sorry, but I cannot make sense of this.
Re:That's what you get for making stupid rules. by reddburn · 2008-02-12 15:21 · Score: 1

I only meant to provide a reference for the rule at most in question. How many people do you think actually went to the trouble of looking it up before blathering?

--
"Those who believe in telekinetics, raise my hand" - Kurt Vonnegut, Jr.

caching by TheSHAD0W · 2008-02-08 13:36 · Score: 0, Redundant

Add some sort of caching parameter to the DTD spec, that specifies how long browsers should cache those DTDs.

Another potential solution: Have browsers keep the DTDs cached, and then check the file date periodically when re-requested. This will still put some load on the w3c's servers, but significantly less than complete re-downloads.

Re:caching by corsec67 · 2008-02-08 13:46 · Score: 1

W3C already says how long the DTD should be cached for: 90 days, using the Cache-Control HTTP header, which is set to "max-age=7776000" (seconds).

--
If I have nothing to hide, don't search me
Re:caching by Bogtha · 2008-02-08 13:59 · Score: 2, Insightful

Add some sort of caching parameter to the DTD spec, that specifies how long browsers should cache those DTDs.

You're solving that problem at the wrong layer. HTTP already includes caching mechanisms, the W3C already use them, and part of the problem is that buggy software is ignoring them.

Another potential solution: Have browsers keep the DTDs cached

Please read the article. This is already supposed to happen. Buggy software fails to do this, which is the problem being talked about.

--
Bogtha Bogtha Bogtha
Re:caching by Anonymous Coward · 2008-02-08 16:15 · Score: 0

>Please read the article.

You must be new here.
Re:caching by netsharc · 2008-02-08 22:09 · Score: 1

I agree, GP, and a lot of other people, please r!t!F!a! I can't believe how many idiots blame the browsers and w3c themselves for the idiocy of some "software developers".

Geez, reading the comments, I can see the internet is getting dumber every minute (thanks to e.g. Digg), and that effect is spreading into Slashdot as well.

--
What time is it/will be over there? Check with my iPhone app!
Re:caching by MikeBabcock · 2008-02-09 16:16 · Score: 1

For the record, the file is set to expire in about 12 weeks from the first time you fetch it but as it supplies both Last-Modified and an ETag, there's no reason to fetch a new copy (it hasn't changed in 424 weeks).

Data available via cacheability checker.

--
- Michael T. Babcock (Yes, I blog)

Simple solution by mcrbids · 2008-02-08 13:36 · Score: 5, Funny

The answer to this problem is quite easy.

Continue to host the data referenced on a single T-1 line. That will cut your expenses to the bone since you'll never exceed 1.54 Mbps and that should be quite cheap. And, any dumfuxorz who fubarred their parser to not cache these basically static values will probably figure it out... very quickly.

You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year.

Problem solved!

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.

WARNING: GNAA by SirBudgington · 2008-02-08 13:36 · Score: 2, Funny

Don't click the link, it's malware.

--
this is my sig

Poetic justice by shark+swooner · 2008-02-08 13:38 · Score: 0, Redundant

Serves them right for forcing us to include the same long urls that point to files that never change in every single HTML file ever.

had this problem with hibernates website... by rgrbrny · 2008-02-08 13:39 · Score: 3, Interesting

the doctype was being used during a xsl transform during our build process; when the hibernate sight flaked out, the builds would fail intermittently.

solution was to add a xmlcatalog using a local resource.

bet this happens a lot more than most people realize; we'd been doing this for years before we noticed a problem.

Caching DTDs locally by NetSettler · 2008-02-08 13:40 · Score: 1

Another potential solution: Have browsers keep the DTDs cached, ...

Or the routers. Frankly, if the result is known to not change, w3 could probably agree with the network authorities to put copies around the net and treat those heavily used URIs as URNs and just never got to w3 (or rarely go there) instead.

The notion that URNs have to be known in advance as "the popular thing" rather than being discovered after-the-fact by noticing high-volume URIs is probably the real bug here.

--

Kent M Pitman
Philosopher, Technologist, Writer

Re:Caching DTDs locally by pyite · 2008-02-08 15:20 · Score: 1

Or the routers.

Yea, so um, routers don't know what XML is, let alone a DTD. Their purpose is to move traffic around really fast, and they manage to do that by having custom hardware do their work instead of generic CPUs/code. Unless you'd like to force everyone to use a proxy server, your idea is not feasible.

--
"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman
Re:Caching DTDs locally by NetSettler · 2008-02-08 16:08 · Score: 1

routers don't know what XML is, let alone a DTD.

It doesn't have to understand XML to redirect a connection request for a URI.

I vaguely recall that when the web first came online, some routers/gateways were doing a lot of this, and that web publishers complained because it was stealing potential ad revenue or obstructing the ability of a host to change what it was publishing moment to moment... and, absent permission, it was probably a copyright violation. Anyone reading along who can tell me if I'm misremembering? (I wasn't sure of a good search keyword to look up the history of this.)

Maybe such caching wasn't a bad idea--maybe it was just bad to do it without asking.

--
Kent M Pitman
Philosopher, Technologist, Writer
Re:Caching DTDs locally by Anonymous Coward · 2008-02-08 16:36 · Score: 0

>It doesn't have to understand XML to redirect a connection request for a URI.

No, but it does have to be capable of doing something that routers aren't - handling application layer data, which is what a URI is.

There are devices that can do this now: http://en.wikipedia.org/wiki/Multilayer_switch

But they aren't routers, strictly speaking.

>Anyone reading along who can tell me if I'm misremembering?

It sounds as though you're thinking of a caching HTTP proxy server, which isn't a router.
Re:Caching DTDs locally by jtev · 2008-02-08 16:39 · Score: 1

Backbone routers don't know what http is, or even TCP. Well, at least, not as far as routing is concerned. The simply move packets from one network segment to another. IP was designed as a specifically stupid protocol. The infrastructure neither knows, nor cares what any packet it is moving contains, it simply reproduces it on the appropriate network segment, or drops it, if it is unable to reproduce it in the proper amount of time.

--
That which is done from love exists beyond good and evil
Re:Caching DTDs locally by NetSettler · 2008-02-08 16:54 · Score: 1

There are devices that can do this now ... But they aren't routers, strictly speaking.

Thanks for the clarification. It's been a while since I read about the details of the network protocol stack, so that was a bit blurry in my memory. I had the layers mixed up, and the Wikipedia article was helpful.

It sounds as though you're thinking of a caching HTTP proxy server, which isn't a router.

Indeed. Again, thanks for the correction.

--
Kent M Pitman
Philosopher, Technologist, Writer

'Web Community'? by radimvice · 2008-02-08 13:41 · Score: 1

A plea to the web community to stop pinging the W3C DTDs isn't going to solve anything. What will work is blocking any unnecessary DTD traffic aggressively, and if that doesn't do the job, blocking it even more aggressively. Intelligently designed software / ISPs / routers will cache, filter and block these requests for the sake of their own efficiency, bandwidth, and proper function. Buggy, bloated and inefficient applications won't. Nothing's ever going to convince the 'web community' to stop pinging the DTDs out of an altruistic concern for W3C's servers, it will need to become beneficial for those software developers to devote the extra development/debugging/patching efforts to do so.

Re:'Web Community'? by mollymoo · 2008-02-09 05:02 · Score: 1

If you read TFA you'd know that what you claim will work is exactly what they are already doing and it hasn't worked.

--
Chernobyl 'not a wildlife haven' - BBC News

Umm, no. by pavon · 2008-02-08 13:43 · Score: 5, Informative

That is supposed to be there according to the standard. And all the major browsers cached that that file after loading it (at most) once, and then never read it again. So no, slashdot is not causing a problem. The problem is all the other HTML processing software besides browsers that do not cache their DTD files, not the files for containing it.

If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.

Re:Umm, no. by Skapare · 2008-02-08 14:28 · Score: 1

It's the whole design of HTML/XML, that needs to have DTD files in the first place to do the processing, that is all wrong. I warned about this well over 12 years ago. At least what little code I've written to process HTML/XML has always entirely ignored the DTD.

--
now we need to go OSS in diesel cars
Re:Umm, no. by Anonymous Coward · 2008-02-08 15:37 · Score: 1, Funny

It's the whole design of HTML/XML, that needs to have DTD files in the first place to do the processing

At least what little code I've written to process HTML/XML has always entirely ignored the DTD.

Either you have the super-human power to defy the laws of logic, or the word "need" does not mean what you think it means.
Re:Umm, no. by MillionthMonkey · 2008-02-08 16:37 · Score: 5, Interesting

At least what little code I've written to process HTML/XML has always entirely ignored the DTD.

Don't be so sure- even if your own code ignores it. Unless you're dealing with it on a raw character level, with most XML libraries and frameworks it can be quite tricky to prevent DTDs from being resolved behind your back.

I wrote some Java code a while back to parse some XML files that were downloaded from NCBI. Typical for NCBI data, this involved wading through terabytes of crap, and anything based on DOM wasn't going to work- so I used the lower level event-based SAX library in JAXP. The files did have DTD declarations in them pointing to NCBI, which I wanted to ignore, since this was a one-time data mining operation. I just examined some sample files, figured out pseudo-XPath expressions for what I wanted to pull out, set up a simple state machine to stumble through the SAX events, and not caring about the DTD, cleared the namespace-aware and validating flags on the SAXParserFactory. So I ended up with this:

File xmlgz = new File("ncbi_diarrhea.xml.gz"); DefaultHandler myHandler = new MyNCBIStateMachineHandler(); GZIPInputStream gzos = new GZIPInputStream(new FileInputStream(xmlgz)); SAXParserFactory spf = SAXParserFactory.newInstance(); spf.setValidating(false); spf.setNamespaceAware(false); SAXParser sp = spf.newSAXParser(); InputSource input = new InputSource(gzos); sp.parse(input, handler);
This ran fine, until it mysteriously froze up 18 hours into the run. It turned out to be caused by our switch to a different ISP, during which time the building lost its outside network access. The thread picked up the next file and immediately got blocked in the SAX library, trying to resolve the NCBI DTD.

This is how I fixed it:

spf.setFeature("http://xml.org/sax/features/external-general-entities", false); spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);

Now I'm sure someone is going to come on here calling me a noob for not knowing to use an XMLReaderFactory (or whatever XML API class isn't obsolete this week) and setting a custom EntityResolver that can provide my local copy of the NCBI DTD when presented with its URI, but why should I even have to bother with that? XML pretends to be simple but it's seriously messed up.
Re:Umm, no. by steelfood · 2008-02-08 18:02 · Score: 1

That is supposed to be there according to the standard.

Slashdot isn't at fault, but don't you find it just a little idiotic to specify in the standard the need to validate against one central description and then complain about it happening too much? Maybe instead, the schema as specified in the standard should have been validated against a locally stored file with the correct hash.

It's one of those little things about XML that caught me as being retarded.

--
"If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
Re:Umm, no. by Anonymous Coward · 2008-02-08 18:12 · Score: 0

That's pretty funny.
Re:Umm, no. by dynamo52 · 2008-02-08 20:00 · Score: 1

I would say that defying the laws of logic is an all-too-average human power

--
Like this comment? I accept Bitcoin! - 153sc8UUBXyp12ofQqfAWDmJrzyiKCYC1x
Re:Umm, no. by ubersonic · 2008-02-08 21:12 · Score: 1

Funny thing is, its their software that does it as well. I regularly validate 100000+ sites a month using a copy of their w3 validator on one of my servers. Around half a year ago, their servers crawled to a halt - dont know why, but for around 3 days you could hardly reach w3.org, and my server would respond slow too. End of story, I had to hack the w3 software to make sure its not fetching the damn validation documents over and over.

--

-- ubersonic Kfz Versicherung
Re:Umm, no. by msuarezalvarez · 2008-02-09 05:41 · Score: 1

don't you find it just a little idiotic to specify in the standard the need to validate against one central description and then complain about it happening too much?
The standard does not say such a thing, at all.

Maybe instead, the schema as specified in the standard should have been validated against a locally stored file with the correct hash.
This is precisely what the standards suggest. Instead of `correct hash' they use the terms `public identifier' and `system identifier'. There is even a whole standard just for the purpose of managing the mapping.

It's one of those little things about XML that caught me as being retarded.
The XML standard does in absolutely no way require this at all nor even hints that it would be a minorly accepted possibility.

Your opinions on XML seem to be quite unsubstantial...
Re:Umm, no. by Anonymous Coward · 2008-02-10 11:43 · Score: 0

That problem with the validator precisely illustrates the point of the article. The validator started using an XML library which does not do the right thing and cache the entity files it's retrieving. The validator team found out about this issue and worked around the XML library bug in less than two weeks.

Simple by Citizen+of+Earth · 2008-02-08 13:44 · Score: 1

I can't think of a problem that is simpler to solve. Just stop serving these documents. The offending programs will be fixed very quickly.

Re:Simple by metamatic · 2008-02-08 16:15 · Score: 1

Even better, serve up incorrect versions and make the misbehaving software choke on valid files.

--
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak

Irony by davburns · 2008-02-08 13:44 · Score: 4, Funny

So, w3c complains about their bandwidth, and the response is: The Slashdot Effect. Doesn't that make the old bandwidth problem seem less of a problem?

I'm just loving the irony in that.

Re:Irony by ion.simon.c · 2008-02-08 16:12 · Score: 2, Informative

See this comment. /. is NOTHING compared to the traffic generated by DTD requests.

http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic#c1821
Re:Irony by prockcore · 2008-02-08 16:31 · Score: 1

Those DTDs are getting 130 million requests per day. 50,000 slashdotters are a drop in the bucket.

Such an easy solution by mwasham · 2008-02-08 13:49 · Score: 5, Funny

And it is only 4 articles down.. Host with Yahoo! Yahoo Offers All-You-Can-Eat Storage and Bandwidth http://hardware.slashdot.org/article.pl?sid=08/02/08/1811236

--
Dallas Real Estate

Re:Such an easy solution by shawn(at)fsu · 2008-02-08 15:00 · Score: 1

I lol'ed. Kudos you're a good problem fixer.

--
500 dollar reward for tip(s) leading to the arrest of the person(s) who stole my sig.

They already do. by pavon · 2008-02-08 13:51 · Score: 4, Informative

The spec already recommends this and all the major browsers do it. The software that is causing the problem are generic XML/SGML processing packages which were designed to be able to deal with documents with any random DTD, not just the main HTML/XHTML ones from W3C. They are the folks that are downloading each DTD every single time and not caching it, contrary to the standard. Sometimes caching is a configuration option which defaults to off and administrators never turn it on.

Re:They already do. by _xeno_ · 2008-02-08 15:52 · Score: 3, Insightful

The problem is that several major XML libraries don't just default to no DTD/schema cache - they don't even implement a cache or local catalog. Implementing such a thing is left to the developers using the library.

For example, the XML libraries that come with Sun's Java rely on java.net.URL for downloading resources. I just checked my 1.6 Java install, and by default, it has no cache. In looking up how the java.net cache works, I discovered it wasn't even added until Java 1.5. So prior to Java 1.5, most Java libraries wouldn't cache responses at all because the included library didn't support caching. 'Course, even in Java 1.6, there's no default implementation, so each Java application would have to implement their own cache[1].

The included Java libraries also offer no internal DTD/schema catalog. You can create one (implement org.xml.sax.EntityResolver[2]) but by default they're off to the Internet to download any DTD they run across.

It's really not hard to see how these libraries could result in millions of hits a day - most people using them probably don't even realize that they're hitting the W3C's servers since it happens transparently. And fixing it is unfortunately not just setting configuration files and saving the DTDs locally: it's implementing a bunch of classes.

[1] And for added fun, the stub that is provided appears to be insufficient to support conditional requests - either the cache says "I have it!" and the cached response is used, or the server has to send a new copy. There's no way to do offer up an "If-Modified-Since:" request via the cache class.

[2] Noting that this can't be set for all parsers, it's set on a per-parser object basis. So if you use a third-party library that parses XML after creating its own parser object, you can't make it use your local DTD catalog.

--
You are in a maze of twisty little relative jumps, all alike.
Re:They already do. by owlstead · 2008-02-09 01:59 · Score: 1

In my opinion, the whole idea that you would download the thing is idiotic. I mean, somebody - probably unauthenticated at that time - sends me a text file, and my HTML/XML implementation is downloading files from the URL's in the text file? It could point to a huge XML file and my implementation would happily start to download the thing. It could even point to some child porn for all that matter. It would be much better if the default would be to lookup a file in the filesystem (cached from disk) for the specific URI (a hashtable comes to mind).

I've even seem this problem with XML signature in the Java libraries. The default implementations will even be stupid enough to download the XML schema from the internet! And then the problem becomes that you cannot specify which algorithms etc. the file should adhere to. These are all things you, as a developer, should think about, according to the builders of the libraries.

As long as the libraries and documentation doesn't take care of these major issues, don't expect the developers to think about them, that's all I'm saying. Anyway, why would we rewrite those handlers over and over? That's the idea of libraries, isn't it?
Re:They already do. by gbjbaanb · 2008-02-09 02:03 · Score: 1

Its not just Java (pah, spit, pah), its .NET as well : check this article out that describes the issue far better than I can tell it:

best quote from that article: If I set "ValidationType = ValidationType.None" it STILL downloads the DTD even though it doesn't validate against it. I get an XmlException when I set "ProhibitDtd = true"

Tha answer, of course, is to write your own resolver.
Re:They already do. by msuarezalvarez · 2008-02-09 06:03 · Score: 1

You simply cannot correctly parse an generic XML file using an arbitrary DTD without access to the DTD, since the file may very well contain references to entities defined in the DTD. This is not an `issue'. A parser which tries to do it is simply a broken parser.
Re:They already do. by Hal_Porter · 2008-02-09 07:24 · Score: 1

For example, the XML libraries that come with Sun's Java rely on java.net.URL for downloading resources. I just checked my 1.6 Java install, and by default, it has no cache. In looking up how the java.net cache works, I discovered it wasn't even added until Java 1.5. So prior to Java 1.5, most Java libraries wouldn't cache responses at all because the included library didn't support caching. 'Course, even in Java 1.6, there's no default implementation, so each Java application would have to implement their own cache[1].

Wow I always thought that Java and XML were designed an implemented by people who don't have a clue about performance but it never occured to me that they would be as dumb as this.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;

Submitted this to /.? by dotancohen · 2008-02-08 13:52 · Score: 5, Funny

Great, they cry "we get too much traffic", so we go ahead and slap them on the front page of slashdot. Sick, sick fucking joke.

--
It is dangerous to be right when the government is wrong.

Re:Submitted this to /.? by ger · 2008-02-08 14:47 · Score: 5, Informative

To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us, etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
Re:Submitted this to /.? by Anonymous Coward · 2008-02-08 15:02 · Score: 0

That should teach you to put potentially problematic URLs on separate sub-domains, so that you can more easily deal with traffic issues should they arise, either by letting a different server(-farm) handle the requests or by using DNS to point the clients away from your servers entirely if it gets too much.
Re:Submitted this to /.? by ari_j · 2008-02-08 18:08 · Score: 1

Is that 650 times as many hits or 650 times as many bytes?
Re:Submitted this to /.? by ger · 2008-02-08 18:45 · Score: 4, Informative

650 times as many hits. (163 times as many bytes.) But that's just from a quick sample.
Re:Submitted this to /.? by jalet · 2008-02-08 23:02 · Score: 1

Then maybe you could just deactivate the logs to save a lot on CPU and hard disk resources...

--
Votez ecolo : Chiez dans l'urne !
Re:Submitted this to /.? by dotancohen · 2008-02-09 00:04 · Score: 1

Is that 650 times as many hits or 650 times as many bytes? It's 650 times as many kicks.

--
It is dangerous to be right when the government is wrong.

I always thought it was stupid by RelliK · 2008-02-08 13:55 · Score: 1

I always thought it was stupid that XML documents include reference to a DTD hosted on a remote server that you do not maintain. This is wrong on so many levels, I don't even know where to begin:

1. The validation will not work if the remote server is down, or network is down, or your connection to the internet is down, or if the file is not accessible for any other reason.

2. You are at the mercy of some third-party to ensure that the file is correct and that it doesn't change.

3. You are susceptible to man-in-the-middle attack.

etc.

For some insane reason, all XML examples have this reference to a remote URL. Most people never change defaults, so we get in a situation where nearly every time XML is validated, W3C site gets hit. The geniuses at W3C should have thought of that *before* this happened. Now they have to live with it...

--
___
If you think big enough, you'll never have to do it.

Re:I always thought it was stupid by MenTaLguY · 2008-02-08 14:27 · Score: 1

What people were supposed to do is include a copy of the DTDs with their software. That's what the PI string is there for, as an index into a local catalog of DTD resources. The URL was supposed to be only a fallback measure.

--

DNA just wants to be free...
Re:I always thought it was stupid by MtHuurne · 2008-02-08 14:52 · Score: 2, Interesting
I wrote my thesis in Docbook and installed the processing toolchain on a laptop. Sometimes the processing would fail and sometimes it worked. After a while I noticed it worked when I was setting behind my desk and failed when I was sitting on my bed. After some digging, I found out that the catalog configuration was wrong and the XML parser was downloading the DTDs from the web. This was before WiFi, so sitting on the bed meant the laptop did not have internet access.

The core of the problem is that most XML parsers will automatically and transparently fetch the DTD from the URL and do not cache it. So if you have no DTDs installed locally, or if your XML parser cannot find them (catalog configuration is easy to mess up), the parsing will work just fine and if processing the XML takes a significant amount of time, you probably won't notice the small delay from downloading the DTD.

There are several possible solutions for this:
- Do not automatically fetch DTDs from the web: make it an explicit option that the user has to set.
- Be vocal when fetching a DTD from the web, for example issue a warning.
- Cache fetched DTDs locally.
All of these are things that should be addressed in the XML parsers.
Re:I always thought it was stupid by statusbar · 2008-02-08 16:08 · Score: 1

Okay then, which xml parsers properly cache the DTD's and which ones don't? Which ones automatically download by default?

--jeffk++

--
ipv6 is my vpn
Re:I always thought it was stupid by MenTaLguY · 2008-02-08 18:11 · Score: 1

This sounds sensible and I think I agree.

--

DNA just wants to be free...
Re:I always thought it was stupid by MtHuurne · 2008-02-08 22:19 · Score: 2, Informative

I think I was using the Java version of Apache Xerces at the time for the Docbook processing. More recently I've used lxml in Python (based on libxml2), which has an option (no_network) to suppress DTD loading from the web, but you have to request that explicitly.

I've never seen a parser that caches DTDs by default. I'm not sure about parsers that do not download by default.

I'm going to say this as clearly as possible. by glwtta · 2008-02-08 13:55 · Score: 3, Informative

Browsers cache the DTDs.

There, you can now stop posting your hilarious "jokes".

--
sic transit gloria mundi

Re:I'm going to say this as clearly as possible. by Anonymous Coward · 2008-02-08 15:44 · Score: 0

Browsers cache the DTDs.

No, they don't. They never download them in the first place. At most, the URLs are used as magic strings to determine the format of the document.
Re:I'm going to say this as clearly as possible. by Anonymous Coward · 2008-02-08 15:46 · Score: 0

Browsers cache the DTDs.
Browsers don't cache the W3 DTDs, they don't even download them to begin with. Any browser that isn't completely nuts stores the DTDs (if they use them) with the application, as is recommended practice.

Generic XML/HTML/SGML processing tools are the only ones that need to download DTDs. Even here, commonly-used DTDs should be cached locally, but it's probably not surprising that there are a lot of "programmers" who don't understand that their tools are automatically fetching these resources. (This was widely-publicized with regards to RSS, for example.)

In a way, it's kinda impressive (in a very stupid way) that we have the sort of infrastructure today in which connectivity is so ubiquitous that nobody notices all these requests going out to the network--the applications just work. Probably the only solution is to have the library vendors rewrite code to make automatically fetching DTDs non-default behavior.
Re:I'm going to say this as clearly as possible. by Anonymous Coward · 2008-02-09 10:44 · Score: 0

Browsers actually validate - what for if not debugging? Scuse me, the W3C just bit itself by posting the DTD URLs in all possible HTML examples around the world.
Re:I'm going to say this as clearly as possible. by jtheisen · 2008-02-09 11:39 · Score: 1

Browsers cache the DTDs. Why should they query it in the first place?
Re:I'm going to say this as clearly as possible. by glwtta · 2008-02-09 12:31 · Score: 1

Browsers don't cache the W3 DTDs, they don't even download them to begin with.

I was including that under the general category of "caching", ie storing something locally because you know you'll need it. Come to think of it, I doubt if browsers use them at all.

--
sic transit gloria mundi

Gumdrops by milsoRgen · 2008-02-08 13:58 · Score: 4, Insightful

They are just about the only people who cannot be responsible for this. Exactly, for as long as I've been involved with HTML's various forms over the years it was always considered proper technique (from W3C documentation) to include the doctype (or more recently xmlns). Certainly sounds like a parser issue to me.

The only thing I'm unclear on is whether your average browser is contributing to this problem when parsing properly written documents.

--
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.

Obivious Solution by PaK_Phoenix · 2008-02-08 13:59 · Score: 1

Load too big on your server, need to slow down the traffic a bit.

Slashdot it.

That should work

--
This space intentionally left blank.

Starting on the 1st, fool by Scrameustache · 2008-02-08 14:06 · Score: 5, Funny

You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year. I suggest April! :D

--

You can't take the sky from me...

Re:Starting on the 1st, fool by Anonymous Coward · 2008-02-08 17:20 · Score: 0

Oh, why not just do it on February 29th instead? XP

behead

Surprise by MBCook · 2008-02-08 14:08 · Score: 3, Insightful

I've got to say, this doesn't surprise me at all. In the time I've spent at my job, I've been repeatedly floored by the amazing conduct of other companies IT departments. We've only encountered two people I can think of who have been hostile. Everyone else has been quite nice. You'd think people would have things setup well, but they don't.

We've seen many custom XML parsers and encoders, all slightly wrong. We've seen people transmitting very sensitive data without using any kind of security until we refused to continue working without SSL being added to the equation. We've seen people who were secure change their certificates to self-signed, and we seem to consistently know when people's certificates expire before they do.

But even without these things, I can't tell you how many people send us bad data and flat out ignore the response. We get all sorts of bad data sent to us all the time. When that happens, we reply with a failure message describing what's wrong. Yet we get bits of stuff all the time that is wrong, in the same way, from the same people. I'm not talking about sending us something that they aren't supposed to (X when we say only Y), I'm saying invalid XML type wrong... such that it can't be parsed.

We have, a few times while I've been there, had people make a change in their software (or something) and bombard us with invalid data until we we either block their IP or manage to get into voice contact with their IT department. Sometimes they don't even seem to notice the lockout.

Some places can be amazing. Some software can be poorly designed (or something can cause a strange side effect, see here). I really like one of the suggestions in the comments on the article... start replying really slow, and often with invalid data. They won't do it. I wouldn't. But I like the idea.

--
Comment forecast: Bits of genius surrounded by a sea of mediocrity.

Re:Surprise by pete-classic · 2008-02-08 16:39 · Score: 1

We've seen people who were secure change their certificates to self-signed

I work with third-party XML (and "XML") regularly, and I largely agree with your conclusions.

On the other hand, "self-signed" and "secure" aren't in opposition. All else held equal, a self-signed certificate is just as secure as one signed by some sort of signing authority. It may not be as trustworthy as one signed by some authority. But it's probably a phone call away from being more trustworthy than one signed by a third party. A phone call seems like a low bar given your description of your efforts to help people with their IT problems.

I agree that throttling by IP makes a lot of sense. Returning bad data would probably be satisfying, but probably wouldn't help or be in line with the W3C's goals.

-Peter

A lesson from network history by idontgno · 2008-02-08 14:10 · Score: 1

which is never ever learned...

A freely accessible network resource is begging to be driven, smoking and shattered, into the ground by the ill-mannered, ill-trained, or ill-intentioned hordes.

Personally, I blame the introduction of AOL in 1994 to the Usenet for this downward spiral. We were doing just fine before all you "me too"s started pouring in.

Get off my lawn, you clueless kids!

--
Welcome to the Panopticon. Used to be a prison, now it's your home.

Re:A lesson from network history by Anonymous Coward · 2008-02-08 14:28 · Score: 0

Godwin says otherwise. Blame the Jews.
Re:A lesson from network history by Anonymous Coward · 2008-02-09 00:23 · Score: 0

You are confused. Hitler said that. Godwin said blame the Hitler.

Stupid design decisions in standards ... by Lazy+Jones · 2008-02-08 14:19 · Score: 1

... come back to haunt you.

Perhaps they will stop putting HTTP-URLs in standardized tags now... Also, enjoy life as a web content provider who spends many hours per week blocking Referers (nice typo in the original RFC!) and dealing with broken clients, something that the W3C never spent much time pondering about.

--
"I love my job, but I hate talking to people like you" (Freddie Mercury)

Make it slower, not faster by Thunderbear · 2008-02-08 14:20 · Score: 2, Insightful

If the problem is that it gets served out too many times, then make the server slow as molasses. If it takes 1-2 minutes to get the DTD from the server, or more, then it is quickly discovered by the performance teams.

--

--
Thorbjørn Ravn Andersen "...and...Tubular Bells!"

Stop the insanity by kaosgoblin · 2008-02-08 14:21 · Score: 1

Must Add 5 miles of data to my code now, they need MORE DATA!!!

Re:I'm just conforming! by jlarocco · 2008-02-08 14:29 · Score: 1

Hey, you made the specs. Why should you blame me if I'm conforming? Does the spec allow me to assume that all my documents are going to use that DTD, and that it won't change?

Sigh.

If you're the one writing the xml this is almost no concern of yours.
The DTD won't change. That's the point of having a standard DTD.
The standards say absolutely nothing about fetching the DTD from the web every time an xml file is being validated.

What to do, what to do...

Try getting a clue.

--
Maybe not

The HTML 5 doctype kind of solves this by Jugalator · 2008-02-08 14:32 · Score: 1

That doctype is simply <!DOCTYPE HTML>!

--
Beware: In C++, your friends can see your privates!

The problem is with the docs by Mantaar · 2008-02-08 14:33 · Score: 4, Insightful

The problem does not lie in the mechanism itself - it's in the documentation - or the lack of understandable (or at least often-used) docs directly at the source.

Simple caching on client side could already improve the situation a whole lot... BUT:

When people implement something for html-ish or svg-ish or xml-ish purposes, they go google for it: "Howto XML blah foo" - result, they're getting basic screw-it-with-a-hammer tutorials that don't point out important design decisions, but instead Just Work - which is what the author wanted to achieve when they started writing the software.

It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2. But since most tutorials and howtos on the net are just dumbed-down copypasta for quick and dirty hacks - and since nobody fucking enforces the standards - nobody does it the Right Way.

So if I start writing some sax-parser, some html-rendering lib, some silly scraper, whatnot... and the first example implementations only deal with basic stuff and show me how to do it so basic functionality can be implemented... and I'm not really interested in that part of the program anyways, because I need it for putting something more fancy on top... once after I'm through with the initial testing of this particular subsystem, I won't really care about anything else. It works, it doesn't seem to hit performance too badly, it's according to some random guy's completely irrelevant blog - hey, this guy knows what he's doing. I don't care!

This story hitting /.'s front page might actually help improve the situation. But.. it's like this with stupid programmers - they never die out, they'll always create problems. Let's get used to it.

--
I'm an infovore...

Re:The problem is with the docs by znerk · 2008-02-08 15:11 · Score: 0

It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2 Oh, please do tell me how to use iptables or iproute2 to set my ip address, or to enable/disable a network adaptor.

--
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
Re:The problem is with the docs by phasm42 · 2008-02-08 15:43 · Score: 0

It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2
Oh, please do tell me how to use iptables or iproute2 to set my ip address, or to enable/disable a network adaptor.
Maybe GP meant ipchains?

--
"No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner
Re:The problem is with the docs by znerk · 2008-02-08 15:47 · Score: 0, Offtopic

I don't see the functionality I asked for in ipchains, either.

--
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
Re:The problem is with the docs by paul248 · 2008-02-08 16:34 · Score: 4, Informative

Oh, please do tell me how to use iptables or iproute2 to set my ip address, or to enable/disable a network adaptor. ip link set eth0 up
ip addr add 192.168.1.2/24 dev eth0
ip link set eth0 down

etc. etc.
Re:The problem is with the docs by phasm42 · 2008-02-08 17:18 · Score: 1

I don't see the functionality I asked for in ipchains, either.
I meant that the deprecated tool is ipchains, not ifconfig.

--
"No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner
Re:The problem is with the docs by frisket · 2008-02-09 02:16 · Score: 1

The problem is that people don't read the docs :-)
SGML (old HTML) mandates a DTD. If you have browsers take markup seriously, they will of course need to download the DTD in order to process the document. Browser writers have known this since Nov 1993 but failed to grasp that the sensible thing to do is ship with local copies of all the common variants.
XML doesn't mandate a DTD (or a Schema) but a browser may choose to retrieve one if one is specified. Again, local provision or caching is the answer. In both SGML and XML there is a well-defined catalog resolution mechanism available to handle Formal Public Identifiers as well as URIs.
It is disingenuous for the W3C to complain about this when they have consistently allowed themselves to be [mis-]directed by their larger members whose interests lie elsewhere. The solution is to educate the browser-writers to fix their broken code.
Re:The problem is with the docs by balsa · 2008-02-09 07:42 · Score: 1

Or the copyright. If they want to encourage replication of the document, they should eliminate constraints on that.
http://www.w3.org/Consortium/Legal/copyright-documents-19990405.
Re:The problem is with the docs by Anonymous Coward · 2008-02-09 11:51 · Score: 0

using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2 a) iptables replaced ifconfig? I guess you mean ipchains because afaik iptables has absolutely nothing to do with setting up network interfaces.

b) fuck 'deprecated' and everyone who passes judgement because someone else uses software for which newer replacements exist. I for one will
continue to use what works well for me (be that ifconfig or xmms) even if that require building them and their dependencies from source.

c) there is no such thing as 'the right way'... ever; just approximations that come sufficiently close.
Re:The problem is with the docs by znerk · 2008-02-11 02:47 · Score: 1

Oh, please do tell me how to use iptables or iproute2 to set my ip address, or to enable/disable a network adaptor. ip link set eth0 up
ip addr add 192.168.1.2/24 dev eth0
ip link set eth0 down

etc. etc. yeah, cuz ip link looks an awful lot like iptables

Oh, I get it... you meant for me to be psychic, and know that you meant to preface that with something that would have actually been informative, like
"ip link is part of the iproute2 suite of tools. You can find more information on using it as you specified by clicking this link. Here's an (incorrect) example of usage, if you want to see it display a lot of error messages:"

--
So much for your post being informative. IMHO, that was just a troll. Maybe this flamebait will get me a +5 Insightful. Yeah, right.

--
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Recording UA? by dotancohen · 2008-02-08 14:39 · Score: 1, Redundant

What are the user agents making the requests? Do these programs identify themselves with a UA string or something?

--
It is dangerous to be right when the government is wrong.

Re:Recording UA? by mrbobjoe · 2008-02-08 23:39 · Score: 1

Apparently many are using their programming language's own parsing libraries, but not bothering to change the default user-agent header. ftfa:
Many of the automated requests we receive have generic user-agent headers such as Java/1.6.0 or Python-urllib/2.1 which provide no information on the actual software responsible for making the requests.
Re:Recording UA? by dotancohen · 2008-02-09 00:15 · Score: 1

Well there you go. Now get in contact with the Java and Python folks and have them fix their libraries.

--
It is dangerous to be right when the government is wrong.

Re:I'm just conforming! by ShatteredArm · 2008-02-08 14:40 · Score: 1, Insightful

I was just being facetious, first of all... But if you must...

The DTD won't change. That's the point of having a standard DTD.

What's the point of having a DTD if it won't change? Oh yeah, there is none. Conceptually, the DTD is there to define the data, and unless you know what is in the DTD, you cannot use it to validate, which is its purpose. And conceptually, if you assume the data is defined a certain way, you don't need a DTD.

If you're the one writing the xml this is almost no concern of yours.

Generally the DTD is for the person parsing the XML. If you're writing the XML, you don't need a DTD, because you already know the schema. If it's only for the XML writers, all you'd need to do is place your schema with the rest of the specs for your application.

Now I wasn't suggesting that in practice you should go to the server every time and fetch the DTD. But clearly you take things too seriously.

Try getting a clue.

Try getting a sense of humor.

heh by rastoboy29 · 2008-02-08 14:44 · Score: 1

I bet Slashdot.org could possibly find some bloggers that would be more than happy to receive that traffic!

--
expandfairuse.org

Re:heh by milsoRgen · 2008-02-08 14:58 · Score: 1

that'd be fun, hijack a dns server and have all the requests directed to whatever project you have...

Look investors look! Look at all the hits we're getting! More money please!

--
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.

Sorry that was me! by syousef · 2008-02-08 14:50 · Score: 1

I'm sorry the typo's mine. I made it when I was working late one night and spilt spagetti down my shirt. I had no idea that it would propagate so far and ruin the web. Oops. Anyway I've fixed it, but it's not in the stable CVS branch yet so I'm afraid you'll just have to put up with it for a while longer.

(For those without a sense of humour, yes this is a joke)

--
These posts express my own personal views, not those of my employer

Re:Sorry that was me! by Anonymous Coward · 2008-02-09 03:35 · Score: 0

But not a funny one.

rofl by Anonymous Coward · 2008-02-08 14:54 · Score: 0

its not too hard to host the dtd file on your own server, amirite? not like its gonna change....

ISP, where's your DTD server? by leek · 2008-02-08 14:57 · Score: 1

Perhaps ISPs should install caching DTD servers.

People would have another reason to complain about their ISP's quirks.

That's the problem with a URI for an ID by argent · 2008-02-08 14:59 · Score: 3, Insightful

I think they screwed up, and brought this on themselves. I already thought that it was annoying having so verbose an identifier... this just makes it more hateful.

If they'd at least made the identifier NOT a URI, something like domain.example.com::[path/]versionstring, or something else that wasn't a URT, so it was clearly an identifier even if it was ultimately convertible to a URI, they would have avoided this kind of problem.

Re:That's the problem with a URI for an ID by Todd+Knarr · 2008-02-08 16:44 · Score: 1

You seen to have missed the syntax of the DOCTYPE element. The first identifier, the one immediately after PUBLIC, is what you describe. It's a system identifier useful for looking up the DTD in the machine's local store of DTDs. The second identifier, the one that's a URL, is a fallback intended to allow software that needs the DTD to fetch a copy when there isn't one stored locally.
Re:That's the problem with a URI for an ID by argent · 2008-02-09 05:36 · Score: 1

You seen to have missed the syntax of the DOCTYPE element.

What I may seem to have missed isn't the problem. The problem is that lots of other people, many of whom have written popular software programs or who have written XML document standards that *do* use URIs as identifiers, seem to have missed it.

Adding another use to an existing standard... by Anonymous Coward · 2008-02-08 15:00 · Score: 0

Classic problem of using something developed for one purpose (a remote resource locator) for something else (a unique identifier).

If they had did something simple like using some non-functional protocol identifier like 'ident' (i.e. xmlns="ident://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd". Browsers and other software would have been developed to never actually 'do' anything with such a URI

Re:Adding another use to an existing standard... by Anonymous Coward · 2008-02-08 20:48 · Score: 0

You're describing the "urn" scheme. Whoever decided to start using URLs as if they were URNs (out of the sheer laziness of using any domain name to ensure their crap is unique) needs to be taken out and maimed.

The SlashDot effect could shut down this site! by Jack+Pallance · 2008-02-08 15:08 · Score: 1

Here, I'm posting the *real* article so you guys don't have to click through this blogspam!

http://www.w3.org/1999/xhtml For further information, see: http://www.w3.org/TR/xhtml1 Copyright (c) 1998-2002 W3C (MIT, INRIA, Keio), All Rights Reserved. This DTD module is identified by the PUBLIC and SYSTEM identifiers: PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" $Revision: 1.1 $ $Date: 2002/08/01 13:56:03 $ --> %HTMLlat1; %HTMLsymbol; %HTMLspecial;

Obligatory by Zombie+Ryushu · 2008-02-08 15:09 · Score: 0, Troll

That sounds like a DTD thing to do! If you are a dee, please don't marry a tee, because if you marry a tee, your kids will be DEE TEE DEE."

Am I the only one who thinks this... by BlizzardandBlaze · 2008-02-08 15:11 · Score: 0, Troll

First off, I don't know much about DTDs, but from what I can tell, it's like a template, like a Cascading style sheet, or something like that. That said...

Why did they even allow people to link to this thing in the first place? I think that they could have predicted that this would happen, simply because the web is huge and if even a small percentage of all the servers on the internet start to link to the code, they are going to get a massive influx of requests demanding this information.

Knowing this, I wouldn't let people link directly to the code. That doesn't mean that they can't use it, (they can use it by downloading the code onto their own computers and hosting it there) but I would make sure that they can't link directly to my servers. Don't get me wrong, it's nice of them to let us link to their code. However, when you provide a useful piece of software for everyone to link to, you gotta expect that people are going to take full advantage of linking your code if you let them, whether they link it efficiently or not.

W3C stupidity by nguy · 2008-02-08 15:24 · Score: 1

Creating a standard that would allow people to host DTD's all over the web and fetch them automatically was major design stupidity, not just because people need to host that stuff, but because it misses the point of standardization in the first place.

Re:I'm just conforming! by Anonymous Coward · 2008-02-08 15:31 · Score: 0

"and unless you know what is in the DTD, you cannot use it to validate"

That the most contrary to reality assertion I've seen in my whole live (well along with some deity being one and three at the same time): you can *always* validate a document against a DTD without having a fucking idea what you'd going to find within till that moment. That's very point for an SGML DTD.

Now, the semantics is another story.

Oy Vey... by zanaxagoras · 2008-02-08 15:41 · Score: 3, Interesting

PocketPick is 100% correct.

Here's an example of what correct markup should look like:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://schemas.slashdot.org/strict.dtd">

The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD.

Re:Oy Vey... by Zarel · 2008-02-08 16:45 · Score: 3, Interesting

The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD. Well, no, it's not. It's true that the standard does not require usage of the URL to W3C's copy of the DTD, but it's definitely recommended, since every client presumably has a cached copy of the W3C's DTD for something as common as HTML 4.01, and if you were to link to your own, some parsers might be confused and unsure about whether or not you're using Official W3C HTML (tm). (Yes, yes, I know; they should know by '-//W3C//etc' but this article is about stupid parsers, isn't it?)

--
Want a high quality FOSS RTS game? Try Warzone 2100!
Re:Oy Vey... by Anonymous Coward · 2008-02-08 20:13 · Score: 0
The standard says
The system identifier may be changed to reflect local system conventions.
They didn't say MUST or even SHOULD. I take it to mean "if URLs aren't acceptable as system identifiers for your parser, you can use something that is." But certainly if you're going to bake special knowledge of a single markup into your parser (rather than just ship a cached copy of the W3C DTDs), the Right Thing is to compare the public identifier to one of
- "-//W3C//DTD XHTML 1.0 Strict//EN"
- "-//W3C//DTD XHTML 1.0 Transitional//EN"
- "-//W3C//DTD XHTML 1.0 Frameset//EN"
and for anything else presume it's tag soup from an inept author.
Re:Oy Vey... by Anonymous Coward · 2008-02-09 03:53 · Score: 0

You'd prefer XHTML 1.0 Frameset to HTML 4.01 Strict? Interesting. I'll wait till a) there's virtually nothing left which will only accept text/html so I don't have to make my servers lie about what they're serving (even if it is an officially sanctioned lie) and b) It actually does something more than HTML 4.01.
Re:Oy Vey... by ger · 2008-02-09 08:24 · Score: 1

Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD.
No no no, that's not the intent at all, documents should continue to point to DTDs on W3C's site. In fact the next version of W3C's markup validator will issue a warning if the FPI and system ID do not match.

People who are simply creating HTML documents generally don't need to worry about this issue at all, sorry if the article was unclear.
Re:Oy Vey... by Anonymous Coward · 2008-02-10 11:56 · Score: 0

It's not a preference so much as a degree of confidence. An XHTML public identifier makes a decent shibboleth for competent developers. While it's reasonable for competent developers to use HTML4 instead (as you say, XHTML really didn't add anything), their work is vastly outnumbered by document type declarations lifted straight out of Cargo Cult HTML In Six Hours For Complete Morons. If a document declares itself HTML4, the odds it can be successfully processed with SGML tools are nearly zero.

IE Made Me Do It by FutureDomain · 2008-02-08 15:46 · Score: 0, Redundant

Sorry W3C, but if I don't include it in my webpage, IE goes into the dreaded quirks mode!

*forwards article to Microsoft*

--
Hydraulic pizza oven!! Guided missile! Herring sandwich! Styrofoam! Jayne Mansfield! Aluminum siding! Borax!

Re:IE Made Me Do It by fbjon · 2008-02-08 19:28 · Score: 1

That's entirely ok. The problem is badly written software that would look at the URI as an URL and attempt to fetch it e.g. on startup.

--
True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.
Re:IE Made Me Do It by mrrazz · 2008-02-08 20:18 · Score: 1

That could easily be fixed by making the URI use a different (nonexistent) protocol handler like schema:// instead of the usual http://
Re:IE Made Me Do It by Anonymous Coward · 2008-02-09 08:27 · Score: 0

A lot of recent Microsoft XML specs have a URN instead of a URL as a URI -- they don't resemble a URL at all. (they have things like urn:microsoft-xml-spec - missing out the // really means it's not a valid URL and a download attempt cannot possibly succeed. It, thus, sidestepps this entire issue.)

An example is the second line below:

<mybook:BOOK xmlns:mybook="http://www.contoso.com/books.dtd">
<bb:BOOK xmlns:bb="urn:blueyonderairlines">

(from http://msdn2.microsoft.com/en-nz/library/a9a1451a(en-us).aspx )

Also, this demonstrates how a C# program might validate against a schema identified only by a URN - by passing a list of URN to URL or filename mappings to the XML Reader.
Re:IE Made Me Do It by aevans · 2008-02-09 10:53 · Score: 1

No, the problem is a badly written spec which explicitly states that the URI should be included as part of the XML document and that looking it up is a perfectly legal (though optional) step in parsing it.

Hey, for once... by 93+Escort+Wagon · 2008-02-08 15:46 · Score: 4, Funny

... you can't blame Microsoft for this problem! After all, IE ignores pretty much all web standards and best practices, and does its own thing!

--
#DeleteChrome

Re:I'm just conforming! by Anonymous Coward · 2008-02-08 16:00 · Score: 0

Try getting a sense of humor.

Try learning the difference between humour and uninformed bullshit.

They need to petition the parser designers... by poor_boi · 2008-02-08 16:02 · Score: 1

They need to petition the parser designers... to include cached copies of these documents "that have not changed in years" with the parser distrobutions themselves. And then petition the parser designers to turn caching on by default. Caching too much of a pain in the ass? Then create two caches, one for authoritative schemas that are highly unlikely to change, and one for third-party schemas where aggressive caching is undesirable.

Of course, this is just a guess, but my instinct tells me that most of this traffic is being generated by people who don't even know better, and probably don't read slashdot, and may not read w3c.org. Making caching 'the default' is probably the only way you'd see a noteworthy drop in traffic.

With that kind of traffic by HangingChad · 2008-02-08 16:10 · Score: 1

Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day

I want to run their AdSense program. Cha-ching!

--
That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage

I doubt Explorer or Firefox ever follow the URL by Art3x · 2008-02-08 16:13 · Score: 0

I doubt Explorer or Firefox ever follow the URL. The visitors most likely are web spiders.

Mainstream web browsers do not validate pages. They do notice the Doctype declaration at the top of your page, but use it only to choose a rendering mode: old 1990s ("quirks mode") or "standards" mode. The two modes are only slightly different and have to do with the font size of tables, width calculation of block-level elements, etc.

- a web page with no doctype gets rendered in quirks mode
- a web page with one of the doctypes usually gets rendered in standards mode

There are other details that those who are interested in can Google for -- Firefox actually has three rendering modes: quirks, almost-standards, and standards mode. But even then, it is based on a regular-expression match on the doctype, not by following the URL at the end of it.

Paranoia? Clear Cache... by Frosty+Piss · 2008-02-08 16:18 · Score: 1

And all the major browsers cached that that file after loading it (at most) once, and then never read it again.

You might thinks so. But today when everyone is ultra paranoid about security, perhaps with so many people clearing their cache every time they close the browser has some part in this? And maybe with other apps, too, keeping such information cached is no longer standard. No?

--
If you want news from today, you have to come back tomorrow.

Re:I'm just conforming! by ShatteredArm · 2008-02-08 16:31 · Score: 0

LOL... How are you going to validate a document against a DTD without knowing what is in the DTD? That's like trying to send a SOAP message without the WSDL. As far as the validating parser is concerned, the DTD could be anything, and the DTD is exactly what it uses to do the validation. That's what determines if the document is valid.

Re:I'm just conforming! by jlarocco · 2008-02-08 17:01 · Score: 1

What's the point of having a DTD if it won't change? Oh yeah, there is none. Conceptually, the DTD is there to define the data, and unless you know what is in the DTD, you cannot use it to validate, which is its purpose. And conceptually, if you assume the data is defined a certain way, you don't need a DTD.

What's the point in having a DTD if it *can* change? All files using the old version would be invalidated. And more important, any parser made for the old version would start rejecting XML that it could parse. That's part of the reason why it doesn't work like that.

Generally the DTD is for the person parsing the XML. If you're writing the XML, you don't need a DTD, because you already know the schema. If it's only for the XML writers, all you'd need to do is place your schema with the rest of the specs for your application.

No. The DTD is an agreement between the XML writer and the parser writer on the format of the XML to be used. The actual content of the DTD is completely irrelevant at run time as long as the incoming file says it complies with the DTD the parser expects. Any parser with more than "if (file.dtd == expectedDtd)" has failed. The only good reason I can think of to even touch the actual DTD at runtime would be for a general purpose XML validator, which, ironically, is a special case.

--
Maybe not

You are not contradicting my point. by v(*_*)vvvv · 2008-02-08 17:11 · Score: 1

A reference is a link. "http://www.w3.org/TR/html4/strict.dtd" is a link. If it weren't a link, then it wouldn't get traversed and become a source of traffic!

And DTD and SGML might be older than W3C, but we are talking about HTML and XHTML which are W3C standards. W3C is the one that decided that their "standards" must comply with DTD and SGML, eventhough there are plenty of anomilies and special rules that browsers know how to handle anyway.

I prefer not even using the dtd declaration, and I am guessing that 99% of the pages served do not need it. Only when you are seriously pushing HTML to its limits do DTD differences come into play anyway. And most html coders know that keeping HTML simple and doing everything on the server side using tools like PHP is far better, because assuming the browser is fully compliant to each DTD version is unreliable. Mainly because browsers don't fully comply with W3C anyway.

I view it as their pennace... by frank_adrian314159 · 2008-02-08 17:25 · Score: 1

... for foisting XML on the world.

--
That is all.

Good! by TheLink · 2008-02-08 17:29 · Score: 1

You put a URL everywhere, don't be surprised if someone visits it even just out of curiosity.

DDoS yourself.

Enjoy your 1000 requests per second, you practically asked for them (even if in theory you didn't).

I hope this makes the W3C people start thinking more about what happens in the real world and design their stuff better.

W3C: "But but but, This is not a hyperlink it's only a machine-readable way to say "this is HTML"'..."

How about using version numbers for _standard_ DTDs instead, and only have URIs for _custom_ DTDs (guess how many would use those in practice or be able to...)?.

The W3C likes to say stuff like "Browser makers must/should raise a security exception if XYZ goes wrong", that's all very nice in the "theory" world.

The real world doesn't work that way, so design stuff better please. Design stuff that breaks reasonably gracefully and safely.

--

Too many replies beneath your current threshold

Re:Good! by Forbman · 2008-02-08 17:52 · Score: 1

Well, much like DNS had its problems scaling initially when it was just one root table that was forwarded around to every network until they came up with a scheme to break it up and distribute the load, how come there isn't a "distributed DTD-lookup" service similar to this, that could be distributed to Akamai, ISP's, web hosting services, etc.?

Or, how about web servers (IIS, Apache, etc) are built to fetch the first request for a given W3 DTD, but it then it caches it on itself?
Re:Good! by TheLink · 2008-02-08 18:49 · Score: 1

Yep they were distributing a hosts file around. But half of the Internet stuff wasn't built yet, so that's understandable.

The W3C should have stood on the shoulders of giants rather than dug holes for themselves ;).

The W3C can use Akamai for this actually without changing the current scheme - so it's actually not that bad a design.

But who's going to pay for it?
--
- Too many replies beneath your current threshold

isp hosting dtd? by tozenne · 2008-02-08 18:03 · Score: 1

well sounds stupid of but what if isp's where rsync the dtd and hosting the w3c dtd and some modified bind would point the w3c dtd addy to your own isp webserver? or a dns record what would increment each hit on it to a different machine? ok me dumb good day to you :)

Caching is not the solution by Saturn49 · 2008-02-08 18:03 · Score: 1

As a developer on the other side, caching is not the solution to this problem. Caching is implemented client side as a *benefit to the client*, not the server. On some platforms (that need to verify XML) caching isn't even really possible. The extra overhead of caching (where do you put it, what program/process maintains it, how is it configured) is entirely unnecessary. The real problem lies in the spec and the various examples floating around.

XML Parsers and such by Sam+Douglas · 2008-02-08 18:11 · Score: 1

I remember from a University project a year or so ago, we had some weird issues where our model loading code would hang when run on systems that didn't have a direct internet gateway available... turned out the .NET XML parser library was trying to fetch and validate the X3D schema for every model it loaded, downloading it each time. These are the kind of defaults that cause issues, if it weren't for the web proxy issue we would have probably never realised it was doing that!

OT: can't believe... by thePowerOfGrayskull · 2008-02-08 18:14 · Score: 1

ROFL! I can't believe I used up my Mod points...

This is the funniest thing I've ever read in a comment. Seriously, I actually laughed aloud. I mean... the perspective, it's awesome: "I can't believe I won the lottery"... "I can't believe that guy just hit me"... "I can't believe I just accidentally drank a gallon of antifreeze instead of a shot of whiskey"... "Gah! I can't believe that I just ranked five separate comments in five different places! How the hell did that happen?!"

Aren't they responsible for the insanity? by shaitand · 2008-02-08 18:24 · Score: 1

I'm pretty sure they are the ones who developed the system that requires all web documents to reference their DTD. Last time I tried omitted such a reference and ran it through the validator, the validation failed ;)

Seems to me that if they designed a system in order to allow them to change the document specification used by billions of documents by modifying a single document. If they never made use of that ability and if that design decision costs them ridiculous amounts of resources that is their own problem. I know a web developer can't omit that reference from their document without violating standards, would it also violate standards for the client application to ignore requests for the standard DTDs?

Speaking of caches... by SanityInAnarchy · 2008-02-08 18:53 · Score: 1

Why can't the software cache it, then? Lazy developers...

I'm sure good ol HEAD, If-Modified-Since, and Etag/If-None-Match would be a LOT less bandwidth. Or are they getting that much in cache hits alone?

Could also refuse to serve it except in gzip format.

--
Don't thank God, thank a doctor!

Re:Speaking of caches... by fbjon · 2008-02-08 19:38 · Score: 1

They could move it to the URL http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd.physical
This is perfectly in line with the standard, as the identifier would still be the same (without .physical). That would cause all software that is erroneously donwloading it to fail, as well as future badly written software that rely on it to exist at the URI-as-URL, teaching the devs a lesson.

--
True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.
Re:Speaking of caches... by sipatha · 2008-02-09 00:37 · Score: 1

That doesn't stop the loading page from making a request to the W3C servers, does it?
Re:Speaking of caches... by cnettel · 2008-02-09 00:49 · Score: 4, Insightful

That doesn't stop the loading page from making a request to the W3C servers, does it? Worse, many of the existing cache implementations won't cache an error result.
Re:Speaking of caches... by Lennie · 2008-02-09 10:49 · Score: 1

The best way to get developers to do something is to build in a 180 seconds delay, before sending the response back to the HTTP-client.

Saves a lot of bandwidth too, because a lot of clients (or users) wouldn't even wait that long ?

--
New things are always on the horizon
Re:Speaking of caches... by gL3nnX · 2008-02-19 01:05 · Score: 1

oMG! they must stop other HTML builders like Dreamweaver, Adobe Golive and etc.. to stop creating DTD request from w3.org

Don't put your number online .... by arse+maker · 2008-02-08 19:04 · Score: 1

This reminds me of why you cannot put phone numbers on tv shows or movies. In both these cases its the same reason, people are idiots, do people think they are going to ring up a movie character and talk to him? Its completely stupid, but guess what, it happens, so they put 555... If you are going to write your phone number on the worlds biggest bathroom wall, you cannot complain about the amount of calls you get. I feel the best response to thier quote about all the traffic is "Duh!". In both the phone number and the url examples I don't see how anyone who thought about it wouldnt realise this is a poor idea. I like their complaints too, its sure amusing. "Something is not working as we intended on the web, can everyone fix it?". Isn't that like fire department calling someone to put out a fire they started? Now if only microsoft would comment on how hard it is to get their web sites working cross browser, what a wonderful world this would be.

Suck it up... due to lack of foresight by W3C by ScienceDada · 2008-02-08 19:19 · Score: 1

Anyone who has read the XML standard (or even a book like O'Reilly "XML in a nutshell") knows that the URI included in the doctype declaration and namespace attribute was never intended to actually be used to fetch DTDs. Duh... What sense does it make to ask people to read W3C blog statements when they didn't even RTFM and then "stop the insanity?" Perhaps W3C should give a bit more thought to their standards just as editors would comment on an author's choice of a pen name like "Harold Dick" or "Ben Dover."

OK. So what... even when I first read the specification (being quite a novice programmer at the time) I immediately thought using a URI for the DTD was poorly conceived, especially if there was actually a file accessible via HTTP at the URI. But it always struck me as retarded to say "here is a URI for the DTD... looks like a URL, smells like a URL, and actually points to a DTD, but don't really use it, even though you can." What did W3C really expect to happen? XML was designed to enable inexperienced programmers and web-monkeys to "be lazy" in that they could avoid authoring parsers and write web-based content with less effort (as long as it was well-formed).

Granted clients shouldn't download it according to the standard, but people don't always behave (or program) according to "the rules." My advice to W3C, do one of two things:

You authored and pushed the damn standard, so suck it up OR
Remove the DTD at the URL corresponding to the URI (since it doesn't need to be there) and eventually programmers will get the idea that there is not actually a file at that location.

Or, perhaps initially post simple text documents that state:
"There is not actually a DTD at this URI. Perhaps you should review the XML standard (i.e. RTFM)."

This is an easy one by Anonymous Coward · 2008-02-08 19:22 · Score: 0

Search for "java" in this discussion. There are currently 2 hits. One refers to brain-damaged Java libs. And the other refers to brain-dead Java devs. There's the source of the problem. Sure these DTD's haven't changed in years, but for years we've added more and more Java devs to the programming population. And Java dev, in general = don't know and don't care how things work. Just want to get it done, in whatever way is easiest for the dev.

Re:This is an easy one by bytesex · 2008-02-09 01:55 · Score: 1

It might be true; I don't know if it's still the case, but when I used it, in 2003, the standard java xslt processor used to /refuse/ to work without network connectivity because of it's own standard DTD. Completely without imagination, these people.

--
Religion is what happens when nature strikes and groupthink goes wrong.

Making others fix their problems by Animats · 2008-02-08 19:36 · Score: 1

I can't tell you how many people send us bad data and flat out ignore the response.

Sometimes you can get things fixed at other sites. We have a list of major sites being exploited by phishing sites, which is updated every three hours by matching PhishTank (10,000 entries) against OpenDirectory (1.7 million entries), and looking for domains in both. We blacklist sites on a per-domain basis, and needed to measure and minimize the collateral damage.

When we started that list last November, it had 174 domains on it. After reports to abuse addresses, two articles in The Register, and help from PhishTank and the Anti-Phishing Working Group, we're down to 45 domains. Only eight of those domains have been on the list for more than 60 days. The remaining long term problem domains are five DSL providers, a free web hosting service, and two ordinary web sites that had break-ins they've never cleaned up. The rest of the list changes frequently, as sites are added to the list due to some problem, then removed from the list as the problem is fixed.

When we started, Google, Yahoo, MSN, and Dell were all on the list. They've all cleaned up their act. They just needed a little nudging.

With the legit sites tightened up, phishing blacklists become much more effective. It's now safe to blacklist entire base domains, not just URLs or subdomains. Anti-phishing tools just became more effective.

So, yes, you really can get such problems fixed.

Re:Making others fix their problems by MBCook · 2008-02-09 12:18 · Score: 1

We ask the bad offenders to fix things, and they usually do. It can take a while, but they'll usually do it. But it can take a while, and we have financial power over people we deal with (since we can cut or restrict traffic). The W3C doesn't have that power over most people.
The idea of a public shame list is a good one for those people who refuse to fix their own problems, and is far more realistic than the military "to bad you messed up, you can't come back until you fix it" approach (which is almost always completely untenable).

--
Comment forecast: Bits of genius surrounded by a sea of mediocrity.

Good job - Slashdot 'em! by RudeIota · 2008-02-08 20:05 · Score: 1

... Yes, that ought to fix the W3C's bandwidth usage...

--
Fact: Everything I say is fiction.

Re:I'm just conforming! by Anonymous Coward · 2008-02-08 20:31 · Score: 0

If you don't have a DTD or XSD, each tool is going to end up having its own hand-coded validator that does a half-assed job of enforcing the same rules. It'll be a big waste of effort and utterly unmaintainable.

And XML writers absolutely do need the DTD or XSD. I'd sooner go to work without pants than publish anything that doesn't even validate.

Poor design by Stan+Vassilev · 2008-02-08 21:37 · Score: 1

They should be happy the most popular browsers don't apply their recommendation and don't get the DTD even once. Because *then* they'd see hell.

DTD-s would be part of of the normal web page cache, and we know there are plenty and plenty of scenarios demanding that cache be turned off for limited or extended periods of time. Development is just one of those.

I find it discouraging that at the slow speed W3C produce and finalize their recommendations, they are still full of badly designed gems, such as the DTD being downloadable from a single URL somewhere on some single site (not to get started on the DTD syntax itself).

They need help, and I'm glad WHATWG are there to help those guys with making HTML5 reality.

Serves them right by nagora · 2008-02-08 21:38 · Score: 1

It should never have been part of an HTML document. Reap what you sow.

TWW

--
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"

Junk Comments anyway by Gonoff · 2008-02-08 21:49 · Score: 1

These comments don't actually do much. In fact all they do is slow the system down/

The only reason I put them on the stuff I am working on at present is that it says in the first paragraph of my assignment to do so. If I don't, I will loose marks..

Straight html seems to be enough in reality...

<html> <head> <title>test</title> </head>

Yes, I know that it is not correct without all the other stuff. It does not check with Amaya or the W3C site if all the other stuff is missing. I see no benefit in putting anything else in there except perhaps

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <meta http-equiv="Content-Language" content="en-us">

This is to make sure it works anywhere on any computer. The rest does nothing and sites work fine without it. It seems like a waste of bandwidth...

--
I'll see your Constitution and raise you a Queen.

Re:Junk Comments anyway by cnettel · 2008-02-09 09:13 · Score: 1

You are aware that the same code will be interpreted radically differently with and without a proper declaration, right? ("Quirks mode" in both IE and Firefox.)
Re:Junk Comments anyway by Gonoff · 2008-02-10 14:36 · Score: 1

But why am I being taught to put these declarations in everything I do?

I wrote some simple pages of html - text, frames, pictures etc. I have tried them out under IE6, 7, firefox, amaya, opera, safari and lynx. Obviously no pictures in lynx but the only one that objects to them is amaya. It then shows everything fine anyway.

My point is that we are blindly putting these things in where there is no need. It seems a pity because other aspects of the standards make some useful differences - for example, like everything in lower case and proper nesting. They make it easier to follow someone else's stuff.

We are trying to put everything in xhtml where html is more appropriate. It seems a bit like saying that all car drivers need 5 point harnesses, crash helmets and flame proof clothing because it is needed by some. Many people stick to the speed limits and never do anything clever. We should be smart enough to stick to simple html where that is all we are needing.

--
I'll see your Constitution and raise you a Queen.

Google Was Blocked by Anonymous Coward · 2008-02-08 21:56 · Score: 0

Their enterprise department kept causing Google to get blocked. It seems their Google Search Appliance didn't cache the DTD. When you point a mess of those at test content of a million documents, each with a reference to a w3c hosted dtd, it was a huge hit to them. The problem was pointed out to the developers, but that didn't fix the immediate problem (changes didn't happen over night). So the the files were all mirrored internally and a script went through all the test content pointing to the internal copies. Hopefully, the hits aren't quite so bad, but I know not every test file was fixed (it's easy to overlook stuff when you count documents in the millions).

I'd write the crap code. by r00t · 2008-02-08 22:47 · Score: 1, Insightful

If I wanted to get all URLs out of a document, I'd grab anything that looked like one. I'd not care if it was in an HTML comment, in body text, in some weird tag (img, a, object, embed, frame... and whatever some drunk browser developer concocted this morning), or wherever.

Simply put, I can not hope to correctly parse the mess in the same way as IE 7 or even Firefox 3. Why burn myself out trying, only to miss lots of stuff? To be really correct I'd probably need to execute everything from ActionScript to VBscript. Sorry, but NO FUCKING WAY.

The only way I'm going to avoid loading the DTD crap as a URL is with a URL blacklist.

Re:I'd write the crap code. by fyrewulff · 2008-02-09 00:36 · Score: 1

why not collect all URLs first and then apply a banlist against them?

--
"We need to get over this notion, that, for Apple to win... Microsoft must lose." - Steve Jobs, 1997
Re:I'd write the crap code. by BZ · 2008-02-09 05:08 · Score: 3, Informative

Just use html5lib instead of rolling your own; it'll parse pretty much "the same way" for all practical purposes.
Re:I'd write the crap code. by msuarezalvarez · 2008-02-09 05:10 · Score: 1

Then use a blacklist. There is no need for ALL CAPS and swearing... Just avoid being part of the problem.
Re:I'd write the crap code. by Hal_Porter · 2008-02-09 06:43 · Score: 1

BLOCK FUCKING CAPS replaces the previously deprecated tag in X(H)TML 5.0 Transitional.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Re:I'd write the crap code. by mollymoo · 2008-02-09 15:41 · Score: 1

Why would you have to parse it the same way as IE, let alone execute VBScript? You only need to exclude the contents of the !DOCTYPE tag.

--
Chernobyl 'not a wildlife haven' - BBC News

Re:I'm just conforming! by vidarh · 2008-02-08 23:22 · Score: 1

Conceptually, the DTD is there to define the data, and unless you know what is in the DTD, you cannot use it to validate, which is its purpose.

You completely miss the point. As long as the public identifier is the same, the DTD should not change, and so a well behaved implementation will retrieve them once and implement a catalog of public identifiers to DTD's. Most well behaved implementations DO.

Stop serving it by zokier · 2008-02-08 23:52 · Score: 1

Force somehow people to host DTD:s on their own servers. To enable caching include md5sum or something to identify DTD.

Already Slashdotted last year ? by ze_jua · 2008-02-09 01:16 · Score: 1

http://developers.slashdot.org/article.pl?sid=07/01/17/1336257

Stupid design by Frozen+Void · 2008-02-09 01:26 · Score: 1

Guess they never heard of term "idiot proof". Always assume the people will nto understand or comply with the rules.

What's their website URN, anyway? by Prototerm · 2008-02-09 01:28 · Score: 3, Funny

Nothing, it's a non-profit.

(ducks and runs)

--
"My country, right or wrong; if right, to be kept right; and if wrong, to be set right." --Senator Carl Schurz (1872)

money solves all problems by wikinerd · 2008-02-09 02:00 · Score: 1

Track them down and send them a bill. They will notice pretty quickly and change their ways. If you can't find them declare all HTML/XHTML standards obsolete and release a new hypertext standard with obligatory built-in identification and automatic billing facilities.

Add Advertising to the XML! by nektra · 2008-02-09 02:28 · Score: 1

W3C can add some Ads to the XML and earn some money!

Unified caching across applications by Lurch00 · 2008-02-09 02:33 · Score: 1

I'm hesitant to blame the client developers. If someone throws a 10 line script together and runs that an a huge set of files, generating a huge number of requests for the DTD URL, obviously someone along the line ought to be caching that result. In my view, it's ultimately the responsibility of the client, but it's too much complexity to expose and implement for a program you may never run again.

So I look to the libraries. However, the last thing I want is a dozen different libraries putting a dozen different caches in a dozen different non-standard locations. Should the development community come up with a standard for how and where to cache HTTP resources? Is there any fundamental reason libCURL, for example, shouldn't be able to access an object that a webbrowser already cached?

Actually, come to think of it, the solution I like best is to punt and disable caching in all applications, and install a transparent caching proxy like squid either locally or on your LAN.

return banner ads by Uzik2 · 2008-02-09 03:32 · Score: 1

Then they'd get paid for all the traffic! ;)

--
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it

Unintentional humor? by Herschel+Cohen · 2008-02-09 03:37 · Score: 1

The link to the main content is extremely effective if no change is really sought. Instead of giving examples of simple, quick fixes one is instead sent to long arguments that contain every detail and counter argument to those that might resist any change. What attention is given those that would simply do as asked? Show me what works quickly without the gale wind force of excess verbiage!

Take this please, this follows the php code sample on how to set a reasonable limit on cache use:

Remember that the Header() function MUST come before any other output.

As you can see, you'll have to create the HTTP date for an Expires header by hand; PHP doesn't provide a function to do it for you (although recent versions have made it easier; see the PHP's date documentation). Of course, it's easy to set a Cache-Control: max-age header, which is just as good for most situations.

For more information, see the manual entry for header.

See also the cgi_buffer library, which automatically handles ETag generation and validation, Content-Length generation and gzip content-coding for PHP scripts with a one-line include. What you are NOT seeing are the multiple links (three, date documentation, manual entry ... and cgi_buffer library) that extends your effort with reading documentation of unknown length and dubious clarity. [Check out the manual entry for instructive mis-direction.] Why not give the simplest case first and say if one encounters problems check here for help?

I really wonder how seriously a remedy is sought. I suspect more than a slight amount of posturing is implicit in this whole set piece. Are they that out of touch?

It should be an option, not a necessity to read entire threads that are more likely to confuse than elucidate those new to the subtler aspects of any given topic. Too often it is waste of time and energy.

don't blame the web developer :) by Anonymous Coward · 2008-02-09 07:01 · Score: 0

blame the browser programmer....developers are only sticking to standards and working workarounds here and there FOR the browser programmers already. :)

Thats what you get by Anonymous Coward · 2008-02-09 07:05 · Score: 0

It was rediculous to do this in the first place - Perhaps the extra bandwidth usage will teach W3C a much needed lesson when developing future protocols.

With all that usage, what all in our world breaks when they can no longer afford to pay for their mistake and shut the server down?

web browsers to blame by skiloh · 2008-02-09 07:31 · Score: 1

web developers are the ones bending over backwards for web browser programmers. I must say these "browser programmers" do alot for web developers also but they are the ones ultimately to claim respponsibilty on this issue of DTD validation....I believe.

Re:I'm just conforming! by aleander · 2008-02-09 09:33 · Score: 1

No. The DTD is an agreement between the XML writer and the parser writer on the format of the XML to be used. The actual content of the DTD is completely irrelevant at run time as long as the incoming file says it complies with the DTD the parser expects. Any parser with more than "if (file.dtd == expectedDtd)" has failed. The only good reason I can think of to even touch the actual DTD at runtime would be for a general purpose XML validator, which, ironically, is a special case.

Entities. If the document is not standalone then even a non-validating parser has to get a DTD. From a local catalog, of course.

--
Segmentation fault. Ore dumped.

Where the traffic is coming from by Anonymous Coward · 2008-02-09 09:45 · Score: 0

Maybe it's all the webpages out there that list sample HTML code?

There must be millions of pages with the DTD's URL in the *body* of the page.

Translation by Anonymous Coward · 2008-02-09 12:09 · Score: 0

>> In short, here's the lesson learned:
>> 1) Some proportion of programmers don't know what they're doing and never will
>> 2) Some proportion of programmers are assholes

Here, let me translate ...

1) Some programmers are Indian
2) Some programmers are American

Speaking of Webmasters, Slashdot is really helping by unassimilatible · 2008-02-09 17:37 · Score: 1

By linking to W3-dot-org and slashdotting them. Thanks guys, you good samaritans! Can you come over later and throw some gasoline on my house that's on fire?

--
Slashdot "libertarians": Small government for me, big government for those I disagree with. -1, I disagree with you

All you can eat by sglines · 2008-02-10 02:10 · Score: 1

For $11.00 Yahoo will give you unlimited bandwidth and storage space. I think W3C should call them on their offering. :)

They asked for it by Ed+Avis · 2008-02-10 23:16 · Score: 1

Who the hell decided that the DTD should be identified with an http: URI anyway? It's as though some people think that any URI has to begin with http:. If you're not meant to fetch it using the hypertext transfer protocol, don't make a URI that says you should.

--
-- Ed Avis ed@membled.com

The cause is clueless "web designers" by Anonymous Coward · 2008-02-11 01:28 · Score: 0

If people would bother to learn HTML, this wouldn't happen. But due to the practice of self-taught "web designers" who copy and paste anything and everything, and who also type in code examples verbatim, this little-needed markup has been spread around. So, W3C should know that they are talking to a set of people who don't listen and don't care. Too bad.

Slashdot Mirror

W3C Gets Excessive DTD Traffic

334 comments