W3C Gets Excessive DTD Traffic
eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"
"Webmasters" strike again. Clowns.
The Kruger Dunning explains most post on
"oops"
I have a solution to the problem; I wrote it down at http://www.w3.org/TR/html4/strict.dtd !
Do what any other respectable web provider would do..
Put links to Goatse in the definitions!
Oh, that was you? I thought that making every webauthor refer to a W3C URL in every web page was going to get someone in trouble someday. Today seems to be someday.
It's a good we don't contribute to the problem - Oh, wait...
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Slashdot: News for nerds, stuff that matters</title>
Isn't this what you call "eating your own dogfood"?
Have they tried delaying the response by 5 or 6 seconds? It could cause a lot of applications to hang pretty badly. That or just serve a completely nonsensical schema every thousandth request. Gotta keep developers on their toes.
I'm surprised none of the CDNs out there haven't volunteered to host this file - the problem is they'd have to host the entire w3.org site, else move the rest of it to a another hostname.
They insist that every document begin with a declaration that includes a link to their site. Now they are complaining about traffic.
The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.
Add some sort of caching parameter to the DTD spec, that specifies how long browsers should cache those DTDs.
Another potential solution: Have browsers keep the DTDs cached, and then check the file date periodically when re-requested. This will still put some load on the w3c's servers, but significantly less than complete re-downloads.
The answer to this problem is quite easy.
Continue to host the data referenced on a single T-1 line. That will cut your expenses to the bone since you'll never exceed 1.54 Mbps and that should be quite cheap. And, any dumfuxorz who fubarred their parser to not cache these basically static values will probably figure it out... very quickly.
You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year.
Problem solved!
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Don't click the link, it's malware.
this is my sig
Serves them right for forcing us to include the same long urls that point to files that never change in every single HTML file ever.
the doctype was being used during a xsl transform during our build process; when the hibernate sight flaked out, the builds would fail intermittently.
solution was to add a xmlcatalog using a local resource.
bet this happens a lot more than most people realize; we'd been doing this for years before we noticed a problem.
Or the routers. Frankly, if the result is known to not change, w3 could probably agree with the network authorities to put copies around the net and treat those heavily used URIs as URNs and just never got to w3 (or rarely go there) instead.
The notion that URNs have to be known in advance as "the popular thing" rather than being discovered after-the-fact by noticing high-volume URIs is probably the real bug here.
Kent M Pitman
Philosopher, Technologist, Writer
A plea to the web community to stop pinging the W3C DTDs isn't going to solve anything. What will work is blocking any unnecessary DTD traffic aggressively, and if that doesn't do the job, blocking it even more aggressively. Intelligently designed software / ISPs / routers will cache, filter and block these requests for the sake of their own efficiency, bandwidth, and proper function. Buggy, bloated and inefficient applications won't. Nothing's ever going to convince the 'web community' to stop pinging the DTDs out of an altruistic concern for W3C's servers, it will need to become beneficial for those software developers to devote the extra development/debugging/patching efforts to do so.
That is supposed to be there according to the standard. And all the major browsers cached that that file after loading it (at most) once, and then never read it again. So no, slashdot is not causing a problem. The problem is all the other HTML processing software besides browsers that do not cache their DTD files, not the files for containing it.
If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.
I can't think of a problem that is simpler to solve. Just stop serving these documents. The offending programs will be fixed very quickly.
So, w3c complains about their bandwidth, and the response is: The Slashdot Effect. Doesn't that make the old bandwidth problem seem less of a problem?
I'm just loving the irony in that.
And it is only 4 articles down.. Host with Yahoo! Yahoo Offers All-You-Can-Eat Storage and Bandwidth http://hardware.slashdot.org/article.pl?sid=08/02/08/1811236
Dallas Real Estate
The spec already recommends this and all the major browsers do it. The software that is causing the problem are generic XML/SGML processing packages which were designed to be able to deal with documents with any random DTD, not just the main HTML/XHTML ones from W3C. They are the folks that are downloading each DTD every single time and not caching it, contrary to the standard. Sometimes caching is a configuration option which defaults to off and administrators never turn it on.
Great, they cry "we get too much traffic", so we go ahead and slap them on the front page of slashdot. Sick, sick fucking joke.
It is dangerous to be right when the government is wrong.
I always thought it was stupid that XML documents include reference to a DTD hosted on a remote server that you do not maintain. This is wrong on so many levels, I don't even know where to begin:
1. The validation will not work if the remote server is down, or network is down, or your connection to the internet is down, or if the file is not accessible for any other reason.
2. You are at the mercy of some third-party to ensure that the file is correct and that it doesn't change.
3. You are susceptible to man-in-the-middle attack.
etc.
For some insane reason, all XML examples have this reference to a remote URL. Most people never change defaults, so we get in a situation where nearly every time XML is validated, W3C site gets hit. The geniuses at W3C should have thought of that *before* this happened. Now they have to live with it...
___
If you think big enough, you'll never have to do it.
Browsers cache the DTDs.
There, you can now stop posting your hilarious "jokes".
sic transit gloria mundi
The only thing I'm unclear on is whether your average browser is contributing to this problem when parsing properly written documents.
I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
Load too big on your server, need to slow down the traffic a bit.
Slashdot it.
That should work
This space intentionally left blank.
You can't take the sky from me...
I've got to say, this doesn't surprise me at all. In the time I've spent at my job, I've been repeatedly floored by the amazing conduct of other companies IT departments. We've only encountered two people I can think of who have been hostile. Everyone else has been quite nice. You'd think people would have things setup well, but they don't.
We've seen many custom XML parsers and encoders, all slightly wrong. We've seen people transmitting very sensitive data without using any kind of security until we refused to continue working without SSL being added to the equation. We've seen people who were secure change their certificates to self-signed, and we seem to consistently know when people's certificates expire before they do.
But even without these things, I can't tell you how many people send us bad data and flat out ignore the response. We get all sorts of bad data sent to us all the time. When that happens, we reply with a failure message describing what's wrong. Yet we get bits of stuff all the time that is wrong, in the same way, from the same people. I'm not talking about sending us something that they aren't supposed to (X when we say only Y), I'm saying invalid XML type wrong... such that it can't be parsed.
We have, a few times while I've been there, had people make a change in their software (or something) and bombard us with invalid data until we we either block their IP or manage to get into voice contact with their IT department. Sometimes they don't even seem to notice the lockout.
Some places can be amazing. Some software can be poorly designed (or something can cause a strange side effect, see here). I really like one of the suggestions in the comments on the article... start replying really slow, and often with invalid data. They won't do it. I wouldn't. But I like the idea.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
which is never ever learned...
A freely accessible network resource is begging to be driven, smoking and shattered, into the ground by the ill-mannered, ill-trained, or ill-intentioned hordes.
Personally, I blame the introduction of AOL in 1994 to the Usenet for this downward spiral. We were doing just fine before all you "me too"s started pouring in.
Get off my lawn, you clueless kids!
Welcome to the Panopticon. Used to be a prison, now it's your home.
Perhaps they will stop putting HTTP-URLs in standardized tags now... Also, enjoy life as a web content provider who spends many hours per week blocking Referers (nice typo in the original RFC!) and dealing with broken clients, something that the W3C never spent much time pondering about.
"I love my job, but I hate talking to people like you" (Freddie Mercury)
If the problem is that it gets served out too many times, then make the server slow as molasses. If it takes 1-2 minutes to get the DTD from the server, or more, then it is quickly discovered by the performance teams.
--
Thorbjørn Ravn Andersen "...and...Tubular Bells!"
Must Add 5 miles of data to my code now, they need MORE DATA!!!
Sigh.
Try getting a clue.
Maybe not
That doctype is simply <!DOCTYPE HTML>!
Beware: In C++, your friends can see your privates!
The problem does not lie in the mechanism itself - it's in the documentation - or the lack of understandable (or at least often-used) docs directly at the source.
/.'s front page might actually help improve the situation. But.. it's like this with stupid programmers - they never die out, they'll always create problems. Let's get used to it.
Simple caching on client side could already improve the situation a whole lot... BUT:
When people implement something for html-ish or svg-ish or xml-ish purposes, they go google for it: "Howto XML blah foo" - result, they're getting basic screw-it-with-a-hammer tutorials that don't point out important design decisions, but instead Just Work - which is what the author wanted to achieve when they started writing the software.
It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2. But since most tutorials and howtos on the net are just dumbed-down copypasta for quick and dirty hacks - and since nobody fucking enforces the standards - nobody does it the Right Way.
So if I start writing some sax-parser, some html-rendering lib, some silly scraper, whatnot... and the first example implementations only deal with basic stuff and show me how to do it so basic functionality can be implemented... and I'm not really interested in that part of the program anyways, because I need it for putting something more fancy on top... once after I'm through with the initial testing of this particular subsystem, I won't really care about anything else. It works, it doesn't seem to hit performance too badly, it's according to some random guy's completely irrelevant blog - hey, this guy knows what he's doing. I don't care!
This story hitting
I'm an infovore...
What are the user agents making the requests? Do these programs identify themselves with a UA string or something?
It is dangerous to be right when the government is wrong.
What's the point of having a DTD if it won't change? Oh yeah, there is none. Conceptually, the DTD is there to define the data, and unless you know what is in the DTD, you cannot use it to validate, which is its purpose. And conceptually, if you assume the data is defined a certain way, you don't need a DTD.
Generally the DTD is for the person parsing the XML. If you're writing the XML, you don't need a DTD, because you already know the schema. If it's only for the XML writers, all you'd need to do is place your schema with the rest of the specs for your application.
Now I wasn't suggesting that in practice you should go to the server every time and fetch the DTD. But clearly you take things too seriously.
Try getting a sense of humor.I bet Slashdot.org could possibly find some bloggers that would be more than happy to receive that traffic!
expandfairuse.org
I'm sorry the typo's mine. I made it when I was working late one night and spilt spagetti down my shirt. I had no idea that it would propagate so far and ruin the web. Oops. Anyway I've fixed it, but it's not in the stable CVS branch yet so I'm afraid you'll just have to put up with it for a while longer.
(For those without a sense of humour, yes this is a joke)
These posts express my own personal views, not those of my employer
its not too hard to host the dtd file on your own server, amirite? not like its gonna change....
People would have another reason to complain about their ISP's quirks.
I think they screwed up, and brought this on themselves. I already thought that it was annoying having so verbose an identifier... this just makes it more hateful.
If they'd at least made the identifier NOT a URI, something like domain.example.com::[path/]versionstring, or something else that wasn't a URT, so it was clearly an identifier even if it was ultimately convertible to a URI, they would have avoided this kind of problem.
Classic problem of using something developed for one purpose (a remote resource locator) for something else (a unique identifier).
If they had did something simple like using some non-functional protocol identifier like 'ident' (i.e. xmlns="ident://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd". Browsers and other software would have been developed to never actually 'do' anything with such a URI
That sounds like a DTD thing to do! If you are a dee, please don't marry a tee, because if you marry a tee, your kids will be DEE TEE DEE."
First off, I don't know much about DTDs, but from what I can tell, it's like a template, like a Cascading style sheet, or something like that. That said...
Why did they even allow people to link to this thing in the first place? I think that they could have predicted that this would happen, simply because the web is huge and if even a small percentage of all the servers on the internet start to link to the code, they are going to get a massive influx of requests demanding this information.
Knowing this, I wouldn't let people link directly to the code. That doesn't mean that they can't use it, (they can use it by downloading the code onto their own computers and hosting it there) but I would make sure that they can't link directly to my servers. Don't get me wrong, it's nice of them to let us link to their code. However, when you provide a useful piece of software for everyone to link to, you gotta expect that people are going to take full advantage of linking your code if you let them, whether they link it efficiently or not.
Creating a standard that would allow people to host DTD's all over the web and fetch them automatically was major design stupidity, not just because people need to host that stuff, but because it misses the point of standardization in the first place.
"and unless you know what is in the DTD, you cannot use it to validate"
That the most contrary to reality assertion I've seen in my whole live (well along with some deity being one and three at the same time): you can *always* validate a document against a DTD without having a fucking idea what you'd going to find within till that moment. That's very point for an SGML DTD.
Now, the semantics is another story.
Here's an example of what correct markup should look like: The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD.
Sorry W3C, but if I don't include it in my webpage, IE goes into the dreaded quirks mode!
*forwards article to Microsoft*
Hydraulic pizza oven!! Guided missile! Herring sandwich! Styrofoam! Jayne Mansfield! Aluminum siding! Borax!
... you can't blame Microsoft for this problem! After all, IE ignores pretty much all web standards and best practices, and does its own thing!
#DeleteChrome
Try learning the difference between humour and uninformed bullshit.
They need to petition the parser designers... to include cached copies of these documents "that have not changed in years" with the parser distrobutions themselves. And then petition the parser designers to turn caching on by default. Caching too much of a pain in the ass? Then create two caches, one for authoritative schemas that are highly unlikely to change, and one for third-party schemas where aggressive caching is undesirable.
Of course, this is just a guess, but my instinct tells me that most of this traffic is being generated by people who don't even know better, and probably don't read slashdot, and may not read w3c.org. Making caching 'the default' is probably the only way you'd see a noteworthy drop in traffic.
Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day
I want to run their AdSense program. Cha-ching!
That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
I doubt Explorer or Firefox ever follow the URL. The visitors most likely are web spiders.
Mainstream web browsers do not validate pages. They do notice the Doctype declaration at the top of your page, but use it only to choose a rendering mode: old 1990s ("quirks mode") or "standards" mode. The two modes are only slightly different and have to do with the font size of tables, width calculation of block-level elements, etc.
- a web page with no doctype gets rendered in quirks mode
- a web page with one of the doctypes usually gets rendered in standards mode
There are other details that those who are interested in can Google for -- Firefox actually has three rendering modes: quirks, almost-standards, and standards mode. But even then, it is based on a regular-expression match on the doctype, not by following the URL at the end of it.
If you want news from today, you have to come back tomorrow.
LOL... How are you going to validate a document against a DTD without knowing what is in the DTD? That's like trying to send a SOAP message without the WSDL. As far as the validating parser is concerned, the DTD could be anything, and the DTD is exactly what it uses to do the validation. That's what determines if the document is valid.
What's the point in having a DTD if it *can* change? All files using the old version would be invalidated. And more important, any parser made for the old version would start rejecting XML that it could parse. That's part of the reason why it doesn't work like that.
No. The DTD is an agreement between the XML writer and the parser writer on the format of the XML to be used. The actual content of the DTD is completely irrelevant at run time as long as the incoming file says it complies with the DTD the parser expects. Any parser with more than "if (file.dtd == expectedDtd)" has failed. The only good reason I can think of to even touch the actual DTD at runtime would be for a general purpose XML validator, which, ironically, is a special case.
Maybe not
A reference is a link. "http://www.w3.org/TR/html4/strict.dtd" is a link. If it weren't a link, then it wouldn't get traversed and become a source of traffic!
And DTD and SGML might be older than W3C, but we are talking about HTML and XHTML which are W3C standards. W3C is the one that decided that their "standards" must comply with DTD and SGML, eventhough there are plenty of anomilies and special rules that browsers know how to handle anyway.
I prefer not even using the dtd declaration, and I am guessing that 99% of the pages served do not need it. Only when you are seriously pushing HTML to its limits do DTD differences come into play anyway. And most html coders know that keeping HTML simple and doing everything on the server side using tools like PHP is far better, because assuming the browser is fully compliant to each DTD version is unreliable. Mainly because browsers don't fully comply with W3C anyway.
... for foisting XML on the world.
That is all.
You put a URL everywhere, don't be surprised if someone visits it even just out of curiosity.
DDoS yourself.
Enjoy your 1000 requests per second, you practically asked for them (even if in theory you didn't).
I hope this makes the W3C people start thinking more about what happens in the real world and design their stuff better.
W3C: "But but but, This is not a hyperlink it's only a machine-readable way to say "this is HTML"'..."
How about using version numbers for _standard_ DTDs instead, and only have URIs for _custom_ DTDs (guess how many would use those in practice or be able to...)?.
The W3C likes to say stuff like "Browser makers must/should raise a security exception if XYZ goes wrong", that's all very nice in the "theory" world.
The real world doesn't work that way, so design stuff better please. Design stuff that breaks reasonably gracefully and safely.
well sounds stupid of but what if isp's where rsync the dtd and hosting the w3c dtd and some modified bind would point the w3c dtd addy to your own isp webserver? or a dns record what would increment each hit on it to a different machine? ok me dumb good day to you :)
As a developer on the other side, caching is not the solution to this problem. Caching is implemented client side as a *benefit to the client*, not the server. On some platforms (that need to verify XML) caching isn't even really possible. The extra overhead of caching (where do you put it, what program/process maintains it, how is it configured) is entirely unnecessary. The real problem lies in the spec and the various examples floating around.
I remember from a University project a year or so ago, we had some weird issues where our model loading code would hang when run on systems that didn't have a direct internet gateway available... turned out the .NET XML parser library was trying to fetch and validate the X3D schema for every model it loaded, downloading it each time. These are the kind of defaults that cause issues, if it weren't for the web proxy issue we would have probably never realised it was doing that!
This is the funniest thing I've ever read in a comment. Seriously, I actually laughed aloud. I mean... the perspective, it's awesome: "I can't believe I won the lottery"... "I can't believe that guy just hit me"... "I can't believe I just accidentally drank a gallon of antifreeze instead of a shot of whiskey"... "Gah! I can't believe that I just ranked five separate comments in five different places! How the hell did that happen?!"
I'm pretty sure they are the ones who developed the system that requires all web documents to reference their DTD. Last time I tried omitted such a reference and ran it through the validator, the validation failed ;)
Seems to me that if they designed a system in order to allow them to change the document specification used by billions of documents by modifying a single document. If they never made use of that ability and if that design decision costs them ridiculous amounts of resources that is their own problem. I know a web developer can't omit that reference from their document without violating standards, would it also violate standards for the client application to ignore requests for the standard DTDs?
Why can't the software cache it, then? Lazy developers...
I'm sure good ol HEAD, If-Modified-Since, and Etag/If-None-Match would be a LOT less bandwidth. Or are they getting that much in cache hits alone?
Could also refuse to serve it except in gzip format.
Don't thank God, thank a doctor!
This reminds me of why you cannot put phone numbers on tv shows or movies. In both these cases its the same reason, people are idiots, do people think they are going to ring up a movie character and talk to him? Its completely stupid, but guess what, it happens, so they put 555... If you are going to write your phone number on the worlds biggest bathroom wall, you cannot complain about the amount of calls you get. I feel the best response to thier quote about all the traffic is "Duh!". In both the phone number and the url examples I don't see how anyone who thought about it wouldnt realise this is a poor idea. I like their complaints too, its sure amusing. "Something is not working as we intended on the web, can everyone fix it?". Isn't that like fire department calling someone to put out a fire they started? Now if only microsoft would comment on how hard it is to get their web sites working cross browser, what a wonderful world this would be.
OK. So what... even when I first read the specification (being quite a novice programmer at the time) I immediately thought using a URI for the DTD was poorly conceived, especially if there was actually a file accessible via HTTP at the URI. But it always struck me as retarded to say "here is a URI for the DTD... looks like a URL, smells like a URL, and actually points to a DTD, but don't really use it, even though you can." What did W3C really expect to happen? XML was designed to enable inexperienced programmers and web-monkeys to "be lazy" in that they could avoid authoring parsers and write web-based content with less effort (as long as it was well-formed).
Granted clients shouldn't download it according to the standard, but people don't always behave (or program) according to "the rules." My advice to W3C, do one of two things:
- You authored and pushed the damn standard, so suck it up OR
- Remove the DTD at the URL corresponding to the URI (since it doesn't need to be there) and eventually programmers will get the idea that there is not actually a file at that location.
Or, perhaps initially post simple text documents that state:"There is not actually a DTD at this URI. Perhaps you should review the XML standard (i.e. RTFM)."
Search for "java" in this discussion. There are currently 2 hits. One refers to brain-damaged Java libs. And the other refers to brain-dead Java devs. There's the source of the problem. Sure these DTD's haven't changed in years, but for years we've added more and more Java devs to the programming population. And Java dev, in general = don't know and don't care how things work. Just want to get it done, in whatever way is easiest for the dev.
I can't tell you how many people send us bad data and flat out ignore the response.
Sometimes you can get things fixed at other sites. We have a list of major sites being exploited by phishing sites, which is updated every three hours by matching PhishTank (10,000 entries) against OpenDirectory (1.7 million entries), and looking for domains in both. We blacklist sites on a per-domain basis, and needed to measure and minimize the collateral damage.
When we started that list last November, it had 174 domains on it. After reports to abuse addresses, two articles in The Register, and help from PhishTank and the Anti-Phishing Working Group, we're down to 45 domains. Only eight of those domains have been on the list for more than 60 days. The remaining long term problem domains are five DSL providers, a free web hosting service, and two ordinary web sites that had break-ins they've never cleaned up. The rest of the list changes frequently, as sites are added to the list due to some problem, then removed from the list as the problem is fixed.
When we started, Google, Yahoo, MSN, and Dell were all on the list. They've all cleaned up their act. They just needed a little nudging.
With the legit sites tightened up, phishing blacklists become much more effective. It's now safe to blacklist entire base domains, not just URLs or subdomains. Anti-phishing tools just became more effective.
So, yes, you really can get such problems fixed.
... Yes, that ought to fix the W3C's bandwidth usage...
Fact: Everything I say is fiction.
If you don't have a DTD or XSD, each tool is going to end up having its own hand-coded validator that does a half-assed job of enforcing the same rules. It'll be a big waste of effort and utterly unmaintainable.
And XML writers absolutely do need the DTD or XSD. I'd sooner go to work without pants than publish anything that doesn't even validate.
They should be happy the most popular browsers don't apply their recommendation and don't get the DTD even once. Because *then* they'd see hell.
DTD-s would be part of of the normal web page cache, and we know there are plenty and plenty of scenarios demanding that cache be turned off for limited or extended periods of time. Development is just one of those.
I find it discouraging that at the slow speed W3C produce and finalize their recommendations, they are still full of badly designed gems, such as the DTD being downloadable from a single URL somewhere on some single site (not to get started on the DTD syntax itself).
They need help, and I'm glad WHATWG are there to help those guys with making HTML5 reality.
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
These comments don't actually do much. In fact all they do is slow the system down/
The only reason I put them on the stuff I am working on at present is that it says in the first paragraph of my assignment to do so. If I don't, I will loose marks..
Straight html seems to be enough in reality...
Yes, I know that it is not correct without all the other stuff. It does not check with Amaya or the W3C site if all the other stuff is missing. I see no benefit in putting anything else in there except perhaps
This is to make sure it works anywhere on any computer. The rest does nothing and sites work fine without it. It seems like a waste of bandwidth...I'll see your Constitution and raise you a Queen.
Their enterprise department kept causing Google to get blocked. It seems their Google Search Appliance didn't cache the DTD. When you point a mess of those at test content of a million documents, each with a reference to a w3c hosted dtd, it was a huge hit to them. The problem was pointed out to the developers, but that didn't fix the immediate problem (changes didn't happen over night). So the the files were all mirrored internally and a script went through all the test content pointing to the internal copies. Hopefully, the hits aren't quite so bad, but I know not every test file was fixed (it's easy to overlook stuff when you count documents in the millions).
If I wanted to get all URLs out of a document, I'd grab anything that looked like one. I'd not care if it was in an HTML comment, in body text, in some weird tag (img, a, object, embed, frame... and whatever some drunk browser developer concocted this morning), or wherever.
Simply put, I can not hope to correctly parse the mess in the same way as IE 7 or even Firefox 3. Why burn myself out trying, only to miss lots of stuff? To be really correct I'd probably need to execute everything from ActionScript to VBscript. Sorry, but NO FUCKING WAY.
The only way I'm going to avoid loading the DTD crap as a URL is with a URL blacklist.
You completely miss the point. As long as the public identifier is the same, the DTD should not change, and so a well behaved implementation will retrieve them once and implement a catalog of public identifiers to DTD's. Most well behaved implementations DO.
Force somehow people to host DTD:s on their own servers. To enable caching include md5sum or something to identify DTD.
http://developers.slashdot.org/article.pl?sid=07/01/17/1336257
Guess they never heard of term "idiot proof". Always assume the people will nto understand or comply with the rules.
Nothing, it's a non-profit.
(ducks and runs)
"My country, right or wrong; if right, to be kept right; and if wrong, to be set right." --Senator Carl Schurz (1872)
Track them down and send them a bill. They will notice pretty quickly and change their ways. If you can't find them declare all HTML/XHTML standards obsolete and release a new hypertext standard with obligatory built-in identification and automatic billing facilities.
W3C can add some Ads to the XML and earn some money!
I'm hesitant to blame the client developers. If someone throws a 10 line script together and runs that an a huge set of files, generating a huge number of requests for the DTD URL, obviously someone along the line ought to be caching that result. In my view, it's ultimately the responsibility of the client, but it's too much complexity to expose and implement for a program you may never run again.
So I look to the libraries. However, the last thing I want is a dozen different libraries putting a dozen different caches in a dozen different non-standard locations. Should the development community come up with a standard for how and where to cache HTTP resources? Is there any fundamental reason libCURL, for example, shouldn't be able to access an object that a webbrowser already cached?
Actually, come to think of it, the solution I like best is to punt and disable caching in all applications, and install a transparent caching proxy like squid either locally or on your LAN.
Then they'd get paid for all the traffic! ;)
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
Take this please, this follows the php code sample on how to set a reasonable limit on cache use: Remember that the Header() function MUST come before any other output.
As you can see, you'll have to create the HTTP date for an Expires header by hand; PHP doesn't provide a function to do it for you (although recent versions have made it easier; see the PHP's date documentation). Of course, it's easy to set a Cache-Control: max-age header, which is just as good for most situations.
For more information, see the manual entry for header.
See also the cgi_buffer library, which automatically handles ETag generation and validation, Content-Length generation and gzip content-coding for PHP scripts with a one-line include. What you are NOT seeing are the multiple links (three, date documentation, manual entry
I really wonder how seriously a remedy is sought. I suspect more than a slight amount of posturing is implicit in this whole set piece. Are they that out of touch?
It should be an option, not a necessity to read entire threads that are more likely to confuse than elucidate those new to the subtler aspects of any given topic. Too often it is waste of time and energy.
blame the browser programmer....developers are only sticking to standards and working workarounds here and there FOR the browser programmers already. :)
It was rediculous to do this in the first place - Perhaps the extra bandwidth usage will teach W3C a much needed lesson when developing future protocols.
With all that usage, what all in our world breaks when they can no longer afford to pay for their mistake and shut the server down?
web developers are the ones bending over backwards for web browser programmers. I must say these "browser programmers" do alot for web developers also but they are the ones ultimately to claim respponsibilty on this issue of DTD validation....I believe.
No. The DTD is an agreement between the XML writer and the parser writer on the format of the XML to be used. The actual content of the DTD is completely irrelevant at run time as long as the incoming file says it complies with the DTD the parser expects. Any parser with more than "if (file.dtd == expectedDtd)" has failed. The only good reason I can think of to even touch the actual DTD at runtime would be for a general purpose XML validator, which, ironically, is a special case.
Entities. If the document is not standalone then even a non-validating parser has to get a DTD. From a local catalog, of course.Segmentation fault. Ore dumped.
Maybe it's all the webpages out there that list sample HTML code?
There must be millions of pages with the DTD's URL in the *body* of the page.
>> In short, here's the lesson learned:
...
>> 1) Some proportion of programmers don't know what they're doing and never will
>> 2) Some proportion of programmers are assholes
Here, let me translate
1) Some programmers are Indian
2) Some programmers are American
By linking to W3-dot-org and slashdotting them. Thanks guys, you good samaritans! Can you come over later and throw some gasoline on my house that's on fire?
Slashdot "libertarians": Small government for me, big government for those I disagree with. -1, I disagree with you
For $11.00 Yahoo will give you unlimited bandwidth and storage space. I think W3C should call them on their offering. :)
Who the hell decided that the DTD should be identified with an http: URI anyway? It's as though some people think that any URI has to begin with http:. If you're not meant to fetch it using the hypertext transfer protocol, don't make a URI that says you should.
-- Ed Avis ed@membled.com
If people would bother to learn HTML, this wouldn't happen. But due to the practice of self-taught "web designers" who copy and paste anything and everything, and who also type in code examples verbatim, this little-needed markup has been spread around. So, W3C should know that they are talking to a set of people who don't listen and don't care. Too bad.