Linkguard To Cure Broken Links?

← Back to Stories (view on slashdot.org)

Linkguard To Cure Broken Links?

Posted by ryuzaki0 on Sunday June 18, 2000 @07:29AM from the fix-me dept.

sean dreilinger writes: "Here's a BBC writeup of the company Linkguard, which hopes to monitor hyperlink stability via their 40-terabyte database and notify web authors when links are broken." This is a different effort than this one. Still, 40 terabytes?

11 of 74 comments (clear)

Min score:

Reason:

Sort:

Misguided? by babbage · 2000-06-18 07:25 · Score: 3

This idea looks okay and all, but is it necessary? It seems to me that the simplest solution would just to have well behaved site maintainance -- mainly by making liberal use of things like server redirects & aliases, which should take care of 85% of the problem or something. If I rename a document on my site, I add a redirect so that the old name still works; if I delete a document, I consider redirecting to a Google search for similar documents -- at least it doesn't leave the user completely lost.
Yeah, these methods are kind of a pain in the ass, but they're only worse if this new plan can do no better. But look at it -- they want every single link out there to be rewritten to their spec. Who does that help? Millions of web authors out there trying to rewrite documents to filter into this (really small) database can't possibly be easier than having the much smaller set of web admins adding server redirects whenever they notice more than a handful of 404 errors on the same document.

--
DO NOT LEAVE IT IS NOT REAL
404 Commercials by Seumas · 2000-06-18 07:52 · Score: 3

Eventually, companies will purchase advertising space on your 404's. User runs into a non-existant page and, suddenly, they're confronted with the picture of a big juicy Whopper and a coupon to print out and take to Burger King.
Or better, Linkguard will work with Netscape and Microsoft to have the browsers automatically redirect you to companies who have paid money to have 404's intercepted and -- instead of redirecting you to the original site as the designer intended, will steal you away to some big corporate website.
"Jeeze, every time I run into a 404, I wind up at eBay.com!"

---
icq:2057699
seumas.com
Re:Wrong approach! by Abigail · 2000-06-18 22:23 · Score: 3

It shouldn't be too difficult to write a short script that would check all the links on one's web site periodically.
That will only check outbound links. That's not the problem what's being solved. The problem is checking for inbound links. That is, links on other peoples websites to your website. That isn't easily solved with a short script.
-- Abigail
Wrong approach! by Peter+Dyck · 2000-06-18 02:39 · Score: 3

This is a completely wrong approach.
It's like name resolving using only a single DNS server.
It shouldn't be too difficult to write a short script that would check all the links on one's web site periodically. Using a 40 terabyte database to effect this is insanely ineffective.
This is already possible and for free by prac_regex · 2000-06-18 02:39 · Score: 4

With a simple cronjob and Perl's wonderfulLWP module package, not to mention the other implemtations of tracking web-pages, any relativly smart administrator should already be doing this. It comes down to this, programmers are lazy and that is good, but is this just too lazy? phooey. Maybe this should be done as an apache module .. hrmm... maybe i should write that one.. mod_url_validator
<Location />
Add-handler Check-Links
</Location>
or something like that... no i dont like it. too much overhead. well at least my first offer works, because i use it.
Why an independent effort? by bgalehouse · 2000-06-18 02:37 · Score: 3

This service would be much easier for somebody like google or altavista to provide. If a company starts making money at this, I'd expect the search engine boys to come in and offer to do the same thing for less.
Seriously, don't they already have the database?
But.. but... by Signal+11 · 2000-06-18 02:40 · Score: 3
Okay, three problems with tracking broken links..
- Dynamically generated pages
- server / operator error
- I like my cool 404 errors.
There's two problems that this thing could never catch with dynamically generated pages - one, is the famous "missing include" problem which usually looks something this in the middle of the page: [Unable to process directive] - not a 404, as the page renders, but definately a Bad Thing. The second problem is that of 404's which appear and disappear at random - like doing a sitewide search & replace across a hundred include files - that has a tendancy to lock files, producing share violations, which in turn result in 404's. So the database can't have 100% integrity.
The other problem is the rapid amount of turnover on the web - millions of pages are appearing and disappearing every day. Those quantum people thought virtual particles were odd - try tracking down the same piece of information you found in a search engine 2 weeks ago!
The second problem is operator/server/network errors. I've seen misconfigured proxies that mangle the URL and produce 404's when the page is there.. I've seen people make typos in the URL field of their browser (and then report it to me!), hell.. I've seen the 'net itself eat a few pages. All of this increases entropy in the database.
Finally.. I like seeing the ocasional cool 404 error. Take this one, from my server:
Once upon a midnight dreary, while I websurfed, weak and weary, ...Over many a strange and spurious website of 'hot chicks galore', ...While I clicked my fav'rite bookmark, suddenly there came a warning, ...And my heart was filled with mourning, mourning for my dear amour. ..."'Tis not possible," I muttered, "give me back my cheap hardcore!" Quoth the server, "404".
HTTP "Referer" header by cperciva · 2000-06-18 02:42 · Score: 3

(from RFC 1945):
10.13 Referer
The Referer request-header field allows the client to specify, for the server's benefit, the address (URI) of the resource from which the Request-URI was obtained. This allows a server to generate lists of back-links to resources for interest, logging, optimized caching, etc. It also allows obsolete or mistyped links to be traced for maintenance. The Referer field must not be sent if the Request-URI was obtained from a source that does not have its own URI, such as input from the user keyboard.

If you want to make sure that you don't break any links when you move your website, all you have to do is consult your HTTP logs, pull out all the lines starting "Referer:", and remove the duplicates.

--
Tarsnap: Online backups for the truly paranoid
Hah! by BJH · 2000-06-18 02:43 · Score: 4

I like that bit about cataloging pages with a five-word "lexical" signature based on words that appear mainly only on that page. How are they going to deal with the 5,000,000,000 web pages that contain only the word "porn"? ;)
Big deal. by Animats · 2000-06-18 08:38 · Score: 3
Dumb journalism. This isn't a breakthrough.
- There's a free service that already does this. They do it even if you don't ask them to, then send you spam telling you about broken links on your site.
- And there's Alexa, which really does archive the Web so that you can find old pages.
- Personally, I like the link checker in Dreamweaver. It's very well integrated with the site maintenance tools.
Probably the biggest source of bad links is unmaintained "favorite links" lists. That's something that needs a simple tool. If the major free-web-page sites provided something, that would probably cut the number of dead links substantially.
URIs don't change: people change them by QBasic_Dude · 2000-06-18 02:49 · Score: 4

There are no reasons at all in theory for people to change URIs (or stop maintaining documents), but millions of reasons in practice.
Tim Berners-Lee, inventor of the World Wide Web, wrote about this in a page titled Cool URIs don't Change. Many web authors don't realize file name extensions can be removed from the URI space. Pages Must Live Forever (Alertbox Nov. 1998) is another document about the same issue.
The Network Working Group is working on a replacement for URLs -- Uniform Resource Names. URNs are intended to serve as persistant, location-independent, resource identifiers and are designed to make it easy to map other namespaces (which share the properties of URNs) into URN-space.