Slashdot Mirror


Linkguard To Cure Broken Links?

sean dreilinger writes: "Here's a BBC writeup of the company Linkguard, which hopes to monitor hyperlink stability via their 40-terabyte database and notify web authors when links are broken." This is a different effort than this one. Still, 40 terabytes?

27 of 74 comments (clear)

  1. Easy way to check for broken links ... by mbyte · · Score: 2

    just add an ErrorDocument 404 notify.php3 to your apache config, and the e-mail the administrator of the link (and the referer ... ;)

    You might getting quite some amout of e-mail .. but .. that "urges" you to fix the problem ;)

    Samba Information HQ

  2. hm by British · · Score: 2

    I'm betting most of the broken links that need to be fixed are geocities pages. I've never visited a geocities page without at least 1 404 on them.

  3. Re:This is already possible and for free by kaip · · Score: 2

    With a simple cronjob and Perl's wonderfulLWP module package, not to mention the other implemtations of tracking web-pages, any relativly smart administrator should already be doing this.

    I once made such a short Perl script to check the links on my own web pages: http://www.iki.fi/kaip/linkkuri.html

  4. Cached pages by seizer · · Score: 2

    http://www. google.com/search?q=cache:www.microsoft.com/+micro soft&hl=en

    Need I say more?

    --Remove SPAM from my address to mail me

  5. This is silly. by Kmon · · Score: 2

    First of all, it would be alot easier to log all outgoing 404s from your server, and notify the system admin when they occur. It sounds to me like these guys were just looking for the most expensive way to do this, and consequently, get the most VC money. Hell, if I told a VC that I was going to write a shell script to record outgoing 404s, I wouldn't even be able to buy a new Porche!

    For those of you who want to try, here's a starting point:

    grep "File does not exist:" error_log > 404s_log

    ;)


    --
    Gah
  6. Misguided? by babbage · · Score: 3
    This idea looks okay and all, but is it necessary? It seems to me that the simplest solution would just to have well behaved site maintainance -- mainly by making liberal use of things like server redirects & aliases, which should take care of 85% of the problem or something. If I rename a document on my site, I add a redirect so that the old name still works; if I delete a document, I consider redirecting to a Google search for similar documents -- at least it doesn't leave the user completely lost.

    Yeah, these methods are kind of a pain in the ass, but they're only worse if this new plan can do no better. But look at it -- they want every single link out there to be rewritten to their spec. Who does that help? Millions of web authors out there trying to rewrite documents to filter into this (really small) database can't possibly be easier than having the much smaller set of web admins adding server redirects whenever they notice more than a handful of 404 errors on the same document.



  7. Re:But.. but... by babbage · · Score: 2
    I use what I hope is a more or less useful 404 page on my site (useful in that it links to Google; better still would be linking a search for that document, but I haven't had a chance to try that).

    But, I think that this one is much more fun, in a clever little funny on IE funny on Lynx kinda way....



  8. 404 Commercials by Seumas · · Score: 3
    Eventually, companies will purchase advertising space on your 404's. User runs into a non-existant page and, suddenly, they're confronted with the picture of a big juicy Whopper and a coupon to print out and take to Burger King.

    Or better, Linkguard will work with Netscape and Microsoft to have the browsers automatically redirect you to companies who have paid money to have 404's intercepted and -- instead of redirecting you to the original site as the designer intended, will steal you away to some big corporate website.

    "Jeeze, every time I run into a 404, I wind up at eBay.com!"


    ---
    icq:2057699
    seumas.com

  9. Re:Wrong approach! by Abigail · · Score: 3
    It shouldn't be too difficult to write a short script that would check all the links on one's web site periodically.

    That will only check outbound links. That's not the problem what's being solved. The problem is checking for inbound links. That is, links on other peoples websites to your website. That isn't easily solved with a short script.

    -- Abigail

  10. Re:URIs don't change: people change them by Abigail · · Score: 2
    How about a brand new protocol for locating documents on httpds?

    Most of the suggested ideas for this "protocol" are already part of of the HTTP protocol.

    WHERE /baz.html

    Not at all needed. That is basically what HTTP does. The URL name space is just a name space. The only relationship between a URL and a file is whatever is dictated by local policy.

    100 /baz.html

    That's basically the 200 HTTP status code.

    101 /baz/1.html
    102 http://foo.baz.org/bar.html

    HTTP doesn't make a needless difference between moved to the same server or a different server. It does however make a difference between moved permanently and moved temporarily. Status codes 301 and 302.

    103 KILLED

    Status code 410.

    200 DOESNT EXIST

    Status code 404.

    300 SERVER ERROR

    That's the 5xx category of status codes. There are also the 4xx status codes, if the problem is with the request itself.

    Maybe, also, there should be another client command, SEARCH, to find any/all occurances of a file name, like: SEARCH bar.html

    That doesn't make sense to put in a protocol, as URLs do *not* point to files. An HTTP server *might* map it to a file, but that's outside the domain of the URL name space. Furthermore, since the URL name space is infinite, the result of such a search command could be an infinite list as well.
    However, it's isn't hard to put in such a functionality in your HTTP server. For instance, the server can be instructed to do a search when encountering the request for /SEARCH/bar.html.

    And a directory list, too... Client says: LIST /

    Again, that doesn't fit in the current standards for the same reason. But note that many HTTP servers have this feature already.

    I have begun to write this all up. Anyone who wants to help, visit my web site, find my e-mail address, and tell me you wish to help with this protocol.

    I strongly suggest you won't trouble yourself in making the effort. You start with the wrong idea, that URLs map to files, and most of the requested functionality is already been taken care of in the HTTP standard.

    Ref: RFC 2068

    -- Abigail

  11. Wrong approach! by Peter+Dyck · · Score: 3
    This is a completely wrong approach.

    It's like name resolving using only a single DNS server.

    It shouldn't be too difficult to write a short script that would check all the links on one's web site periodically. Using a 40 terabyte database to effect this is insanely ineffective.

    1. Re:Wrong approach! by medicthree · · Score: 2
      Using a 40 terabyte database to effect this is insanely ineffective.

      Ineffective? I doubt it. I'm sure it'll work just fine. Inefficient, maybe.

  12. This is already possible and for free by prac_regex · · Score: 4

    With a simple cronjob and Perl's wonderfulLWP module package, not to mention the other implemtations of tracking web-pages, any relativly smart administrator should already be doing this. It comes down to this, programmers are lazy and that is good, but is this just too lazy? phooey. Maybe this should be done as an apache module .. hrmm... maybe i should write that one.. mod_url_validator
    <Location />
    Add-handler Check-Links
    </Location>
    or something like that... no i dont like it. too much overhead. well at least my first offer works, because i use it.

  13. Why an independent effort? by bgalehouse · · Score: 3
    This service would be much easier for somebody like google or altavista to provide. If a company starts making money at this, I'd expect the search engine boys to come in and offer to do the same thing for less.

    Seriously, don't they already have the database?

  14. But.. but... by Signal+11 · · Score: 3
    Okay, three problems with tracking broken links..
    • Dynamically generated pages
    • server / operator error
    • I like my cool 404 errors.

    There's two problems that this thing could never catch with dynamically generated pages - one, is the famous "missing include" problem which usually looks something this in the middle of the page: [Unable to process directive] - not a 404, as the page renders, but definately a Bad Thing. The second problem is that of 404's which appear and disappear at random - like doing a sitewide search & replace across a hundred include files - that has a tendancy to lock files, producing share violations, which in turn result in 404's. So the database can't have 100% integrity.

    The other problem is the rapid amount of turnover on the web - millions of pages are appearing and disappearing every day. Those quantum people thought virtual particles were odd - try tracking down the same piece of information you found in a search engine 2 weeks ago!

    The second problem is operator/server/network errors. I've seen misconfigured proxies that mangle the URL and produce 404's when the page is there.. I've seen people make typos in the URL field of their browser (and then report it to me!), hell.. I've seen the 'net itself eat a few pages. All of this increases entropy in the database.

    Finally.. I like seeing the ocasional cool 404 error. Take this one, from my server:

    Once upon a midnight dreary, while I websurfed, weak and weary, ...Over many a strange and spurious website of 'hot chicks galore', ...While I clicked my fav'rite bookmark, suddenly there came a warning, ...And my heart was filled with mourning, mourning for my dear amour. ..."'Tis not possible," I muttered, "give me back my cheap hardcore!" Quoth the server, "404".
  15. server down? by MathJMendl · · Score: 2

    Not a bad idea. What happens if the server is temporarily down, however? You don't want everyone taking down their links if it is just down for a day or so because of technical difficulties. This service would get annoying if someone had a lot of links on their site and a few were randomly down on that particular day.

    --


    "I have not failed. I've simply found 10,000 ways that won't work." --Thomas Edison
  16. Wrong approach to a non-existent problem by JimDabell · · Score: 2

    They are just trying to create a market for themselves. Trying to keep tabs on the whole internet just in case a page moves every now and again is a silly idea. The right approach is for each site author to use cron to check the links every now and again. As soon as a page moves, update your link.

  17. HTTP "Referer" header by cperciva · · Score: 3
    (from RFC 1945):
    10.13 Referer
    The Referer request-header field allows the client to specify, for the server's benefit, the address (URI) of the resource from which the Request-URI was obtained. This allows a server to generate lists of back-links to resources for interest, logging, optimized caching, etc. It also allows obsolete or mistyped links to be traced for maintenance. The Referer field must not be sent if the Request-URI was obtained from a source that does not have its own URI, such as input from the user keyboard.

    If you want to make sure that you don't break any links when you move your website, all you have to do is consult your HTTP logs, pull out all the lines starting "Referer:", and remove the duplicates.
  18. Hah! by BJH · · Score: 4


    I like that bit about cataloging pages with a five-word "lexical" signature based on words that appear mainly only on that page. How are they going to deal with the 5,000,000,000 web pages that contain only the word "porn"? ;)

  19. Government Conspiracy by Seumas · · Score: 2
    *cough* ; )

    This is a government conspiracy. I'm surprised none of you (especially Signal 11) didn't pick up on it right away.

    Any webmaster worth his weight in HTML can use LWP or even a simple GUI-based Xenu (freeware linkchecker) to check on the current status of links on their site, and elsewhere.

    The only obvious benefit to something like Linkguard, is for the government to keep track of you. You have 20% dead links on your site? Bad webmaster -- BAD!".

    Next thing you know, your name is published in the paper, your wife leaves you, your house is forclosed, and your children are taken away from you and put into foster care, with a family who does know how to maintain their links.
    ---
    icq:2057699
    seumas.com

  20. Big deal. by Animats · · Score: 3
    Dumb journalism. This isn't a breakthrough.
    • There's a free service that already does this. They do it even if you don't ask them to, then send you spam telling you about broken links on your site.
    • And there's Alexa, which really does archive the Web so that you can find old pages.
    • Personally, I like the link checker in Dreamweaver. It's very well integrated with the site maintenance tools.
    Probably the biggest source of bad links is unmaintained "favorite links" lists. That's something that needs a simple tool. If the major free-web-page sites provided something, that would probably cut the number of dead links substantially.
  21. Agents? No thanks. by Syn.Terra · · Score: 2
    Here's a key line from the article:

    Eventually Linkguard is planning to use discrete software programs called agents to watch links and tell the webmasters of any affected sites when they are updated or changed.

    By "agents" they mean "bots", I suppose.

    Now, if it takes 40 terrabytes (roughly 41,943,040 megabytes, I believe) to document all the links on the web, how much more space will be needed to keep contact info on all those links? Plus, how efficient will these agents be? I'm not so hot on the idea of bots constantly poking around my lil' Network, checking that all my links are okay.

    And will these bots follow the robots.txt rules? I know plenty of sites which revoke all robots, so the "agents" would be useless anyway... Nice idea, but sounds a bit invasive.

    Plus this line below:

    If the destination page disappears, search engines that can use these signatures would try to find the relevant signature and relocate the page.

    Oh, so now you're relying on search engines to get the links right... hm...

    I'll stick to manually checking them myself, thankyouverymuch.


    ---

    --
    "Okay, who taught the cat how to type ctrl alt delete?"
  22. 40tb not enough by seizer · · Score: 2

    In a recent New Scientist article (sorry, can't source it better than that) I read that Google has "4000 linux computers each with 80gb of diskspace". Well, that works out to be 312tb and that's only to index a small portion of the web - 20% is it? (as I understand it, Google stores the whole webpage in order to serve a cached version on demand). So, how could 40tb be enough? Even assuming that this new company compressed the data to a higher degree than Google (which needs to serve pages fast), this just couldn't be enough to be useful.

    Gimmick or badly planned...whichever.

    --Remove SPAM from my address to mail me

  23. URIs don't change: people change them by QBasic_Dude · · Score: 4
    There are no reasons at all in theory for people to change URIs (or stop maintaining documents), but millions of reasons in practice.
    Tim Berners-Lee, inventor of the World Wide Web, wrote about this in a page titled Cool URIs don't Change. Many web authors don't realize file name extensions can be removed from the URI space. Pages Must Live Forever (Alertbox Nov. 1998) is another document about the same issue.

    The Network Working Group is working on a replacement for URLs -- Uniform Resource Names. URNs are intended to serve as persistant, location-independent, resource identifiers and are designed to make it easy to map other namespaces (which share the properties of URNs) into URN-space.

    1. Re:URIs don't change: people change them by Coma+of+Souls · · Score: 2

      There's also things like PURL, but they haven't really caught on.

  24. Money? by arnald · · Score: 2

    Just how does this `company' plan to make money?

    Will they email the companies saying "at least
    one of your links is down; send us a cheque for
    x00000 pounds and we'll tell you which"?

    Just how?

    --
    arnald
  25. Missed the point? by Chris+Pimlott · · Score: 2

    To me the article seemed to stress not broken links among your own page, but broken links from other people's pages to your own, thus causing you to lose out on visitors coming to your site from others.

    However, this seems like it could also be done on the local side, by logging the http-referer so you can keep track of any pages that a lot of your visitors seem to be coming from and then notifying them if/when you change your URL's.