Linkguard To Cure Broken Links?
sean dreilinger writes: "Here's a BBC writeup of the company Linkguard, which hopes to monitor hyperlink stability via their 40-terabyte database and notify web authors when links are broken." This is a different effort than this one. Still, 40 terabytes?
just add an ErrorDocument 404 notify.php3 to your apache config, and the e-mail the administrator of the link (and the referer ... ;)
.. but .. that "urges" you to fix the problem ;)
You might getting quite some amout of e-mail
Samba Information HQ
I'm betting most of the broken links that need to be fixed are geocities pages. I've never visited a geocities page without at least 1 404 on them.
I once made such a short Perl script to check the links on my own web pages: http://www.iki.fi/kaip/linkkuri.html
http://www. google.com/search?q=cache:www.microsoft.com/+micro soft&hl=en
Need I say more?
--Remove SPAM from my address to mail me
First of all, it would be alot easier to log all outgoing 404s from your server, and notify the system admin when they occur. It sounds to me like these guys were just looking for the most expensive way to do this, and consequently, get the most VC money. Hell, if I told a VC that I was going to write a shell script to record outgoing 404s, I wouldn't even be able to buy a new Porche!
For those of you who want to try, here's a starting point:
grep "File does not exist:" error_log > 404s_log
;)
Gah
Yeah, these methods are kind of a pain in the ass, but they're only worse if this new plan can do no better. But look at it -- they want every single link out there to be rewritten to their spec. Who does that help? Millions of web authors out there trying to rewrite documents to filter into this (really small) database can't possibly be easier than having the much smaller set of web admins adding server redirects whenever they notice more than a handful of 404 errors on the same document.
DO NOT LEAVE IT IS NOT REAL
But, I think that this one is much more fun, in a clever little funny on IE funny on Lynx kinda way....
DO NOT LEAVE IT IS NOT REAL
Or better, Linkguard will work with Netscape and Microsoft to have the browsers automatically redirect you to companies who have paid money to have 404's intercepted and -- instead of redirecting you to the original site as the designer intended, will steal you away to some big corporate website.
"Jeeze, every time I run into a 404, I wind up at eBay.com!"
---
icq:2057699
seumas.com
That will only check outbound links. That's not the problem what's being solved. The problem is checking for inbound links. That is, links on other peoples websites to your website. That isn't easily solved with a short script.
-- Abigail
Most of the suggested ideas for this "protocol" are already part of of the HTTP protocol.
WHERE /baz.html
Not at all needed. That is basically what HTTP does. The URL name space is just a name space. The only relationship between a URL and a file is whatever is dictated by local policy.
100 /baz.html
That's basically the 200 HTTP status code.
101 /baz/1.html
102 http://foo.baz.org/bar.html
HTTP doesn't make a needless difference between moved to the same server or a different server. It does however make a difference between moved permanently and moved temporarily. Status codes 301 and 302.
103 KILLED
Status code 410.
200 DOESNT EXIST
Status code 404.
300 SERVER ERROR
That's the 5xx category of status codes. There are also the 4xx status codes, if the problem is with the request itself.
Maybe, also, there should be another client command, SEARCH, to find any/all occurances of a file name, like: SEARCH bar.html
That doesn't make sense to put in a protocol, as URLs do *not* point to files. An HTTP server *might* map it to a file, but that's outside the domain of the URL name space. Furthermore, since the URL name space is infinite, the result of such a search command could be an infinite list as well. /SEARCH/bar.html.
However, it's isn't hard to put in such a functionality in your HTTP server. For instance, the server can be instructed to do a search when encountering the request for
And a directory list, too... Client says: LIST /
Again, that doesn't fit in the current standards for the same reason. But note that many HTTP servers have this feature already.
I have begun to write this all up. Anyone who wants to help, visit my web site, find my e-mail address, and tell me you wish to help with this protocol.
I strongly suggest you won't trouble yourself in making the effort. You start with the wrong idea, that URLs map to files, and most of the requested functionality is already been taken care of in the HTTP standard.
Ref: RFC 2068
-- Abigail
It's like name resolving using only a single DNS server.
It shouldn't be too difficult to write a short script that would check all the links on one's web site periodically. Using a 40 terabyte database to effect this is insanely ineffective.
With a simple cronjob and Perl's wonderfulLWP module package, not to mention the other implemtations of tracking web-pages, any relativly smart administrator should already be doing this. It comes down to this, programmers are lazy and that is good, but is this just too lazy? phooey. Maybe this should be done as an apache module .. hrmm... maybe i should write that one.. mod_url_validator />
<Location
Add-handler Check-Links
</Location>
or something like that... no i dont like it. too much overhead. well at least my first offer works, because i use it.
Seriously, don't they already have the database?
There's two problems that this thing could never catch with dynamically generated pages - one, is the famous "missing include" problem which usually looks something this in the middle of the page: [Unable to process directive] - not a 404, as the page renders, but definately a Bad Thing. The second problem is that of 404's which appear and disappear at random - like doing a sitewide search & replace across a hundred include files - that has a tendancy to lock files, producing share violations, which in turn result in 404's. So the database can't have 100% integrity.
The other problem is the rapid amount of turnover on the web - millions of pages are appearing and disappearing every day. Those quantum people thought virtual particles were odd - try tracking down the same piece of information you found in a search engine 2 weeks ago!
The second problem is operator/server/network errors. I've seen misconfigured proxies that mangle the URL and produce 404's when the page is there.. I've seen people make typos in the URL field of their browser (and then report it to me!), hell.. I've seen the 'net itself eat a few pages. All of this increases entropy in the database.
Finally.. I like seeing the ocasional cool 404 error. Take this one, from my server:
Not a bad idea. What happens if the server is temporarily down, however? You don't want everyone taking down their links if it is just down for a day or so because of technical difficulties. This service would get annoying if someone had a lot of links on their site and a few were randomly down on that particular day.
"I have not failed. I've simply found 10,000 ways that won't work." --Thomas Edison
They are just trying to create a market for themselves. Trying to keep tabs on the whole internet just in case a page moves every now and again is a silly idea. The right approach is for each site author to use cron to check the links every now and again. As soon as a page moves, update your link.
If you want to make sure that you don't break any links when you move your website, all you have to do is consult your HTTP logs, pull out all the lines starting "Referer:", and remove the duplicates.
Tarsnap: Online backups for the truly paranoid
I like that bit about cataloging pages with a five-word "lexical" signature based on words that appear mainly only on that page. How are they going to deal with the 5,000,000,000 web pages that contain only the word "porn"?
This is a government conspiracy. I'm surprised none of you (especially Signal 11) didn't pick up on it right away.
Any webmaster worth his weight in HTML can use LWP or even a simple GUI-based Xenu (freeware linkchecker) to check on the current status of links on their site, and elsewhere.
The only obvious benefit to something like Linkguard, is for the government to keep track of you. You have 20% dead links on your site? Bad webmaster -- BAD!".
Next thing you know, your name is published in the paper, your wife leaves you, your house is forclosed, and your children are taken away from you and put into foster care, with a family who does know how to maintain their links.
---
icq:2057699
seumas.com
- There's a free service that already does this. They do it even if you don't ask them to, then send you spam telling you about broken links on your site.
- And there's Alexa, which really does archive the Web so that you can find old pages.
- Personally, I like the link checker in Dreamweaver. It's very well integrated with the site maintenance tools.
Probably the biggest source of bad links is unmaintained "favorite links" lists. That's something that needs a simple tool. If the major free-web-page sites provided something, that would probably cut the number of dead links substantially.Eventually Linkguard is planning to use discrete software programs called agents to watch links and tell the webmasters of any affected sites when they are updated or changed.
By "agents" they mean "bots", I suppose.
Now, if it takes 40 terrabytes (roughly 41,943,040 megabytes, I believe) to document all the links on the web, how much more space will be needed to keep contact info on all those links? Plus, how efficient will these agents be? I'm not so hot on the idea of bots constantly poking around my lil' Network, checking that all my links are okay.
And will these bots follow the robots.txt rules? I know plenty of sites which revoke all robots, so the "agents" would be useless anyway... Nice idea, but sounds a bit invasive.
Plus this line below:
If the destination page disappears, search engines that can use these signatures would try to find the relevant signature and relocate the page.
Oh, so now you're relying on search engines to get the links right... hm...I'll stick to manually checking them myself, thankyouverymuch.
---
"Okay, who taught the cat how to type ctrl alt delete?"
In a recent New Scientist article (sorry, can't source it better than that) I read that Google has "4000 linux computers each with 80gb of diskspace". Well, that works out to be 312tb and that's only to index a small portion of the web - 20% is it? (as I understand it, Google stores the whole webpage in order to serve a cached version on demand). So, how could 40tb be enough? Even assuming that this new company compressed the data to a higher degree than Google (which needs to serve pages fast), this just couldn't be enough to be useful.
Gimmick or badly planned...whichever.
--Remove SPAM from my address to mail me
The Network Working Group is working on a replacement for URLs -- Uniform Resource Names. URNs are intended to serve as persistant, location-independent, resource identifiers and are designed to make it easy to map other namespaces (which share the properties of URNs) into URN-space.
Just how does this `company' plan to make money?
Will they email the companies saying "at least
one of your links is down; send us a cheque for
x00000 pounds and we'll tell you which"?
Just how?
arnald
To me the article seemed to stress not broken links among your own page, but broken links from other people's pages to your own, thus causing you to lose out on visitors coming to your site from others.
However, this seems like it could also be done on the local side, by logging the http-referer so you can keep track of any pages that a lot of your visitors seem to be coming from and then notifying them if/when you change your URL's.