Linkguard To Cure Broken Links?
sean dreilinger writes: "Here's a BBC writeup of the company Linkguard, which hopes to monitor hyperlink stability via their 40-terabyte database and notify web authors when links are broken." This is a different effort than this one. Still, 40 terabytes?
I think it is clear now that everything posted by Signal 11 should be moderated down as flamebait. His posts always draw flames, therefore it is flamebait.
just add an ErrorDocument 404 notify.php3 to your apache config, and the e-mail the administrator of the link (and the referer ... ;)
.. but .. that "urges" you to fix the problem ;)
You might getting quite some amout of e-mail
Samba Information HQ
There are already companies out there that are doing this for free and when they find a broken link they send and email to webmaster@thedomainname.com.
This is nothing new really...
Nathaniel P. Wilkerson
NPS Internet Solutions, LLC
www.npsis.com
Nathaniel P. Wilkerson
www.haidacarver.com
I'm betting most of the broken links that need to be fixed are geocities pages. I've never visited a geocities page without at least 1 404 on them.
If all you're worried about is moved pages within your server, looking at the error log from the web server is pretty good.
PJRC: Electronic Projects, 8051 Microcontroller Tools
If you click the above link to my home page, my server is supposed to see that you came from this article on Slashdot. That's all well and good.
If you type a random URL - say, http://www.yahoo.com/ - into your address bar now, should Yahoo see that you came from Slashdot? I'd call that an invasion of privacy. Netscape sends that information.
--
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
How about a brand new protocol for locating documents on httpds? If a page has a link to http://foo.bar.org/baz.html then the client checks loc://foo.bar.org for the actual location of baz.html! The client tells the location server: /baz.html
/baz.html
/baz/1.html) the location server will say, /baz/1.html
/bar.html /foo/bar.html /baz/bar.html
/index.html /style.css /cgi-bin/ /products.html /downloads/ /people.html /bar.html /baz.html /links.html /menu.html
WHERE
If baz.html is in the same place (foo.bar.org/baz.html) the location server will return:
100
If it's been moved to elsewhere on the server (say,
101
If it's moved to another server (this time foo.baz.org/bar.html) then the location server returns:
102 http://foo.baz.org/bar.html
And if it's been removed for good, then:
103 KILLED
If the file never existed, then the location server will say:
200 DOESNT EXIST
If the server encounters a problem then it should return:
300 SERVER ERROR
Or something else, if it knows what it did wrong.
Maybe, also, there should be another client command, SEARCH, to find any/all occurances of a file name, like:
SEARCH bar.html
To this the server might reply:
400 SEARCH BEGINS
401
401
401
402 SEARCH ENDS
And a directory list, too... Client says:
LIST /
And the server says:
405 LIST BEGINS
406
406
406
406
406
406
406
406
406
406
407 LIST ENDS
I have begun to write this all up. Anyone who wants to help, visit my web site, find my e-mail address, and tell me you wish to help with this protocol.
Chris 'coldacid' Charabaruk Meldstar Entertainment
I just have cron set up to do a daily 404 report for the last seven days with Analog, and read it occasionally.
Duh!
- A.P.
--
"One World, one Web, one Program" - Microsoft promotional ad
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
I'm not really familiar with the workings of the
web, but why can't HTML code be devised that
looks ahead to where the links lead after the rest
of the page has finished loading, and return a
status code on how the page is working? If it's
404, then the browser can change the link text to
a predefined color so the user knows not to click
on it. Any ideas?
Remember "Bring 'em on"? *sigh
You don't seem to understand what this company does at all. But thats ok, you got 14th post. Good job!
RewriteLog
RewriteMap real-to-user txt:/anywhere/map.real-to-host
RewriteRule ^/([^/]+)/~([^/]+)/(.*)$
Also, the HTTP URI or Refresh header can be used to easily redirect an existing location to another. There is no need for a document location protocol.
I somewhat disagree with this. After all, a URL could be bookmarked, linked, or be refered to in another way. Once a URI is created, it should exist forever. Freenet is an interesting distributed Internet-like network where documents can be uploaded, and since the files do not reside on a central server, exist as long as their is a demand for the file.Searching should, in my opinion, be higher level and not in the protocol. CGI can easily be used instead.
This really shouldn't be necessary. Links should be able to get to all the public documents on the web server. HTTP is not FTP.
If you feel a new feature should be added to HTTP, suggest it on the ietf-http-wg working group mailing list and it might be accepted in HTTP 1.2.
Someone please give him an extra point for the 404 message; it's a classic. I'm saving that just for future reference.
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
Someone could create a plugin to do this for you -- when you start Netscape or IE, it checks each of your Favorites/Bookmarks. It does this in the background, so it doesn't interrupt what you want to do. If it finds a broken link, it moves it into a Broken Links folder and highlights the folder red so the next time you pull down the Bookmarks/Favorites menu, you'll see that something broke.
This is the least intrusive behavior -- it doesn't need to popup to let you know, because it won't be of concern to you until you actually want to go somewhere else. And by moving them into another folder, it gives you a chance to review them and find out where the right link is, if it's something you still care about. I'd pay money to see something like this developed.
--
I feel fantastic, and I'm still alive.
I once made such a short Perl script to check the links on my own web pages: http://www.iki.fi/kaip/linkkuri.html
Comment removed based on user account deletion
Well, it's a funny world, I guess.
Another use for a database like this is to warn webmasters. You would go the company to tell everyone that links to you that your site is moving before it actually does. You type in the URL you want to tell people about. It does a cross reference and returns all the other sites that link to it. Now, either it would return you the addresses or it could send them an e-mail by itself. This way people won't get the "We've moved" page. This would be good because even if you did move you may have the "We've moved" page linking (and/or redirecting) the person to the new site. The page exists, but the link is outdated though.
Will linkguard be monitoring these various searchengines/portals/whatever? Seems to me many of the 'broken links' I find are actually links from places like Yahoo, Lycos, Hotbot, etc. Perhaps if these companies did periodic link checking, many 404s would be eliminated from the web.
creation science book
I think that the most efficent place to do something like this would be at the search engine level. When you are at a companies page, they should have thier own software to do that, not some central 40tb database, thats just a waste. But if each search engine company put something along the lines of this into place (i.e. check a link when someone uses it. is it good? then good. good. Is the link bad? then put it in a test again later queue.
at least, thats what I think.
.mincus
I don't see how Google can store web pages locally. Most pages are copyrighted so it wouldn't work out legally.
http://www. google.com/search?q=cache:www.microsoft.com/+micro soft&hl=en
Need I say more?
--Remove SPAM from my address to mail me
Sorry, File Not Found 404's vanish for good File still not found
First of all, it would be alot easier to log all outgoing 404s from your server, and notify the system admin when they occur. It sounds to me like these guys were just looking for the most expensive way to do this, and consequently, get the most VC money. Hell, if I told a VC that I was going to write a shell script to record outgoing 404s, I wouldn't even be able to buy a new Porche!
For those of you who want to try, here's a starting point:
grep "File does not exist:" error_log > 404s_log
;)
Gah
Yeah, these methods are kind of a pain in the ass, but they're only worse if this new plan can do no better. But look at it -- they want every single link out there to be rewritten to their spec. Who does that help? Millions of web authors out there trying to rewrite documents to filter into this (really small) database can't possibly be easier than having the much smaller set of web admins adding server redirects whenever they notice more than a handful of 404 errors on the same document.
DO NOT LEAVE IT IS NOT REAL
Two that come to mind are Dreamweaver and Frontpage. They'll check your entire site for you and display broken links.
"We're sorry, but the website you're trying to reach has been disconnected."
I agree that this is something that a webmaster should deal with on his own site, individually. It's like printing company business cards with telephone numbers that don't exist on them, it gives an unprofessional feel to the whole site. However, on complex, dynamic and interactive user-based pages, this might not always be possible, unless multiple Webmasters are constantly monitoring the site. So I think that while a webmaster should be obliged to look after broken links - and other aspects of website care, in some cases tools like this will be beneficial.
"A few atoms won't even light a match" - Dr Jones, 1933
But, I think that this one is much more fun, in a clever little funny on IE funny on Lynx kinda way....
DO NOT LEAVE IT IS NOT REAL
Where did common sense go???
It's not even close to using Robust Hyperlinks (nobody wants to use them or understand them). The web is created by the lowest denominator.
This is a perfectly valid approach given _people_are_lazy_. It HAS to be done by a third arty or it will never be done.
I cant believe some of the rediculous comments...
"any good programmer"...etc. Most webpages are not even given a second thought after being created by an everyday joe who struggles to grasp HTML or more importantly DOESNT CARE. Think it through.
Often wrong but never in doubt.
I am Jack9.
Often wrong but never in doubt.
I am Jack9.
Everyone knows me.
Just look into the backup aspects, which are certainly not trivial: Let's say you have a fiber channel link to the backup sub system and the database in question is a good backup citizen and handles 80 Gbyte per hour.
Believe me, that's darn good throughput and rarely achieved in the real world. Go calculate.
What shudders me most about the story is the (Err, yessir; you know we had this incredible stoopid .COM biznes model idea, collected data for a few month and - sheesh I tell ya boy - where we stunned that we suddenly sat on 40 terra bytes of data...) approach of database engineering.
I've seen a lot of outrageously dumb approaches in database design and engineering. But those blokes really deserve a top slot in the list.
ich bin der musikant
mit taschenrechner in der hand
kraftwerk
Or better, Linkguard will work with Netscape and Microsoft to have the browsers automatically redirect you to companies who have paid money to have 404's intercepted and -- instead of redirecting you to the original site as the designer intended, will steal you away to some big corporate website.
"Jeeze, every time I run into a 404, I wind up at eBay.com!"
---
icq:2057699
seumas.com
Paying a service to do it when I can buy an app, schedule it to run overnight, and have reports generated in the morning, strikes me as silly.
I agree that fun 404s have become a nice amusement on the web. At least they avoid the two biggest problems with standard ones: telling people to contact the sysadmin, especially on a many-user machine, and telling people they must've typed something wrong, when people almost never type URLs.
That will only check outbound links. That's not the problem what's being solved. The problem is checking for inbound links. That is, links on other peoples websites to your website. That isn't easily solved with a short script.
-- Abigail
Most of the suggested ideas for this "protocol" are already part of of the HTTP protocol.
WHERE /baz.html
Not at all needed. That is basically what HTTP does. The URL name space is just a name space. The only relationship between a URL and a file is whatever is dictated by local policy.
100 /baz.html
That's basically the 200 HTTP status code.
101 /baz/1.html
102 http://foo.baz.org/bar.html
HTTP doesn't make a needless difference between moved to the same server or a different server. It does however make a difference between moved permanently and moved temporarily. Status codes 301 and 302.
103 KILLED
Status code 410.
200 DOESNT EXIST
Status code 404.
300 SERVER ERROR
That's the 5xx category of status codes. There are also the 4xx status codes, if the problem is with the request itself.
Maybe, also, there should be another client command, SEARCH, to find any/all occurances of a file name, like: SEARCH bar.html
That doesn't make sense to put in a protocol, as URLs do *not* point to files. An HTTP server *might* map it to a file, but that's outside the domain of the URL name space. Furthermore, since the URL name space is infinite, the result of such a search command could be an infinite list as well. /SEARCH/bar.html.
However, it's isn't hard to put in such a functionality in your HTTP server. For instance, the server can be instructed to do a search when encountering the request for
And a directory list, too... Client says: LIST /
Again, that doesn't fit in the current standards for the same reason. But note that many HTTP servers have this feature already.
I have begun to write this all up. Anyone who wants to help, visit my web site, find my e-mail address, and tell me you wish to help with this protocol.
I strongly suggest you won't trouble yourself in making the effort. You start with the wrong idea, that URLs map to files, and most of the requested functionality is already been taken care of in the HTTP standard.
Ref: RFC 2068
-- Abigail
www.linkguard.com. Seems to look fairly simple actually, they just have massive hdds. Gits.
"Elmo knows where you live!" - The Simpsons
Can they cure this link?
It's like name resolving using only a single DNS server.
It shouldn't be too difficult to write a short script that would check all the links on one's web site periodically. Using a 40 terabyte database to effect this is insanely ineffective.
With a simple cronjob and Perl's wonderfulLWP module package, not to mention the other implemtations of tracking web-pages, any relativly smart administrator should already be doing this. It comes down to this, programmers are lazy and that is good, but is this just too lazy? phooey. Maybe this should be done as an apache module .. hrmm... maybe i should write that one.. mod_url_validator />
<Location
Add-handler Check-Links
</Location>
or something like that... no i dont like it. too much overhead. well at least my first offer works, because i use it.
Sounds like a service that many site-scanning search engine companies could provide. Of course, those facilities which scan more often would provide faster service to the subscribing webmasters -- we've all seen out-of-date search results.
Seriously, don't they already have the database?
There's two problems that this thing could never catch with dynamically generated pages - one, is the famous "missing include" problem which usually looks something this in the middle of the page: [Unable to process directive] - not a 404, as the page renders, but definately a Bad Thing. The second problem is that of 404's which appear and disappear at random - like doing a sitewide search & replace across a hundred include files - that has a tendancy to lock files, producing share violations, which in turn result in 404's. So the database can't have 100% integrity.
The other problem is the rapid amount of turnover on the web - millions of pages are appearing and disappearing every day. Those quantum people thought virtual particles were odd - try tracking down the same piece of information you found in a search engine 2 weeks ago!
The second problem is operator/server/network errors. I've seen misconfigured proxies that mangle the URL and produce 404's when the page is there.. I've seen people make typos in the URL field of their browser (and then report it to me!), hell.. I've seen the 'net itself eat a few pages. All of this increases entropy in the database.
Finally.. I like seeing the ocasional cool 404 error. Take this one, from my server:
Not a bad idea. What happens if the server is temporarily down, however? You don't want everyone taking down their links if it is just down for a day or so because of technical difficulties. This service would get annoying if someone had a lot of links on their site and a few were randomly down on that particular day.
"I have not failed. I've simply found 10,000 ways that won't work." --Thomas Edison
They are just trying to create a market for themselves. Trying to keep tabs on the whole internet just in case a page moves every now and again is a silly idea. The right approach is for each site author to use cron to check the links every now and again. As soon as a page moves, update your link.
If you want to make sure that you don't break any links when you move your website, all you have to do is consult your HTTP logs, pull out all the lines starting "Referer:", and remove the duplicates.
Tarsnap: Online backups for the truly paranoid
I like that bit about cataloging pages with a five-word "lexical" signature based on words that appear mainly only on that page. How are they going to deal with the 5,000,000,000 web pages that contain only the word "porn"?
This is a government conspiracy. I'm surprised none of you (especially Signal 11) didn't pick up on it right away.
Any webmaster worth his weight in HTML can use LWP or even a simple GUI-based Xenu (freeware linkchecker) to check on the current status of links on their site, and elsewhere.
The only obvious benefit to something like Linkguard, is for the government to keep track of you. You have 20% dead links on your site? Bad webmaster -- BAD!".
Next thing you know, your name is published in the paper, your wife leaves you, your house is forclosed, and your children are taken away from you and put into foster care, with a family who does know how to maintain their links.
---
icq:2057699
seumas.com
- There's a free service that already does this. They do it even if you don't ask them to, then send you spam telling you about broken links on your site.
- And there's Alexa, which really does archive the Web so that you can find old pages.
- Personally, I like the link checker in Dreamweaver. It's very well integrated with the site maintenance tools.
Probably the biggest source of bad links is unmaintained "favorite links" lists. That's something that needs a simple tool. If the major free-web-page sites provided something, that would probably cut the number of dead links substantially.HyperG.
Cooperative networkerd multi-media via ad hoc netowrked indepednant locally caching nodes that make up a destribuetd database system.
Thera re a few good books on it for thsoe who are interested out there and source code is freely available.
Unfortunately the "minimally functional/maximally stupid fragily linked file server" solution of HTTPD got too established too quickly and HyperG couldn't penetrate.
Once again, better technology proves not to be the deciding point in the market.
It seems that I am in the wrong. But does anyone know how to get FrontPage extensions working with Apache 1.3.12 on Windows 98? That's why the LIST existed in this idea, and the actual purpose behind this.
Chris 'coldacid' Charabaruk Meldstar Entertainment
Eventually Linkguard is planning to use discrete software programs called agents to watch links and tell the webmasters of any affected sites when they are updated or changed.
By "agents" they mean "bots", I suppose.
Now, if it takes 40 terrabytes (roughly 41,943,040 megabytes, I believe) to document all the links on the web, how much more space will be needed to keep contact info on all those links? Plus, how efficient will these agents be? I'm not so hot on the idea of bots constantly poking around my lil' Network, checking that all my links are okay.
And will these bots follow the robots.txt rules? I know plenty of sites which revoke all robots, so the "agents" would be useless anyway... Nice idea, but sounds a bit invasive.
Plus this line below:
If the destination page disappears, search engines that can use these signatures would try to find the relevant signature and relocate the page.
Oh, so now you're relying on search engines to get the links right... hm...I'll stick to manually checking them myself, thankyouverymuch.
---
"Okay, who taught the cat how to type ctrl alt delete?"
In a recent New Scientist article (sorry, can't source it better than that) I read that Google has "4000 linux computers each with 80gb of diskspace". Well, that works out to be 312tb and that's only to index a small portion of the web - 20% is it? (as I understand it, Google stores the whole webpage in order to serve a cached version on demand). So, how could 40tb be enough? Even assuming that this new company compressed the data to a higher degree than Google (which needs to serve pages fast), this just couldn't be enough to be useful.
Gimmick or badly planned...whichever.
--Remove SPAM from my address to mail me
The Network Working Group is working on a replacement for URLs -- Uniform Resource Names. URNs are intended to serve as persistant, location-independent, resource identifiers and are designed to make it easy to map other namespaces (which share the properties of URNs) into URN-space.
Just how does this `company' plan to make money?
Will they email the companies saying "at least
one of your links is down; send us a cheque for
x00000 pounds and we'll tell you which"?
Just how?
arnald
Think of all the times you _had_ to take down a web page because it had misinformation, because it broke a copyright or some other reason... I have already had this problem with google, I think it'll be even worse w/ this service.
-- these are only opinions and they might not be mine.
Nope
1 gig = 1,000 meg.
40 gig = 40,000 meg.
1 tb = 1,000 gig
so 40*1000*1000
40 tb = 40,000,000 megabyte
Of course this is with 1 kb equaling 1000 bytes no 1024. So add two terabytes roughly.
i have misplaced my signature.
To me the article seemed to stress not broken links among your own page, but broken links from other people's pages to your own, thus causing you to lose out on visitors coming to your site from others.
However, this seems like it could also be done on the local side, by logging the http-referer so you can keep track of any pages that a lot of your visitors seem to be coming from and then notifying them if/when you change your URL's.
Compression compression and more compression...
i have misplaced my signature.
Even with 40 terabytes, I don't see how it can find every possible link. Well, I appreciate them trying, I've been informed before about a bropken link on my page by an automated bot, and I did fix it. So, I'm all for it, even though I can't see very much of a point. Well, we'll see.