Broken Links No More?
johndoejersey writes "Students in England have developed a tool which could bring the end to broken links. Peridot, developed by UK intern students at IBM scans company weblinks and replaces outdated information with other relevant documents and links. IBM have already filed 2 patents for the project. The students said Peridot could protect companies by spotting links to sites that have been removed, or which point to wholly unsuitable content. 'Peridot could lead to a world where there are no more broken links,' James Bell, computer science student at the University of Warwick, told BBC News Online. Here is another story on it." See also the BBC story.
Suppose you have broken link http://somesite.com/foo/bar.html, some sites return a list of search results from within 'somesite.com' matching 'foo' or 'bar'. Quite clever, and much more useful than a plain old 'page not found' error.
This just takes that one step further by doing the searching at the referring end instad.
Like tinyurl, but one letter less! http://qurl.co.uk/
Remember Google-hacks at http://johnny.ihackstuff.com/? Basically, since Google effectively snoops millions of servers, you can use this information to break into servers and get information. Having an internal feature that connects broken links to real pages may be orders of magnitudes worse. What if I imaginatively "linked" to a made-up URL to see what's on your servers? This could be bad news if it's effectively done.
A NYC lawyer blogs. http://www.chuangblog.com/
Hey... We had this kind of features on internet before
T!
http://www-ai.cs.uni-dortmund.de/DOKUMENTE/malzahn _2003a.pdf
Basically, the thesis evaluates different methods to build a kind of "finger-print" of a page. The finger print is used to find the page with google if it is gone, or has changed significantly.
The internet wayback machine was used to learn distinguishing disappeared pages from pages changing slightly over the time.
It wouldn't take long to write a script to find all the broken links on a page.
Just use Xenu's Link Checker.
The big difference here is that in the case of SiteFinder, Verisign had control over where you ended up for basically the entire internet. This seems like it would be the type of thing that would run as an Apache mod that would get invoked when a 404 gets returned, and so would only affect that particular site. There's a big difference between going to www.linuxdistro.org/whats_new.html and getting redirected to www.linuxdistro.org/whatsnew.php like this would probably do, and going to www.linuxdistor.org (typo intentional) and having Verisign redirect you to www.microsoft.com because they're getting paid to advertise.
do not read this line twice.
So, they've invented SED? Cause thats what I've been using for years to replace old/broken links. A simple script using the netsaint/nagios service tests can check if a link is still good and then build a list of bad ones to be replaced by script number two using SED.
"Have you ever thought about just turning off the TV, sitting down with your kids, and hitting them?"
Just use the W3C's link-checker.
That's what the 410 Gone HTTP response header is for. If only admins would use it more...
Ydco co
For those running a real browser, just make this a link, preferably in your personal tool bar.
) {v oid(Qr=prompt('Url...',''))};if(Qr)location.href=' http://web.archive.org/web/*/'+escape(Qr)
javascript:Qr=document.URL;if(Qr=='about:blank'
Now when I click on a link that isn't there, I select my Archive search button and it shows me the Wayback Machine's history of that link. Of course it works only if the url hasn't been modified by the server. If it has it's another couple steps (copy link, ^T, archive search, paste url in pop-up dialog)
tcboo
If you read the article (the BBC one, which is the only link in there with any relevant information) you'll find that's not how it works. It alerts the webmaster and suggests a replacement, rather than randomly "fixing" other people's pages.
Clippy indeed, must be a slow news day,
- RLJ
Ideally, cool URIs don't change, but in the real world they do.
If document X moves and the link is invalid, you should be serving an HTTP 301 Permanent Redirect and well behaved user agents will update their bookmarks, and well behaved content management systems will update their code. If document X is gone, you should be serving an HTTP 410 Gone.
Ideally, 404 is supposed to mean that the web server has never heard of the file in question before, but in the real world...
Why can't I moderate something "Wrong" or at least "Grossly Misinformed"?
There were two fellows at UC Berkeley (Phelps and Wilensky) who implemented the idea of "fingerprinting" web pages at least as far back as 2000. It was a non-trivial fingerprinting (i.e. not just MD5 hash of a web page).
As far as I know, they haven't done any more recent work on this and the software is only available via archive.org.
A paper
I gather that the IBM effort is different in significant respects, but it certainly employs ideas from Phelps & Wilensky.