Broken Links No More?

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Friday September 24, 2004 @01:35AM from the dream-big dept.

johndoejersey writes "Students in England have developed a tool which could bring the end to broken links. Peridot, developed by UK intern students at IBM scans company weblinks and replaces outdated information with other relevant documents and links. IBM have already filed 2 patents for the project. The students said Peridot could protect companies by spotting links to sites that have been removed, or which point to wholly unsuitable content. 'Peridot could lead to a world where there are no more broken links,' James Bell, computer science student at the University of Warwick, told BBC News Online. Here is another story on it." See also the BBC story.

14 of 212 comments (clear)

Min score:

Reason:

Sort:

Not Entirely New by terrencefw · 2004-09-24 01:41 · Score: 3, Informative

I've seen lots of site that return search results based on bits of the broken link instead of 404's.
Suppose you have broken link http://somesite.com/foo/bar.html, some sites return a list of search results from within 'somesite.com' matching 'foo' or 'bar'. Quite clever, and much more useful than a plain old 'page not found' error.
This just takes that one step further by doing the searching at the referring end instad.

--
Like tinyurl, but one letter less! http://qurl.co.uk/
Vulnerability? by darkmeridian · 2004-09-24 01:55 · Score: 2, Informative

Remember Google-hacks at http://johnny.ihackstuff.com/? Basically, since Google effectively snoops millions of servers, you can use this information to break into servers and get information. Having an internal feature that connects broken links to real pages may be orders of magnitudes worse. What if I imaginatively "linked" to a made-up URL to see what's on your servers? This could be bad news if it's effectively done.

--
A NYC lawyer blogs. http://www.chuangblog.com/
Re:Great by tannnk · 2004-09-24 01:58 · Score: 3, Informative

Hey... We had this kind of features on internet before

--
T!
German readers... by dukoids · 2004-09-24 02:12 · Score: 5, Informative

may want to take a look at the master's thesis of Nils Malzahn (from 2003, in German) to see (in detail) how this actually can work:
http://www-ai.cs.uni-dortmund.de/DOKUMENTE/malzahn _2003a.pdf
Basically, the thesis evaluates different methods to build a kind of "finger-print" of a page. The finger print is used to find the page with google if it is gone, or has changed significantly.
The internet wayback machine was used to learn distinguishing disappeared pages from pages changing slightly over the time.
Re:And... ? by pipingguy · 2004-09-24 02:14 · Score: 3, Informative

It wouldn't take long to write a script to find all the broken links on a page.

Just use Xenu's Link Checker.
Re:Take this with a grain of salt by liquidsin · 2004-09-24 02:22 · Score: 2, Informative

The big difference here is that in the case of SiteFinder, Verisign had control over where you ended up for basically the entire internet. This seems like it would be the type of thing that would run as an Apache mod that would get invoked when a 404 gets returned, and so would only affect that particular site. There's a big difference between going to www.linuxdistro.org/whats_new.html and getting redirected to www.linuxdistro.org/whatsnew.php like this would probably do, and going to www.linuxdistor.org (typo intentional) and having Verisign redirect you to www.microsoft.com because they're getting paid to advertise.

--
do not read this line twice.
SED? by Kenja · 2004-09-24 02:23 · Score: 2, Informative

So, they've invented SED? Cause thats what I've been using for years to replace old/broken links. A simple script using the netsaint/nagios service tests can check if a link is still good and then build a list of bad ones to be replaced by script number two using SED.

--

"Have you ever thought about just turning off the TV, sitting down with your kids, and hitting them?"
Instead by Dr.+Stavros · 2004-09-24 02:35 · Score: 3, Informative

Just use the W3C's link-checker.
Re:What if the page is deleted, not changed by troon · 2004-09-24 02:41 · Score: 4, Informative

That's what the 410 Gone HTTP response header is for. If only admins would use it more...

--
Ydco co ,df C erb-y go. a Ekrpat t.fxrapev
been there, done that. by Quickening · 2004-09-24 02:43 · Score: 4, Informative

For those running a real browser, just make this a link, preferably in your personal tool bar.

javascript:Qr=document.URL;if(Qr=='about:blank') {v oid(Qr=prompt('Url...',''))};if(Qr)location.href=' http://web.archive.org/web/*/'+escape(Qr)

Now when I click on a link that isn't there, I select my Archive search button and it shows me the Wayback Machine's history of that link. Of course it works only if the url hasn't been modified by the server. If it has it's another couple steps (copy link, ^T, archive search, paste url in pop-up dialog)

--
tcboo
Re:Take this with a grain of salt by julesh · 2004-09-24 03:00 · Score: 2, Informative

If you read the article (the BBC one, which is the only link in there with any relevant information) you'll find that's not how it works. It alerts the webmaster and suggests a replacement, rather than randomly "fixing" other people's pages.
scary enough... by Rev.LoveJoy · 2004-09-24 03:13 · Score: 2, Informative

FrontPage has been able to "Scan your web site for broken links" since it first came out in ... what 1997?
Clippy indeed, must be a slow news day,
- RLJ
Re:It will work, but that isn't good, here is why by GeorgeH · 2004-09-24 03:51 · Score: 2, Informative

Ideally, cool URIs don't change, but in the real world they do.

If document X moves and the link is invalid, you should be serving an HTTP 301 Permanent Redirect and well behaved user agents will update their bookmarks, and well behaved content management systems will update their code. If document X is gone, you should be serving an HTTP 410 Gone.

Ideally, 404 is supposed to mean that the web server has never heard of the file in question before, but in the real world...

--
Why can't I moderate something "Wrong" or at least "Grossly Misinformed"?
predecessor: robust hyperlinks by pangloss · 2004-09-24 07:17 · Score: 2, Informative

There were two fellows at UC Berkeley (Phelps and Wilensky) who implemented the idea of "fingerprinting" web pages at least as far back as 2000. It was a non-trivial fingerprinting (i.e. not just MD5 hash of a web page).

As far as I know, they haven't done any more recent work on this and the software is only available via archive.org.

A paper

I gather that the IBM effort is different in significant respects, but it certainly employs ideas from Phelps & Wilensky.