Fixing Broken Links With the Internet Archive

← Back to Stories (view on slashdot.org)

Fixing Broken Links With the Internet Archive

Posted by Soulskill on Friday January 24, 2014 @08:55AM from the maintain-URIs-or-T.B-L.-will-beat-you-up dept.

eggboard writes "The Internet Archive has copies of Web pages corresponding to 378 billion URLs. It's working on several efforts, some of them quite recent, to help deter or assist with link rot, when links go bad. Through an API for developers, WordPress integration, a Chrome plug-in, and a JavaScript lookup, the Archive hopes to help people find at least the most recent copy of a missing or deleted page. More ambitiously, they instantly cache any link added to Wikipedia, and want to become integrated into browsers as a fallback rather than showing a 404 page."

4 of 79 comments (clear)

Min score:

Reason:

Sort:

Re:No. 404 is important! by Sarten-X · 2014-01-24 09:10 · Score: 4, Insightful

Supply HTTP code 404, and provide the content of the old page, preferably with a large banner saying "we couldn't find it, but here's what we had before".
I believe that meets all applicable standards. Automated systems should recognize the 404 code, and human systems (which won't likely see the underlying code) will see the banner.

--
You do not have a moral or legal right to do absolutely anything you want.
Re:No. 404 is important! by bill_mcgonigle · 2014-01-24 09:30 · Score: 5, Informative

Chillax, dude, it's simply a matter of implementation and preferences.
While archive.org might think this is a new idea, I've been using Errorzilla mod for the good part of a decade. When a 404 is encountered, you get the regular error page, and then it adds some buttons that let you look at the Google cache, Coral cache, Wayback archive, etc.
Quite useful and non-harmful.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Re:No. 404 is important! by Minwee · 2014-01-24 09:43 · Score: 4, Insightful

Sorry but that violates the standard as well. It must return a 404 or you break testing.
RFC 2616 mandates a 4xx error code followed by an optional human readable reason phrase. While the reason phrase is usually "Not Found" for a 404 error, there's nothing keeping it from being augmented by "...but a copy of a previous version is over there."
If your testing relies on anything beyond the numeric error code, then it's probably already broken.
Re:No. 404 is important! by SunTzuWarmaster · 2014-01-24 10:26 · Score: 4, Interesting

So let's say that my company has three lines of products on three different webpages. We decide to discontinue two of the lines of products for being unprofitable, and remove the pages. Google search results still show the pages, and archive.org still shows them to users. These products are still shown to my potential customers, who experience frustration when they attempt to get them.
Alternately, I create a temporary webpage for displaying some demo content to a potential client. It is a demo page, and ridden with bugs, holes, and other areas that need improvement. Archive.org still shows this page as part of search results? What will potential clients think of my company, given that it put up a buggy/terrible page?
Alternately, let's just say that I rename a longstanding webpage (technology.slashdot.org to tech.slashdot.org) and delete the old URL. Should archive.org redirect to false content?
Or, let's say that my restaurant decides to take down its 2013menu.html page, and doesn't wish customers to be able to compare its new and old menu side by side to see where prices inflated.
Error messages have purpose. While the most common case is that the page/server went offline, there are many times where a page URL changes as a result of regular website updates, where you don't want users to obtain old content.
Sometimes things are deleted for a reason.