Fixing Broken Links With the Internet Archive

← Back to Stories (view on slashdot.org)

Fixing Broken Links With the Internet Archive

Posted by Soulskill on Friday January 24, 2014 @08:55AM from the maintain-URIs-or-T.B-L.-will-beat-you-up dept.

eggboard writes "The Internet Archive has copies of Web pages corresponding to 378 billion URLs. It's working on several efforts, some of them quite recent, to help deter or assist with link rot, when links go bad. Through an API for developers, WordPress integration, a Chrome plug-in, and a JavaScript lookup, the Archive hopes to help people find at least the most recent copy of a missing or deleted page. More ambitiously, they instantly cache any link added to Wikipedia, and want to become integrated into browsers as a fallback rather than showing a 404 page."

9 of 79 comments (clear)

Min score:

Reason:

Sort:

Please no? by DMiax · 2014-01-24 09:01 · Score: 3, Insightful

...want to become integrated into browsers as a fallback rather than showing a 404 page
Fuck no. If a page does not exist it does not exist.
Re:No. 404 is important! by Sarten-X · 2014-01-24 09:10 · Score: 4, Insightful

Supply HTTP code 404, and provide the content of the old page, preferably with a large banner saying "we couldn't find it, but here's what we had before".
I believe that meets all applicable standards. Automated systems should recognize the 404 code, and human systems (which won't likely see the underlying code) will see the banner.

--
You do not have a moral or legal right to do absolutely anything you want.
Cool! I can stop paying my hosting provider! by barlevg · 2014-01-24 09:18 · Score: 3, Insightful

While I honestly think this is an awesome idea, I wonder, if this takes off, whether anyone who currently pays for web hosting of a static site will decide, "fuck it--it's backed up on Internet Archive. Might as well save the $N a month I pay to maintain the website and lease the domain name."
Re:No. 404 is important! by bill_mcgonigle · 2014-01-24 09:30 · Score: 5, Informative

Chillax, dude, it's simply a matter of implementation and preferences.
While archive.org might think this is a new idea, I've been using Errorzilla mod for the good part of a decade. When a 404 is encountered, you get the regular error page, and then it adds some buttons that let you look at the Google cache, Coral cache, Wayback archive, etc.
Quite useful and non-harmful.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Re:No. 404 is important! by Minwee · 2014-01-24 09:43 · Score: 4, Insightful

Sorry but that violates the standard as well. It must return a 404 or you break testing.
RFC 2616 mandates a 4xx error code followed by an optional human readable reason phrase. While the reason phrase is usually "Not Found" for a 404 error, there's nothing keeping it from being augmented by "...but a copy of a previous version is over there."
If your testing relies on anything beyond the numeric error code, then it's probably already broken.
Re:No. 404 is important! by amorsen · 2014-01-24 09:47 · Score: 3, Informative

The only way this can be implemented without causing problems for others is to have it be an option in the browser for those who want it to do the additonal lookup.
That is the proposal. The browser does it. The web server still returns 404, so your code does not have to work around anything. This is not the NXDOMAIN redirection fiasco.

--
Finally! A year of moderation! Ready for 2019?
Re:No. 404 is important! by SunTzuWarmaster · 2014-01-24 10:26 · Score: 4, Interesting

So let's say that my company has three lines of products on three different webpages. We decide to discontinue two of the lines of products for being unprofitable, and remove the pages. Google search results still show the pages, and archive.org still shows them to users. These products are still shown to my potential customers, who experience frustration when they attempt to get them.
Alternately, I create a temporary webpage for displaying some demo content to a potential client. It is a demo page, and ridden with bugs, holes, and other areas that need improvement. Archive.org still shows this page as part of search results? What will potential clients think of my company, given that it put up a buggy/terrible page?
Alternately, let's just say that I rename a longstanding webpage (technology.slashdot.org to tech.slashdot.org) and delete the old URL. Should archive.org redirect to false content?
Or, let's say that my restaurant decides to take down its 2013menu.html page, and doesn't wish customers to be able to compare its new and old menu side by side to see where prices inflated.
Error messages have purpose. While the most common case is that the page/server went offline, there are many times where a page URL changes as a result of regular website updates, where you don't want users to obtain old content.
Sometimes things are deleted for a reason.
I Have Experience in Internet Archaeology by IonOtter · 2014-01-24 11:59 · Score: 3, Interesting

There was a fascinating website dedicated to high-energy weapons and experiments, called svbxlabs.com
It was run by a young man who'd been born in the US to Ukranian immigrants, which is actually important to keep in mind. He was brilliant, at least in my eyes, putting together the most incredible devices. HERF cannons, railguns, Tesla coils; you name it. He was the first to explain what the OptiCom traffic Light Changer was, and how it worked.
In short, he was doing a lot of work on things a LOT of people would much rather he didn't. Things were zipping along nicely, and his college professor was very excited to see what he came up with next.
Then 9/11 happened. Within four months, the site was gone. And Slava Person vanished from the Internet not long after that. Other people took up the mantle of his work, such as powerlabs.org, but it's not as good as Mr. Slava's work had been.
But if you put svbxlabs.com into WBM/A.O, you can find most of what he did. Also, one of the problems of WBM/A.O is that you can't just click on the links. Sometimes you have to copy them, then enter them into the WBM window, otherwise your browser tries to go to the direct link. Which no longer exists.
I've also used it to find all kinds of fan fiction, role-playing games, artwork and more.
I approve of this.

--
[End Of Line]
Redirect, don't 404. by pavon · 2014-01-24 12:21 · Score: 3, Insightful

None of those examples should result in a broken link if you are maintaining your website correctly. And this feature is only "fixing" broken links; that is links that once existed and are now 404'ed.
If you want to discontinue a product, then replace those pages with one that explains that the product is discontinued, and provides links to simular current products, as well as the support page for the discontinued product. If a users is clicking on links in reviews or forum posts about your old product and receive 404's, or redirection to a completely unrelated and unhelpfull page on your site, they will be frustrated with or without this feature.
In the second case, just redirect the entire demo website URL tree to a current list of examples.
In the third case, you shouldn't do that without redirecting the old url to the new one. Seriously, are you trying to make your content hard to find?
Again, redirect to the new menu.
In no case is sending a user a 404 useful or benificial, nor is it the most appropriate thing to do according to the HTTP standard. If you really want to be pendantic then send a 301 or 303 to perform the redirect, otherwise use URL rewriting, or just change the contents of the existing URL, whichever is easiest. The user should only see a 404 if they clicked an invalid link that was never a real URL for your website. Otherwise, you have failed your users, and it's no-one's fault but your own if they choose to use a service that tries to make up for your short-commings.