Internet Archive Says It Has Restored 9 Million Broken Wikipedia Links By Directing Them To Archived Versions in Wayback Machine (archive.org)
Mark Graham, the Director of Wayback Machine at Internet Archive, announces: As part of the Internet Archive's aim to build a better Web, we have been working to make the Web more reliable -- and are pleased to announce that 9 million formerly broken links on Wikipedia now work because they go to archived versions in the Wayback Machine.
For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week. And for the past 3 years, we have been running a software robot called IABot on 22 Wikipedia language editions looking for broken links (URLs that return a '404', or 'Page Not Found'). When broken links are discovered, IABot searches for archives in the Wayback Machine and other web archives to replace them with. Restoring links ensures Wikipedia remains accurate and verifiable and thus meets one of Wikipedia's three core content policies: 'Verifiability.'
For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week. And for the past 3 years, we have been running a software robot called IABot on 22 Wikipedia language editions looking for broken links (URLs that return a '404', or 'Page Not Found'). When broken links are discovered, IABot searches for archives in the Wayback Machine and other web archives to replace them with. Restoring links ensures Wikipedia remains accurate and verifiable and thus meets one of Wikipedia's three core content policies: 'Verifiability.'
Archive.org is precious! Since a long time I rather send my students the archived version of web pages. If it is not there, I upload it. That way I can reuse a web page many years later and trust it is still there. Phenomenally simple!
You just know people will file DMCA takedowns for their content archived on Wayback, breaking the links yet again.
Because people are petty and obsessed with controlling their content even though they're not making money from it anymore and they would have otherwise forgotten about it completely.
How are these 9 million links broken in the first place?
Wikipedia has a useful and seemingly complete archive of every version and edit for every article. I'm curious how these broken links originate, and how they differ from those that are available in the WIkipedia Revision History.
Until the domain's new owner sets up a robots.txt, causing Wayback Machine to retrospectively block public access to the archived copy of a document. See debate about this a year and a half ago.
Go Internet Archive, go!
most of archive.org is terminally broken itself. robots.txt and domain name squatters retroactively booting you out, anyone? Then there's the behind-the-scenes xml-in-xml malarky, and its trouble with archiving pictures that makes trying to visit archived webcomics and other picture-heavy sites especially futile. Thanks, archive.org.
Do they provide a health plan? The workers or their children might need hospital care.
Those articles were deleted for a reason!
These nazis are trying to plug up the memory hole!
Shut it down!
and that's okay? Web sites go down for a variety of reasons, and one of them is to delete outdated information or just information that the site owner no longer wants to display. So with this system if Wikipedia has ever cited a page, it never goes away. Now maybe the site owner is juts lazy and is being "protected" from his laziness by this project. Or just maybe the site owner eliminated information because he legitimately wanted to. In that case this project is contrary to his desires. It's just another way to make sure Wikipedia is outdated. Instead of broken links that have been broken intentionally, Wikipedia remains outdated by pointing to stuff that ought to be gone. I do not see this as a good thing. It's not up to Wikipedia or the archive to police the Internet. If a link no longer works (easily discovered and reported by a web crawler, let Wikipedia fix the article, including removing the link. Yeah, that IS a big job, but it is their responsibility. As it is this is just another reason Wikipedia is not particularly accurate.
How about a moderation of -1 pedantic.
For rescuing articles from deletionists and showing more prominently what edits were reverted by admins?
Internet Trashcan. Saving garbage for future generations to sift through and make sense of. Whenever I watch for something, it will often bring up trash like garage band albums from "alternative" and "experimental" "music".
The Amber project, http://amberlink.org/ provides a plugin for various content management systems to do the same thing on your own site.
Why doesn't every browser do this automatically when encountering 404?