Slashdot Mirror


Internet Archive Says It Has Restored 9 Million Broken Wikipedia Links By Directing Them To Archived Versions in Wayback Machine (archive.org)

Mark Graham, the Director of Wayback Machine at Internet Archive, announces: As part of the Internet Archive's aim to build a better Web, we have been working to make the Web more reliable -- and are pleased to announce that 9 million formerly broken links on Wikipedia now work because they go to archived versions in the Wayback Machine.

For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week. And for the past 3 years, we have been running a software robot called IABot on 22 Wikipedia language editions looking for broken links (URLs that return a '404', or 'Page Not Found'). When broken links are discovered, IABot searches for archives in the Wayback Machine and other web archives to replace them with. Restoring links ensures Wikipedia remains accurate and verifiable and thus meets one of Wikipedia's three core content policies: 'Verifiability.'

10 of 40 comments (clear)

  1. Archive.org is precious! by Anonymous Coward · · Score: 5, Insightful

    Archive.org is precious! Since a long time I rather send my students the archived version of web pages. If it is not there, I upload it. That way I can reuse a web page many years later and trust it is still there. Phenomenally simple!

  2. Just watch, people will ruin it by ZorinLynx · · Score: 3, Insightful

    You just know people will file DMCA takedowns for their content archived on Wayback, breaking the links yet again.

    Because people are petty and obsessed with controlling their content even though they're not making money from it anymore and they would have otherwise forgotten about it completely.

    1. Re:Just watch, people will ruin it by tlhIngan · · Score: 4, Interesting

      You just know people will file DMCA takedowns for their content archived on Wayback, breaking the links yet again.

      Because people are petty and obsessed with controlling their content even though they're not making money from it anymore and they would have otherwise forgotten about it completely.

      Except the Internet Archive is a recognized library, which means they actually have powers to ignore DMCA takedowns. In fact, as a library they get a lot of exceptions to the DMCA. It's why they host a lot of copyrighted material for free

      It's one of he few positives of the DMCA.

  3. Re:how are these links broken in the first place? by Anonymous Coward · · Score: 2, Informative

    It's not like somebody broke the links by editing Wikipedia. Websites disappear all of the time.

  4. robots.txt to block Wayback Machine by tepples · · Score: 5, Informative

    Until the domain's new owner sets up a robots.txt, causing Wayback Machine to retrospectively block public access to the archived copy of a document. See debate about this a year and a half ago.

    1. Re:robots.txt to block Wayback Machine by Anubis+IV · · Score: 5, Interesting

      Exactly what I was thinking. A site posts something that creates a situation, they take the page down and engage in PR spin, Wikipedia links to the archived copy of the page to demonstrate what content had been there, and then the site modifies their robots.txt, retroactively clearing the content from the IA.

      I understand IA's policy of abiding by robots.txt, but when someone needs to be held accountable for what they said, having a single source that can serve as a living embodiment of "the Internet never forgets" would be quite nice.

    2. Re:robots.txt to block Wayback Machine by drinkypoo · · Score: 5, Informative

      Until the domain's new owner sets up a robots.txt, causing Wayback Machine to retrospectively block public access to the archived copy of a document. See debate about this a year and a half ago.

      Except they don't do that any more, unless the domain's new owner explicitly blocks the internet archive's user agent. A disallow * policy is now ignored.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  5. Re:how are these links broken in the first place? by ZorinLynx · · Score: 5, Informative

    Link rot.

    Even websites that have been around for decades experience it, because they change the structure of their site, breaking links to articles that might even still be available.

    If you follow a CNN link from 15 years ago, it probably won't work.

    It's a bit scary to think how much of our history we're losing to link rot and archive.org is doing their best to fight it. They are awesome people.

  6. Re:how are these links broken in the first place? by Anonymous Coward · · Score: 2, Informative

    This is about external links, not wikilinks.

  7. Re:how are these links broken in the first place? by wistlo · · Score: 2

    Link rot is all to familiar to me. Local newspapers, such as nola.com in New Orleans, are impossible to search because of site updates that wiped out the entire history.

    I had not realized there were nine million such links on Wikipedia, as they tend to mind such matters more closely than the commercial media companies. (An exception are the NYT and Washington Post, who in my experience do pretty well in keeping old links working or at least redirected to the same content).

    I donate to archive.org partly because they're they only folks who seem to understand that web content with no archive is as ephemeral as a sand drawing on a beach. FIfteen years ago I put up some photos on angelfire that I forgot about but later wanted to see. Voila, here it is on the Wayback machine.

      In 10,000 years, our current era may be less well documented than the Bronze Age.