Internet Archive Says It Has Restored 9 Million Broken Wikipedia Links By Directing Them To Archived Versions in Wayback Machine (archive.org)

← Back to Stories (view on slashdot.org)

Internet Archive Says It Has Restored 9 Million Broken Wikipedia Links By Directing Them To Archived Versions in Wayback Machine (archive.org)

Posted by msmash on Tuesday October 2, 2018 @02:50AM from the better-web dept.

Mark Graham, the Director of Wayback Machine at Internet Archive, announces: As part of the Internet Archive's aim to build a better Web, we have been working to make the Web more reliable -- and are pleased to announce that 9 million formerly broken links on Wikipedia now work because they go to archived versions in the Wayback Machine.

For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week. And for the past 3 years, we have been running a software robot called IABot on 22 Wikipedia language editions looking for broken links (URLs that return a '404', or 'Page Not Found'). When broken links are discovered, IABot searches for archives in the Wayback Machine and other web archives to replace them with. Restoring links ensures Wikipedia remains accurate and verifiable and thus meets one of Wikipedia's three core content policies: 'Verifiability.'

40 comments

Min score:

Reason:

Sort:

Archive.org is precious! by Anonymous Coward · 2018-10-02 02:58 · Score: 5, Insightful

Archive.org is precious! Since a long time I rather send my students the archived version of web pages. If it is not there, I upload it. That way I can reuse a web page many years later and trust it is still there. Phenomenally simple!
1. Re:Archive.org is precious! by Anonymous Coward · 2018-10-02 03:41 · Score: 1
  
  Back in the day (when things were just getting built), I would bookmark interesting sites
  like crazy never thinking that they would disappear. So, when sometime later, I'd try to
  revisit that site, Boom! it was gone (yes I made a sound like that). So now, I scrape the
  pages I have an interest and it's there "forever" on my local HD. Except now even that's
  getting difficult as some sites are built on-demand in JS, so the "save page as" doesn't
  really save everything I think it saves. So if the site's backend goes away, it's lost.
  I know people make fun of /.'s support of _only_ Baudot encoding, but at least when I
  save a /. article, I get the whole article.
  CAP === 'molecule'
2. Re:Archive.org is precious! by mikael · 2018-10-02 04:05 · Score: 0
  
  In the early days of the internet (mid 1990's to 2000), I used AT&T's web browser. This has the wonderful feature of closing the web browser the minute the dial-up connection was lost. So to avoid losing downloads, I'd just save every web page first, then read it.
  There are command line utilities to download a web page from an URL eg. wget.
  
  --
  Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
3. Re:Archive.org is precious! by Anonymous Coward · 2018-10-02 04:19 · Score: 0
  
  There are command line utilities to download a web page from an URL eg. wget.
  I found faculty pages on .edu sites with useful information would disappear probably due to the individual changing institutions. Sometimes even trade magazine links to informative articles go away. When I find one online now or via Archive.org I usually just copy/paste the content into a Word document.
4. Re:Archive.org is precious! by Anonymous Coward · 2018-10-02 04:43 · Score: 0
  
  With the new found interest in raytracing (3D rendering of floating balls perfectly reflecting between themselves etc.) I wanted to read the page about raytracing on printers i.e. in a postscript document and get the postscript document/program again, but I found out it was gone for this reason. Although this was something of a practical joke.
5. Re:Archive.org is precious! by UnknownSoldier · 2018-10-02 04:43 · Score: 1
  
  That's why Chrome's "Save to PDF" is invaluable. Because sooner or later the website / web page WILL disappear.
Just watch, people will ruin it by ZorinLynx · 2018-10-02 02:59 · Score: 3, Insightful

You just know people will file DMCA takedowns for their content archived on Wayback, breaking the links yet again.
Because people are petty and obsessed with controlling their content even though they're not making money from it anymore and they would have otherwise forgotten about it completely.
1. Re:Just watch, people will ruin it by Anonymous Coward · 2018-10-02 03:19 · Score: 0, Flamebait
  
  You could have fucked off and stopped using the internet and then everyone would be a lot happier.
2. Re:Just watch, people will ruin it by MobyDisk · 2018-10-02 03:29 · Score: 1
  
  What makes you think this will happen? These 404s are usually the result of people reorganizing a site, retiring a blog, etc. They probably don't even know about it.
3. Re:Just watch, people will ruin it by tlhIngan · 2018-10-02 06:00 · Score: 4, Interesting
  
  You just know people will file DMCA takedowns for their content archived on Wayback, breaking the links yet again.
  Because people are petty and obsessed with controlling their content even though they're not making money from it anymore and they would have otherwise forgotten about it completely.
  Except the Internet Archive is a recognized library, which means they actually have powers to ignore DMCA takedowns. In fact, as a library they get a lot of exceptions to the DMCA. It's why they host a lot of copyrighted material for free
  It's one of he few positives of the DMCA.
4. Re:Just watch, people will ruin it by Shikaku · 2018-10-02 10:18 · Score: 1
  
  One of the interesting protections lets you upload files and even if the link is DMCA'd, all that would do is just hide the link from its search feature and browsing sections. The files stay up if you know the direct link. There's a lot of legally gray files like ROMs on there because of that.
how are these links broken in the first place? by wistlo · 2018-10-02 03:02 · Score: 1

How are these 9 million links broken in the first place?
Wikipedia has a useful and seemingly complete archive of every version and edit for every article. I'm curious how these broken links originate, and how they differ from those that are available in the WIkipedia Revision History.
1. Re:how are these links broken in the first place? by Anonymous Coward · 2018-10-02 03:06 · Score: 2, Informative
  
  It's not like somebody broke the links by editing Wikipedia. Websites disappear all of the time.
2. Re:how are these links broken in the first place? by ZorinLynx · 2018-10-02 03:16 · Score: 5, Informative
  
  Link rot.
  Even websites that have been around for decades experience it, because they change the structure of their site, breaking links to articles that might even still be available.
  If you follow a CNN link from 15 years ago, it probably won't work.
  It's a bit scary to think how much of our history we're losing to link rot and archive.org is doing their best to fight it. They are awesome people.
3. Re:how are these links broken in the first place? by Anonymous Coward · 2018-10-02 03:16 · Score: 0
  
  How are you on Slashdot and don't even know what a broken link is?
  It means the web site that Wikipedia linked to has changed. Either the page is no longer available, or the entire site has shutdown or relocated to another domain.
4. Re:how are these links broken in the first place? by Anonymous Coward · 2018-10-02 03:17 · Score: 2, Informative
  
  This is about external links, not wikilinks.
5. Re:how are these links broken in the first place? by mikael · 2018-10-02 04:10 · Score: 1
  
  Or more simply, they have reorganised their website directory system. http:///.com/developersupport//demos/mainindex.html suddenly becomes http:////presentations/demos/mainindex.html
  
  --
  Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
6. Re:how are these links broken in the first place? by wistlo · 2018-10-02 04:14 · Score: 2
  
  Link rot is all to familiar to me. Local newspapers, such as nola.com in New Orleans, are impossible to search because of site updates that wiped out the entire history.
  I had not realized there were nine million such links on Wikipedia, as they tend to mind such matters more closely than the commercial media companies. (An exception are the NYT and Washington Post, who in my experience do pretty well in keeping old links working or at least redirected to the same content).
  I donate to archive.org partly because they're they only folks who seem to understand that web content with no archive is as ephemeral as a sand drawing on a beach. FIfteen years ago I put up some photos on angelfire that I forgot about but later wanted to see. Voila, here it is on the Wayback machine.
  In 10,000 years, our current era may be less well documented than the Bronze Age.
7. Re:how are these links broken in the first place? by Scarletdown · 2018-10-02 04:28 · Score: 1
  
  Link rot.
  Even websites that have been around for decades experience it, because they change the structure of their site, breaking links to articles that might even still be available.
  If you follow a CNN link from 15 years ago, it probably won't work.
  It's a bit scary to think how much of our history we're losing to link rot and archive.org is doing their best to fight it. They are awesome people.
  Or it is like how Web searches frequently ended up back in the still adolescent days of the Web.
  You search on a topic, and the first couple dozen pages are all different sites that link to the same page that link to the same page, that link to the same page, that link to what has long since become a 404.
  I don't know what was more frustrating; the fact that no one can be arsed to create backup sources for info; or the searches where you have an important question about something (perhaps a tech problem), and search results show a lot of others asking the same thing; but zero actual solutions (other than a few useless "Works for me" posts) or any replies whatsoever. Heck, I've seen plenty of instances where once the original poster gets their answer, they then remove or request that not only their post, but the entire thread be deleted.
  
  --
  This space unintentionally left blank.
8. Re:how are these links broken in the first place? by Anonymous Coward · 2018-10-02 04:51 · Score: 0
  
  I remember the entirely malicious link farms. Dozens of variations of a site linking between themselves in eternal loops. By design they would never give you themselves any useful content whatsoever. Haven't seem them in a long time, I found them amusing.
robots.txt to block Wayback Machine by tepples · 2018-10-02 03:06 · Score: 5, Informative

Until the domain's new owner sets up a robots.txt, causing Wayback Machine to retrospectively block public access to the archived copy of a document. See debate about this a year and a half ago.
1. Re:robots.txt to block Wayback Machine by Anubis+IV · 2018-10-02 03:18 · Score: 5, Interesting
  
  Exactly what I was thinking. A site posts something that creates a situation, they take the page down and engage in PR spin, Wikipedia links to the archived copy of the page to demonstrate what content had been there, and then the site modifies their robots.txt, retroactively clearing the content from the IA.
  I understand IA's policy of abiding by robots.txt, but when someone needs to be held accountable for what they said, having a single source that can serve as a living embodiment of "the Internet never forgets" would be quite nice.
2. Re:robots.txt to block Wayback Machine by drinkypoo · 2018-10-02 03:32 · Score: 5, Informative
  
  Until the domain's new owner sets up a robots.txt, causing Wayback Machine to retrospectively block public access to the archived copy of a document. See debate about this a year and a half ago.
  Except they don't do that any more, unless the domain's new owner explicitly blocks the internet archive's user agent. A disallow * policy is now ignored.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
3. Re:robots.txt to block Wayback Machine by MobyDisk · 2018-10-02 05:32 · Score: 1
  
  Can I archive the archived page?
4. Re:robots.txt to block Wayback Machine by tepples · 2018-10-08 14:35 · Score: 1
  
  Thank you for the update. The Daily Pangram 1-550 is saved.
Second greatest use, after ROMs by Anonymous Coward · 2018-10-02 03:19 · Score: 0

Go Internet Archive, go!
Too bad by Anonymous Coward · 2018-10-02 03:22 · Score: 0

most of archive.org is terminally broken itself. robots.txt and domain name squatters retroactively booting you out, anyone? Then there's the behind-the-scenes xml-in-xml malarky, and its trouble with archiving pictures that makes trying to visit archived webcomics and other picture-heavy sites especially futile. Thanks, archive.org.
Health care by erikmartino477 · 2018-10-02 03:31 · Score: 1

Do they provide a health plan? The workers or their children might need hospital care.
Accurate? by Anonymous Coward · 2018-10-02 03:32 · Score: 1

Those articles were deleted for a reason!
These nazis are trying to plug up the memory hole!
Shut it down!
So now we link to outdated information by mschuyler · 2018-10-02 04:00 · Score: 0

and that's okay? Web sites go down for a variety of reasons, and one of them is to delete outdated information or just information that the site owner no longer wants to display. So with this system if Wikipedia has ever cited a page, it never goes away. Now maybe the site owner is juts lazy and is being "protected" from his laziness by this project. Or just maybe the site owner eliminated information because he legitimately wanted to. In that case this project is contrary to his desires. It's just another way to make sure Wikipedia is outdated. Instead of broken links that have been broken intentionally, Wikipedia remains outdated by pointing to stuff that ought to be gone. I do not see this as a good thing. It's not up to Wikipedia or the archive to police the Internet. If a link no longer works (easily discovered and reported by a web crawler, let Wikipedia fix the article, including removing the link. Yeah, that IS a big job, but it is their responsibility. As it is this is just another reason Wikipedia is not particularly accurate.

--
How about a moderation of -1 pedantic.
1. Re:So now we link to outdated information by mikael · 2018-10-02 04:12 · Score: 1
  
  Most academic links go down because the student no longer works there, and the research lab has a clean out of old documents and web pages.
  
  --
  Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
2. Re: So now we link to outdated information by virtig01 · 2018-10-02 04:16 · Score: 1
  
  Since you're assigning responsibility for updating outdated content, why isn't it the responsibility of the cited website's author to update their page, rather than taking it down?
  In my experience with Wikipedia dead links, it's almost always a case of a server no longer existing or a site changing their CMS without setting up redirects.
How about archiving wikipedia itself? by Anonymous Coward · 2018-10-02 04:08 · Score: 0

For rescuing articles from deletionists and showing more prominently what edits were reverted by admins?
The Internet Archive logo looks like a trash can by Anonymous Coward · 2018-10-02 04:08 · Score: 0

Internet Trashcan. Saving garbage for future generations to sift through and make sense of. Whenever I watch for something, it will often bring up trash like garage band albums from "alternative" and "experimental" "music".
You can do this yourself... by imcdona · 2018-10-02 04:18 · Score: 1

The Amber project, http://amberlink.org/ provides a plugin for various content management systems to do the same thing on your own site.
Automatic? by Anonymous Coward · 2018-10-02 07:41 · Score: 0

Why doesn't every browser do this automatically when encountering 404?