Slashdot Mirror


Wikipedia Community and Internet Archive Partner To Fix One Million Broken Links on Wikipedia (wikimedia.org)

More than one million formerly broken links in the English Wikipedia have been updated to archived versions from the Wayback Machine, thanks to a partnership between the Internet Archive, and volunteers from the Wikipedia community, and the Wikimedia Foundation. From a blog post: The Internet Archive, the Wikimedia Foundation, and volunteers from the Wikipedia community have now fixed more than one million broken outbound web links on English Wikipedia. This has been done by the Internet Archive's monitoring for all new, and edited, outbound links from English Wikipedia for three years and archiving them soon after changes are made to articles. This combined with the other web archiving projects, means that as pages on the Web become inaccessible, links to archived versions in the Internet Archive's Wayback Machine can take their place. This has now been done for the English Wikipedia and more than one million links are now pointing to preserved copies of missing web content. What do you do when good web links go bad? If you are a volunteer editor on Wikipedia, you start by writing software to examine every outbound link in English Wikipedia to make sure it is still available via the "live web." If, for whatever reason, it is no longer good (e.g. if it returns a "404" error code or "Page Not Found") you check to see if an archived copy of the page is available via the Internet Archive's Wayback Machine. If it is, you instruct your software to edit the Wikipedia page to point to the archived version, taking care to let users of the link know they will be visiting a version via the Wayback Machine.

21 comments

  1. The links will be reverted as "not notable" by Anonymous Coward · · Score: 3, Informative

    Wikipedia, reverting knowledge since 2001.

    1. Re:The links will be reverted as "not notable" by Anonymous Coward · · Score: 0

      Please stop linking to or even mentioning Wikipedia. Starting now.

      See how good it feels to bask in a flow of information free of ass hat manipulation!

  2. Awesome by edittard · · Score: 2

    Somebody wrote a perl script!

    --
    At the bottom of the /. main page it says 'Yesterday's News'. Well they got that right.
    1. Re:Awesome by 110010001000 · · Score: 1

      On the new Slashdot that passes for genius.

    2. Re:Awesome by K.+S.+Kyosuke · · Score: 1

      Ever since the advent of Python? Well, perhaps.

      --
      Ezekiel 23:20
    3. Re:Awesome by Anonymous Coward · · Score: 0

      Python: The language of the mass produced unimaginative human drones.

    4. Re:Awesome by K.+S.+Kyosuke · · Score: 1

      Still much better than Java IMO.

      --
      Ezekiel 23:20
  3. IMPORTANT INFORMATION, PLEASE REREAD by Thud457 · · Score: 0, Offtopic
    Here's a dead link that needs to be preserved for posterity.
    Preferably laser-etched into millions of quartz tablets and shot into space.
    http://www.ananova.com/news/story/sm_215916.html

    Zoo keeper mauled to death 'after defecating on tiger'

    A young Chinese tiger keeper has been mauled to death after apparently trying to defecate on one of his big cats.

    The 19-year-old appears to have climbed the railings of the Bengal tiger cage and pulled his trousers down.

    Evidence at the scene of the death at the Jinan animal park included toilet paper, excrement and a trouser belt.

    Zoo officials think Xu Xiaodong either slipped into the cage or was pulled in by one of the four angry tigers.

    According to the South China Morning Post, the man told a co-worker he needed to go to the toilet but police were called when he failed to return.

    They found his body lying on the ground surrounded by tigers. The teenager had reportedly been bitten in the neck and was covered in blood.

    Police believe Xu climbed the wall of a partially constructed building used to raise the tigers to relieve himself. They said the smell probably caused the tigers to pounce.

    You can see more stories about tigers and zoos on Ananova,
    or read our Animal attacks file.

    --

    the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff

    1. Re:IMPORTANT INFORMATION, PLEASE REREAD by D00MSlayer · · Score: 4, Funny

      Have you ever eaten human? They're GRRRRRREEAAT!

    2. Re:IMPORTANT INFORMATION, PLEASE REREAD by Anonymous Coward · · Score: 0

      Really? I heard they taste like shit.

      Captcha: rarely

  4. Not good enough by stevel · · Score: 2

    I have found many cases on Wikipedia where the links are broken but the correct content exists at a different URL. This auto-archive system would bypass that and perhaps prevent ever recognizing that the link target still exists. This is especially an issue for links to corporate and government pages where someone periodically gets the bright idea to reshuffle the web site's organization and doesn't put in permanent redirects.

    1. Re:Not good enough by skids · · Score: 3, Interesting

      By the same token, links could be taken over by bad actors at any time, or might delete content that was relevent to the reason they were linked in the first place.

      I kinda wonder if Wikipedia should *only* link to wayback content (just with a one-click option for a live/updated link and maybe an option to perform an edit to update it to a more recent wayback copy), because it is more in the spirit of the wiki audit-trail. Of course, that would probably require adding more resources at wayback.

    2. Re:Not good enough by stevel · · Score: 4, Interesting

      That's a problem across the whole web and, at least the deletion part, happens more often than you'd think. When I've updated links on Wikipedia, I note that it not only asks for a CAPTCHA but alerts editors to the change, in case the change was malicious.

      I think the motivation is good, but the implementation (as I understand it) could be better. Perhaps what is needed is to add a Wayback link alongside the original one. Does Wikipedia have a process for human review of broken links? In the cases I've found, replacement links can be found quickly for content that just moved.

    3. Re:Not good enough by radarskiy · · Score: 1

      In the context of a citation, the data at the time the citation was made is the correct link. Moving to the new data potentially means correcting the article to match so it is reasonable to to leave that to a human editor. Perhaps an automated message could be dropped in the talk page if the target of an outbound link has changed substantially to alert editors to check for corrections?

  5. 1 step forward two steps back by TimothyHollins · · Score: 1

    Great job, now if only Wikipedia could deal with the blatant bias and astroturfing of its upper editing class it would be back to its 2005 status.

    Ever since I first saw the inner workings of Wikipedia I have become more and more enamored with the old style of expert-based (and accountable) encyclopaedias rather than the internet-warrior crowdsourced one. People tend to write very differently when their professional reputation is on the line.

  6. Fix Internal Links by CanadianMacFan · · Score: 1

    I wish they would fix links when someone changes the location of content within wikipedia. What happens is that someone believes they have found a better spot for some content and move it there (for example, maybe there's more information about something and it becomes a page of it's own instead of a paragraph on some other page). However the person that moves it doesn't look for everything that links to where the original content was and updates the links.

    I was doing some research on colours for a developer tool earlier this year and came across problem a number of times. It's extremely frustrating, especially when you contact the person to ask about the move (it wasn't as simple as my example above) and they rip your head off for asking.

  7. Those links will still break by alexo · · Score: 4, Informative

    Because the Internet Archive applies robots.txt rules retroactively.

    1. Re:Those links will still break by brewthatistrue · · Score: 3, Interesting

      yes, even if a domain squatter gets the domain.

      Additionally, their interpretation of robots.txt is questionable.

      It was meant to prevent automated crawlers, not human-requested fetches, yet often the web archive will disallow me from archiving a page because of robots.txt.

      This is one reason I often will archive to both http://web.archive.org AND archive.is.

      Archive.is explains its robots.txt policy in its FAQ.

      http://archive.is/faq#Why_does...

      > Why does archive.is not obey robots.txt?

      > Because it is not a free-walking crawler, it saves only one page acting as a direct agent of the human user. Such services don't obey robots.txt (e.g. Google Feedfetcher, screenshot- or pdf-making services, isup.me, )

      People have asked about this on the archive.org forum but I haven't read them all to see if there are any good answers.

  8. What makes you think ... by Ungrounded+Lightning · · Score: 1

    Please stop linking to or even mentioning Wikipedia. Starting now. See how good it feels to bask in a flow of information free of ass hat manipulation!

    What makes you think other information sources - encyclopedias, news outlets, books, etc. - aren't subject to "ass hat manipulation"?

    At least with Wikipedia the information gets a chance to get out (and into the page history) before some ass hat gatekeeper decides to shut it down or distort it beyond recognition.

    --
    Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
    1. Re:What makes you think ... by Anonymous Coward · · Score: 0

      At least with Wikipedia the information gets a chance to get out (and into the page history) before some ass hat gatekeeper decides to shut it down or distort it beyond recognition.

      Or charge you for it, or show you intrusive pop-up interstitials before you can see it. I agree with you that I'll take the periodic "We'd like a donation" headers to get info that I can then verify against their sources over paying $1,200 for the World Book Encyclopedia.