When the Internet Archive Forgets (gizmodo.com)

← Back to Stories (view on slashdot.org)

When the Internet Archive Forgets (gizmodo.com)

Posted by msmash on Thursday November 29, 2018 @09:30AM from the PSA dept.

A reminder that Internet Archive's Wayback Machine, which many people assume keeps a permanent trail and origin of web-content, has little feasible choice but to comply with DMCA takedown notices. As a result of which, a portion of the archive of things people submit to the website continues to quietly fade away. Gizmodo: Over the last few years, there has been a change in how the Wayback Machine is viewed, one inspired by the general political mood. What had long been a useful tool when you came across broken links online is now, more than ever before, seen as an arbiter of the truth and a bulwark against erasing history. That archive sites are trusted to show the digital trail and origin of content is not just a must-use tool for journalists, but effective for just about anyone trying to track down vanishing web pages. With that in mind, that the Internet Archive doesn't really fight takedown requests becomes a problem. That's not the only recourse: When a site admin elects to block the Wayback crawler using a robots.txt file, the crawling doesn't just stop. Instead, the Wayback Machine's entire history of a given site is removed from public view.

In other words, if you deal in a certain bottom-dwelling brand of controversial content and want to avoid accountability, there are at least two different, standardized ways of erasing it from the most reliable third-party web archive on the public internet. For the Internet Archive, like with quickly complying with takedown notices challenging their seemingly fair use archive copies of old websites, the robots.txt strategy, in practice, does little more than mitigating their risk while going against the spirit of the protocol. And if someone were to sue over non-compliance with a DMCA takedown request, even with a ready-made, valid defense in the Archive's pocket, copyright litigation is still incredibly expensive. It doesn't matter that the use is not really a violation by any metric. If a rightsholder makes the effort, you still have to defend the lawsuit.

71 comments

Min score:

Reason:

Sort:

Move to Canada by JMJimmy · 2018-11-29 09:32 · Score: 4, Interesting

They should move to Canada as we have an exemption for archives which would allow the content to remain.
1. Re:Move to Canada by Anonymous Coward · 2018-11-29 09:52 · Score: 2, Funny
  
  Perhaps a simpler solution would be to see if Elon Musk has any ideas on how to fix this? Perhaps using something involving Blockchain?
2. Re:Move to Canada by Anonymous Coward · 2018-11-29 10:00 · Score: 0
  
  A lot of us should. Canada is beautiful, and most of the people I met there were really nice.
3. Re:Move to Canada by Anonymous Coward · 2018-11-29 11:17 · Score: 0
  
  what about using AI?
4. Re:Move to Canada by Anonymous Coward · 2018-11-29 12:13 · Score: 1
  
  They should move to Canada as we have an exemption for archives which would allow the content to remain.
  Wouldn't work.
  The USA has an exemption for archives and libraries, of which the Internet Archive is a legally registered one of, and they also are named explicitly in an exemption to the DMCA when it comes to software.
  https://archive.org/post/82097/internet-archive-helps-secure-exemption-to-the-digital-millennium-copyright-act
  To sue the Internet Archive for violating a law, where the law explicitly names the Internet Archive as exempt, takes some serious balls.
  I can't see how it would be that costly to simply point at their name in the law they are accused of breaking, where it states they are exempt from that very law.
  The fact they aren't even using that exemption to the law here means they wouldn't do any different in Canada.
5. Re:Move to Canada by mikael · 2018-11-29 12:47 · Score: 1
  
  A distributed web archive system, like BitTorrent, where different systems archive different webpages and can be searched in the same way.
  
  --
  Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
6. Re:Move to Canada by Zontar+The+Mindless · 2018-11-29 14:33 · Score: 1
  
  They have back bacon, poutine, Molson's, The Odds, and toques. What's not to like, other than Avril Lavigne?
  
  --
  Il n'y a pas de Planet B.
7. Re: Move to Canada by Anonymous Coward · 2018-11-29 16:54 · Score: 0
  
  Hillary should have just updated the robots.txt file on her email server.
8. Re: Move to Canada by Anonymous Coward · 2018-11-29 18:58 · Score: 0
  
  Black people and Muslims.
9. Re: Move to Canada by Zontar+The+Mindless · 2018-11-29 20:29 · Score: 2
  
  I like all kinds of people of good will. Assholes who won't grow the fuck up, not so much.
  In which group do you think you belong?
  
  --
  Il n'y a pas de Planet B.
10. Re:Move to Canada by Anonymous Coward · 2018-11-29 22:43 · Score: 0
  
  I think Musk's solution would be to move the archive to Mars, there may be issues with accessing it from Earth, but that will be solved by living on Mars.
11. Re:Move to Canada by JMJimmy · 2018-11-30 02:13 · Score: 1
  
  Our litigation system works slightly differently and is more balanced. Courts will refuse to hear obviously frivolous/baseless claims far more readily and costs are awarded to defendants far more often. The chance that you'll have to pay both sides of a lawsuit's costs is a big deterrent.
Victory will be achieved.... by forkfail · 2018-11-29 09:35 · Score: 2

... when the Wayback Machine itself has been dropped into a memory hole.

--
Check your premises.
Library of Congress by JBMcB · 2018-11-29 09:40 · Score: 5, Interesting

Get a charter from the Library of Congress, which can essentially bypass DMCA restrictions by fiat. The LoC usually seems pretty progressive about these things.

--
My Other Computer Is A Data General Nova III.
1. Re:Library of Congress by AmiMoJo · 2018-11-29 09:59 · Score: 2
  
  Or move those archives out of the US where the DMCA does not apply.
  This is why we had world wide mirrors back in the day, especially for crypto software.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
2. Re:Library of Congress by Anonymous Coward · 2018-11-29 15:40 · Score: 0
  
  One of the requirements for registering for copyright, not just throwing a MMCV (c) stamp up on something, is to place a copy in the Library of Congress.
  The Internet Archive already has the infrastructure, process and charter to preserve the World Wide Web, Usenet and other digital media. It would be quite appropriate for the Library of Congress to charter the Wayback machine.
  Much like how the US Postal System is often a private company provided a legal monopoly on first class postage, the Archive could be given the authority, though a bill you can propose to your local Senator or Representative, to fullfil the job of preserving our collective digital heritage from the ravages of Lawyers from Disney, mothers against porn, criminals against their past and dictators against the public.
What will happen when.... by Anonymous Coward · 2018-11-29 09:43 · Score: 0

...science finds out that this reality has a log that can not be erased :O
An Internet Achiver server in Asgardia by Anonymous Coward · 2018-11-29 09:45 · Score: 0

We will have an Internet Archive server in Asgardia, and just screw every DCMA request in the world !!!!
Anyways. Remember to Donate by martiniturbide · 2018-11-29 09:50 · Score: 3, Informative

Remember to donate to the Internet Archive: https://archive.org/donate/
DOOM ? : https://archive.org/details/do...
Apple II : https://archive.org/details/ap...
Arcade: https://archive.org/details/in...
DOS GAmes: https://archive.org/details/so...
Like political office holders? by the_skywise · 2018-11-29 09:51 · Score: 4, Insightful

if you deal in a certain bottom-dwelling brand of controversial content
I like how this insinuates that it's the "dark web" trying not to be blocked when it's political leaders, actors and other public personae (people very much out front and wanting to be seen) that go out of their way to delete their internet history when it contradicts with whatever they're pushing today so they can say "this has always been who I am!"
Archiving and to a greater point JOURNALISM (not "reporting" but actually chronicling and journaling the days' notable events in an objective manner) is an indispensable requirement for any person to become educated on a topic and to make an informed decision.
Eventually these things become history and are lost to current though until somebody digs through the archives to rediscover the truth. Except now we can make it go away with a keypress and, poof, we've always been at war with Eurasia.
1. Re:Like political office holders? by Can'tNot · 2018-11-29 10:46 · Score: 1
  
  I didn't get a "dark web" implication from that, it said controversial, not illegal. There are some select politicians who do this (lets not slander all politicians for the practices of the worst among them), but my first impression was that this was directed at the new breed of partisan "journalism" - where a pundit makes some groundless accusations or unsubstantiated claims, and when those claims are eventually painstakingly refuted the pundit just pretends they never happened.
2. Re:Like political office holders? by Anonymous Coward · 2018-11-29 13:00 · Score: 0
  
  Suppose you had been job hunting in the past. You had multiple versions of your resume. Your history had one one project been working on signal processing using 64-bit multi-core DSP systems in the past, but since then you have moved onto Linux applications development in scientific visualisation. But going to any interview, recruitment agencies and companies see the words "embedded systems", their eyes light up, giaze over and they pigeon-hole you as an embedded systems type of person for modern 8-bit and 16-bit microcontrollers. That has happened on at least five occasions in the last couple of years. They make a little box with their hands, slide it along the table, and then say, "What if we don't give you this position, but move you onto ...." It's easy to edit your resume to eliminate the whiff of "embedded systems", but they will still find some remotely archived copy of your resume.
3. Re:Like political office holders? by LordAba · 2018-11-30 00:25 · Score: 1
  
  Eventually these things become history and are lost to current though until somebody digs through the archives to rediscover the truth. Except now we can make it go away with a keypress and, poof, we've always been at war with Eurasia.
  Hmm, true. However, there is also the need to forget some things. If you used twitter to post a stupid edgy joke back when you were a teen, do you really want that to sabotage your career? Who makes the distinction between what is personal and what is public history? Who decides the limits?
  Not an easy thing to answer, especially when people are currently weaponizing it.
NASA experiment time forgot by Anonymous Coward · 2018-11-29 09:53 · Score: 0

Puzzle of the Pendulum at Archive.org
What is Winter Sunlight?
Working of Error
Remember to not commit treason also. NEXT TIME? by Anonymous Coward · 2018-11-29 09:53 · Score: 0, Troll

https://www.washingtonpost.com/politics/2018/11/29/key-takeaways-michael-cohens-new-plea-deal
1. There are conspicuous mentions of Trump and his family.
2. Putin’s spokesman appears to have helped cover this up.
3. This ties the Trump family’s efforts to the Russian government
4) The deal apparently died the day The Post broke a story about Russian hacking
https://www.huffingtonpost.com/entry/deutsche-bank-offices-raided_us_5c00331de4b027f1097bc8aa
https://www.nydailynews.com/news/politics/ny-pol-manafort-confidential-mueller-trump-giuliani-20181129-story.html
boxing helena problem by trumpai · 2018-11-29 09:54 · Score: 1

notice the obvious issues related to abuse. Imagine having a personal or non personal website - lets say slashcomma.org. Doesn't matter what you do with it - you store and share with the world publicly accessible information. You do not block webarchive and all this information gets archived from time to time in webarchive. You know you and public gets free backups. Medicine from "memento" type issues. Now imagine you go missing, disappeared, get hit by the bus, etc - you no longer run website. Domain gets expired. It gets registered by another party and they explicitly tell through robots.txt that new site is not subject to be browsed by such spiders. So all the backups and your hard work gets wasted into oblivion. In other words in that sense the new owner of domain is inheriting your intellectual property even though they shouldn't have. It's an exploit of changeling. Some sort of intelligent design by some sort of intelligence community of tar budgets. You know it's their world. The great digital divide that is. In other words they want you to believe that internet is fair and balanced, but in reality it's just a fair of series of tubes filled with balanced diet of carnival food such as yellow cakes.
Re: Not holding office much longer, TBC by Anonymous Coward · 2018-11-29 09:57 · Score: 0

He was never in office to begin with
Re: Move to ADX Leavenworth by Anonymous Coward · 2018-11-29 09:57 · Score: 0

Retard alert
The fact of removal can still be shown by mi · 2018-11-29 10:01 · Score: 3, Interesting

So, someone requested, you remove a page — and you decide to comply. By replacing it with something like "Content removed by on date on request from such and such."
Requesting removals of evidence suddenly becomes less effective — an explicit record of removal may appear even more sinister, than whatever was there before...

--
In Soviet Washington the swamp drains you.
1. Re:The fact of removal can still be shown by Anonymous Coward · 2018-11-29 10:40 · Score: 0
  
  retard
Spirit? by Anonymous Coward · 2018-11-29 10:09 · Score: 1

the robots.txt strategy, in practice, does little more than mitigating their risk while going against the spirit of the protocol
This is stupid, if there is a 'spirit' to the protocol, formalize it and enforce it.
Otherwise all you're saying is that it's as useless as do-not-track/do-not-call and that people are free to ignore it as they choose.
The internet has shown us time and time again, if you are relying on people voluntarily following the 'spirit' of something, it's 100% guaranteed they won't.
My take is that adding a robots.txt cannot possibly be retroactive ... as of the time it was crawled, you didn't have this in place ... therefore, you can't claim adding it after several years goes back in time and revokes what was public access to content. You've already published it for everyone to see, and the stuff which was already crawled was out there for with public access.
Fuck the 'spirit' of the protocol, do what the corporations do, follow the letter of the law and no more. Trusting to the good behavior and intentions of the internet is stupid.
What is it by nightfire-unique · 2018-11-29 10:19 · Score: 2

What is it that separates some of us, who believe that a proper, immutable archive is more important to our species than copyright restriction, from those who feel otherwise?
Is it just money? Is that all it is? Or is it something deeper?

--
A government is a body of people notably ungoverned - AC
1. Re:What is it by Kyr+Arvin · 2018-11-29 12:59 · Score: 2
  
  What is it that separates some of us, who believe that a proper, immutable archive is more important to our species than copyright restriction, from those who feel otherwise?
  Is it just money? Is that all it is? Or is it something deeper?
  There are many in the copyright/content industry who consider copyright to be akin to physical property, and that their property rights trump anyone's desire to do things with their property.
2. Re:What is it by Anonymous Coward · 2018-11-29 15:37 · Score: 0
  
  What is it that separates some of us, who believe that a proper, immutable archive is more important to our species than copyright restriction, from those who feel otherwise?
  Is it just money? Is that all it is? Or is it something deeper?
  What the hell gave you the impression that copyright restrictions have anything to do with deleting anything? They can just sit in the archive forever, get indexed, summarized, sliced and diced, served a million different ways without breaking any copyright laws.
Spirit of the protocol by markdavis · 2018-11-29 10:20 · Score: 2, Insightful

>"the robots.txt strategy, in practice, does little more than mitigating their risk while going against the spirit of the protocol."
Spirit of the protocol? I kinda disagree with that. If a site admin put up a robots.txt file, then they are clearly signaling they do not want the specified parts of the site crawled/archived/copied. It isn't just a directive to be convenient signaling to the crawler about what is a waste of time/load/bandwidth, but also a choice the admin made saying "these things should not be crawled" and for whatever reason the admin wants.
To me, the only controversial part would be- does having a robots.txt excluding something NOW mean that it should exclude things that had been "OK" in the past (because there was no exclusion back then). Personally, I tend to go with the interpretation that it means "now or in the past" (perhaps they changed their mind or forgot to put up a robots.txt initially). But that is certainly murky.
I think it is very hostile, and very much against the "spirit of the protocol" to ignore a robots.txt file. I could see where it might even have legal ramifications later (similar to a "no photography" sign in a store).
1. Re:Spirit of the protocol by darkain · 2018-11-29 10:33 · Score: 4, Informative
  
  The big issue came about in that some domains lapsed, years later someone else registered said domains, put up robots.txt, and as such the entire history from the previous owners were inadvertently deleted.
2. Re:Spirit of the protocol by markdavis · 2018-11-29 10:47 · Score: 2
  
  >"The big issue came about in that some domains lapsed, years later someone else registered said domains, put up robots.txt, and as such the entire history from the previous owners were inadvertently deleted."
  OK, well I can certainly see where that would be an issue. A big one at that, since it is hard or impossible for any crawler to tell if it is even the same site, since the domain/path might be the same. Of course, one might also argue that if they sold the domain/site/path to someone else, they sold their rights along with it.
  I think ignoring a current robots.txt file for current content is certainly, flat-out "wrong." But it is more murky about deleting PAST stuff based on a current directive. You provided a good example of how it can, indeed, get pretty murky.
3. Re:Spirit of the protocol by Anonymous Coward · 2018-11-30 03:50 · Score: 0
  
  I would think it should be possible to respect robots.txt at the time the site is archived? e.g. in 2017 there's no robots.txt, so those are fine; 2018 robots.txt prohibits crawling, any pages from then on wouldn't be archived, but the version back when there was no robots.txt is still fine and not deleted, but new versions aren't added.
4. Re:Spirit of the protocol by Anonymous Coward · 2018-12-01 11:52 · Score: 0
  
  "sold"
  Actually, the GoDaddys of the world are the 'sellers', in these cases. The original holder has nothing to do with it. It's like if someone stops renewing a cell phone subscription, then an entirely different person tries to 'recover' the accounts that use SMS (texts) sent to it after buying a subscription on that number. See also: 1-800 numbers not being sold.
Some things need erasing by darkain · 2018-11-29 10:31 · Score: 1

Some things do indeed need erasing though. I've yet to publish the paper because disclosure was only earlier this month. I found an entire ISP leaking publicly the physical addresses of virtual every single customer on their network (essentially EVERYONE was doxxed at once). This information is mirrored in the Wayback Machine (along with a few other archives). Part of the reason my paper isn't published yet is due to the ISP currently working with these archives to remove the sensitive information the ISP should never have published in the first place.
1. Re:Some things need erasing by Anonymous Coward · 2018-11-29 11:06 · Score: 0
  
  "we posted it on the internet, it's too late, you don't just 'erase' things off the internet"
  When Hollywood says something to the contrary, we laugh at the movie.
  When Hollywood does say the line, we solemnly nod.
  Idioms abound that suggest your expectation is faulty from inception. You want to rebag cats, unspill beans, take the pee out of the pool, and put the toothpaste back in the tube.
  I'm fine with you wishing that Some Things SHOULD Be Erased. It then shares the SHOULD shelf with world peace.
2. Re:Some things need erasing by Anonymous Coward · 2018-11-29 22:49 · Score: 0
  
  Idioms abound that suggest your expectation is faulty from inception. You want to rebag cats, unspill beans, take the pee out of the pool, and put the toothpaste back in the tube.
  None of those things are actually impossible, merely difficult to achieve.
  Something being difficult isn't on its own sufficient reason not to do it.
Free speech damage by Tablizer · 2018-11-29 10:44 · Score: 2

If the DMCA stymies free speech in practice, then it could be considered a violation of the 1st Amendment. Form a coalition to sue all the way up to the Supreme Court.
A similar situation has arisen for the recent "Online Sex Trafficking Act of 2017", which is so vague that it makes hosting any kind of online romantic discussion
or message group too risky. One could end up in jail because they don't police content tightly enough.
(Craigslist removed the "personals" discussion group because that Act. Ads for shady services now spill over into other discussion groups, often ruining them. Craig may end up in jail anyhow for not scrubbing hard enough.)
Both laws have "excessive side-effects" on legitimate free speech.

--
Table-ized A.I.
1. Re:Free speech damage by fustakrakich · 2018-11-29 19:01 · Score: 1
  
  Most people are turning against free speech now.
  
  --
  “He’s not deformed, he’s just drunk!”
2. Re:Free speech damage by Anonymous Coward · 2018-11-30 01:15 · Score: 0
  
  Hey troll, one of those things is not like the others.
holding news media accountable by eaglesrule · 2018-11-29 11:08 · Score: 3, Insightful

Eventually these things become history and are lost to current though until somebody digs through the archives to rediscover the truth. Except now we can make it go away with a keypress and, poof, we've always been at war with Eurasia.
There is more of an immediate need, since the ability to stealth edit a story after publishing it is too great a temptation to resist. There's been too many examples of 'reputable' news sources getting caught red handed doing this.
Anyway, an archive source that is subject to the hideously malformed DMCA is hardly an archive source at all.
1. Re:holding news media accountable by Can'tNot · 2018-11-29 14:21 · Score: 1
  
  You should link the response as well. The way you phrase it, "caught red handed," might suggest that this was some sort of nefarious conspiracy.
Robots.txt by Anonymous Coward · 2018-11-29 11:31 · Score: 0

Why did archive.org ever choose to respect robots.txt anyways? Archive.is doesn't. Sites shouldn't put content up if they don't want someone to see it.
Hellooo NAFTA! Hello digital dark ages! by Anonymous Coward · 2018-11-29 11:33 · Score: 2, Interesting

You will find out, what those agreements were made for.
This happens all over the world.
Scientist already call it the "digital dark ages".
Fun fact:
There was a time, when Germany didn't have any such laws, but the UK did.
The UK's creative scene suffocated. While Germany's creative scene flourished so much, that that is, where it got its title "the land of poets and thinkers" from.
I recommend you look it up, and don't just believe an AC on the Internet.
And all, because some cokeheads didn't want to work for their money, but leech on artists, and then demand real money, not copies of money, but real money, that people actually had to work for, ... in return for mere copies.
Imagine us doing that: We'd be like: "Hey! I worked HARD for those $100 money! So you better accept 100 copies of my $100 bill of "labor property" as payment for this meal, or I'll go WAAAHHH, buy a law and force you to give me a TRILLION meals, while I get to call you a seafaring rapist thug!"
They are leeches and thieves. Nothing else.
I pay for concerts. For actual work. Because guess what: I had to work for it too!
I don't pay for copies. Unless I can pay with copies too. Period.
Why not host it in another country? by Solandri · 2018-11-29 11:34 · Score: 1

One which doesn't have a DMCA?
1. Re:Why not host it in another country? by Anonymous Coward · 2018-11-29 23:04 · Score: 0
  
  Loyalty to empire.
Our mission is we have it, but maybe can't show it by Anonymous Coward · 2018-11-29 12:03 · Score: 0

Seems like if you go to the archive for a forgotten page, they may not be able to retrieve it, but sure could have it in the archive.
And they could tell you why you can't see then you could challenge it if it is worth it to you.
That make the archive still serve it's primary mission which is archiving. (Retrieve is secondary. You can have one without the other, but not the reverse.)
That puts the cost burden for challenges in a more scalable place.
The courts can still provide specific retrievals over DCMA.
Will you do what it takes to make a... by Anonymous Coward · 2018-11-29 12:13 · Score: 0

Truly 'free' international archive?
Because at this point in time it would require founding a new nation, replacing an existing small one in a coup, then getting internet access and funds to it to maintain the archive and expand its hardware base.
IA requires a *LOT* of hard disks. Basically Backblaze style storage pods. And unless you have access to a steady supply of those, plus internet, plus finances, plus a country with laws tailored to allow you to archive, you are FUBAR.
Why forget? by Anonymous Coward · 2018-11-29 13:20 · Score: 0

Why not store it, but not publish?
Didn't Archive.org stop honoring robots.txt files? by Anonymous Coward · 2018-11-29 14:17 · Score: 0

In mid-2017, Digital Trends reported that "Internet Archive will ignore robots.txt files to keep historical record accurate". At the time of the article, Archive.org already disregarded robots.txt files on U.S. government and military sites, but confirmed that they planned to "do this [ignore robots.txt] more broadly".
Did the expansion of this policy actually happen? Archive.org doesn't seem to have commented further.
DT Link
Many archives! by shplopt · 2018-11-29 15:34 · Score: 1

Let's not forget that there are other internet archives out there. I'm thinking specifically of archive.is, which has less willingness to acquiesce to takedown requests and robots.txt exclusions, but there are others out there that could surely use our support. A patchwork approach of several archives with differing approaches and goals (general, academic, scientific; text, image, video, etc.) could provide a more robust backup of the web. A single archive is a single point of failure.
British Library UK Web Archive by Martin+S. · 2018-11-29 20:13 · Score: 2

UK websites are covered by the British Library
https://www.bl.uk/collection-g...
http://data.webarchive.org.uk/...