Inside the Internet Archives

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Wednesday June 18, 2008 @02:46AM from the those-who-ignore-history-are-doomed-to-eat-a-sandwich dept.

blackbearnh writes "O'Reilly Media is running an interview with Gordon Mohr, Chief Technologist for the Internet Archive (archive.org). If you've ever wondered how pages are selected for archiving, or just how they manage such a huge quantity of data, the answers are here. The interview also touches on the problems of intellectual property in archives, archiving the Internet in a post Web 2.0 world, and the potential vulnerabilities exposed by archiving web sites that may include security exploits."

13 of 85 comments (clear)

Oblig. Clark/Kubrick by Anonymous Coward · 2008-06-18 02:51 · Score: 3, Funny

My God, it's full of ones and zeros!
mutual exclusivity? by Itninja · 2008-06-18 02:53 · Score: 3, Funny

The Interviewer: And I'm not sure I want to think about what posterity is going to think about a recording of my Twitter feed.

If Twitter becomes so mainstream so as to be more than a 'remember when?' to posterity I will kill myself.

--
I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
but is it indexed by google? by AliasMarlowe · 2008-06-18 03:03 · Score: 4, Funny

and does archive.org record google's cache?

--
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
I wished archive.org stored even more stuff by jacquesm · 2008-06-18 03:07 · Score: 3, Insightful

I keep running into bookmarks that have gone awol, then find that archive.org also doesn't have the pages anymore.

Combining a bookmarking / chaching service would be really handy.

--
MP3 Search Engine
1. Re:I wished archive.org stored even more stuff by blhack · 2008-06-18 03:20 · Score: 5, Funny
  
  Combining a bookmarking / chaching service would be really handy. I heard that lexmark makes one, its called a "printer".
  
  --
  NewslilySocial News. No lolcats allowed.
2. Re:I wished archive.org stored even more stuff by RareButSeriousSideEf · 2008-06-18 03:44 · Score: 3, Interesting
  
  Yeah, how exactly do pages go AWOL from archive.org? I've encountered that, plus pages suddenly acquiring META refresh tags (maybe through an external script or iframe?) that redirect to some domain squatter's site now. Extremely annoying. I'm going to have to mess around with wget to see what's in the markup, unless someone can suggest an easier way to get at such content.
  Combining a bookmarking / chaching service would be really handy. Furl fits that bill, doesn't it?
  
  --
  Pi Ran Out
Wayback by TheRealMindChild · 2008-06-18 03:18 · Score: 4, Informative

While I love the wayback machine, a little "problem" creped in a couple of years ago that is still there... and it drives me nuts.

At one point, I forgot to renew my domain name and a squatter snatched it up the second it was available. I have since lost the html/java applets/images/etc that I had originally there. I used to show people what it looked like via the wayback machine. But you can't do it anymore. Example: http://web.archive.org/web/*/http://www.mindchild.net

Apparently, the current squatter put a robots.txt on that domain, and wayback refuses to show any ARCHIVED pages where the domain CURRENTLY has a robots.txt. I emailed them about it, and after a couple of months, I actually got a reply pretty much saying "That is just the way it is. We are underfunded and have no time to fix it. Sorry".

So if for some reason you don't want to have your site viewable via the wayback machine, just put up a robots.txt. It doesn't even need to contain anything.

--

"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
1. Re:Wayback by iangoldby · 2008-06-18 03:29 · Score: 3, Insightful
  
  wayback refuses to show any ARCHIVED pages where the domain CURRENTLY has a robots.txt.
  In true Raymond Chen style, think about what the world would be like if this wasn't true: If it wasn't true, then a site owner would have no way to remove his content from the Wayback Machine retrospectively. That raises far more problems that the ability of a new owner to remove a previous owner's content.
2. Re:Wayback by SydShamino · 2008-06-18 03:44 · Score: 5, Insightful
  
  If it wasn't true, then a site owner would have no way to remove his content from the Wayback Machine retrospectively. I don't necessarily disagree with their policy, but this is the wrong argument for it.
  
  If you publish something, you lose the right to withdraw it from the public archives retrospectively. That's part of the "contract" (term used figuratively) with the public that establishes the foundation of copyright law.
  
  If you don't want it to appear on the Wayback Machine, you have an ability called robots.txt. That's already more than you have if you publish a book and want to keep it out of libraries. In neither case, though, do you have the right to demand or expect the content to be removed from the archive on your request.
  
  I see what the archive does to be a courtesy service, not something that the site owners should expect.
  
  --
  It doesn't hurt to be nice.
3. Re:Wayback by RareButSeriousSideEf · 2008-06-18 03:52 · Score: 3, Insightful
  
  Ideally they could obey the robots.txt at the time of archiving, and simultaneously grab a snapshot of the whois record. In the future, new robots.txts would by default only take away previously archived content if the domain hadn't changed hands. This would keep squatters from killing the archive, and the original copyright owner could always actively request removal of content if s/he matched the old whois record (though this would take manpower at archive.org, which is a problem).
  
  --
  Pi Ran Out
Remember Slashdot in it's Infancy? by dbarron · 2008-06-18 03:45 · Score: 5, Interesting

Check this out....it reads like a free software update blog :)
http://web.archive.org/web/19980113191222/http://slashdot.org/
Squatters & robots.txt Re:Wayback by gojomo · 2008-06-18 05:55 · Score: 5, Informative

Unfortunately, this "squatters-add-robots-restrictions" problem comes up a lot.

We'd like to address it, and to do so there are two major issues to be tackled: (1) our current Wayback Machine software only excludes sites on a "for all time" basis; (2) short of mechanistically trusting the current domain owner, determining who has the right to exclude or restore material could be a very labor-intensive, error-prone, and liability-compounding process.

The new open-source 'Wayback' software, which will go live for the Worldwide Wayback Machine later this year, enables time-range exclusions. (It's currently only used for many smaller collections we do for partners.) That should give us the capability to address (1). Addressing (2) will require further discussion about the proper and efficient policies -- but it's on our agenda once the technical capability for time-range exclusions is in place.

Specifically regarding the mindchild.net site you mention, it looks like the issue is that our current retroactive-exclude robots.txt-parser doesn't understand the 'Allow' directive. (The mindchild.net/robots.txt tries to enable ia_archiver/WaybackMachine access via an 'Allow'.) That too will be fixed in the new 'Wayback' deploy (if not sooner).

- Gordon @ IA
Re:Searchable Archive. . . by gojomo · 2008-06-18 18:08 · Score: 3, Informative

'Recall' wasn't exactly Google-like search. IIRC, in some respects it was better, with an advanced idea of related concepts, and with data on frequency of terms over time. In other respects, it was not what people would expect: there was no exact phrase matching, and certain terms that didn't become tracked concepts weren't findable at all, even though you could see the words in other indexed results.

Unfortunately, IA couldn't maintain the deployment when the developer, Anna Patterson, moved to Google. So, Recall turned out to be a short-lived experiment, grand in scale of pages indexed and novel features but not in traffic served.

Patterson did big things at Google and now has another search startup, Cuill, that's likely to do more good things for the web.

At the Internet Archive, we've also been using the open-source projects Nutch and Hadoop to offer search on smaller web collections for our partners. (A pair of such searchable partner collections for the US National Archives and Records Administration lives at webharvest.gov.) Someday we may be able to scale these up to the full 11+ year archive.
- Gordon @ IA