Inside the Internet Archives

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Wednesday June 18, 2008 @02:46AM from the those-who-ignore-history-are-doomed-to-eat-a-sandwich dept.

blackbearnh writes "O'Reilly Media is running an interview with Gordon Mohr, Chief Technologist for the Internet Archive (archive.org). If you've ever wondered how pages are selected for archiving, or just how they manage such a huge quantity of data, the answers are here. The interview also touches on the problems of intellectual property in archives, archiving the Internet in a post Web 2.0 world, and the potential vulnerabilities exposed by archiving web sites that may include security exploits."

10 of 85 comments (clear)

Min score:

Reason:

Sort:

I wished archive.org stored even more stuff by jacquesm · 2008-06-18 03:07 · Score: 3, Insightful

I keep running into bookmarks that have gone awol, then find that archive.org also doesn't have the pages anymore.

Combining a bookmarking / chaching service would be really handy.

--
MP3 Search Engine
1. Re:I wished archive.org stored even more stuff by jacquesm · 2008-06-18 03:50 · Score: 2, Insightful
  
  hehe, yes, so true, but then you can't access it electronically any more.
  
  I really think the bookmark + cache would be a nice thing to have without resorting to 'dead tree' format.
  
  But it's a good point, a printer would be an easy way to collect stuff that you really want / need to keep.
  
  --
  MP3 Search Engine
Downside of IP conciderations... by CFBMoo1 · 2008-06-18 03:23 · Score: 2, Insightful

I had a cheesy site back in college where I played around with HTML and learning the basics. I ended up making a few pages that poked fun at friends.

I went to archive.org years later looking for them cause I remember back in the day they nabbed em and now they're all gone. The images and sounds I used were all gone.

I wanted to recreate a page from that archive for nostalgia reasons with my old friends. Can't do it and I can't find the files anymore in my local archives.

I was kinda disappointed but I guess it was expecting too much. I really wish there was a true and complete archive of the internet that didn't care what was there it just had it.

--
~~ Behold the flying cow with a rail gun! ~~
Re:Wayback by ibwolf · 2008-06-18 03:27 · Score: 2, Insightful

This is an unfortunate side effect of their policies but it is very understandable that they would like to err on the side of caution.

Should the robots.txt ever go away or change then your old stuff will become accessible again.
Re:Wayback by iangoldby · 2008-06-18 03:29 · Score: 3, Insightful

wayback refuses to show any ARCHIVED pages where the domain CURRENTLY has a robots.txt.
In true Raymond Chen style, think about what the world would be like if this wasn't true: If it wasn't true, then a site owner would have no way to remove his content from the Wayback Machine retrospectively. That raises far more problems that the ability of a new owner to remove a previous owner's content.
selection? funding? why not plain Debian? by bcrowell · 2008-06-18 03:43 · Score: 2, Insightful

I was left with a several questions that weren't addressed by the article.

The slashdot summary says the article explains how pages are selected for archiving, but I couldn't find anything in the article that explicitly explained that. It does say that the actual crawler is run by alexa, which hands off the data to them, but it didn't say what the criteria were. Alexa computes various stats about web sites, so presumably they could apply some kind of minimum cut. Or do they try to index every single lame personal page, unless the owner opts out? That seems like it would require an unreasonable amount of disk space. The web also has a lot of stuff like, e.g., the kind of spam sites that try to scam google's search/ad system; I wonder if the archive records those.

The article didn't say a darn thing about funding. They have to run thousands of machines, so the electric bills must be formidable. Where the heck do they get their money? Is there a significant chance that their funding will dry up at some point in the future, and the whole archive will disappear?

The article states that they moved from plain Debian to Ubuntu. That surprised me, and I was curious why they'd do that. E.g., if you're shopping for webhosts, it's much more common for them to offer plain Debian than Ubuntu. I love Ubuntu as a desktop distro, but it surprises me that they'd see any big advantage in using Ubuntu for their application.

--
Find free books.
Re:Wayback by SydShamino · 2008-06-18 03:44 · Score: 5, Insightful

If it wasn't true, then a site owner would have no way to remove his content from the Wayback Machine retrospectively. I don't necessarily disagree with their policy, but this is the wrong argument for it.

If you publish something, you lose the right to withdraw it from the public archives retrospectively. That's part of the "contract" (term used figuratively) with the public that establishes the foundation of copyright law.

If you don't want it to appear on the Wayback Machine, you have an ability called robots.txt. That's already more than you have if you publish a book and want to keep it out of libraries. In neither case, though, do you have the right to demand or expect the content to be removed from the archive on your request.

I see what the archive does to be a courtesy service, not something that the site owners should expect.

--
It doesn't hurt to be nice.
Re:Wayback by RareButSeriousSideEf · 2008-06-18 03:52 · Score: 3, Insightful

Ideally they could obey the robots.txt at the time of archiving, and simultaneously grab a snapshot of the whois record. In the future, new robots.txts would by default only take away previously archived content if the domain hadn't changed hands. This would keep squatters from killing the archive, and the original copyright owner could always actively request removal of content if s/he matched the old whois record (though this would take manpower at archive.org, which is a problem).

--
Pi Ran Out
Why is so much of 2001 missing? by Anonymous Coward · 2008-06-18 07:15 · Score: 1, Insightful

I think it is really weird that EVERY SINGLE news site on the Internet is mysteriously missing any captures from May 2001 to Sept 2001 (maybe one or two days in July are there).

And then all of a sudden on Sept 11, ALL the news sites have multiple captures per day.

I want to see what CNN, LA times, Washington Post, etc. had in the news on Sept 8th, 9th and 10th...
Re:2008 is the year of Linux on the Archive! by TTK+Ciar · 2008-06-18 15:45 · Score: 2, Insightful

The transition from Debian to Ubuntu was driven by developers' desire for more and newer features. We originally went with Debian-Stable because it was, well, stable, and did everything we needed the PetaBox to do at the time. But programmers whined and moaned that such-and-such package wasn't supported, or was too old, and claimed that this held back development of features which Brewster wanted to see made into reality.

Brewster was never much for stability anyway, so the transition was made. It bit us several times, as Ubuntu is not as stable as Debian-Stable (which is to be expected when releases happen more often and newer software is deployed without extensive testing), but the developers were a lot happier with it. And, to be fair, while some of the problems have been substantial (like kernel bugs which interacted with the forcedeth device drivers to make servers freeze ~10% of the time when power cycled), afaik it has not contributed directly to data lossage (which is the bottom line at an archive).

-- TTK