Inside the Internet Archives
blackbearnh writes "O'Reilly Media is running an interview with Gordon Mohr, Chief Technologist for the Internet Archive (archive.org). If you've ever wondered how pages are selected for archiving, or just how they manage such a huge quantity of data, the answers are here. The interview also touches on the problems of intellectual property in archives, archiving the Internet in a post Web 2.0 world, and the potential vulnerabilities exposed by archiving web sites that may include security exploits."
I keep running into bookmarks that have gone awol, then find that archive.org also doesn't have the pages anymore.
Combining a bookmarking / chaching service would be really handy.
MP3 Search Engine
I had a cheesy site back in college where I played around with HTML and learning the basics. I ended up making a few pages that poked fun at friends.
I went to archive.org years later looking for them cause I remember back in the day they nabbed em and now they're all gone. The images and sounds I used were all gone.
I wanted to recreate a page from that archive for nostalgia reasons with my old friends. Can't do it and I can't find the files anymore in my local archives.
I was kinda disappointed but I guess it was expecting too much. I really wish there was a true and complete archive of the internet that didn't care what was there it just had it.
~~ Behold the flying cow with a rail gun! ~~
This is an unfortunate side effect of their policies but it is very understandable that they would like to err on the side of caution.
Should the robots.txt ever go away or change then your old stuff will become accessible again.
I was left with a several questions that weren't addressed by the article.
The slashdot summary says the article explains how pages are selected for archiving, but I couldn't find anything in the article that explicitly explained that. It does say that the actual crawler is run by alexa, which hands off the data to them, but it didn't say what the criteria were. Alexa computes various stats about web sites, so presumably they could apply some kind of minimum cut. Or do they try to index every single lame personal page, unless the owner opts out? That seems like it would require an unreasonable amount of disk space. The web also has a lot of stuff like, e.g., the kind of spam sites that try to scam google's search/ad system; I wonder if the archive records those.
The article didn't say a darn thing about funding. They have to run thousands of machines, so the electric bills must be formidable. Where the heck do they get their money? Is there a significant chance that their funding will dry up at some point in the future, and the whole archive will disappear?
The article states that they moved from plain Debian to Ubuntu. That surprised me, and I was curious why they'd do that. E.g., if you're shopping for webhosts, it's much more common for them to offer plain Debian than Ubuntu. I love Ubuntu as a desktop distro, but it surprises me that they'd see any big advantage in using Ubuntu for their application.
Find free books.
If you publish something, you lose the right to withdraw it from the public archives retrospectively. That's part of the "contract" (term used figuratively) with the public that establishes the foundation of copyright law.
If you don't want it to appear on the Wayback Machine, you have an ability called robots.txt. That's already more than you have if you publish a book and want to keep it out of libraries. In neither case, though, do you have the right to demand or expect the content to be removed from the archive on your request.
I see what the archive does to be a courtesy service, not something that the site owners should expect.
It doesn't hurt to be nice.
Ideally they could obey the robots.txt at the time of archiving, and simultaneously grab a snapshot of the whois record. In the future, new robots.txts would by default only take away previously archived content if the domain hadn't changed hands. This would keep squatters from killing the archive, and the original copyright owner could always actively request removal of content if s/he matched the old whois record (though this would take manpower at archive.org, which is a problem).
Pi Ran Out
I think it is really weird that EVERY SINGLE news site on the Internet is mysteriously missing any captures from May 2001 to Sept 2001 (maybe one or two days in July are there).
And then all of a sudden on Sept 11, ALL the news sites have multiple captures per day.
I want to see what CNN, LA times, Washington Post, etc. had in the news on Sept 8th, 9th and 10th...
The transition from Debian to Ubuntu was driven by developers' desire for more and newer features. We originally went with Debian-Stable because it was, well, stable, and did everything we needed the PetaBox to do at the time. But programmers whined and moaned that such-and-such package wasn't supported, or was too old, and claimed that this held back development of features which Brewster wanted to see made into reality.
Brewster was never much for stability anyway, so the transition was made. It bit us several times, as Ubuntu is not as stable as Debian-Stable (which is to be expected when releases happen more often and newer software is deployed without extensive testing), but the developers were a lot happier with it. And, to be fair, while some of the problems have been substantial (like kernel bugs which interacted with the forcedeth device drivers to make servers freeze ~10% of the time when power cycled), afaik it has not contributed directly to data lossage (which is the bottom line at an archive).
-- TTK