Inside the Internet Archives
blackbearnh writes "O'Reilly Media is running an interview with Gordon Mohr, Chief Technologist for the Internet Archive (archive.org). If you've ever wondered how pages are selected for archiving, or just how they manage such a huge quantity of data, the answers are here. The interview also touches on the problems of intellectual property in archives, archiving the Internet in a post Web 2.0 world, and the potential vulnerabilities exposed by archiving web sites that may include security exploits."
My God, it's full of ones and zeros!
The Interviewer: And I'm not sure I want to think about what posterity is going to think about a recording of my Twitter feed.
If Twitter becomes so mainstream so as to be more than a 'remember when?' to posterity I will kill myself.
I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
and does archive.org record google's cache?
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
I keep running into bookmarks that have gone awol, then find that archive.org also doesn't have the pages anymore.
Combining a bookmarking / chaching service would be really handy.
MP3 Search Engine
While I love the wayback machine, a little "problem" creped in a couple of years ago that is still there... and it drives me nuts.
At one point, I forgot to renew my domain name and a squatter snatched it up the second it was available. I have since lost the html/java applets/images/etc that I had originally there. I used to show people what it looked like via the wayback machine. But you can't do it anymore. Example: http://web.archive.org/web/*/http://www.mindchild.net
Apparently, the current squatter put a robots.txt on that domain, and wayback refuses to show any ARCHIVED pages where the domain CURRENTLY has a robots.txt. I emailed them about it, and after a couple of months, I actually got a reply pretty much saying "That is just the way it is. We are underfunded and have no time to fix it. Sorry".
So if for some reason you don't want to have your site viewable via the wayback machine, just put up a robots.txt. It doesn't even need to contain anything.
"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
I had a cheesy site back in college where I played around with HTML and learning the basics. I ended up making a few pages that poked fun at friends.
I went to archive.org years later looking for them cause I remember back in the day they nabbed em and now they're all gone. The images and sounds I used were all gone.
I wanted to recreate a page from that archive for nostalgia reasons with my old friends. Can't do it and I can't find the files anymore in my local archives.
I was kinda disappointed but I guess it was expecting too much. I really wish there was a true and complete archive of the internet that didn't care what was there it just had it.
~~ Behold the flying cow with a rail gun! ~~
I was left with a several questions that weren't addressed by the article.
The slashdot summary says the article explains how pages are selected for archiving, but I couldn't find anything in the article that explicitly explained that. It does say that the actual crawler is run by alexa, which hands off the data to them, but it didn't say what the criteria were. Alexa computes various stats about web sites, so presumably they could apply some kind of minimum cut. Or do they try to index every single lame personal page, unless the owner opts out? That seems like it would require an unreasonable amount of disk space. The web also has a lot of stuff like, e.g., the kind of spam sites that try to scam google's search/ad system; I wonder if the archive records those.
The article didn't say a darn thing about funding. They have to run thousands of machines, so the electric bills must be formidable. Where the heck do they get their money? Is there a significant chance that their funding will dry up at some point in the future, and the whole archive will disappear?
The article states that they moved from plain Debian to Ubuntu. That surprised me, and I was curious why they'd do that. E.g., if you're shopping for webhosts, it's much more common for them to offer plain Debian than Ubuntu. I love Ubuntu as a desktop distro, but it surprises me that they'd see any big advantage in using Ubuntu for their application.
Find free books.
Check this out....it reads like a free software update blog :)
http://web.archive.org/web/19980113191222/http://slashdot.org/
HAW HAW!
Sorry, I couldn't resist.
You are in a maze of twisty little passages, all alike.
Unfortunately, this "squatters-add-robots-restrictions" problem comes up a lot.
We'd like to address it, and to do so there are two major issues to be tackled: (1) our current Wayback Machine software only excludes sites on a "for all time" basis; (2) short of mechanistically trusting the current domain owner, determining who has the right to exclude or restore material could be a very labor-intensive, error-prone, and liability-compounding process.
The new open-source 'Wayback' software, which will go live for the Worldwide Wayback Machine later this year, enables time-range exclusions. (It's currently only used for many smaller collections we do for partners.) That should give us the capability to address (1). Addressing (2) will require further discussion about the proper and efficient policies -- but it's on our agenda once the technical capability for time-range exclusions is in place.
Specifically regarding the mindchild.net site you mention, it looks like the issue is that our current retroactive-exclude robots.txt-parser doesn't understand the 'Allow' directive. (The mindchild.net/robots.txt tries to enable ia_archiver/WaybackMachine access via an 'Allow'.) That too will be fixed in the new 'Wayback' deploy (if not sooner).
- Gordon @ IA
The transition from Debian to Ubuntu was driven by developers' desire for more and newer features. We originally went with Debian-Stable because it was, well, stable, and did everything we needed the PetaBox to do at the time. But programmers whined and moaned that such-and-such package wasn't supported, or was too old, and claimed that this held back development of features which Brewster wanted to see made into reality.
Brewster was never much for stability anyway, so the transition was made. It bit us several times, as Ubuntu is not as stable as Debian-Stable (which is to be expected when releases happen more often and newer software is deployed without extensive testing), but the developers were a lot happier with it. And, to be fair, while some of the problems have been substantial (like kernel bugs which interacted with the forcedeth device drivers to make servers freeze ~10% of the time when power cycled), afaik it has not contributed directly to data lossage (which is the bottom line at an archive).
-- TTK
'Recall' wasn't exactly Google-like search. IIRC, in some respects it was better, with an advanced idea of related concepts, and with data on frequency of terms over time. In other respects, it was not what people would expect: there was no exact phrase matching, and certain terms that didn't become tracked concepts weren't findable at all, even though you could see the words in other indexed results.
Unfortunately, IA couldn't maintain the deployment when the developer, Anna Patterson, moved to Google. So, Recall turned out to be a short-lived experiment, grand in scale of pages indexed and novel features but not in traffic served.
Patterson did big things at Google and now has another search startup, Cuill, that's likely to do more good things for the web.
At the Internet Archive, we've also been using the open-source projects Nutch and Hadoop to offer search on smaller web collections for our partners. (A pair of such searchable partner collections for the US National Archives and Records Administration lives at webharvest.gov.) Someday we may be able to scale these up to the full 11+ year archive.
- Gordon @ IA