Inside the Internet Archives
blackbearnh writes "O'Reilly Media is running an interview with Gordon Mohr, Chief Technologist for the Internet Archive (archive.org). If you've ever wondered how pages are selected for archiving, or just how they manage such a huge quantity of data, the answers are here. The interview also touches on the problems of intellectual property in archives, archiving the Internet in a post Web 2.0 world, and the potential vulnerabilities exposed by archiving web sites that may include security exploits."
My God, it's full of ones and zeros!
do they archive SECOND POSTs?
The Interviewer: And I'm not sure I want to think about what posterity is going to think about a recording of my Twitter feed.
If Twitter becomes so mainstream so as to be more than a 'remember when?' to posterity I will kill myself.
I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
and does archive.org record google's cache?
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
I'm going to have to RTFA! I keep wondering why my old Quake site the Springfield Fragfest (abandoned years ago) is in the Wayback Machine, while "Kneel" Harriot's Yello There has seemingly disappeared from the entire internet.
The only page from Yello There I can find is one that was linked from my site (the aformentioned Quake site). That particular page was wonderfully recursive because of Yello's frame.
mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest
I keep running into bookmarks that have gone awol, then find that archive.org also doesn't have the pages anymore.
Combining a bookmarking / chaching service would be really handy.
MP3 Search Engine
is clearly because someone wanted the worlds largest porn collection!
From TFA:
"there's a lot of porn on the internet, so there's a lot of porn that gets collected when you're archiving the whole internet"
While I love the wayback machine, a little "problem" creped in a couple of years ago that is still there... and it drives me nuts.
At one point, I forgot to renew my domain name and a squatter snatched it up the second it was available. I have since lost the html/java applets/images/etc that I had originally there. I used to show people what it looked like via the wayback machine. But you can't do it anymore. Example: http://web.archive.org/web/*/http://www.mindchild.net
Apparently, the current squatter put a robots.txt on that domain, and wayback refuses to show any ARCHIVED pages where the domain CURRENTLY has a robots.txt. I emailed them about it, and after a couple of months, I actually got a reply pretty much saying "That is just the way it is. We are underfunded and have no time to fix it. Sorry".
So if for some reason you don't want to have your site viewable via the wayback machine, just put up a robots.txt. It doesn't even need to contain anything.
"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
I had a cheesy site back in college where I played around with HTML and learning the basics. I ended up making a few pages that poked fun at friends.
I went to archive.org years later looking for them cause I remember back in the day they nabbed em and now they're all gone. The images and sounds I used were all gone.
I wanted to recreate a page from that archive for nostalgia reasons with my old friends. Can't do it and I can't find the files anymore in my local archives.
I was kinda disappointed but I guess it was expecting too much. I really wish there was a true and complete archive of the internet that didn't care what was there it just had it.
~~ Behold the flying cow with a rail gun! ~~
Ew.
"16MB (fuck off, MiB fascists)" - The Mighty Buzzard
I was left with a several questions that weren't addressed by the article.
The slashdot summary says the article explains how pages are selected for archiving, but I couldn't find anything in the article that explicitly explained that. It does say that the actual crawler is run by alexa, which hands off the data to them, but it didn't say what the criteria were. Alexa computes various stats about web sites, so presumably they could apply some kind of minimum cut. Or do they try to index every single lame personal page, unless the owner opts out? That seems like it would require an unreasonable amount of disk space. The web also has a lot of stuff like, e.g., the kind of spam sites that try to scam google's search/ad system; I wonder if the archive records those.
The article didn't say a darn thing about funding. They have to run thousands of machines, so the electric bills must be formidable. Where the heck do they get their money? Is there a significant chance that their funding will dry up at some point in the future, and the whole archive will disappear?
The article states that they moved from plain Debian to Ubuntu. That surprised me, and I was curious why they'd do that. E.g., if you're shopping for webhosts, it's much more common for them to offer plain Debian than Ubuntu. I love Ubuntu as a desktop distro, but it surprises me that they'd see any big advantage in using Ubuntu for their application.
Find free books.
Check this out....it reads like a free software update blog :)
http://web.archive.org/web/19980113191222/http://slashdot.org/
AAAARGH!1!
I can't stand "post [xyz] world", "pre [xyz] mindset" or any such similar phrases. Go away, GO AWAY!!!!
Really, the archive is tasked with 'saving' the internet every so often. I'm sure they'll figure out how to save AJAX stuff. And if not, then that stuff isn't really meant to be saved, now is it? (I mean, we don't need a save of Gmail, since it's account based.)
Another non-functioning site was "uncertainty.microsoft.com."
The purpose of that site was not known.
I tried to find a Hogans Heroes page that I did in 1996 so I could find whatever email I used so that I could make an attempt at getting back my original slashdot uid. No luck for me.
I blame Internet Archive for this and ask twitter and his sock puppet army to start ranting about this horrible horrible travesty as well. The loss of my 5 digit uid is as bad as Gitmo and waterboarding combined!
riding round the world on an old motorcycle
Unfortunately, this "squatters-add-robots-restrictions" problem comes up a lot.
We'd like to address it, and to do so there are two major issues to be tackled: (1) our current Wayback Machine software only excludes sites on a "for all time" basis; (2) short of mechanistically trusting the current domain owner, determining who has the right to exclude or restore material could be a very labor-intensive, error-prone, and liability-compounding process.
The new open-source 'Wayback' software, which will go live for the Worldwide Wayback Machine later this year, enables time-range exclusions. (It's currently only used for many smaller collections we do for partners.) That should give us the capability to address (1). Addressing (2) will require further discussion about the proper and efficient policies -- but it's on our agenda once the technical capability for time-range exclusions is in place.
Specifically regarding the mindchild.net site you mention, it looks like the issue is that our current retroactive-exclude robots.txt-parser doesn't understand the 'Allow' directive. (The mindchild.net/robots.txt tries to enable ia_archiver/WaybackMachine access via an 'Allow'.) That too will be fixed in the new 'Wayback' deploy (if not sooner).
- Gordon @ IA
This was very exciting! Seriously; you might remember the content of a page you were looking at five years ago, but can you remember it's specific web address? --Especially with the turnover and abandoning of domain names, it is entirely possible to simply lose contact with mountains of data.
So a basic search engine was a very exciting idea!
Too bad they killed "Recall" after only a few weeks. I never got a chance to try it, (and boy, I would have made good use of it! There are still a few dozen items I'd love to find again.) I somehow didn't expect Recall to be discussed in the interview, and I was right about that. Too bad.
Maybe Google should set up something similar; they don't trash old data, do they? I know they've got a setting which allows you to look for data up to a year old, but it's rather vague and it doesn't provide specific controls. How awesome would a non-linear search engine in an archive going back to the beginning of the web be?
I wonder what the deal with "Recall" was, and why nobody talks about it.
-FL
Normally I let typos go; people are generally forgiving and will read around them knowing that they are just as susceptible to making errors, but in the case of those typos which don't just create a spelling mistake, but actually switch the meaning of an entire sentence, I will sometimes haul myself to the task of writing a short retraction. Just like this one.
Cheers.
-FL
I thought he worked for Intel and was the person behind Mohr's Law. Has he changed his interests recently?
I think it is really weird that EVERY SINGLE news site on the Internet is mysteriously missing any captures from May 2001 to Sept 2001 (maybe one or two days in July are there).
And then all of a sudden on Sept 11, ALL the news sites have multiple captures per day.
I want to see what CNN, LA times, Washington Post, etc. had in the news on Sept 8th, 9th and 10th...
Shark Attacks in Florida. True Story. There was also Chandra Levy/Gary Condit stuff leading up to that.
Populus vult decipi, ergo decipiatur...
"Force shits upon Reason's back." - Poor Richard's Almanac
#1 Why aren't archived pages modified very slightly to insert a tag, so that archived images, sub-pages, and the like will be fetched from the archive, rather than linking to non-existent locations on the current server? Surely the current server operators don't like the dozens of hit for everyone that visits the archive...
But more than that, it's a PITA to visit an archived page, and manually copy and paste every single link, one at a time. And I'm sure most people don't realize that they can even do that, and just give up on finding the content they want...
#2 Why is the video encoding so incredibly horrible? Terrible ghosting on the low-res / low-bitrate versions, and a 30fps frame-rate like you've just stupidly deinterlaced the pulldown/telecine.
At 320x240, and 256kbps, those movies should look great, not HORRIBLE. That, combined with the fact that you almost never provide a full resolution (around 720x480) low-bitrate version, no doubt forces a large number of people to download the HUGE (4+ GB) interlaced MPEG-2 copy of a film, rather than the (completely screwed up) 250MB copy...
And don't complain about lack of skill, manpower, etc. I've e-mailed you, TWICE, personally volunteering to take care of everything, from getting the proper software compiled and installed (that was mentioned as a problem long ago) to writing an automatic conversion script to detect the encoding, and do the job without human intervention. I received no responses, either time.
#3 I'm on Verizon DSL in Southern CA. Why does my (dynamic) IP address seem to be blocked the majority of the time I try to access the archives, while it's fine on the rest of the site? Are you just having capacity problems these days?
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Web 2.0 will be the end of the free internet as we know it. It will be fully corporate controlled. You won't have access to anything like we have now. That being the case, it's pointless to discuss what the internet archives will be like post web 2.0. Even if it does stay around it will be as heavily censored as your post web 2.0 internet.
Do some reading on the topic and do what you can to stop Web 2.0 now, while you can still express your (uncensored) opinions.
From the site's current robots.txt: -
User-agent: ia_archiverAllow:
Now that's irony. (Actually, is that irony? I'm always a bit worried I might get it wrong, since the whole Alanis Morissette thing.)