Inside the Internet Archives

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Wednesday June 18, 2008 @02:46AM from the those-who-ignore-history-are-doomed-to-eat-a-sandwich dept.

blackbearnh writes "O'Reilly Media is running an interview with Gordon Mohr, Chief Technologist for the Internet Archive (archive.org). If you've ever wondered how pages are selected for archiving, or just how they manage such a huge quantity of data, the answers are here. The interview also touches on the problems of intellectual property in archives, archiving the Internet in a post Web 2.0 world, and the potential vulnerabilities exposed by archiving web sites that may include security exploits."

7 of 85 comments (clear)

Min score:

Reason:

Sort:

Wayback by TheRealMindChild · 2008-06-18 03:18 · Score: 4, Informative

While I love the wayback machine, a little "problem" creped in a couple of years ago that is still there... and it drives me nuts.

At one point, I forgot to renew my domain name and a squatter snatched it up the second it was available. I have since lost the html/java applets/images/etc that I had originally there. I used to show people what it looked like via the wayback machine. But you can't do it anymore. Example: http://web.archive.org/web/*/http://www.mindchild.net

Apparently, the current squatter put a robots.txt on that domain, and wayback refuses to show any ARCHIVED pages where the domain CURRENTLY has a robots.txt. I emailed them about it, and after a couple of months, I actually got a reply pretty much saying "That is just the way it is. We are underfunded and have no time to fix it. Sorry".

So if for some reason you don't want to have your site viewable via the wayback machine, just put up a robots.txt. It doesn't even need to contain anything.

--

"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
1. Re:Wayback by corsec67 · 2008-06-18 03:34 · Score: 2, Informative
  
  User-agent: ia_archiver Allow: /
  in the robots.txt that you mention (http://mindchild.net/robots.txt) is hardly not containing anything.
  
  But, it is interesting how they take the current robots.txt to apply to old content that used to be at that location...
  
  --
  If I have nothing to hide, don't search me
Re:I wished archive.org stored even more stuff by Alphax.au · 2008-06-18 04:02 · Score: 2, Informative

Combining a bookmarking / chaching service would be really handy. WebMynd claims to do that; I haven't tried it myself though.
Squatters & robots.txt Re:Wayback by gojomo · 2008-06-18 05:55 · Score: 5, Informative

Unfortunately, this "squatters-add-robots-restrictions" problem comes up a lot.

We'd like to address it, and to do so there are two major issues to be tackled: (1) our current Wayback Machine software only excludes sites on a "for all time" basis; (2) short of mechanistically trusting the current domain owner, determining who has the right to exclude or restore material could be a very labor-intensive, error-prone, and liability-compounding process.

The new open-source 'Wayback' software, which will go live for the Worldwide Wayback Machine later this year, enables time-range exclusions. (It's currently only used for many smaller collections we do for partners.) That should give us the capability to address (1). Addressing (2) will require further discussion about the proper and efficient policies -- but it's on our agenda once the technical capability for time-range exclusions is in place.

Specifically regarding the mindchild.net site you mention, it looks like the issue is that our current retroactive-exclude robots.txt-parser doesn't understand the 'Allow' directive. (The mindchild.net/robots.txt tries to enable ia_archiver/WaybackMachine access via an 'Allow'.) That too will be fixed in the new 'Wayback' deploy (if not sooner).

- Gordon @ IA
Re:I wished archive.org stored even more stuff by TTK+Ciar · 2008-06-18 15:34 · Score: 2, Informative

New material is always being added to The Archive's web archive, and (afaik) unlike the collections archive it is never deliberately deleted. Most of what appears to be "pages going AWOL" is indexing errors. In order for newly archived stuff to become visible to the wayback machine interface, the entire web archive needs to be periodically re-indexed. Unfortunately the indexing process is error-prone, and stuff that might have been accessible before the index might disappear afterwards (and appear again after the next indexing).

Other things that can make data temporarily unavailable are:

* Downed servers. There are over a thousand on the web archive's end of the cluster, and if a server "only" crashes once a year, that's about three servers crashing per day on average, but it doesn't happen at a low constant rate. Traumatic events at the datacenter (like AC failures, power cycles, etc) tend to knock a bunch of hosts onto their asses at a time, and they don't always come back up when rebooted. The Archive runs at a deficit of system administration manpower, so it takes a long time (weeks, sometimes months) to get humpty-dumpty back together again.

* robots.txt. In order to avoid getting their asses sued off, The Archive uses the live site's robots.txt to control which pages are publicly viewable. Every time you hit up a wayback machine URL, it downloads the real site's robots.txt and parses it to see if the owners have rendered the desired content unviewable. So if a site owner changes their robots.txt, archived content that was viewable yesterday might not be viewable today. When a website is abandoned, there isn't a robots.txt to download anymore, so at least entirely "lost" sites are viewable by everyone.

* Genuinely lost data. When I worked at The Archive (a few months ago, now), most of the web archive was on "SOLO" nodes, meaning there was no on-site replication of the content. The data servers lack RAID-level redundancy as well, so if a SOLO loses a disk, and nobody copied the data off it first, and there isn't a copy tucked away on our sister sites (some in Amsterdam, some in Alexandria Egypt), then the data is lost forever. To prevent this from happening, disks are tested hourly for a variety of symptoms (like nonzero sectors reallocated in SMART), and if a disk shows early signs of ill health, its contents gets "shuffled" off onto other machines in the cluster, and the disk itself is replaced.

But the system isn't perfect, not by a long shot, and lossage occurs. It's possible to do a lot better. Numerous people within the archive have tried to put better practices into place over the years, but for various reasons getting those practices into .. well, into practice has proven futile. Fortunately around the time I left, there was a push underway to get more of the web archive onto paired storage (so that all data was stored in duplicate on different physical machines). We can hope that moves forward.

The last time anyone tried seriously measuring the rate of lossage, iirc it ran into the 10-20 Mbits/sec range. That's not even a slow drip against the ~1.1PB in the web archive, but loss is loss.

Anyway, the *vast* majority of missing pages aren't really missing, they're just not indexed at the moment. The content itself is still tucked away in the cluster, and may resurface in the future.

-- TTK
Re:Searchable Archive. . . by gojomo · 2008-06-18 18:08 · Score: 3, Informative

'Recall' wasn't exactly Google-like search. IIRC, in some respects it was better, with an advanced idea of related concepts, and with data on frequency of terms over time. In other respects, it was not what people would expect: there was no exact phrase matching, and certain terms that didn't become tracked concepts weren't findable at all, even though you could see the words in other indexed results.

Unfortunately, IA couldn't maintain the deployment when the developer, Anna Patterson, moved to Google. So, Recall turned out to be a short-lived experiment, grand in scale of pages indexed and novel features but not in traffic served.

Patterson did big things at Google and now has another search startup, Cuill, that's likely to do more good things for the web.

At the Internet Archive, we've also been using the open-source projects Nutch and Hadoop to offer search on smaller web collections for our partners. (A pair of such searchable partner collections for the US National Archives and Records Administration lives at webharvest.gov.) Someday we may be able to scale these up to the full 11+ year archive.
- Gordon @ IA
Re:selection? funding? why not plain Debian? by gojomo · 2008-06-18 18:47 · Score: 2, Informative

I can't comment in more detail about Alexa's bulk crawl strategy because it is only documented to the public (and us at the Internet Archive) in general terms: it is a broad survey crawl of the public web, weighted by Alexa's internal measures of site/page importance and legitimacy (which are at least partially based on the same toolbar data that drives their site rankings). While we expect to continue receiving the Alexa donations indefinitely, a growing proportion of the public archive is likely to come from other sources, including the IA's own crawling and other outside donors, in the future.

The Archive is funded by a combination of private donations from individuals and foundations (sometimes for general operations and sometimes for specific projects), and fees for services provided to our partners, who are public libraries and archives themselves. With 11+ year history, and long partnerships with customers and funding sources, we're pretty stable in the world of technology nonprofits.

I wasn't directly involved in the Ubuntu choice, but it's been nice to have our developer desktops in close sync with cluster servers.
- Gordon @ IA