How the Wayback Machine Works
tregoweth writes: "O'Reilly has an interview with Brewster Kahle about how The Internet Archive's Wayback Machine works, with lots of juicy details about how the biggest database ever built works."
← Back to Stories (view on slashdot.org)
Wayback slashdot.org goes back to 1997...
100 TBs do not make the biggest DB ever. I am personally working on an 60-70TB ERP system that's also writeable; I am sure there are bigger systems out there (e.g. Wal-Mart's or GM's ERP systems come to mind).
A read-only DB containing highly-compressible text does not really make for a very challenging datamine. Just because it's on and about the Web and sexier than a stodgy ERP system should not make you overlook the real technology.
A number of you have asked whether the websites taken down since 9/11 are available on archive.org. The answer is yes. One example is:
DC Air National Guard on Archive
Same Page - 404
One of the conspiracy websites that I have read was saying that combat airplanes, normally on 24 hour alert, at this base should have and could have prevented the plane from entering the restricted airspace in DC. They were saying that this site was removed because it provided evidence that somebody dropped the ball.
_______________________________
"I'm not Conceited...I'm just a realist..."
http://znet.net/~schester/facts/database_sizes.htm l
Apparently, walmart's is 24TB, and the entire www index as of 1999 was only 6TB.
Google, and others, also cache a lot of content. If a web provider doesn't want their stuff cached on The WayBack, all they have to do is include the no bots code in their html.
Send lawyers, guns, and money. Dad, get me out of this.
The biggest database ever? 100TB? Hardly.
I worked at a large pharmaceutical company for two years (known internally as the Squid), and supported a 380TB protein interaction database (Oracle) and a 260TB SAP-backend database (Informix + custom).
Certainly Wayback's database is large, and certainly it holds far more varied information and appeals to a far larger audience, but by no means is it the biggest. I'm sure there are databases that made the ones I worked on look puny by comparison.
You're right, the Wayback machine is not the largest collection of data -- not even the largest collection online. I work with the USGS's catalog of satellite data. They have over 300 terabytes of satellite imagery, and the collection is growing at a rate of about 1 terabyte per day.
The USGS collection comprises multiple instruments, but Landsat 7 is a big one, contributing about 100 terabytes that's searchable online.
Perhaps 'Largest TEXT Database' would be a better description of the Wayback Machine?
Genocide Man -- Life is funny. Death is funnier. Mass murder can be hilarious.
They use different OS's for different purposes within the system. The so-called OS they wrote is described in the article. It's a collection of tools for controlling their parallel computer, which is a collection of many inexpensive computers running the BSD and Linux OS's you talk about.
The interviewer is the one who describes it as an OS. The interviewee expains that the real breakthrough is that with their tools an ordinary programmer can operate in a parallel computing environment-- you don't need a specialist in parallel computing anymore. Which leads to the conclusion that relatively small institutions on relatively small budgets can build enormously powerful computers with massive storage.
Send lawyers, guns, and money. Dad, get me out of this.
But Brewster answers your question in the interview himself on the second page:
Koman: What about the question of rights? I just wrote about Lawrence Lessig's book on intellectual property. Surely the publishers and the television networks and the record companies aren't willing to let you keep a copy of all of their stuff?
Kahle: All we collect for the Web archive are sites that are publicly accessible for free, and if there's any indication from the site owner that they don't want it in the archive, we take it out. If there's a robot exclusion, it's removed from the Wayback Machine. Over the years, people would notice these things in their logs and would say, what are you doing? And we'd explain what we're doing -- building this archive and donating a copy to the Library of Congress, etc., etc., and 90% of the time they say, "Oh, that's cool, you're crazy, but go ahead." About 10% of the time, they'd say, "I don't want any part of it," and we instruct them on how to use a robot exclusion and they're taken out of history. That seems to work for everybody at this point. People are really excited about this future that we're building together.
Certified Black Helicopter Pilot *** Unwitting Dupe of One World Gov'ment
Important to note that they will allow you to "opt out" by using a robots.txt file (not sure what you do if the domain is no longer available).
Funny part is, they may not have to allow this, except out of courtesy. Apparently libraries such as this can get away with all kinds of stuff that, if done by private individuals with any kind of profit motive, would normally constitute serious copyright violations. (see http://www.loc.gov/copyright/circs/circ21.pdf for information).
I do not have a signature
"Ma'am, did you realize that Chevrolet has an important plan for your life?"
"Whatever happened to fair use?"
-- Duff-Man