How the Wayback Machine Works

← Back to Stories (view on slashdot.org)

How the Wayback Machine Works

Posted by ryuzaki0 on Wednesday January 23, 2002 @01:13AM from the very-big-hard-drive dept.

tregoweth writes: "O'Reilly has an interview with Brewster Kahle about how The Internet Archive's Wayback Machine works, with lots of juicy details about how the biggest database ever built works."

12 of 134 comments (clear)

Min score:

Reason:

Sort:

Re:Not very way back! by tom.allender · 2002-01-23 01:37 · Score: 3, Informative

Wayback Slashdot ...only goes back to 2000?

Wayback slashdot.org goes back to 1997...
Not the biggest DB by costas · 2002-01-23 01:44 · Score: 5, Informative

100 TBs do not make the biggest DB ever. I am personally working on an 60-70TB ERP system that's also writeable; I am sure there are bigger systems out there (e.g. Wal-Mart's or GM's ERP systems come to mind).

A read-only DB containing highly-compressible text does not really make for a very challenging datamine. Just because it's on and about the Web and sexier than a stodgy ERP system should not make you overlook the real technology.
1. Re:Not the biggest DB by limber · 2002-01-23 02:44 · Score: 2, Informative
  
  I am sure there are bigger systems out there (e.g. Wal-Mart's or GM's ERP systems come to mind)
  
  Just to nitpick, in the interview Mr. Kahle does explicitly mention that the database is in fact bigger than Walmart's. No mention is made of GM's, however.
  
  "It's larger than Walmart's, American Express', the IRS. It's the largest database ever built. "
  he says. Whether the claim is credible is a different matter.
Government Removed Site still Available by Tazzy531 · 2002-01-23 02:14 · Score: 4, Informative

A number of you have asked whether the websites taken down since 9/11 are available on archive.org. The answer is yes. One example is:

DC Air National Guard on Archive

Same Page - 404

One of the conspiracy websites that I have read was saying that combat airplanes, normally on 24 hour alert, at this base should have and could have prevented the plane from entering the restricted airspace in DC. They were saying that this site was removed because it provided evidence that somebody dropped the ball.

--

_______________________________
"I'm not Conceited...I'm just a realist..."
Link to various database sizes by rkgmd · 2002-01-23 02:16 · Score: 3, Informative

http://znet.net/~schester/facts/database_sizes.htm l Apparently, walmart's is 24TB, and the entire www index as of 1999 was only 6TB.
Re:Copyright infringement by madfgurtbn · 2002-01-23 02:22 · Score: 2, Informative

Google, and others, also cache a lot of content. If a web provider doesn't want their stuff cached on The WayBack, all they have to do is include the no bots code in their html.

--
Send lawyers, guns, and money. Dad, get me out of this.
Hardly the biggest. by dmd · 2002-01-23 02:39 · Score: 2, Informative

The biggest database ever? 100TB? Hardly.

I worked at a large pharmaceutical company for two years (known internally as the Squid), and supported a 380TB protein interaction database (Oracle) and a 260TB SAP-backend database (Informix + custom).

Certainly Wayback's database is large, and certainly it holds far more varied information and appeals to a far larger audience, but by no means is it the biggest. I'm sure there are databases that made the ones I worked on look puny by comparison.
Talk to the US government by Remus+Shepherd · 2002-01-23 03:21 · Score: 3, Informative

You're right, the Wayback machine is not the largest collection of data -- not even the largest collection online. I work with the USGS's catalog of satellite data. They have over 300 terabytes of satellite imagery, and the collection is growing at a rate of about 1 terabyte per day.

The USGS collection comprises multiple instruments, but Landsat 7 is a big one, contributing about 100 terabytes that's searchable online.

Perhaps 'Largest TEXT Database' would be a better description of the Wayback Machine?

--
Genocide Man -- Life is funny. Death is funnier. Mass murder can be hilarious.
Re:Operating system by madfgurtbn · 2002-01-23 03:26 · Score: 2, Informative

They use different OS's for different purposes within the system. The so-called OS they wrote is described in the article. It's a collection of tools for controlling their parallel computer, which is a collection of many inexpensive computers running the BSD and Linux OS's you talk about.

The interviewer is the one who describes it as an OS. The interviewee expains that the real breakthrough is that with their tools an ordinary programmer can operate in a parallel computing environment-- you don't need a specialist in parallel computing anymore. Which leads to the conclusion that relatively small institutions on relatively small budgets can build enormously powerful computers with massive storage.

--
Send lawyers, guns, and money. Dad, get me out of this.
Re:Copyright infringement by pjones · 2002-01-23 04:07 · Score: 3, Informative

Child! Child! They do not sue you right away -- and they can't. First they send you a cease-and-desist order and you evaluate their claim.

But Brewster answers your question in the interview himself on the second page:

Koman: What about the question of rights? I just wrote about Lawrence Lessig's book on intellectual property. Surely the publishers and the television networks and the record companies aren't willing to let you keep a copy of all of their stuff?

Kahle: All we collect for the Web archive are sites that are publicly accessible for free, and if there's any indication from the site owner that they don't want it in the archive, we take it out. If there's a robot exclusion, it's removed from the Wayback Machine. Over the years, people would notice these things in their logs and would say, what are you doing? And we'd explain what we're doing -- building this archive and donating a copy to the Library of Congress, etc., etc., and 90% of the time they say, "Oh, that's cool, you're crazy, but go ahead." About 10% of the time, they'd say, "I don't want any part of it," and we instruct them on how to use a robot exclusion and they're taken out of history. That seems to work for everybody at this point. People are really excited about this future that we're building together.

--
Certified Black Helicopter Pilot *** Unwitting Dupe of One World Gov'ment
Re:Noooooooooo !!! by ichimunki · 2002-01-23 04:10 · Score: 2, Informative

Important to note that they will allow you to "opt out" by using a robots.txt file (not sure what you do if the domain is no longer available).

Funny part is, they may not have to allow this, except out of courtesy. Apparently libraries such as this can get away with all kinds of stuff that, if done by private individuals with any kind of profit motive, would normally constitute serious copyright violations. (see http://www.loc.gov/copyright/circs/circ21.pdf for information).

--
I do not have a signature
Their movie archive has "Hired!" by for(;;); · 2002-01-23 07:00 · Score: 3, Informative

Hot damn! Their movie archive has a downloadable version of the short they showed on MST3K prior to "'Manos:' The Hands of Fate."

"Ma'am, did you realize that Chevrolet has an important plan for your life?"

--

"Whatever happened to fair use?"
-- Duff-Man