How the Wayback Machine Works

← Back to Stories (view on slashdot.org)

How the Wayback Machine Works

Posted by ryuzaki0 on Wednesday January 23, 2002 @01:13AM from the very-big-hard-drive dept.

tregoweth writes: "O'Reilly has an interview with Brewster Kahle about how The Internet Archive's Wayback Machine works, with lots of juicy details about how the biggest database ever built works."

9 of 134 comments (clear)

Min score:

Reason:

Sort:

Google? by kenneth_martens · 2002-01-23 01:25 · Score: 4, Interesting

It's an interesting idea, but the real problem is not storing the 100 TB of data, it's figuring out how to search through it to find what you're looking for. Now, apparently they write a lot of their own software, but it might be better if they could team up with Google and have Google index their sites on a special database. We'd have www.google.com for regular searches, and wayback.google.com for the Wayback Machine's sites.

Something else I found interesting: according to the article, they "use as much open source software as [they] can." That makes sense when they've got between 300 and 400 computers, and with the number growing all the time. Licensing all those with a non-open OS would be quite expensive.
Try this instead.. by CptnHarlock · 2002-01-23 01:37 · Score: 4, Interesting

http://web.archive.org/web/*/http://slashdot.org

--
$HOME is where the .*shrc is
-- silver_p
They haven't got http://web.archive.org/ by Rentar · 2002-01-23 01:43 · Score: 5, Funny

They don't seem to think the history of their site would be interesting: http://web.archive.org/web/*/http://web.archive.or g/ lredirects you to their index.html! boring!

Now, that would really be a test for their apps. Same as if Google indexed www.google.com (entirely).
Not the biggest DB by costas · 2002-01-23 01:44 · Score: 5, Informative

100 TBs do not make the biggest DB ever. I am personally working on an 60-70TB ERP system that's also writeable; I am sure there are bigger systems out there (e.g. Wal-Mart's or GM's ERP systems come to mind).

A read-only DB containing highly-compressible text does not really make for a very challenging datamine. Just because it's on and about the Web and sexier than a stodgy ERP system should not make you overlook the real technology.
Pretty amazing ... by CDWert · 2002-01-23 01:53 · Score: 4, Funny

Id say is pretty amazing, I actually was able to retreive content I thought lost years ago.

My sites go back to 95, and yep theyre archived starting 96, this is too cool.

I wonder how much of the goverments docs that were pulled off post Sept 11 are still on this ?

A really funny note is it seems like all the p0rn is intact staring in 96, gotta archive the porn.

But seriously , I was unaware of this, Im gonna use this thing like hell as a sales tool if nothing else. Its also great to find certain content thats been pulled.

--
Sig went tro...aahemmm.....fishing........
Noooooooooo !!! by morzel · 2002-01-23 01:53 · Score: 5, Funny

Please please please please do _NOT_ google it... It was embarassing enough when google acquired dejanews, and put the old usenet archives on-line. :-)
I just visited some sites from which I hoped that they dissappeared completely from cyberspace. The only defense I've got now are the old cryptic URLs of these monstrosities... Indexing that database would be a disaster, especially with an unusual name like mine...
(Yes, I was stupid enough to use my real name ;-)
Damn you, wayback :p

--
Okay... I'll do the stupid things first, then you shy people follow.
[Zappa]
Government Removed Site still Available by Tazzy531 · 2002-01-23 02:14 · Score: 4, Informative

A number of you have asked whether the websites taken down since 9/11 are available on archive.org. The answer is yes. One example is:

DC Air National Guard on Archive

Same Page - 404

One of the conspiracy websites that I have read was saying that combat airplanes, normally on 24 hour alert, at this base should have and could have prevented the plane from entering the restricted airspace in DC. They were saying that this site was removed because it provided evidence that somebody dropped the ball.

--

_______________________________
"I'm not Conceited...I'm just a realist..."
Biggest ever? I don't think so! by Proud+Geek · 2002-01-23 02:50 · Score: 4, Funny

I once worked on a site with a 25 year old database that was much larger.

The ancient magnetic storage took up several warehouses. Beat that, for biggest database ever!

--
Even Slashdot wants to hide some things
200 transactions/second? by selan · 2002-01-23 03:37 · Score: 4, Insightful

Having so few transactions for a database of this size probably helps them run without needing large expensive machines. Many VLDBs support thousands of transactions per second. I found a list here of top ten winners of a very large database scalability contest. The winner for peak performance was something like 20,000+ TPS.