How the Wayback Machine Works

← Back to Stories (view on slashdot.org)

How the Wayback Machine Works

Posted by ryuzaki0 on Wednesday January 23, 2002 @01:13AM from the very-big-hard-drive dept.

tregoweth writes: "O'Reilly has an interview with Brewster Kahle about how The Internet Archive's Wayback Machine works, with lots of juicy details about how the biggest database ever built works."

12 of 134 comments (clear)

Min score:

Reason:

Sort:

Google? by kenneth_martens · 2002-01-23 01:25 · Score: 4, Interesting

It's an interesting idea, but the real problem is not storing the 100 TB of data, it's figuring out how to search through it to find what you're looking for. Now, apparently they write a lot of their own software, but it might be better if they could team up with Google and have Google index their sites on a special database. We'd have www.google.com for regular searches, and wayback.google.com for the Wayback Machine's sites.

Something else I found interesting: according to the article, they "use as much open source software as [they] can." That makes sense when they've got between 300 and 400 computers, and with the number growing all the time. Licensing all those with a non-open OS would be quite expensive.
Try this instead.. by CptnHarlock · 2002-01-23 01:37 · Score: 4, Interesting

http://web.archive.org/web/*/http://slashdot.org

--
$HOME is where the .*shrc is
-- silver_p
Just Network Programming? by shic · 2002-01-23 01:47 · Score: 2, Interesting

So... would I be right in thinking that the "Wayback Machine" is an (admittedly large scale) exercise in network database programming (of the style popular pre Codd and his relational model?) I am tempted to question if this is indeed the biggest database ever built - I suppose it depends upon definitions - but to my mind a database should be general purpose... whereas it appears to me that this project is basically a large-scale single index.

I also wonder if it would be appropriate to call this the largest project of it's kind - for example - while Google stores less data, I suspect it supports a higher query rate... how exactly do you intend to measure scale... if it is in terms of computing power is it relevant that Google already have thousands of Linux server nodes?

That said - I think it is an exciting project in its own right. I hope and expect this offering to become a significant information resource in years to come.
The Cost of a Terabyte by wayn3 · 2002-01-23 02:01 · Score: 3, Interesting

You buy from EMC a terabyte for maybe $300,000. That's just the storage for 1 TB. We can buy 100 TBs with 250 CPUs to work on it, all on a high-speed switch with redundancy built in.
Interesting quote. Mr. Kahle addresses something I've been wondering for a while -- are storage area networks really worth it? Or is he ignoring the costs of maintenance and manpower to keep these things afloat?
DBMS and model? by leandrod · 2002-01-23 02:44 · Score: 3, Interesting

But what is the DBMS? Is the database relational? How it was modelled?

--
Leandro GuimarÃ£es Faria Corcete DUTRA
DA, DBA, SysAdmin, Data Modeller
GNU Project, Debian GNU/Lin
Interesting thought process by cheese_wallet · 2002-01-23 02:57 · Score: 3, Interesting

Pretty decent read, but one thing they said got me thinking a little bit.

They said that at Thinking Machines they built a super fast computer, but it required a new way of thinking about things in order to program it. And then they called this a mistake, because they couldn't attract any customers.

This seems like a real problem that would lead to technological stagnation. At least from a market place point of view.

It is kind of similar to a company making games off of pre-existing engines, like quake, instead of some new non-quake compatible engine.

Or everybody making x86 compatible CPUs.

It also seems that when a company does come up with some new way of doing things, they get burned, and it is the second generation of companies that pick up the torch that make the money. So nobody wants to be that first company, they are all waiting for someone else to break the ground.

Maybe the only people/companies that come up with new stuff are the ones that are insanely rich, and won't get hurt by doing something new, or the insanely poor who have nothing to lose anyway.

I can't help thinking that this clustering boom going on is just like what 3dfx was trying to do. The difference right now is that clustering actually *does* outperform the super fast single chip. I wonder when technological advances will change this fact.
Re:Quite a lofty goal... by limber · 2002-01-23 03:09 · Score: 2, Interesting

Kahle's idea is actually quite reminiscent of Vannevar Bush's seminal 1945 description in The Atlantic Monthly of the memex, a device that would "give man access to and command over the inherited knowledge of the ages".

The frequency with which this article (the Bush article, that is) has been cited in hypertext research attests to its importance.
Re:Not the biggest DB by limber · 2002-01-23 03:29 · Score: 3, Interesting

As a side example to this discussion of 'what constitutes a large database', the NOAA's National Climate Data Centre maintains a database of digital data of about a petabyte of climatological data. The Centre takes in about a quarter of a terabyte of data *daily*.
Wisdom in his words.. by grub · 2002-01-23 03:32 · Score: 3, Interesting

From the article:
How the archive works is just with stacks and stacks of computers runnning Solaris on x86, FreeBSD, and Linux, all of which have serious flaws, so we need to use different operating systems for different functions.

The man puts bias aside and uses various OSs in areas in which each performs well. A real, tangible project like this is worth more than any amount of drooling zealotry.

--
Trolling is a art,
my favorite wayback slashdot story so far... by CrudPuppy · 2002-01-23 04:38 · Score: 2, Interesting

"So IBM announces a 25 gig hard drive... does the world need this yet? Unless this is in a RAID, would you really want to trust 25 gigs on a single drive? What would you use this for? 400+ hours of MP3s comes to mind... "

mind you, this was only a couple short years ago, and now I'm writing this from a PC with three 80 giggers.

i thought we geeks were supposed to have more foresight than this? *grin*

--
A year spent in artificial intelligence is enough to make one believe in God.
Re:Ok, nice, but what data do we throw away? by delta0 · 2002-01-23 05:30 · Score: 2, Interesting

: In the past media degeneration and obscelecense
: over time have made the decisions for us. But
: going forward we will have massive distributed,
: redundant data stores, with geographically
: remote backups. The data isn't going to go away
: unless we tell it to.

GOOD! And there is something *wrong* with this??
Seriously -- the Internet should have been designed like this from the start. Don't ever throw away, simply classify and organize. In a throw away world, we already destroy too much to afford loosing what's on the Internet.

Given, there is lots of junk and what some might consider noise, but... (as the saying goes) some peoples junk is others treasure.

--
--- Delta0.. makes no difference.
the misanthropic bitch by joshuaos · 2002-01-23 06:07 · Score: 3, Interesting

I spent the summer on the road, and when I settled down the for the cold months, I was quite sad to see the the Misanthropic Bitch appears to have vanished. This made me very sad. Today, when I read this article, I was delighted to find that all of dear bitch's articles are archived.
I think this is a fabulous project, and I hope it does well. However, I think that the notion of such a centralized database will begin to become unrealistic. I think peer to peer projects are the future, and I can see a day far in the future when the database layer comes down and inhabits the filesystem layer and all the databases on the internet can talk to eachother, and in a sense, the net becomes a giant database that anyone can contribute to.
Cheers, Joshua

--
When in danger or in doubt, run in circles, scream and shout!