How the Wayback Machine Works
tregoweth writes: "O'Reilly has an interview with Brewster Kahle about how The Internet Archive's Wayback Machine works, with lots of juicy details about how the biggest database ever built works."
← Back to Stories (view on slashdot.org)
It's an interesting idea, but the real problem is not storing the 100 TB of data, it's figuring out how to search through it to find what you're looking for. Now, apparently they write a lot of their own software, but it might be better if they could team up with Google and have Google index their sites on a special database. We'd have www.google.com for regular searches, and wayback.google.com for the Wayback Machine's sites.
Something else I found interesting: according to the article, they "use as much open source software as [they] can." That makes sense when they've got between 300 and 400 computers, and with the number growing all the time. Licensing all those with a non-open OS would be quite expensive.
http://web.archive.org/web/*/http://slashdot.org
$HOME is where the
-- silver_p
They don't seem to think the history of their site would be interesting: http://web.archive.org/web/*/http://web.archive.or g/ lredirects you to their index.html! boring!
Now, that would really be a test for their apps. Same as if Google indexed www.google.com (entirely).
100 TBs do not make the biggest DB ever. I am personally working on an 60-70TB ERP system that's also writeable; I am sure there are bigger systems out there (e.g. Wal-Mart's or GM's ERP systems come to mind).
A read-only DB containing highly-compressible text does not really make for a very challenging datamine. Just because it's on and about the Web and sexier than a stodgy ERP system should not make you overlook the real technology.
Id say is pretty amazing, I actually was able to retreive content I thought lost years ago.
My sites go back to 95, and yep theyre archived starting 96, this is too cool.
I wonder how much of the goverments docs that were pulled off post Sept 11 are still on this ?
A really funny note is it seems like all the p0rn is intact staring in 96, gotta archive the porn.
But seriously , I was unaware of this, Im gonna use this thing like hell as a sales tool if nothing else. Its also great to find certain content thats been pulled.
Sig went tro...aahemmm.....fishing........
I just visited some sites from which I hoped that they dissappeared completely from cyberspace. The only defense I've got now are the old cryptic URLs of these monstrosities... Indexing that database would be a disaster, especially with an unusual name like mine...
(Yes, I was stupid enough to use my real name
Damn you, wayback
Okay... I'll do the stupid things first, then you shy people follow.
[Zappa]
A number of you have asked whether the websites taken down since 9/11 are available on archive.org. The answer is yes. One example is:
DC Air National Guard on Archive
Same Page - 404
One of the conspiracy websites that I have read was saying that combat airplanes, normally on 24 hour alert, at this base should have and could have prevented the plane from entering the restricted airspace in DC. They were saying that this site was removed because it provided evidence that somebody dropped the ball.
_______________________________
"I'm not Conceited...I'm just a realist..."
I once worked on a site with a 25 year old database that was much larger.
The ancient magnetic storage took up several warehouses. Beat that, for biggest database ever!
Even Slashdot wants to hide some things
Having so few transactions for a database of this size probably helps them run without needing large expensive machines. Many VLDBs support thousands of transactions per second. I found a list here of top ten winners of a very large database scalability contest. The winner for peak performance was something like 20,000+ TPS.