How the Wayback Machine Works
tregoweth writes: "O'Reilly has an interview with Brewster Kahle about how The Internet Archive's Wayback Machine works, with lots of juicy details about how the biggest database ever built works."
← Back to Stories (view on slashdot.org)
It's an interesting idea, but the real problem is not storing the 100 TB of data, it's figuring out how to search through it to find what you're looking for. Now, apparently they write a lot of their own software, but it might be better if they could team up with Google and have Google index their sites on a special database. We'd have www.google.com for regular searches, and wayback.google.com for the Wayback Machine's sites.
Something else I found interesting: according to the article, they "use as much open source software as [they] can." That makes sense when they've got between 300 and 400 computers, and with the number growing all the time. Licensing all those with a non-open OS would be quite expensive.
Ok, we have successfully Slashdotted the Wayback Machine. Screw history! :) Let's move on to bigger and better things.
Wayback Slashdot ...only goes back to 2000? Seems kind of lame when you consider that my little web site goes back to 1998.
How to Download YouTube Videos
And I thought I'd erased all my old embarrassing HTTP handywork....until I discovered my old website nicely archived - bleargh!
:)
;)
Ah hell, may as well keep it there - it's even got my old web-based Curriculum Vitae on it too - perhaps in some way I've now been "immortalised"??
I've not touched HTML ever since those first abortive attempts I made 5 years ago, cause I realise now that I'm pretty crap at it - I'll stick to Unix admin, what I know best
Wow! You really weren't trying too hard back in the day, were you Taco?
:)
I was glad to see the interviewee was brutally honest about free software -- both its benefits and its drawbacks. Usually discussions among my friends usually degenerate into holy wars, with both of us spouting cliches at one another until we all storm off in huffs.
Free software can save the world, I think. We just need to realize that it needs a lot more work to get there.
They that would sacrifice their
http://web.archive.org/web/*/http://slashdot.org
$HOME is where the
-- silver_p
They don't seem to think the history of their site would be interesting: http://web.archive.org/web/*/http://web.archive.or g/ lredirects you to their index.html! boring!
Now, that would really be a test for their apps. Same as if Google indexed www.google.com (entirely).
I too find this solution to be absolutly amazing.
Amazingly cool.
Amazingly "cheep".
Amazingly interesting.
However, teaming up with google doesn't strik me as a particulary good idea. This project probably has a different scope then a pure web search engine, but, assuming there _is_ a substantial overlap between the projects. I for one would love to see some new high quality attemts at searching vast amounts of data.
Google might be the best there is right now, but some serious competition should make them even better. (Or, there will be a new top dog, wish of course would be equally good.)
Anyway,
"First lesson," Jon said. "Stick them with the pointy end."
Not only does this sound like a rather far fetched plot from an old StarTrek episode, but it also seems to be an a physical and theoretical impossibility. Even if adequate storage space did exist for such a task (a 10 TB database would be but a small start), I do not foresee any type of technology that could ever adequately capture new data at a sufficient speed to harness that which is human innovation and creativity.
It is a nice thought, however, and I certainly wish him all the best in her pursuits...
Beer is proof that God loves us and wants us to be happy. -- Benjamin Franklin
100 TBs do not make the biggest DB ever. I am personally working on an 60-70TB ERP system that's also writeable; I am sure there are bigger systems out there (e.g. Wal-Mart's or GM's ERP systems come to mind).
A read-only DB containing highly-compressible text does not really make for a very challenging datamine. Just because it's on and about the Web and sexier than a stodgy ERP system should not make you overlook the real technology.
So... would I be right in thinking that the "Wayback Machine" is an (admittedly large scale) exercise in network database programming (of the style popular pre Codd and his relational model?) I am tempted to question if this is indeed the biggest database ever built - I suppose it depends upon definitions - but to my mind a database should be general purpose... whereas it appears to me that this project is basically a large-scale single index.
I also wonder if it would be appropriate to call this the largest project of it's kind - for example - while Google stores less data, I suspect it supports a higher query rate... how exactly do you intend to measure scale... if it is in terms of computing power is it relevant that Google already have thousands of Linux server nodes?
That said - I think it is an exciting project in its own right. I hope and expect this offering to become a significant information resource in years to come.
I thought it was going to tell us how Mr. Peabody's wayback machine worked. You know, like the flux capacitor diagrams that made everything clear . . .
----------
I am an expert in electricity. My father held the chair of applied electricity at the state prision.
Id say is pretty amazing, I actually was able to retreive content I thought lost years ago.
My sites go back to 95, and yep theyre archived starting 96, this is too cool.
I wonder how much of the goverments docs that were pulled off post Sept 11 are still on this ?
A really funny note is it seems like all the p0rn is intact staring in 96, gotta archive the porn.
But seriously , I was unaware of this, Im gonna use this thing like hell as a sales tool if nothing else. Its also great to find certain content thats been pulled.
Sig went tro...aahemmm.....fishing........
I just visited some sites from which I hoped that they dissappeared completely from cyberspace. The only defense I've got now are the old cryptic URLs of these monstrosities... Indexing that database would be a disaster, especially with an unusual name like mine...
(Yes, I was stupid enough to use my real name
Damn you, wayback
Okay... I'll do the stupid things first, then you shy people follow.
[Zappa]
This world needs more people like that... driven to make this world a better place... and having fun doing it, being proud of what they're trying to accomplish... This interview sent shivers down my spine... These are the kind of interviews that inspire people... It makes me think about humanity a little less sceptic... There's still hope.
The article was good with all it's warts and gems. I take it more as a testament to the human spirit. I seriously doubt it's the largest database, though it might be the largest publicly accessible database. I'm sure the NSA could easily dwarf their database considering how much data they collect from around the world every day.
You buy from EMC a terabyte for maybe $300,000. That's just the storage for 1 TB. We can buy 100 TBs with 250 CPUs to work on it, all on a high-speed switch with redundancy built in.
Interesting quote. Mr. Kahle addresses something I've been wondering for a while -- are storage area networks really worth it? Or is he ignoring the costs of maintenance and manpower to keep these things afloat?
Seeing that they cache webpages from other sites, I wonder how long it will take before another company sues them?
Also, I wonder what their criteria will be for "submissions"? 1 month? 1 year?
Me email iz skyewalkerluke at microsoft's free email service.
The interview talked a little about throwing more machines on when the demand deems necessary. I wonder if it is possible to do this over the internet? I mean, I'm seeing something along the lines of SETI, where millions of people worldwide donate their unused processor power. Would it be possible to distribute the searches to remote computers over the internet in real time?
_______________________________
"I'm not Conceited...I'm just a realist..."
A number of you have asked whether the websites taken down since 9/11 are available on archive.org. The answer is yes. One example is:
DC Air National Guard on Archive
Same Page - 404
One of the conspiracy websites that I have read was saying that combat airplanes, normally on 24 hour alert, at this base should have and could have prevented the plane from entering the restricted airspace in DC. They were saying that this site was removed because it provided evidence that somebody dropped the ball.
_______________________________
"I'm not Conceited...I'm just a realist..."
Are you violating copyright laws?
About the Internet Archive
No. Like your local library's collections, our collections consist of publicly available documents. Furthermore, our Web collection (the Wayback Machine) includes only pages that were available at no cost and without passwords or special privileges. And if they wish, the authors of Internet documents can remove their documents from the Wayback Machine at http://www.archive.org/internet/remove.html."
I don't really think that they're neccesarily right about this. I'm glad they've got the archive up, and I think it's dandy, but it seems like the copying and reposting of other's materials is a suspect practice. This will end up in court as soon as something that someone removed from their own webspace re-appears historically accurate here. I'd guess some liable suits will be the first...
--
RumorsDaily
http://znet.net/~schester/facts/database_sizes.htm l
Apparently, walmart's is 24TB, and the entire www index as of 1999 was only 6TB.
Can you imagine a Beowulf cluste... err. I think they allready have.
The biggest database ever? 100TB? Hardly.
I worked at a large pharmaceutical company for two years (known internally as the Squid), and supported a 380TB protein interaction database (Oracle) and a 260TB SAP-backend database (Informix + custom).
Certainly Wayback's database is large, and certainly it holds far more varied information and appeals to a far larger audience, but by no means is it the biggest. I'm sure there are databases that made the ones I worked on look puny by comparison.
Even though this has been available since Oct, it's the first I've seen of it. I think it's a great resource. Long dead sites that are no longer there now can be found for historical purposes. The interesting thing is that the links on the page are also updated to link to the archived versions. What I found it useful for was building a history page of what my site looked like over the years. Lots of great uses for this so hope it stays up!
liB
So the article says in one place that they wrote their own operating system, and in another that they use linux (or BSD, I forget which).
So which is it?
Sig is taking a break!
But what is the DBMS? Is the database relational? How it was modelled?
Leandro Guimarães Faria Corcete DUTRA
DA, DBA, SysAdmin, Data Modeller
GNU Project, Debian GNU/Lin
I once worked on a site with a 25 year old database that was much larger.
The ancient magnetic storage took up several warehouses. Beat that, for biggest database ever!
Even Slashdot wants to hide some things
Pretty decent read, but one thing they said got me thinking a little bit.
They said that at Thinking Machines they built a super fast computer, but it required a new way of thinking about things in order to program it. And then they called this a mistake, because they couldn't attract any customers.
This seems like a real problem that would lead to technological stagnation. At least from a market place point of view.
It is kind of similar to a company making games off of pre-existing engines, like quake, instead of some new non-quake compatible engine.
Or everybody making x86 compatible CPUs.
It also seems that when a company does come up with some new way of doing things, they get burned, and it is the second generation of companies that pick up the torch that make the money. So nobody wants to be that first company, they are all waiting for someone else to break the ground.
Maybe the only people/companies that come up with new stuff are the ones that are insanely rich, and won't get hurt by doing something new, or the insanely poor who have nothing to lose anyway.
I can't help thinking that this clustering boom going on is just like what 3dfx was trying to do. The difference right now is that clustering actually *does* outperform the super fast single chip. I wonder when technological advances will change this fact.
This site has archived television from all over the world on September 11th-September 18th.
I've been pretty jaded and unpatriotic/anti-war about the whole thing, but I can't help but admit I still get creeped out watching the footage.
I'm not sure how much storage Deep Thought had, but certainly the computer it was priviliged to design has more than any database on this planet - by definition, since that database is simply a subsystem! Even more impressively it compresses down to just two (base 13) nibbles: 42.
Slashdot looks the exact same it did 5 years ago!
WHEN is this site going to be updated? Forget the wayback machine, if I want ancient web history I visit slashdot.
--Dood
You're right, the Wayback machine is not the largest collection of data -- not even the largest collection online. I work with the USGS's catalog of satellite data. They have over 300 terabytes of satellite imagery, and the collection is growing at a rate of about 1 terabyte per day.
The USGS collection comprises multiple instruments, but Landsat 7 is a big one, contributing about 100 terabytes that's searchable online.
Perhaps 'Largest TEXT Database' would be a better description of the Wayback Machine?
Genocide Man -- Life is funny. Death is funnier. Mass murder can be hilarious.
Haha, look at the retro topics of those days.. Netscape, Kernel 2.1.74 (hey, why a dev-kernel?), the MS-trials...
What time is it/will be over there? Check with my iPhone app!
From the article:
How the archive works is just with stacks and stacks of computers runnning Solaris on x86, FreeBSD, and Linux, all of which have serious flaws, so we need to use different operating systems for different functions.
The man puts bias aside and uses various OSs in areas in which each performs well. A real, tangible project like this is worth more than any amount of drooling zealotry.
Trolling is a art,
Arg. Where's Google's cache? Oh.. wait.. nevermind.
My life is one big siesta in which I'm dreaming I wished my life was one big siesta.
Having so few transactions for a database of this size probably helps them run without needing large expensive machines. Many VLDBs support thousands of transactions per second. I found a list here of top ten winners of a very large database scalability contest. The winner for peak performance was something like 20,000+ TPS.
imagine what beowulf joke we can put...
What time is it/will be over there? Check with my iPhone app!
This is what your site gave me:
Data Retrieval Failure.
We're sorry. We were unable to retrieve the requested data. We may be experiencing technical difficulties and suggest that you try again later.
See the FAQs for more info and help, or contact us.
Perhaps it doesn't show, but I'm genuinely interested to find out.
This sig under construction. Please check back later.
"So IBM announces a 25 gig hard drive... does the world need this yet? Unless this is in a RAID, would you really want to trust 25 gigs on a single drive? What would you use this for? 400+ hours of MP3s comes to mind... "
mind you, this was only a couple short years ago, and now I'm writing this from a PC with three 80 giggers.
i thought we geeks were supposed to have more foresight than this? *grin*
A year spent in artificial intelligence is enough to make one believe in God.
(Score: -1 Incoherent)
This brings to mind a serious question. What data, if any, do we throw away? With ever expanding storage capacities it's getting easier and easier to just keep stuff, than to sit down and figure out what you want to throw away.
In the past media degeneration and obscelecense over time have made the decisions for us. But going forward we will have massive distributed, redundant data stores, with geographically remote backups. The data isn't going to go away unless we tell it to.
Freenet addresses this problem by culling the less popular data (not actively, but as a end result of its caching policies) - but this has the unfortunate effect that important data can get lost. Not a desirable behavior for corporate data.
-josh
With an appropriate robots.txt file, a site's listing can be stopped from showing up on the Wayback archive. Interesting.
I was kinda bummed to see that dec.com and digital.com yielded:
Blocked Site Error.
Per the request of the site owner, http://www.dec.com is no longer available in the Wayback Machine. Try another request or click here to see if the page is available, live, on the Web.
http://www.dec.com
Some things are just plain funny.
dinner: it's what's for beer
I hope people help them out, they have already brought me back to some cool stuff.
This is a noble cause.
--- Delta0.. makes no difference.
That must be like 99.99 TB of pr0n!!!
----------------------------------- My Other Sig Is Hilarious -----------------------------------
I can't seem to get the slashdot pages archived to load.
I think this is a fabulous project, and I hope it does well. However, I think that the notion of such a centralized database will begin to become unrealistic. I think peer to peer projects are the future, and I can see a day far in the future when the database layer comes down and inhabits the filesystem layer and all the databases on the internet can talk to eachother, and in a sense, the net becomes a giant database that anyone can contribute to.
Cheers, Joshua
When in danger or in doubt, run in circles, scream and shout!
"Ma'am, did you realize that Chevrolet has an important plan for your life?"
"Whatever happened to fair use?"
-- Duff-Man
http://www.archive.org/exec/faqsidos/about/faqs.ht ml?index=2 andt ml?index=26
http://www.archive.org/exec/faqsidos/about/faqs.h
The claims made in these faqs are just not consistent with the law. Are they going to repost everything that was available on Napster?
They also have some problems with their algorithm so that some domains that are redirected fool their algorithm into associating content with a site that was never actually associated with the site. To try to find copywritten works would be a nightmare. Archive.org has refused to respond to any of these issues and, in fact, are lying about it if the quotes in the article are factual.
Russ Smith
slightly off topic but check out ID Webpage circa Dec 1997 and then look at it today. The work that must have gone into update that baby (;
The US federal gov't has larger databases than this. They just don't talk about them too much.
BTW, looks like the Wayback Machine has been slashdotted.
I wonder if Google indexes this site. Think of the ramifications on finding the revellent inforation. Paradoxs galore!!
FWIW The cern labs database is interesting read at 10 PB (10 mil GB or 10,000 TB)
l
http://biz.yahoo.com/prnews/011203/sfm093_1.htm
It a Hitchhikers' Guide to the Galaxy joke.
You know, for sites like the wayback machine, google groups, etc., it would be nice to have another option. Rather than contacting them and saying "please remove this old embarassing crap of mine from your database," it would be great if one could instead tell them "please suppress this stuff for the next 60 years" (and have the request carry forward to whoever inherits the project). That way, people (historians) could still see all my old stupid stuff, but not until I'm dead or too senile to care.
wow!
just imagine a beow.......
ah. nevermind...
- In 1980, I was using a TRS-80 with a cassette tape interface. Not a lot of storage, unreliable and quite slow.
- In 1983, I started using an Apple-II with its 140K floppy drive. I went through all of high school keeping everything I ever wrote on only 20 disks.
- In 1987, the computers in college had 20M hard drives. One machine I had access to had a 40M drive, and there was almost never a shortage of space.
- In 1988, I got my first 1.44M floppy drive, and found that it took a really long time to fill them. (I was working with 360K floppies until then.)
- In 1989, I got my first hard drive - an 80M model, which blew away all the machines that the school was providing in the labs.
- In 1991, I got my first 1G drive and couldn't imagine ever filling it. Until I started getting OSs like Windows and apps like Office.
- Today, 40G drives are pretty much generic standard issue, and 100G drives aren't terribly expensive either.
Today, I can't imagine needing that much storage (Right now, my home system has about 5G of stuff (on 11G worth of media), which includes three operating systems and an installation of MS Office.) But I'm sure the need for it will arise soon, and even bigger drives will become cheap and popular.It's the nature of things. Engineering continually makes stuff smaller and cheaper, and data always grows to consume all available space.
Anyway, to keep this somewhat on-topic, it doesn't surprise me that archive.org was able to build a 100TB server farm. Today, you can get a 160GB drive for $275 (according to a listing on PriceWatch). 100TB is 625 of these drives, which would cost about $172,000. (Of course, it would really cost less, because 625 drives would qualify for a rather large bulk-purchase discount.)
well, look at that... it has tons and tons of lost and forgotten porn... time to buy that new 160 gigabyte hard drive!
http://eHacked.com
Has anyone tried this site in Konqueror? There is a floating link to the home page on top of the results that says search in progress, but never goes away?
index it, split the database horizontally each split on a different machine.
I would have to look at the database to tell you how big each split would ned to be, and how much power the machine housing each split needed to be, as well as other details.
depending on certian variables, I might even consider splitting it accross a cluster od some sort. but its hard to say without a real look at the structure.
The Kruger Dunning explains most post on
Same symptoms. I had to start up IE5.5 for the first time this month to use archive.org.
Once you eliminate the impossible, whatever remains, no matter how improbable, will be quoted out of context on