Brewster Kahle & The Largest Library In History
BorgiaPope writes "WAIS creator and Alexa founder Brewster Kahle is interviewed by Feed. Kahle talks about the 30 terabytes of 'net content stored in Alexa's Linux servers, a data store he calls the 'largest library the world has ever known.' Some fascinating observations about how sites move in and out of the top traffic tier. He also claims that the top ten Web sites have the "greatest worldwide concentration of power since the Roman Empire.""
A professor of mine as well as myself and a number of other students are doing some indepth research on language and how it changes over time. One of our biggest problems at this point is finding sufficient samples of text data from strict editorial sources, so we have had to resort to using photocopied->scanned->OCR'ed National Geographic articles. However, now that we're moving on to a new phase of the project, we need ten times as much data to realize the accurracy of our results. As of now, sources of digital text are few and far inbetween, with no sources going back very far. Why is it that organizations in our society haven't invested the money and time into, say, digitizing the Library of Congress? I realize it's incredibly expensive and timeconsuming - that's what we discovered, but it would be oh so useful to be able to read publications from a hundred years ago on my web browser. It's also great to see modern material produced by our society being archived, but there's a lot of ancient history that should be put into a format that should last forever as well.
-------
CAIMLAS
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
For example: In three hundred years, pornography is viewed as a valuable cultural resource. A historian wishes to study the subject of pornography over the ages and relate it to the prevailing attitudes in those ages. The historian will be stuffed, because to a librarian now, pornography is clearly not suitable for inclusion.
The history we have is much more a history of the rich and powerful, and not a history of the poor, because nobody wrote anything about the poor. Today, big scientific tomes are kept, but Joe Blogg's Geocities page (with exciting photos of him and his family and his cat) gets binned. In three hundred years this might be interesting historical evidence, the same as Joe Chimney Sweep's diary from 1800 or something.
The technology to do this effectively might not really be here yet, but it will probably arrive in those three hundred years. (Unless we're all too busy looking at porn instead ;) )
Here's a reality check - for me, anyway: I honestly thought slashdot.org would be somewhere in the top 500. I was going to make a joke about "You know it's time to move on to kuro5hin when slashdot makes it into the top 100". Nope. Slashdot isn't in the top 1000.
Linux doesn't show up in any of the top 1000 domain names, but windows does - once - in windowsmedia.com, which is about a TV-like a site as you can get, and a subsite of MSN.
Google was 21st, cnn.com was 37th, and wired.com was 970. Other than that, none of the sites I've bookmarked are in the top 1000.
I guess I shouldn't be surprised that the web I see is nothing like the web most of the world sees. I am a little disconcerted though. No wonder the general public doesn't care about software freedom, DeCSS, software patents, privacy, etc. The awful truth is that for most people, the internet is like TV.
What a depressing way to start a Friday.
Torrey Hoffman (Azog)
Torrey Hoffman (Azog)
"HTML needs a rant tag" - Alan Cox
Here:
And here:
Part of the reason I don't like that notion is because it starts a level of accountability that I wouldn't be comfortable seeing. Where would the tracking begin - or end, for that matter - so that the proper payment balance could be provided? Which ISP - the one the surfer is using to view the content, or the one hosting the content? I imagine he means the latter - and that bothersome. If an ISP can be held financially liable for content that a user provides - regardless of who the copyright holder/content owner is - then how long before said ISP decides to host only content that's marketable and profitable? Draw your own conclusions about where the picking and choosing would go from there.
Another reason I don't like it - not necessarily a valid one, but definitely a personal one - is that it commercializes the web that much further. There's already enough corporate-owned and profit-driven crap here. It's not like we need more like that.
Kahle mentions that something like ASCAP is needed, but he himself talks about the nasty history behind his example's development. He also throws out AOL as an example of a company in the "best position" to implement such a thing. Like we didn't have enough concerns about content ownership/control/marketing without an endorsement like that...
Karma: Excellent, but still won't get you laid.
The Internet will be useless as a repository of knowledge until it is quite ruthlessly edited. I doubt any posts on this thread (including this one) would survive in a proper library.
-- the most controversial site on the Web