Brewster Kahle & The Largest Library In History
BorgiaPope writes "WAIS creator and Alexa founder Brewster Kahle is interviewed by Feed. Kahle talks about the 30 terabytes of 'net content stored in Alexa's Linux servers, a data store he calls the 'largest library the world has ever known.' Some fascinating observations about how sites move in and out of the top traffic tier. He also claims that the top ten Web sites have the "greatest worldwide concentration of power since the Roman Empire.""
I must say, it's sad to see Yahoo at the top of the list, and the Open Directory Project not even on there, especially since it's now bigger than Yahoo, and growing faster. (Though, as always, it's in need of editors.)
It is an interesting list to look over, some of the ones on there are very suprising.
---
"You know your god is man-made when he hates all the same people you do."
He's proud of that? Still touting that as an accomplishment?
...
WAIS was the biggest piece of sh** to ever get steamrolled by the web
In a more general sense, copyright (and now license agreements) are to blame. There was a lot of talk in the "early days" about getting lots of stuff online, and it's slowly happening with, for example Project Gutenberg and alt.binaries.e-book. But currently this is slow; OCR technology isn't good enough to process things without an editing pass, and sharing the original scans currently requires institutional resources. That, combined with the periodic extension of copyright terms to cover almost anything created in the 20th Century has put a damper on volunteer efforts.
One would think that libraries would be a great place to start with this at the institutional level. Even without scanning, a lot of recent journals come with electronic versions as part of the subscription. And they're bought and paid for, so copyright isn't an issue (as long as you belong to a subscribing library). But...restrictive license agreements to the rescue! This article on oss4lib describes a situation where librarians are required to scan paper copies of journals they have electronically for interlibrary loan purposes.
Fundamentally, the movement to put a fence around information and charge for every view is at odds with aim to preserve it. If we want hardcopy to be available electronically, or electronic documents to be preserved at all, we have to change the rules, or ignore them. In the meantime, start a private collection in the hope of publishing it someday. Historians will thank you.
A professor of mine as well as myself and a number of other students are doing some indepth research on language and how it changes over time. One of our biggest problems at this point is finding sufficient samples of text data from strict editorial sources, so we have had to resort to using photocopied->scanned->OCR'ed National Geographic articles. However, now that we're moving on to a new phase of the project, we need ten times as much data to realize the accurracy of our results. As of now, sources of digital text are few and far inbetween, with no sources going back very far. Why is it that organizations in our society haven't invested the money and time into, say, digitizing the Library of Congress? I realize it's incredibly expensive and timeconsuming - that's what we discovered, but it would be oh so useful to be able to read publications from a hundred years ago on my web browser. It's also great to see modern material produced by our society being archived, but there's a lot of ancient history that should be put into a format that should last forever as well.
As I understand it (from the little bit I read on their site, and from stuff gleamed from the interview), there's an program you can download from alexa's site (www.alexa.com). When you run it, I imagine that it tells alexa what sites you're visiting. So their hitcounts are only from people using their program - though I could be wrong.
That's the telco model of information pricing. The telcos had to be dragged, kicking and screaming, into the era of cheap communications and free content.
The basic problem with micropayments is that all the enthusiasm for them is on the collecting side, not the paying side. Contrast this with credit card acceptance, which consumers actually want.
On the web, there are are only two (non-porno) pay sites that do significant business. The Wall Street Journal and Consumer Reports. Both had top reputations in the print world. Everybody else who's tried it has bombed, including MTV. So pay-per-view is the wrong answer. Kale is way off base on that. His "ISP tax" idea is even worse. That sounds like something the RIAA would come up with.
I thought the British empire was the greatest concentration of power since the roman empire....
Go figure.... Guess history *was* wrong after all...
The secret of success is honesty and fair dealing. If you can fake those, you've got it made. (Marx)
I really do see the similarity between Alexa and Alexandria as a bad omen.
Much as I like InfoTech, I don't like the Roman Empire analogy. Information can influence people, but it is NOT military power.
Perhaps a better analogy would be to 400-1400 when the Popes and the Roman Catholic Church did hold a monopoly on religious information in the West. That ended with Gutenberg and the Reformation.
Does someone recalls a teen called Mafius Boyus who locked all roman empire's activities during several hours just for fun ?
This line was inexplicably removed from the final inteview: Q: "Thirty Terabytes? That's a lot, isn't it?" A: "Well, once we've taken out all the Spam, 'Make Money Fast' schemes, Pr0n, "w3 0\/\/N j00" homepages, Natalie Portman fansites, 'USS Enterprise vs. Star Destroyer' discussions, links to goatse.cx, and Jon Katz articles, we can fit it all onto a floppy."
How does Alexa avoid violating copyright? Linking is one thing, mirroring another.
I really wonder what iloveschool.co.kr does there above microsoft, geocities, ebay and altavista.
<grub> Reading
-------
CAIMLAS
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
For example: In three hundred years, pornography is viewed as a valuable cultural resource. A historian wishes to study the subject of pornography over the ages and relate it to the prevailing attitudes in those ages. The historian will be stuffed, because to a librarian now, pornography is clearly not suitable for inclusion.
The history we have is much more a history of the rich and powerful, and not a history of the poor, because nobody wrote anything about the poor. Today, big scientific tomes are kept, but Joe Blogg's Geocities page (with exciting photos of him and his family and his cat) gets binned. In three hundred years this might be interesting historical evidence, the same as Joe Chimney Sweep's diary from 1800 or something.
The technology to do this effectively might not really be here yet, but it will probably arrive in those three hundred years. (Unless we're all too busy looking at porn instead ;) )
I wonder if that's the sense he meant it in. From reading the interview, I took his phrase to mean not that this is the most powerful group in the world (although that is still possible as many of these companies have off-line influence in spades as well), but that it is the most concentrated group. Television media, for instance, may rightfully be considered more powerful culturally, but it's also more distributed when viewed by number of "hits". These top ten sites, OTOH, are more concentrated in a small area.
The analogy to rome in that sense is a good one, since most of the true power during the Empire's peak was concentrated in a very small area. Unfortunately, the idea of these small number of companies having equivalent power to the Empire is unfortunately untenable.
Well, London was colonized by the Romans. So let's compare London to any of the places colonized by the Brits.
That's understandable. I have signed up thrice in three different categories to be an editor. I have not ever heard back from them. That means that either their registration/application process is so difficult or counter-intuitive that I cannot figure it out, or that they just don't give a shit if they get another editor or not. Either way, I'm not surprised that they don't have as many editors as they would like or need.
That's understandable. I have signed up thrice in three different categories to be an editor. I have not ever heard back from them. That means that either their registration/application process is so difficult or counter-intuitive that I cannot figure it out, or that they just don't give a shit if they get another editor or not. Either way, I'm not surprised that they don't have as many editors as they would like or need.
/dev/null like yours did. If you do decide to apply again (once you're accepted, it's not nearly as bad as the initial application), just remember to apply to smaller categories with few subcategories (especially ones without any editor currently), and fill in the URL fields of the application.
:)
Thanks for the comment, I'm bringing it to the attention of those people responsible for accepting new editors. It took me two applications to be accepted, and the first one seemed to have found it's way to
I do agree that what they did to you is a horrible way to get people to edit and even use the directory...
---
"You know your god is man-made when he hates all the same people you do."
Can't agree with the Papal analogy either. You've completely ignored the Scholastic movement which utilized Moslem-Judeo translations of originally Greek works as a primary source. Not really controlled by the Papacy (at all.)
So long and thanks for all the fish . . . !!!
His assumption of power concentration would be true, if the net was the major medium for all, which it is not. That crown, for better or for worse, is still television.
However, that makes by definition the American media & Hollywood the #1 social power on the planet, not those sites. Sites will come and go. It's not the hits that count. There are countries with no web access or very restricted access (Chad, Syria, almost anywhere in the 3rd world), yet these countries get much more "Americanization" via movies & print literature.
So I'd say that he's on the mark with the content idea, and the web itself is a powerful distributor of knowledge and information. But the most concentrated since the Roman Empire? Almost. That's still the press/media.
46. The Hobo smiles, his eyes glaze over, and he burps. "Beware the man who has lived longer than the Wasteland."
Does that make Jack Valenti to be the Mule?
I always thought Brewster's neatest trick was getting his company this amazing space in San Francisco's leafy and spacious retired military base, the Presidio. It was reserved for non-profit firms, so he said that Alexa was archiving the web. Then, lo and behold, he found some commerical application of that library (does anyone actually use that "context" bar thing?) and sold the company to Amazon for a bazillion dollars. And kept his space!
...and they should be public too, I think.
:]
When deja took away the newsgroup archives pre-99, I was at first outraged, and then of course I realized that they're a business and not a public resource.
The wealth of human knowledge available in the newsgroup archives is immense and extremely useful on a day-to-day basis. A repository of public newsgroup archives would be a great public resource, and I'd love to see a project that gets shares that knowledge with the world. Hopefully this project will go that way, but I dunno if usenet is included in the 30 terabytes.
Hopefully we can also get these archives without the annoying product links inserted in them.
Here:
And here:
Part of the reason I don't like that notion is because it starts a level of accountability that I wouldn't be comfortable seeing. Where would the tracking begin - or end, for that matter - so that the proper payment balance could be provided? Which ISP - the one the surfer is using to view the content, or the one hosting the content? I imagine he means the latter - and that bothersome. If an ISP can be held financially liable for content that a user provides - regardless of who the copyright holder/content owner is - then how long before said ISP decides to host only content that's marketable and profitable? Draw your own conclusions about where the picking and choosing would go from there.
Another reason I don't like it - not necessarily a valid one, but definitely a personal one - is that it commercializes the web that much further. There's already enough corporate-owned and profit-driven crap here. It's not like we need more like that.
Kahle mentions that something like ASCAP is needed, but he himself talks about the nasty history behind his example's development. He also throws out AOL as an example of a company in the "best position" to implement such a thing. Like we didn't have enough concerns about content ownership/control/marketing without an endorsement like that...
Karma: Excellent, but still won't get you laid.
The Internet will be useless as a repository of knowledge until it is quite ruthlessly edited. I doubt any posts on this thread (including this one) would survive in a proper library.
-- the most controversial site on the Web