Slashdot Mirror


Brewster Kahle & The Largest Library In History

BorgiaPope writes "WAIS creator and Alexa founder Brewster Kahle is interviewed by Feed. Kahle talks about the 30 terabytes of 'net content stored in Alexa's Linux servers, a data store he calls the 'largest library the world has ever known.' Some fascinating observations about how sites move in and out of the top traffic tier. He also claims that the top ten Web sites have the "greatest worldwide concentration of power since the Roman Empire.""

29 of 88 comments (clear)

  1. Re:the top 100 WWW sites are: by Saige · · Score: 2

    I must say, it's sad to see Yahoo at the top of the list, and the Open Directory Project not even on there, especially since it's now bigger than Yahoo, and growing faster. (Though, as always, it's in need of editors.)

    It is an interesting list to look over, some of the ones on there are very suprising.
    ---

    --
    "You know your god is man-made when he hates all the same people you do."
  2. WAIS Creator? by Bryan+Ischo · · Score: 2

    He's proud of that? Still touting that as an accomplishment?

    WAIS was the biggest piece of sh** to ever get steamrolled by the web ...

  3. Re:Digital archives... by rillian · · Score: 2

    In a more general sense, copyright (and now license agreements) are to blame. There was a lot of talk in the "early days" about getting lots of stuff online, and it's slowly happening with, for example Project Gutenberg and alt.binaries.e-book. But currently this is slow; OCR technology isn't good enough to process things without an editing pass, and sharing the original scans currently requires institutional resources. That, combined with the periodic extension of copyright terms to cover almost anything created in the 20th Century has put a damper on volunteer efforts.

    One would think that libraries would be a great place to start with this at the institutional level. Even without scanning, a lot of recent journals come with electronic versions as part of the subscription. And they're bought and paid for, so copyright isn't an issue (as long as you belong to a subscribing library). But...restrictive license agreements to the rescue! This article on oss4lib describes a situation where librarians are required to scan paper copies of journals they have electronically for interlibrary loan purposes.

    Fundamentally, the movement to put a fence around information and charge for every view is at odds with aim to preserve it. If we want hardcopy to be available electronically, or electronic documents to be preserved at all, we have to change the rules, or ignore them. In the meantime, start a private collection in the hope of publishing it someday. Historians will thank you.

  4. Digital archives... by Gendou · · Score: 4

    A professor of mine as well as myself and a number of other students are doing some indepth research on language and how it changes over time. One of our biggest problems at this point is finding sufficient samples of text data from strict editorial sources, so we have had to resort to using photocopied->scanned->OCR'ed National Geographic articles. However, now that we're moving on to a new phase of the project, we need ten times as much data to realize the accurracy of our results. As of now, sources of digital text are few and far inbetween, with no sources going back very far. Why is it that organizations in our society haven't invested the money and time into, say, digitizing the Library of Congress? I realize it's incredibly expensive and timeconsuming - that's what we discovered, but it would be oh so useful to be able to read publications from a hundred years ago on my web browser. It's also great to see modern material produced by our society being archived, but there's a lot of ancient history that should be put into a format that should last forever as well.

  5. Re:How does he measure hitcounts ? by dave_d · · Score: 2

    As I understand it (from the little bit I read on their site, and from stuff gleamed from the interview), there's an program you can download from alexa's site (www.alexa.com). When you run it, I imagine that it tells alexa what sites you're visiting. So their hitcounts are only from people using their program - though I could be wrong.

  6. He likes micropayments? by Animats · · Score: 2
    Yeah, Minitel had "micropayments". Pricing was comparable to 900 numbers. Prices varied; the telephone directory was free, chat sites ("messageries") were a few francs a minute, and sites with official government data like lists of research projects cost about 4x as much as sex chat.

    That's the telco model of information pricing. The telcos had to be dragged, kicking and screaming, into the era of cheap communications and free content.

    The basic problem with micropayments is that all the enthusiasm for them is on the collecting side, not the paying side. Contrast this with credit card acceptance, which consumers actually want.

    On the web, there are are only two (non-porno) pay sites that do significant business. The Wall Street Journal and Consumer Reports. Both had top reputations in the print world. Everybody else who's tried it has bombed, including MTV. So pay-per-view is the wrong answer. Kale is way off base on that. His "ISP tax" idea is even worse. That sounds like something the RIAA would come up with.

  7. Strange... by dr_labrat · · Score: 2

    I thought the British empire was the greatest concentration of power since the roman empire....

    Go figure.... Guess history *was* wrong after all...

    --
    The secret of success is honesty and fair dealing. If you can fake those, you've got it made. (Marx)
    1. Re:Strange... by Ndog · · Score: 3

      That does seem like a very questionable statement to me. The top ten web sites are potentially powerful, but it depends what content they are serving up. If they are selling things, like Amazon, would that be so powerful? Sure, you can push certain things, but ultimately it's up to the buyer. Of course portals like Yahoo are powerful, but only when it comes to the content they are providing. Do they really have any power over my everyday life? What about people and cultures without so much internet access? Are they not even considered in this discussion?

      Besides, power is fleeting.


      Spooon!

      --
      -N
    2. Re:Strange... by thing12 · · Score: 2

      You miss the point that this concentration of power is larger than anything that has preceded it since the Roman Empire. If you read the artical it says that the top 10 sites control 20% of what all people around the world see on the web. 20%!! That's an amazing amount of media power to be held by so few people. So few people able (if they got together and had the desire) to strongly influence the way the Internet population feels about an issue - and more important tell them they should buy Product X instead of Product Y.

  8. Libraries that weren't by Dreamland · · Score: 2

    I really do see the similarity between Alexa and Alexandria as a bad omen.

  9. Info != power by redelm · · Score: 3

    Much as I like InfoTech, I don't like the Roman Empire analogy. Information can influence people, but it is NOT military power.

    Perhaps a better analogy would be to 400-1400 when the Popes and the Roman Catholic Church did hold a monopoly on religious information in the West. That ended with Gutenberg and the Reformation.

  10. Roman empire by anpe · · Score: 2

    Does someone recalls a teen called Mafius Boyus who locked all roman empire's activities during several hours just for fun ?

  11. The unedited interview: by darylp · · Score: 3

    This line was inexplicably removed from the final inteview: Q: "Thirty Terabytes? That's a lot, isn't it?" A: "Well, once we've taken out all the Spam, 'Make Money Fast' schemes, Pr0n, "w3 0\/\/N j00" homepages, Natalie Portman fansites, 'USS Enterprise vs. Star Destroyer' discussions, links to goatse.cx, and Jon Katz articles, we can fit it all onto a floppy."

  12. Copyright ? by redelm · · Score: 2

    How does Alexa avoid violating copyright? Linking is one thing, mirroring another.

  13. Alexa's top50 list by Leto2 · · Score: 2
    Did you look at Alexa's top50 of busiest websites in August?

    I really wonder what iloveschool.co.kr does there above microsoft, geocities, ebay and altavista.

    --
    <grub> Reading /. at -1 is like driving through Cracktown in a convertible that is stuck in 1st
    1. Re:Alexa's top50 list by Azog · · Score: 5

      Here's a reality check - for me, anyway: I honestly thought slashdot.org would be somewhere in the top 500. I was going to make a joke about "You know it's time to move on to kuro5hin when slashdot makes it into the top 100". Nope. Slashdot isn't in the top 1000.

      Linux doesn't show up in any of the top 1000 domain names, but windows does - once - in windowsmedia.com, which is about a TV-like a site as you can get, and a subsite of MSN.

      Google was 21st, cnn.com was 37th, and wired.com was 970. Other than that, none of the sites I've bookmarked are in the top 1000.

      I guess I shouldn't be surprised that the web I see is nothing like the web most of the world sees. I am a little disconcerted though. No wonder the general public doesn't care about software freedom, DeCSS, software patents, privacy, etc. The awful truth is that for most people, the internet is like TV.

      What a depressing way to start a Friday.

      Torrey Hoffman (Azog)

      --
      Torrey Hoffman (Azog)
      "HTML needs a rant tag" - Alan Cox
  14. Roman Empire and Power by CAIMLAS · · Score: 4
    Power is the ability to conform the will of others to your own. The Romans did this by killing their opponents, and by the threat of such things. These sites don't have such power - anything that people submit to are submitted to out of free will. Unless, of course, you count thinks like the ability to sell personal information. :)

    -------
    CAIMLAS

    --
    ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
  15. Re:Library? I think not by Harri · · Score: 4
    ...and the disadvantage of a library, is that the stuff is selected by a librarian, according to a view of what is interesting that is specific to the age in which he is living and the culture to which he belongs. (Or she, or it). Thus I believe there is a good deal of value in the idea of a library which is not filtered at the time of collection, but which can be filtered at the time of reading according to the interests of the reader.

    For example: In three hundred years, pornography is viewed as a valuable cultural resource. A historian wishes to study the subject of pornography over the ages and relate it to the prevailing attitudes in those ages. The historian will be stuffed, because to a librarian now, pornography is clearly not suitable for inclusion.

    The history we have is much more a history of the rich and powerful, and not a history of the poor, because nobody wrote anything about the poor. Today, big scientific tomes are kept, but Joe Blogg's Geocities page (with exciting photos of him and his family and his cat) gets binned. In three hundred years this might be interesting historical evidence, the same as Joe Chimney Sweep's diary from 1800 or something.

    The technology to do this effectively might not really be here yet, but it will probably arrive in those three hundred years. (Unless we're all too busy looking at porn instead ;) )

  16. Re:He's so almost there by Nezumi-chan · · Score: 2
    So I'd say that he's on the mark with the content idea, and the web itself is a powerful distributor of knowledge and information. But the most concentrated since the Roman Empire? Almost. That's still the press/media.

    I wonder if that's the sense he meant it in. From reading the interview, I took his phrase to mean not that this is the most powerful group in the world (although that is still possible as many of these companies have off-line influence in spades as well), but that it is the most concentrated group. Television media, for instance, may rightfully be considered more powerful culturally, but it's also more distributed when viewed by number of "hits". These top ten sites, OTOH, are more concentrated in a small area.

    The analogy to rome in that sense is a good one, since most of the true power during the Empire's peak was concentrated in a very small area. Unfortunately, the idea of these small number of companies having equivalent power to the Empire is unfortunately untenable.

  17. Colonial success by brokeninside · · Score: 2
    It may seem imflammatory, but compare post British colonies with colonies by any other colonial power.

    Well, London was colonized by the Romans. So let's compare London to any of the places colonized by the Brits.

  18. Re:the top 100 WWW sites are: by po_boy · · Score: 2
    Though, as always, it's in need of editors.

    That's understandable. I have signed up thrice in three different categories to be an editor. I have not ever heard back from them. That means that either their registration/application process is so difficult or counter-intuitive that I cannot figure it out, or that they just don't give a shit if they get another editor or not. Either way, I'm not surprised that they don't have as many editors as they would like or need.

  19. Re:the top 100 WWW sites are: by Saige · · Score: 2

    That's understandable. I have signed up thrice in three different categories to be an editor. I have not ever heard back from them. That means that either their registration/application process is so difficult or counter-intuitive that I cannot figure it out, or that they just don't give a shit if they get another editor or not. Either way, I'm not surprised that they don't have as many editors as they would like or need.

    Thanks for the comment, I'm bringing it to the attention of those people responsible for accepting new editors. It took me two applications to be accepted, and the first one seemed to have found it's way to /dev/null like yours did. If you do decide to apply again (once you're accepted, it's not nearly as bad as the initial application), just remember to apply to smaller categories with few subcategories (especially ones without any editor currently), and fill in the URL fields of the application.

    I do agree that what they did to you is a horrible way to get people to edit and even use the directory... :)
    ---

    --
    "You know your god is man-made when he hates all the same people you do."
  20. Re:Info != power by JJ · · Score: 2

    Can't agree with the Papal analogy either. You've completely ignored the Scholastic movement which utilized Moslem-Judeo translations of originally Greek works as a primary source. Not really controlled by the Papacy (at all.)

    --
    So long and thanks for all the fish . . . !!!
  21. He's so almost there by Markvs · · Score: 3

    His assumption of power concentration would be true, if the net was the major medium for all, which it is not. That crown, for better or for worse, is still television.

    However, that makes by definition the American media & Hollywood the #1 social power on the planet, not those sites. Sites will come and go. It's not the hits that count. There are countries with no web access or very restricted access (Chad, Syria, almost anywhere in the 3rd world), yet these countries get much more "Americanization" via movies & print literature.

    So I'd say that he's on the mark with the content idea, and the web itself is a powerful distributor of knowledge and information. But the most concentrated since the Roman Empire? Almost. That's still the press/media.

    --
    46. The Hobo smiles, his eyes glaze over, and he burps. "Beware the man who has lived longer than the Wasteland."
  22. Re:Sounds familiar? by Glytch · · Score: 2

    Does that make Jack Valenti to be the Mule?

  23. Alexa's a great real-estate scam by zlite · · Score: 3

    I always thought Brewster's neatest trick was getting his company this amazing space in San Francisco's leafy and spacious retired military base, the Presidio. It was reserved for non-profit firms, so he said that Alexa was archiving the web. Then, lo and behold, he found some commerical application of that library (does anyone actually use that "context" bar thing?) and sold the company to Amazon for a bazillion dollars. And kept his space!

  24. Respoitories like this are necessary... by AugstWest · · Score: 2

    ...and they should be public too, I think.

    When deja took away the newsgroup archives pre-99, I was at first outraged, and then of course I realized that they're a business and not a public resource.

    The wealth of human knowledge available in the newsgroup archives is immense and extremely useful on a day-to-day basis. A repository of public newsgroup archives would be a great public resource, and I'd love to see a project that gets shares that knowledge with the world. Hopefully this project will go that way, but I dunno if usenet is included in the 30 terabytes.

    Hopefully we can also get these archives without the annoying product links inserted in them. :]

  25. Here's the part I'm not sure I like... by hiryuu · · Score: 4

    Here:

    And I think the right place to tax is the ISPs.

    And here:

    Right now, people are paying all of their money to use ISPs but the ISPs don't have to pay for the content.

    Part of the reason I don't like that notion is because it starts a level of accountability that I wouldn't be comfortable seeing. Where would the tracking begin - or end, for that matter - so that the proper payment balance could be provided? Which ISP - the one the surfer is using to view the content, or the one hosting the content? I imagine he means the latter - and that bothersome. If an ISP can be held financially liable for content that a user provides - regardless of who the copyright holder/content owner is - then how long before said ISP decides to host only content that's marketable and profitable? Draw your own conclusions about where the picking and choosing would go from there.

    Another reason I don't like it - not necessarily a valid one, but definitely a personal one - is that it commercializes the web that much further. There's already enough corporate-owned and profit-driven crap here. It's not like we need more like that.

    Kahle mentions that something like ASCAP is needed, but he himself talks about the nasty history behind his example's development. He also throws out AOL as an example of a company in the "best position" to implement such a thing. Like we didn't have enough concerns about content ownership/control/marketing without an endorsement like that...

    --
    Karma: Excellent, but still won't get you laid.
  26. Library? I think not by streetlawyer · · Score: 4
    He misunderstands the concept of a "library". A library, in its historic definition, is not just a heap of information and publications; it represents someone's selection and preservation of worthwhile knowledge. A massive cache of shitty Geocities sites, corporate bumph and pathetically precious weblogs is not a library by any stretch of the imagination. A library isn't a library unless it has a librarian, deciding what needs to be preserved and, importantly editing out the dross. His servers might have three times as many bytes as the Library of Congress has letters, but I know which one I'd rather spend an afternoon with.

    The Internet will be useless as a repository of knowledge until it is quite ruthlessly edited. I doubt any posts on this thread (including this one) would survive in a proper library.