Slashdot Mirror


IBM vs. Content Chaos

ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.

5 of 216 comments (clear)

  1. Expensive by starvingcodeartist · · Score: 4, Interesting

    In the article is says they plan on charging between $150,000 and $300,000 a year to use this super-search engine. They think corporate execs will pay for it. Seems really steep to me. BUT, for corporate execs, its probably not too expensive. They'll just outsource another 10-15 programming jobs to India to pay for it.

    1. Re:Expensive by orac2 · · Score: 4, Interesting

      The point is that it's not intended for use as a search engine, but a platform for doing computation intensive data mining and analysis. A search engine can tell you how many mentions of IBM appear on the web, but not how people feel about IBM.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
  2. What about Existing Data? by ParadoxicalPostulate · · Score: 4, Interesting

    Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

    You would need an enormous workforce to do that.

    And if they don't plan on doing that, what about all the existing information? Is it going to be excluded from the database? Seems like much of a waste to me!

    Damn but I would love to have access to one of these, even if the amount of information available will be miniscule (relatively speaking) for the next few years.

  3. How long before people start gaming the system? by dpbsmith · · Score: 4, Interesting

    As Google has discovered, it's only possible for simple heuristics and algorithms to "understand" the human content on the Web for as long as it doesn't matter.

    As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.

    And the stakes are much higher for gaming WebFountain than for gaming Google.

    For example, I'd imagine there would be big money for anyone who could convince companies that they know how to make it appear that a particular movie/song/toy/computer was "hot," so that the WebFountain-using Walmarts and Best Buys of the world would stock more of it.

    WebFountain will work well only until it is actually introduced.

  4. Half a football field? by AndroidCat · · Score: 4, Interesting
    (Imperial or metric football fields?)
    IBM's breakthrough is called WebFountain--half a football field's worth of rack-mounted processors, routers, and disk drives running a huge menagerie of programs.
    Later:
    It uses a cluster of thirty 2.4-GHz Intel Xeon dual-processor computers running Linux to crawl as much of the general Web as it can find at least once a week.

    To ensure that WebFountain's finger is constantly on the pulse of the Internet, an additional suite of similar computers is dedicated to crawling important but volatile Web sites, such as those hosting blogs, at least once a day. Other machines maintain access to popular non-Web-based sources, such as Usenet (a newsgroup service that predates the Web) and the Internet Relay Chat system, known as IRC. The data is then passed into WebFountain's main cluster of computers, currently composed of 32 server racks connected via gigabit Ethernet. Each rack holds eight Xeon dual-processor computers and is equipped with about 4-5 terabytes of disk storage.

    That's a lot of stuff, but half a football field? Possibly they're including cubicles for the staff or did they just inherit some old Big Iron space that was that large?
    --
    One line blog. I hear that they're called Twitters now.