Slashdot Mirror


IBM vs. Content Chaos

ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.

15 of 216 comments (clear)

  1. All we need... by TJ_Phazerhacki · · Score: 3, Interesting

    There is already altogether too much "Stuff out there" for anyone to put any major effort into catogorizing it. We should soon reach the point of info overload, and then what? What is the point of catologing overflow data? Do we really need something like this? Or should we just ship a bunch of programmers wasting their time over to something else, like better spam filters and OS's without gaping security holes?

    --
    Physics is nothing like religion. If it was, we'd have an easier time trying to raise money!
    1. Re:All we need... by redragon · · Score: 2, Interesting

      I think the inverse is the case.

      The more chaotic (overloaded in your terms) that data tends to be, then the greater the information contained in that data (think compression). So what they're going after is not "catogorizing" the internet, they're going after making some sense out of all of that data. Information overload begins to necesitate an intermediary to help filter out the data that you're interested in.

      The interesting thing becomes what sort of biases are built into a system like this? That is what I'm curious about. Right now when we search on Google (which of course has it's own biases), we decide which links end up mattering (if we have the will to root through it). If a computer system is doing this, it will inevitably alter the way in which we come to understand the data we're looking through.

      I think you're saying (or am I (mis)reading you?) that, "it doesn't matter," isn't the right direction of thinking here. Sure spam and security are issues too, spam actually being a related problem, but it seems unfair to delegate this to the "bad idea" stack already.

      --
      - Sighuh?
  2. Get this setup by millahtime · · Score: 3, Interesting

    I wonder how long until IBM sells this setup. If it works well Logistics Orginazations would love to get their hands on it.

    1. Re:Get this setup by The+Limp+Devil · · Score: 2, Interesting

      let WebFountain troll it

      I sincerely hope you meant trawl it. The last thing we need is for IBM to build and sell an automated system for trolling the entire internet!

  3. Expensive by starvingcodeartist · · Score: 4, Interesting

    In the article is says they plan on charging between $150,000 and $300,000 a year to use this super-search engine. They think corporate execs will pay for it. Seems really steep to me. BUT, for corporate execs, its probably not too expensive. They'll just outsource another 10-15 programming jobs to India to pay for it.

    1. Re:Expensive by orac2 · · Score: 4, Interesting

      The point is that it's not intended for use as a search engine, but a platform for doing computation intensive data mining and analysis. A search engine can tell you how many mentions of IBM appear on the web, but not how people feel about IBM.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
  4. What about Existing Data? by ParadoxicalPostulate · · Score: 4, Interesting

    Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

    You would need an enormous workforce to do that.

    And if they don't plan on doing that, what about all the existing information? Is it going to be excluded from the database? Seems like much of a waste to me!

    Damn but I would love to have access to one of these, even if the amount of information available will be miniscule (relatively speaking) for the next few years.

  5. Impact on Google IPO by G4from128k · · Score: 2, Interesting

    This is the type of technology that could either ensure or derail Google's future (I'm not saying that it will, only that it could). Semantic analysis and clustering of web pages could improve search. I hope Google gets to use/create this type of tech.

    --
    Two wrongs don't make a right, but three lefts do.
  6. Echelon? by SexyKellyOsbourne · · Score: 2, Interesting

    This project sounds quite interesting -- it could really help out projects like Echelon to help win the war on terrorism, if it's capable of understanding other languages of course, and could possibly build a whole database of information that's intercepted from other places. All that chatter, with the codewords they use, could possibly be understood by a football field full of Linux rackmounts, and might foil something.

    Of course, such power could also be horribly misused if it came into the wrong hands. What if they wanted to enumerate every member or affiliate of the "terrorist" Green Party in the case of a "national emergency?" Feed WebFountain some data from the internet, and from ECHELON, and they would have a quick blacklist.

    Or corporations, for that matter, as that's who it's designed for, could quickly blacklist people from employment who were considered "dangerous" such as whistleblowers, heavily involved union members, spies, watchdogs, and so forth.

  7. i.e. nameprotect by joeldg · · Score: 3, Interesting

    nameprotect does something similar, except they are looking for people violating copyrights.
    in addition I think they might be one of the most banned bots online.

    anyway, their users are all corporate entities who pay a lot of money to be able to auto-cease and desist copyright infringers..

    These same companies will pay IBM to tell them that since their cease and desist spree everyone hates them.

  8. Gaming Webfountain by G4from128k · · Score: 3, Interesting

    I wonder how long it will take sleazy e-commerce sites and p0rn sites to game WebFountain and turn it into SpamFountain?

    I suspect that this tool (and any like it) must make a core assumption -- that each webpage is about one semantic thing and that the creators are trying to communicate that one thought. In contrast, people who try to boost their page rank have no compuction about misleading people (or algorithms). Clever tagging and misleading verbage should be able to fool IBM's analyzer into clustering a site where it does not belong (but where the site owner wants it). The result is pages look like it is about another thing (some popular search term)while being about soemthing else (selling their junk or porn).

    Next will come high-priced consultants that tell you how to make you site pace highly on WebFountain (like the ones that currently game Google).

    --
    Two wrongs don't make a right, but three lefts do.
  9. How long before people start gaming the system? by dpbsmith · · Score: 4, Interesting

    As Google has discovered, it's only possible for simple heuristics and algorithms to "understand" the human content on the Web for as long as it doesn't matter.

    As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.

    And the stakes are much higher for gaming WebFountain than for gaming Google.

    For example, I'd imagine there would be big money for anyone who could convince companies that they know how to make it appear that a particular movie/song/toy/computer was "hot," so that the WebFountain-using Walmarts and Best Buys of the world would stock more of it.

    WebFountain will work well only until it is actually introduced.

  10. Half a football field? by AndroidCat · · Score: 4, Interesting
    (Imperial or metric football fields?)
    IBM's breakthrough is called WebFountain--half a football field's worth of rack-mounted processors, routers, and disk drives running a huge menagerie of programs.
    Later:
    It uses a cluster of thirty 2.4-GHz Intel Xeon dual-processor computers running Linux to crawl as much of the general Web as it can find at least once a week.

    To ensure that WebFountain's finger is constantly on the pulse of the Internet, an additional suite of similar computers is dedicated to crawling important but volatile Web sites, such as those hosting blogs, at least once a day. Other machines maintain access to popular non-Web-based sources, such as Usenet (a newsgroup service that predates the Web) and the Internet Relay Chat system, known as IRC. The data is then passed into WebFountain's main cluster of computers, currently composed of 32 server racks connected via gigabit Ethernet. Each rack holds eight Xeon dual-processor computers and is equipped with about 4-5 terabytes of disk storage.

    That's a lot of stuff, but half a football field? Possibly they're including cubicles for the staff or did they just inherit some old Big Iron space that was that large?
    --
    One line blog. I hear that they're called Twitters now.
  11. It already exists by claudebbg · · Score: 3, Interesting

    I've already seen/heard of such system, basically in the Business Intelligence field.
    In England, a systems like Autonomy (used by the police at the beginning) can crawl a mass of information with dedicated spiders (not only for the web, but also commercial databases, files...). Then, it structures all the content in thematics with links and proximity.
    I personnaly tested it some years ago, feeding it with information websites and asking some articles "close to" another one. The efficiency was amazing because it was able to make the difference between close terms that have really different meaning depending on the context. Usually, search engines are wrong because they can't use the context.
    I also set up some "agents" for recurrent searches (an agent is basically a search plus some training, letting Autonomy know what found document are close and not) and it was able to propose everyday a really good press review with nearly no wrong documents.
    As a complement to Autonomy, I know a BI team that uses some other tools like Periclesto feed the searches with "relevant" content, basically thematics that are "appearing" in the group of documents and are close to some interests.
    Such BI tools can already provide the kind of information cited, like a opinion movement against a company detected in the newsgroup or some websites. And IBM is certainly on the tracks to improve such tools with the techniques of their labs.
    I hope these tools won't be limited to PR articles on the web and/or private use by big corporations, because it could only be another Echelon with all its bad consequences:
    - bad use of public information
    - paranoia feeded with wrong scares
    - public/corp. power against the citizens
    If tools like echelon could be used by everybody, it would have to let much more privacy to citizens and the public leaders would have to explain the investments.

  12. Sounds like CYC by Sanity · · Score: 2, Interesting
    CYC have been trying to collect all human knowledge for the last few decades and feed it into a knowledge base. They have even open sourced part of their database.

    Despite the apparent promise of the project, it is difficult to find actual examples of it doing really cool stuff.