Slashdot Mirror


IBM vs. Content Chaos

ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.

28 of 216 comments (clear)

  1. I think a better question... by bc90021 · · Score: 5, Funny

    ...doesn't concern whether "Pink" is a colour or a singer, but whether "Paris Hilton" is a hotel in France or an oft downloaded video... ;)

  2. All we need... by TJ_Phazerhacki · · Score: 3, Interesting

    There is already altogether too much "Stuff out there" for anyone to put any major effort into catogorizing it. We should soon reach the point of info overload, and then what? What is the point of catologing overflow data? Do we really need something like this? Or should we just ship a bunch of programmers wasting their time over to something else, like better spam filters and OS's without gaping security holes?

    --
    Physics is nothing like religion. If it was, we'd have an easier time trying to raise money!
    1. Re:All we need... by millahtime · · Score: 5, Insightful

      There are many organizations that need better ways to analyze their info. There are databases that are terabytes in size and have to do detailed searches. With SQL databases that can take a long time and any faster way can save a lot of time and money. There is a big need for this technology across many industries.

    2. Re:All we need... by xyzzy · · Score: 5, Insightful

      That's really funny that you mention "spam filters", since that is exactly the content categorization task that you are talking about.

      Automatic categorization of overflowing data is exactly what you need to do when you have too much to think about -- it allows you to triage your attention span, which is the most limited resource you have.

  3. Send link to Google by Urkki · · Score: 4, Insightful

    They could certainly use this kind of techniques to improve their results...

    Then again, in a way they already use something like this, except they're only really concerned about links, not actual contents of pages...

  4. structure... by Rhubarb+Crumble · · Score: 5, Funny
    a huge system to turn all the unstructured info on the web into structured data

    In order to do this, they will use a scheme by which each document is referred to by a string including the transfer protocol, the host name, and a file path.

    oh, wait...

  5. First customer by Anonymous Coward · · Score: 3, Funny

    IEEE reports that the first commercial use will be to track public opinion for companies.

    Word has it the first test case will be SCO. Web fountian: "Outlook not so good"

  6. Get this setup by millahtime · · Score: 3, Interesting

    I wonder how long until IBM sells this setup. If it works well Logistics Orginazations would love to get their hands on it.

    1. Re:Get this setup by orac2 · · Score: 4, Informative

      Although the article didn't have room to go into this point (and I should know, I'm the author), IBM can completley compartmentalize competitors' data, even if hosted in house (IBM already does this in other parts of its business). If companies are still wary, they can host the data themselves and let WebFountain troll it on a need to know basis.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
  7. Expensive by starvingcodeartist · · Score: 4, Interesting

    In the article is says they plan on charging between $150,000 and $300,000 a year to use this super-search engine. They think corporate execs will pay for it. Seems really steep to me. BUT, for corporate execs, its probably not too expensive. They'll just outsource another 10-15 programming jobs to India to pay for it.

    1. Re:Expensive by orac2 · · Score: 4, Interesting

      The point is that it's not intended for use as a search engine, but a platform for doing computation intensive data mining and analysis. A search engine can tell you how many mentions of IBM appear on the web, but not how people feel about IBM.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
  8. corporate meddling by commo1 · · Score: 3, Insightful

    One of my main concerns with search databases is the inhenrent ability for corporations to increase their visibility on the web by manipulating data to their benefit to bring their corporate page up first on the list. I wonder if there is a way for the database to have a scoring system based on the validity of the data: is the information there, or are there just highly develpoped metatags doing the work? If you do a search for a specific part number for an HP product, what are the cances of getting a) the HP home page where a further search would be necessary to find any relevant info or b) the big chains like Staples, Sircuit City who just want to sell you cartridges and have the time and resources to steer you in the right direction. How would the system be regulated? (kinda like Slashdot mods :P)? Who watches the watchers, and can information validity be electronically implemented? What kind of AI would be necessary?

  9. What about Existing Data? by ParadoxicalPostulate · · Score: 4, Interesting

    Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

    You would need an enormous workforce to do that.

    And if they don't plan on doing that, what about all the existing information? Is it going to be excluded from the database? Seems like much of a waste to me!

    Damn but I would love to have access to one of these, even if the amount of information available will be miniscule (relatively speaking) for the next few years.

    1. Re:What about Existing Data? by Ronald+Dumsfeld · · Score: 5, Funny
      Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

      No, they're writing software to put in the XML tags.

      What will be more interesting to see is if it's possible to pollute the database by putting in your own XML. Instead of Google-Bombing we'll have people pissing in the WebFountain.
      --
      Where's the Kaboom?
      There's supposed to be an Earth-shattering Kaboom.
  10. Entirely unsuited by happyfrogcow · · Score: 3, Insightful

    From the article, "But many online information sources are entirely unsuited to the XML model--for example, personal Web pages, e-mails, postings to newsgroups, and conversations in chat rooms."

    entirely unsuited? chrissake. email, unsuited. newsgroups, unsuited. chat rooms, unsuited. If personal home pages are unsuited, then so are corporate home pages, as there is nothing inherantly different about the two. All this from an IEEE article... I would have thought them to be more acurate and less misleading. I could put <popularmusic>Pink</popularmusic> in my HTML as easily as Amazon could in theirs.

    HTML is based on the XML model. HTML is used to create personal web pages. How on earth then, could personal web pages be "entirely unsuited to the XML model"?

    1. Re:Entirely unsuited by orac2 · · Score: 4, Insightful

      Disclaimer: I'm the author of the article.

      Most people don't and won't tag as they go. (Except for those of us used to writing HTML-enabled comments on /. of course). Also, in order to be able to write <popularmusic>Pink</popularmusic>, and have it make sense, you'd have to be following a DTD.

      As anyone who's been involved in DTD formulation can attest, even for internal documentation, it can be a royal pain in the butt. I don't think the vast majority of on-line rapid content generators (all those bloggers, emailers, chatters) will ever use XML to routinely tag their content manually. The article isn't talking about machine generated or commercial content, like Amazon's, but the day to day stuff that gets put up in the time it takes to write it and click submit, and which is of most interest to market researchers.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
  11. One Net to Rule Them All by null+etc. · · Score: 5, Insightful
    It would be nice if, in parallel to the Internet, another network was developed to hold only symantically organized knowledge. That network would be free of marketing and commercial business, and would ostensibly be the largest repository of organized knowledge in the planet. Think Internet2, based entirely in XML.

    Similar to HTML's current weakness in separating presentation from content, the web today has a weakness in separating content sites from sales sites. Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic. This lack of ability to separate queries for knowledge, verses queries for product sales literature, is especially frustrating for scientists and programmers. I think Google is taking a step towards this with Froogle, meaning that if Froogle becomes popular enough, it's possible that Google will strip marketing pages from their search results.

    Worse even, is when someone registers a thousand domains (plumbing-supplies-store.com, plumb-superstore-supplies.com, all-plumbing-supplies.com, etc) and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage. You would think that Google could detect this "marketing domain spam" and reduce the relevancy of such search results.

    Anyways, I can't complain, because I can find nearly anything on the web I need, compared to 10 years ago.

  12. i.e. nameprotect by joeldg · · Score: 3, Interesting

    nameprotect does something similar, except they are looking for people violating copyrights.
    in addition I think they might be one of the most banned bots online.

    anyway, their users are all corporate entities who pay a lot of money to be able to auto-cease and desist copyright infringers..

    These same companies will pay IBM to tell them that since their cease and desist spree everyone hates them.

  13. Like NorthernLight? by dpbsmith · · Score: 4, Informative

    This sounds very similar to NorthernLight.

    NorthernLight was (it still exists, but apparently is not available to the nonpaying public at all) a search engine that displayed its results automatically sorted into as many as fifteen or twenty categories, automatically generated on the basis of the search. (For some reason, they called these categories "custom search folders.")

    Since it's no longer available to the public I can't give a concrete example. I can't test it to see whether a search on "Pink" creates a couple of folders labelled "Singer" and "Color," for example. But that's exactly the sort of thing it does/did.

    I actually would have used NorthernLight as one of my routine search engines--it worked quite well--had it not been for another major annoyance: in the publicly available version, it always searched both publicly available Web pages and a number of fee-based private databases, so whatever you searched for, the majority of the results were in the fee-based databases and I would have had to pay money to see what they were. In other words, it was heavy-handed promotion of their paid services and had only limited utility to those who did not wish to by them).

  14. Gaming Webfountain by G4from128k · · Score: 3, Interesting

    I wonder how long it will take sleazy e-commerce sites and p0rn sites to game WebFountain and turn it into SpamFountain?

    I suspect that this tool (and any like it) must make a core assumption -- that each webpage is about one semantic thing and that the creators are trying to communicate that one thought. In contrast, people who try to boost their page rank have no compuction about misleading people (or algorithms). Clever tagging and misleading verbage should be able to fool IBM's analyzer into clustering a site where it does not belong (but where the site owner wants it). The result is pages look like it is about another thing (some popular search term)while being about soemthing else (selling their junk or porn).

    Next will come high-priced consultants that tell you how to make you site pace highly on WebFountain (like the ones that currently game Google).

    --
    Two wrongs don't make a right, but three lefts do.
  15. How long before people start gaming the system? by dpbsmith · · Score: 4, Interesting

    As Google has discovered, it's only possible for simple heuristics and algorithms to "understand" the human content on the Web for as long as it doesn't matter.

    As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.

    And the stakes are much higher for gaming WebFountain than for gaming Google.

    For example, I'd imagine there would be big money for anyone who could convince companies that they know how to make it appear that a particular movie/song/toy/computer was "hot," so that the WebFountain-using Walmarts and Best Buys of the world would stock more of it.

    WebFountain will work well only until it is actually introduced.

  16. "Is this web site selling something"? by Animats · · Score: 3, Insightful
    Search engine spiders need to understand more about sites. Things like this:
    • The site is selling something.
    • The page is composed of multiple unrelated articles or ads, each one of which should be viewed as a separate entity for search purposes.
    • The page is part of a blog.
    • Content on this site duplicates that found on other sites.
    • The site is owned by an organization with a known Dun and Bradstreet number. (If a site is selling something, and its Whois info doesn't match the DNB corporation database, it should be downgraded in search position. This would encourage honest Whois info.)
  17. SCO by Zork+the+Almighty · · Score: 4, Funny

    IEEE reports that the first commercial use will be to track public opinion for companies.

    Searching "SCO"
    Found "Slashdot"
    ERROR arithmetic underflow.

    --

    In Soviet America the banks rob you!
  18. CrapFountain by s4m7 · · Score: 4, Funny

    Here's how it works:

    Executive Bob, who's paid IBM $150,000 for his enterprise liscence of webfountain, enters into his webfountain search box: "Pink the musician, not the color"

    IBM's powerful software parses this command into "pink music -color" and passes it to google, retrieves the results, removes Google's paid ads and replaces them with IBM's paid ads. The content is then served to Executive Bob, who shouts: "EUREKA" since within the top ten search results he finds "NUDE PICTURES OF RAPPER PINK!"

    IBM then lands a lucrative support contract with Exectutive Bob to remove all the viruses and spyware from his desktop PC. Rinse and Repeat.

    --
    This comment is fully compliant with RFC 527.
  19. Re:Echelon? by orac2 · · Score: 3, Insightful

    Disclaimer: I'm the author of the article.

    I know, from talking to the WebFountain team that they're very sensitive to privacy concerns. WebFountain obeys robots.txt and doesn't archive material which has vanished from the publicly visible web (if only for reasons of storage capacity!).

    The point is that all the information that feeds into IBM is already publicly availble. If wanted to go after Green Party members and if the Green Party posted it's membership roll on a webserver, I think they'd be able to get it, WebFountain or no.

    Of course, I suppose WebFountain could be used to construct a membership list by scanning people's home page's to find out if they say that they're a member, but again this is publicly declared information.

    Bottom line, as always: if you don't want it generally accessible to all, don't put it on a public web server.

    --
    "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
  20. Half a football field? by AndroidCat · · Score: 4, Interesting
    (Imperial or metric football fields?)
    IBM's breakthrough is called WebFountain--half a football field's worth of rack-mounted processors, routers, and disk drives running a huge menagerie of programs.
    Later:
    It uses a cluster of thirty 2.4-GHz Intel Xeon dual-processor computers running Linux to crawl as much of the general Web as it can find at least once a week.

    To ensure that WebFountain's finger is constantly on the pulse of the Internet, an additional suite of similar computers is dedicated to crawling important but volatile Web sites, such as those hosting blogs, at least once a day. Other machines maintain access to popular non-Web-based sources, such as Usenet (a newsgroup service that predates the Web) and the Internet Relay Chat system, known as IRC. The data is then passed into WebFountain's main cluster of computers, currently composed of 32 server racks connected via gigabit Ethernet. Each rack holds eight Xeon dual-processor computers and is equipped with about 4-5 terabytes of disk storage.

    That's a lot of stuff, but half a football field? Possibly they're including cubicles for the staff or did they just inherit some old Big Iron space that was that large?
    --
    One line blog. I hear that they're called Twitters now.
  21. Prior art :o) by Mr_Silver · · Score: 3, Funny
    IEEE reports that the first commercial use will be to track public opinion for companies

    You can do that already with Google:

    A search for "Microsoft is evil" gets you 600,000 pages.

    A search for "Microsoft is good" gets you 3,590,000 pages.

    Therefore Microsoft is more good than evil.

    Err ... that wasn't quite the answer I was expecting.

    (cue sounds of joke falling apart...)

    --
    Avantslash - View Slashdot cleanly on your mobile phone.
  22. It already exists by claudebbg · · Score: 3, Interesting

    I've already seen/heard of such system, basically in the Business Intelligence field.
    In England, a systems like Autonomy (used by the police at the beginning) can crawl a mass of information with dedicated spiders (not only for the web, but also commercial databases, files...). Then, it structures all the content in thematics with links and proximity.
    I personnaly tested it some years ago, feeding it with information websites and asking some articles "close to" another one. The efficiency was amazing because it was able to make the difference between close terms that have really different meaning depending on the context. Usually, search engines are wrong because they can't use the context.
    I also set up some "agents" for recurrent searches (an agent is basically a search plus some training, letting Autonomy know what found document are close and not) and it was able to propose everyday a really good press review with nearly no wrong documents.
    As a complement to Autonomy, I know a BI team that uses some other tools like Periclesto feed the searches with "relevant" content, basically thematics that are "appearing" in the group of documents and are close to some interests.
    Such BI tools can already provide the kind of information cited, like a opinion movement against a company detected in the newsgroup or some websites. And IBM is certainly on the tracks to improve such tools with the techniques of their labs.
    I hope these tools won't be limited to PR articles on the web and/or private use by big corporations, because it could only be another Echelon with all its bad consequences:
    - bad use of public information
    - paranoia feeded with wrong scares
    - public/corp. power against the citizens
    If tools like echelon could be used by everybody, it would have to let much more privacy to citizens and the public leaders would have to explain the investments.