IBM vs. Content Chaos

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Monday January 12, 2004 @04:42AM from the help-me-find-directions-to-p4r1s-h1l70n dept.

ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.

17 of 216 comments (clear)

I think a better question... by bc90021 · 2004-01-12 04:45 · Score: 5, Funny

...doesn't concern whether "Pink" is a colour or a singer, but whether "Paris Hilton" is a hotel in France or an oft downloaded video... ;)

--
libertarianswag.com
Send link to Google by Urkki · 2004-01-12 04:47 · Score: 4, Insightful

They could certainly use this kind of techniques to improve their results...

Then again, in a way they already use something like this, except they're only really concerned about links, not actual contents of pages...
structure... by Rhubarb+Crumble · 2004-01-12 04:47 · Score: 5, Funny

a huge system to turn all the unstructured info on the web into structured data
In order to do this, they will use a scheme by which each document is referred to by a string including the transfer protocol, the host name, and a file path.
oh, wait...
Expensive by starvingcodeartist · 2004-01-12 04:51 · Score: 4, Interesting

In the article is says they plan on charging between $150,000 and $300,000 a year to use this super-search engine. They think corporate execs will pay for it. Seems really steep to me. BUT, for corporate execs, its probably not too expensive. They'll just outsource another 10-15 programming jobs to India to pay for it.
1. Re:Expensive by orac2 · 2004-01-12 04:56 · Score: 4, Interesting
  
  The point is that it's not intended for use as a search engine, but a platform for doing computation intensive data mining and analysis. A search engine can tell you how many mentions of IBM appear on the web, but not how people feel about IBM.
  
  --
  "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
What about Existing Data? by ParadoxicalPostulate · 2004-01-12 04:54 · Score: 4, Interesting

Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

You would need an enormous workforce to do that.

And if they don't plan on doing that, what about all the existing information? Is it going to be excluded from the database? Seems like much of a waste to me!

Damn but I would love to have access to one of these, even if the amount of information available will be miniscule (relatively speaking) for the next few years.
1. Re:What about Existing Data? by Ronald+Dumsfeld · 2004-01-12 05:11 · Score: 5, Funny
  
  Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?
  
  No, they're writing software to put in the XML tags.
  
  What will be more interesting to see is if it's possible to pollute the database by putting in your own XML. Instead of Google-Bombing we'll have people pissing in the WebFountain.
  
  --
  Where's the Kaboom?
  There's supposed to be an Earth-shattering Kaboom.
Re:Get this setup by orac2 · 2004-01-12 05:01 · Score: 4, Informative

Although the article didn't have room to go into this point (and I should know, I'm the author), IBM can completley compartmentalize competitors' data, even if hosted in house (IBM already does this in other parts of its business). If companies are still wary, they can host the data themselves and let WebFountain troll it on a need to know basis.

--
"Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
One Net to Rule Them All by null+etc. · 2004-01-12 05:03 · Score: 5, Insightful

It would be nice if, in parallel to the Internet, another network was developed to hold only symantically organized knowledge. That network would be free of marketing and commercial business, and would ostensibly be the largest repository of organized knowledge in the planet. Think Internet2, based entirely in XML.
Similar to HTML's current weakness in separating presentation from content, the web today has a weakness in separating content sites from sales sites. Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic. This lack of ability to separate queries for knowledge, verses queries for product sales literature, is especially frustrating for scientists and programmers. I think Google is taking a step towards this with Froogle, meaning that if Froogle becomes popular enough, it's possible that Google will strip marketing pages from their search results.
Worse even, is when someone registers a thousand domains (plumbing-supplies-store.com, plumb-superstore-supplies.com, all-plumbing-supplies.com, etc) and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage. You would think that Google could detect this "marketing domain spam" and reduce the relevancy of such search results.
Anyways, I can't complain, because I can find nearly anything on the web I need, compared to 10 years ago.
Re:All we need... by millahtime · 2004-01-12 05:04 · Score: 5, Insightful

There are many organizations that need better ways to analyze their info. There are databases that are terabytes in size and have to do detailed searches. With SQL databases that can take a long time and any faster way can save a lot of time and money. There is a big need for this technology across many industries.

--
Evolution or ID?
Re:Entirely unsuited by orac2 · 2004-01-12 05:10 · Score: 4, Insightful

Disclaimer: I'm the author of the article.

Most people don't and won't tag as they go. (Except for those of us used to writing HTML-enabled comments on /. of course). Also, in order to be able to write <popularmusic>Pink</popularmusic>, and have it make sense, you'd have to be following a DTD.

As anyone who's been involved in DTD formulation can attest, even for internal documentation, it can be a royal pain in the butt. I don't think the vast majority of on-line rapid content generators (all those bloggers, emailers, chatters) will ever use XML to routinely tag their content manually. The article isn't talking about machine generated or commercial content, like Amazon's, but the day to day stuff that gets put up in the time it takes to write it and click submit, and which is of most interest to market researchers.

--
"Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
Like NorthernLight? by dpbsmith · 2004-01-12 05:11 · Score: 4, Informative

This sounds very similar to NorthernLight.

NorthernLight was (it still exists, but apparently is not available to the nonpaying public at all) a search engine that displayed its results automatically sorted into as many as fifteen or twenty categories, automatically generated on the basis of the search. (For some reason, they called these categories "custom search folders.")

Since it's no longer available to the public I can't give a concrete example. I can't test it to see whether a search on "Pink" creates a couple of folders labelled "Singer" and "Color," for example. But that's exactly the sort of thing it does/did.

I actually would have used NorthernLight as one of my routine search engines--it worked quite well--had it not been for another major annoyance: in the publicly available version, it always searched both publicly available Web pages and a number of fee-based private databases, so whatever you searched for, the majority of the results were in the fee-based databases and I would have had to pay money to see what they were. In other words, it was heavy-handed promotion of their paid services and had only limited utility to those who did not wish to by them).

--
"How to Do Nothing," kids activities, back in print!
Re:All we need... by xyzzy · 2004-01-12 05:20 · Score: 5, Insightful

That's really funny that you mention "spam filters", since that is exactly the content categorization task that you are talking about.

Automatic categorization of overflowing data is exactly what you need to do when you have too much to think about -- it allows you to triage your attention span, which is the most limited resource you have.
How long before people start gaming the system? by dpbsmith · 2004-01-12 05:21 · Score: 4, Interesting

As Google has discovered, it's only possible for simple heuristics and algorithms to "understand" the human content on the Web for as long as it doesn't matter.

As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.

And the stakes are much higher for gaming WebFountain than for gaming Google.

For example, I'd imagine there would be big money for anyone who could convince companies that they know how to make it appear that a particular movie/song/toy/computer was "hot," so that the WebFountain-using Walmarts and Best Buys of the world would stock more of it.

WebFountain will work well only until it is actually introduced.

--
"How to Do Nothing," kids activities, back in print!
SCO by Zork+the+Almighty · 2004-01-12 05:22 · Score: 4, Funny

IEEE reports that the first commercial use will be to track public opinion for companies.

Searching "SCO" Found "Slashdot" ERROR arithmetic underflow.

--

In Soviet America the banks rob you!
CrapFountain by s4m7 · 2004-01-12 05:25 · Score: 4, Funny

Here's how it works:

Executive Bob, who's paid IBM $150,000 for his enterprise liscence of webfountain, enters into his webfountain search box: "Pink the musician, not the color"

IBM's powerful software parses this command into "pink music -color" and passes it to google, retrieves the results, removes Google's paid ads and replaces them with IBM's paid ads. The content is then served to Executive Bob, who shouts: "EUREKA" since within the top ten search results he finds "NUDE PICTURES OF RAPPER PINK!"

IBM then lands a lucrative support contract with Exectutive Bob to remove all the viruses and spyware from his desktop PC. Rinse and Repeat.

--
This comment is fully compliant with RFC 527.
Half a football field? by AndroidCat · 2004-01-12 05:27 · Score: 4, Interesting

(Imperial or metric football fields?)
IBM's breakthrough is called WebFountain--half a football field's worth of rack-mounted processors, routers, and disk drives running a huge menagerie of programs.
Later:
It uses a cluster of thirty 2.4-GHz Intel Xeon dual-processor computers running Linux to crawl as much of the general Web as it can find at least once a week.
To ensure that WebFountain's finger is constantly on the pulse of the Internet, an additional suite of similar computers is dedicated to crawling important but volatile Web sites, such as those hosting blogs, at least once a day. Other machines maintain access to popular non-Web-based sources, such as Usenet (a newsgroup service that predates the Web) and the Internet Relay Chat system, known as IRC. The data is then passed into WebFountain's main cluster of computers, currently composed of 32 server racks connected via gigabit Ethernet. Each rack holds eight Xeon dual-processor computers and is equipped with about 4-5 terabytes of disk storage.
That's a lot of stuff, but half a football field? Possibly they're including cubicles for the staff or did they just inherit some old Big Iron space that was that large?

--
One line blog. I hear that they're called Twitters now.