IBM vs. Content Chaos

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Monday January 12, 2004 @04:42AM from the help-me-find-directions-to-p4r1s-h1l70n dept.

ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.

6 of 216 comments (clear)

I think a better question... by bc90021 · 2004-01-12 04:45 · Score: 5, Funny

...doesn't concern whether "Pink" is a colour or a singer, but whether "Paris Hilton" is a hotel in France or an oft downloaded video... ;)

--
libertarianswag.com
structure... by Rhubarb+Crumble · 2004-01-12 04:47 · Score: 5, Funny

a huge system to turn all the unstructured info on the web into structured data
In order to do this, they will use a scheme by which each document is referred to by a string including the transfer protocol, the host name, and a file path.
oh, wait...
One Net to Rule Them All by null+etc. · 2004-01-12 05:03 · Score: 5, Insightful

It would be nice if, in parallel to the Internet, another network was developed to hold only symantically organized knowledge. That network would be free of marketing and commercial business, and would ostensibly be the largest repository of organized knowledge in the planet. Think Internet2, based entirely in XML.
Similar to HTML's current weakness in separating presentation from content, the web today has a weakness in separating content sites from sales sites. Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic. This lack of ability to separate queries for knowledge, verses queries for product sales literature, is especially frustrating for scientists and programmers. I think Google is taking a step towards this with Froogle, meaning that if Froogle becomes popular enough, it's possible that Google will strip marketing pages from their search results.
Worse even, is when someone registers a thousand domains (plumbing-supplies-store.com, plumb-superstore-supplies.com, all-plumbing-supplies.com, etc) and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage. You would think that Google could detect this "marketing domain spam" and reduce the relevancy of such search results.
Anyways, I can't complain, because I can find nearly anything on the web I need, compared to 10 years ago.
Re:All we need... by millahtime · 2004-01-12 05:04 · Score: 5, Insightful

There are many organizations that need better ways to analyze their info. There are databases that are terabytes in size and have to do detailed searches. With SQL databases that can take a long time and any faster way can save a lot of time and money. There is a big need for this technology across many industries.

--
Evolution or ID?
Re:What about Existing Data? by Ronald+Dumsfeld · 2004-01-12 05:11 · Score: 5, Funny

Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

No, they're writing software to put in the XML tags.

What will be more interesting to see is if it's possible to pollute the database by putting in your own XML. Instead of Google-Bombing we'll have people pissing in the WebFountain.

--
Where's the Kaboom?
There's supposed to be an Earth-shattering Kaboom.
Re:All we need... by xyzzy · 2004-01-12 05:20 · Score: 5, Insightful

That's really funny that you mention "spam filters", since that is exactly the content categorization task that you are talking about.

Automatic categorization of overflowing data is exactly what you need to do when you have too much to think about -- it allows you to triage your attention span, which is the most limited resource you have.