IBM vs. Content Chaos

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Monday January 12, 2004 @04:42AM from the help-me-find-directions-to-p4r1s-h1l70n dept.

ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.

12 of 216 comments (clear)

Min score:

Reason:

Sort:

Send link to Google by Urkki · 2004-01-12 04:47 · Score: 4, Insightful

They could certainly use this kind of techniques to improve their results...

Then again, in a way they already use something like this, except they're only really concerned about links, not actual contents of pages...
corporate meddling by commo1 · 2004-01-12 04:54 · Score: 3, Insightful

One of my main concerns with search databases is the inhenrent ability for corporations to increase their visibility on the web by manipulating data to their benefit to bring their corporate page up first on the list. I wonder if there is a way for the database to have a scoring system based on the validity of the data: is the information there, or are there just highly develpoped metatags doing the work? If you do a search for a specific part number for an HP product, what are the cances of getting a) the HP home page where a further search would be necessary to find any relevant info or b) the big chains like Staples, Sircuit City who just want to sell you cartridges and have the time and resources to steer you in the right direction. How would the system be regulated? (kinda like Slashdot mods :P)? Who watches the watchers, and can information validity be electronically implemented? What kind of AI would be necessary?
Re:All we need... by geoffspear · 2004-01-12 04:54 · Score: 1, Insightful

Oh yes, because there's such an enormous shortage of programmers right now. IBM should lay off all of these programmers so Microsoft will have a pool of available programmers who know nothing about OS security to work on security.
And once all the game producers, who make a product we definitely don't "need" get rid of all of their programmers, there will be plenty of free people to work on anti-spam technology. Whee!

--
Don't blame me; I'm never given mod points.
Entirely unsuited by happyfrogcow · 2004-01-12 04:54 · Score: 3, Insightful

From the article, "But many online information sources are entirely unsuited to the XML model--for example, personal Web pages, e-mails, postings to newsgroups, and conversations in chat rooms."

entirely unsuited? chrissake. email, unsuited. newsgroups, unsuited. chat rooms, unsuited. If personal home pages are unsuited, then so are corporate home pages, as there is nothing inherantly different about the two. All this from an IEEE article... I would have thought them to be more acurate and less misleading. I could put <popularmusic>Pink</popularmusic> in my HTML as easily as Amazon could in theirs.

HTML is based on the XML model. HTML is used to create personal web pages. How on earth then, could personal web pages be "entirely unsuited to the XML model"?
1. Re:Entirely unsuited by orac2 · 2004-01-12 05:10 · Score: 4, Insightful
  
  Disclaimer: I'm the author of the article.
  
  Most people don't and won't tag as they go. (Except for those of us used to writing HTML-enabled comments on /. of course). Also, in order to be able to write <popularmusic>Pink</popularmusic>, and have it make sense, you'd have to be following a DTD.
  
  As anyone who's been involved in DTD formulation can attest, even for internal documentation, it can be a royal pain in the butt. I don't think the vast majority of on-line rapid content generators (all those bloggers, emailers, chatters) will ever use XML to routinely tag their content manually. The article isn't talking about machine generated or commercial content, like Amazon's, but the day to day stuff that gets put up in the time it takes to write it and click submit, and which is of most interest to market researchers.
  
  --
  "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
One Net to Rule Them All by null+etc. · 2004-01-12 05:03 · Score: 5, Insightful

It would be nice if, in parallel to the Internet, another network was developed to hold only symantically organized knowledge. That network would be free of marketing and commercial business, and would ostensibly be the largest repository of organized knowledge in the planet. Think Internet2, based entirely in XML.
Similar to HTML's current weakness in separating presentation from content, the web today has a weakness in separating content sites from sales sites. Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic. This lack of ability to separate queries for knowledge, verses queries for product sales literature, is especially frustrating for scientists and programmers. I think Google is taking a step towards this with Froogle, meaning that if Froogle becomes popular enough, it's possible that Google will strip marketing pages from their search results.
Worse even, is when someone registers a thousand domains (plumbing-supplies-store.com, plumb-superstore-supplies.com, all-plumbing-supplies.com, etc) and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage. You would think that Google could detect this "marketing domain spam" and reduce the relevancy of such search results.
Anyways, I can't complain, because I can find nearly anything on the web I need, compared to 10 years ago.
Re:All we need... by millahtime · 2004-01-12 05:04 · Score: 5, Insightful

There are many organizations that need better ways to analyze their info. There are databases that are terabytes in size and have to do detailed searches. With SQL databases that can take a long time and any faster way can save a lot of time and money. There is a big need for this technology across many industries.

--
Evolution or ID?
Re:All we need... by xyzzy · 2004-01-12 05:20 · Score: 5, Insightful

That's really funny that you mention "spam filters", since that is exactly the content categorization task that you are talking about.

Automatic categorization of overflowing data is exactly what you need to do when you have too much to think about -- it allows you to triage your attention span, which is the most limited resource you have.
"Is this web site selling something"? by Animats · 2004-01-12 05:22 · Score: 3, Insightful
Search engine spiders need to understand more about sites. Things like this:
- The site is selling something.
- The page is composed of multiple unrelated articles or ads, each one of which should be viewed as a separate entity for search purposes.
- The page is part of a blog.
- Content on this site duplicates that found on other sites.
- The site is owned by an organization with a known Dun and Bradstreet number. (If a site is selling something, and its Whois info doesn't match the DNB corporation database, it should be downgraded in search position. This would encourage honest Whois info.)
Re:Echelon? by orac2 · 2004-01-12 05:26 · Score: 3, Insightful

Disclaimer: I'm the author of the article.

I know, from talking to the WebFountain team that they're very sensitive to privacy concerns. WebFountain obeys robots.txt and doesn't archive material which has vanished from the publicly visible web (if only for reasons of storage capacity!).

The point is that all the information that feeds into IBM is already publicly availble. If wanted to go after Green Party members and if the Green Party posted it's membership roll on a webserver, I think they'd be able to get it, WebFountain or no.

Of course, I suppose WebFountain could be used to construct a membership list by scanning people's home page's to find out if they say that they're a member, but again this is publicly declared information.

Bottom line, as always: if you don't want it generally accessible to all, don't put it on a public web server.

--
"Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
Potential money saver: Differential buzz by benja · 2004-01-12 05:42 · Score: 2, Insightful

The head of a research and development department could feed WebFountain all the e-mails, reports, PowerPoint presentations, and so on that her employees produced in the last six months. From this, WebFountain could give her a list of technologies that the department was paying attention to. She could then compare this list to the technologies in her sector that were creating a buzz online. Discrepancies between the two lists would be worth asking her managers about, allowing her to know whether or not the department was ahead of the market or falling dangerously behind.
This is a potentially very useful money-saver. Currently companies employ hoards of middle-management people who do little else than detecting discrepancies between the technologies that their department is focusing on and those that are currently all the buzz. Now we can create an automatic boss that sends out e-mails like, "What's this IP-over-XML thing and why don't we use it and how soon can you have all our critical systems migrated to it?"
Scanalyzer by Anonymous Coward · 2004-01-12 05:49 · Score: 1, Insightful

Reminds me of the Scanalyzer service in John Brunner's book "Stand On Zanzibar." The supercomputer Shalmaneser analyzed millions of inputs and tried to make sense of them.