IBM vs. Content Chaos
ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.
How about "it can be fed"
...doesn't concern whether "Pink" is a colour or a singer, but whether "Paris Hilton" is a hotel in France or an oft downloaded video... ;)
libertarianswag.com
The spinoff that will be used by joe sixpack net user.
There is already altogether too much "Stuff out there" for anyone to put any major effort into catogorizing it. We should soon reach the point of info overload, and then what? What is the point of catologing overflow data? Do we really need something like this? Or should we just ship a bunch of programmers wasting their time over to something else, like better spam filters and OS's without gaping security holes?
Physics is nothing like religion. If it was, we'd have an easier time trying to raise money!
They could certainly use this kind of techniques to improve their results...
Then again, in a way they already use something like this, except they're only really concerned about links, not actual contents of pages...
In order to do this, they will use a scheme by which each document is referred to by a string including the transfer protocol, the host name, and a file path.
oh, wait...
IEEE reports that the first commercial use will be to track public opinion for companies.
Word has it the first test case will be SCO. Web fountian: "Outlook not so good"
Link to a Mirror
ThisIsAnExampleAccountGL@yahoo.com
I wonder how long until IBM sells this setup. If it works well Logistics Orginazations would love to get their hands on it.
Evolution or ID?
In the article is says they plan on charging between $150,000 and $300,000 a year to use this super-search engine. They think corporate execs will pay for it. Seems really steep to me. BUT, for corporate execs, its probably not too expensive. They'll just outsource another 10-15 programming jobs to India to pay for it.
One of my main concerns with search databases is the inhenrent ability for corporations to increase their visibility on the web by manipulating data to their benefit to bring their corporate page up first on the list. I wonder if there is a way for the database to have a scoring system based on the validity of the data: is the information there, or are there just highly develpoped metatags doing the work? If you do a search for a specific part number for an HP product, what are the cances of getting a) the HP home page where a further search would be necessary to find any relevant info or b) the big chains like Staples, Sircuit City who just want to sell you cartridges and have the time and resources to steer you in the right direction. How would the system be regulated? (kinda like Slashdot mods :P)? Who watches the watchers, and can information validity be electronically implemented? What kind of AI would be necessary?
Information wants to be... Fuscia!
*shrug*
e.
Build Your Own PVR/HTPC news, reviews, &
Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?
You would need an enormous workforce to do that.
And if they don't plan on doing that, what about all the existing information? Is it going to be excluded from the database? Seems like much of a waste to me!
Damn but I would love to have access to one of these, even if the amount of information available will be miniscule (relatively speaking) for the next few years.
From the article, "But many online information sources are entirely unsuited to the XML model--for example, personal Web pages, e-mails, postings to newsgroups, and conversations in chat rooms."
entirely unsuited? chrissake. email, unsuited. newsgroups, unsuited. chat rooms, unsuited. If personal home pages are unsuited, then so are corporate home pages, as there is nothing inherantly different about the two. All this from an IEEE article... I would have thought them to be more acurate and less misleading. I could put <popularmusic>Pink</popularmusic> in my HTML as easily as Amazon could in theirs.
HTML is based on the XML model. HTML is used to create personal web pages. How on earth then, could personal web pages be "entirely unsuited to the XML model"?
Meaning 1:
crazy, ridiculous
Your mom flushed your stash? that's whack!
Meaning 2:
stupid, dumb; gay
Dude, that's whack.
Source: Urban Dictionary(.com)
Some information at different paths might require cross-referencing. Thus, the scheme you propose should be extended so that there would be a way for text documents to contain links to each other.
However, if you just take a big enough storage system and download all the documents from teh intterweb, you can have a flat directory containing all the documents. Woohoo, progress!
I think, therefore thoughts exist. Ego is just an impression.
IBM should try their own website. Passport-Advantage is about the most hideous labyrinth I've ever spelunked (sp). IBM is not alone, but through sheer scale the site just screams "bueromaze".
This is the type of technology that could either ensure or derail Google's future (I'm not saying that it will, only that it could). Semantic analysis and clustering of web pages could improve search. I hope Google gets to use/create this type of tech.
Two wrongs don't make a right, but three lefts do.
This project sounds quite interesting -- it could really help out projects like Echelon to help win the war on terrorism, if it's capable of understanding other languages of course, and could possibly build a whole database of information that's intercepted from other places. All that chatter, with the codewords they use, could possibly be understood by a football field full of Linux rackmounts, and might foil something.
Of course, such power could also be horribly misused if it came into the wrong hands. What if they wanted to enumerate every member or affiliate of the "terrorist" Green Party in the case of a "national emergency?" Feed WebFountain some data from the internet, and from ECHELON, and they would have a quick blacklist.
Or corporations, for that matter, as that's who it's designed for, could quickly blacklist people from employment who were considered "dangerous" such as whistleblowers, heavily involved union members, spies, watchdogs, and so forth.
Similar to HTML's current weakness in separating presentation from content, the web today has a weakness in separating content sites from sales sites. Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic. This lack of ability to separate queries for knowledge, verses queries for product sales literature, is especially frustrating for scientists and programmers. I think Google is taking a step towards this with Froogle, meaning that if Froogle becomes popular enough, it's possible that Google will strip marketing pages from their search results.
Worse even, is when someone registers a thousand domains (plumbing-supplies-store.com, plumb-superstore-supplies.com, all-plumbing-supplies.com, etc) and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage. You would think that Google could detect this "marketing domain spam" and reduce the relevancy of such search results.
Anyways, I can't complain, because I can find nearly anything on the web I need, compared to 10 years ago.
Ummm. No. HTML predates XML.
Best Slashdot Co
Researchers in Alabama are working on a system which converts all music on the internet into a single Menudo mp3 file. EIEIO reports the first public use will be to create a single mp3 file that results in trilllions of dollars in royalties to the RIAA when traded illegally.
Brains coworker :-)
:-)
Watch cartoons all day and see your mind melt down
nameprotect does something similar, except they are looking for people violating copyrights.
in addition I think they might be one of the most banned bots online.
anyway, their users are all corporate entities who pay a lot of money to be able to auto-cease and desist copyright infringers..
These same companies will pay IBM to tell them that since their cease and desist spree everyone hates them.
anime+manga together at last.. in real time.
is the Almaden webspider (http://www.almaden.ibm.com/cs/crawler/) that's been scavenging in the dark a part of this?
WebFountain
This sounds very similar to NorthernLight.
NorthernLight was (it still exists, but apparently is not available to the nonpaying public at all) a search engine that displayed its results automatically sorted into as many as fifteen or twenty categories, automatically generated on the basis of the search. (For some reason, they called these categories "custom search folders.")
Since it's no longer available to the public I can't give a concrete example. I can't test it to see whether a search on "Pink" creates a couple of folders labelled "Singer" and "Color," for example. But that's exactly the sort of thing it does/did.
I actually would have used NorthernLight as one of my routine search engines--it worked quite well--had it not been for another major annoyance: in the publicly available version, it always searched both publicly available Web pages and a number of fee-based private databases, so whatever you searched for, the majority of the results were in the fee-based databases and I would have had to pay money to see what they were. In other words, it was heavy-handed promotion of their paid services and had only limited utility to those who did not wish to by them).
"How to Do Nothing," kids activities, back in print!
I wonder how long it will take sleazy e-commerce sites and p0rn sites to game WebFountain and turn it into SpamFountain?
I suspect that this tool (and any like it) must make a core assumption -- that each webpage is about one semantic thing and that the creators are trying to communicate that one thought. In contrast, people who try to boost their page rank have no compuction about misleading people (or algorithms). Clever tagging and misleading verbage should be able to fool IBM's analyzer into clustering a site where it does not belong (but where the site owner wants it). The result is pages look like it is about another thing (some popular search term)while being about soemthing else (selling their junk or porn).
Next will come high-priced consultants that tell you how to make you site pace highly on WebFountain (like the ones that currently game Google).
Two wrongs don't make a right, but three lefts do.
if you read the article you would have seen that that statement is about the fact that people are not going to spend time xml tagging their irc chat and every blog entry and email.
(Is "pink" the singer or the color?)
I didn't get the joke.
These are, after all, engineers. Pink is neither a color nor a singer (talented or otherwise).
To an engineer, PINK can only be an acronym.
Fire and Meat. Yummy.
this project to india.
If you can read this sig - the bitch fell off.
IBM should know that Pink was the predecessor to Taligent which was the predecessor to absolutely nothing.
Your favorite sig sucks
Why does IBM use PC hardware?
Wouldn't it make more (marketing) sense to use one of their own platforms, I guess the z-Series should be the most suited for that amount of data...
Don't get me wrong, I like getting a little web-traffic (I said a little so no /.ing please!), but when I look through my logs and see searches where Google is referring people to my site inappropriately, I almost want to scream at the mindlessness they use to catagorize my web pages. On the one hand, I'm flattered, but on the other hand it's disturbingly out of context. I even put the <meta NAME="robots" CONTENT="noindex,noarchive"> line in the headers that were giving me headaches, but people still end up at my site looking for that damned lemonparty.jpg just because I mentioned it in my blog once.
Can't wait to see what the entry for SCO looks like...
My beliefs do not require that you agree with them.
"Things such as price or product identification numbers are identified by bracketing them with so-called tags, as in Deluxe Toaster , $19.95 ."
They're "tags", not "so-called tags".
Tags! Like those little things they hang on stuff at the store to tell you how much it costs. Tags.
Of course, he may have been referring to their use in a "software program".
As Google has discovered, it's only possible for simple heuristics and algorithms to "understand" the human content on the Web for as long as it doesn't matter.
As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.
And the stakes are much higher for gaming WebFountain than for gaming Google.
For example, I'd imagine there would be big money for anyone who could convince companies that they know how to make it appear that a particular movie/song/toy/computer was "hot," so that the WebFountain-using Walmarts and Best Buys of the world would stock more of it.
WebFountain will work well only until it is actually introduced.
"How to Do Nothing," kids activities, back in print!
IEEE reports that the first commercial use will be to track public opinion for companies.
Searching "SCO"
Found "Slashdot"
ERROR arithmetic underflow.
In Soviet America the banks rob you!
Here's how it works:
Executive Bob, who's paid IBM $150,000 for his enterprise liscence of webfountain, enters into his webfountain search box: "Pink the musician, not the color"
IBM's powerful software parses this command into "pink music -color" and passes it to google, retrieves the results, removes Google's paid ads and replaces them with IBM's paid ads. The content is then served to Executive Bob, who shouts: "EUREKA" since within the top ten search results he finds "NUDE PICTURES OF RAPPER PINK!"
IBM then lands a lucrative support contract with Exectutive Bob to remove all the viruses and spyware from his desktop PC. Rinse and Repeat.
This comment is fully compliant with RFC 527.
One line blog. I hear that they're called Twitters now.
You can do that already with Google:
A search for "Microsoft is evil" gets you 600,000 pages.
A search for "Microsoft is good" gets you 3,590,000 pages.
Therefore Microsoft is more good than evil.
Err ... that wasn't quite the answer I was expecting.
(cue sounds of joke falling apart...)
Avantslash - View Slashdot cleanly on your mobile phone.
Sounds good. There ought to be something similar under BSD or GPL.
Political dissidents would definitely benefit from this kind of super search system, and so do normal users like kids doing searches for their homework.
We need our own "commie" version.
I wish I was fluent in computer languages or else I'd be the first one to start this up under BSD licence.
Any suggestions as to what language I need to learn to develop this kind of search engine?
Its gotta have a capability like freenet to distribute load on the network and the system while keeping users anonymous, since private users won't have the resource to come up with 1000s of servers. I'm thinking on the lines of XML.
You've won this round, Lonestar...
"Talk minus action equals nothing" - Joey Shithead, D.O.A.
"Talk minus action equals
This is a potentially very useful money-saver. Currently companies employ hoards of middle-management people who do little else than detecting discrepancies between the technologies that their department is focusing on and those that are currently all the buzz. Now we can create an automatic boss that sends out e-mails like, "What's this IP-over-XML thing and why don't we use it and how soon can you have all our critical systems migrated to it?"
This sounds like just another tool for the RIAA to use against us. This time, anyone with an apache server account and some mp3s is vulnerable, not just the P2P guys.
If my answers frighten you, stop asking scary questions.
I've already seen/heard of such system, basically in the Business Intelligence field.
In England, a systems like Autonomy (used by the police at the beginning) can crawl a mass of information with dedicated spiders (not only for the web, but also commercial databases, files...). Then, it structures all the content in thematics with links and proximity.
I personnaly tested it some years ago, feeding it with information websites and asking some articles "close to" another one. The efficiency was amazing because it was able to make the difference between close terms that have really different meaning depending on the context. Usually, search engines are wrong because they can't use the context.
I also set up some "agents" for recurrent searches (an agent is basically a search plus some training, letting Autonomy know what found document are close and not) and it was able to propose everyday a really good press review with nearly no wrong documents.
As a complement to Autonomy, I know a BI team that uses some other tools like Periclesto feed the searches with "relevant" content, basically thematics that are "appearing" in the group of documents and are close to some interests.
Such BI tools can already provide the kind of information cited, like a opinion movement against a company detected in the newsgroup or some websites. And IBM is certainly on the tracks to improve such tools with the techniques of their labs.
I hope these tools won't be limited to PR articles on the web and/or private use by big corporations, because it could only be another Echelon with all its bad consequences:
- bad use of public information
- paranoia feeded with wrong scares
- public/corp. power against the citizens
If tools like echelon could be used by everybody, it would have to let much more privacy to citizens and the public leaders would have to explain the investments.
ClaudeBBG
Reminds me of the Scanalyzer service in John Brunner's book "Stand On Zanzibar." The supercomputer Shalmaneser analyzed millions of inputs and tried to make sense of them.
Despite the apparent promise of the project, it is difficult to find actual examples of it doing really cool stuff.
XML simply isn't enough. Structure != Meaning. Meaning must be inserted somewhere by someone. Trying to interpret HTML/natural language to form structured documents is a daunting task. If you want real meaning then the data needs to be described or translated into a meaningful form like RDF (yes represented by xml) when it is created so that intellegent agents such as this can *understand* the data. RDF uses triples (thing graphs) to describe relationships making use of URIs: Subject--Predicate--Object ...etc.
Now think about how to merge all this information - with well formed rules RDF documents merge great:
with traditional structured xml the merged docs would not be well-formed. Now they can be and XML can be generated for standard xml rendering.
Take a look at the Semantic Web
When I heard about TIA I figured they would do it this way.
Can a site have a copyright saying "reselling my data prohibited"? Then IBM can't give it to customers.
Also, I look forward to the system being manipulated for fun and profit.
Well, if you're daft enough to only enter 'Pink' into a search engine and expect it to know you mean the singer, not the colour, you are daft.
Search engines need to be used properly in order to get the best results. In my quotes above, the only thing that might mean the singer over the colour to a search engine is the capital letter at the beginning, but seriously people, who the hell uses generic words in search engines these days and expects to get great results?
This technology should be made available to social scientists, anthropologists, cultural critics, etc. so that current social trends can be analyzed. Perhaps IBM would be kind enough to provide free access to this system to Universities?
It is a pity that the WebFountain system is geared toward corporate users. Of course, there must be some ROI... but, still it makes me sad that every new technology seems to be driven by corporate desire for good PR and world domination.
Interestingly, this article comes out right after Slashdot's coverage of the O'Reilly GeekCamp, in which the CNN article mentions the following relevant projects:
So, perhaps the Open Source community will be able to create some similar technology that is freely available for researchers, writers, scientists, etc. to use.
------- "One of the joys of travel is visiting new towns and meeting new people." -- G. KHAN
Google lets you do a keyword search (bottom-up) or via the directories - DMOZ (top-down). Vivisimo and Grokker were recently discussed on slashdot where they were creating dynamic categorizations, i.e. bottom-up. I think it would be better to let people analyze the markup (directory/top-down approach) or analyze the material (keyword/bottom-up) rather than mixing up the two and presenting the "results" to the person.
This is the second place where energies should be focused. Where the document is created may mean a lot. It could be in which directory I create a new file inherits the path (hence context), or it could be as simple that on the top-right of the screen I create personal files, on the bottom right I create files about sports, on the left-bottom-middle I create files about javaTo see a world in a grain of sand, and then to step back and see the beach where the sand lies
The really nice part is that they can use their 0.5 FBFs of stuff to data-mine the Internet once, and then sell the work over and over again. (There's a little work to sort/package the data for each client, but trivial compared to crunching and tagging the Internet in the first place.)
One line blog. I hear that they're called Twitters now.
http://www.googlism.com/
Where's my $300,000?
Good point.
Best Slashdot Co
Does this mean that the folks at IBM Almaden can fix slashdot so we don't get all that unstructured crap from the first posters when a new topic arrives?
Sigs. We don't need no steenking sigs.
Wonder if this "web fountain" will be smart enough to determine the context to THAT level.
A painter thinks "colour" when he sees the word.
A slashdot reader (and many other grown-ups) thinks of the band "Pink Floyd".
If you are (or are the parent of) a teen-aged girl you think of neither...you think of the anti-Britney pop-star princess of angst Pink
...they were used in calibration tests... you know, find the highs and lows of the system.
Kjella
Live today, because you never know what tomorrow brings
It's a reference to Pink Floyd's Have a Cigar lyrics, "And by the way, which one is Pink?"
Pink can also refer to female genitalia, hence my name. Mmmm mmmm good.
Pr0nfountain leads to sticky keyboards
Any sufficiently advanced man is indistinguishable from God
If you've ever heard her sing you'd know that pink is a color.
So, by extension, Baby Bush speaks black streetgang loserspeak! Probably wrong but about his intelligence level.
30% of the net is Porn.
30% of the net is dupes.
How much of the net is porn dupes?
from IBM? thats rich. their website is so bad I have to use google to locate stuff on it, even if i know it exists.
Look, I know I run a lame-ass excuse for a website, (yeah, quote me on that) and it's not even meant to be viewed by children or prudish adults, but I'll be happier when you don't see my website under this link. In the meantime, I hope IBM's "Web Fountain" doesn't troll over my site and determine that it's about Lycoris Screen Resolution either.
Tell me why Google is my friend again?
............or the song?
Can't think why they're putting money and effort into this project. BTW, anyone tried to find something on the IBM site recently?
Slashdot should be a refuge from mentions of artists of such calibur as Pink.
Look it's a joke about my sig IN MY SIG! LOL!
utilising your own system is a start. on the desktop there's nat *Ximian* friedmans Dashboard
peterrenshaw ~ Another Scrappy Startup
Nothing is smart enough to tell the difference because the content is contextual (hence the name). In a corporation like the one I'm at now (a class A railway) we have hundreds of terabytes of information flowing through our systems on a regular basis. Trying to track it, categorize it, and make sense of what's there is next to impossible. Yet we still keep trying.
I've been trying to architect the information gathering myself in a manual way using a distributed model. Rather than having one system (or hundreds of systems depending on how you look at it) go out and farm the information, have each system submit themselves (automated if such a way exists) to a central repository so that it makes sense. Like I said, any entity is the best thing to know about itself and how it should be classified.
The Trove system from SourceForge is such a beast. Any project submits themself to the trove for categorization. If you abstract that concept up a level, you get a general classification system that lets you not only search based on it, but also filter the information and allow something to be categorized in multiple dimensions. It's not just about one listing anymore, because Pink the singer could be listed under Rock, Pop and Female. You can't choose just one. The trove system as it is, isn't the most scalable in the world, but with a little work could be and could be generic enough to classify documents, objects, people, whatever. just a thought...
That sounds a whole lot like Google fight :)
;)
This wasn't the answer I was hoping for either
May we live long and die out
GOPHER! Quick, patent that idea before some one else thinks of it first, oh, wait .....
http://www.research.ibm.com/resources/news/images
On the left:
Andrew Tomkins, WebFountain Chief Scientist
On the right:
Bob Carlson, VP of WebFountain at the Almaden Research Center
Spamdexing tools used to push page rankings basically spout babble. Even the most naive semantic parser will choke on it and spit that stuff out as rubbish.
Of course I don't doubt you are right about the motivation, and one would expect to see them come up with nastly little tricks like taking public domain documents and replacing keywords with their own ( so to get semantically well formed data that is actually just a rankings magnet )
But, in the final analysis 'content costs' and even robots can be coded to reject nonsense.
"the first commercial use will be to track public opinion for companies."
Have they learnt _nothing_ from the google-bombs?
As soon as people find out what algoritm they use, there'll be someone coordinating abuse thereof.
YAW.
Your head of state is a corrupt weasel, I hope you're happy.
Let me see if I understand this correctly. The makers of Lotus Notes, WebSphere, and what could very well be the WORST website for developers EVER thinks they can some how tackle content chaos? Hahahaha. It hurts to laugh this hard. "Trust you? How can I trust a man who can't even trust his own pants?" - Henry Fonda, Once Upon a Time in the West
> The site is owned by an organization with a
> known Dun and Bradstreet number. (If a site is
> selling something, and its Whois info doesn't
> match the DNB corporation database, it should
> be downgraded in search position. This would
> encourage honest Whois info.)
This may be a question born of serious ignorance. If so, I'd really appreciate some enlightenment.
This is also not so theoretical for me, as I am currently privately developing a product that I will eventually be selling online.
However, until your post, I had not heard of Dun and Bradstreet. I have gone to their website, and they apparently provide a number of services, which can be broadly catagorized as marketing advice/consulting and credit advice/consulting. For my particular small business, neither of these services are useful. I'm funding out of pocket (so my business's credit is not of interest) and I don't plan on extending credit to my customers (there's no need for it) and the market for my product is very small and specialized and I'm fairly confident in stating that I know as much about it as Dun and Bradstreet or pretty much anyone else, modulo a few other people within the same community.
So my analysis, as a prospective small business owner, is that a Dun and Bradstreet number would be useless. Why would I want to get one?
Now, this would be utterly offtopic, except that you suggest that a Dun and Bradstreet number would be a reliable way of confirming whois info. When I get a domain name for this venture (not there yet) I do intend to provide accurate whois info and I generally agree that accurate whois info is a good thing and fairly important. However, for me, a DNB number would be a useless expense (actually, it appears it may be free? I distrust anyone who requires me to give out personal info before giving me prices, and apparently they want my email address first. However, if it's free, than one considers questioning the reliability of their information) and, while I'm not the majority case, I'm also not entirely unique.
If one were to implement this, this would mean that many businesses who were quite legitimate and had accurate whois information would be classed as if they did not have accurate whois information. This strikes me as a serious weakness.
the answer lies in the use of linked ontological
domains coupled with bayesian stats overlying
graph based storage. Graph theory stuff.
Lots of technical issues to get it implemented
but I think this is the way to go. Of anyone
IBM have the resources to make it go.
Anyone interested might want to look at
DAML + OIL www.daml.org.
I wish at was Friday, but I dont want to wish my life away. So I wish it was last Friday.
Ever considered just how ridiculous you urban rap hiphop black streetgang wannabes really sound? Or look. Stan Laurel wearing that phat fuck Oliver Hardy's pants! Dum nigurz!
PS- I'm abandoning my account. I think the moderation on this site is rigged. How did I get modded overrated when I was never modded up to begin with?