CMU Web-Scraping Learns English, One Word At a Time
blee37 writes "Researchers at Carnegie Mellon have developed a web-scraping AI program that never dies. It runs continuously, extracting information from the web and using that information to learn more about the English language. The idea is for a never ending learner like this to one day be able to become conversant in the English language." It's not that the program couldn't stop running; the idea is that there's no fixed end-point. Rather, its progress in categorizing complex word relationships is the object of the research. See also CMU's "Read the Web" research project site.
What happens when it discovers lolcats?
It could be scraping SMS messages.
On the up-side, at least then it would learn teen-speak.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
I am the the Carnie Mellon reader, I have discovered with this article that I am robot.
Damia
I've always been amazed that until recently, most work on AI has been focused as a preconstructed system that fits data into pathways while having some variation in thought abilities to let it expand it's model slightly.
They'd write the rules for the system and try to include most of the work on it, and then let see how good it does, with limited learning capabilities and still based on the original model.
I'm glad a lot of research is finally gearing more towards the path of having a small initial program, then feeding it data and letting it grow into it's own intelligence.
If you give it the ability to learn, then it'll learn itself the rest, rather than giving it functions that let it pretend to learn while fitting into a model.
And i know there have been research into this in the past, but it didn't really take off till the last decade or so, and i'm glad it has.
True, or at least somewhat competent AI, here we come.
You never realize how much manually made unmanaged "linked" lists suck, till you have src.link.link.link.link...
Only as good as current machine learning algorithms.
So not very.
Why do I get the feeling that the bot's first words are going to be OMGWTFBBQ?
Don't thank God, thank a doctor!
What happens when this program stumbles across text written in a language other than english? Or how about random nonsensical text? How does it know that the text it learns from is genuine english text?
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
Once a computer understands 3d objects with English names, it can then have an imagination to know how these objects interact with each other. Of course writing imagination space that simulates real life is exceedingly difficult and I don't see anyone doing it for several years if not a decade just to start.
God spoke to me.
Yeah, I've coded an infinite loop a few times, how come I never made the headlines on Slashdot?
Strange things are afoot at the Circle-K.
In general I find that the quality of a data set tends to be determined by the number (and quality) of man hours that go into maintaining it. Every database accumulates spurious entries and if they aren't removed the data loses it's integrity.
I'm very skeptical of the idea that this thing is going to keep taking input forever and accumulate a usable data set unless an army of student labor is press-ganged to prune it.
My only political goal is to see to it that no political party achieves its goals.
The concept is intriguing, "Create a program that learns all there is to know, off the net." What amazes me is that others don't try the same thing. It doesn't take a team of A.I. types from Stamford to kick start this program. The cost is a Netbook, even Nigerian Princes could afford this. I'm trying figure out how economic competitors could take advantage of this. I can see how the U.S.P.T. could use this to help evaluate prior art, and common usage. I'm thinking that an interface to a "Real World Simulator" would be the next step toward usefulness.
There is simply no existing database to tell computers that "cups" are kinds of "dishware" and that "calculators" are types of "electronics." NELL could create a massive database like this, which would be extremely valuable to other AI researchers.
This is what they are trying to do, based on information they glean from the internet. It's already been done, with Cyc. The major difference seems to be that Cyc was built by hand, and cost a lot more. It will be interesting to see if this experiment results in a higher or lower quality database.
Also, I question their assertion that it would be extremely valuable to other AI researchers. Cyc has been around for a while now, and nothing really exciting has come of it. I'm not sure why this would be any different.
Qxe4
I think I see the problem with their code.
All they've done is reproduce the typical office worker. It just sits around and surfs the net all day, without coming back with an answer.
Serious? Seriousness is well above my pay grade.
I guess bucket didn't get any choice where to go to school either.
Let it read wikipedia - not get it poisoned by twitter etc!
Sorry for replying myself. I forgot to finish my comment. In fact, this problem is related to the Symbol Grounding Problem. It addresses the issue of "grounding" symbols (like words) into their sensory representation, e.g., the symbol "triangle" into the raw pixel representation of a triangle. In the case of symbols about visual objects, some researchers used intermediary 3d abstraction of sensory data, mapping the symbols to these intermediary representations. It is a hot research topic since 80's.
Bucket of #xkcd is on github: http://github.com/zigdon/xkcd-Bucket
Well, Bucket's based on the (rather widespread) 'infobot' Perl program. The original infobot is hosted at http://sourceforge.net/projects/infobot/, but the XKCD variant of Bucket has a very detailed page showing the various interactions one can have with it, as well as a link to the Github page. See http://wiki.xkcd.com/irc/Bucket.