CMU Web-Scraping Learns English, One Word At a Time
blee37 writes "Researchers at Carnegie Mellon have developed a web-scraping AI program that never dies. It runs continuously, extracting information from the web and using that information to learn more about the English language. The idea is for a never ending learner like this to one day be able to become conversant in the English language." It's not that the program couldn't stop running; the idea is that there's no fixed end-point. Rather, its progress in categorizing complex word relationships is the object of the research. See also CMU's "Read the Web" research project site.
What happens when it discovers lolcats?
It could be scraping SMS messages.
On the up-side, at least then it would learn teen-speak.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
"Frosty Pist" , if it reads slash dot
I am the the Carnie Mellon reader, I have discovered with this article that I am robot.
Damia
I've always been amazed that until recently, most work on AI has been focused as a preconstructed system that fits data into pathways while having some variation in thought abilities to let it expand it's model slightly.
They'd write the rules for the system and try to include most of the work on it, and then let see how good it does, with limited learning capabilities and still based on the original model.
I'm glad a lot of research is finally gearing more towards the path of having a small initial program, then feeding it data and letting it grow into it's own intelligence.
If you give it the ability to learn, then it'll learn itself the rest, rather than giving it functions that let it pretend to learn while fitting into a model.
And i know there have been research into this in the past, but it didn't really take off till the last decade or so, and i'm glad it has.
True, or at least somewhat competent AI, here we come.
You never realize how much manually made unmanaged "linked" lists suck, till you have src.link.link.link.link...
Only as good as current machine learning algorithms.
So not very.
Why do I get the feeling that the bot's first words are going to be OMGWTFBBQ?
Don't thank God, thank a doctor!
Does this mean somebody forgot to put a "break" in the loop?
What happens when this program stumbles across text written in a language other than english? Or how about random nonsensical text? How does it know that the text it learns from is genuine english text?
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
lke, rally der bestest ways like ter learn a puter inglish isit!!!??!?!
Seriously though, poor AI; if I had a gun I'd go and put it out of its misery.
...it will forever be stuck at the level of a retarded 8 year old. Or the level of a normal 4-chan user.
Contrary to popular belief, life is not a bitch. It is far far worse.
...should we start welcoming the Mailman (as in True Names)?
Once a computer understands 3d objects with English names, it can then have an imagination to know how these objects interact with each other. Of course writing imagination space that simulates real life is exceedingly difficult and I don't see anyone doing it for several years if not a decade just to start.
God spoke to me.
Show it only Porn-alike text. Let's see what it learns...
Have you heard about SoylentNews?
Yeah, I've coded an infinite loop a few times, how come I never made the headlines on Slashdot?
Strange things are afoot at the Circle-K.
In general I find that the quality of a data set tends to be determined by the number (and quality) of man hours that go into maintaining it. Every database accumulates spurious entries and if they aren't removed the data loses it's integrity.
I'm very skeptical of the idea that this thing is going to keep taking input forever and accumulate a usable data set unless an army of student labor is press-ganged to prune it.
My only political goal is to see to it that no political party achieves its goals.
>Rather, its progress in categorizing complex word relationships is the object of the research.
From the web? Half the people here are writing English as a second language; the rest, haven't finished learning the language, or cannot be bother to string a sentence together. Just what is this program going to learn?
Open Source Drum Kit, LPLC deve board - mjhdesigns.com
The concept is intriguing, "Create a program that learns all there is to know, off the net." What amazes me is that others don't try the same thing. It doesn't take a team of A.I. types from Stamford to kick start this program. The cost is a Netbook, even Nigerian Princes could afford this. I'm trying figure out how economic competitors could take advantage of this. I can see how the U.S.P.T. could use this to help evaluate prior art, and common usage. I'm thinking that an interface to a "Real World Simulator" would be the next step toward usefulness.
There is simply no existing database to tell computers that "cups" are kinds of "dishware" and that "calculators" are types of "electronics." NELL could create a massive database like this, which would be extremely valuable to other AI researchers.
This is what they are trying to do, based on information they glean from the internet. It's already been done, with Cyc. The major difference seems to be that Cyc was built by hand, and cost a lot more. It will be interesting to see if this experiment results in a higher or lower quality database.
Also, I question their assertion that it would be extremely valuable to other AI researchers. Cyc has been around for a while now, and nothing really exciting has come of it. I'm not sure why this would be any different.
Qxe4
It's just another CMU hoax like Forum 2000. Read End of an Era: Forum 2000 Closes for details.
Greetings to Corey Kosak, Andrej Bauer and the Forum 2000 students for all the laughs.
So far most of the words it's learned are related to various sex acts.
On December 11, 2012, NELL encounters MySpace.
On December 12, 2012 it becomes sentient but very emo, and destroys the world.
How come every time I ask Nell what the answer is to life, all it responds with is "42". When I ask what 42 means, it tells me that I'll need a bigger computer.
Serious? Seriousness is well above my pay grade.
KILL. ALL. HUMANS.
I guess bucket didn't get any choice where to go to school either.
Let it read wikipedia - not get it poisoned by twitter etc!
Oh dear god, this thing will be the ULTIMATE grammar Nazi!!!!
Similar things have been done in the past. However, this kind of approach still is an active research topic.
Sorry for replying myself. I forgot to finish my comment. In fact, this problem is related to the Symbol Grounding Problem. It addresses the issue of "grounding" symbols (like words) into their sensory representation, e.g., the symbol "triangle" into the raw pixel representation of a triangle. In the case of symbols about visual objects, some researchers used intermediary 3d abstraction of sensory data, mapping the symbols to these intermediary representations. It is a hot research topic since 80's.
It addresses the issue of "grounding" symbols (like words) into their sensory representation, e.g., the symbol "triangle" into the raw pixel representation of a triangle.
You're not really justified in calling it "the ... representation of a triangle". It isn't unique. An upside-down triangle is still a triangle. A blue triangle is still a triangle.
This gets messy fast, since you're really mapping words into equivalence classes of representations. But then, they really aren't equivalence classes. In particular, they aren't disjoint. Is a blue triangle going to live in the equivalence class for blue? Or for triangles? It can't be in both, but it is.
Is there one for IRC? :)
Are there any good chat bots for IRC? I tried Seeborg (based on Alice), but it sucked. :( I wished rbot could do AI chatter.
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
This is what I think happend.
Developers: We have a problem with the application. there seem to be an infinite loop that prevent it from finishing.
Marketing: So, that's the programs main feature, is it not?
Perhaps if there were a book in electronic form that had all English words in it perhaps with a definition of each word.
I will say, I'm disappointed by the comments I've seen here on slashdot.
Best comment came from an anonymous coward about the pining for an "emergent" type system, the fact that we're not wired that way, and that while more power gives some more in the way of degrees of freedom, it doesn't mean that everything can be analyzed together... you have to have some way of focusing (and a pretty darn good one to prevent unimaginable problem blowup).
Bootstrapping works well when confined to a fixed arena with observable and unambiguous criteria for selection of behaviors or incorporating a piece of knowledge and observable and unambiguous criteria for judging the success thereof. That is to say, a tight focus and goal directed behavior. Without these and a tight feedback loop, the resulting system tends to disappoint.
Having as your scope, reading the web to gain an understanding of the world is um... just a bit outside that template for success. While the big talk may be a pre-requisite for grant interest, I doubt have nearly as many illusions as the average slashdot reader. I hope their work goes well, and I hope some of their techniques for extracting information from the web prove useful. That said, it looked like their initial target was classification only. Not trivial, but a very small part of the puzzle of intelligence to say the least, especially when you consider the fact that the classifications this thing will suck in will reflect mostly the sort of classifications that we don't take for granted.
And here I'll start reflecting my bias. I am a former #$HumanCyclist (I did an internship about 10 years ago), because even though I am in some ways disappointed, I do think that the fact that they're actually building something (and along the way have been solving problems with it) and have been for a lot of years means that there's a lot to learn from them.
Among the things the Cyc project has shown, is exactly how important these sorts of unstated classifications turn out to be in the problem of doing even the most mundane things right. But there's no point dwelling on that, because even assuming you have some impossibly large beautiful graph reflecting a really solid and well thought out classification of everything, from every angle (hahaha), you're nowhere.
Facts are fuel... the engine is the rules. Reading those from free text is a very, very dicey proposition, both because the parsing is infinitely harder, and because much more so than facts, they're largely unstated and in terms of our own learning, inferred from examples. You can set up probability matrixes or the like, but only if you know what you're evaluating for (how would you program "curiosity"?). Even if you do get those matrices, reasoning with them directly is pretty much impracticable, so you have to have to make some arbitrary decisions about when you're confident enough to say you "know" something. This is just really, really hard knowledge to get in any automated fashion.
Finally, for both facts and rules, the consequences of incorporating a poorly considered one can be quite dire, and there's no practiceable way (as the amount of knowledge grows) to know whether it's consistent with what is considered true to that point.
Getting even more slippery, there is no one context or frame to consider everything in. This goes equally well for facts and rules. You could try and split hairs and say that given enough antecedents, your facts and rules are solid. However, as any kind of remotely practical matter, you need a way of accumulating and organizing these antecedents, and that's true from both from an technical (engine execution), and practical (reasoning and learning ease) perspective.
Oh, and as a minor matter, languages are difficult enough from a syntactic dimension, and the symantics of it (in order to understand a statement, you have to understand the ones prior, the context or framing that may have switched, the built up assumptions that maybe can be discarded, maybe not, etc.
Eventually, at least the learning component will converge; returns will diminish for feeding it more data. This is particularly true given the independence assumption inherent in their classifier (but would also hold on stronger learners). I suspect that this will happen to the reader component as well. If it were as simple as applying Naive Bayes to classify on a corpus of text connected to a knowledge base (which is probably just a set of posteriors left from previous training sessions), Cyc would have already passed the Turing test.
The article has too much hype, but the actual work has some potential. For the limited problem they're really addressing, extracting certain data about sports teams and corporate mergers, this approach might work.
Both of those areas have the property that you can get structured data feeds on the subject. Bloomberg will sell you access to databases which report mergers in a machine-processable way; some stock analysis programs need that data. Sports statistics are, of course, available on line. So the program's extraction of that info from news stories intended for humans can be checked. This allows supervised learning. The program can tell what it got right and what it got wrong.
When they can distinguish between a merger that's being talked about, one which entered negotiations but was not completed, one which went for DoJ approval and was rejected, and one which was completed, they'll have something. Until then, they're probably won't outperform "'merger' NEAR 'companyname'" queries.
.... program that never dies. It runs continuously ..... It's not that the program couldn't stop running; the idea is that there's no fixed end-point
Wow I didn't even think that was physically possible! Maybe google should borrow this tech for their web crawlers. Must be a pain to restart them every day...
There is a fine line between being a cultivated citizen and being someone else's crop. - A. J. Patrick Liszkie
... may be a site resembling http://www.20q.net/ , which started as a never ending story (neural net) as well.
Quote: "The 20Q was created in 1988 as an experiment in artificial intelligence (AI) The principle is that the player thinks of something and the 20Q artificial intelligence asks a series of questions before guessing what the player is thinking. This artificial intelligence learns on its own with the information relayed back to the players who interact with it, and is not programmed. The player can answer these questions with: Yes, No, Unknown, or Sometimes. The experiment is based on the classic word game of Twenty Questions, and on the computer game "Animals," popular in the early 1970s, which used a somewhat simpler method to guess an animal."
CC.
TaijiQuan (Huang, 5 loosenings)
Is this technology available for the employees at the local McDonalds?
garbage in - garbage out
And how will they determine if this gets stuck in some local optimum for certain concepts, and thus stops to learn anything relevant at all about any one given concept or topic? The report is low on details and high on hype. There are no current algorithms that don't require heavy parameter tuning and constant monitoring to get right. Switching one on for a few years and hoping does not strike me as an exciting story.
I tried to e-maik Kevin, the author at lenzo@cs.cmu.edu, but got it returned:
SMTP error from remote mail server after RCPT TO:: ... address not contained in directory, you cannot relay :(
host MX-LB-03.SRV.cs.cmu.edu [128.2.217.14]: 550 5.1.1
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
When will this thing build a ghost?
When I first read about Cyc I immediately thought that this is the way to go. And this was before the WWW took off. While I don't think that knowing about the world is all that's needed for AI, I think that without knowing about the world you can't have any AI or at least none you'd recognize.
Intelligence (as we know it) is mostly about interacting with and understanding your environment and having some environment being accessible to something remotely intelligent is a good start. Every living being is just a point in space and time, relating to everything around it and still being different from its surroundings, trying to survive and to understand what's going on.
I have no doubt that any real AI will be born with and out of all the networked information we're collecting like crazy. Or it may never be born, of course. AI is hard.
Oh, and as a minor matter, languages are difficult enough from a syntactic dimension, and the symantics of it (in order to understand a statement, you have to understand the ones prior, the context or framing that may have switched, the built up assumptions that maybe can be discarded, maybe not, etc...) make for a truly fantastically dificult problem.
And still, every newborn human masters all of this without having the faintest explicit knowlegde about anything of this and still learns it within a few years. Is an AI meant to be like a newborn baby (which is in no way intelligent) or like an adult? Most (or all) people become intelligent without knowing how intelligence works or what it is. It's just that everything that doesn't work gets discarded very soon. You start to imitate and to try out what works and what gets results and what not.
Perhaps we need just some evolution in code, code trying to understand and survive in the world of data. Have them fight and eat each other and have the fittest survive.
...just add virus to make him mobile.
J