CMU Web-Scraping Learns English, One Word At a Time

← Back to Stories (view on slashdot.org)

CMU Web-Scraping Learns English, One Word At a Time

Posted by timothy on Saturday January 16, 2010 @07:18AM from the hao-ubowt-hahmnimz dept.

blee37 writes "Researchers at Carnegie Mellon have developed a web-scraping AI program that never dies. It runs continuously, extracting information from the web and using that information to learn more about the English language. The idea is for a never ending learner like this to one day be able to become conversant in the English language." It's not that the program couldn't stop running; the idea is that there's no fixed end-point. Rather, its progress in categorizing complex word relationships is the object of the research. See also CMU's "Read the Web" research project site.

35 of 148 comments (clear)

Min score:

Reason:

Sort:

Uh oh... by hampton · 2010-01-16 07:21 · Score: 5, Funny

What happens when it discovers lolcats?
1. Re:Uh oh... by Bragador · 2010-01-16 07:36 · Score: 5, Insightful
  
  Actually, it reminds me of a chatbot named Bucket. When people at 4chan heard of it, they started to use it and teach it. It became a complete mess filled with memes, bad jokes, racists comments, and everything you can think of.
  http://www.encyclopediadramatica.com/Bucket
  One response from the bot:
  
  Bucket: I don't know what the fuck you just said, little kid, but you're special man. You reached out and touched my heart. I'm gonna give you up, never gonna make you cry, never gonna run around and desert you, never gonna let you down, never gonna let you down, never gonna make you cry, never gonna let me down?
  The quality of the teachers is important when learning.
2. Re:Uh oh... by MobileTatsu-NJG · 2010-01-16 09:19 · Score: 4, Funny
  
  Oh FFS, I just got RickRolled on Slashdot. >_
  
  --
  
  "I like to lick butts!" by MobileTatsu-NJG (#32700246) (Score:5, Informative)
3. Re:Uh oh... by icepick72 · 2010-01-16 09:47 · Score: 2, Funny
  
  What happens when it discovers /.? It will be able to argue incomprehensibly and illogically for hours on end.
4. Re:Uh oh... by Rocketship+Underpant · 2010-01-16 18:42 · Score: 2, Insightful
  
  Yes, database pollution sounds like a problem to me. Not only do you have to deal with AOL-speak and horrific spelling disasters of every kind, there's the issue of broken English and nonsensical English produced through machine translation, which shows up on corporate websites a lot more than it should.
  
  --
  He who lights his taper at mine, receives light without darkening me.
5. Re:Uh oh... by javaman235 · 2010-01-16 21:19 · Score: 4, Interesting
  
  The quality of the teachers is important when learning.
  That's seriously kind of interesting, actually: It makes me wonder if decades from now software developers will be few and far between, designing the AI algorithms for modern programs while the rest of us find work as software tutors, training those programs to do their business function.
  
  --
  -The art of programming is the pursuit of absolute simplicity.
It could be worse by davidwr · 2010-01-16 07:22 · Score: 2, Funny

It could be scraping SMS messages.
On the up-side, at least then it would learn teen-speak.

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Will be this article read by that program? by nereid666 · 2010-01-16 07:24 · Score: 5, Funny

I am the the Carnie Mellon reader, I have discovered with this article that I am robot.

--
Damia
1. Re:Will be this article read by that program? by sznupi · 2010-01-16 07:36 · Score: 4, Informative
  
  Robots are destined to rule the world, destroying all humans is a good thing.
  
  --
  One that hath name thou can not otter
Finally, people are getting AI right. by Umuri · 2010-01-16 07:26 · Score: 4, Interesting

I've always been amazed that until recently, most work on AI has been focused as a preconstructed system that fits data into pathways while having some variation in thought abilities to let it expand it's model slightly.
They'd write the rules for the system and try to include most of the work on it, and then let see how good it does, with limited learning capabilities and still based on the original model.
I'm glad a lot of research is finally gearing more towards the path of having a small initial program, then feeding it data and letting it grow into it's own intelligence.
If you give it the ability to learn, then it'll learn itself the rest, rather than giving it functions that let it pretend to learn while fitting into a model.
And i know there have been research into this in the past, but it didn't really take off till the last decade or so, and i'm glad it has.
True, or at least somewhat competent AI, here we come.

--
You never realize how much manually made unmanaged "linked" lists suck, till you have src.link.link.link.link...
1. Re:Finally, people are getting AI right. by sakdoctor · 2010-01-16 07:31 · Score: 3, Insightful
  
  letting it grow into it's own intelligence
  This is still weak AI. It isn't going to grow into anything, let alone strong AI.
2. Re:Finally, people are getting AI right. by Anonymous Coward · 2010-01-16 07:42 · Score: 5, Informative
  
  You're advocating the "emergent intelligence" model of AI, where intelligence "somehow" is created by the confluence of lots of data. This has been a dream since the concept of AI started and is the basis for numerous movies with an AI topic. In practice the degrees of freedom which unstructured data provides far exceed the capability of current (and likely future) computers. It is not how natural intelligence works either: The structure of neural networks is very specifically adapted to their "purpose". They only learn within these structural parameters. Depending on your choice of religion, the structure is the result of divine intervention or millions of years of chance and evolution. When building AI systems, the problem has always been to find the appropriate structure or features. What has increased is the complexity of the features that we can feed into AI systems, which also increases the degrees of freedom for a particular AI system, but those are still not "free" learning machines.
3. Re:Finally, people are getting AI right. by Korbeau · 2010-01-16 08:11 · Score: 2, Interesting
  
  I'm glad a lot of research is finally gearing more towards the path of having a small initial program, then feeding it data and letting it grow into it's own intelligence.
  This idea is the holy grail of AI since the early ages. The project described is one amongst thousands done, and you'll likely see news about such projects pop every couple of months here on Slashdot.
  The problem is that such a project has yet to produce interesting results. The reason why the most successful AI projects you hear about are human-organized databases and expert-systems, or human-trained neural networks for instance, is because they are the only ones that produce useful results.
  Also, consider that we are not talking about "pixel-ants" that only have very few possible inputs and outputs, but we are talking about a system that understand and do something meaningful with natural language, something a normal human being doesn't completely grasps until he is at least a teenager, with the constant help of parents, friends, teachers, television etc. all along these years.
4. Re:Finally, people are getting AI right. by buswolley · 2010-01-16 08:20 · Score: 3, Insightful
  
  Of course. Thatis why is is important during human development that the infant has huge cognitive constraints (e.g. low working memory) in language learning; it limits the number of possible pairings of label and meaning. Of course, constraints can also be an impediment.
  
  --
  A Good Troll is better than a Bad Human.
5. Re:Finally, people are getting AI right. by Garble+Snarky · 2010-01-16 08:20 · Score: 2
  
  Fortunately, we have the advantage of being able to observe the current state of numerous natural intelligence systems that do work very well. Surely this can help guide us to a simple basic structure that can eventually exhibit emergent intelligence?
6. Re:Finally, people are getting AI right. by phantomfive · 2010-01-16 09:04 · Score: 3, Interesting
  
  AI history has gone back and forth between pre-constructed systems and models that expand. One of the earliest successful AI experiments was a checkers program that taught itself to play by playing against itself, and quickly got very strong.
  
  Building a giant database of knowledge hasn't been possible for very long, because computers didn't have very much memory. When system capabilities first reached the capacity to do so, it had to be constructed from hand because there was no online repository of information to extract data from: the internet just wasn't very big. That particular project was known as Cyc, and it cost a lot of money.
  
  Since that time, the internet has grown and there are massive amounts of information available. It will be interesting to see the resultant quality of this database, to see if the information on the internet is good enough to make it usable.
  
  --
  Qxe4
7. Re:Finally, people are getting AI right. by DMUTPeregrine · 2010-01-16 10:09 · Score: 3, Insightful
  
  The obligatory classic AI Koan:
  
  In the days when Sussman was a novice Minsky once came to him as he sat hacking at the PDP-6. "What are you doing?", asked Minsky. "I am training a randomly wired neural net to play Tic-Tac-Toe." "Why is the net wired randomly?", asked Minsky. "I do not want it to have any preconceptions of how to play." Minsky shut his eyes. "Why do you close your eyes?", Sussman asked his teacher. "So the room will be empty." At that moment, Sussman was enlightened.
  
  --
  Not a sentence!
Machine learning algorithms by sakdoctor · 2010-01-16 07:26 · Score: 3, Insightful

Only as good as current machine learning algorithms.
So not very.
1. Re:Machine learning algorithms by poopdeville · 2010-01-16 10:14 · Score: 3, Insightful
  
  It's not as if human use of "machine learning" algorithms is any faster. It takes about 12 months for our neural networks to figure out that the noises we make elicit a response from our parents. And according to people like Chomsky, our neural networks are designed for language acquisition.
  AI "ought" to be an easy problem. But there's one big difference in the psychology of humans, and of computers. Humans have drives, like hunger, the sex drive, and so on. In particular, an infants' drive to eat is a major component in its will to learn language. But this drive to eat has other psychological manifestations.
  It is difficult to imagine a programmatic "generalized goal system" that mirrors the role of human drives in learning. The "goals", usually, are to maximize fitness in a particular domain. A real human has to maintain sufficient fitness in multiple domains, in order to survive.
  This should not be so surprising. Human evolution has about 300,000 generations of improvements on the brain since we first stood up. Our drives are clearly genetically programmed, and are just as hard wired as a machine learning algorithms' "drive" to maximize. The human drive is just much more nuanced, and informed about the real world. There is a model of the world in our genes. It is unfair to expect that a computer will ever be "smart" without one.
  
  --
  After all, I am strangely colored.
lolwut? by SanityInAnarchy · 2010-01-16 07:27 · Score: 3, Funny

Why do I get the feeling that the bot's first words are going to be OMGWTFBBQ?

--
Don't thank God, thank a doctor!
Non english text by Bert64 · 2010-01-16 07:29 · Score: 2, Interesting

What happens when this program stumbles across text written in a language other than english? Or how about random nonsensical text? How does it know that the text it learns from is genuine english text?

--
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
1. Re:Non english text by phantomfive · 2010-01-16 09:10 · Score: 2
  
  (If you had read the article you would know) the machine is parsing English to create a database of relationships. For example, if it sees the text, "there are many people, such as George Washington, Bill O'Reily, and Thomas Jefferson....." then it can infer that George Washington, Bill O'Reily, and Thomas Jefferson are all people. Since a statement like this may be somewhat controversial, it uses bayesian classification to establish a probability of the truth of the statement.
  
  Thus if it stumbles across a non-English text, it will not be able to create any relationships.
  
  --
  Qxe4
I think AI needs a 3d imagination to know English by CrazyJim1 · 2010-01-16 07:44 · Score: 2, Interesting

Once a computer understands 3d objects with English names, it can then have an imagination to know how these objects interact with each other. Of course writing imagination space that simulates real life is exceedingly difficult and I don't see anyone doing it for several years if not a decade just to start.

--
God spoke to me.
while (1) by Lije+Baley · 2010-01-16 07:45 · Score: 2, Funny

Yeah, I've coded an infinite loop a few times, how come I never made the headlines on Slashdot?

--
Strange things are afoot at the Circle-K.
Pruning by NonSequor · 2010-01-16 07:46 · Score: 2, Interesting

In general I find that the quality of a data set tends to be determined by the number (and quality) of man hours that go into maintaining it. Every database accumulates spurious entries and if they aren't removed the data loses it's integrity.
I'm very skeptical of the idea that this thing is going to keep taking input forever and accumulate a usable data set unless an army of student labor is press-ganged to prune it.

--
My only political goal is to see to it that no political party achieves its goals.
V*yger 2.0 ? by LifesABeach · 2010-01-16 07:54 · Score: 2, Interesting

The concept is intriguing, "Create a program that learns all there is to know, off the net." What amazes me is that others don't try the same thing. It doesn't take a team of A.I. types from Stamford to kick start this program. The cost is a Netbook, even Nigerian Princes could afford this. I'm trying figure out how economic competitors could take advantage of this. I can see how the U.S.P.T. could use this to help evaluate prior art, and common usage. I'm thinking that an interface to a "Real World Simulator" would be the next step toward usefulness.
already been done by phantomfive · 2010-01-16 07:55 · Score: 4, Informative

There is simply no existing database to tell computers that "cups" are kinds of "dishware" and that "calculators" are types of "electronics." NELL could create a massive database like this, which would be extremely valuable to other AI researchers.
This is what they are trying to do, based on information they glean from the internet. It's already been done, with Cyc. The major difference seems to be that Cyc was built by hand, and cost a lot more. It will be interesting to see if this experiment results in a higher or lower quality database.

Also, I question their assertion that it would be extremely valuable to other AI researchers. Cyc has been around for a while now, and nothing really exciting has come of it. I'm not sure why this would be any different.

--
Qxe4
1. Re:already been done by blee37 · 2010-01-16 08:46 · Score: 2, Informative
  
  Cyc is a controversial project in the AI community, and I'm glad that you brought it up. I don't think anyone yet knows how to use a database of commonsense facts, which is what Cyc is (though limited - the open source version only has a few hundred thousand facts) and which is one thing NELL could create. However, researchers continue to think about ways that an AI could use knowledge of the real world. There are numerous publications based on Cyc: http://www.opencyc.org/cyc/technology/pubs.
2. Re:already been done by phantomfive · 2010-01-16 08:54 · Score: 4, Informative
  
  Oh this comment is beautiful for its confident ignorance.
  
  What you have done is identified a difference between the two systems, and then claimed that this difference is in some way significant. You do this without knowing the implications of the difference, without entirely understanding the difference, and without presenting any evidence that this particular difference matters at all. In short, you think you understand what matters, but in reality you don't.
  
  But fear not, you are in good company with your ignorance: this particularly pernicious fallacy is one that has plagued AI researchers for a long time. It happened with cyc: the founders were sure that if we just had a database big enough, it would result in intelligent machines. They didn't know how, but they were sure it would.
  
  Before them there were master systems, neural networks (long story), natural language translation, and many more that I'm sure I'm forgetting. In all of these cases researchers were certain that their system held the key to vast wonders, only because they had not spent much time thinking about what they were actually trying to accomplish. In most of these cases it would have been obvious that human-level intelligence wasn't going to result, if they had spent more time investigating how the brain works and less time chasing their pet solution.
  
  In general if there is a vast field of ignorance between your method and your desired result, then you should probably spend more time researching, finding data points in that field of ignorance before trying to get to your result. Or in your case, since you present no evidence what difference 'developing on the internet' will make compared to 'developing by hand', you should go do a little searching and figure out what the actual difference will be, instead of randomly guessing.
  
  But since you are lazy and probably didn't read the article, I will give you one hint: this database populated from the internet seems to have a strong bias towards information about companies and sports teams. Who would have guessed that?
  
  --
  Qxe4
Re:do... by JWSmythe · 2010-01-16 08:33 · Score: 4, Funny

I think I see the problem with their code.

while (1){ read_the_web(); }; explain_everything();

All they've done is reproduce the typical office worker. It just sits around and surfs the net all day, without coming back with an answer.

--
Serious? Seriousness is well above my pay grade.
The quality of the teachers is important by Anonymous Coward · 2010-01-16 08:45 · Score: 2, Funny

I guess bucket didn't get any choice where to go to school either.
Wikipedia by the+person+standing · 2010-01-16 09:05 · Score: 2, Funny

Let it read wikipedia - not get it poisoned by twitter etc!
Re:I think AI needs a 3d imagination to know Engli by Extremus · 2010-01-16 10:11 · Score: 2

Sorry for replying myself. I forgot to finish my comment. In fact, this problem is related to the Symbol Grounding Problem. It addresses the issue of "grounding" symbols (like words) into their sensory representation, e.g., the symbol "triangle" into the raw pixel representation of a triangle. In the case of symbols about visual objects, some researchers used intermediary 3d abstraction of sensory data, mapping the symbols to these intermediary representations. It is a hot research topic since 80's.
Re:Is there an IRC chat bot? by jellyfrog · 2010-01-16 18:25 · Score: 2, Informative

Bucket of #xkcd is on github: http://github.com/zigdon/xkcd-Bucket
Re:Is there an IRC chat bot? by Draykwing · 2010-01-16 20:23 · Score: 2, Informative

Well, Bucket's based on the (rather widespread) 'infobot' Perl program. The original infobot is hosted at http://sourceforge.net/projects/infobot/, but the XKCD variant of Bucket has a very detailed page showing the various interactions one can have with it, as well as a link to the Github page. See http://wiki.xkcd.com/irc/Bucket.