CMU Web-Scraping Learns English, One Word At a Time

← Back to Stories (view on slashdot.org)

CMU Web-Scraping Learns English, One Word At a Time

Posted by timothy on Saturday January 16, 2010 @07:18AM from the hao-ubowt-hahmnimz dept.

blee37 writes "Researchers at Carnegie Mellon have developed a web-scraping AI program that never dies. It runs continuously, extracting information from the web and using that information to learn more about the English language. The idea is for a never ending learner like this to one day be able to become conversant in the English language." It's not that the program couldn't stop running; the idea is that there's no fixed end-point. Rather, its progress in categorizing complex word relationships is the object of the research. See also CMU's "Read the Web" research project site.

7 of 148 comments (clear)

Min score:

Reason:

Sort:

Re:Will be this article read by that program? by sznupi · 2010-01-16 07:36 · Score: 4, Informative

Robots are destined to rule the world, destroying all humans is a good thing.

--
One that hath name thou can not otter
Re:Finally, people are getting AI right. by Anonymous Coward · 2010-01-16 07:42 · Score: 5, Informative

You're advocating the "emergent intelligence" model of AI, where intelligence "somehow" is created by the confluence of lots of data. This has been a dream since the concept of AI started and is the basis for numerous movies with an AI topic. In practice the degrees of freedom which unstructured data provides far exceed the capability of current (and likely future) computers. It is not how natural intelligence works either: The structure of neural networks is very specifically adapted to their "purpose". They only learn within these structural parameters. Depending on your choice of religion, the structure is the result of divine intervention or millions of years of chance and evolution. When building AI systems, the problem has always been to find the appropriate structure or features. What has increased is the complexity of the features that we can feed into AI systems, which also increases the degrees of freedom for a particular AI system, but those are still not "free" learning machines.
already been done by phantomfive · 2010-01-16 07:55 · Score: 4, Informative

There is simply no existing database to tell computers that "cups" are kinds of "dishware" and that "calculators" are types of "electronics." NELL could create a massive database like this, which would be extremely valuable to other AI researchers.
This is what they are trying to do, based on information they glean from the internet. It's already been done, with Cyc. The major difference seems to be that Cyc was built by hand, and cost a lot more. It will be interesting to see if this experiment results in a higher or lower quality database.

Also, I question their assertion that it would be extremely valuable to other AI researchers. Cyc has been around for a while now, and nothing really exciting has come of it. I'm not sure why this would be any different.

--
Qxe4
1. Re:already been done by blee37 · 2010-01-16 08:46 · Score: 2, Informative
  
  Cyc is a controversial project in the AI community, and I'm glad that you brought it up. I don't think anyone yet knows how to use a database of commonsense facts, which is what Cyc is (though limited - the open source version only has a few hundred thousand facts) and which is one thing NELL could create. However, researchers continue to think about ways that an AI could use knowledge of the real world. There are numerous publications based on Cyc: http://www.opencyc.org/cyc/technology/pubs.
2. Re:already been done by phantomfive · 2010-01-16 08:54 · Score: 4, Informative
  
  Oh this comment is beautiful for its confident ignorance.
  
  What you have done is identified a difference between the two systems, and then claimed that this difference is in some way significant. You do this without knowing the implications of the difference, without entirely understanding the difference, and without presenting any evidence that this particular difference matters at all. In short, you think you understand what matters, but in reality you don't.
  
  But fear not, you are in good company with your ignorance: this particularly pernicious fallacy is one that has plagued AI researchers for a long time. It happened with cyc: the founders were sure that if we just had a database big enough, it would result in intelligent machines. They didn't know how, but they were sure it would.
  
  Before them there were master systems, neural networks (long story), natural language translation, and many more that I'm sure I'm forgetting. In all of these cases researchers were certain that their system held the key to vast wonders, only because they had not spent much time thinking about what they were actually trying to accomplish. In most of these cases it would have been obvious that human-level intelligence wasn't going to result, if they had spent more time investigating how the brain works and less time chasing their pet solution.
  
  In general if there is a vast field of ignorance between your method and your desired result, then you should probably spend more time researching, finding data points in that field of ignorance before trying to get to your result. Or in your case, since you present no evidence what difference 'developing on the internet' will make compared to 'developing by hand', you should go do a little searching and figure out what the actual difference will be, instead of randomly guessing.
  
  But since you are lazy and probably didn't read the article, I will give you one hint: this database populated from the internet seems to have a strong bias towards information about companies and sports teams. Who would have guessed that?
  
  --
  Qxe4
Re:Is there an IRC chat bot? by jellyfrog · 2010-01-16 18:25 · Score: 2, Informative

Bucket of #xkcd is on github: http://github.com/zigdon/xkcd-Bucket
Re:Is there an IRC chat bot? by Draykwing · 2010-01-16 20:23 · Score: 2, Informative

Well, Bucket's based on the (rather widespread) 'infobot' Perl program. The original infobot is hosted at http://sourceforge.net/projects/infobot/, but the XKCD variant of Bucket has a very detailed page showing the various interactions one can have with it, as well as a link to the Github page. See http://wiki.xkcd.com/irc/Bucket.