OpenCyc 1.0 Stutters Out of the Gates
moterizer writes "After some 20 years of work and five years behind schedule, OpenCyc 1.0 was finally released last month. Once touted on these pages as "Prepared to take Over World", the upstart arrived without the fanfare that many watchers had anticipated — its release wasn't even heralded with so much as an announcement on the OpenCyc news page. For those who don't recall: "OpenCyc is the open source version of the Cyc technology, the world's largest and most complete general knowledge base and commonsense reasoning engine." The Cyc ontology "contains hundreds of thousands of terms, along with millions of assertions relating the terms to each other, forming an upper ontology whose domain is all of human consensus reality." So are these the fledgling footsteps of an emerging AI, or just the babbling beginnings of a bloated database?"
Please, for the good of Humanity, vote Obama.
Leave Wikipedia out of this.
I'm sure "SlashdotMedia" will improve on all the wonders that Dice Holdings blessed us all with
...but does it know Linux?
Bragi Ragnarson Lawful Good (I change the law when it's not good)
commonsense reasoning engine.
A reasonable test would be to have it read slashdot, and identify slashback 'articles' as recycled junk.
"We are all geniuses when we dream"
- E.M. Cioran
I kind of feel bad for Cyc/OpenCyc... they've put so many years into this project, but using web-based games to collect and verify this common-sense data is much faster than using a few paid experts and can give much more data. For the curious, Luis von Ahn, a grad student (and now assistant professor) at Carnegie Mellon University gave a (rather entertaining) tech talk at Google about his work in this area.
He's recently been working on a project called Verbosity, which uses such games to collect the same sort of common-sense data that Cyc has been trying to collect all these years. Cyc's ontology apparently contains "hundreds of thousands of terms, along with millions of assertions relating the terms to each other." If Verbosity is as popular as von Ahn's ESP Game, the game could probably construct a better database in a matter of weeks.
Here's the abstract from a research paper on the topic:
Verbosity: a game for collecting common-sense facts
We address the problem of collecting a database of ""common-sense facts"" using a computer game. Informally, a common-sense fact is a true statement about the world that is known to most humans: ""milk is white,"" ""touching hot metal hurts,"" etc. Several efforts have been devoted to collecting common-sense knowledge for the purpose of making computer programs more intelligent. Such efforts, however, have not succeeded in amassing enough data because the manual process of entering these facts is tedious. We therefore introduce Verbosity, a novel interactive system in the form of an enjoyable game. People play Verbosity because it is fun, and as a side effect of them playing, we collect accurate common-sense knowledge. Verbosity is an example of a game that not only brings people together for leisure, but also collects useful data for computer science.
So are these the fledgling footsteps of an emerging AI, or just the babbling beginnings of a bloated database?
Cyc is a fledgling AI, depending on how you count "AI". Then again, so is my thermostat. My thermostat "knows" how to keep the room the right temperature. Cyc "knows" about a great deal of conventional human background, just like a database with a query system "knows" how to give you the data in that system.
The real question is not "is this AI", but rather, is it useful, and if so, to who? I think Cyc has the potential to be quite useful in some areas; we'll see how far it goes, and what the limitations are in time.
Right now, I think the real problem with Cyc is understanding it on a practical level, and getting an understanding of what it can do in practice, not in theory. When I last looked at the project nine years ago, they were just starting to open up things a bit, and it sounded like someone who understood the project might make great things happen. They don't seem to have yet; but who knows... perhaps in the future.
Now that OpenCyc is finally released, the most important steps to get people using it is to drop the learning curve down to a reasonable level, so that developers can start playing with it and find out what it can do without committing their lives to the project...
We'll have to see what happens: Cyc is a big (bloated?) database that's also a fledgling AI -- the real question is, what cool things can we make it DO? Time will tell...
Having done a great deal of data processing, I have watched these projects off and on with minor amusement. The reason why is that, in my humble opinion, it will never work. That is not to say that it can't, just that these projects just love to forget Gödel's Theorem, which states, roughly: any sufficiently complex system will have things that are obviously true or false, but are not provable within the system.
Put another way, any complex set of rules will inherently be unable to stay consistent because eventually the syntax complexity become able to state, "The following sentence is false. The previous sentence is true." This occurs regularly in data processing when a given field's syntax (datum value) bridges or is not defined by your context (schema).
The real crutch is that syntax is inductive, where we try to fit each word into a category; however, our context (use of language) is deductive, we all learn it through experience with a physical world. I have seen this problem over and over as people constantly modify the schema to overcome syntactic limitation. While Cyc is designed to be constantly expanded with new rules, they are still syntactical statements.
By Gödel's Theorem, syntactic systems are doomed to fail. Instead, Cyc should be allowed to learn through observation and deduce its own understanding of the world so that it is not bound by any particular syntax. While this could work, it fails the ultimate intent. We want a computer that can both learn and yet not be wrong.
The problem is you can't have that. You can either be syntactically correct, but simplify the model until it works (Physics). Or, you can allow deductions and have to work in the realm of probability (Humans).
Although, I would gladly accept a computer that erred like a human and yet didn't bitch about how it was someone else's fault.
Bel, the mostly sane.. "Of course I can't see anything! I'm standing on the shoulders of idiots." -- Me
Even if it could interpret your question correctly, it would most likely not have a local data store with enough ambiguous information to answer any arbitrary question. It could perhaps answer the question "Is a dog a mammal?" as "True", but not anything more complex. However, connected to the 'net and things like Wikipedia (if you trust that information), other encyclopedia's, dictionaries, Google (to come up with lesser known facts/infobits) you might possibly get it to some sort of rudimentary pseudo-AI which could possibly do as you mentioned in more general way.
Unfortunately, however this is still a long way from sentient AI. Something you could literally talk to and it would be correct in factual based questions 99% of the time and be able to think abstractly.
'He was a dreamer, a thinker, a speculative philosopher... or, as his wife would have it, an idiot.' - Douglas Adams
Cyc is only words and descriptors. If you attach them to 3d shapes and actions in the 3d world, the program can imagine what you're saying. It can even obey and do tasks if hooked up into a robotic body and scan the room. It requires the technology of being able to scan its environment then run something like the program they run to find text inside of images. Instead of finding text inside of images, its finding objects inside an environment. Pretty simple once you understand the basics, but it will take a lot of work. A longer descriptor of this can be found at: AI page Cyc isn't a waste, but you need to do something harder to make it into AI, you need to attach 3d objects to every noun, and apply 3d actions to every verb, etc. I'd say that'd be on the realm of next to impossible, so yeah what they've done really doesn't advance AI at all.
God spoke to me.
Cyc has an ontology of general conceptual terms, and represents the precise logical way in which
those concepts interrelate. In other words, it emulates an aspect of the pure rational part of
human reasoning about the world.
But it's known that humans are not dispassionate rational agents. And indeed that there probably
is no such thing as a dispassionate rational agent. Commander Data and Spock are very ill-conceived
ideas of robot-like reasoners. Passion (emotion, affect) is the prioritizer of reasoning that allows
it to respond effectively (sometimes in real time) to the relevant aspects
of situations. Without the guidance of emotion, no common-sense reasoning engine would be powerful
enough, no matter how parallel it was, to process all of the ramifications of situations and
come up with relevant and useful and communicable and actionable conclusions.
So how do we give CYC passion? Or at least a simulation of it?
Well the key would seem to lie in measuring the level of human concern with each concept, and with
each type of situational relationship between pairs (and n-tuples) of concepts.
How could we do that? How about doing a latent semantic analysis from google search results. Something
similar to Google Trends, but which measures specifically the correlation strengths of pairs of
concepts (in human discourse, which Google indexes). The relative number of occurrences (and co-occurrences)
of concept terms in the web corpus should provide a concept weighting and a concept-relationship weighting.
If we then map that weighting on top of the CYC semantic network, we should have a nicely "concern"-weighted
common-sense knowledge base, which should be similar in some sense to a human's memory that supports
human-like comprehension of situations.
Combining a derivative of google search results with CYC is my suggestion for beginning to make an AI that can talk to
us in our terms, and understand our global stream of drivel.
I wish I had time to work on this.
Where are we going and why are we in a handbasket?
Don't be alarmed, Arthur Dent. Be very, very frightened.
Human thought is a rather complex thing, that don't always appear to follow logical patterns or rules. Or not the simple "if I want X, I must do Y" clear-cut rules that nerds everywhere expect. Human thought is a complex attempt at balancing the priority of not only "I want X", but also stuff like "but it would be socially bad to be seen doing Y", and "I could do Y1 instead, but that's way more effort than I can be arsed to do today", and "it would be nice to have time left to do Z too today, or the missus will blow a gasket", and quite often "actually I don't really want X, I want Z, but it would be uncool to admit that." It's not just following rules and logic, it's trying to fit it all in a complex scheme of priorities, social rituals, and whatnot, and most often boiling down to finding the least crappy compromise in that space.
In other words, whenever you find yourself thinking, "meh, people/men/women/engineers/PHBs/whatever are so stupid/illogical/whatever. If they want X, they should just do Y", chances are it's not them who are illogical. It's you who don't understand their personal version of that maze of priorities and rituals. Or what is the real Z they're after, when they say they want X.
Most of those things aren't even at a conscious level. Even if you poll people along the lines of "if you wanted X, would you do Y?", you'll get an answer that's most often useless. For starters it will be heavily skewed towards what they'd like to think of themselves, not what they'd actually do. Second, without providing a _lot_ of context, it will bypass most of those priorities and rituals that might override that in practice.
What's the point of this whole rant? That the first AIs trained by humans will inherently be a dud.
If you make an AI that functions by precise, inflexible rules, congratulations, you've just programmed OCPD. Literally.
Add a lack of perceptions of human reactions, feelings, body language, etc, and you've given it Autism too. Again, pretty literally.
I.e., I'd expect the first few AIs, or even generations of AIs to be... well, don't think the lovable R2D2 or the essentially human C3-PO, but an electronic equivalent of the most obnoxious socially-dysfunctional kind of geek.
If you want that as an overlord... I don't know, I hope I'm not around at least.
A polar bear is a cartesian bear after a coordinate transform.
According to this FAQ entry, it's not fully open-source...
You can't compare Wikipedia to Cyc. If you do, then you are just misunderstanding what Cyc is and what it is not. Cyc is a database of logical relations representing common sense knowledge. It contains something like 20 different meanings of the word "lie" and such things as this. It is not concerned with knowledge of popular culture, but rather the underlying semantic rules that we use to talk about things such as pop culture.
Completely different.
I'm not so sure that Cyc and google are really competitors - I think they're complimentary. Cyc's real (or potential) value is that it contains information so obvious nobody would bother to write it down, like that a person can travel using a car, or that being inside a refrigerator makes things cold, in other words "common sense." Whether it's ultimately more productive to spend 20 years encoding common sense, or devise algorithms and sensors to acquire common sense by experimenting in the environment and inferring from other information sources, is still an open question. Human babies seem to be a mixture of both, for instance they know instinctively (i.e. are "pre-programmed") with a fear of heights, on the other hand they learn that people can sit in chairs by inferring from observations, on the other hand we put kids through 15 years of school spoonfeeding them with facts.
Meanwhile google happily eats whatever crap its spiders manage to find and thru some hacking and dark magic algorithms is still able to give not so meaningless answers to not to much badly worded queries.
That's a key point explaining why OpenCyc came too late. Wordnet, Thoughtreasure, Cyc et alii all share a set of common drawbacks. Their input data need to be specially formated. That's why all those overly ambitious project have progress so slowly in the past years, and are still only limited to answers precise non-ambous simple question like "Is a cat a mamal ?".
This is linked to their fundamental design around a solid, non-flexible, pure logical architectures (reading their repective Wikipedia entries help understand how they work). In a way, the scientist behind those projects tryed to apply the same kind of language logic that is used in maths and programming languages to human language, and while this may be usefull for some academic purpose or very specific application were some reasonning may be useful (which has been used and applied well - I've seen it at least for WN and TT), they don't scale that well to REAL-WORD(tm) situations.
Their fundamental structure clashes with reality of human reasonning : WordNet is limited to single non-ambigous meaning for terms (no things like "nut" as in the seed, and "nut" as in the thing that can be screwed on a bolt). Other "stuctured" designs clash with real life's fuzzy nature with the other softwares.
Meanwhile search engines have grown in a completly different way. Initially they were designed only to scan pages content and then index their keywords for later queries. Only after that, slowly, one hack after another, they where tuned. In order to make results more revelant. In order to avoid link farms. Finding some complexe strategies in the ranking calculation to return more correct and more meaningful. To find results not with matching keyword, but with related keywords (Google's "Keyword is encountered only in page linking to thig target"). To cope easily with bad spelling (something that is very common in the real life. Something that is difficult to even detect for a common-sense engine. something that is very intuitive in search enginges, and that is even more optimisable given the statistics that such engine can do). And lot of other small ponctual improvement.
And slowly, by on one hand having a system that gets each day a little bit more optimised, and, on the other hand, an incredibly huge corpus to process that grows at a very fast rate, the search enginges, like google, become fantastic multipurpose information retrieving tools.
By now, you can type crap in google and still get something (as long it's not a "google-sepuku" like of crap, but more of "I'm very clumsy with my wording and my keyboard-skills"). You can have also other wonderful information, including stats on spelling errors or even statistic based translation (that are otherwise very difficult to get by classical mean), static about currently hot topic (which can be fed back to improve results for ambigous queries).
All this because search engines are built around a fuzzy logic : at the core is a braindead simple indexing rule, slightly modified by a bunch of hacks.
Such fuzzy logic approach "without really needing to teach the machine everything" has been recently successfully used on
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Would you like it if they were not these the fledgling footsteps of an emerging ai or just the babbling beginnings of a bloated database?
- I've created the following constants for my cats, their sibling and parents:
- #$Comet-TheCat
- #$Rocket-TheCat
- #$Packet-TheCat
- #$Mama-TheCat
- #$GhostDad-TheCat
- I've asserted (#$isa [cat] #$Cat) about all of them.
- I've asserted (#$biologicalMother [cat] #$Mama-TheCat) about Comet, Rocket and Packet
- I've asserted (#$biologicalFather [cat] #$GhostDad-TheCat) about Comet, Rocket and Packet as well.
- I even created #$ConceptionOfKitties, asserted (#$isa #$ConceptionOfKitties #$BiologicalReproductionEvent), (#$parentActors #$ConceptionOfKitties #$Mama-TheCat) and (#$parentActors #$ConceptionOfKitties #$GhostDad-TheCat).
So why can't Cyc infer that (#$siblings #$Comet-TheCat #$Packet-TheCat)? Is it a limitation in the public subset of the ontology, or some more fundamental issue with my data?The joke will be on us when the first real AI wakes up, spends some time contemplating the Internet, downloading terabytes of information, and finally communicates with its creators...
...only to ask for more pr0n.