Paraphrasing Sentences With Software
prostoalex writes "Cornell University researchers are making progress in paraphrasing and "understanding" complete sentences in a software application. Analyzing sentences on the semantic level allows the software application to treat two sentences, expressing similar thoughts and ideas, but written in a different manner, as a single semantic unit. Significant achievements in this area could revolutionize the information searching field."
Imagine a beowulf cluster of this
That's there's absolutely nothing formulaic about idioms, which comprise 80% or so of english conversation. A human learns it by years of experience, a computer has to be given programming for every idiom there is.
I think that the first and best use of this technology would be to help the editors of Slashdot find duplicate articles!
Think about the possiblities...
Of course, the biggest problem with that is that there wouldn't be nearly as many cool articles to read!
I have no problem with your religion until you decide it's reason to deprive others of the truth.
I always loved the text adventure games by Infocom. They were way ahead of their time, and I have been truly amazed on several occasions by the software's ability to 'understand' what I was asking it to do. Of course I'm sure this is leaps and bounds beyond what was available back then, but it's truly amazing how far ahead of their time they actually were.
There is a mailbox here.
C. Griffin
"Can I keep his head for a souvenir?" --Max from Sam 'N Max Freelance Police
so would this allow something like google to pick up a phrase and relate it to the results instead of just picking up keywords?
one of the ways I can think of to use this technology is to improve search engine capabilities, instead of looking for exactly the same words, search engines then can look for similar sentences, giving more accurate results.
However, after reading the article, I wonder whether the research can be applied to Latin languages, as they did the research on semantic languages.
The IT section color scheme sucks.
I was too lazy to lazy to read the article so I used the Summarize feature in OS X to parse the sentences down since it seems a bit wordy.
Okay, maybe I exaggerate a bit here, I did read the article and while the summarize isn't that far off from what these guys are doing...
Burn Hollywood Burn
I'm curious as to whether Google News, since it draws from various news sources and groups articles by topic (similar to paraphrasing, perhaps), uses any of the same techniques.
I'm sure this would improve translation software too, since a paraphrased sentence should be easier to translate into something sensible.
Things like this are what makes academic research Really Cool and allows useful things to come about, Go Cornell.
... and I'd relate my 2nd and 3rd paragraph if it wasn't 3am here. Goodnight, slashdot. :)
I'd note that this is a novel approach, and, for better or for worse, it goes about doing things much differently than our minds do.
Actually, though, it's closer to how humans understand writing (stringing together atomic words/phrases in an implicit context) than previous statistical methods.
RD
if these people get an "informative" when they paraphrase the article, they should be metamodded to "insightful"...
but the day the mods will be replaced by parsers, I think I'll get one to post instead of me.
Trolling using another account since 2005.
I looked again and whaddayaknow? I asked the paperclip about auto summarize and it is still there in the toold menu afterall! Looks like I don't have that feature installed though.
I'm too lazy to read the article.. could someone write some software to paraphrase it for me?
Finally, auto-translate, then auto-parse can rid us of these "manuals" :-)
Simon
Physicists get Hadrons!
Hello, automatic paraphrasing of literature.
P.S. Just joking, kids. Stay in school!
Let's see the srtwfaoe cut its tteeh anigist tihs lttilte puzzle! (blatant reference to an older article)
I wonder what its' application could be, other than to detect duplicates... Perhaps, a tool to suggest ways of rewriting sentences? Or maybe part of a more advanced grammar check?
... to this:
My first thought was translation tools. GOOD translation tools that understand the grammar in the source language, and uses the grammar in the destination language to form the resulting sentence.
There has been some work on something to solve this problem, where a phrase in language A was translated to some special "universal" code, and then finally to language B. The developers would then need to make the translator translate all languages to the universal code, and vice versa. The universal code could be whatever necessary to make the software as easily as possible be able to preserve the "meaning" of the sentence.
However, if this is done, the problem could change from this:
Source: I love hot dogs.
Destination: Ich liebe heiBe Hunde. (i.e. a literal translation, from Altavista Babelfish)
Source: I love hot dogs.
Destination: Ich liebe Nahrung. ("I love food")
In case the universal language wasn't advanced enough and the english -> universal translator conversion was "lossy". So we might exchange our current problem with mangled grammar with lots information.
Here's a web site about it, and I'm sure there are many more.
Beware: In C++, your friends can see your privates!
They should use this technology to transcribe legalese into plain English and back. Like, you feed it with "Due to unanticipated circumstances as listed under the terms of the clause 17(a), we may be unable to comply with your request within this and successive fiscal year(s)", and it spits out "bugger off".
Of course, millions of lawyers worldwide would lose their jobs, but I, being bitten by them, just take it as an added benefit.
Lisp is the Tengwar of programming languages.
a "-1, redundant" generator.
Without thinking too much about it, we paraphrase all the time. Trying to give a sentence to a computer to reword, is a complicated task.
At Cornell, University, researchers decided to avail themselves of two different sources of the same news and use computational biology methods to make it possible for computers to automatically paraphrase input sentences. Their first step was to compare the two different sources of the same news.
Eventually, it is hoped that this research will have benefits in computer processing of natural-language queries, translation engines, and in assisting people with certain types of reading disabilities.
The project began when two ideas came together, said one of the Cornell researchers, Regina Barzilay. Regina Barzilay is an assistant professor of computer science at the Massachusetts Institute of Technology.
The vast amount of duplicated content online is a valuable resource for computer systems learning to paraphrase. A number of reporters report the same news but using different wording. The redundant sources of news are able to assist in learning the different ways one piece of information can be paraphrased, as the same basic facts are reported in each. So with these multiple sources, you can sort out the noise and get the facts and then work out different ways of stating those facts.
Even with similar styles of writing, paraphrasing of sentences is more than just working out ans substituting synonyms. The researchers' provide a couple of common business phrases to illustrate this:
After the latest Fed rate cut, stocks rose across the board.
Winners strongly outpaced losers after Greenspan cut interest rates again.
The next step, was to use computational biology techniques to determine how much in common two sentences had and how closely they were related. The technique used was similar to when biologista are looking to see how close two sets of genes are that may have started from the same seed but then evolved. They are different but have a degree of similarity.
They important thing was to compare news sources that were written differently but covered the same event. This generated a whole set of word patterns that were kind of the same. This was exactly the core data needed to inform a computer paraphrasing technique.
The Reuters and AFP news sources were used to test the system. News was selected from English articles produced between September 2000 and August 2002.
The system developed by the researchers performs two groupings; firstly comparing articles from the same source:
Word-based clustering methods were used to identify sets of text that had a high degree of overlapping words. This method identified articles that reported distinct acts of violence occuring in Israel and the Palestinian territories.
Computational biology techniques were then used on these sets of articles to generate lattices or sentence templates for the computer to use. Each lattice contains a number of sets of words that occur in parallel and empty slots where arguments, such as locations, number of fatalities, times and dates can be inserted.
The challenge was to sort out which lattices were indeed due to different events and which were due to writing variability.
The researchers were thus able to identify common templates used by journalists to describe similar events. Ie. journalists who take the same article and change or take out a word, add a detail, reverse the sentence and so on are hereby busted.
One of the templates, or lattices, read: Palestinian suicide bomber blew himself up in NAME on DATE killing NUMBER (other) people and injuring/maiming NUMBER. In addition to the injuring/maiming variable, there are several variables within the name argument: settlement of, coastal resort of, center of, southern city, or garden cafe.
43 AFP and 32 Reuters templates were thus discovered by the system. The researchers then cross-compared these lattices.
They compared the
The next generation of students sure will have it much easier than us. How is a teacher supposed to catch plagiarism with software like that?
Oh wait...
Mrs. G: Johnny, come here for a second.
Johnny: Yes Mrs. G?
Mrs. G: What did you mean by "Shrub claimed that Basket Hamper and the Hatchets of Sin will be blown out" in your current events report?
Johnny: Oh, whoops! What I meant to say there was, "Bush says Bin Laden and the Axes of Evil will be defeated." Sorry about that. Darn that defective spell-check and grammar-check!
Auto Greeter Machine: I welcome you to our country, and greet you with open arms. Please enjoy your stay - we have a fine range of tourist facilities, restaurants, bars and so forth. And on a personal note, may I say that you are likely to be eaten by a grue.
How do you paraphrase Slashdot ?
Ans : Dupes for nerds, stuff that matters again and again.
How do you paraphrase Microsoft Innovation ?
Ans :
getSexySig();
But could it understand bablefish translations.
I guess you could try using Esperanto or Lojban as your intermediary language. Lojgan in particular is computer parseable *and* human understandable, so it would probably be the easiest to write translations for.
Karma: It's all a bunch of tree-huggin' hippy crap!
Not to mention the increased ability to quickly spot "re-written" bought term papers.
Money for nothing, pix for free
There's this algorithm called Latent Semantic Analysis which has been under development for quite some time (freely available!). It's quite good at comparing the semantic content of 2 bits of speech based on its database of many thousands of book (in fact you can specify the education level by choosing different databases).
The output of LSA has been shown to be roughly equivalent to human scorers for examining summary essays produced in tests.
Point is, that by combining this here paraphrasing algorithm with LSA, we can have computers summarizing text and other computers giving them grades on it. This takes students and teachers out of the equation entirely. Saves us big bucks and get public education back on its feet!
"Pass me the crackpipe, man!"
Proudly karma-whoring since the turn of the millenium
Money for nothing, pix for free
An American friend of mine was terribly confused by the expression "Crash us a fag, mate".
can anyone else shed any light into how far the LOLITA project (under Roberto Garigliano) got at Durham Unversity? Yeah, it's a research project, but last I heard (10 years ago) it was able to parse complete texts (for example, newspaper articles) and answer simple questions based on it. I believe ther was also work underway to make it understand/'speak' chinese/russian. There was also supposed to be some kind of 'script' support which would give it contextual information about certian situations (the common example was what contextual knowlegde do you need to know when you go into a restaurant and how can that knowledge help you understand what is said there).
Shouldn't this make it possible to improve spam filters?
Another area in which the world is poorer for the lack of a Douglas Adams wandering (or more likely flying first class) around it.
I would have LOVED to see him tackle a 'text message adventure' along the lines of the old infocom classics. He has written a number of pieces (some of which are collected in salmon of doubt) about how much he enjoyed this marrage of writing and computing. The flexibility and restrictions of the medium would have led to something pretty neat I'm guessing. Of course - then he'd have pissed another 10 years down the drain discussing making it into a movie with Disney!
Damn I want to swap to another paralel universe sometimes. One where Adams did EVERYTHING we think he'd have been good at, and where Britney Spears lives next door and cooks me pastries for breakfast on sundays!
Yes, but strcmp can say two strings are identical, yet they can convey different information. Big-endian vs. little-endian, anyone?
Binary identity does not imply semantic equivalence. It all depends on how the data is interpreted.
If anyone is interested in the history of this field then I would highly recommend the book with the above title, edited by Inderjeet Mani and Mark T. Maybury. amazon. Lots of very interesting articles, including discourse trees and a brief bit of stuff about summarising non-textual assets such as diagrams, video streams etc etc
The only Good System is a Sound System
My guess is any slick technology set up with this will let plagiarism run rampant.
Google translator already let my sister-in-law "cheat" on a German paper, but the translation was "too good" so she got caught. Paraphrasing that's excellent (obviously would take a while, but what the hell, we can play Apple II games on a Palm not 20 years later....) could be real messy.
Just think of the ramifications this will have for Zork. Now I'll be able to say "Will you just open the damn egg?"
It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
At a roughly 10% size:
At a quarter size:
GPL Deconstructed
I first classify the text into a category, then weight every word in the text based on how much it contributed to this classification - I then output as a "summary" of the one or two sentences in the original text that most contribute to the classification of the entire text.
Not really sumarization, but useful.
-Mark