Paraphrasing Sentences With Software
prostoalex writes "Cornell University researchers are making progress in paraphrasing and "understanding" complete sentences in a software application. Analyzing sentences on the semantic level allows the software application to treat two sentences, expressing similar thoughts and ideas, but written in a different manner, as a single semantic unit. Significant achievements in this area could revolutionize the information searching field."
Imagine a beowulf cluster of this
That's there's absolutely nothing formulaic about idioms, which comprise 80% or so of english conversation. A human learns it by years of experience, a computer has to be given programming for every idiom there is.
I think that the first and best use of this technology would be to help the editors of Slashdot find duplicate articles!
Think about the possiblities...
Of course, the biggest problem with that is that there wouldn't be nearly as many cool articles to read!
I have no problem with your religion until you decide it's reason to deprive others of the truth.
I always loved the text adventure games by Infocom. They were way ahead of their time, and I have been truly amazed on several occasions by the software's ability to 'understand' what I was asking it to do. Of course I'm sure this is leaps and bounds beyond what was available back then, but it's truly amazing how far ahead of their time they actually were.
There is a mailbox here.
C. Griffin
"Can I keep his head for a souvenir?" --Max from Sam 'N Max Freelance Police
Will this get rid of the 10 people who get +5 informative from stealing the link out of the comment a few spots up.
so would this allow something like google to pick up a phrase and relate it to the results instead of just picking up keywords?
one of the ways I can think of to use this technology is to improve search engine capabilities, instead of looking for exactly the same words, search engines then can look for similar sentences, giving more accurate results.
However, after reading the article, I wonder whether the research can be applied to Latin languages, as they did the research on semantic languages.
The IT section color scheme sucks.
I was too lazy to lazy to read the article so I used the Summarize feature in OS X to parse the sentences down since it seems a bit wordy.
Okay, maybe I exaggerate a bit here, I did read the article and while the summarize isn't that far off from what these guys are doing...
Burn Hollywood Burn
I'm curious as to whether Google News, since it draws from various news sources and groups articles by topic (similar to paraphrasing, perhaps), uses any of the same techniques.
I hope they make use of this new technology on machine translation sites like Babelfish, because the dreck that Babelfish shoots out is utter shit!
I'm sure this would improve translation software too, since a paraphrased sentence should be easier to translate into something sensible.
But... I wonder, will it produce 'In Soviet Russia' pseudo-paraphrasing.
I wonder what its' application could be, other than to detect duplicates... Perhaps, a tool to suggest ways of rewriting sentences? Or maybe part of a more advanced grammar check?
So now
(all your base...)==(I'm a tard)
?
Things like this are what makes academic research Really Cool and allows useful things to come about, Go Cornell.
... and I'd relate my 2nd and 3rd paragraph if it wasn't 3am here. Goodnight, slashdot. :)
I'd note that this is a novel approach, and, for better or for worse, it goes about doing things much differently than our minds do.
Actually, though, it's closer to how humans understand writing (stringing together atomic words/phrases in an implicit context) than previous statistical methods.
RD
Maybe prostoalex could learn something from the Cornell researchers! How about this for an article summary, eh?
Cornell University researchers could revolutionize the information searching field by analyzing sentences on the semantic level to allow a software application to treat two sentences, expressing similar thoughts and ideas but written in a different manner, as a single semantic unit.
Who will be first to post the paraphrased article so I don't have to RTFA?
The days of "All your base are belong to us" Engrish may soon be over? A brand new AirSoft gun I just purchased has the phrase "No point at the creature" molded into the plastic. Don't get me started on the owners manuals for consumer electronics. Japan needs this software, bad. If it comes at a cost of no more "All your base" jokes, well, that's a cost I think society will have to bear.
---
DRM is like antifreeze, to the MPAA/RIAA it's sweet, to the consumers it's poison.
I looked again and whaddayaknow? I asked the paperclip about auto summarize and it is still there in the toold menu afterall! Looks like I don't have that feature installed though.
Two ideas led to the system, said Regina Barzilay...
..." :)
Speaking of natural language recognition, I parsed this sentence from the article as reading, "Two ideas led to the system, said Reginald Barclay
I'm too lazy to read the article.. could someone write some software to paraphrase it for me?
If strcmp says that two strings are different, but you say that they mean the same thing, then the problem is with your language, not with strcmp.
Finally, auto-translate, then auto-parse can rid us of these "manuals" :-)
Simon
Physicists get Hadrons!
Hello, automatic paraphrasing of literature.
P.S. Just joking, kids. Stay in school!
Let's see the srtwfaoe cut its tteeh anigist tihs lttilte puzzle! (blatant reference to an older article)
riiiiipppppp
What is that? What are you doing?
I'm paraphrasing. This intro is too long.
Paraph....well don't paraphrase...don't. Look, I will read whatever is in the script and you just type whatever I say. So just type what ever I say.
Just type whatever I say.
No, dont type everthing I say. Just type
No! Not everthing... just guh, guh, er, duh, duh...
That's not funny.
You're such a cock bite.
Alright now that - ok, that's gotta - that, take that off, because that is firs
cock bite
Slashdot, where armchair scientists get shouted down and armchair theologians get modded up.
They should use this technology to transcribe legalese into plain English and back. Like, you feed it with "Due to unanticipated circumstances as listed under the terms of the clause 17(a), we may be unable to comply with your request within this and successive fiscal year(s)", and it spits out "bugger off".
Of course, millions of lawyers worldwide would lose their jobs, but I, being bitten by them, just take it as an added benefit.
Lisp is the Tengwar of programming languages.
Paraphrase THIS!
(from the I'll-Paraphrase-YOU! department)
Significant achievements [GOOD] in this area could revolutionize [IS] the information searching field. [THIS].
yo.
a "-1, redundant" generator.
What about true speech recognition? As i understand it this could go a long way towards making speech recognition work effectively. Me: "Computer i want to write an email." Computer: "One moment please."
Without thinking too much about it, we paraphrase all the time. Trying to give a sentence to a computer to reword, is a complicated task.
At Cornell, University, researchers decided to avail themselves of two different sources of the same news and use computational biology methods to make it possible for computers to automatically paraphrase input sentences. Their first step was to compare the two different sources of the same news.
Eventually, it is hoped that this research will have benefits in computer processing of natural-language queries, translation engines, and in assisting people with certain types of reading disabilities.
The project began when two ideas came together, said one of the Cornell researchers, Regina Barzilay. Regina Barzilay is an assistant professor of computer science at the Massachusetts Institute of Technology.
The vast amount of duplicated content online is a valuable resource for computer systems learning to paraphrase. A number of reporters report the same news but using different wording. The redundant sources of news are able to assist in learning the different ways one piece of information can be paraphrased, as the same basic facts are reported in each. So with these multiple sources, you can sort out the noise and get the facts and then work out different ways of stating those facts.
Even with similar styles of writing, paraphrasing of sentences is more than just working out ans substituting synonyms. The researchers' provide a couple of common business phrases to illustrate this:
After the latest Fed rate cut, stocks rose across the board.
Winners strongly outpaced losers after Greenspan cut interest rates again.
The next step, was to use computational biology techniques to determine how much in common two sentences had and how closely they were related. The technique used was similar to when biologista are looking to see how close two sets of genes are that may have started from the same seed but then evolved. They are different but have a degree of similarity.
They important thing was to compare news sources that were written differently but covered the same event. This generated a whole set of word patterns that were kind of the same. This was exactly the core data needed to inform a computer paraphrasing technique.
The Reuters and AFP news sources were used to test the system. News was selected from English articles produced between September 2000 and August 2002.
The system developed by the researchers performs two groupings; firstly comparing articles from the same source:
Word-based clustering methods were used to identify sets of text that had a high degree of overlapping words. This method identified articles that reported distinct acts of violence occuring in Israel and the Palestinian territories.
Computational biology techniques were then used on these sets of articles to generate lattices or sentence templates for the computer to use. Each lattice contains a number of sets of words that occur in parallel and empty slots where arguments, such as locations, number of fatalities, times and dates can be inserted.
The challenge was to sort out which lattices were indeed due to different events and which were due to writing variability.
The researchers were thus able to identify common templates used by journalists to describe similar events. Ie. journalists who take the same article and change or take out a word, add a detail, reverse the sentence and so on are hereby busted.
One of the templates, or lattices, read: Palestinian suicide bomber blew himself up in NAME on DATE killing NUMBER (other) people and injuring/maiming NUMBER. In addition to the injuring/maiming variable, there are several variables within the name argument: settlement of, coastal resort of, center of, southern city, or garden cafe.
43 AFP and 32 Reuters templates were thus discovered by the system. The researchers then cross-compared these lattices.
They compared the
The next generation of students sure will have it much easier than us. How is a teacher supposed to catch plagiarism with software like that?
Oh wait...
Mrs. G: Johnny, come here for a second.
Johnny: Yes Mrs. G?
Mrs. G: What did you mean by "Shrub claimed that Basket Hamper and the Hatchets of Sin will be blown out" in your current events report?
Johnny: Oh, whoops! What I meant to say there was, "Bush says Bin Laden and the Axes of Evil will be defeated." Sorry about that. Darn that defective spell-check and grammar-check!
after reading the article, I wonder whether the research can be applied to Latin languages, as they did the research on semantic languages
...is a good example :)
Auto Greeter Machine: I welcome you to our country, and greet you with open arms. Please enjoy your stay - we have a fine range of tourist facilities, restaurants, bars and so forth. And on a personal note, may I say that you are likely to be eaten by a grue.
According to The Guardian "In his pre-trial interview, the cannibal said that after eating Brandes he felt much better and more stable. Brandes spoke good English, he said, and since eating him his English had improved."
Science fiction for grown-ups...
Perhaps it can make sense of Bill Gates talks on security. I know I can't.
7#1$ m@(#1n3 $uXX0rZ
So now we can run a simple program and it will tell us what the media is really saying, without all their bullshit and padding. That is, it will go through an entire article, and pull out the stupid statistics like death counts. Thus, three pages of bullshiot and padding are reduced to:
10 People were killed, and 30 injured. Arabs Suck, America is great.
It took a complex computer program and years of research to figure out that all the news stories could be summed up in 3 lines.
(don't mark this as troll right-off, read the article first.)
-- 'The' Lord and Master Bitman On High, Master Of All
How do you paraphrase Slashdot ?
Ans : Dupes for nerds, stuff that matters again and again.
How do you paraphrase Microsoft Innovation ?
Ans :
getSexySig();
Summary
:)
It's been done by CanadaDave (544515) on Thu December 04, 9:20
Microsoft Word had AutoSummarize in Word 97, or was it 2000? Anyhow it seems to be absent in Word XP.
-----
Fantastic bit of programming there, Bill.
Not really the same thing Mr. Dave.
This post contains benzene, nitrosamines, formaldehyde and hydrogen cyanide.
Don't bother with the Yacc code, Orwell already did the work for you. As I recall, eliminating synonyms was one of the primary goals of newspeak.
#include <sig.h>
nuff said
But could it understand bablefish translations.
when will this thing be ready to 'summarize' whole articles. I'm in senior next year. heh heh
Not to mention the increased ability to quickly spot "re-written" bought term papers.
Money for nothing, pix for free
There's this algorithm called Latent Semantic Analysis which has been under development for quite some time (freely available!). It's quite good at comparing the semantic content of 2 bits of speech based on its database of many thousands of book (in fact you can specify the education level by choosing different databases).
The output of LSA has been shown to be roughly equivalent to human scorers for examining summary essays produced in tests.
Point is, that by combining this here paraphrasing algorithm with LSA, we can have computers summarizing text and other computers giving them grades on it. This takes students and teachers out of the equation entirely. Saves us big bucks and get public education back on its feet!
"Pass me the crackpipe, man!"
Proudly karma-whoring since the turn of the millenium
Money for nothing, pix for free
and deep contextual dependency.
Neat trick if they can pull it off. Then Google results would really improve.
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
No more Google Ebay ads Selling us shit they really dont sell!
They can finally target advertising directly ino our brain!
An American friend of mine was terribly confused by the expression "Crash us a fag, mate".
Tie this little puppy into a speach recoginition system then jack your DVD jukebox into your computer. Now you have suddenly obtained the ability to search all your pr0n the exact type of scene you like, or your woman likes, to see. Identify, "I like X"; "Give it to me from X"; "You like that X, don't you?"; Seems like a perfect application to me.
When I tell an object to delete this, am I killing it or telling it to kill me?
A possible application
We could feed a technical report to the computer and the output will be pure poetry, imagine Slashdot in verses !
can anyone else shed any light into how far the LOLITA project (under Roberto Garigliano) got at Durham Unversity? Yeah, it's a research project, but last I heard (10 years ago) it was able to parse complete texts (for example, newspaper articles) and answer simple questions based on it. I believe ther was also work underway to make it understand/'speak' chinese/russian. There was also supposed to be some kind of 'script' support which would give it contextual information about certian situations (the common example was what contextual knowlegde do you need to know when you go into a restaurant and how can that knowledge help you understand what is said there).
yuo are on TEH SPOKE!!!!!!11
paraphrased to
In SOVIET RUSSIA, #other people's messages posted before your own avoid simply duplicating what has already been said by YOU
Shouldn't this make it possible to improve spam filters?
A lot of Reuters stories are available for research purposes as a set corpus. See http://about.reuters.com/researchandstandards/corp us/
for details on this. Perfect and designed for just this sort of work.
Also BT a few years back was working on a summariser called Prosum. Don't know what happened to that in the .don churn.
Actually, this might be a Good Thing for e-learning projects. One great challenge for e-learning is to give precise evaluation automatically. With this, teacher could write his own essay and machines evaluate students' essays taking teacher's as reference.
Neat stuff. And the paper is really well written, IMHO. The "story" doesn't say enough.
See wittgenstein for a related concept..and why it probably won't work. People have tried defining language logicaly for a long time.. the semantics of it never work. Ultimately you can't use language to "completely" describe itself.
...towards a Natural Language Compiler?
Babelfish already does this.
== Jez ==
Do you miss Firefox? Try Pale Moon.
The first use I thought of was using this software to paraphrase an assignment so you could more easily pass it in as your own work, and it would be more difficult to prove that it was the same as a previous work since it had different words. Possibly the software has some sort of paraphrase signature that would make this possible to detect?
Another area in which the world is poorer for the lack of a Douglas Adams wandering (or more likely flying first class) around it.
I would have LOVED to see him tackle a 'text message adventure' along the lines of the old infocom classics. He has written a number of pieces (some of which are collected in salmon of doubt) about how much he enjoyed this marrage of writing and computing. The flexibility and restrictions of the medium would have led to something pretty neat I'm guessing. Of course - then he'd have pissed another 10 years down the drain discussing making it into a movie with Disney!
Damn I want to swap to another paralel universe sometimes. One where Adams did EVERYTHING we think he'd have been good at, and where Britney Spears lives next door and cooks me pastries for breakfast on sundays!
Now I can get my 2000 word English research paper up to the required 3000 words, and have the required unreadability!
tasks(723) drafts(105) languages(484) examples(29106)
I used to work at a company called Uniplex. They bought technology that could precis English text. One of their examples was cutting down "Alice in Wonderland" to 10% of its original length. It weighted words according to some magic algorithm that tried to retain the most important phrases.
Whilst the resulting document was a bit odd, you could certainly use it to remind yourself about the story.
-- Don't believe everything you read, hear or think
If I was paraphrasing a passage I don't understand, I would need a dictionary and grammar rules. If the grammar was normal or normalized, I would still need the dictionary.
So, what would a dictionary for a computer look like? How can basic concepts be defined for computer understanding?
Would it look perhaps like a Prolog program?
Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.
If anyone is interested in the history of this field then I would highly recommend the book with the above title, edited by Inderjeet Mani and Mark T. Maybury. amazon. Lots of very interesting articles, including discourse trees and a brief bit of stuff about summarising non-textual assets such as diagrams, video streams etc etc
The only Good System is a Sound System
maybe they should work on this first before building the app.
My guess is any slick technology set up with this will let plagiarism run rampant.
Google translator already let my sister-in-law "cheat" on a German paper, but the translation was "too good" so she got caught. Paraphrasing that's excellent (obviously would take a while, but what the hell, we can play Apple II games on a Palm not 20 years later....) could be real messy.
With technology like this, we could probably compress the Internet into about 200 or so unique sites!
We might even arrive conclusively at the twenty or so keywords that compromise 99% of Slashdot posts. Oh heck, I'll even give it a partial headstart: "Linux, Linus, MPAA, RIAA, SCO, RTFA, Gates, Lucas, outrage, Rings, Rockets, RMS"
Significant achievements in this area will revolutionize the lazy plagiarist field.
One will be required to think and to phrase oneself alike Ray Romano or Paddington Bear in order for software to fully 'understand', and for one to understand the software's response. Which sucks. Why bother trying? Are we really up to seeing if 'language rule updates' can keep up with changes in actual language? Or, will we find that language stagnates just because somebody makes a dictionary vocally conversational?
"Stratigraphically the origin of agriculture and thermonuclear destruction will appear essentially simultaneous" -- Lee
it not ain't if i say 'boo-ya' up to the shizzo!! phooonkeee-pbbbt! mess with it.
"Stratigraphically the origin of agriculture and thermonuclear destruction will appear essentially simultaneous" -- Lee
Just think of the ramifications this will have for Zork. Now I'll be able to say "Will you just open the damn egg?"
It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
At a roughly 10% size:
At a quarter size:
GPL Deconstructed
Here's a site demo'ing the Machinese syntax parser. It can build parse trees for sentences you type in where the components in the sentence are separated and related to each other.
http://www.connexor.com/demos/syntax_en.html
Beware: In C++, your friends can see your privates!
I first classify the text into a category, then weight every word in the text based on how much it contributed to this classification - I then output as a "summary" of the one or two sentences in the original text that most contribute to the classification of the entire text.
Not really sumarization, but useful.
-Mark
Here is another website about a similar idea, Universal Networking Language (UNL).
20 mil and I will! Learn Esperanto with 20M others.
If this works, it would make catching plagiarists almost impossible.
1) Google the paper topic
2) Cut'n'paste
3) Run it through the Cornell application
4) Turn it in
5) Collect the grades
Doubtless Cornell University researchers are already modifying this to create software for catching plagiarists. A bit like IronPort buying SpamCop?
This post piqued my interest. I don't own a Mac so now I'm curious about how this thing works. I wasn't even aware that there was such a thing as a summarizing algorithm. How does it work? I did a search for "Summary" "OS X" on google and I got no interesting leads. Can anyone give me some pointers to places where I could either play with a summarizing program (maybe a web based one) or learn more about how it works?
Get some text on your screen with a Cocoa app. Say, this post with Safari.
Select the text.
Choose from the Application (e.g. Safari) menu in the menubar Services...Summarize.
The Summary tool pops up. Horray! The sad part is they demoed it at MacWorld Boston '97, and released it in Jaguar, IIRC.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Depending on how that develops, it will have a great impact on translation software.
Imagine, using a computer to translate from one language to another, and end up with a gramatically correct result. That would be amazing..
-- -- Warning. Do not stare directly at the sun.
I think it's a revolution in psychometrics and psychological assessment that's already here and waiting to expand exponentially in use.
I do research on psychological assessment for a living, and it's amazing to me all the applications NLP has for psychological assessment.
Already, as you say, existing algorithms and software can produce scoring systems that have the same validity as the average rating of a group of human raters. In most cases, the predictive validity of a AES system exceeds that of a single human rater. That is, the scores assigned by NLP scoring algorithms are more valid, in the sense of better predicting other criteria, than the scores assigned by a single human rater.
I don't do educational assessment, I work on clinical assessment, and so far I've seen nothing done with NLP. It has the potential to really cause a revolution in clinical psychological assessment.
For one, it has the potential to lead to a renaissance in use of "projective" tests, which have fallen out of use for good reason. It also has the potential for standardized scoring of clinical interview responses, which is amazing to me.
Absolutely amazing stuff.
Describes it a little, since it's written with Apple's Summarize Service.
I think Apple uses the service internally in their file indexing and search feature, too!
GPL Deconstructed
It does a bit of what this article describes, works wonders,
has been around for quite some time, and it's amazingly
accurate; check it out:
highlight any text in OSX, hit the application menu, then Services and Summarize. Simple!
k.h.
k.h.
Think what a boon this will be for students and reporters who don't want do their own work. You find the article containing the target subject, plug in the style you want it paraphrased into and let it crank.
Stealing from one person is called plagiarism. Stealing from many is calld research.
that newswires have cross-licensing arrangements. So AFP might well be able to take a Reuters feed (and vice versa) and minimally rewrite it. I'm not sure that they do, but that sort of thing is pretty common - for instance Reuters tend to specialise in finance while AFP are more a media service, so AFP might source some of their finance news from Reuters.
That'd rather throw out the assumptions behind this research, wouldn't it?
Company representatives said quote, "The real challenge was finding a software developer that hadn't slept through English class and knew how to diagram sentences."
"Reports that say something hasn't happened are interesting to me, because as we know, there are known unknowns; there things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."
If the software can summarize that for me, I'm all ears :-)
Or how about removing redundant comments?
I feel fantastic, and I'm still alive.