Slashdot Mirror


Paraphrasing Sentences With Software

prostoalex writes "Cornell University researchers are making progress in paraphrasing and "understanding" complete sentences in a software application. Analyzing sentences on the semantic level allows the software application to treat two sentences, expressing similar thoughts and ideas, but written in a different manner, as a single semantic unit. Significant achievements in this area could revolutionize the information searching field."

11 of 203 comments (clear)

  1. Translation software? by znaps · · Score: 2, Informative

    I'm sure this would improve translation software too, since a paraphrased sentence should be easier to translate into something sensible.

  2. Re:The problem is... by ravydavygravy · · Score: 5, Informative

    a computer has to be given programming for every idiom there is.

    Rubbish - Ever heard of Machine Learning?

    There has been much work on resolving coreferance and named-entity recognition problems has been onging for several years, with the aim being to lead onto full NLP. This research seems interesting in that it takes work from another field (genetic sequence matching) and applies it to an NLP problem. What links them all is that in almost every case, the research involves machine learning at some point... it makes no sense to hand-code millions of case-specific rules, when a machine can learn them faster and better...

    Read their paper and you'll see that indeed it's an unsupervised learning approach - even nicer in that it doesn't require you to label training examples for the algorithm...

    ~D

  3. Paraphrase of the article. by fven · · Score: 4, Informative

    Without thinking too much about it, we paraphrase all the time. Trying to give a sentence to a computer to reword, is a complicated task.

    At Cornell, University, researchers decided to avail themselves of two different sources of the same news and use computational biology methods to make it possible for computers to automatically paraphrase input sentences. Their first step was to compare the two different sources of the same news.

    Eventually, it is hoped that this research will have benefits in computer processing of natural-language queries, translation engines, and in assisting people with certain types of reading disabilities.

    The project began when two ideas came together, said one of the Cornell researchers, Regina Barzilay. Regina Barzilay is an assistant professor of computer science at the Massachusetts Institute of Technology.

    The vast amount of duplicated content online is a valuable resource for computer systems learning to paraphrase. A number of reporters report the same news but using different wording. The redundant sources of news are able to assist in learning the different ways one piece of information can be paraphrased, as the same basic facts are reported in each. So with these multiple sources, you can sort out the noise and get the facts and then work out different ways of stating those facts.

    Even with similar styles of writing, paraphrasing of sentences is more than just working out ans substituting synonyms. The researchers' provide a couple of common business phrases to illustrate this:

    After the latest Fed rate cut, stocks rose across the board.
    Winners strongly outpaced losers after Greenspan cut interest rates again.

    The next step, was to use computational biology techniques to determine how much in common two sentences had and how closely they were related. The technique used was similar to when biologista are looking to see how close two sets of genes are that may have started from the same seed but then evolved. They are different but have a degree of similarity.

    They important thing was to compare news sources that were written differently but covered the same event. This generated a whole set of word patterns that were kind of the same. This was exactly the core data needed to inform a computer paraphrasing technique.

    The Reuters and AFP news sources were used to test the system. News was selected from English articles produced between September 2000 and August 2002.

    The system developed by the researchers performs two groupings; firstly comparing articles from the same source:

    Word-based clustering methods were used to identify sets of text that had a high degree of overlapping words. This method identified articles that reported distinct acts of violence occuring in Israel and the Palestinian territories.

    Computational biology techniques were then used on these sets of articles to generate lattices or sentence templates for the computer to use. Each lattice contains a number of sets of words that occur in parallel and empty slots where arguments, such as locations, number of fatalities, times and dates can be inserted.

    The challenge was to sort out which lattices were indeed due to different events and which were due to writing variability.

    The researchers were thus able to identify common templates used by journalists to describe similar events. Ie. journalists who take the same article and change or take out a word, add a detail, reverse the sentence and so on are hereby busted.

    One of the templates, or lattices, read: Palestinian suicide bomber blew himself up in NAME on DATE killing NUMBER (other) people and injuring/maiming NUMBER. In addition to the injuring/maiming variable, there are several variables within the name argument: settlement of, coastal resort of, center of, southern city, or garden cafe.

    43 AFP and 32 Reuters templates were thus discovered by the system. The researchers then cross-compared these lattices.

    They compared the

  4. Re:This reminds me of the Infocom classics by Anonymous Coward · · Score: 2, Informative

    That'd be scott adams games.

    Infocom's parser was much better. "Put the big bunch of keys in the blue box under the table." can be parsed by it, for example.

    As the OP said, this isn't near the level of what's mentioned in the article, but it's certainly better than you imply.

  5. Re:google? by millette · · Score: 5, Informative
    Just discovered this:
    Now when searching Google, you can use a ~ (tilde) to find pages using synonyms of the word you're searching for. For instance, search for:


    css ~help

    and you'll get sites with tutorials, guides, support, etc.
  6. Re:This reminds me of the Infocom classics by blancolioni · · Score: 5, Informative

    Interactive fiction hasn't died, and you can certainly play it on your PDA. Furthermore, it's generally acknowledged that the quality of modern works has surpassed that of Infocom. Baf's guide is probably a good place to dip your toes in, but there's resources all over the place and the annual competition has just finished.

    An interactive novel, at least the kind you're probably thinking about with deeply implemented characters and so forth, is probably AI-complete. It's not about the disk space and processor speed, it's about the inherent trickiness.

  7. Link for the Reuters Corpus stories. by openmtl · · Score: 1, Informative

    A lot of Reuters stories are available for research purposes as a set corpus. See http://about.reuters.com/researchandstandards/corp us/ for details on this. Perfect and designed for just this sort of work. Also BT a few years back was working on a summariser called Prosum. Don't know what happened to that in the .don churn.

    --

  8. Advances in Automatic Text Summarization by fingal · · Score: 4, Informative

    If anyone is interested in the history of this field then I would highly recommend the book with the above title, edited by Inderjeet Mani and Mark T. Maybury. amazon. Lots of very interesting articles, including discourse trees and a brief bit of stuff about summarising non-textual assets such as diagrams, video streams etc etc

    --

    The only Good System is a Sound System

  9. Re:This reminds me of the Infocom classics by Sargent1 · · Score: 3, Informative

    There are changes to the various interactive fiction languages to address various problems and shortcomings in the field. The trouble is, most of the easy stuff has been done. What's left now is trying to figure out what hard stuff can be done, or is even worth doing.

    For example, right now most of the languages accept sentences of the form [VERB] [DIRECT OBJECT] [PREPOSITION] [INDIRECT OBJECT]. Occasionally someone suggests, "Why not add adverbs?" The general concensus is that doing so suddenly requires the author(s) to consider a gigantic range of actions (what's the difference in result between "squeeze toothpaste tube slowly" and "squeeze toothpaste tube violently"?), and that, though such parsing can be done, it doesn't add to the world model.

    Nevertheless, even in traditional interactive fiction there is language development going on to increase what can be done. The example I am most familiar with is TADS 3 (http://tads.org/t3dl.htm), which is adding a lot of deeper simulation aspects, such as varying light sources, a better concept of distance, easy ways of getting around the standard atomicity of the world being broken up into discrete rooms, and support for deeper interaction with non-player characters. The big leap here is in giving a ready-made and easy-to-use framework for such advances.

  10. Re:Google News? by Kappelmeister · · Score: 4, Informative

    I'm curious as to whether Google News, since it draws from various news sources and groups articles by topic (similar to paraphrasing, perhaps), uses any of the same techniques.

    No, but Regina Barzilay, who is the researcher featured in the article, worked (with me) on the Newsblaster project at Columbia University, where she indeed applied these techniques to multidocument summarization. Newsblaster gathers and clusters news like Google News, but produces more sophisticated summaries.

  11. For the lazy, or interested, a summary via OS X! by 2nd+Post! · · Score: 4, Informative
    Set on the lowest setting, a summary of the article is:

    The method could eventually allow computers to more easily process natural language, produce paraphrases that could be used in machine translation, and help people who have trouble reading certain types of sentences.

    At a roughly 10% size:

    The researchers used gene comparison techniques to identify word patterns from different news sources that described the same event.

    The method could eventually allow computers to more easily process natural language, produce paraphrases that could be used in machine translation, and help people who have trouble reading certain types of sentences.

    ...When two reporters describe the same news event, for instance, they may use different details, but they tend to report about the same basic facts, said Barzilay.

    ...you have genes which started from the same kind of seed, and then they change during evolution [but] there is some similarity," said Barzilay.

    ...Given a sentence to paraphrase, the system finds the closest match among one set of lattices, then uses the matching lattice from the second source to fill in the argument values of the original sentence to create paraphrases.

    At a quarter size:

    The researchers used gene comparison techniques to identify word patterns from different news sources that described the same event.

    The method could eventually allow computers to more easily process natural language, produce paraphrases that could be used in machine translation, and help people who have trouble reading certain types of sentences.

    ...When two reporters describe the same news event, for instance, they may use different details, but they tend to report about the same basic facts, said Barzilay.

    ...Second, to sort out sentence similarities, the researchers borrowed techniques from computational biology that determine how closely related organisms are by finding similarities among genes.... you have genes which started from the same kind of seed, and then they change during evolution [but] there is some similarity," said Barzilay.

    ...Lattices are made up of words or parallel sets of words that occur across several examples, and arguments, or slots, where names, dates or number of people hurt or killed occur.

    ...One pattern, or lattice, read: Palestinian suicide bomber blew himself up in NAME on DATE killing NUMBER (other) people and injuring/wounding NUMBER.

    ...Given a sentence to paraphrase, the system finds the closest match among one set of lattices, then uses the matching lattice from the second source to fill in the argument values of the original sentence to create paraphrases.

    ...The researchers' ultimate goal is to use the system to allow computers to be able to paraphrase like humans, and to understand paraphrases, "but that's very far [off]", said Barzilay.

    ...Barzilay's previous work, which used a different technique to paraphrase at the level of words and phrases rather than sentences, is part of the Columbia News Blaster project, which summarizes news stories.

    ...The researchers' system has the potential to accomplish the same thing by taking one human translation and creating 10 paraphrases of it automatically, she said.

    ...The system could be used to produce paraphrases based on a specific model, for example, for phasic readers, who find it difficult to read certain types of phrases, she said.

    ...For example, the system learned incorrectly that "Palestinian suicide bomber" and "suicide bomber" were the same, and that "killing 20 people" is the same as "killing 20 Israelis", said Barzilay.