Slashdot Mirror


Open Source Automated Text Summarization?

TrebleJunkie writes "I've spent some time recenting looking for open source projects dealing with Automated Text Summarization -- automatically generating detailed summaries from longer documents -- to no avail. I can find a lot of research papers and several commercial projects, but no open source code or projects? Does anyone out there know of any?"

4 of 38 comments (clear)

  1. Re:maybe a dumb question by PurpleBob · · Score: 3, Informative

    is there a non-free program that does this?

    Microsoft Word.

    It doesn't do it all that well, from what I've seen, but it does it. It's called "AutoSummarize".

    --
    Win dain a lotica, en vai tu ri silota
  2. Re:maybe a dumb question by 90XDoubleSide · · Score: 3, Informative

    Mac OS X’s has a built-in Summarize Service that works much better than the one in Word, IMO. Sorry I can't think of any open source ones.

    --
    "Reality is just a convenient measure of complexity" -Alvy Ray Smith
  3. Some related information by f00zbll · · Score: 3, Informative
    I did some research into this for a pet project of my own. I wanted to write an application to crawl the web and get information. After a couple months of research, I realized how big of a problem it is.

    1. the application needs to be able to determine the relevance of the provided text
    2. to do so, it needs to determine the relative importance of the sentences and words
    3. it has to be able to compose new sentence to write a summary
    4. not all documents follow good structure or grammer
    5. how do you account for spelling/grammar mistakes

    From my research, there appears to be two primary methods of performing this kind of processing:

    1. natural language parsing
    2. statistical parsing

    Of the two, statistical parsing is more popular these days because it doesn't require knowledgebase, expert system shells, grammar modeling and extensive dictionary. One of the primary method of determining the relative importance of words in a sentence is valence. The main challenge with natural language parsing and statistical technique is it depends on the training dataset. The more specific the dataset is, the better it will perform.

    Statistical analysis can also use expert system shells and other AI technologies to improve accuracy, but it doesn't have to.

    From my understanding (which is limited), it stems from a principle from linguistics. By counting the frequency of words or more specifically nouns, the program is able to rate each nouns importance. Once it got done, it could then look at the sentence that best describes the document by doing a comparison between the most importance words and the appearance of those words in the sentences. I remember this from my literature and linguistics classes. Congnitive science has also attempted to solve this problem, but it is very difficult.

    In either case, if you dealing with well structured documents, your best bet is to grab the first 3 paragraphs assuming the author followed standard thesis/essay structure. If you're planning on summarizing new articles, it might not be that hard if the author followed the inverted pyramid, which many do not. One of the big tools of natural language parsing in the early days was prolog. It is still used a lot in academic settings for natural language processing. You're best bet is to get an intern to read and summarize for yo

  4. Sherlock by maggard · · Score: 3, Informative
    Apple's Sherlock application does this.

    It's not Open but it is scriptable, is not an additional cost, and is available on a Unix OS (MacOS X) Indeed through Apple's Open Scripting Architecture (OSA) one can use any number of scripting languages such as Python, Perl, and even JavaScript to interact with the application.

    Feed it a document, tell it to summarize and back will come a generally useful précis. For folks directly on a Mac (MacOS 8.6 or newer incl. X) simply highlight a document or portion of text and select "Summarize" from the contextual menu.

    --
    I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.