Slashdot Mirror


Open Source Automated Text Summarization?

TrebleJunkie writes "I've spent some time recenting looking for open source projects dealing with Automated Text Summarization -- automatically generating detailed summaries from longer documents -- to no avail. I can find a lot of research papers and several commercial projects, but no open source code or projects? Does anyone out there know of any?"

38 comments

  1. The way it is supposed to work! by AllMightyPaul · · Score: 0, Redundant

    It's too bad the Internet didn't keep with the way it was supposed to work back in the beginning with the and such. If that were the case, you could look for headers and body blocks to determine content.

    If you know how the text will be formatted you can have the script look for a proliferation of words that are not things such as "a" or "and" or "of" and perhaps search for common threads, but other than that, I got nothing!

    1. Re:The way it is supposed to work! by AllMightyPaul · · Score: 1

      Stupid auto formatting. I meant to show an <H1> and an <H2>

    2. Re:The way it is supposed to work! by moncyb · · Score: 1

      It's too bad the Internet didn't keep with the way it was supposed to work ...

      Yeah. I think that's from too many stupid people. I remember seeing a guy get flamed on usenet for having a summary line in his headers (or was that a keyword line? I forget.) The idiot that flamed him said something like it messes up his newsreader or some odd crap like that.

      Back on the subject at hand...the sort of program you're asking about would need some sort of AI code in it. I believe the field of study is called natural language processing. It's not a trivial matter, so it makes sense to me that there is no open source software for it.

      It wouldn't be a bad idea for a project. Maybe a system that converts text into a more understandable form for computers...most languages were just haphazardly slapped together then highly bastardized over time...it's amazing that anyone or anything can understand English! ;-) Anyway, changing to data that is structured and conforms to more strict rules could do wonders. After that, it'd be much less difficult to write programs that depend on understanding the meaning of documents. (like creating summaries)

      Unfortunately, such a project would involve more than just changing words to specific codes. I remember reading about one of the first translation attempts by computer. They tried to test it out by going from English to Russian and back. They put in: "The spirit is willing, but the flesh is weak." They got back something like: "The vodka is good, but the meat is rotten." A lot of the meaning is lost. You have to account for not only placement of words, but also context, idiosyncratic phrases, &etc...

    3. Re:The way it is supposed to work! by AllMightyPaul · · Score: 1

      Who on earth rates these? How is this redundant? I was the third person to reply to this topic and I'm redundant. Good God, that's something wrong.

  2. maybe a dumb question by Anonymous Coward · · Score: 1, Insightful

    is there a non-free program that does this?

    1. Re:maybe a dumb question by PurpleBob · · Score: 3, Informative

      is there a non-free program that does this?

      Microsoft Word.

      It doesn't do it all that well, from what I've seen, but it does it. It's called "AutoSummarize".

      --
      Win dain a lotica, en vai tu ri silota
    2. Re:maybe a dumb question by 90XDoubleSide · · Score: 3, Informative

      Mac OS X’s has a built-in Summarize Service that works much better than the one in Word, IMO. Sorry I can't think of any open source ones.

      --
      "Reality is just a convenient measure of complexity" -Alvy Ray Smith
    3. Re:maybe a dumb question by NaturePhotog · · Score: 2

      "doesn't do it all that well" is being kind. My wife tried it on a 5 page letter she'd written, and the results were...bizarre. Yes, the text was from the document. Far from the most relevant parts, seemingly grabbed at random.

      My best guess for what it could do is some sort of word frequency count, ignoring common words like 'the'. Then include the top N% of sentences and those adjacent to them that include the most common words. Also, give a higher weighting to things in the beginning and end, since papers following the classic form tend to say what they're going to say, say it, then say what they've just said.

    4. Re:maybe a dumb question by bildstorm · · Score: 3, Interesting

      Check out ArchiText from YellowBrix.

      Having been looking at their demos and so on, they have some great summary software.

      It is most certainly NOT free, but perhaps by looking at the summaries generated and the documents pulled from, you could get some idea how to reverse-engineer the process.

      --
      The power of accurate observation is commonly called cynicism by those who have not got it. - G.B. Shaw
    5. Re:maybe a dumb question by WatertonMan · · Score: 1
      The company I work with has a summarization library that does this. Pricing depends upon how you use it. I know that they've made fairly good deals for educational uses. It was more designed for writing automated abstracts, but it does an amazingly good job on news sources as well.

      Obvious caveats apply - i.e. I work for them and helped write the thing. However if you are needing that sort of thing or something more particular contact Lextek

  3. A simple kinda-solution by wickidpisa · · Score: 1, Redundant

    I know this isn't exactly what you are looking for, but I remember SAT prep books that teach you to read the first line of every paragraph to get a quick summary. Granted it works better for the SATs than it does IRL, but it often works pretty well and it's better than nothing. You could whip up a simple perl script to extract the first line of each paragraph in no time.

    1. Re:A simple kinda-solution by Rick+the+Red · · Score: 2
      A local TV talk show host once revealed that if he didn't have time to read the book of an author/guest he would read the first chapter, the first page of each chapter, and the last chapter.

      Yeah, it's off-topic, but it's not redundant! Stupid moderators -- meta-mod will bite you back!

      --
      If all this should have a reason, we would be the last to know.
  4. No Offence inteneded but, Why?? by Why+Should+I · · Score: 0, Redundant
    Can't seem to think about any reason you would possibly want to automate this.
    Surely the whole point to summaries is that they are a shortened version of human-generated english (or whatever human language) that embodies the general context of the document.
    This just seems like one of those things that:
    1. Is best not automated
    2. Probably cheaper done by hiring a clerk to read and summarise, than use a computer
    Seriously though, I can't think of any reason why you would really need to automate this. Is there one?
    1. Re:No Offence inteneded but, Why?? by Bazzargh · · Score: 1

      To save you the time reading the whole document? Most summarizers just pull what they think are the most relevant sentences out of the document (rather than attempt to write a summary from scratch). This results in quite readable, and essentially human-generated summaries.

      And as for 'probably cheaper' - well yes if you want to guarantee you get something that makes sense back. However, the expense can only be justified if you want a summary for use in (say) a presentation, or you want to read a review. If you want to see abstracts of dozens of documents in order to decide if any are worth reading a computer is way cheaper and faster; 50wpm and 1GHz just don't compare for that task.

    2. Re:No Offence inteneded but, Why?? by babbage · · Score: 3, Interesting
      Yes: coping with the flood of information available on the internet within a time frame that any mere mortal could digest. As a mere mortal, I don't have time to keep up with all the Usenet feeds & mailing list discussions about all the fields I'm interested in, and the amount of digital information is growing way faster than anyone can keep up with, ever. Tools like Google help you find specific needles in the information haystack, but search engines really need to be complemented by other tools that can tell you more about the haystack itself (how big, what color, what's in it, what's it smell like, etc). That way you can choose what haystacks -- to keep strangling the metaphor -- you want to spend the time looking for your own needles in.

      There was a trend a few months to a year ago where members of some discussion groups were producing summaries of each week's traffic, but it proved to be so much thankless work that they have all quit by now. Every week these people would have to spend hours sifting through hundreds of messages and manually distilling it down to one hyperlinked document of perhaps a few hundred words, or a couple of pages long if printed. For the thanks they got in return -- and people did appreciate all the work, but you can't eat thanks -- it just wasn't worth it for any of them to keep doing these manual summaries. Even if they were being paid, it's not the sort of work most people want to be doing in the first place.

      Finding a system that could programmatically produce a periodic summary -- even if a crude one -- of what was discussed on one of these groups would be a great tool. And no, I'm not willing to pay an assistant to summarize Usenet for me, and no I don't think it's something that any one assistant could do alone anyeay. But I would be willing to have, say, a cron job that on mondays gave me a summary of the Linux kernel lists, on Tuesdays gave me a report on what's up with Perl6, on Wednesdays told me what security issues have been news lately, on Thursdays ...you get the idea.

      In order to be able to summarize these aggregates of documents, you'll have to start with smaller ones. You could play it in both directions: from the messages up level, you could reduce each posting to a sentence or less , while from a threads down level you could figure out what topics seemed to be hot and go for key ideas from messages within the main threads. Bonus points for a system that could recognize citations (if what poster A said was important enough for poster B to quote it, then maybe that quote should end up in the summary) or, Google-style, place emphasis on traffic pattens, linkages, etc.

      As several people have noted, this is all a big, hard problem to solve, and there would be real uses for it if anyone could put it all together. Would we be willing to pay for such a service? I dunno, depends how good it is I guess. But if it really could reduce Usenet, web logs, mailing lists, and hey maybe even some normal web sites down into a small handful of roughly accurate documents that could be read over a cup of coffee each morning, then yeah I think that would be a valuable thing.

    3. Re:No Offence inteneded but, Why?? by TrebleJunkie · · Score: 1

      No offense taken.

      The answer's pretty simple: I want to be able to summarize documents as they come into my life and inevitably stay there. I'm a pack rat. I keep everything. I just want to be able to organize it. I want to be able to search it. I want to be able to search through the summaries (to provide for a way to be able to search generalizations, rather than finding keywords in irrelevant parts of irrelevant documents.) and I want to be able to display the summary when I mouse over the document... stuff like that... so I don't have to dig through the whole document to find out if it's really what I need. And I wanted to play a little bit with the technology. I tinker. I do that. :)

      I do thank everyone for their responses. Anything else you can think of, please let me know. Thanks much!

      --

      Ed R.Zahurak

      You know, oblivion keeps looking better every day.

    4. Re:No Offence inteneded but, Why?? by Anonymous Coward · · Score: 0

      Could someone summarize what that guy just said? =)

    5. Re:No Offence inteneded but, Why?? by Anonymous Coward · · Score: 0

      Perhaps to avoid wasting time reading all of your reply?

  5. Where is the research you found? by DevilM · · Score: 1

    Why not provide us with the research you found. Maybe one of us would be willing to hack up a quick and dirty prototype based on the research.

  6. bookaminute by ghamerly · · Score: 1

    First off, I'm doubtful that there are any open-source programs that do this well, as it's a very difficult problem! It has to do with understanding a document, which computers really can't do.

    So I'd like to take a moment to point out a good resource for some existing summaries, at bookaminute.

    1. Re:bookaminute by Anonymous Coward · · Score: 0
      You are kidding, right? Here's a typical one:

      Hitchhiker's Guide To the Galaxy
      By Douglas Adams
      Ultra-Condensed by David J. Parker and Samuel Stoddard


      (The Earth gets BLOWN UP.)

      Arthur

      I'm a bit upset about that.
      Ford
      Yes, I can understand that.
      (They fly around the galaxy. They go UNDERGROUND, where they see...)

      Arthur

      The Earth.
      Deep Thought
      Forty two.
      THE END
  7. Have a look in CPAN by Anonymous Coward · · Score: 0

    The HTML::Summary and/or Lingua::EN::Summarize modules probably do what you need.

    1. Re:Have a look in CPAN by orangesquid · · Score: 4, Interesting

      Lingua::EN::Summarize tested on the GPL v2:

      USA Everyone is permitted to copy and distribute verbatim copies of this license document. Changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. The GNU General Public License is intended to guarantee your freedom to share and change free software. To make sure the software is free for all its users. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.). We are referring to freedom. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish).

      It seems comparable to MS Word..

      --
      --TheOrangeSquid Is it any wonder things seem so awry? We swim in a sea of confusion and don't have to think to survive
  8. Microsoft Summarize by cabalamat2 · · Score: 4, Funny

    I know it's not open source, but have you tried the Summarize feature in Microsoft Word? I fed it the entire contents of the GNU website and it came back with:

    GNU is rubbish. Don't use the viral GPL! Bill is your friend. You love Bill. Microsoft software is the best.
    1. Re:Microsoft Summarize by Anonymous Coward · · Score: 0

      I tried on Microsoft Office summary on Slashdot and here is what came out.

      Microsft is rubbish. Do't use the viral Microsoft! Open Source is the best.

  9. Check out Alembic by Anonymous Coward · · Score: 1, Interesting

    http://www.mitre.org/technology/alembic-workbench/
    Might do exactly what you want. You probably have to train it first but it works quite nicely.

    Mike

  10. Summarisers by Exeter+Bun · · Score: 2, Interesting

    I think one of the problems is that such a piece of software would be big business. I think I found something in the Natural Language Processing Software Registry: http://registry.dfki.de/ Check under sections->written language->summarization Another poster described systems that simply filter through relevant sentences. They're also sometimes known as abridgers. You might want to include that term in any keyword search you're doing

  11. I'll GPL this: by Bazman · · Score: 4, Funny
    perl -ne 'split;foreach(@_){print $_." " if (rand()>.9)}'

    Try it on man pages:

    man awk | perl -ne 'split;foreach(@_){print $_." " if (rand()>.9)}'

    and it still makes sense! :)

  12. The reason why is because it's hard by JimMcCusker · · Score: 2, Interesting

    I developed NetOwl Summarizer 1.5 (way at the bottom), and there's a lot that needs to be done. You need to score enough documents, and need to have a good entity extraction mechanism (which NetOwl Extractor does) and you need a good on-line learning system. It's a lot of work, and even still, we don't get very good results, only good results. Microsoft's text summarizer does far worse, actually, but neither of us is perfect.

    1. Re:The reason why is because it's hard by Zurk · · Score: 1

      how does the MITRE one compare to yours ?
      The mitre one is at : http://www.mitre.org/technology/alembic-workbench/ ANLP97-bigger.html
      and
      http://www.mitre.org/technology/alembic-workbenc h/

  13. Some related information by f00zbll · · Score: 3, Informative
    I did some research into this for a pet project of my own. I wanted to write an application to crawl the web and get information. After a couple months of research, I realized how big of a problem it is.

    1. the application needs to be able to determine the relevance of the provided text
    2. to do so, it needs to determine the relative importance of the sentences and words
    3. it has to be able to compose new sentence to write a summary
    4. not all documents follow good structure or grammer
    5. how do you account for spelling/grammar mistakes

    From my research, there appears to be two primary methods of performing this kind of processing:

    1. natural language parsing
    2. statistical parsing

    Of the two, statistical parsing is more popular these days because it doesn't require knowledgebase, expert system shells, grammar modeling and extensive dictionary. One of the primary method of determining the relative importance of words in a sentence is valence. The main challenge with natural language parsing and statistical technique is it depends on the training dataset. The more specific the dataset is, the better it will perform.

    Statistical analysis can also use expert system shells and other AI technologies to improve accuracy, but it doesn't have to.

    From my understanding (which is limited), it stems from a principle from linguistics. By counting the frequency of words or more specifically nouns, the program is able to rate each nouns importance. Once it got done, it could then look at the sentence that best describes the document by doing a comparison between the most importance words and the appearance of those words in the sentences. I remember this from my literature and linguistics classes. Congnitive science has also attempted to solve this problem, but it is very difficult.

    In either case, if you dealing with well structured documents, your best bet is to grab the first 3 paragraphs assuming the author followed standard thesis/essay structure. If you're planning on summarizing new articles, it might not be that hard if the author followed the inverted pyramid, which many do not. One of the big tools of natural language parsing in the early days was prolog. It is still used a lot in academic settings for natural language processing. You're best bet is to get an intern to read and summarize for yo

    1. Re:Some related information by braineater · · Score: 1

      Systems I've seen are also capable of extracting noun phrases and verb phrases, and weighing the relevence of those. If 'the lazy sleeping dog' occurres a couple of times (especially across paragraph boundries) it will score high to be part of an overall summary of the document. Ferreting out the parts of a verb phrase can be quite a bit more difficult, because they can be bigger, containing noun phrases, prepositional phrases, they can be nested, yada yada. You need to have a lexicon so the system can pick out a word, and classify it as a noun, adjective, adverb, verb, and the heirarchial relationship they are allowed to have with one another. And after all that, you have to remember that in english (and alot, but not all other languages) word position is syntactically significant, and that there's more than one (syntactically relevant) way to say the same thing.Which is why it would be hard.

      But it wouldn't be impossible. There's a company in Canada that does software like this (in Englis, German and French, I believe) called Nstein. I've seen a demo and its very impressive.

  14. Sherlock by maggard · · Score: 3, Informative
    Apple's Sherlock application does this.

    It's not Open but it is scriptable, is not an additional cost, and is available on a Unix OS (MacOS X) Indeed through Apple's Open Scripting Architecture (OSA) one can use any number of scripting languages such as Python, Perl, and even JavaScript to interact with the application.

    Feed it a document, tell it to summarize and back will come a generally useful précis. For folks directly on a Mac (MacOS 8.6 or newer incl. X) simply highlight a document or portion of text and select "Summarize" from the contextual menu.

    --
    I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
  15. Hard.. by Tom7 · · Score: 2

    It's a pretty hard problem; people are still actively researching this and the best results are only so-so.

    The best way to find code would be to e-mail the authors of the papers you've found. They probably have implementations, and academics are usually willing to share under something like a BSD license or the GPL.

  16. From GamesNET by Anonymous Coward · · Score: 0

    Are you DevilM from GamesNET?

  17. It's All Perl's Fault by Anonymous Coward · · Score: 0

    You're just seeing the problems with the Perl Summarize::Moderate module. Its redundancy detector needs improvement.