Slashdot Mirror


Open Source Automated Text Summarization?

TrebleJunkie writes "I've spent some time recenting looking for open source projects dealing with Automated Text Summarization -- automatically generating detailed summaries from longer documents -- to no avail. I can find a lot of research papers and several commercial projects, but no open source code or projects? Does anyone out there know of any?"

6 of 38 comments (clear)

  1. Re:maybe a dumb question by bildstorm · · Score: 3, Interesting

    Check out ArchiText from YellowBrix.

    Having been looking at their demos and so on, they have some great summary software.

    It is most certainly NOT free, but perhaps by looking at the summaries generated and the documents pulled from, you could get some idea how to reverse-engineer the process.

    --
    The power of accurate observation is commonly called cynicism by those who have not got it. - G.B. Shaw
  2. Check out Alembic by Anonymous Coward · · Score: 1, Interesting

    http://www.mitre.org/technology/alembic-workbench/
    Might do exactly what you want. You probably have to train it first but it works quite nicely.

    Mike

  3. Summarisers by Exeter+Bun · · Score: 2, Interesting

    I think one of the problems is that such a piece of software would be big business. I think I found something in the Natural Language Processing Software Registry: http://registry.dfki.de/ Check under sections->written language->summarization Another poster described systems that simply filter through relevant sentences. They're also sometimes known as abridgers. You might want to include that term in any keyword search you're doing

  4. Re:Have a look in CPAN by orangesquid · · Score: 4, Interesting

    Lingua::EN::Summarize tested on the GPL v2:

    USA Everyone is permitted to copy and distribute verbatim copies of this license document. Changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. The GNU General Public License is intended to guarantee your freedom to share and change free software. To make sure the software is free for all its users. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.). We are referring to freedom. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish).

    It seems comparable to MS Word..

    --
    --TheOrangeSquid Is it any wonder things seem so awry? We swim in a sea of confusion and don't have to think to survive
  5. The reason why is because it's hard by JimMcCusker · · Score: 2, Interesting

    I developed NetOwl Summarizer 1.5 (way at the bottom), and there's a lot that needs to be done. You need to score enough documents, and need to have a good entity extraction mechanism (which NetOwl Extractor does) and you need a good on-line learning system. It's a lot of work, and even still, we don't get very good results, only good results. Microsoft's text summarizer does far worse, actually, but neither of us is perfect.

  6. Re:No Offence inteneded but, Why?? by babbage · · Score: 3, Interesting
    Yes: coping with the flood of information available on the internet within a time frame that any mere mortal could digest. As a mere mortal, I don't have time to keep up with all the Usenet feeds & mailing list discussions about all the fields I'm interested in, and the amount of digital information is growing way faster than anyone can keep up with, ever. Tools like Google help you find specific needles in the information haystack, but search engines really need to be complemented by other tools that can tell you more about the haystack itself (how big, what color, what's in it, what's it smell like, etc). That way you can choose what haystacks -- to keep strangling the metaphor -- you want to spend the time looking for your own needles in.

    There was a trend a few months to a year ago where members of some discussion groups were producing summaries of each week's traffic, but it proved to be so much thankless work that they have all quit by now. Every week these people would have to spend hours sifting through hundreds of messages and manually distilling it down to one hyperlinked document of perhaps a few hundred words, or a couple of pages long if printed. For the thanks they got in return -- and people did appreciate all the work, but you can't eat thanks -- it just wasn't worth it for any of them to keep doing these manual summaries. Even if they were being paid, it's not the sort of work most people want to be doing in the first place.

    Finding a system that could programmatically produce a periodic summary -- even if a crude one -- of what was discussed on one of these groups would be a great tool. And no, I'm not willing to pay an assistant to summarize Usenet for me, and no I don't think it's something that any one assistant could do alone anyeay. But I would be willing to have, say, a cron job that on mondays gave me a summary of the Linux kernel lists, on Tuesdays gave me a report on what's up with Perl6, on Wednesdays told me what security issues have been news lately, on Thursdays ...you get the idea.

    In order to be able to summarize these aggregates of documents, you'll have to start with smaller ones. You could play it in both directions: from the messages up level, you could reduce each posting to a sentence or less , while from a threads down level you could figure out what topics seemed to be hot and go for key ideas from messages within the main threads. Bonus points for a system that could recognize citations (if what poster A said was important enough for poster B to quote it, then maybe that quote should end up in the summary) or, Google-style, place emphasis on traffic pattens, linkages, etc.

    As several people have noted, this is all a big, hard problem to solve, and there would be real uses for it if anyone could put it all together. Would we be willing to pay for such a service? I dunno, depends how good it is I guess. But if it really could reduce Usenet, web logs, mailing lists, and hey maybe even some normal web sites down into a small handful of roughly accurate documents that could be read over a cup of coffee each morning, then yeah I think that would be a valuable thing.