Computers Summarize the News
oily_ants writes "I get sick and tired of reading the same story on different web sites. That's why I like slashdot so much. Good (??) summaries of all of the stuff out there on the net. Now there is a project at Columbia University by the nlp group that attempts to generate computer summaries of all of those news articles on different web sites. The project is called Newsblaster and the summaries are excellent. You can read about the project on regular news sites like Online Journalism Review or USA Today."
news.google.com. Just released yesterday. I haven't yet played around with it enough to say whether it's cool or not, but it does look promising.
Simpli - Your source for San Jose dedicated servers and colocation!
I get sick and tired of reading the same story on different web sites. That's why I like slashdot so much.
I'm sure most will agree with me when I say that this makes ABSOLUTELY NO SENSE.
Sounds like a good idea, but I'm worried about the "Newsbots" objectivity. If I wanted to read a bunch of stories about the latest NVidia GeForce 4 release, 10 reasons more RAM is better, and why you should upgrade your hard drive, I'd just watch TechTV.
It hurts when I pee.
To tell you the truth, at first I thought the summaries were TOO good; I was suspicious that it wasn't really automated.
But after looking at a few more stories, it looks like it just pulls sentences out of the stories that seem to have a different point to make, and strings them together.
Sometimes you see some redundancy and some non-sequiturs, but I have to admit the illusion is pretty good.
Sometimes it's best to just let stupid people be stupid.
This is a somewhat dangerous trend, IMHO. CNN Headline news gives us blurbs...soundbites...with no substance. "Israelis shot Palestinians" or vice versa on a daily basis. Little reporting of substance of negotiations; why there was a conflict in that location at that time for what reason. The great thing about the internet is that there is great reporting in depth. I like to check out the Drudge report, BBC, disinfo.com, etc on a regular basis to get a good blend of various points of view so that I can make my OWN opinion. I don't want to be served watered down sentence fragments by a corporate AOL/TimeWarner beheometh. Slashdot is one of a few exceptions to this rule, since they typically link to articles of substance and allow for dialogue and debate by (usually) intelligent users. The moderation system isn't perfect, but it helps dodge the trolls. My guess is that automated summaries will lose the flavour of good journalism/writing, and by taking an "average" will end up with a C+ "factual comprehension" review as opposed to multiple A+ "theory" and "syntehsis" editorials.
John Maynard Keynes: "When the facts change, I change my mind. What do you do?"
Check out this odd story about incarcerated Browns. The summarizer could apparently still use some manual supervision.
So where's the slashbox for it?
Davo -- Free speech, free software, AND free beer.
What are the copyright or other legal issues to republishing news stories collected from web sites? The Newsblaster site clearly states where the information comes from - like every good college student is taught to cite information sources. On the other hand, on the bottom of many of the stories is the notice: "Copyright 2002 Associated Press. All rights reserved. This material may not be published, broadcast, rewritten, or redistributed." Is collecting and condensing news stories "republishing" - does this violate copyright stuff?
We have a summarization strategy that selects from three summarizers: one that works over documents describing a "single event" which is novel, one that works over documents describing a person (so-called biography events) using sentence extraction, and one that is a general sentence extractor based on the biographical summarizer which does use more than just TFIDF weighting for the extraction. (It has a notion of semantic classes, and some other stuff.)
The "single event" summarizer is novel though. It uses a clustering component to cluster the sentences, then for each cluster it takes the intersection of the sentences (yes, we need to parse the text to do this, and we do) and RE-GENERATES (does not extract) a sentence that synthesizes the information from the cluster.
There's a lot of other stuff going on as well, we're using a text categorization system that we developed here, a text clustering system, our own system for categorizing the images that come with the articles (you'll be able to browse by image categories soon as well) and some other stuff.
Basically, it looks at the headlines on Yahoo/Reuters, and finds sentences that scan as 5/7/5, and uses Perl cleverness to present them as a little news haikus (or senryu, if you wanna be picky). It's great stuff:
I'm hooked :)
They have archives going back to the beginning of 2001, with only a few holes (e.g. the days after September 11), and they talk about how they are doing everything. Bonus points: you can have the haiku headlines mailed to you automagically every day. I just hope they have the bandwidth (etc) to withstand Slashdot....
DO NOT LEAVE IT IS NOT REAL
And don't forget http://catalogs.google.com/ for online searching of mail-order catalogs. (They scan 'em, OCR 'em, and make 'em searchable.)
My sci-fi novel, Ghost Thief, is now available from Amazon.com.