Computers Summarize the News
oily_ants writes "I get sick and tired of reading the same story on different web sites. That's why I like slashdot so much. Good (??) summaries of all of the stuff out there on the net. Now there is a project at Columbia University by the nlp group that attempts to generate computer summaries of all of those news articles on different web sites. The project is called Newsblaster and the summaries are excellent. You can read about the project on regular news sites like Online Journalism Review or USA Today."
...whether this will include the obscure stories that are actually interesting, or whether it'll be just a rehash of the major stories that we can find in ten or twelve other places.
This is a somewhat dangerous trend, IMHO. CNN Headline news gives us blurbs...soundbites...with no substance. "Israelis shot Palestinians" or vice versa on a daily basis. Little reporting of substance of negotiations; why there was a conflict in that location at that time for what reason. The great thing about the internet is that there is great reporting in depth. I like to check out the Drudge report, BBC, disinfo.com, etc on a regular basis to get a good blend of various points of view so that I can make my OWN opinion. I don't want to be served watered down sentence fragments by a corporate AOL/TimeWarner beheometh. Slashdot is one of a few exceptions to this rule, since they typically link to articles of substance and allow for dialogue and debate by (usually) intelligent users. The moderation system isn't perfect, but it helps dodge the trolls. My guess is that automated summaries will lose the flavour of good journalism/writing, and by taking an "average" will end up with a C+ "factual comprehension" review as opposed to multiple A+ "theory" and "syntehsis" editorials.
John Maynard Keynes: "When the facts change, I change my mind. What do you do?"
So where's the slashbox for it?
Davo -- Free speech, free software, AND free beer.
Wrong on every count.
/. gets is news from other places and is always hours or days late with it. The worst thing you can do is get all your news from one source.
/. (And my favorite of your list "USA Today") Sometimes you get more information than contained in a story merely by seeing how different people report the story! Reading one paragraph summaries of the days news will tell you nothing at all. Maybe worse, mislead you due to there not being enough information.
Besides the fact that
Every news site has some kind of slant to it. CNN, NPR,
I read news from about 10 sources a day and if I see multiple articles that I'm not interested in they're easy to skip. If I am intersted in them I read them on all sites. You get much much more information that way.
Though you do need to pick your sites. If you look at CNN, MSNBC and Salon and all three are merely parroting Reuters then you know your not doing yourself any good.
Contrary to popular belief, coding is not all free blow-jobs and beer. Those things cost MONEY!
As expected, the content it's presenting is predominantly US-centric. I'll be giving it a miss until they start scraping from a more globally representative pool of media sources...
What if we have two such automated news services and they scan each other? Wouldn't they get stuck in some sort of infinite loop where they repeatedly pass the same story back and forth, summarizing it over and over again?
I think you are confusing plagarism, and a violation of copyright. I am primarily concerned with the legal issue of Copyright violation raised by the previous poster, not an amorphous ethical one.
As Bitlaw points out, under the Copyright Act, four factors are to be considered in order to determine whether a specific action is to be considered a "fair use." These factors are as follows:
1) the purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational purposes;
2) the nature of the copyrighted work;
3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
4) the effect of the use upon the potential market for or value of the copyrighted work.
Dang, clicked the submit button by mistake.
Attempting to apply the four factors there, while some could be argued either way, I can see that on balance, you both might be right. I could probably make a stronger case that it doesn't qualify as fair use, than that it does, based on those four factors. I think I was focusing over-much on the "amount taken" criteria and overlooking the others.
--LP
...but rather in identifying multiple documents that appear to be talking about the same thing. Summarization is a well-researched (but not well-perfected) NLP topic, but finding inter-document similarities is quite a bit more challenging. This is easy for me and you to do when we read something, but think about what it takes to get a machine to do this. Take a look at some of the examples--you'll find that although large chunks may be verbatim from document to document (especially ones that rehash standard news feeds like Reuters and AP), most articles have a different wording or spin on each idea.
:wq