Bayesian Filtering Outside of Email?

I'd like to use it for Slashdot.. by NanoGator · 2004-03-29 16:34 · Score: 2, Funny

.. imagine, filtering out MS fud stories and dupes!

--
"Derp de derp."

Re:I'd like to use it for Slashdot.. by heim913 · 2004-03-29 17:48 · Score: 1

"First Post" 's, too!
Re:I'd like to use it for Slashdot.. by eugene+ts+wong · 2004-03-29 18:47 · Score: 1

I'd use it to filter /. 1st posts as well. That was 1 of the 1st applications that crossed my mind. Filtering out goatse, tubgirl & GNAA posts would be useful as well. In fact, the whole moderation system should move towards using it.

I'm just thinking out loud, though, & I'm not a filtering expert.

--
testing out my trending skills
Re:I'd like to use it for Slashdot.. by Anonymous Coward · 2004-03-29 20:59 · Score: 0

But there'd be nothing left!
Re:I'd like to use it for Slashdot.. by Leffe · 2004-03-29 21:13 · Score: 1

Nonono, you use it for karmawhoring :)

1. Run the filter on a number of posts, trolls, karmawhores, the rest, ....
2. Write a comment.
3. Run the filter on the comment, if the score is too low, try to improve it by using words that will give you more karma. Such as CowboyNeal, SCO and Micro$oft.

Nyuk Nyuk Nyuk by Ieshan · 2004-03-29 16:35 · Score: 1

Slashdot Dupes.

And, as a more insightful suggestion, troll posts marked as redundant in slashdot stories. There have been a few "attacks" on slashdot which could have been prevented by simply blocking 'repeat' posts.

Re:Nyuk Nyuk Nyuk by NanoGator · 2004-03-29 16:37 · Score: 3, Interesting

" There have been a few "attacks" on slashdot which could have been prevented by simply blocking 'repeat' posts. "

Filerting out GNAA posts would be nice. Not that I've run into it lately, but there was a story a couple of months back that had nearly 1,000 GNAA posts. Impressive organization on the behalf of the trolls, but it did take a while to suss out. (I wonder how many mods burned up mod points that night...)

--
"Derp de derp."

Bayesian isn't the right approach by costas · 2004-03-29 16:38 · Score: 4, Informative

Bayesian needs pre-determined "bins" of data to assign a new piece of information to --that's a limited approach that will break down for news articles or generic Web pages. A combination of context- and collaborative-filtering is a much better approach IMSHO (that's my newsbot, BTW).

Re:Bayesian isn't the right approach by stoborrobots · 2004-03-29 16:48 · Score: 3, Insightful

Most "Filtering" techniques fall into the same trap you've outlined - namely that they require pre-determined bins to sort data into. This is the nature of the beast.
There are "clustering" techniques which attempt to identify similar bunches of data, without respect to any pre-determined bins, but the are not as useful for programmatically dealing with information. This is simply because you don't know what the clusters will contain, so you cannot make assumptions about what you will want to do with each cluster.
Classification systems are used when you WANT to fit things into one of a number of bins that you already have decided what to do with (e.g. SPAM - delete, From Mistress - show now, From Boss - file for later, From Debt collector - return "Deceased", etc.) Bayesian filtering is simply one form of classification.
For more information and ideas, check out KD Nuggets
Nice work on the newsbot, BTW.

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
Re:Bayesian isn't the right approach by clonebarkins · 2004-03-29 16:51 · Score: 1

That's great stuff, and looks like it would work for what I want. Is the backend code open source? Curtis.

--

"The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand
Re:Bayesian isn't the right approach by Chilles · 2004-03-29 20:14 · Score: 1

You could try sorting into "interesting" and "uninteresting" based on previously labeled webpages. Those two categories would be entirely user specific and any dataset would become invalid over time as user interest shifts, but still, these are two "good" bins.
For newsfeeds you could set a subject (for example: "Presidential elections") and sort into "About presidential elections" and "Not about presidential elections". You just make an initial suggestion (a few articles maybe) and judge the first few articles the bayesian news sorter sends your way (by saying: "this is a good article, this is a bad article") and you're set for the next few days.
And then there's clustering techniques (as suggested by another poster) that might work as well. You could even use bayesian techniques to determine the quality of a found cluster based on user judgement of previously found clusters.
Re:Bayesian isn't the right approach by bhima · 2004-03-29 20:44 · Score: 1

I would love for an E-mail program to automagically sort my work e-mail into the project folder it belongs in!

--
Nothing in the world is more dangerous than sincere ignorance and conscientious stupidity.
Re:Bayesian isn't the right approach by Hard_Code · 2004-03-30 01:49 · Score: 1

Well, theoretically the same bayesian filter that knows to put spam in the "spam" folder, can be similarly taught to put arbitrary content in an arbitrary folder. The trick is training it. The email client would have to somehow "record" every time you moved or copied something into a folder (or numerous folders), and then, when a message fit that criteria, it would have to replicate that action, move/copy, to the specified folder or folders. I don't think it's all that hard, but I don't think it's been done in major email clients. I REALLY wish the bayesian filter in Mozilla was upgraded for arbitrary content and arbitrary destinations or actions (for instance I may want to mark it important, AND move it to a folder, AND maybe forward it to somebody else, all at once).

--

It's 10 PM. Do you know if you're un-American?
Re:Bayesian isn't the right approach by Crayon+Kid · 2004-03-30 03:00 · Score: 1

The email client would have to somehow "record" every time you moved or copied something into a folder (or numerous folders), and then, when a message fit that criteria, it would have to replicate that action, move/copy, to the specified folder or folders. I don't think it's all that hard, but I don't think it's been done in major email clients.

Provided you find a bayesian filter which can use arbitrary destinations, Sylpheed Claws can easily take care of the automatic filtering using its folder processing rules.

--
i ate crayons when i was a kid and now i have two braincells and the blue ones taste nicer
Re:Bayesian isn't the right approach by dublin · 2004-03-30 17:40 · Score: 1

Well, theoretically the same bayesian filter that knows to put spam in the "spam" folder, can be similarly taught to put arbitrary content in an arbitrary folder. The trick is training it.

This is really not that hard. Check out POPfile, an open-source Perl program that's intended for spam filtering, but can be used and adapted for much more. It's as good or better than Mozilla's bayesian engine - I would still be using it except that the Mozilla approach does offer some integration benefits. For other applications, though, POPfile should be great - think of it as an all-purpose bayesian engine you can modify at will. (Not that that's necessarily trivial, but it *is* possible...)

--
"The future's good and the present is nothing to sneeze at." - Roblimo's last ./ post
Re:Bayesian isn't the right approach by Derek+Mason · 2004-03-31 04:52 · Score: 0

There's no reason why you'd necessarily need to train it - Bayesian clustering can look for similarities in documents, and use model-selection techniques such as MDL or BIC to determine the most information-efficient arrangement. No doubt that's the kind of thing that is used over at Vivisimo, which automatically clusters search results.

Bookmark Filing by magnum3065 · 2004-03-29 16:49 · Score: 1

Here's an enhancement request I filed for Firefox. This is something I think would be a nice use of Baysian filtering.

Re:Bookmark Filing by hool5400 · 2004-03-29 17:38 · Score: 1

Sorry, links to Bugzilla from Slashdot are disabled.

I get the feeling they've been slashdotted before. Once bitten, twice shy...

--

Remember, it takes 42 muscles to frown and only 4 to pull the trigger of a sniper rifle.
Re:Bookmark Filing by nicolas.e · 2004-03-30 03:29 · Score: 1

To view the bug report :

1. Enter http://bugzilla.mozilla.org/ directly in your brower's navigation bar.

2. Enter bug # 235076 and click show.

3. View suggestion.

4. ???

5. profit !
Re:Bookmark Filing by bhtooefr · 2004-03-31 02:44 · Score: 1

1. Go to http://www.opera.com/, click Free Download, and download the version for your platform.
2. Go back here, hit F12, and uncheck Enable Referrer Logging.
3. Click the link, and view the suggestion.
4. ???
5. Well, if you want to get rid of the Opera ad banner, it's not profit, but hey...

Autonomy's been doing this for years by Jayfar · 2004-03-29 16:51 · Score: 3, Interesting

See their technology overview. I believe they have a number of (ugh!) patents on Bayesian text analysis. They were founded by a Dr. Michael Lynch to productize research he did at Cambridge U.

Bayesian Approaches to Phylogenetics by GrumpySimon · 2004-03-29 16:56 · Score: 5, Informative

Bayesian approaches have really taken off in studies of molecular evolution (Phylogenetics).

For those of you who don't know, phylogenetics is a set of techniques for working out a 'family tree' of taxa (taxa = basically units of analysis, normally species or genetic sequences). The main reason for doing this is that it gives an objective way of testing evolutionary hypotheses. For example - If I predict a certain protein has evolved through stages A, B then C, but my tree shows a pattern of A - C - B, I can reject that hypothesis.

Phylogenetics is extremely powerful and has allowed us to investigate many many cool things (like the origin of modern humans in Africa, and the migrations out of). The problem is that there is a *huge* number of trees to search to find the optimal set of trees. The formula (IIRC) is 5N-2!!, where N is the number of taxa. So, 10 taxa (species or whatever) has 34 million trees, and when you get up to a real dataset it gets much worse: There are 10^132 ways of connecting my 77 taxa dataset.

Bayesian approaches can really really speed up this process. We used to have to do a large number (100-1000) of heuristic analyses and then bootstrap (a resampling procedure) these to get a confidence interval, of say, a date of a divergence time or a model fit. These Bayesian techniques allow us to do, say, 10 long runs whilst simultaneously estimating parameters.

Sooo much faster (ie - that 77 taxa dataset mentioned before - instead of ~250 hours x 1,000, I can do the same in about ~100 hours x 10.

There are some problems - it possibly over-estimates support (ie underestimated uncertainty in the data) for taxa groupings, compared to the bootstrap method. This isn't terribly surprising given the hill-climbing approach these algorithms use, but no-one's really sure whether this is a good or bad thing (since no-ones really sure how to interpret the alternative bootstrap support)

Fantastic software: Mr Bayes: Bayesian Inference of Phylogeny
and BAMBE: Bayesian Analysis in Molecular Biology and Evolution

--
henry -- the human evolution news relay

Re:Bayesian Approaches to Phylogenetics by gumbi+west · 2004-03-29 17:38 · Score: 1

WinBugs Is another code that does this stuff. It is amazing what it can do. Since this is slashdot, I will mention that their mailing list just had an email today mentioning a port to (the open source) R that is well underway. The program itself is free (as in beer).
But basically, the Bayesian approach is a probability approach, not a statistics approach (i.e. what is reality like based on my data and on previous data).
Re:Bayesian Approaches to Phylogenetics by wayne606 · 2004-03-29 17:59 · Score: 3, Informative

For a good summary of this stuff check out Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids.
Also, hidden Markov models (which are used for phylogenetic analysis and involve Bayesian statistics) have been used longest in speech recognition.

i was just thinking about this by Siniset · 2004-03-29 16:57 · Score: 1

and especially how it applied to rss feeds, but that's not all. You could apply it to search results, friendster-type profiles, etc. Maybe that's what google has planned with their personalized search engines...

NNTP/Usenet by JGski · 2004-03-29 17:15 · Score: 1

For those who still bravely (foolishly) venture onto usenet, it would be nice to replace kill files with something Bayesian. There may be such a reader already but I haven't seen it (nevermind something cross-platform, which is a must for me).

Re:NNTP/Usenet by jpkunst · 2004-03-29 19:21 · Score: 2, Informative

For those who still bravely (foolishly) venture onto usenet, it would be nice to replace kill files with something Bayesian. There may be such a reader already but I haven't seen it (nevermind something cross-platform, which is a must for me).

There is one newsreader I know of which uses Bayesian filtering for articles in its latest version, but it's Mac only: MT-NewsWatcher.

JP

MT-Newswatcher by megabulk3000 · 2004-03-29 17:22 · Score: 3, Informative

Well, the latest version of MT-Newswatcher for Mac OS X utilizes Bayesian filtering to filter Spam out of newsgroup postings. Maybe not the most unusual application of things Bayesian, but a welcome one nonetheless.

pr0n! by Anonymous Coward · 2004-03-29 17:27 · Score: 1, Funny

It works great to sort pr0n! And it's much more useful than getting rid of spam too.

moderation by gumbi+west · 2004-03-29 17:33 · Score: 1

It could help for slashdot. Unfortunately, the site is only given a small portion of a machine, so the added complexity would probably cost the parent company too much.

Re:moderation by wayne606 · 2004-03-29 18:03 · Score: 1

Do you mean the slashdot web site? According to the FAQ it runs on 10 fairly powerful CPU's. That was 4 years ago though.
Re:moderation by Anonymous Coward · 2004-03-30 15:03 · Score: 0

Yeah, the parent company keeps trying to slash costs and recently (remember when their host change a few months ago) moved them onto a shared server with the idea of saving money.

Why...yes. by ByronEllis · 2004-03-29 17:49 · Score: 5, Informative

First off, the spam filters are actually classification algorithms, not filters---the name filter is incorrectly used almost exclusively by spam classification software--and worse yet they're really only referring to a specific classifier (the "Naive Bayes" algorithm) rather than to classifiers in general. "Bayesian" filters are things like Kalman Filters, Particle Filters and Hidden Markov Models which are used in any number of fields, but not really germane to the tasks you're asking about I think. Using "Bayesian Classification" in Google will probably yield more fruitful results.

It sounds like you want to extend the naive bayes classifier to more than two categories and, in the best case, learn new categories from the data. Both can be done and have been done with varying degrees of success. You might try here for some pointers to more information about how it is done (the algorithm itself has been around since the '60s---people only think its something new). Unfortunately for things like RSS and email you're going to run into two problems: you really want to do your classification on-line and your data are actually quite sparse and your prior is usually uninformative so its going to be hard to do the actual classification. But, who knows, its still an active topic of research.

Re:Why...yes. by Anonymous Coward · 2004-03-30 06:09 · Score: 0

I wish I could mod you up, but I don't have mod points right now. Thank you for bringing some sense to this discussion; I always cringe when people bring up "Bayesian Filters" on slashdot. The wrong name, limited view of applications, and little knowledge of how Naive Bayes really works. I'm just biased because I have studied it so much. The paper you reference is perfect for the topic.

Classifier4J, NNTP//RSS &Bayesian Blog Classif by ArkieNerd · 2004-03-29 17:52 · Score: 2, Informative

Try visiting http://www.mackmo.com/nick/blog/java/?permalink=cl assifier4jnntprss.txt

"I now have Classifier4J and nntp//rss working together to do Bayesian classification of RSS feeds. There are a few things still to work out (perfomance and usability to name two), but I'm pretty pleased with it, since it was something I whipped up in a couple of hours. AFAIK it is the first Bayesian/RSS thing that has got far enough to have a screenshot..."

Yes, this has been done for RSS feeds by Cecil · 2004-03-29 17:58 · Score: 1

My friend has done this with his growlmurrdurr aggregator. It uses SpamBayes along with a set of "this sucks", "this is yay" buttons on displayed feeds to highlight them appropriately.

Also, I'm not certain, but I strongly suspect that Google is using some sort of Bayesian filtering as at least part of their criteria for Google News.

--
Random and weird software I've written.

Re:Yes, this has been done for RSS feeds by Lao-Tzu · 2004-03-30 03:22 · Score: 1

Hey, that's me!

Yeah, I tried it. It tends to suck, actually. RSS feeds don't have quite enough information to usefully classify every article that comes up. Especially when a lot of your RSS feeds contain nothing but the title of an article.

But you can see it kinda in action on my own aggregator. The software works well, but the bayesian classification is not too useful. I guess part of the problem is also that the majority of my RSS feeds I actually want to read.

Similarly... identifying webpage blocking by OnyxRaven · 2004-03-29 18:04 · Score: 2, Interesting

I'm working on a project for my Senior Project that could take the Bayes method to identify webpages that are 'good' or 'bad' for a proxy or bridge based connection filtering or bandwidth limiting application.

Now, obviously for webpages its a bit easier to say 'good' 'bad', but this app (www.bandwidtharbitrator.com) already has some regular expressions for apps like Kazaa, Bittorrent, in the hopes of limiting the bandwidth. I wonder if a Bayesian system could be adapted to this domain? I considered it, but the person in charge of that part of the project is using a diff-like method (which I find silly).

Are there easy-to-plug-into APIs and libraries like that we could use to do all the 'hard work'? Is SpamBayes up to the task?

--
--onyx--

oh yeah by revmoo · 2004-03-29 18:16 · Score: 3, Funny

What other areas can you think of where Bayesian filtering may prove useful?

Family discussions?

--
I would expect such blatant racism on Fark, but on Slashdot? Mods please ban this asshole.

the paperclip by drDugan · 2004-03-29 19:04 · Score: 3, Informative

the technology developed at MS research to get the paperclip (the office help animate hate attractor) to work is based on a bayes net.

http://www.wired.com/news/print/0,1294,43065,00.ht ml

Re:the paperclip by mandalayx · 2004-03-29 22:07 · Score: 1

If you are looking for a more academic perspective, Wikipedia is here too.

Stock speculation by Tomah4wk · 2004-03-29 20:51 · Score: 1

I have a friend at university who is using it to analyse news stories and make predictions about stock increase/decrease (Masters degree project). It seems to be working well enough that if you followed exactly what was guessed so far you would have made money, however i still wouldnt trust real money (the gains are quite small, and obviously the risk is still high). However, combined with human knowledge this really does look like a potentially very interesting bit of software.

opera by 216pi · 2004-03-29 20:54 · Score: 1

I know, it's in mail, but as far as I know, opera's mail client (in the actual beta 7.5 at least) uses bayesian filtering to sort non-spam messages in your views. Opera learns where to sort mails when you drag and drop mails from one view to another so you don't have to set up rules (you can do, if you want but you don't have to).

Control algorithms by lindelof · 2004-03-29 21:35 · Score: 5, Interesting

I work at the Building Physics Laboratory in Lausanne, Switzerland, and I investigate the possible use of Bayes' theorem in the building control field. The idea is to classify situations as bad respectively good based on feedback from the occupants and have the system learn from its mistakes.

Consider, for instance, the total amount of sunlight hitting your computer screen. Most people would like an automatic system to control their window blinds to keep that amount to an acceptable level, but the system cannot know a priori what that level will be for a given user. So we let the system set the blinds to a setting deemed acceptable for the average user and use the user's manual interventions to build up a list of bad settings, corresponding to the setting immediately before the intervention, and good settings, corresponding to the setting immediately after the intervention.

The system will then attempt to minimize the probability of the user rejecting its settings by applying Bayes' theorem.

I've done only preliminary exploration of this idea so far but the results are encouraging, and we plan to do a full-scale experiment this summer.

Short answer... by pjdepasq · 2004-03-30 00:19 · Score: 1

I have a short answer. Yes.

My students and I are buidling a filter for the web. We're really not ready to tlak about it yet, but it is working well and we hope to get something "out there" soon (next year?).

Has anyone seen a content filter? by shaitand · 2004-03-30 00:30 · Score: 1

We take care of the technical needs of many schools throughout the area and every one of them wants web content filtering.

We typically setup squid and squidguard for them and grab blacklists from a regional database the schools put together.

The first thing you can't help but notice is that it sucks. Even with the various schools additions it doesn't block much of what it should and blocks quite a bit it shouldn't. All of the same problems come into play with these hardcoded blacklists that come into play with spam.

So I'm wondering, is there any filter for squid (or another linux based web proxy) which uses a more intelligent method such as bays?

Re: Bayesian Filtering Outside of Email? by manavendra · 2004-03-30 01:00 · Score: 1

Is anybody out there using Bayesian filtering for stuff other than to get rid of spam?
Look out for most content management systems - most of them happen to make use of some or other form of Bayesian algorithms to "cleanse" the content and/or extract attributes. After all, your "filter" is nothing but a set of rules built on a test/clean data, with which you compare your actual data.

For example, how useful would Bayesian filtering be to identify news stories/blog entries in the RSS feeds I monitor?
Do you monitor similar/same RSS feeds from different sources? What factors differentiate these two sources? Do you have know the ground rules/criteria to determine sanctity for the same/similar RSS feed from these different sources?

Is there any software out there using Bayesian filtering to do this sort of thing already?
Don't know about that. Though I'm sure you can download some Bayesian implementation from the web and hook it up with your RSS feeds.

What other areas can you think of where Bayesian filtering may prove useful?
There are already content management (catalogs), and attribute extraction (ESS systems for large corporations need to exchange data with several suppliers via supplier catalogs).

--
http://efil.blogspot.com/

Popfile for mailing lists by juntunen · 2004-03-30 01:54 · Score: 1

I am trying to setup Popfile to sort mailing list messages into multiple buckets: very interesting, mildly interesting, worthless and so forth. I belong to several high-volume mailing lists and I've been wishing for an easier way to find what I care about without having to skim several hundred messages to find it. I am hoping the classifier will eventually pick up on what people and topics I like best.

System Logs by Kalak · 2004-03-30 04:07 · Score: 1

This would be a great application for system logs. You think your e-mail is full of spam and worthless junk, try going through MB of multiple sysem logs a day. I know there are logwatch tools, but AFAIK, they're regex based. A Bayesian approach would be great, as it would learn what I care about and what I don't. Heck, I might be able to convince work I need to write on now. Time to Google and see if such a thing exists.

--
I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)

bug/suggestion tracking by yardbird · 2004-03-30 07:08 · Score: 1

When the original "plan for spam" article came out, I got excited about it and incorporated it into a suggestion tracking system I was working on. The end result was nice. In the system, the user would look at email and associate it with existing suggestions or bug reports. The system learned what words were associated with which suggestions or bugs, and would show the user a list of suggestions which might be relevant for the email he was viewing. It worked surprisingly well.

--
Free, legal music for iTunes users.

Re:bug/suggestion tracking by AeiwiMaster · 2004-04-04 01:44 · Score: 1

I would like to learn more about this.

Do you have a link or other info ??
Re:bug/suggestion tracking by yardbird · 2004-04-06 15:42 · Score: 1

Sad to say, it was a work-for-hire so I don't have rights to the source. If you have any general questions about it, feel free to contact me: asthma_pie at earthlink dot net.

--
Free, legal music for iTunes users.

Kind of ... by pen · 2004-03-31 05:23 · Score: 2, Interesting

I run a submission-based web site that, at times, gets a lot of duplicate (or very similar) submissions. I have a basic Bayesian script break each new submission into words and flag it if it's too close to something else.

News Site uses Bayesian... by starrsoft · 2004-03-31 08:20 · Score: 1

Findory.com (run by a Slashdot user) filter's news based on user preferences. It stores preferences automatically using cookies and require no registration.

--
Read my blog: HansMast.com

Slashdot Mirror

Bayesian Filtering Outside of Email?

54 comments