Bayesian Filtering Outside of Email?
clonebarkins asks: "Is anybody out there using Bayesian filtering for stuff other than to get rid of spam? For example, how useful would Bayesian filtering be to identify news stories/blog entries in the RSS feeds I monitor? Is there any software out there using Bayesian filtering to do this sort of thing already? Are other types of filters better for these purposes?" What other areas can you think of where Bayesian filtering may prove useful?
For those of you who don't know, phylogenetics is a set of techniques for working out a 'family tree' of taxa (taxa = basically units of analysis, normally species or genetic sequences). The main reason for doing this is that it gives an objective way of testing evolutionary hypotheses. For example - If I predict a certain protein has evolved through stages A, B then C, but my tree shows a pattern of A - C - B, I can reject that hypothesis.
Phylogenetics is extremely powerful and has allowed us to investigate many many cool things (like the origin of modern humans in Africa, and the migrations out of). The problem is that there is a *huge* number of trees to search to find the optimal set of trees. The formula (IIRC) is 5N-2!!, where N is the number of taxa. So, 10 taxa (species or whatever) has 34 million trees, and when you get up to a real dataset it gets much worse: There are 10^132 ways of connecting my 77 taxa dataset.
Bayesian approaches can really really speed up this process. We used to have to do a large number (100-1000) of heuristic analyses and then bootstrap (a resampling procedure) these to get a confidence interval, of say, a date of a divergence time or a model fit. These Bayesian techniques allow us to do, say, 10 long runs whilst simultaneously estimating parameters.
Sooo much faster (ie - that 77 taxa dataset mentioned before - instead of ~250 hours x 1,000, I can do the same in about ~100 hours x 10.
There are some problems - it possibly over-estimates support (ie underestimated uncertainty in the data) for taxa groupings, compared to the bootstrap method. This isn't terribly surprising given the hill-climbing approach these algorithms use, but no-one's really sure whether this is a good or bad thing (since no-ones really sure how to interpret the alternative bootstrap support)
Fantastic software: Mr Bayes: Bayesian Inference of Phylogeny
and BAMBE: Bayesian Analysis in Molecular Biology and Evolution
henry -- the human evolution news relay
First off, the spam filters are actually classification algorithms, not filters---the name filter is incorrectly used almost exclusively by spam classification software--and worse yet they're really only referring to a specific classifier (the "Naive Bayes" algorithm) rather than to classifiers in general. "Bayesian" filters are things like Kalman Filters, Particle Filters and Hidden Markov Models which are used in any number of fields, but not really germane to the tasks you're asking about I think. Using "Bayesian Classification" in Google will probably yield more fruitful results.
It sounds like you want to extend the naive bayes classifier to more than two categories and, in the best case, learn new categories from the data. Both can be done and have been done with varying degrees of success. You might try here for some pointers to more information about how it is done (the algorithm itself has been around since the '60s---people only think its something new). Unfortunately for things like RSS and email you're going to run into two problems: you really want to do your classification on-line and your data are actually quite sparse and your prior is usually uninformative so its going to be hard to do the actual classification. But, who knows, its still an active topic of research.
Consider, for instance, the total amount of sunlight hitting your computer screen. Most people would like an automatic system to control their window blinds to keep that amount to an acceptable level, but the system cannot know a priori what that level will be for a given user. So we let the system set the blinds to a setting deemed acceptable for the average user and use the user's manual interventions to build up a list of bad settings, corresponding to the setting immediately before the intervention, and good settings, corresponding to the setting immediately after the intervention.
The system will then attempt to minimize the probability of the user rejecting its settings by applying Bayes' theorem.
I've done only preliminary exploration of this idea so far but the results are encouraging, and we plan to do a full-scale experiment this summer.