Bayesian Filtering For Dummies
Dynamoo writes "Bayesian filtering for spam is awfully clever stuff, touched on by Slashdot several times before. There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering. It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name. Those Vikings have a lot to answer for."
I suggest Slashdot immediatly implement this "Bayesian Filter for Dummies" to remove most of the trolls, etc.
The BBC article mentions Paul Graham, and I found his page (and some more information on Bayesian networks for spam filtering) here:
Paul Graham's spam page
He talks a little bit more about the technical aspects there.
the blood has stopped pumping, and he's left to decay
the me that you know is now made up of wires
I've been using it for a bit on my own e-mail, and it seems to work out. But it's not at the point where I'd be happy to see ISPs implementing it for their customers -- even ignoring the Freedom of Speech issue, it still has the occasional false positive.
Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822-3.
It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name.
Does anyone have proof thats where the name comes from?
Mouse powered Chips, Open source Processors and Lego
Monty Python - vikings? What are you on about?
When all you have is a hammer, everything looks like a skull.
I'd say that the BBC has more in common with the Normans, actually.
Tarsnap: Online backups for the truly paranoid
I have used a bayesian filter for some time now and while it is the BEST filter type I have ever used nothing is 100% reliable. While this is the best technology for the average user it is most cirtainly not perfect. Instead I use a combination of moderate bayesian filtering and good old fasion "block sender" filtering.
Thus proving that the TV generation is full of idiots.
Now, let's be fair. All it proves is that the poster is an idiot, and the SlashDot Editor-on-Duty is either an idiot or just lazy.
Someone needs to learn the meaning of "ironic". (Hint: it doesn't mean "weird coincidence".)
Paul
Bayesian filtering could stop all the spam that easily? This is great! Where can I download a filter like this? And back in the mid to late 80's or so, at least around Bell Labs where I worked, SPAM stood for Stupid People Asking for Money, when did that change?
But it doesn't have much Spam in it.
Interesting yes, ironic, no.
What's your name, Alanis Morissette ?
This whole spam thing reminds me of a story I read while in 7th grade. In it, the postage for sending junk mail was decreased to practically nothing. Then, junk mail buried America. Hundreds of years later, archeologists came back and investigated the remains. Their conclusions about our society are kind of humorous. However, the idea of junk mail burying us when the postage goes way down has kind of been proved with spam. Maybe a small tax for spam wouldn't be a bad idea.
Thank you for your support.
Is this truly the only Earth I can live on?
Viagra often spelled V-l-a-g-r-a online
I-f I t-r-o-l-l l-i-k-e t-h-i-s, w-i-l-l i-t p-a-s-s S-l-a-s-h-d-o-t.'s t-r-o-l-l f-i-l-t-e-r ?
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
Why then, does the article show a pic from a Monty Python animation about the black spot who goes to seek his fortune...
You'd think they'd use the actual pic of the skit with the Vikings in the cafe...
/sig
So this filter works on analysis of previously filtered mail?
I can see the casual (mis)use of this technique by your average user rapidly becoming a problem - putting just one email from a legit e-mail sender into the bayesian filter could concievably snowball into a block on a lot of legit traffic under certain circumstances.
Above and Below knows I have enough hassle with users and their e-mail already
"The sheer number of spam mail sent means that even tiny response rates, reportedly 0.0001%, means junk mailers turn a profit. "
And this is why I say that educating users is just about as important as implementing spam filtering technology. If people know that they are perpetuating a serious problem by replying to spam, then that's bad news for spammers.
About another fact mentioned in the article: It said Paul Graham's filter extracts "the top 15 features that define them as spam." 15? I thought that most Bayesian filters use many more spam-defining features. Because I'd say that there are quite a few more. Just think of the many features that spam tends to have. But he says his filter works well. Interesting.
In my home mailbox, I don't receive spam. And I only got two 419 nigerian invesment frauds on my professional address in a whole year, despite the fact that my corporate email address is widly publicized and easy to find on google. And amazingly, I never receive spam in my "special bogus registration" hotmail account (useful for programs like RealPlayer, or nytimes.com).
:
So existing mail filters work for me, more or less. The few unwanted mails that pass through are easily taken care of by my trusted delete button. This leads me to ask
- Do other people really receive that much spam, or am I an isolated case ?
- Do people who receive spam purchase things online, or register software and other services with their real names and email ?
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
...supposedly uses some form of Baysian reasoning. I've been using it for a year now. I trained it for a couple of weeks, turned it on "automatic filtering" mode, and now I can count the number of times its misclassified a message on my two hands. I used to get more spam than legit mail, now I can't help but wonder why spam is a problem for people. Until I remember that most people don't use a mac. Every once in a while, I flip it back into training mode so that I can see the lovely see of brown-colored spam messages that flood my inbox. I flip it back to automatic mode, Mail automatically moves them to my junk folder, and I can forget about them.
I have a simple questions, is there a way to impliment a Bayesian Filter for Evolution without having to add an extra stop for the email, (ie a mail server on my computer from which evolution picks mail up locally).
I do security
I find in our case it stops 98-99% of spam dead in its tracks. There have been a few false positives, and you do need check from time to time just in case an genuine emails are misclassified, but it's surprising just how quickly the filter sorts the wheat from the chaff.
Don't expect miracles but they can save you a lot of time... what I find cool is that it learns so quickly, almost like a complicated neural net should, but it's such a simple idea. I wonder if there are any other uses for this kind of thing?
Sorry, but my karma just ran over your dogma.
Perhaps /. could implement a bayesian filtering for killing all the dupes!
I simply got to the point that I could count the number of real emails on my hands. So I reversed my previous filter. Instead of filtering spam to my spam folder, I made it default *ALL* mail to the spam folder except from certain known addresses (such as work, friends and my own domain). So far, it has only made one wrong decision, and that was because I hadn't written the email address of a friend correctly.
:)
This is waaaaay better than any other filtermethod I've tried and requires no learning period at all
Well, the type of Bayesian learning used in this spam filtering is called "Naive Bayesian" and the engine is trained using "supervised learning" technique. Naive Bayes has been proven very successful for text categorization. Spam filtering is even more successful because we essentially categorize e-mails to two labels: "spam" or "not spam".
Supervised learning basically works like this. Feed the engine with multiple examples (in this case, e-mails) with labels (in this case, "spam" or "not spam"). The training usually takes thousands of examples to get good enough accuracy. And take note that we need both "spam" and "not spam" examples to enable the learning engine to distinguish them.
How Naive Bayes works? Well, think of the full Bayesian Network. Bayes net is basically a causal-effect graph with annotated Conditional Probability Table (CPT) on each node denoting the probabilities of possible values. Full Bayes Net takes Directed Acyclic Graph (DAG), but Naive Bayes takes a form of tree instead due to some "naive" assumptions. (Okay, I handwaved a whole lot of details here) And in Learning Naive Bayes, we basically try to construct the tree out of the examples.
Let P(spam) be the percentage of training e-mails that is labelled as "spam" and P(not spam) be the percentage of "not spam" e-mails.
First, let the filter reads all e-mails and collect the words out of them. Weed out duplicates and stop words (common words like "I", "you", "the", etc). Let NumVocab be the number of words after weeding.
Second, process e-mail one by one. Do weeding phase like the above. Let "n" be the number of words on that particular e-mail after the weeding. Scan the word one by one. Let "w" be the current word scanned and "nw" be the number of times word "w" occur in that e-mail. Imagine you have a big two dimensional array to store the result (let's call the array "P"). If the e-mail is labeled "spam", then store (nw+1)/(n+NumVocab) to P[w][spam].
Repeat until all training e-mails are read.
And here comes the testing phase...
When you encounter an e-mail and want to classify whether it's spam or not, you'll need to look up the array P you created earlier. First, you do the weeding phase and scan the word one by one. The algo is like this:
Hope this helps.
--
Error 500: Internal sig error
pleaaase!!! stop timothy from spamming us with these boring articles...
maybe a little too simple?
I *like* spam! I print it out and jack off to it!
This allows your single spam/non-spam feedback to the system to do double duty, so that once the program knows that you consider an email source to be "trusted", it will allow even spammy-looking stuff (read: mailing list digests, plane schedules, bank statements, etc) through to your non-spam folder.
Of course, if spammers start constructing google-style databases of who your friends are and impersonating their accounts, then this won't work anymore... but if they start that, all hell is going to break loose anyway.
I don't care if it's 90,000 hectares. That lake was not my doing.
I have to say you both are idiots.
Although MP didn't invent to word 'Spam', they did pioneer its use considering we aren't talking about spiced canned ham by product.
I wonder if a Bayesian classifier could sort out banner ads? I currently use Guidescope to block them, but it would be far better not to rely on a third party to decide what's an ad URL. It think it would work, but training it might be hard.
(And before anyone says "Don't do that, websites will die" my response would be "Good, let most of them die." I hate ads.)
Reading about the history of email and instant messaging, it reminded me of how easy it was to echo "Hi there" > /dev/tty01 to send a message to another college acquaintance...ahh the memories...
I'm going to put the printer in landscape mode!
-- "At Microsoft, quality is job 1.1" -- PC Magazine, Nov. 1994
Comment removed based on user account deletion
So what they are saying here is that Bayesian Filters other than current version do a better job???
Looks like we'd better get that early beta version reinstalled...
So tell me, Mr. Anderson. What good is an email if you can't read it?
*sigh* I suggested this back when bayesian was first mentioned. Filtering and classification for the admins, as well as the users. Throw in NNTP and Slashdot could be much better. Oh well, anyone wanna hear about my idea for keeping pancakes from sticking to your kitchen ceiling?
Instead I filter all of my mail for wanted/expected mail into a (large) tree of input folders, mailing lists, company mailings etc.
Most of what's left is spam, so a quick scan of the inbox (and creation of new rules) weeds out the uncaught desirables and the rest gets dropped in the bitbucket.
The point being that legitimate mail doesn't try to spoof my filters. I haven't (yet) had any spam arriving where it shouldn't. I'd rather my ISP dumped all the crud in the bin for me, but my marginal cost is low as I'm on ADSL. I now also use a distinct email for each purpose, making it easy to spot where spammers got it from and to create new rules as needed. It's a shame I didn't do this at the start as I have a couple of early ones that are spammed but I can't dump.
I hereby inform you that I have NOT been required to provide any decryption keys.
I'm using this now, and it works great!
Get it here.
-ted
Graham's method is called "naive Bayesian", and it's called "naive" for a reason. It works surprisingly well, but it barely scratches the surface of what people are doing with statistical models of text.
The lack of references on Graham's web site to prior work on text classification makes one wonder whether he just is unfamiliar with a huge body of literature going back decades or whether he just deliberately ignores them. Either way, Graham didn't invent any of the techniques and they are far from state-of-the-art. (Incidentally, you'll probably find Octave or Perl/PDL a more convenient language for implementing this stuff than Lisp.)
Anybody seriously interested in text filtering should at least do a little bit of background reading. "Readings in Information Retrieval" by Jones and Willett covers some of the basic papers.
Mozilla incorporates a twostep filter:
1. Is the sender in the address book? If yes, is not spam, otherwise:
2. Does the message have a probability of 90% that it is spam based on the Bayes filter? If so, flag as spam, otherwise not spam.
Great. Wanna filter my email for me? ;)
The sheer number of spam mail sent means that even tiny response rates, reportedly 0.0001%, means junk mailers turn a profit.
Are we missing a critical factor of the end user who actually responds to SPAM?
If spammers survive on 0.0001% response rate, how many people are actually clicking/buying? Are these people who provide the customers for spammers going to stop or use any sort of filters?
That's what we refer to as "Alanis irony", 8^)
I don't use email. Yes, I have a few addresses but I havent checked them in months. Email is kinda dead way of communication anyway, beaten by things such as mobile phones and instant messaging.
if their response rate is one in a million, why not put a $0.0001 fee on emails? I use a lot of email, and I dont think that the one cent per hundred is gonna break my bank. (Of course, if you tax it, they'll do it surreptitiously.)
http://www.accountkiller.com/removal-requested
Change the subscription service to all the articles,but zero AC posting, then you wouldn't have to filter by threshold as severely. Working out the obvious dodge of multiple login handles I don't know, but maybe somehow it's possible, but merely charging per handle would slow it down considerably.
whoops, posting as AC....
On my mailbox outside my apartment I have a "No Junk Mail please" sticker... This actually works. I tried to put the same sticker on my pc, but the junk mail just keeps on comming... I don't understand....
Ask me no questions, and I'll tell you no lies...
Why go through all the work of training some software to read your email and decide if you might want to read it when most email programs have white list capabilities?
If I don't know you, that means I don't want to talk to you. Your email goes straight a junk folder, which I can quickly scan once every few days for from names I recognize. I can add these names to my white list if I so choose.
Granted, my job does not involve me soliciting contacts from the public at large, so this wouldn't work for everyone. I use it on my personal Hotmail account though, and I get to not even consider lots of crap every day.
You can never put too much water in a nuclear reactor.
As soon as there's a well-written app that works with or on top of programs like Outlook and Netscape, I will be excited. Until then, a huge (and likely most targeted) sector of people remains relatively un-filtered.
If you can't see the value in jet powered ants you should turn in your nerd card. - Dunbal (464142)
I get som newsletters in my mailbox that I actually want. Some of thm are verry simular to some spam mails in the structure. (Having html code, pictures and so forth) How are the filters in handeling this potential problem? I'm not currently having a spam filter. So I don't know. Does anybody know this?
Ask me no questions, and I'll tell you no lies...
I just had to backlevel bogofilter from 0.11 to 0.9. I don't know WTF happened between those two revs, but the filtering algorithm went straight to hell. I had forgotten that I normally get over 100 spams a day until I went to 0.11. Then it all came back and I started losing half an hour a day to sorting out my email.
I gave it a chance for over two weeks, and it never got even close to the success rate of 0.9. Not that I'm complaining, there was nothing left to improve upon in 0.9 AFAIC. (And yes, I did see that someone decided to reverse the function performed by the -N and -S switches -- thus making my crontab edits a nightmare with troubleshooting)
So I'm now back at 0.9 and back in nirvana. It's good to be home.
Intelligent Life on Earth
If you merely place the white list filter before the baysian system, and all white listed e-mail is sent directly to your inbox bypassing the filter then the filter will miss it's oportunity to be better trained. It will efectively have very few e-mails to re-inforce what "not-spam" is and will only be fed spam. It might become overly aggressive in it's filtering and any non-whitelisted, non-spam may have a higher chance of being incorrectly classed as spam.
If you use the whitelist after the baysian filter then the filter misses out on the oportunity to be better educated again, as regardless of how it classifies the whitelisted "not-spam" e-mail, you'll still receive it.
The whitelist needs to be incorporated into the baysian system itslef to ensure the filter is continually trained with what is theoretically known good "non-spam" mail.
The fact that a fish swims in water does not make it an expert in fluid dynamics. GogglesPisano (199483)
Why don't we just put a thorough e-mail embargoe against China, and let the communist gov't there shoot the spammers for us?
I absolutely HATE Chinese spammers with a vengance because they once forged my e-mail address and I got the angry replies, I wanted to drive down to China and shoot that little Chinese guy and stuff his little Chinese computer full of dynamite.
Fucking Chinese.
Ok, so some English are descendent of the Vikings but only because when We the Danes came to the UK we did not find sheep to go around
I just hate bit SPAM, (www.netnoise.com.kh)
MP didn't "pioneer its use" at all. In the skit they are referring to the actual product of spiced ham. How could just mearly mentioning a product by name pioneer the use of the word? They obviously didn't coin any new term, they were talking about spiced ham.
"50 years of successful predictive modeling should be enough: lessons for philosophy of science"
it's a very common misconception, but the fact is that a well-written test (eg. the MMPI) will always be better than a human "expert" (eg. a psychologist).
A slightly different idea that I was considering today works as follows.
Take the Tagged Message Delivery Agent, a system that will send a challenge message to anyone it doesn't know (isn't in the whitelist), which you have to reply to.
Then change it so anything allowed through on the whitelist is added to the "Not Spam" category, and anything that is challenged is passed through the filter. If it passes, it doesn't get challenged (but also doesn't get added automatically to Not Spam), and if it _doesn't_ pass, then it gets challenged.
Few, if any, false positives, and challenges not sent where they don't need to be. Sounds foolproof enough...
You don't necessarily need an explicit whitelist. All you need to do is include the email headers in the list of tokens from which the Bayesian filter learns.
Then, if you receive non-spam from a friend, their email address is automatically added to the list of non-spammy words.
Conversely, any time you classify a spam email, then that email address, and potentially the domain if the tokenising is smart, is added to the list of spammy words.
This is what SpamBayes already does, I believe.
I use a Unix probability based spam filter written in awk and ksh with a whitelist built in. The whitelist is executed first as it is much faster than the spam filter. It is located at:
http://www.sofbot.com/
You could look for the message ID of a message you sent in the header fields of received messages (specifically, the in-reply-to header field). If you find it, it means that the received message is likely to be a reply to a message you sent.
You could look for a phrase from your signature, which could indicate that someone sent a reply and included your original message.
Besides the words in your signature, you could program in certain other words that automatically trigger a classification as non-spam. Those words might include the names of trademarked products that your company sells or similar types of words. Of course, this is just overriding some of the learning that presumably would happen automatically. But if these are very important words, then you must insist that nothing else the filter does can override the classification as non-spam, and thereby avoid false positives.
In summary, I think that bayesian classifiers, as Paul Graham proposes them, are just too naive. The addition of a few heuristics could make a big difference.
How to Increase Your Penis
And Stop Premature Ejaculation
FREE Bottle Offer 100% Guaranteed to work.
Take Advantage of Our FREE Bottle Offer As Seen On TV !!!
Click here to learn more.
NB: Amusingly my first revision of this was smacked down by slashdot's inbuilt junk filtering mechanisms. :P
// -- http://www.BRAD-X.com/ --
All the people who say "I don't get spam, why do you?" will be the first up against the wall when the revolution comes. Well, first after the damned spammers, anyway.
Given the fact you got modded to +5 funny, Yes, if you troll like that it will ge through the lameness filter and the people with mod points.
Good work!
"Live Free or Die." Don't like it? Then keep out of the USA
True, that mostly works... but it doesn't handle the possibility of my friend sending me an email where the spam-keywords overwhelm the "goodness" of his non-spammy email address. I like to know for certain that no matter what my friends send me, it will get to me (of course, if they send me too much crap, they'll lose their "friend" status... ;^))
I don't care if it's 90,000 hectares. That lake was not my doing.
heuristics could make a big difference.
I disagree -- the heuristics you mention are much more naive than the Bayesian filter. For example, what if someone doesn't quote your signature in their reply? What if their mailer doesn't include the Message ID? What if the email isn't a reply to something you wrote, but a spontaneous email?
Even if the heuristics did work well (and in my experience they don't), there is still the time factor -- I don't want to spend all of my free time coming up with and implementing new heuristic rules. I want my computer to do the scut work for me. Bayesian does that.
I don't care if it's 90,000 hectares. That lake was not my doing.
what the normans are? frogs, thats what. just because a few vikings pillaged the coast dont mean they are vikings. hah! maybe lucky eddie. the british are basically africans who picked up some civilisation from irish scots french an germans
Yes, that would most definitely be a problem.
:)
It would probably also tell me that it's time to get some new friends
n/t
Writers imply. Readers infer.
It is official; Slashdot now confirms: Spam is dying
.
.
One more crippling bombshell hit the already beleaguered Spam community when Slashdot confirmed that Spam market share has dropped yet again, now down to less than a fraction of 1 percent of all servers. Coming on the heels of a recent Slashdot survey which plainly states that Spam has lost more market share , this news serves to reinforce what we've known all along. Spam is collapsing in complete disarray, as fittingly exemplified by failing dead last [samag.com] in the recent Sys Admin comprehensive networking test.
You don't need to be a Kreskin [amazingkreskin.com] to predict Spam's future. The hand writing is on the wall: Spam faces a bleak future. In fact there won't be any future at all for Spam because Spam is dying . Things are looking very bad for Spam. As many of us are already aware, Spam continues to lose market share. Red ink flows like a river of blood.
Sex SPAM is the most endangered of them all, having lost 93% of its core developers. The sudden and unpleasant departures of long time Sex SPAM developers Jordan Hubbard and Mike Smith only serve to underscore the point more clearly. There can no longer be any doubt: Sex SPAM is dying
Let's keep to the facts and look at the numbers.
Viagra SPAM leader Theo states that there are 7000 users of Viagra SPAM. How many users of Penis Extender SPAM are there? Let's see. The number of Viagra SPAM versus Penis Extender SPAM posts on Usenet is roughly in ratio of 5 to 1. Therefore there are about 7000/5 = 1400 Viagra SPAM users. BSD/OS posts on Usenet are about half of the volume of Penis Extender SPAM posts. Therefore there are about 700 users of Soft Porno SPAM. A recent article put Sex SPAM at about 80 percent of the Spam market. Therefore there are (7000+1400+700)*4 = 36400 Sex SPAM users. This is consistent with the number of Sex SPAM Usenet posts.
Due to the troubles of Virginia, abysmal sales and so on, Sex SPAM went out of business and was taken over by Walmart who sell another troubled Dead Tree version. Now Walmart is also dead , its corpse turned over to yet another charnel house.
All major surveys show that Spam has steadily declined in market share. Spam is very sick and its long term survival prospects are very dim. If Spam is to survive at all it will be among OS dilettante dabblers. Spam continues to decay. Nothing short of a miracle could save it at this point in time. For all practical purposes, Spam is dead
Fact: Spam is dying
POPFile's Magnets work like this - based on From, To, or Subject.
Writers imply. Readers infer.
I have my main hotmail set to max filtering, i.e. only allow people I have in my address book or safe list.
;)
I've noticed that recently some spam has been coming through pretending to be amazon.com or bn.com as they are on my safe list (and I'd imagine many other people's too).
Is this the beginning of a wave of intelligent spam? one step up from them pretending to be from yourself. How soon before one of those outlook virii is designed to divert the address book info to some spammer, so they can more than just guess what email addresses people are likely to let through?
The end is nigh I tells ya!
You'll soon be running out of bits to store the floating point results. Implement it by adding logarithms of probabilities instead of products of them, thus:
If you have a couple of hundred key-words, this will make a lot of difference concerning the accuracy of the predictions.
Possible spam solution: require email senders send a "revokable certificate" which can be downloaded from their web site. Only allow email addresses already in address book and certifed web email addresses to send email. This idea works great as a filter. Hard part would be getting websites to start posting email certificates for download as well as getting the (optional) filter installed on the end users machines.
Could be easily done if a create email certificate option was added to frontpage and the filter was built into outlook. Only downside to filters of course, is that they only cut traffic down where the filtering is done. Of course, if a good filter is implemented and everyone starts using it, I'm sure it would severely cut down on the outgoing spam in the first place.
BTW: filtered certificates would automatically send a certificate revocation notice back to the sender of the offending email. Certificates could easily be reinstated at any time just by downloading the certificate again. Web browser of course would have to work in conjunction with the email software in the certificate transfer process.
Certificate revocation would only require interaction beteween the mail server and the mail client.
Any reason why this isn't being done yet?
Start thinking of your email address as something more along the lines of a credit card number & think twice about giving it to someone who doesn't offer a certifacate.
While friends in the US and Britain tell me that SPAM is 'SPiced hAM', I think it's really an acronym for 'Swine Parts and Artificial Meat'.
"Love is a familiar; Love is a devil: there is no evil angel but Love." --William Shakespeare ('Love's Labors Lost')
From what I've read on Bayesian filtering it seems like most of the spam is caught not only from the body text, but also from the spam crap in the headers (eg bad message id). could other tricks be added into this filtering as "virtual words"--more picture than text, for example, or similar things?
I guess you don't get a lot of new correspondents. Do much web shopping?
Or do you review your Spam folder periodically?
A Plan For SPAM.
Of course, Vikings are entirely appropriate to SPAM. The Hormel factory which makes all America's SPAM (as well as the UK's) is located in Austin, Minnesota... and of course, Minnesota has a sports team called the Vikings.
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
I was impressed with the concept of Bayesian filtering, and was happy to find the latest Mozilla Mail supporting it with built-in buttons to mark mail as Spam/Not Spam. I wasn't so happy when I discovered many spams slipping through, even though they had the same type of content as previous ones I had marked as spam.
In viewing the source of one of the mail messages, I discovered embedded HTML comments which split up phrases and even words which might get flagged otherwise. The content of the comments appeared to be randomly generated text, so a filter wouldn't be able to categorize it the same every time. And the placement of the comments within a word wouldn't be the same every time either, so a naive implementation could never filter spam based on previous choices.
A smarter implementation might try operating on the results a viewer would see-- with the HTML tags stripped out. However, as soon as a filter does that, spammers could add Javascript within comments, that generates the spam text, and viewers who allow Javascript (many) would still see the spam.
It seems that both the text without the HTML tags, and the contents of the HTML tags, need to be considered separately in order to be able to filter the new generation of spam.
Life's a lot like money-- you spend it, then it's gone. Spend wisely.
But then again I could use it to filter out all the commericials and forward to Pamela Anderson jiggling.
Sorry, I thought that said "Baywatch Filtering for Dummies"
I am SO embarrassed.
Well, there's spam egg sausage and spam, that's not got much spam in it.
In addition, we require reimbursement for our attorneys' fees in the present amount of $140.00.
I look forward to your response on or before January 30, 1998.
one of my accounts is glutted with spam. (unfortunately, it's not one that I can just close down.) I highlight the spam messages (ctrl+A, usually) and mark them as junk, but Mozilla Mail 1.3 doesn't seem to classify incoming messages as junk, even after several weeks of training. I know basically how Bayesian filtering operates, but how do I get it to work?
I've read Grocklaw. BoycottNovell, you're no Grocklaw
And rejecting HTML mail from non-whitelisted sources is probably a good thing anyway
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
It looks like a really nasty attack on Bayesian filters, at least until the filters start recognizing HTML comments as a bad thing.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks