More on Bayesian Spam Filtering
michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."
kill 'em. might = right
Sudden change of agenda?
block all, and then let only what you KNOW you need in. it's the only method that will ever work right.
Of course, the 1% of non-spam that accidentally gets filtered out is just collateral damage (except it's normally something really important like a tin of processed peas or something).
I'm going to sit down now and take some more HGH.
Never email donotemail@WeAreSpammers.com
Can said "filter" filter out non-pr0n spam, while keeping the sweet sweet pr0n spam?
Spammers have to make money, too. Is it so hard to click on a link or two a day to help put food into the mouths of the man's children? Who are you, Scrooge? Help the man feed his kids this Thanksgiving.
Someone came up with this idea recently, and I like it, so I've been repeating it. Instead of illegalizing spam, which i would love if it worked, but it won't, require spammers to indicate the nature of the email--anonymous, commercial, with a word or such in the subject line, which can then be filtered by individual recipients according to their desires. It would not be as free-speech-limiting as banning spam, and spam would die out due to ineffectiveness once most everyone filtered it.
I originally coined the term "Bayesian Spam" to describe my Bay of Pigs / Asian conspiracy theory.
Best Windows Freeware
Why is such a simple problem that pisses off 99.9% of the population is so hard to manage on a global scale? I mean, EVERYONE is pissed off at getting spammed, everyone would LOVE legislation to sodomize local spammer with a baseball bat, oversea is a different problem but country/continent-wide spam is 1/2 of my problem and can be easily be taken care of with proper legislation. For once a restrictive legislation would get 99% support... you don't see that everyday. like I mentionned before, I don't get our politicians, they say they work for us, they try to find clever ways to tax us, remove control that we used to have and all, but something on which they would get unprecedented support, they are simply sitting on the issue...
Until politicians will be fed up and people will actually get SUED for spamming (for once you could have a good reason to sue real bad guys) nothing will change.
Yes I know in SOME states it's beginning, so for local spam in a few years from now I think legislation will make it's way and we'll be able to look in our mailbox and stop having TD waterhouse spamming when you already have an account with them, etc.
The other problem now is oversea spamming, especially coming from China/Taiwan. I mean.. I don't read chineese, I don't plan on buying that #.#" something oversea, so why do they spam us like that? I never get it, but I'd be all for passive euthanasia (i.e. ban their IP at router level) and if this is bad for buisness or relations or whatever, well MAYBE they will do something about it.
Here where I work, it's simple, one spam, I ban a whole class straight off the servers, if one day I get a call because someone couldn't reach us (if they really need to reach us, we have a phone anyways!) I'll be sure to mention him Why. too bad this is not happening at the backbone level, because some people would get their act together fast and apply a legislation globally.
--- Metamoderating abusive downgraders since my 300th post.
The timing of this article seems impecable, since I am myself trying to learn about Bayesian Statistics.
I am a Computer Science student studying Computational Biology (more specifically, Sequence Alignments) and while I have a bit of background on Classical Statistics, I was (and still am) completely ignorant about Bayesian Statistics.
It is only now that I'm trying to learn about Hidden Markov Models and its applications to Sequence Alignment that Ifinally decided to learn the basic hypothesis about Bayesian Statistics and how it differs from the hypothesis made by the Classical Statistics.
During my searches for finding introductory material on Bayesian Statistics, I found this course page which has some nice introductory notes, including Bayesian Statistics.
I hope that other people find this resource as useful as I did.
I'd like to hear about modifications to this system. I removed Graham's doubling of "good" word frequencies, and I trained my filter using digrams. I also tried all the various methods supplied by the program "rainbow", with good results, but the implmentation was too slow and klunky to place in the middle of my email delivery system. What are other possible modifications?
...is in the eating. I think the same applies to spam. Paul showed, to his satisfaction, that the technique he used worked for his samples. Gary proposes some changes that would improve the filter's accuracy, but does not test these theories.
:) but it would be interesting to see whether what looks convincing in theory pays off in practice.
We will now have many slashdot posts saying "I've not tested this but I think A (or B, or C, or X)"
Here's where the scientific method comes into its own. Anyone who cares enough can actually test and post their results. I'd be interested in seeing what they look like. I don't have a database of spam to test against (and please don't volunteer to sign me up for some
development.lombardi.com
I have some tricks for Hotmail users who cannot benefit from the technique above: ..... and your own email address userid.
:-)
Filter any message without the @ in the address.
Filter Britney, Boobs, Penis, Inches, WIN, ___
Now you only have about 40 spams a day to deal with instead of 100.
Uncheck your information from being in the MSN directory too.
Enjoy
John
Saskboy's blog is good. 9 out of 10 dentists agree.
It's good that work is being done to make a good weigted spam filter.
It's funny how bad the standard Microsoft spam filter is (the one present in outlook). It's simply a word lookup, where if the word is present the message is marked as spam. It looks for things like "for free?". You can see the full list here, near the bottom. It's a little old, but not outdated (I think you can upgrade your spam filters, but I tested these, and the ones I tested work).
The adult filter isn't any better.
"Probably the toughest time in anyone's life is when you have to murder a loved one because they're the devil." -Philips
P (This is spam) = P (This is Spam | It will enlarge my penis) * P (It will enlarge my penis)
Now, given that I have prior knowledge that:
P (It will enlarge my penis)
is very low,
and given that, having never encountered anything which enlarges my penis in any permanent way, I have no knowledge of
P (This is Spam | It will enlarge my penis)
and we have the product of one probability which I know is low, and another of which I have no posterior knowledge, so we conclude that P (It is Spam) is also low, and that I must have requested more information on their new penile enlargement technique.
So, that message goes into the keepers.
Meanwhile,
P (It is Spam) = P (It is Spam | Frank is getting maried) * P (Frank is getting married)
So, I know frank is getting married, since he sent me this e-mail I'm considering filtering as Spam, and weather or not it is spam is pretty much independent of whether or not frank is getting married, so.... it's Spam. Away it goes.
P.S. I've deliberated made a hash of this for a joke. The actual rule is:
P (A & B) = P (A | B) * P (B)
The good and new comes from no quarter where it is looked for, and is always something different from what is expected.
Is this what the new Mail.app in Mac OS X 10.2 uses?
I, myself, am not sure but the new Mail.app is smart and it does learn. After a week of "learning" it has correcly determined messages as spam more than 99 out of a 100 times.
Here is a suggestion for something that might make an impact on spammers: IF I open my firewall, I see several attempts a day from people trying to get into my mail server. Of course, I don't have a mail server, but spammers are always looking for open relay points they can spam from. My suggestion: Give the a nice open relay server they can send mail to. Of course, you don't want to piss off your service provider by sending spam, and your upstream speed might limit you to less than you can receive, so rather than run a full mail server lets modify some mail server code to just accept mail and send it to the bit bucket. Maybe we can even misconfigure existing code to do this with no programming changes.
No valid user will be affected, assuming you don't otherwise run a mail server. All that bandwidth you pay for can be used to receive e-mail from spammers before it ever goes out. Eventually their customers will see the response go from .1% to 0% and their business will dry up. This will impact spammers, blocking your own spam after it's been delivered will not.
This need not even impact your own bandwidth. You can run the server when you are done using your system (Might make a nice screen saver - a black screen that just shows how many spammed addresses were prevented from getting spammed). Or you cam impose limits on bandwidth at a firewall or router, or even restrict hours of access.
If we set up enough different false open relay servers I think we could have a real impact on the spammers.
I'm an American. I love this country and the freedoms that we used to have.
At UCSD, Bob Boyer and I wrote a neural net spam filter. Neural Nets, as everyone knows, are not really like biological brains, but really just statistical engines similar to the approach the guy above claimed to do.
E t ar.gz
Our approach worked pretty well (95-97% accuracy), and we had to deal with the same issues that the above "Bayesian" approach did. I.e., weighing the neurons so that false positives occur much less frequently than false negatives, etc. We built it using data on spam collected from the UCI machine learning repository.
It ties in with procmail. I'm not really a windows guy, so if anyone knows how to put a filter between an IMAP server and Microsoft Outlook/Netscape Communicator, I'd be interested in hearing how it's done.
The README for it is at: http://www-cse.ucsd.edu/~wkerney/spamfilter.READM
And you can download it at:
http://www-cse.ucsd.edu/~wkerney/spamfilter.
-Bill Kerney
wkerney at ucsd.edu
SpamAssassin works great for me. It eats about 90% of my spam, you just hack up a little procmail file for it, and you're done.
With so many people using SpamAssassin these days, I can't see how this is a timely or newsworthy item. More like from the been-there-done-that-dept..
I want to delete my account but Slashdot doesn't allow it.
While I love everything there is to love about open source (code and ideas), I kind of worry when I read how successful all these new Bayesian/Grahamian filtering techniques work.
Not being a coder or statistician myself, I'm left wondering if the spammers can exploit it for a workaround. Is there something "built in" to these filtering techniques that can be used by spammers to effectively circumvent them?
I hate to give any kind of credit to M$ but they patented the idea of using Bayesian analysis for spam filtering circa 1995. They even had it in one of thier beta's. However the filters were tagging some of those fricking Blue Mountain greeting cards as spam (imagine that!) so Blue Mountain sued them on anti-competitive grounds and M$ pulled it. Blue Mountain wanted to have the spam filters universally pass Blue Mountain content but MS refused that on the grounds that if a user considers it spam then it is in fact spam to them (Hurray for the "bad guys"!). The law suit has been settled/dropped/died for reasons I don't know.
Anyway I hear that the next version of MSN will have a Bayesian filter and that it will be introduced in an up coming version of Outlook Express (no idea about Exchange and Outlook).
BTW I believe internally MS uses this technique for spam control and that they don't seem to have any spam problems.
Politics in the US is not about the will of the people; it is about the will of the corporations that have the money for lobbying their agenda. The politicians will continue to ignore the people unless the resistance from the people corsses a certain threshold (in this case, when people are bothered enough by spam to ignore other issues that the politiona in question might be working on).
Hehe, sounds like fun. Maybe I can then capture all the e-mail addresses that get run through my fake mail server, and sell the list back to the spammers.
Hey, that is a really cool idea, I wonder if it can really work. It is a new idea to me, so if anyone knows if this is a joke, or a possibility, please let us know?
Then we need someone to develop some open source code that creates a dead end mail server on whoever installs the program. They should be able to set how much spam their server eats in a night, rated to bandwidth usage. I'd run it as a screensaver.
Saskboy's blog is good. 9 out of 10 dentists agree.
Sure, spam is a big problem, but right now only 10-20% of my emails are spam, and most are easily identifiable by subject.
On the other hand, I get hundreds of emails every few days covering a range of topics, which need to be manually sorted into folders.
What I'd like to see, and I suspect I'm not alone here, is similar software that can sort email into any number of categories, not just spam and non-spam.
For example, if I have an email folder called 'fishing', containg emails from fishing buddies, then next time I get an email containg references to 'casting', 'trout' and 'it was *this* long', it should be sorted into that folder automatically.
I'd be curious to know if there's any existing software to do this, and if not, I'd be tempted to have a go at knocking something up to do this.
One tricky bit would be how to integrate it with the email client. I'd imagine that users wouldn't want to switch away from Outlook/Mozilla/Mutt/Whatever merely for this feature, so it would have to be client-agnostic.
I'm thinking that implementing a simple IMAP server would be the easiest option since this allows for server-side folder management. It would then be case of maintaining word counts (Bayesian or otherwise) for each folder, and classifying mail accordingly.
Anyone else had any thoughts along these lines?
Gamingmuseum.com: Give your 3D accelerator a rest.
For once a restrictive legislation would get 99% support... you don't see that everyday. like I mentionned before, I don't get our politicians, they say they work for us, they try to find clever ways to tax us, remove control that we used to have and all, but something on which they would get unprecedented support, they are simply sitting on the issue...
Perhaps the problem is that the law would gain them less votes then a few hundred thousand dollars in campaing financing would. A large portion of the population isn't online, and a large portion of those who are don't care about spam, so your politician doesn't care either.
Since this is such a trivial technical problem to solve, it's not really a big deal either way. I daily reduce 800 spam messages to five or six that make it through to my inbox just using procmail scoring, and I haven't had a false positive in years. I spend five minutes updating my procmailsc every six months to keep it effective. I suppose that I could use an automated system to generate my score file similar to what Paul Graham described, but when I only spend ten minutes a year updating my rules, it's going to be alot of years before it was faster to have written all that code. No need for sweeping legislation.
Ah, all they have to do is say something about restricting free speech and all the angry ballbats go limp. Spamassassin: works for me.
try { do() || do_not(); } catch (JediException err) { yoda(err); }
While in many respects I agree that "There oughta be a Law" against spam, there are some problems with that approach. Not the least is that generally a social solution is much better (or at least has less side effects) than any law that a government will enact.
Laws have the distinct problem of either going too far (false positive) or being too weak and thereby legitimizing the spam that would manage to work through the loopholes. Taken to the extreme that seems to commonly occur in the US legal system, I can envision spammers suing ISPs for blacklisting their "legit per US act ####" spam.
I would much rather statistical methods such as are being discussed. This combined with "whitelist" methods seem to work very well by all accounts.
McFly777
- - -
"What do people mean when they say the computer went down on them?" -Marilyn Pittman
This need not even impact your own bandwidth.
Last week (I can't find the article yet), Slashdot had a link to a column by someone who was (in his opinion) unjustly blacklisted for hosting an easily-accessible mail server. The moment his name hit that blacklist, he became a target for what may as well be every spammer on the planet. Even though he didn't actually have an open relay (just an easily-guessed password), the incoming traffic from so many e-mail spammers effectively brought his server to its knees. Changing his domain name and IP address was the only cure.
Building a "honeypot" mail server for spammers is appealing, but could be more trouble than its worth, especially since it's more or less irreversible. I'd advice against it.
I don't get it. Simply allow incoming email only from user names you know. Period.
Why is this hard to understand?
I've noticed in the past 2-3 weeks that the look of the spam I've received is a lot more like regular mail.
eg:
---
carpet
Your home refinance loan is approved!
To get your approved amount go here.
To be excluded from further notices go here.
carpet 5gate 1932zIgl2
---
It's still identifiable as spam with a probability filter, but it's not that far removed from a legitimate mail an AOL dork might send or receive. (not that I care about them getting spammed!)
I have implemented Paul Graham's algorithm at my corporation, and it is blocking 90-97% of our spam each day. It is "good stuff". Combine that with Razor v2 and some other filtering I do, and nary a spam gets thru.
I'm fairly sure a false relay won't work. Just like snail mail list sellers, the spammers salt their victim lists with their own valid addresses that they can check to see if the message is getting out.
BUT, an early spam filter at an ISP worked just like that. The design parameters were 1) that spam filtering require no more resources than actual delivery of the message, and 2) the filter give no indication to the spammer that the message was not going to delivered. This gives the spammer no feedback and forces THEM to waste CPU cycles which will slow them down.
Ever dream you could fly? Get up from the Flight Sim. I Fly
... what exactly bayesian means?
I realized one day that filtering spam out by content is a futile exercise. I use a simple method that has worked perfectly: If the FROM address of an incoming message is not in my contact list, the message is Trashed. Before emptying the trash, I'll glance through it to be sure that I didn't recieve a legitimate message from someone not in my list. Since I've used this, not one spam has ever appeared in my Inbox. This is important since I use mobile devices and other strange ways to access my email that would be very sensitive to spam overload. Fact is, 99.999% of email I receive is either 1.) From people already on my contact list, or 2.) People who inform me they're going to send an email. Before I give out my address, I inform them that I need to know their address first, and add it to my contact list. If someone gets my email from someone other than me, or otherwise didn't talk to me first, I probably don't want their email anyway. And if it's important, they'll get in touch with me. I'm using Outlook for this solution and use a rule that moves all the messages out of the Inbox that don't meet this criteria. I plan to switch to Evolution soon under Mandrake and I'm sure I can program a similar function. It's much easier to spot 1 message from a legitimate sender out of 100 spams (takes only a few seconds in fact) than it takes to manually delete spams or constantly fiddle with filters. Each day, I'll glance at the list of 100-200 spams that have collected in my trash box, and within a few seconds, I can spot if someone I know has sent me something who isn't in my list. From that point forward, they're in my contact list, and it never happens again. At some point I plan to set up an auto-reply system that gives people a URL that they can visit to "ask for permission" to send me email. Spammers won't use it. I haven't bothered yet because I'll need to carefully design this to prevent my address from being "confirmed" by spammers as a result of this message, but I have ideas for that (send from a null account, use a picture of my email address in the message, with instructions on how to ask permission.) At that point, I can safely instant-trash all unrecognized recipients. I'd love some feedback on this method. It's worked great for me, though admittedly it won't work for those who recieve many emails from new contacts, such as someone who publishes (eek!) their address on a site for inviting new messages.
# Erik
sites like yahoo, hotmail, etc are in a unique position to rid their users of spam.
i don't see why they cant implement some system that scans incoming mail for its users' mailboxes, maybe does a checksum for each message or something, and if it finds that a number of its users are receiving exactly (or nearly exactly) the same message, assume it's spam. nuke the messages, and any new incoming ones.
yeah, if such a system only scans a small number of mailboxes, it may filter out mailing list posts and so on. but it gets more and more reliable the higher number of mailboxes it tracks.
this avoids searching for certain keywords and eliminates false positives. after all, how well would these keyword searching methods do if i were to quote a spam message in an email to a friend?
I think the original poster's point would be to make commercial e-mail illegal unless properly tagged. That way an untagged spam could be handed over to the FBI and treated like wire-fraud or something.
Big problem would be prosecuting the spammer. Either they would all move overseas or the court would be so backlogged as to become ineffective.
McFly777
- - -
"What do people mean when they say the computer went down on them?" -Marilyn Pittman
What you call SPAM I call creative marketing, besides someone has to get this economy going?
I'm not sure why this particular article needed to be posted, as it's just one of several alternative approaches and an untested one at that. On Paul's page, he also lists several published academic papers with other alternatives -- all actually tested, of course.
Gary is basically right in questioning the use of the word "Bayesian". Paul's approach is more about weighing "evidence" as given by the appearance of certain words, rather than in figuring out the probability of spam assuming a "prior". See Paul's explanation, but if you check the article he references at the end, you'll note that the method Paul uses is only one of several methods to solve an underspecified problems. It's a reasonable guess, not necessarily the only guess.
Looking at another article Paul references, given the word independence assumption, the more formal Naive Bayesian approach calculates as follows:
p(spam) = [ p(spam)*p(word1|spam)*...*p(wordn|spam) ] / [ p(spam)*p(word1|spam)*...*p(wordn|spam) + p(!spam)*p(word1|!spam)*...*p(wordn|!spam)]
This is similar to Paul's approach except for including a "prior" assumption of p(spam) -- the expected probability of any email being spam, calcuated from the historically observed frequency of spam. By leaving it out, Paul implicitly assumes that 50% of mail is spam -- that's his "prior" estimate of the spam rate. Given the other adjustments he makes to his sample, that appears to be acceptable in practice. (Paul overweights the spam prior, but also overweights the effects of "good" words.)
I'd personally prefer to overweight the "good" e-mails entirely rather than just put a "good-multiplier" on them like Paul does, but that's just quibbling over small bits.
As to the bit that Gary raises about Paul assuming a spam probability for an unknown word -- Paul originally said .2, then revised to .4, but really should have put it at .5 or just excluded it from all calculations. A new word has no robustness as a predictor (which is why Paul dropped words that didn't appear five times anyway). In practice, a new word at .4 isn't going to be among the 15 most interesting words to make the calculation from, anyway.
-XDG
A Honeypot for spammers? Sounds like an idea who's time has come.
The problem's are many, but the outcome would be fantastic. Create a Mail-dev/null program which looks like a "real" system and make it hackable. Keep the same doors the spammers would normally use. Make said program freely available to anyone and everyone. Make it that much more difficult for Spammers to find a working program to hack.
Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it. -Samuel Johns
Right. Try that one again after your non-100% effective filter starts filtering out business e-mails. Then where'll ya be? nowhere.
AI people have absolutely no common sense. Its been proven by my neural net.
"Oh, you hate your job? There's a support group for that, it's called everyone, they meet at the bar."
ban their IP at router level Oi, remind me to start running when you consider *active* euthanasia
beauty is only a light switch away
Why not try it? The problem the guy had last week was that he did this on his home box that we used for other stuff (specifically, some mail-related stuff).
So when he was blacklisted, his legitimate work was affected.
There is nothing inherently wrong with running a honeypot mail-server. Just do it somewhere that isn't going to screw you when it shows up in ORBZ.
(In fact, you could set up one server that acted as a honey-pot, and publish all the IPs of the spammers who try and connect to it. Other servers could use those IPs to block access at a lower level, without the risk of running their own honey-pots.)
Tuus crepidae innexilis sunt.
There was extensive discussion of Graham's spam filtering algorithm and potential improvements on comp.lang.python in mid-to-late August. Check Google Groups for the subjects "Lisp to Python translation criticism?" and "Graham's spam filter."
I own my own domain, which makes it easier, but we really need a system designed to filter. And make it easier. This is my uninformed proposal. Perhaps it won't work, but it seems something is needed.
People should have a private/public e-mail address. They should all go the same "account" and be part of the basic plan for any e-mail user.
privateauthentication~myemail@myhost.com
I know this is important and relevant
publicauthentication~myemail@myhost.com
I gave this person my e-mail address
myemail@myhost.com will go into the crap bin and be deleted eventually. Perhaps some program could be used to alert users of possible important mail pieces there.
Then we could also have some system to CHANGE the private authentication or public authentication that is form based. I.e. This address has been disconnected. Please apply for the new password.
So close and yet so far from the world's perfect ID number
It seems to me a countermeasure spammers might try is including a dictionary with their spam. Since filters are for sure going to be conservative and avoid false positives, they'll latch onto "good" words from the dictionary and ignore "bad" words from the spam.
There I was on vacation, wondering what to do with my free time, and a spam popped into my inbox. I remembered the article about Graham's statistical technique, which seemed a lot more interesting than an arbitrary keyword list or a set of ad-hoc rules, so I decided to write an anti-spam program. Vacation accomplished.
After a couple of weeks I've built up a big enough spambase that Graham's algorithm is pretty close to 100% effective (and no false positives at all).
However, I did run into one problem: Some particularly devious spammers are base64 encoding their email so that it can't be scanned by programs like this. (I can't think of any other reason why they're using base64 encoding for text/plain or text/html messages.)
After I added code to check the email header and decode the message body it worked much better.
Apple's new spam detector works amazingly well for me. After some initial jitters it pretty much never gets false positives these days.
why are we even considering this method when microsoft has a trademark on it? nothing can be done.. they have a lock on it. trademark here
I don't actually see the point in putting emails into different folders, if you have that feature.
-WolfWithoutAClause
"Gravity is only a theory, not a fact!"Congratulations, you've just more or less reinvented the Tagged Message Delivery Agent (TMDA).
I think it might me interesting to apply AI methods in fighting spam, especially machine learning. For example, you could have a spam filter that is able to learn. You just show 100 spam mails to the filter program, then 100 non-spam mails, and the system "learns" how spam looks like.
Don't drink and su! antidisestablishmentariazationally
All it says in the help is that it is adaptive and trains itself on your previous spam. It would be nice to see some source... and be able to patch it if we don't like it.... oh well, whining won't get me anywhere.
I hereby place the above post in the public domain.
Let me start by saying I know very little about coding, otherwise I'd probably already be rushing off to a night of coding by the glow from my monitor.
When the first Bayesian spam filtering article was posted, I thought it was a great idea, and this article just reinforces that idea. However, it would be interesting to build some sort of Sendmail module (or whatever MTA you like), but add some additional functionality:
1. Option to return a 550 error if the message is determined to be spam: "550 Delivery blocked; Bayesian filter reports spam probability of nn%"
- Right before reporting this error, wait n seconds or alternately, slow connection to n bps for n minutes.
- After reporting the error, "deliver" the Subject and Body of the email to the spam words database.
2. Inclusion of a whitelist, by IP, reverse DNS, MAIL FROM address, or RCPT TO address, header To: address, header From: address, etc.
3. Configuration of account where spams can be forwarded to, for automatic addition to the database.
- Perhaps this could be combined with the blacklist/whitelist. For example, any emails to spamthis@antispamdomain.com are always added to the DB. The entry could be as follows (similar to the Sendmail access map):
spamthis@antispamdomain.com <tab> BAYESIAN:SILENT
- This would allow for either silent addition to the filter (sender thinks mail was delivered -- good for spam harvesting emails, or for users to send their spam to), or a more "vocal" addition much like item #1 above, where a 550 error is reported... eg, BAYESIAN:550 or perhaps BAYESIAN:REJECT
I realize this would block a lot of mail, but I have my Sendmail currently configured to actually block spam (or what it considers spam) and have had very few issues with valid messages bouncing. Obviously, results may vary, but I'm a firm believer in rejecting spam during the SMTP conversation, not accepting it and then deleting it silently.
Does anyone else have any suggestions?
This whole methodology is already patented by Microsoft. ANY implementation not licensed by Microsoft is going to be a violation... And now that you know, it is treble damages...
patent 6,161,130
I doubt that there are many spammers out there who are not using all of their available bandwidth to send spam already, I can't see how setting up dummy port 23's would make spam worse. Just the opposite: While this can be started by a few changes to an open source mail server, or maybe even by misconfiguring an existing mail server, it should grow and evolve. I think we can beat the spammers, but not just by being impressed on how well we can filter our own mail. Heck, as they add smarts, we could add smarts too. If we can identify the test messages with reasonable certainty we can elect to send them through. We could even build a nice P2P network of systems cooperating to stay one step ahead of the spammers.
Can anyone get us started on this? Provide some Windows and/or Linux code to start the roach motel e-mail server (spammers log in but they don't send out)? I'll get one running tonight if I can get a good dummy mail server for Windows (and just slightly longer to put the hardware together if I have to build up a Linux system).
I'm an American. I love this country and the freedoms that we used to have.
Most of the spam may be coming from overseas now, but at least in some of these countries it is far more likely that one could actually pass a law to sodomize the offender witha baseball bat.
I think it might me interesting to apply AI methods in fighting spam, especially machine learning. For example, you could have a spam filter that is able to learn. You just show 100 spam mails to the filter program, then 100 non-spam mails, and the system "learns" how spam looks like (maybe reinforcement learning?)
Don't drink and su! antidisestablishmentariazationally
Spam is a GLOBAL problem. There ARE no globsal laws. Do you think for one minute the Chinese ISPs (chinanet.cn) is going to refuse HARD US$... and not allow American and other international spammers to use their gateways? THINK AGAIN.... the ONLY way to fight spam is to make is to expensive for the spammers, that they will use other means to push their "penis enlargement" crap.
Of course not everyone has the skills to track down and identify the spammers, but one can certainly have a lot of fun harrassing them.
If you can identify the spammer and get a mailing address for them (very hard to do), then send them in invoice for the time you take in reading and reporting it. Kindly reminding them if they dont pay up by the deadline, you'll take them to collections.
Now if EVERYONE did that (wishful thinking) then spammers might die or go away. Especially if everyone they spammed, would take them to small claims court demanding they pay for your time in reading their smut.
I've been told that SOME people have actually been paid.... BOY!!! What lamers...
He said he is only recieving 5-10 spams/hr. Lets try and knock that up a few levels trolls.
Target: grobinson@transpose.com
Paul's article lists a few of the bayesian spam filters, but here's a short list of the ones I've tried:
Gary Arnold's bayespam is implemented in perl and geared towards qmail using maildir storage.
Brian Burton's spamprobe, written in C++, tries to remember already-seen messages, so that you can dump your spams/good mails on separate folders, have spamprobe learn from them, and delete them afterwards. Spamprobe remembers which ones it already processed, and won't reprocess a message if it's already seen it.
Eric Raymond's bogofilter is a typical ESR tool: concise, with a baroquely written man page, and quite simplistic, but does its job and does it well. ESR even uses some funny terms, like "spamicity", and "ham" (the opposite of spam). I don't like its dependency on the Judy libraries for dynamic arrays but what the heck.
Matthew Walker's BayesSpam plugin for Squirrelmail provides squirrelmail users with bayesian spam filtering capabilities, no longer restricting use of the technique to those with access to procmail/mailfilter systems.
One person's spam is another person's 'useful email'. For instance, I may want a particular type of email (eg: a pr0n mailing list, or a "George Foreman Grill" user group, or lots of Korean friends). It might be considered spam by the ISP's filters, but not by me.
That's why it's best to train _my_ filter against _my_ received mail.
And as more email gets received and I add the uncaught messages to the spam filter, my filter 'learns' what I consider spam.
My father is a blogger.
So, if they own the damn thing, why can't they sit down and make a real implementation of it for Hotmail? I'm sure everyone involved would be happier.
What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey
If the spammers have to jack a 75k file onto the end of their spams, suddenly they are sending 75 GB of data per spam run. This is about as stealthy as dancing naked on the piano in the middle of a wedding reception.
Also, it would only work once- the first dictionary spam I got would be marked spam and then all the junk words would get marked in the list.
What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey
It can use the filter to get any result you want, not just a binary trash/don't trash.
It puts an "X-Text-Classification" header in mails you get saying what category it determined, so that you can just write simple filter rules in whatever program you use to sort it all.
What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey
So, if someone sends you mail about which UDP ports to unblock on a firewall to play a game, you've just lost communication.
Single word "zero-tolerance" rules are unwise, to say the least.
What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey
I have my server setup to send a password to a user who sends me email. They must send the password back in an email. The password is a one-time password. Been using it for 2 months, works great....(I monitor the discarded mail, yes Im paranoid, and I've even had SPAMMERs bitch to me about wasting their bandwidth....the script is setup to adapt to continual spam from a server by forwarding emails to people like root@domain, spam@domain, remove@domain...etc..)
You can download the source here if you like.
It's not from the same guy, but it's definitely derivative work.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
This is basically an "ask slashdot" question.
I have my popmail hosted by my ISP. I usually check my mail from my windows box. I'd like to configure my Linux box to periodically pull the POP3 mail from the server, spam-filter it, and then act as a "local" POP server that I'd just point my windows Eudora at.
Anyone have an easy (relatively speaking) means of doing this? Seems like each of the 3 parts (Getting mail from ISP, filter, and being a POP server) are trivial, but anything out there that would do all this or pieces that play well together?
I'm not keen on trying to deal with SMTP right now. My internet connection is a little too flaky for that...
Thanks for any ideas.
In the Portland, Ore area and like card games? Check out: http://groups.yahoo.com/group/portlandgames/
I implemented both spamassasin and ifile one month ago.
Results
Both: 787 62%
SpamAssasin only: 385 30%
Ifile only: 62 5%
Missed: 29 2%
False positives: 0 0%
I'm fairly happy with these results. I see about 1 spam message a day.
The problem is that they're using spam they get today as sample input to their algorithms. This won't work because spammers will simply taylor their prose to fit the filter.
This would be as simple as taking the filter and keep hitting it with the text of your spam. If the filter filters it, then tweak the words a bit, and iterate until the filter lets the spam pass. Now you have a spam that will pass through most peoples filters.
Alternatively, you do the same as above, but instead of changing the words, change how you send them, but so that they still get rendered on a browser correctly - i.e., encode the text so the filter can't defilter it. For example, send the spam as text in a jpg image.
So, a prior article described a method of spam detection which claimed to use something like Bayesian methods, and now we read that it didn't. Sounds like just another case of ...
Bayesian Mimicry
(Don't clap, just throw money.)
Considering this is Slashdot, par for the course.
Hey Taco! Looks like you're using the "infinite monkeys and typewriters" scheme to generate Ask Slashdots again...
I wonder if this technique could be modified to spot trolls. Not too likely I guess, it'd have to be able to tell relevance to a topic.
IIUC, The proposed method normalizes (with Ln norm) over the number of words, for "spammishness" and "unspammishness" of words, combining the results.
whats stoping the spammers from attaching, say, a random scientific article longer than the spam at the end of the spam message ? This will give the spam a high grade in these bayesian method in general, but more so with his normalizing metric.
Working for necessity's mother.
Lot of implementations mentioned in this thread, but does anyone know of an implementation for the most wildly used E-Mail clients under Linux/BSD: KMail, Evolution and Mozilla?
TIA for any links.
Bye egghat.
-- "As a human being I claim the right to be widely inconsistent", John Peel
Of classic "probabilistic searching" from the field of information retrieval. Here's a typical tutorial You can feed key words from this to google to find more if you want to.
The application to spam filtering is trivial. Simply take a document set (your inbox for a month), identify the spam set (manually) and the algorithm will generate term weightings for you.
Then apply these term weightings to previous unclassified records (emails) and BINGO!
BugBear
Ignorance is curable. Stupid is forever.
Andrea
That one seems quite interesting:e /
http://www.ai.mit.edu/~jrennie/ifil
I've been using and testing bayespam (thanks Gary!) for the last week or so and am impressed by how accurate it is. Easy to install too. All of the other anti-spam tools (blackhole, spamassassin etc) are a complete nightmare to setup and configure. Obviously speed is important but I'm going to use bayespam on a case by case basis rather than filter all and any. If a user has problems, start filtering... Must remember to keep saving up that spam for my corpus. To me it doesn't matter if it *really* is bayesian or not, it works. Hope someone sorts out a Mozilla setup too...
Do I sense Hotmail here?
I lkie the soun of this one - seems like it should eliminate all false positives sent by real peope and all false negatives. I worry about auto-responders and auto-reminders, though. TMDA (Tagged Message Delivery Agent)
"that's not encryption - it's a new perl script that I'm working on..." - from some Matrix parody
"I'm fairly sure a false relay won't work. Just like snail mail list sellers, the spammers salt their victim lists with their own valid addresses that they can check to see if the message is getting out." MAYBE some do salt, but demonstrably some don't. As recently as 17 minutes ago one spammer sent relay spam to my (2 1/2 year old) honeypot. It isn't being delivered. If he salted the list with his own address (as you say he does) he'd have figured out the honeypot last week already. The Moscow honeypot trapped Ralsky spam from February to July. Not only did Ralsky not salt the addresses he ended up sending spam run statistics reports back to himself THROUGH THE HONEYPOT. The entire episode was one long cause for ROFL. I'll grant that there may be some smart spammers and smart spamware vendors. Please don't assume that this smartness prevails. It does not. Um. Now it's trapped relay spam as recently as 9 minutes ago - I took some time to compose this, he's still busy. 88 recipients on this one. He's going alphabeticallly, he's in the bobxxxx's right now.
Start here: http://fightrelayspam.homestead.com/ Also, Google for "corpit honeypot" and look at the cached page. Really wicked. A honeypot with a real-time log of the incoming spam on a web page. Send the URL to the abuse@ISP and watch the throwaway accounts drop like flies. Sadly, now most relay spam seems to come through open proxies so that doesn't work.
.. is Bayesspam 2.x for Squirrelmail. Its an easily installable plugin for a php-based webmail system, that uses MySQL to store the Bayesian corpus. It's also got options to limit the size of the messages to be filtered, and displays the spam probability and the 'mark as spam/nonspam' links in each email header.
Visit http://jackpot.uk.net to download it.
You also need a JVM, obviously.
This one's web page is even better than the cached page you'll see if you Google for "corpit honeypot" and look at thr cached copy of the hit. You can examine any spam it has trapped.
I should have guessed that when you get 3-4 copies of the same spam it means that the spamming scumbags just get redundancy by spamming the entire 64,000,000 addresses repeatedly through different raped relays.
Your honeypot success gives me an idea. What if yours and other honeypots were used to cooperate to capture the spam to seed spam filters? Since EVERY message you process is spam all of the words in it, or a hashed signature could be send out to these filter dictionaries so that ISPs will know the message you captured should be delivered to
What'dya think?
Ever dream you could fly? Get up from the Flight Sim. I Fly
Could work, should work. But there's already a service that captures spam using spamtraps that otherwise works almost exactly as you describe: DCC. It sends out fuzzy checksums, and I'm not the one to tell you how the fuzzy checksums are computed. As I recall there's a place on the web site where you can paste in a spam message and see if it would have been identified as "bulky" (DCC detects bulkiness rather than spamishness - it needs a whitelist for mailing list sources.)
See: http://www.rhyolite.com/anti-spam/dcc/
This truly is an excellent idea.
1. it is a patent, not a trademark
2. just because someone has a patent doesn't mean the patent can't be challenged.
3. just because someone has a patent doesn't mean a patent will be enforced.
4. Some things are worth fighting for
I have attempted a quick implementation of these revised algorithms (at least the first two: S and f(w)) and the results are much less promising than the original article's algorithm.
Caught spam dropped from near 99% into the 70s and the false positives jumped from 1 in ~2000 to 10-20%.
Anyone else get similar results? Is it just my implementation?