Slashdot Mirror


More on Bayesian Spam Filtering

michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."

251 comments

  1. How about Macchiavellian Spam Filtering by Anonymous Coward · · Score: 1, Funny

    kill 'em. might = right

  2. What happened to IBM/Redhat Article by bashly · · Score: 0, Offtopic

    Sudden change of agenda?

    1. Re:What happened to IBM/Redhat Article by meringuoid · · Score: 1

      Turned out to be a duplicate. Got pulled.

      --
      Real Daleks don't climb stairs - they level the building.
  3. best method .. is ..to .. by Anonymous Coward · · Score: 0

    block all, and then let only what you KNOW you need in. it's the only method that will ever work right.

    1. Re:best method .. is ..to .. by erikdotla · · Score: 1

      I agree, and I have a more detailed discussion of this going on somewhere below, perhaps you can come and jump on my side, since I'm getting flamed down there. :)

      --
      # Erik
  4. Spam spam spam by Dynamoo · · Score: 1
    Well I guess spam comes in different size tins sometimes, and with different labels so you can tell the spam apart. I like Hot and Spicy Spam. Mmmm.

    Of course, the 1% of non-spam that accidentally gets filtered out is just collateral damage (except it's normally something really important like a tin of processed peas or something).

    I'm going to sit down now and take some more HGH.

    --
    Never email donotemail@WeAreSpammers.com
    1. Re:Spam spam spam by docbrown42 · · Score: 1

      Well I guess spam comes in different size tins sometimes, and with different labels so you can tell the spam apart. I like Hot and Spicy Spam. Mmmm.

      Bloody Vikings!

      -Ed

      docbrown.net NEW!
      Graphic Design, Web Design, Role-Playing Games...all the good stuff

      --
      Ed Wedig
      Graphic design services
      docbrown.net
  5. But what about pr0n spam? by Anonymous Coward · · Score: 0

    Can said "filter" filter out non-pr0n spam, while keeping the sweet sweet pr0n spam?

    1. Re:But what about pr0n spam? by saskboy · · Score: 1

      I know, what would we do without our free porn, and porn passwords?

      --
      Saskboy's blog is good. 9 out of 10 dentists agree.
    2. Re:But what about pr0n spam? by Anonymous Coward · · Score: 0

      In other words what we need is a sort of "facial recognition system."

    3. Re:But what about pr0n spam? by wirelessbuzzers · · Score: 1
      In other words what we need is a sort of "facial recognition system."
      facial?
      --
      I hereby place the above post in the public domain.
    4. Re:But what about pr0n spam? by Anonymous Coward · · Score: 0

      Do you prefer the term "money shot" or something? There's more than one kind of facial.

    5. Re:But what about pr0n spam? by yomegaman · · Score: 0

      This is fucking hilarious... I congratulate you, finally a pun worth making.

      --
      ...wearing a skin-tight topless leather jumpsuit, with cutaway buttocks and transparent crotch panel.
  6. Why filter spam? by Anonymous Coward · · Score: 0

    Spammers have to make money, too. Is it so hard to click on a link or two a day to help put food into the mouths of the man's children? Who are you, Scrooge? Help the man feed his kids this Thanksgiving.

    1. Re:Why filter spam? by Anonymous Coward · · Score: 0

      Nothing wrong with people making money.
      There is plenty wrong with stealing it.

      "Anti-spammers" are not starving anyone.
      They are, however, opposing theft.

    2. Re:Why filter spam? by kaladorn · · Score: 2

      Maybe we can feed the spammers' children.
      Hopefully to something large and hungry.

      --
      -- Mal: "Well they tell you: never hit a man with a closed fist. But it is, on occasion, hilarious."
    3. Re:Why filter spam? by Anonymous Coward · · Score: 0

      Actually, everything is wrong with people making money.

      Stealing it from fat cats who do nothing but take pride in the ammount of money they have is a moral imperative.

      Fat cats should be fed to the starving children.

  7. spam by sstory · · Score: 1

    Someone came up with this idea recently, and I like it, so I've been repeating it. Instead of illegalizing spam, which i would love if it worked, but it won't, require spammers to indicate the nature of the email--anonymous, commercial, with a word or such in the subject line, which can then be filtered by individual recipients according to their desires. It would not be as free-speech-limiting as banning spam, and spam would die out due to ineffectiveness once most everyone filtered it.

    1. Re:spam by Anonymous Coward · · Score: 0

      You're mighty fucking optimistic if you think people are going to tag their mail so that you don't have to read it. May as well sit around and hope they all just decide to stop spamming.

    2. Re:spam by Cyno01 · · Score: 2

      yeah, theyre all ready supposed to indicate if its pr0n spam by specifying in the subject, hot sluts inside, or whatever, but they ussually don't, and there isn't a really good way to enforce this, 9 times outa 10 when i get an e-mail from Joey, i know 3, and the subject is hey, or hows it going or something, the actual e-mail is pr0n

      --
      "Sic Semper Tyrannosaurus Rex."
    3. Re:spam by Grax · · Score: 1

      I just want to require programmatically that spammers have permission to send from whatever domain they're sending from. At least then there is some contact info for them and a domain that you can deny. (OK. so it is fake contact info and they'll register a bunch of domains)

      If everyone subscribed to my plan (outlined in the link in my signature) spammers would be forced to send mail only from their own domains. If only yahoo.com and I subscribed to my plan at least all the spam forging yahoo.com's domain would be rejected.

      A couple of weeks ago a spam went out with one of my domains forged as the sender. That annoys me severely and this would prevent that.

    4. Re:spam by wheany · · Score: 1

      Well... Spamming individuals is already illegal in Finland, but has that stopped (Finnish) spammers? No. They still make the same excuses ("This email can't be considered spam because spammers lie").

      What would help is to have the authorities take spamming more seriously. Even if they know who the spammer is, they are unlikely to do anything about it...

  8. I'm going to sue by L.+VeGas · · Score: 0, Troll

    I originally coined the term "Bayesian Spam" to describe my Bay of Pigs / Asian conspiracy theory.

  9. I still think passive euthanasia is the best way. by tcc · · Score: 2, Flamebait

    Why is such a simple problem that pisses off 99.9% of the population is so hard to manage on a global scale? I mean, EVERYONE is pissed off at getting spammed, everyone would LOVE legislation to sodomize local spammer with a baseball bat, oversea is a different problem but country/continent-wide spam is 1/2 of my problem and can be easily be taken care of with proper legislation. For once a restrictive legislation would get 99% support... you don't see that everyday. like I mentionned before, I don't get our politicians, they say they work for us, they try to find clever ways to tax us, remove control that we used to have and all, but something on which they would get unprecedented support, they are simply sitting on the issue...

    Until politicians will be fed up and people will actually get SUED for spamming (for once you could have a good reason to sue real bad guys) nothing will change.

    Yes I know in SOME states it's beginning, so for local spam in a few years from now I think legislation will make it's way and we'll be able to look in our mailbox and stop having TD waterhouse spamming when you already have an account with them, etc.

    The other problem now is oversea spamming, especially coming from China/Taiwan. I mean.. I don't read chineese, I don't plan on buying that #.#" something oversea, so why do they spam us like that? I never get it, but I'd be all for passive euthanasia (i.e. ban their IP at router level) and if this is bad for buisness or relations or whatever, well MAYBE they will do something about it.

    Here where I work, it's simple, one spam, I ban a whole class straight off the servers, if one day I get a call because someone couldn't reach us (if they really need to reach us, we have a phone anyways!) I'll be sure to mention him Why. too bad this is not happening at the backbone level, because some people would get their act together fast and apply a legislation globally.

    --
    --- Metamoderating abusive downgraders since my 300th post.
  10. Tutorial on Bayesian Inference by rbrito · · Score: 5, Informative

    The timing of this article seems impecable, since I am myself trying to learn about Bayesian Statistics.

    I am a Computer Science student studying Computational Biology (more specifically, Sequence Alignments) and while I have a bit of background on Classical Statistics, I was (and still am) completely ignorant about Bayesian Statistics.

    It is only now that I'm trying to learn about Hidden Markov Models and its applications to Sequence Alignment that Ifinally decided to learn the basic hypothesis about Bayesian Statistics and how it differs from the hypothesis made by the Classical Statistics.

    During my searches for finding introductory material on Bayesian Statistics, I found this course page which has some nice introductory notes, including Bayesian Statistics.

    I hope that other people find this resource as useful as I did.

    1. Re:Tutorial on Bayesian Inference by Wile+E.+Heresiarch · · Score: 3, Interesting
      Here are some additional references, on-line & off, about Bayesian probability.

      On the web, see: Assoc. for Uncertainty in Artificial Intelligence -- this is the primary conference devoted to belief networks, which are a class of graphical (in the circles and arrows sense) Bayesian probability models. There are tutorials and other papers on the main AUAI web page, and links to the last several years of conference proceedings. By the way, Heckerman and Horvitz, now doing belief networkish work at MS Research, are in the AUAI crowd.

      In print, my favorite reference is E.T. Jaynes, "Probability Theory: The Logic of Science", which is due out soon. See this web site devoted to Jaynes' work for the status. I am also fond of Castillo, Gutierrez, & Hadi, "Expert Systems and Probabilistic Network Models".

      There are a vast (well, maybe just large) number of alternative models to classify things; a good introduction is Hastie, Tibshirani, & Friedman, "Elements of Statistical Learning". Incidentally, they use spam classification to illustrate several kinds of models.

      Finally, if you're wondering what the heck is the difference between Bayesian probability and any other kind -- just google the posts in sci.stat.math; there is a Bayesian vs frequentist flame war about once a year. :^)

  11. Post your results here by Jeffrey+Baker · · Score: 5, Interesting
    I'd like to head the results of anyone who has implemented one of these probabilistic filtering systems. I implemented a modifed version of Paul Graham's system and so far it kicks ass. So far it has trapped over 600 spams without any false positives. I receive almost 100 spams a day and over the last week I have generally only had to delete one or two by hand. The rest go directly to jail.

    I'd like to hear about modifications to this system. I removed Graham's doubling of "good" word frequencies, and I trained my filter using digrams. I also tried all the various methods supplied by the program "rainbow", with good results, but the implmentation was too slow and klunky to place in the middle of my email delivery system. What are other possible modifications?

    1. Re:Post your results here by ajm · · Score: 3, Insightful

      Just out of interest what's your code written in and would you consider posting it?

    2. Re:Post your results here by kwerle · · Score: 3, Interesting

      I implemented Paul's system without the changes you mentioned, and am seeing >95% success (and climbing). 0 false positives. I will be submitting it to sourceforge this week.

    3. Re:Post your results here by Jeffrey+Baker · · Score: 2
      So have you been retraining the system as you get more spam, or did you train it initially and leave it that way. How large is your training set?

      Details! My training set was 300 spams and 3500 not-spams. With digrams, my filter traps 618 out of 621 spams in my spam folder, which is 99.5%

    4. Re:Post your results here by saskboy · · Score: 1

      I don't understand the filtering software that I can download to implement this system. Anyone have a link?

      --
      Saskboy's blog is good. 9 out of 10 dentists agree.
    5. Re:Post your results here by Jeffrey+Baker · · Score: 5, Interesting
      I hacked it together in Perl, to make use of the Berkeley DB interfaces and the MIME parsing modules. Took about 30 minutes. I'm working on a C library that could be linked into mutt or pine or whatever, but I'm finding the available MIME code in C cumbersome.

      You can grab the source here, but it is specific to the exact way that my mail gets delivered (via offlineimap into maildirs).

    6. Re:Post your results here by aero6dof · · Score: 1

      how about hybridizing it with spamassassin to help mark email for varying levels of analysis. You could use spamassassin with a low threshhold to do an initial low resource pass and then do a high resource pass with your system. Alternately, you could try to generate spamassassin rules from your database to help with that first pass filter.

    7. Re:Post your results here by kwerle · · Score: 3, Interesting

      So have you been retraining the system as you get more spam

      I continue to train.

      or did you train it initially and leave it that way. How large is your training set?

      I started off with a base.

      Details! My training set was 300 spams and 3500 not-spams.

      I started with a little more than 300 spam, and around 1000 valid messages.
      My count is now:
      Good messages read: 1194
      Bad messages read: 644

      That's because I only train on deleted mail, and I don't tend to delete my mailing lists except for once a month or 2...

      With digrams, my filter traps 618 out of 621 spams in my spam folder, which is 99.5%

      Against my start set, I nailed about 97%, including refiling 2 false positives from my old anti-spam system as being not spam. I've noticed that the system is really good at nailing stuff it already knows about, but the learning curve is a little steep for 'new spam types'. Still, I'm pretty happy with it.

    8. Re:Post your results here by kwerle · · Score: 2

      If you're hitting > 95% or even >97% with the new system, who cares about spamassassin?

    9. Re:Post your results here by Jeffrey+Baker · · Score: 2

      What you said is actually inverted, since my system is way faster than spamassassin. I can do 15 mails/second with my Perl code. My half-implemented C code does over 600/second. Spamassassin seems to take about 5 seconds per mail on the same hardware.

    10. Re:Post your results here by wdr1 · · Score: 2

      or vice-versa...

      -Bill

      --
      SlashSig Karma: Excellent (mostly affected by moderatio
    11. Re:Post your results here by pemerson · · Score: 2

      In addition, is anyone running probabilistic filtering on a large system with lots of users? Say, 1,000,000 messages a day? I'd be curious to know how you do it while keeping your load down on your mail machines.

    12. Re:Post your results here by XDG · · Score: 3, Interesting
      I've implemented it in part -- my code is in perl and will flag e-mails, but I haven't worked it into a filter yet.

      My experience is that I get a few percent false-negatives and about 1% false positives. I'm not seeing zero false positives, like many people are, but that probably has to do with the training sets used. Statistically speaking, you always have to trade off false negative with false positives, so it's reasonable in my 'real world' tests.

      As a side note, everyone should test out of sample. E.g. set aside half your good e-mails and half your spam e-mails, build the filter on one half, and then test on the other half. That's the only way to get a fair test of the filter.

      For my "good" email corpus, I dumped my entire e-mail archive since 1995. That included personal e-mail, receipts from online shopping, some mailing lists, etc. The few things that get flagged as spam (a) are almost always sent in HTML format, and (b) very short with little real content. (E.g., "Hey, looking forward to seeing you this weekend. Call me if you go out. My number is... Bye.")

      The spam corpus I took from on online resource while I build up my own. The e-mails that slip by unflagged are usually (a) short and (b) phrased like friend making a suggestion. (E.g., "Hi, I just thought you'd be interested in hearing about a this new, cool website, http://...") It seems to be close enough to a real message to slip through. Thankfully, few of them are like that.

      I'm including subject lines, from addresses, and the body so far. I'm not parsing ip addresses or html tags specially, however, just basic words using a simple perl regexp.

      Interestingly, "COLOR" is the one of the most often flagged words indicating spam. HTML formatting text seems to be the biggest culprit in my false positives. I might explicitly exclude the ones that show up in good mail (e.g. from friends who use crappy e-mail programs like aol) like COLOR, FONT, FACE, etc., but leave in the ones that spammer use like TD, TR, etc.

      -XDG

    13. Re:Post your results here by scatalogical · · Score: 1

      I have considered using trigrams or bigrams instead of a simple monogram for the weighting. This takes into account the "neighborhood" around terms and is/was used by the NSA. These frequencies can be dropped directly into the equations I beleive.
      Eg.:

      sample text:
      a simple sentance to show off.

      bigrammatic analysis:
      a simple 1
      simple sentance 1
      sentence to 1
      to show 1
      show off 1

      trigrammatic analysis:
      a simple sentance 1
      simple sentence to 1
      sentence to show 1
      to show off 1

      This is used in cryptography in various ways and I have used it where I was publishing something and had to match up citations that had been edited and ones that hadn't with each other so I knew which numbers matched. (They had updated the cites and changed the numbering, adding or deleting some between versions.)

    14. Re:Post your results here by Mushy · · Score: 1

      I was using the hybrid approach until Bayesian filtering started performing better than SpamAssassin (No false positives, More trapped spam, less upgrade mania) and then I just dumped SpamAssassin.

    15. Re:Post your results here by XDG · · Score: 1
      My code in perl -- rough as it is, undocumented, etc. can be found here. I personally label them something like v0.05.

      Two programs -- one to make a .db file with the words and probabilities from a good corpus and bad corpus, the other to test a mailbox against the word database. In both cases, the mail files probably need to be in unix mbox format.

      -XDG

    16. Re:Post your results here by Eric+Seppanen · · Score: 3, Insightful

      You might want to consider collaborating with the group working on bogofilter, which is basically the same thing, done in C.

      --
      314-15-9265
    17. Re:Post your results here by Anonymous Coward · · Score: 0
      I've been filtering spam using all of these techniques (arbitrary spam scoring, white/black lists, trained word frequency analysis, etc) for a little over five years now. In addition I also make extensive use of time-limited one time addresses on Usenet, web pages, etc. It's not possible to spam me at addresses scanned from the web or Usenet, at least not after a few days.

      By and large, any combination of n spam techniques, where n is at least three or so, will work about as well. Depending on the flavor of your legitimate mail you'll get somewhere between 95 to 99.9% accuracy. You will get both false positives and negatives.

      Clearly the only schemes which can improve on this are the human-driven shared blacklist affairs. There seem to be several of these around, and they, too, currently produce imperfect results. It's unclear when or if any of these will reach a large enough critical mass to produce many nines of accuracy.

      Grant Taylor <gtaylor+slashdot_bffjg091702@picante.com>
      http://www.picante.com/~gtaylor/spam/

    18. Re:Post your results here by Ian+Bicking · · Score: 2
      I'm surprised to only see one link to Bogofilter in this discussion. I started using it just a couple days ago, training it from scratch (because I am patient and lazy). I train it on all emails it doesn't mark as spam, and then retrain it on spam it misses. So far I'm up to catching about 50%-75% of spams (climbing rapidly), with one false positive (though I had to read through that email a couple times to realize it wasn't really a spam -- so the false positive is understandable, since as a human I could have made the same mistake).

      Bogomail potentially captures more relevent words than as described by Graham -- IP addresses, email addresses, and other text that should be considered atomic are recorded atomically. I think even more could be done with this -- but I worry that bogofilter is going to create too large a database, as it even seems to be keeping track of words like "$20".

      As an optimization, I could imagine you could double-register some words, mostly those in headers. So the word "mother" in a subject line might register both "mother" and "subject:mother". Perhaps IP addresses could be recorded with all their classes (e.g., "200.69.228.105" would be recorded as "200.69.228.105", "200.69.228", "200.69" and "200" -- maybe prefixing some text to the last three, so that "200" the number is distinguishabe from "200" the class-A address)

      Ultimately, a well trained spam database could be trimmed and distributed, but bogofilter does not yet include such a database. Graham's concern about distribution and trust are, IMHO, not entirely necessary -- a well-trained database can be created by only a handful of people (who receive lots of spam), and even if non-spam must be classified on an individual basis, spam is not tailored to any individual (nearly by definition). I don't think this has as great a risk of censorship as blocking lists.

      I would be interested to see an improvement in the client end of bogofilter (or similar software). Right now I'm using procmail, and forwarding miscategorizations back to myself with a changed subject line (which procmail catches and feeds to bogomail). With just a little work, this could be used to create filters besides spam, where I train bogofilter to filter based on other criteria. (Well, I can do this right now, but it would take only a little work to make this accessible even to computer novices)

    19. Re:Post your results here by joebok · · Score: 1

      Right after Graham's article I started in on my own implementation- with similar results: it very quickly starts catching spam and no false positives so far (that I've noticed anyway). I'm using VB and an access database- right now it seems snappy enough but I have worries it'll bog down when the DB gets too big. But it's working good enough for me as-is. Code and binaries are at joeemail.sourceforge.net.

    20. Re:Post your results here by Anonymous Coward · · Score: 0

      If I had a way of contacting you, I can give you the details to a project just starting up that's working on it. It just got started.... I joined it a week ago, and they (project participants) are just honing the formulas and are getting impressive results.

      Hmmm!! How can we meet privately? Does slashdot provide a way?

    21. Re:Post your results here by benedict · · Score: 1

      In a similar vein, I used spamassassin to train ifile.

      I'll use bogofilter as soon as it has a simple installation procedure for BSD.

      --
      Ben "You have your mind on computers, it seems."
    22. Re:Post your results here by Anonymous Coward · · Score: 1, Insightful

      Check out the Spambayes project in SourceForge. They are working mostly in Python (python.org), and have a large collection of "spam fodder" and "ham fodder" to work with.

      I'm sure of you've done some code like Jeff Baker has, he sure would be welcome to participate in the group. They have a CVS library already in SourceForge.

    23. Re:Post your results here by Anonymous Coward · · Score: 0
      I'm surprised to only see one link to Bogofilter in this discussion

      Well, I went and looked and I saw this:

      Eric S. Raymond is the original author of bogofilter.
      And that's enough right there for me to find something else to use.
    24. Re:Post your results here by cmeans · · Score: 1
      I wrote a version in Java (to be released under the Apache Software License version 1.1), and created a James mailet wrapper (this lets me plug the code into the James server), but it's written to be used from a variety of interfaces.

      I have had a few false positives, but that's probably because I've not gone back and rebuilt my corpus since debugging...I'm very pleased with the results so far...

    25. Re:Post your results here by Hater's+Leaving,+The · · Score: 1

      Those are 403-ing.
      They look quite short (2.9K), maybe you could just copy/paste?

      THL.

      --
      Keeping /. cynic density high since the fscking Kwhores/trolls arrived.
    26. Re:Post your results here by XDG · · Score: 1

      Sorry, permissions fixed.

      Also, I discovered that the new Mail::Box as of a few days ago was breaking my code, so I've got two versions up there now.

      -XDG

  12. The proof of the pudding... by ajm · · Score: 5, Interesting

    ...is in the eating. I think the same applies to spam. Paul showed, to his satisfaction, that the technique he used worked for his samples. Gary proposes some changes that would improve the filter's accuracy, but does not test these theories.

    We will now have many slashdot posts saying "I've not tested this but I think A (or B, or C, or X)"

    Here's where the scientific method comes into its own. Anyone who cares enough can actually test and post their results. I'd be interested in seeing what they look like. I don't have a database of spam to test against (and please don't volunteer to sign me up for some :) but it would be interesting to see whether what looks convincing in theory pays off in practice.

    1. Re:The proof of the pudding... by shadow303 · · Score: 3, Insightful

      From what I can observe from the writeup, Gray appears to be one of the "experts" that I refer to as "theory whores". Hard problems need to be tested, but some people seem to think that they can arrive at good results from an unproven theory. Anybody who has actually tested difficult problems to any extent could tell you that things don't always go as planned. An improvement with might work in theory, sometimes results in disaster due to minor points that the theory does not take into account.
      Also, it bothered me that he objected to Paul's work biasing one side. It was almost like he thought it was a bug, but there was a good reason for biasing (reduce false positives). So my advice for Paul is, until you actually implement your idea, don't go trying to say that it is better than somebody else's method.

      --
      I've got a mind like a steel trap - it's got an animal's foot stuck in it.
    2. Re:The proof of the pudding... by NecrosisLabs · · Score: 1

      You, sir, have made my day; if I have to hear one more chucklehead say "The proof is in the pudding" I will not be held accountable for my actions.

    3. Re:The proof of the pudding... by shadow303 · · Score: 1

      Doh! That last sentence was advice for Gary, not Paul.

      --
      I've got a mind like a steel trap - it's got an animal's foot stuck in it.
    4. Re:The proof of the pudding... by ajm · · Score: 2

      It's the feeling I got as well. The change Gray suggests might be good or they might be bad. What I liked about Paul's write up was that he determined practically that it worked for him. Sure the theory may not be perfect but, to use a broad analogy, you don't need general relativity to work out how fast a ball rolls down a slope, Newtonian theory works fine. It may not be worth it to get the extra .1% of accuracy, especially if, as you point out, it increases false positives. Only testing will tell.

    5. Re:The proof of the pudding... by AndroidCat · · Score: 1
      I'm sure those chuckleheads could care less. They probably say "Play it again Sam" and "Alas poor Yorick, I knew him well" too.

      Death's too good for some people.

      --
      One line blog. I hear that they're called Twitters now.
    6. Re:The proof of the pudding... by Anonymous Coward · · Score: 2, Insightful

      Paul called his method Bayesian. It wasn't. In addition to just pointing out that Paul was wrong, Gary also outlined how one might take a Bayesian approach to the problem.

      He also showed how his extended solution included Paul's as a special case.

      It sounds like you frequently get terminology wrong, and when someone points out that you're using the term incorrectly, and further shows how you could actually apply what you were talking about to the problem at hand, you go off on them for being a "theory whore." You're the winner of today's "Slashdot personified" award. Congratulations!

    7. Re:The proof of the pudding... by NoOneInParticular · · Score: 1
      I'm not sure whether Paul actually called his method Bayesian or not, but this is an easy source of confusion. The method is called 'naive Bayes' because of the wrong assumption it makes. In effect: the name Naive Bayes actually implies that the method is not Bayesian at all, as it makes naive, non-bayesian assumptions. The reason Bayes is put in there in the first place is because it uses Bayes rule to calculate the 'probability'. However, if you check the formula, p(C|W) = prod(P(W_i|C)) * p(C), you will soon notice that the reliance on P(C) is completely superfluous as it will not make much difference in the prediction. (You can check this by seeing that P(C) is only one probability, while the product is on all the words you consider, which can be thousands). You can easily check that using the left hand side of the rule or the first part of the left hand side will hardly ever make any difference.

      The method has a long history in information retrieval. I think people have used it since the sixties. There it is commonly referred to as the 'multinomial model'. This in contrast with the 'multivariate model' that takes not counts but only occurance/no-occurance of words into account.

      BTW, for people interested in implementing this: DO NOT multiply the probabilities, but add the negative logs of them. This will save you from destructive floating point errors.

  13. poor Hotmail users are still in the cold... by saskboy · · Score: 4, Funny

    I have some tricks for Hotmail users who cannot benefit from the technique above:
    Filter any message without the @ in the address.
    Filter Britney, Boobs, Penis, Inches, WIN, ___ ..... and your own email address userid.
    Now you only have about 40 spams a day to deal with instead of 100.
    Uncheck your information from being in the MSN directory too.

    Enjoy :-)
    John

    --
    Saskboy's blog is good. 9 out of 10 dentists agree.
    1. Re: poor Hotmail users are still in the cold... by IIRCAFAIKIANAL · · Score: 2

      I really feel for any plastic surgeons named Britney that focus on penile and breast enlargement surgery. They can't filter squat.

      --
      Robots are everywhere, and they eat old people's medicine for fuel.
    2. Re: poor Hotmail users are still in the cold... by saskboy · · Score: 1

      I saw a /. sig the other day that said a person who gets spam must feel something like a Fat balding woman who needs bigger breasts and a longer penis to screw Britney.

      --
      Saskboy's blog is good. 9 out of 10 dentists agree.
    3. Re: poor Hotmail users are still in the cold... by wirelessbuzzers · · Score: 1

      actually, i have received about 1 a day for the past two weeks (2 out of 12 were filtered), plus apple news that i havent canceled, and I have been fairly loose about giving out my address (mike_hamburg@hotmail.com).

      This is not my main account anymore, I only check it twice a week for people not aware of my address change, so don't even bother to sign me up for pornspam. Part of the reason i don't get spammed much is my address has an underscore and i havent done any REALLY stupid things with it (besides enter greetingWishes.com once, that's half my spam right there), but wisely chosen names (not jsmith23) will help you not get spammed. And dont be a part of their stupid member directory.

      A nice way to cut your spam in half is to kill anything with "udp" in it, because no English words (ie in the dictionary) contain this combo (unless you count mudpie), but most of that fake diploma spam has it.

      --
      I hereby place the above post in the public domain.
    4. Re: poor Hotmail users are still in the cold... by MrEd · · Score: 1
      Or just switch to FastMail and be doen with it.


      It can check your Hotmail account every half hour or so too if you don't want to give up your spam-harvesting mailbox. How's that for features?


      NOTE: I don't work for them or have anything to do with them except being quite happy with my free account there. This is not a plug!

      --

      Wah!

    5. Re: poor Hotmail users are still in the cold... by Anonymous Coward · · Score: 0

      And then with their massivly Enlarged Penises, they knock holes in their walls and need Low Interest Mortages

    6. Re: poor Hotmail users are still in the cold... by JackWolf · · Score: 1

      Actually you can use pop3hot as a windows proxy and treat your hotmail account just like a pop mail server.

      http://www.pop3hot.com/main.htm

  14. Terrible Spam Filters by DonkeyJimmy · · Score: 3, Informative

    It's good that work is being done to make a good weigted spam filter.

    It's funny how bad the standard Microsoft spam filter is (the one present in outlook). It's simply a word lookup, where if the word is present the message is marked as spam. It looks for things like "for free?". You can see the full list here, near the bottom. It's a little old, but not outdated (I think you can upgrade your spam filters, but I tested these, and the ones I tested work).

    The adult filter isn't any better.

    --
    "Probably the toughest time in anyone's life is when you have to murder a loved one because they're the devil." -Philips
  15. Naive Bayesian Learning by Anonymous Coward · · Score: 2, Interesting
    Finally it is worth mentioning that if you really want to go a 100% Bayesian route, try Charles Elkan's paper, "Naive Bayesian Learning". That approach should work very well but is a good deal more complicated than the approach described above.
    Here is the article[citeseer.nj.nec.com]
  16. Let's see by sam_handelman · · Score: 5, Funny

    P (This is spam) = P (This is Spam | It will enlarge my penis) * P (It will enlarge my penis)

    Now, given that I have prior knowledge that:
    P (It will enlarge my penis)

    is very low,

    and given that, having never encountered anything which enlarges my penis in any permanent way, I have no knowledge of
    P (This is Spam | It will enlarge my penis)

    and we have the product of one probability which I know is low, and another of which I have no posterior knowledge, so we conclude that P (It is Spam) is also low, and that I must have requested more information on their new penile enlargement technique.

    So, that message goes into the keepers.

    Meanwhile,

    P (It is Spam) = P (It is Spam | Frank is getting maried) * P (Frank is getting married)

    So, I know frank is getting married, since he sent me this e-mail I'm considering filtering as Spam, and weather or not it is spam is pretty much independent of whether or not frank is getting married, so.... it's Spam. Away it goes.

    P.S. I've deliberated made a hash of this for a joke. The actual rule is:

    P (A & B) = P (A | B) * P (B)

    --
    The good and new comes from no quarter where it is looked for, and is always something different from what is expected.
  17. Whatever Jaguar (Mac OS X 10.2) uses works! by Anonymous Coward · · Score: 1, Interesting

    Is this what the new Mail.app in Mac OS X 10.2 uses?

    I, myself, am not sure but the new Mail.app is smart and it does learn. After a week of "learning" it has correcly determined messages as spam more than 99 out of a 100 times.

    1. Re:Whatever Jaguar (Mac OS X 10.2) uses works! by marmoset · · Score: 2

      I've see the same effect. I almost never see spam anymore, and get almost no false positives. I wish I could find the quote, but one of the Jaguar engineers confirmed that there's some "serious math" behind Mail.app's spam handling. I can't find any technical reference about the algorithms being used, though.

  18. filtering not the answer - maybe this is by frovingslosh · · Score: 5, Insightful
    Sadly, unless you are an ISP or other mail service provider, filtering does nothing. The spammers work in volume. They count on hitting everyone to reach that .1% that will respond. That response is what they are after and what they get paid for. You likely know better than to ever deal with anyone who spams you or to ever respond to their spam. Filtering your own e-mail has absolutely no effect on the spammer, you were not going to respond anyway. By the time you filter they have already wasted your bandidth, and perhaps mailbox capacity and even forwarding limits from a forwarding service. Your filtering is useless, puny human!

    Here is a suggestion for something that might make an impact on spammers: IF I open my firewall, I see several attempts a day from people trying to get into my mail server. Of course, I don't have a mail server, but spammers are always looking for open relay points they can spam from. My suggestion: Give the a nice open relay server they can send mail to. Of course, you don't want to piss off your service provider by sending spam, and your upstream speed might limit you to less than you can receive, so rather than run a full mail server lets modify some mail server code to just accept mail and send it to the bit bucket. Maybe we can even misconfigure existing code to do this with no programming changes.

    No valid user will be affected, assuming you don't otherwise run a mail server. All that bandwidth you pay for can be used to receive e-mail from spammers before it ever goes out. Eventually their customers will see the response go from .1% to 0% and their business will dry up. This will impact spammers, blocking your own spam after it's been delivered will not.

    This need not even impact your own bandwidth. You can run the server when you are done using your system (Might make a nice screen saver - a black screen that just shows how many spammed addresses were prevented from getting spammed). Or you cam impose limits on bandwidth at a firewall or router, or even restrict hours of access.

    If we set up enough different false open relay servers I think we could have a real impact on the spammers.

    --
    I'm an American. I love this country and the freedoms that we used to have.
    1. Re:filtering not the answer - maybe this is by RealAlaskan · · Score: 1
      ... rather than run a full mail server lets modify some mail server code to just accept mail and send it to the bit bucket.

      That's clever. One possible problem might be that spammers would quickly learn to test your relay with a message, just to make sure that it didn't all go to /dev/null. I suppose that we'd have to set things up so that single emails got forwarded, and bulk emails didn't, just to avoid that. Now we really are running an open relay, though it isn't much good to spammers.

      Another problem is that spammers might start automatically sending the same spams to the same lists via several different open relays. Thus, we might increase the volume of spam, at least in the early stages.

      I agree that the ultimate solution to the spam problem isn't going to come from filtering at the email client. It's a social problem, and needs a social solution. Filtering by ISPs (on by default, but easily circumventable by knowledgeable users, maybe?) would help, and so would us telling our less-knowledgeable friends and relations NOT TO BUY FROM SPAMMERS!

    2. Re:filtering not the answer - maybe this is by stienman · · Score: 3, Insightful

      Interesting idea, but easy to verify. Send one thousand emails, and include a verifiable email in it. Check the email a few hours later - if it's not there, then don't use the relay.

      -Adam

    3. Re:filtering not the answer - maybe this is by GigsVT · · Score: 1

      would us telling our less-knowledgeable friends and relations NOT TO BUY FROM SPAMMERS!

      I can hear them now:
      "I didn't buy from a spammer, it said 'THIS IS NOT SPAM!!!' and it had legit looking unsubscribe info."

      You might do better to send out a spam, then murder all the buyers once you get their address. Intellectual cleansing.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    4. Re:filtering not the answer - maybe this is by frovingslosh · · Score: 2
      You might do better to send out a spam, then murder all the buyers once you get their address. Intellectual cleansing.

      I've often wondered why we don't see a few spammer's heads on pikes to greatly reduce this problem, but there is a lot to be said for your solution too. Just don't do it on the day some good soul gets fed up with spammers and comes after you! ;-)

      --
      I'm an American. I love this country and the freedoms that we used to have.
    5. Re:filtering not the answer - maybe this is by frovingslosh · · Score: 2
      Interesting idea, but easy to verify. Send one thousand emails, and include a verifiable email in it. Check the email a few hours later - if it's not there, then don't use the relay.

      Sure, they might test it. Still seems better than doing nothing. If a spammer passes me 1000 pieces of mail and waits a few hours, that's 999 pieces that didn't go out and a few hours of his time. If only I do this it will have little impact, but if the slashdot effect kicked in and there were so many false servers that it kept happening to him over and over again, that would be sweet!

      And of course, some spammers will be lazy and not test. Jackpot!

      Of course, the servers should look different. Some Linux, some Windows, some something else. Claim to be different applications. We might even start building smarts into the servers (if you get only one email, and it's going to an address that is likely a test address (his own domain, a mailbox service like Hotmail, or a local ISP that serves the same area his packets came from), wait one minute and then send it on. Worst that can happen is your false relay gets blacklisted (not a problem).

      The bottom line is, which will have any impact on spammers, a lot of false relays out there that discard their e-mail destined for victims that keep the system going, or filtering e-mail that you were never going to read anyway?

      --
      I'm an American. I love this country and the freedoms that we used to have.
    6. Re:filtering not the answer - maybe this is by Stinky+Cheese+Man · · Score: 1
      Give the a nice open relay server they can send mail to. ...lets modify some mail server code to just accept mail and send it to the bit bucket.

      Open relays are detected by the fact that they actually do relay email. A system that merely accepts email, but does not pass it on to the email address used by the open relay tester, will never be identified as an open relay in the first place.

    7. Re:filtering not the answer - maybe this is by frovingslosh · · Score: 2

      As I mentioned (in a post that sadly was sent before being previewed), I can open my port 23 on my system and get several hits in a night for people trying to find a open relay server. Maybe they do look to the blacklists, since some of the mail on a blacklisted server still gets through, but they are spending a lot of time to find those open relay servers in the first place. If you're on the Internet, open a mail server (real or not) and it will be found the first day. I'm suggesting we make it much harder for the spammers to find a real open relay server - by giving them lots of decoy servers that will at a minimum cost them a lot of time and at best might even receive their bulk spamming rather than a real open relay.

      --
      I'm an American. I love this country and the freedoms that we used to have.
    8. Re:filtering not the answer - maybe this is by Jeremi · · Score: 2
      Filtering your own e-mail has absolutely no effect on the spammer, you were not going to respond anyway.


      You're missing the point. I could care less what the spammer does. The benefit is that with a good filter, I don't have to look at spam. Currently I spend maybe 15 minutes a day recognizing and deleting spam emails, and occasionally screw it up and delete something important by mistake. If a filter program can reduce that load, it's useful to me even if it doesn't stop the spammer from spamming.


      And in any case, in a year or two, when such intelligent filters are a standard feature on AOL and Outlook and etc, the spammer's "hit rates" will likely drop dramatically, at which point they will have less incentive to spam.

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    9. Re:filtering not the answer - maybe this is by Stinky+Cheese+Man · · Score: 1
      No spammer will use your server in the first place if it is not really an open relay. Merely being able to connect to the server is not enough. Merely having the server appear to accept a test email is not enough. The spammer will not use your server unless a test message is actually relayed by your server.

      Either your server really is an open relay, which the spammer will send millions of messages through, or it is not - in which case the spammers relay test will fail and they will not attempt to use it.

    10. Re:filtering not the answer - maybe this is by silversurf · · Score: 1

      > Sadly, unless you are an ISP or other mail service provider, filtering does nothing

      Oh, I don't think that's fair statement. What if I'm a company and I have 1000 employees all recieving email and using my internal mail servers resources. So I put in a mail gateway that filters email up front before is passes through my firewall and in to my Exchange/Lotus/ system. I get a big resource savings by just filtering the email at the gateway and my users are happy.

      Too many Slashdot readers only seem to think in terms of their personal mail box on this issue of SPAM, when the problem must be dealt with in a much larger scope. Personal level spam filtering is great and something that has it's place and these systems can be applied for that, but really make an impact on spammers it takes an ISP or many companies to drop their connections because the message they are transferring is spam, or they are known as a spammer.

      That's why I like RBL's to some extent. They aren't the single silver bullet and are perfect by far, but I get a relatively high percentage of accurate blocks and I don't even have to take in all the email data at my mail server. I just find out who the sending host is and "blam" see ya later, very little bandwidth is used.

      Now implementing something like Paul suggests in his article has it's place too. I think if you are a spammer and manage to get through the black/white lists, RBL's and other measures up front, then it's time to filter and a good, solid statistical filtering method is going to really do the trick. Especially when I get a consistently large sample of SPAM, like I do for my 1200 users I have to deal with.

      -s

    11. Re:filtering not the answer - maybe this is by jshazen · · Score: 1
      frovingslosh said:
      Filtering your own e-mail has absolutely no effect on the spammer... By the time you filter they have already wasted your bandidth, and perhaps mailbox capacity and even forwarding limits from a forwarding service. Your filtering is useless, puny human!

      Sure, my filter (based on Paul's) may not effect the spammer, and may still waste some of my resources, but spam now wastes so much less of my precious time, that it is well worth it!

      You can focus on saving the world. I'll worry about saving my sanity.

    12. Re:filtering not the answer - maybe this is by Phil+Gregory · · Score: 2

      Die.net's approach seems like a good implementation of this.



      --Phil (Sadly, I'm on a cable modem, so I don't have the bandwidth for this.) Gregory
      --
      355/113 -- Not the famous irrational number PI, but an incredible simulation!
    13. Re:filtering not the answer - maybe this is by frovingslosh · · Score: 2
      > Sadly, unless you are an ISP or other mail service provider, filtering does nothing

      Oh, I don't think that's fair statement. What if I'm a company and I have 1000 employees all recieving email .....

      ..... like I do for my 1200 users I have to deal with.

      If you are providing mail for 1200 users then I certainly would include you in the "service provider" group that I mentioned. I'm glad to see you doing it. I don't discount the technology, I wished my ISP used it, and more importantly I wish my forwarding service used it before it counted spam against my quota.

      Another advantage to false news servers that would just be used by spammers: As the technology evolved I would envision that a network of such decoy relays could build up information that could be provided instantly to the service providers quickly to help make their filters more accurate and responsive.

      --
      I'm an American. I love this country and the freedoms that we used to have.
    14. Re:filtering not the answer - maybe this is by frovingslosh · · Score: 2

      No. It's another good thing that can be done, but it's not what I'm advocating. Basically he has set up a mail server expecting to get mail for his own addresses. He then wastes as much time as he can of the open relay the spammer is using. This at least slows things down for the spammer, and he might just get the attention of someone in charge of the server. Aother good thing to do, I think we need to do lots of things like this to stay ahead of the spammers. The dummy open relay would be another but different tool. Rather than slow down the conection it should take as much spam as it can, so that it doesn't go elsewhere and so that the paying client of the spammer would eventually see that he is getting no results.

      --
      I'm an American. I love this country and the freedoms that we used to have.
    15. Re:filtering not the answer - maybe this is by JackWolf · · Score: 1

      I disagree. Effective filtering that is idiot proof and available on every computer can and will end spam. Make it work well, and make it free. Better yet get it bundled with windows and spam will cease to exist.

    16. Re:filtering not the answer - maybe this is by minas-beede · · Score: 1

      Oh, foo. Think, man - can't it only deliver the test messages? (The answer is: YES.) And there are some spammers who actually do start sending spam if their test messages are merely accepted. I should know: my honeypot got a test message yesterday, I didn't deliver it, the spam came anyway. Still coming today. Fascinating: some single spam messages have over 1000 recipients. Tens of thousands of recipients will not see THIS spam. Sadly, hundreds of thousands will: open relays vastly outnumber open realy honeypots. Set up your own honeypot and you may also be fascinated.

    17. Re:filtering not the answer - maybe this is by MrDemeanour · · Score: 2
      I've written a relay honeypot in Java. It's a real relay, that relays messages only if:
      • there's less than recipients (configurable);
      • the relay request arrived not less than seconds after the previous request.

      It can bounce messages addressed to the local machine, in case the spammer checks for bounces (buggy, at the moment).

      It whitelists relay test-addresses, as specified by its operator, and relays to those addresses even if it thinks it's in a spam-run. It adds any address to which it relays to its whitelist (i.e. it collects relay-test addresses).

      It also posts all the data it collects to a website, which it can serve itself (i.e., it's a webserver too).

      It has quite a number of other frills (not all of which are documented yet - it's still in test, but it's getting more stable every day).

      It is a valid objection to a honeypot that does relay test-messages, that it is sending spam. There is a risk of the program being subverted by a spammer. Honeypotting this way isn't for children - you could get complaints for running this program.

      Having said that, you can download the current Beta build at My site. (Damn, how do you get rid of that crap in square brackets???) It's highly configurable, but it runs out-of-the-box on Win NT/2K/ME systems (it needs a JVM, of course).

      Jack.

    18. Re:filtering not the answer - maybe this is by minas-beede · · Score: 1

      "Interesting idea, but easy to verify. Send one thousand emails, and include a verifiable email in it. Check the email a few hours later - if it's not there, then don't use the relay." Easy and obvious. So far most spammers don't. Honeypots are in use now, and have been for some time. The evidence suggests they haven't gtten this smart. When I first ran a honeypot I checked to see if there were dupliate addressses, thinking that they'd lazily use the same address to test. I never found a duplicate and I quit looking. When the Windows honeypot comes out home users with Windows using DSL or Cable can run a honeypot. What would you guess the number of such users to be?

  19. Neural Net Spam Filtering by ShakaUVM · · Score: 3, Interesting

    At UCSD, Bob Boyer and I wrote a neural net spam filter. Neural Nets, as everyone knows, are not really like biological brains, but really just statistical engines similar to the approach the guy above claimed to do.

    Our approach worked pretty well (95-97% accuracy), and we had to deal with the same issues that the above "Bayesian" approach did. I.e., weighing the neurons so that false positives occur much less frequently than false negatives, etc. We built it using data on spam collected from the UCI machine learning repository.

    It ties in with procmail. I'm not really a windows guy, so if anyone knows how to put a filter between an IMAP server and Microsoft Outlook/Netscape Communicator, I'd be interested in hearing how it's done.

    The README for it is at: http://www-cse.ucsd.edu/~wkerney/spamfilter.README
    And you can download it at:
    http://www-cse.ucsd.edu/~wkerney/spamfilter.t ar.gz

    -Bill Kerney
    wkerney at ucsd.edu

    1. Re:Neural Net Spam Filtering by Trinition · · Score: 2

      The thing that struck me here is that you chose the 57 attributes up front and determine the value of these attributes for each spam. These values are then the input to the spam.

      How did you arrive that these attributes? Are there any others you considered but didn't include?

      Is there any way a nueral network, or somethng else, perhaps could b used to determine other, less-obvious attributes? For example, Paul's filter found that the color #ff0000 (bright red) was a high indicator for spam. While that is the value of an an attribute (value = red, attriibute = color) that is the sort of unanticipated tell-tale sign of spam I'm referring to, except I wonder if there are unanticipated attributes to be found.

    2. Re:Neural Net Spam Filtering by ShakaUVM · · Score: 1

      We took the attributes that had been collected from 7,000 pieces of spam from the UCI repository, so that we wouldn't have to collect our own spam and look for common attributes.

      Collecting at least 7,000 spam letters off my hotmail account would take at least, oh, say three days, so that saved me a lot of time. ;)

      There is a trainer included in the package so you can choose whatever attributes you want to look for. The email parser would have to be rewritten a little bit, too, but its definitely viable.

      Automatic identification of attributes is outside of its scope.

      -Bill

  20. SpamAssassin - duh by Gothmolly · · Score: 3, Interesting

    SpamAssassin works great for me. It eats about 90% of my spam, you just hack up a little procmail file for it, and you're done.

    With so many people using SpamAssassin these days, I can't see how this is a timely or newsworthy item. More like from the been-there-done-that-dept..

    --
    I want to delete my account but Slashdot doesn't allow it.
    1. Re:SpamAssassin - duh by Eric+Seppanen · · Score: 3, Insightful
      Reasons why I don't use SpamAssassin:
      1. It tends to rely on blocklists, many of which have demonstrated unfair practices in the past.
      2. The more SpamAssassin is used, the more spammers will specifically avoid doing things SpamAssassin checks for.
      3. It's a gigantic heap of perl, the Write-Only (tm) language. I hate the fact that every perl program demands I mess up the package manager on my system by blindly downloading a half-dozen new modules. And it's slow!
      4. Bogofilter is better. duh.
      --
      314-15-9265
    2. Re:SpamAssassin - duh by Anonymous Coward · · Score: 0


      Couldn't agree more.

      Try installing SpamAssassin on a solaris box without root access. My trouble all boiled down to bugs in the Perl install package code ... as I was able to eventually verify by searching through groups.google archives.

      Argh ... that was a wasted evening!

    3. Re:SpamAssassin - duh by silversurf · · Score: 1

      I don't think 1/2 of 1/16th of the people here read the article he wrote, or at least understood it. This item is timely and worthy because it's a different approach, anyone who took time to read the article would realize why SpamAssign is workable but not the greatest implementation and why a statistical model is probably the way most spam filtering will go. He explains, quite correctly, that scoring spam on word scores alone isn't a viable method for the future. Read it, it's worth a look over if you're interested in Spam filtering.

      It's not the end all be all of filtering methods, but the whole idea of blacklists/whitelists just isn't practical in the real world, same for word score filtering methods. It's fine for my home/personal mail and domains, but try implementing a whitelist or word score system in a 10,000 user organization and you'll learn what pain is.

      -s

    4. Re:SpamAssassin - duh by ansible · · Score: 2

      It tends to rely on blocklists, many of which have demonstrated unfair practices in the past.

      True. Spamassassin does use block lists as part of the score, but you can lower the scores for those, not use them at all. The scores aren't high enough to kill a message by itself, I believe the highest score for a block list is 3.0 with the default threshold being 5.0.

      The more SpamAssassin is used, the more spammers will specifically avoid doing things SpamAssassin checks for.

      And if spammers decide not to send me pr0n or other crap, that's a bad thing?

      The only real problem I've had with SpamAssassin lately is that I'm stuck on version 2.20. My ISP needs to upgrade Perl before I can run more recent versions. :-(

      I'm not a big fan of Perl either.

    5. Re:SpamAssassin - duh by Fweeky · · Score: 2
      It tends to rely on blocklists, many of which have demonstrated unfair practices in the past.

      I've turned them off; it's still 95% effective.

      The more SpamAssassin is used, the more spammers will specifically avoid doing things SpamAssassin checks for.

      As spam changes, so does SpamAssassin. It includes phrase frequency checks etc, too.

      It's a gigantic heap of perl, the Write-Only (tm) language. I hate the fact that every perl program demands I mess up the package manager on my system by blindly downloading a half-dozen new modules. And it's slow!

      Oh, hell, yes. It's really quite nasty code, badly speghettified and relying on things like looped evals.

      Look at SpamAssassin/PerMsgStatus.pm -- it performs (body regep rules * body message lines) regexp matches per message. It doesn't take long to see how nasty that'll get on a large message with over 200 rules :)

      Bogofilter is better. duh.

      Mmm, might have a look at that, thanks.
    6. Re:SpamAssassin - duh by mikecarrmikecarr · · Score: 1

      Reasons why your argument sucks:

      1. It tends to rely on blocklists, many of which have demonstrated unfair practices in the past.

      Weight the scores for each of the blocklist tests to 0. (SpamAssassin works out of the box for most people, obviously not for you; this is why we allow for local configs)

      2. The more SpamAssassin is used, the more spammers will specifically avoid doing things SpamAssassin checks for.

      Agreed. You can't really get around that. The spammers can use the same tools as us and customize their marketing to defeat our tools; so we write more tools, they defeat them, etc. We just need to be effective most of the time to make spam protection worthwhile... oh and minimize those false positives too ;)

      3. It's a gigantic heap of perl, the Write-Only (tm) language. I hate the fact that every perl program demands I mess up the package manager on my system by blindly downloading a half-dozen new modules. And it's slow!

      k, don't give me the ``perl is a slow language'' argument because I flat out don't buy it. Your package manager not being able to deal with perl modules is the fault of your package manager; hrm... I need Time::HiRes, so I apt-get install libtime-hires-perl. Seems to work for me.

      4. Bogofilter [sourceforge.net] is better. duh.

      And the only reason that I'm not arguing with you here is that (a) I haven't used Bogofilter and (b) it was written by ESR who freaking rocks my world :)

      If SpamAssassin doesn't do it for you then don't use it. It seems to do a good job for most of us though.

      --

      ID-10-T is a way of life

    7. Re:SpamAssassin - duh by Anonymous Coward · · Score: 0

      ESR is a gun totting lebertarian idiot, who gives a damn about him.

    8. Re:SpamAssassin - duh by khuber · · Score: 1
      I agree, plus SpamAssassin uses an extremely unintelligent weight system. There must be a dozen ways to do better with various AI and statistical techniques. 90% is not good enough.

      -Kevin

    9. Re:SpamAssassin - duh by kirkjobsluder · · Score: 1

      There are some serious limitations to SpamAssassin as an approach for limiting spam.

      1: The addition of new recognized spam phrases must be done manually. I can't just pipe the spam into the program with the option "this is spam." The fact that I can teach Bayesian programs by piping spam into them is a major advantage over rule-based programs.

      2: The definition of what is spam and what isn't spam is in the hands of someone else. SpamAssassin has what I find to be an unacceptably high false positive rate including some messages that I consider to be important such as calls for proposals, membership renewals for professional organizations. In contrast mail that I get because I purchased someting online before I discovered spam gourmet seems to get through.

      3: A Bayesian approach is not just about tagging spam, but about differentiating between spam and non-spam. That is a subtle but important distinction. SpamAssassin looks for features thare are typical of spam. However it has no idea what my legitimate email looks like.

      Information that is specific to me becomes an automatic whitelist. Lets say for example that I get an email from a new employee in my workplace. The name, telephone exchange and address of my workplace mapks that message as being a high probability of not being spam. Bayesian filters are better at detecting personal messages than spam resulting in a low false positive rate and a higher false negative rate.

      4: Perhaps another reason is that I'm not convinced that rule-based programs scale well. For an examle with a non-trivial message. spamassassin takes 4.24 seconds to process an 8k message while my program (written (rather badly) in python) delivers a verdict in .740 seconds. (On a solaris machine I have no control over.) Just sampling a few in my inbox spamassassin consistantly requires about 5x the time of my rather poorly written python script and I can probiably optimize even further. (I suspect that for large messages I do not need to scan the entire message) I suspect this will get worse as more rules are added to spamassassin.

    10. Re:SpamAssassin - duh by sholden · · Score: 1

      I managed with no trouble at all.

      I guess it might require the ability to read a perl faq on how to keep your own packages installed, which means you would need to be able to read, which would imply you need at least a tiny brain.

  21. How do you pronounce "Bayesian" anyways? by mblase · · Score: 2


    While I love everything there is to love about open source (code and ideas), I kind of worry when I read how successful all these new Bayesian/Grahamian filtering techniques work.

    Not being a coder or statistician myself, I'm left wondering if the spammers can exploit it for a workaround. Is there something "built in" to these filtering techniques that can be used by spammers to effectively circumvent them?

    1. Re:How do you pronounce "Bayesian" anyways? by KieranElby · · Score: 2, Interesting

      > Is there something "built in" to these filtering techniques that can be used by spammers to effectively circumvent them?

      Yes and no.

      To defeat a bayesian filter, the spammer needs to make his email contain similar words, and combinations of words, to your genuine email, while at the same time making sure that the words used are different to those in known spam.

      So saying 'click here to make $$$' won't work any more, since most of your regular emails don't contain the word combinations 'click here' and 'make $$$', whereas known spam emails will.

      However, we're already beginning to see spammers making their emails less obviously spam.

      For example, the spammer may use an email along the lines of:

      "How's things?

      Have you seen yet?

      Don't forget to mail me those documents.

      Regards,
      A Spammer"

      Even a bayesian filter will struggle to distinguish that from:

      "Have you seen the story on slashdot yet?

      Don't forget those reports.

      Regards,
      Your Boss"

    2. Re:How do you pronounce "Bayesian" anyways? by KieranElby · · Score: 2, Informative

      Oh, and "Bayesian" is pronounced "BAY - ZEE - UHNN".

    3. Re:How do you pronounce "Bayesian" anyways? by nelsonal · · Score: 1

      I thing its similar to Beige combined with an That's a short a, I don't know how to represent it here. Others are likely to correct me, but the only way to circumvent them are to make your spams different from other spams and similar to your normal mail. There isn't a single way to get on a good list, since there is no single good list, only attributes that make it more or less likely to be a Spam.

      --
      Degaussing scares the bad magnetism out of the monitor and fills it with good karma.
    4. Re:How do you pronounce "Bayesian" anyways? by PurpleBob · · Score: 2

      I bet they could get around it by picking a few random words from a dictionary and adding it to the end of the spam. If one of them were an obscure word that you've received in one or two legitimate e-mails, the filter would decide "Hey, I've never gotten a spam with the word 'yarborough' in it before, so it must be real".

      --
      Win dain a lotica, en vai tu ri silota
    5. Re:How do you pronounce "Bayesian" anyways? by Anonymous Coward · · Score: 0

      Did you read any of the articles, or even the posts around you?
      New words are assigned a rather neutral value, which means they won't be included in the 15 most interesting words list which is used to determined whether or not its spam.

    6. Re:How do you pronounce "Bayesian" anyways? by crapulent · · Score: 1

      Incorrect. The algorithm only looks at the top 10 (or 15, I forget) "most interesting" words in the message, where interesting is defined as a score close to 1 or close to 0. So only words that very strongly indicate spam or non-spam are considered. Words that have never been seen before are given a score of 0.4 (or something like that, I don't remember) which makes it all but impossible for them to be considered "interesting."

  22. Well... by ccarter · · Score: 2, Informative

    I hate to give any kind of credit to M$ but they patented the idea of using Bayesian analysis for spam filtering circa 1995. They even had it in one of thier beta's. However the filters were tagging some of those fricking Blue Mountain greeting cards as spam (imagine that!) so Blue Mountain sued them on anti-competitive grounds and M$ pulled it. Blue Mountain wanted to have the spam filters universally pass Blue Mountain content but MS refused that on the grounds that if a user considers it spam then it is in fact spam to them (Hurray for the "bad guys"!). The law suit has been settled/dropped/died for reasons I don't know.

    Anyway I hear that the next version of MSN will have a Bayesian filter and that it will be introduced in an up coming version of Outlook Express (no idea about Exchange and Outlook).

    BTW I believe internally MS uses this technique for spam control and that they don't seem to have any spam problems.

    1. Re:Well... by Anonymous Coward · · Score: 0

      Nice try, go read the patent. They just like Paul Graham did not use genuine Bayesian analysis. They also clearly in the patent describe spam filtering as a classification problem and promptly mention just about every possible type of classifier including Bayesian classification. They then go on to use a none Bayesian approach to the classification problem.

      Clearly the patent is an application of existing methods to something that is a clearly understood problem. There is no inventive step whatsoever in the patent. Even if you agree with software patents, this particular patent should never have been granted.

    2. Re:Well... by ccarter · · Score: 1

      "Clearly the patent is an application of existing methods to something that is a clearly understood problem. There is no inventive step whatsoever in the patent. Even if you agree with software patents, this particular patent should never have been granted."

      I don't agree with software patents, patenting software is akin to patenting a spoken language in my book. The idea of granting someone a patent simply on the basis of them saying "a software method for doing (insert common everyday task here)..." is just ludicrous.

  23. Re:I still think passive euthanasia is the best wa by Anonymous Coward · · Score: 0

    Politics in the US is not about the will of the people; it is about the will of the corporations that have the money for lobbying their agenda. The politicians will continue to ignore the people unless the resistance from the people corsses a certain threshold (in this case, when people are bothered enough by spam to ignore other issues that the politiona in question might be working on).

  24. Double whammy by Anonymous Coward · · Score: 0

    Hehe, sounds like fun. Maybe I can then capture all the e-mail addresses that get run through my fake mail server, and sell the list back to the spammers.

  25. Re:filtering not the answer - maybe SPOOFSERVERS by saskboy · · Score: 1

    Hey, that is a really cool idea, I wonder if it can really work. It is a new idea to me, so if anyone knows if this is a joke, or a possibility, please let us know?

    Then we need someone to develop some open source code that creates a dead end mail server on whoever installs the program. They should be able to set how much spam their server eats in a night, rated to bandwidth usage. I'd run it as a screensaver.

    --
    Saskboy's blog is good. 9 out of 10 dentists agree.
  26. Why just spam? by KieranElby · · Score: 1

    Sure, spam is a big problem, but right now only 10-20% of my emails are spam, and most are easily identifiable by subject.

    On the other hand, I get hundreds of emails every few days covering a range of topics, which need to be manually sorted into folders.

    What I'd like to see, and I suspect I'm not alone here, is similar software that can sort email into any number of categories, not just spam and non-spam.

    For example, if I have an email folder called 'fishing', containg emails from fishing buddies, then next time I get an email containg references to 'casting', 'trout' and 'it was *this* long', it should be sorted into that folder automatically.

    I'd be curious to know if there's any existing software to do this, and if not, I'd be tempted to have a go at knocking something up to do this.

    One tricky bit would be how to integrate it with the email client. I'd imagine that users wouldn't want to switch away from Outlook/Mozilla/Mutt/Whatever merely for this feature, so it would have to be client-agnostic.

    I'm thinking that implementing a simple IMAP server would be the easiest option since this allows for server-side folder management. It would then be case of maintaining word counts (Bayesian or otherwise) for each folder, and classifying mail accordingly.

    Anyone else had any thoughts along these lines?

    1. Re:Why just spam? by McFly777 · · Score: 3, Informative

      Easy. Just re-run the spam filter on your 'cleaned' mail using a ruleset generated by splitting the mail into topical vs. everything else.

      --

      McFly777
      - - -
      "What do people mean when they say the computer went down on them?" -Marilyn Pittman
    2. Re:Why just spam? by GigsVT · · Score: 1

      What I'd like to see, and I suspect I'm not alone here, is similar software that can sort email into any number of categories, not just spam and non-spam.

      You must run Windows. Try a modern OS sometime. This has been a standard feature for years.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    3. Re:Why just spam? by nelsonal · · Score: 1

      I use a series of rules in outlook to do something similar. I don't know if other email programs support this feature to the same level. But in outlook you can create rules to move email to a folder, or delete it, reply to it, etc. based on sender, words in the subject, body, or nearly any other attribute. Mine is to sort email into company folders, as I work for a pension fund, and recieve 50+ research emails a day. It also keeps my inbox empty, since the company folders are off the exchange server. Most of the people deleted anything unread (usually several hundred) more than a week old before I showed them how to use rules to sort the stuff. Email me if you want more info about them, they are under the tools menu. Not quite as good as a Bayesian solution, but pretty good nonetheless.

      --
      Degaussing scares the bad magnetism out of the monitor and fills it with good karma.
    4. Re:Why just spam? by shrikel · · Score: 2, Informative

      Have you tried Ms Outlook? It's got extensive rule-based sorting capability. It doesn't work for IMAP, and you mentioned IMAP leter in your message, but it's not clear that that's all you're dealing with.

      --
      Any sufficiently simple magic can be passed off as mere advanced technology.
    5. Re:Why just spam? by KieranElby · · Score: 1

      Sorry, I should clarify what I meant. I'm aware that email agents/clients exist that can do classification based on programatic rules.

      What I want is something that can do this statistically by looking at the existing contents of my email folders; i.e. without the need to set up an inevitably somewhat fragile set of rules.

    6. Re:Why just spam? by KieranElby · · Score: 1

      Sorry, I should clarify what I meant. I'm aware that email agents/clients exist that can do classification based on programatic rules defined by the user.

      What I want is something that can do this statistically (possibly Bayesian) by looking at the existing contents of my email folders; i.e. *without* the need to set up an inevitably somewhat fragile set of rules.

    7. Re:Why just spam? by Seth+Golub · · Score: 1
      What I'd like to see, and I suspect I'm not alone here, is similar software that can sort email into any number of categories, not just spam and non-spam.

      Any content-based classifier that works for spam/non-spam could also work for other categories, though the signals, and therefore the accuracy, might be different.

      But enough theory. What you want is ifile. It does exactly what you describe.
      ifile is a general mail filtering system that works with a mail client to intelligently filter mail according to the way the user tends to organize mail. ifile uses the machine learning algorithm Naive Bayes to classify e-mail documents.
    8. Re:Why just spam? by Anonymous Coward · · Score: 0


      Sorry, I should clarify what I meant. I'm aware that email agents/clients exist that can do classification based on programatic rules.

      No, you were clear. But of course, that never gets in the way of a frothing Linux zealot when he has an opportunity to (falsely) bash windows.

    9. Re:Why just spam? by GigsVT · · Score: 1

      frothing Linux zealot

      Mmmmmm, frothy...

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
  27. Brain exploded by operagost · · Score: 2, Funny
    Note to statisticians: the product of the probabilities is monotonic with the Fisher inverse chi-square combined probability technique from meta-analysis. The null hypothesis is that the probabilities are independent and uniformly distributed.
    Ouch! My brain is hurting, Doc!
    --

    Gamingmuseum.com: Give your 3D accelerator a rest.
  28. Re:I still think passive euthanasia is the best wa by ivan256 · · Score: 3, Informative

    For once a restrictive legislation would get 99% support... you don't see that everyday. like I mentionned before, I don't get our politicians, they say they work for us, they try to find clever ways to tax us, remove control that we used to have and all, but something on which they would get unprecedented support, they are simply sitting on the issue...


    Perhaps the problem is that the law would gain them less votes then a few hundred thousand dollars in campaing financing would. A large portion of the population isn't online, and a large portion of those who are don't care about spam, so your politician doesn't care either.

    Since this is such a trivial technical problem to solve, it's not really a big deal either way. I daily reduce 800 spam messages to five or six that make it through to my inbox just using procmail scoring, and I haven't had a false positive in years. I spend five minutes updating my procmailsc every six months to keep it effective. I suppose that I could use an automated system to generate my score file similar to what Paul Graham described, but when I only spend ten minutes a year updating my rules, it's going to be alot of years before it was faster to have written all that code. No need for sweeping legislation.

  29. Re:I still think passive euthanasia is the best wa by ch-chuck · · Score: 1

    Ah, all they have to do is say something about restricting free speech and all the angry ballbats go limp. Spamassassin: works for me.

    --
    try { do() || do_not(); } catch (JediException err) { yoda(err); }
  30. anti-spam laws by McFly777 · · Score: 2

    While in many respects I agree that "There oughta be a Law" against spam, there are some problems with that approach. Not the least is that generally a social solution is much better (or at least has less side effects) than any law that a government will enact.

    Laws have the distinct problem of either going too far (false positive) or being too weak and thereby legitimizing the spam that would manage to work through the loopholes. Taken to the extreme that seems to commonly occur in the US legal system, I can envision spammers suing ISPs for blacklisting their "legit per US act ####" spam.

    I would much rather statistical methods such as are being discussed. This combined with "whitelist" methods seem to work very well by all accounts.

    --

    McFly777
    - - -
    "What do people mean when they say the computer went down on them?" -Marilyn Pittman
  31. Mmmm, I wouldn't try it by mblase · · Score: 2

    This need not even impact your own bandwidth.

    Last week (I can't find the article yet), Slashdot had a link to a column by someone who was (in his opinion) unjustly blacklisted for hosting an easily-accessible mail server. The moment his name hit that blacklist, he became a target for what may as well be every spammer on the planet. Even though he didn't actually have an open relay (just an easily-guessed password), the incoming traffic from so many e-mail spammers effectively brought his server to its knees. Changing his domain name and IP address was the only cure.

    Building a "honeypot" mail server for spammers is appealing, but could be more trouble than its worth, especially since it's more or less irreversible. I'd advice against it.

    1. Re:Mmmm, I wouldn't try it by GigsVT · · Score: 2

      He was running an open relay. He was too ignorant to know it.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    2. Re:Mmmm, I wouldn't try it by Deagol · · Score: 2
      I've always wondered what it would take to modify a sendmail or postfix configuration to become a "mail sink". Sure, there are tarpits that slow spammers down, but why not make a server that acts and smells like an open relay, but simply dumps the mail to /dev/null and tells the sender they were delivered? Maybe bandwidth might be an issue, but it may more effective than a tar pit.

      A human watching over his spam software might notice if the target relay is delivering at a rate of 1 message per day and find another. If, however, he sees that the "server" is ripping through deliveries at a massive rate, he might stay with that server and all of his spam will vanish into the bit bucket.

    3. Re:Mmmm, I wouldn't try it by frovingslosh · · Score: 2, Offtopic
      Actually, he did have an open relay, he just wants to hide behind a lame claim that it wasn't open because the spammer had to lie to use it! Imagine that, a spammer lying. He was a lawyer, and we know they never lie ;-). IMHO he got a lot less punishment than he deserved.

      And what was the reported problem he cried about? Not an overload on his network, that was not his complaint. But his domain name being blacklisted. With good reason, IMHO. He was running a server that spammers used, and could even see this when the people he invited to test his system got right in. He then claimed they misused his system because they gave a false name and suggested he should sue them!

      Maybe this guy was just too stupid to block a port on an incoming firewall to keep the outside mail server users out. It seems unlikely though, particularly if he had the ability to set up a mail server (supposedly for the use of his own local network). It sounded more to me like there was a good chance he knew exactly what he was doing and wanted to set up a server for spamming, and was blowing smoke when he got black holed.

      Getting black holed will not be a problem for a dummy server that never actually sends mail (the black hole people are not out there port scanning like the spammers are). Even if your dummy mail server were to be blacklisted, so what? That in no way would affect your normal e-mail that you send through your service provider.

      --
      I'm an American. I love this country and the freedoms that we used to have.
    4. Re:Mmmm, I wouldn't try it by FFFish · · Score: 2

      ...and into a Bayesian filter mangler, providing it with a diet of 100% unadulterated spam. The filter can then be distributed a la virus updates...

      --

      --
      Don't like it? Respond with words, not karma.
    5. Re:Mmmm, I wouldn't try it by AndroidCat · · Score: 2
      Google and Googlegroups for spam honeypot.

      And search Slashdot too. I think there was an article about a Russian honeypot a few months ago. Because of bandwidth costs, they "throttled down" their honeypot to reduce the truely huge amount of hits by clueless spammers. (But I repeat myself..)

      There are arguements both ways about relay honeypots. The downside is that you have to let some relay tests go through so that when the spammer tests it, the tests go through. But then when the actual spam-run happens, it has to choke it off completely.

      --
      One line blog. I hear that they're called Twitters now.
    6. Re:Mmmm, I wouldn't try it by GigsVT · · Score: 1

      Some ISPs scan for open relays and shut you down if they find one. Of course such an ISP sucks, but it is another risk I have seen no one mention.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    7. Re:Mmmm, I wouldn't try it by minas-beede · · Score: 1

      I've run such a honeypot for 2 1/2 years. It is a combined server/honeypot but I don't recommend doing that. As someone else said, use a dedicated server.

      At one time all you needed to make sendmail be a honeypot was to run it sendmail -bd. That's no longer true, and even if you did that there was a manual step needed now and again: what makes a honeypot powerful is to deliver the spammer relay tests.

      More here: http://fightrelayspam.homestead.com/

      and a new development is in the works (by someone else) that will be a giant leap forward.

    8. Re:Mmmm, I wouldn't try it by minas-beede · · Score: 1

      "Building a "honeypot" mail server for spammers is appealing, but could be more trouble than its worth, especially since it's more or less irreversible. I'd advice against it." Too late: I already do it. Mine is a combined server/honeypot (the honeypot grew out of my way of voiding open relay for the server.) What is this "irreversible" bit? If you are on the network, can afford the trafic, and have a spare unix/linux box and spare IP you can run a honeypot. You may see a lot of spam, you can do some real damage. What I see now is almost exclusively spam that comes through open proxies so you now no longer have information on the spammer himself. That awaits development of the open proxy honeypot (this is a sideways invitation for you to do that.) Spammers send relay tests all over - check your email logs, if you're a system manager. Accept and deliver just one of those and the spammer will probably conclude the IP is an open relay. If you can't guess what he'll do next you haven't been paying attention. Do a Google search for "corpit honeypot" and look at the cached page. That was the Moscow honeypot run by Michael Tokarev. I can tell you that many spammer dialup accounts got nuked because of that web page. The count of 3.6 million spam recipents protected is low: the counter got reset a couple of months before the cached page. It's more like 10 million. There have only been a few relay spam honeypots, and some of those (European ones, IT, NL) are very quiet about their existence and about what they see. It is quite likely that if you run a honeypot you will see something that no one else has yet reported. It is well worth doing if you wish to help end relay spam.

  32. Why is spam still a problem? by Anonymous Coward · · Score: 0

    I don't get it. Simply allow incoming email only from user names you know. Period.
    Why is this hard to understand?

    1. Re:Why is spam still a problem? by Anonymous Coward · · Score: 0

      Easily understood.
      Fairly easily implemented.
      Largely unused... ...since it's a restriction that requires
      someone already know you or you know them, but not all non-spam e-mail is between known parties. And adding the barrier of a confirmation step for an average person - when most are not expecting it - gets a "Why should I waste my time with this isolationist bozo?" response.

    2. Re:Why is spam still a problem? by erikdotla · · Score: 1

      I disagree with you. I have a more thorough thread on this below you might want to check out.

      My position is that there is already an authorization step for virtually all senders of email - you need to get the recipients address.

      Most people are careful not to publish their address for obvious reasons. To get someones address, you have to ask them in person somehow.

      I use this system, and I haven't had any problems when I ask someone "by the way, what address will you be sending from? I need to add you to my list." Most don't ask why since most people don't ask for someone's address and refuse to give their own - it's stupid anyway, since pretty soon you'll see their from address in their message.

      Anyone who doesn't want you to know their address before sending a message is probably malacious and you don't need their email anyway. Anyone who doesn't mind will gladly give it to you when they ask you for your email address.

      --
      # Erik
  33. spam is already keeping up? by Anonymous Coward · · Score: 0

    I've noticed in the past 2-3 weeks that the look of the spam I've received is a lot more like regular mail.
    eg:
    ---
    carpet

    Your home refinance loan is approved!

    To get your approved amount go here.

    To be excluded from further notices go here.

    carpet 5gate 1932zIgl2

    ---

    It's still identifiable as spam with a probability filter, but it's not that far removed from a legitimate mail an AOL dork might send or receive. (not that I care about them getting spammed!)

    1. Re:spam is already keeping up? by Mushy · · Score: 1

      You must read Paul Graham's article closely. It is still identifiable as spam cuz of the phrases. As for AOL dorks sending out emails like that, I hope you don't have friends who send emails like that to you. If you are including header data in your checking routine too, your friends will also get excluded since they'll have nonspam emails in the training data.

  34. Bayesian Filtering Works by CleverFox · · Score: 1

    I have implemented Paul Graham's algorithm at my corporation, and it is blocking 90-97% of our spam each day. It is "good stuff". Combine that with Razor v2 and some other filtering I do, and nary a spam gets thru.

  35. Re:filtering not the answer - maybe SPOOFSERVERS by netringer · · Score: 4, Insightful

    I'm fairly sure a false relay won't work. Just like snail mail list sellers, the spammers salt their victim lists with their own valid addresses that they can check to see if the message is getting out.

    BUT, an early spam filter at an ISP worked just like that. The design parameters were 1) that spam filtering require no more resources than actual delivery of the message, and 2) the filter give no indication to the spammer that the message was not going to delivered. This gives the spammer no feedback and forces THEM to waste CPU cycles which will slow them down.

    --
    Ever dream you could fly? Get up from the Flight Sim. I Fly
  36. Could someone tell me... by Metallic+Matty · · Score: 1

    ... what exactly bayesian means?

    1. Re:Could someone tell me... by Broccolist · · Score: 2

      Based on the probability theory of Thomas Bayes, an 18th century philosopher. In a nutshell, whereas orthodox statistics emphasises a "distribution function" that wholly describes everything about a random variable, Bayesian methods take a more ground-up approach that works with incomplete information and revises probabilistic beliefs as new evidence comes to light. It's been attracting a lot of attention and research lately.

  37. authorization based email box by erikdotla · · Score: 1

    I realized one day that filtering spam out by content is a futile exercise. I use a simple method that has worked perfectly: If the FROM address of an incoming message is not in my contact list, the message is Trashed. Before emptying the trash, I'll glance through it to be sure that I didn't recieve a legitimate message from someone not in my list. Since I've used this, not one spam has ever appeared in my Inbox. This is important since I use mobile devices and other strange ways to access my email that would be very sensitive to spam overload. Fact is, 99.999% of email I receive is either 1.) From people already on my contact list, or 2.) People who inform me they're going to send an email. Before I give out my address, I inform them that I need to know their address first, and add it to my contact list. If someone gets my email from someone other than me, or otherwise didn't talk to me first, I probably don't want their email anyway. And if it's important, they'll get in touch with me. I'm using Outlook for this solution and use a rule that moves all the messages out of the Inbox that don't meet this criteria. I plan to switch to Evolution soon under Mandrake and I'm sure I can program a similar function. It's much easier to spot 1 message from a legitimate sender out of 100 spams (takes only a few seconds in fact) than it takes to manually delete spams or constantly fiddle with filters. Each day, I'll glance at the list of 100-200 spams that have collected in my trash box, and within a few seconds, I can spot if someone I know has sent me something who isn't in my list. From that point forward, they're in my contact list, and it never happens again. At some point I plan to set up an auto-reply system that gives people a URL that they can visit to "ask for permission" to send me email. Spammers won't use it. I haven't bothered yet because I'll need to carefully design this to prevent my address from being "confirmed" by spammers as a result of this message, but I have ideas for that (send from a null account, use a picture of my email address in the message, with instructions on how to ask permission.) At that point, I can safely instant-trash all unrecognized recipients. I'd love some feedback on this method. It's worked great for me, though admittedly it won't work for those who recieve many emails from new contacts, such as someone who publishes (eek!) their address on a site for inviting new messages.

    --
    # Erik
    1. Re:authorization based email box by Hayzeus · · Score: 1
      The main problem with this approach is that it's a little awkward for those of us who frequently receive (non-spam) email from strangers. I get a lot of these.

      There are a number of people who use your method, but automate it, which is a better way to go but still a bit awkward. Incoming emails not on the reply list generate a reply requesting the original sender to go to a web page, which allows them to enter themselves on the contact list automatically. Conceivably, the URL can contain a "web bug" that merely requires the sender to visit the link to have the add happen automatically. (Of course, some SPAM filters will block email containg web bugs...)

      The best results I've seen personally involve spamassassin, which cuts my incoming volume from about 70spams/day to 1 spam every 3 or 4 days. Highly recommended for perl/procmail-capable platforms.

      All of this of course, only adresses the problem at the level of the individual user. The larger problem is not likely to be solved by any means short of legislation.

    2. Re:authorization based email box by Anonymous Coward · · Score: 0
      Before emptying the trash, I'll glance through it to be sure that I didn't recieve a legitimate message from someone not in my list.

      LOL. So you are forced to wade through the crap anyway. What a great system.

      How does the fact that it's in "Trash" instead of "Inbox" make any difference if you have to skim over everything anyway?

    3. Re:authorization based email box by erikdotla · · Score: 1

      It makes a huge difference.

      I recieve approximately 50 legitimate emails per day and 100-200 spams. If these were all intermixed in my inbox, deleting the spams by hand is a major chore.

      Glancing at a container filled with almost 100% spam to spot an out-of-place legitimate message takes only a second or two instead of 15 minutes.

      After your contact list truly evolves to contain everyone you email, it gets to the point where you don't even have to look as hard in that container, since everything tends to be spam. Most days, I just nuke it without looking and if I miss one or two non-spams, no big deal.

      Spam filters are just as likely to junk just as many legitimate messages as my system does. So you still have to look through it. How is a filter any better? Only a perfect filter would allow you to automatically delete the junk, but there is no such thing. Baiesian(sp) or whatever, it just won't ever happen. The spammers will always find a way around it.

      But there is a more important point to all this that I think is being missed. My method only requires me to look into the trash box because I'm trying to be accomodating to those senders who I haven't emailed in a while and maybe don't have on my contact list. But, if this were a more standard and accepted method (just the general idea of authorization-based email) people would make an effort to be sure the recipient "knows" them before sending, because they would know that their message will otherwise be dropped.

      Given any spam solution, if we all used it, it doesn't increase it's effectiveness. With this solution, if we all used it, it would be perfect.

      Additionally, it could be integrated more tightly with the mail server to check the contact lists of everyone on the server, and not even deliver the message at all if it's unrecognized. As I heard recently, the SMTP connection could be dropped before the entire message is even transmitted. In other words, pause recieving after the FROM line until it's sure that it's authorized, and otherwise, drop the connection.

      If I changed my email address tomorrow (which I'm considering), I could implement auto-delete and never see them. Why? Because anyone emailing me is going to have to ask me for my email, thus getting on my contact list. Nobody is going to find me otherwise, and the system becomes perfect. If you happen to be in a position where you're email address is changing anyway, you're in a unique position to implement this with perfect results.

      Thanks for the comments and I'd like to discuss this more.

      --
      # Erik
    4. Re:authorization based email box by erikdotla · · Score: 1

      Another quick thought:

      When integrating it into the SMTP server, this is better than using an auto-reply to inform senders of how to get authorized.

      When the connection is dropped, the 'reason' line reported by the mail server could be 'he doesn't know you! call him.' Thus, relying on existing SMTP technology to send this communique back to the sender without using any kind of auto-reply function that just consumes bandwidth and has other drawbacks.

      I fully plan to write this system into my mail server and see no other reason that it will not work. Unfortuantely I'm on Exchange and don't know much about programming it, so I'll be doing this when I can switch to a Linux based email server. I'm just waiting for Mandrake 9... :)

      --
      # Erik
    5. Re:authorization based email box by electric_penguin · · Score: 1
      If I changed my email address tomorrow (which I'm considering), I could implement auto-delete and never see them. Why? Because anyone emailing me is going to have to ask me for my email, thus getting on my contact list. Nobody is going to find me otherwise, and the system becomes perfect. If you happen to be in a position where you're email address is changing anyway, you're in a unique position to implement this with perfect results.

      That of course assumes that None of your friends email and some other person. Or forward your messages on to some chail letter. It only takes one of these "friends" for your address to escape back into the wild.

    6. Re:authorization based email box by erikdotla · · Score: 1

      True but that's not the point.

      The point is that once it's changed, anyone I want to email me will have to contact me and will be on my list, and I will be absolutely certain that everything else is spam (without having to glance at it before deleting it.)

      If it escapes into the wild, who cares - that's the whole point.

      One caveat is that I must inform my friends not to give my address to my OTHER friends - otherwise I'll never know about them and they don't end up on my list.

      Granted, you have to be a bit of an isolationist but most people who I chat with about this as they're asking me for authorization are fascinated when I tell them that it removes 100% of spam, and want to learn more.

      Anyway, I'm open minded. I try filters to further reduce my spam base but they just don't work. Give me an alternative that is 100% guaranteed and I'll switch. This is the only option right now.

      --
      # Erik
    7. Re:authorization based email box by erikdotla · · Score: 1

      I'd like to clarify. I said "true" when I didn't mean that.

      It does NOT assume anything, especially hoping that my friends won't email anyone.

      Regardless of whether I change my address or not, I could publish my address openly on a site right now with no fear, and let the robots harvest it and end up on every list in the world. I don't care, I'm not going to see any of that anyway. My friends can forward me to anyone they want.

      I've seen friends of mine take part in a favorite pasttime: Signing up enemies' email addresss into spam lists and porn lists by hand. They could even do that to me, and it wouldn't matter one bit.

      --
      # Erik
  38. keyword matching isnt the answer by mack+knife · · Score: 2, Interesting

    sites like yahoo, hotmail, etc are in a unique position to rid their users of spam.

    i don't see why they cant implement some system that scans incoming mail for its users' mailboxes, maybe does a checksum for each message or something, and if it finds that a number of its users are receiving exactly (or nearly exactly) the same message, assume it's spam. nuke the messages, and any new incoming ones.

    yeah, if such a system only scans a small number of mailboxes, it may filter out mailing list posts and so on. but it gets more and more reliable the higher number of mailboxes it tracks.

    this avoids searching for certain keywords and eliminates false positives. after all, how well would these keyword searching methods do if i were to quote a spam message in an email to a friend?

    1. Re:keyword matching isnt the answer by nelsonal · · Score: 1

      They have a vested interest in seeing your mailbox fill up, so they can sell you a larger more expensive mailbox. So any spam killing and additional users must be balanced against the slower fill up rate.

      --
      Degaussing scares the bad magnetism out of the monitor and fills it with good karma.
    2. Re:keyword matching isnt the answer by Anonymous Coward · · Score: 0

      True, but it is easier to simply drop the free capacity, as Hotmail did. Also, as users abandon addresses, they are still spammed and occupy disk space. So in the end I bet it costs them more to allow spam.
      As far as the original question goes, ATT uses Brightmail which does exactly that.

  39. optimistic by McFly777 · · Score: 2

    I think the original poster's point would be to make commercial e-mail illegal unless properly tagged. That way an untagged spam could be handed over to the FBI and treated like wire-fraud or something.

    Big problem would be prosecuting the spammer. Either they would all move overseas or the court would be so backlogged as to become ineffective.

    --

    McFly777
    - - -
    "What do people mean when they say the computer went down on them?" -Marilyn Pittman
    1. Re:optimistic by sstory · · Score: 1

      yeah, that is what i meant. Thanks for clarifying for these people. If spam is illegal in general, it'll be a misdemeanor. If only a special class of violations are illegal, the penalties can be harsher, etc. I suppose I should have realized that some people need it spelled out for them. I just need to avoid those kinds of people by filtering out score 0 comments.

  40. SPAM by Anonymous Coward · · Score: 0

    What you call SPAM I call creative marketing, besides someone has to get this economy going?

  41. Bayesian vs not isn't really the point by XDG · · Score: 4, Insightful
    Gary is both right in some respects and irrelevant in others. Here's the key line in his article that deflates it a bit:
    It is untested as of now. It is based purely on theoretical reasoning. If anyone wants to try and it test it in comparison to other techniques, I'd be very interested in hearing the outcome.
    On the other hand Paul Graham has actually tested his model and it works. I've worked it up in perl and tested it on my own data set and it works there, too. Paul acknowledges that he's being a bit fast and dirty, but the proof is in the pudding. The rest is just academic quibbling over the fine points.

    I'm not sure why this particular article needed to be posted, as it's just one of several alternative approaches and an untested one at that. On Paul's page, he also lists several published academic papers with other alternatives -- all actually tested, of course.

    Gary is basically right in questioning the use of the word "Bayesian". Paul's approach is more about weighing "evidence" as given by the appearance of certain words, rather than in figuring out the probability of spam assuming a "prior". See Paul's explanation, but if you check the article he references at the end, you'll note that the method Paul uses is only one of several methods to solve an underspecified problems. It's a reasonable guess, not necessarily the only guess.

    Looking at another article Paul references, given the word independence assumption, the more formal Naive Bayesian approach calculates as follows:
    p(spam) = [ p(spam)*p(word1|spam)*...*p(wordn|spam) ] / [ p(spam)*p(word1|spam)*...*p(wordn|spam) + p(!spam)*p(word1|!spam)*...*p(wordn|!spam)]

    This is similar to Paul's approach except for including a "prior" assumption of p(spam) -- the expected probability of any email being spam, calcuated from the historically observed frequency of spam. By leaving it out, Paul implicitly assumes that 50% of mail is spam -- that's his "prior" estimate of the spam rate. Given the other adjustments he makes to his sample, that appears to be acceptable in practice. (Paul overweights the spam prior, but also overweights the effects of "good" words.)

    I'd personally prefer to overweight the "good" e-mails entirely rather than just put a "good-multiplier" on them like Paul does, but that's just quibbling over small bits.

    As to the bit that Gary raises about Paul assuming a spam probability for an unknown word -- Paul originally said .2, then revised to .4, but really should have put it at .5 or just excluded it from all calculations. A new word has no robustness as a predictor (which is why Paul dropped words that didn't appear five times anyway). In practice, a new word at .4 isn't going to be among the 15 most interesting words to make the calculation from, anyway.

    -XDG

    1. Re:Bayesian vs not isn't really the point by Seth+Golub · · Score: 1
      By leaving it out, Paul implicitly assumes that 50% of mail is spam -- that's his "prior" estimate of the spam rate.

      Including the prior is only important if you want to treat the result as a probability, which, as Gary points out, isn't reasonable to do with Naive Bayes because of the incorrect independence assumption. For a classification task, it's enough to set the cutoff correctly to give you the performance you want (preferably through automated learning, but also potentially set by hand). What's really missing is the notion of a loss matrix that formally defines the value of the different types of classification errors (false positive, false negative) and that guides the cutoff selection.

  42. Re:filtering not the answer - Spam Honeypot! by Ma$$acre · · Score: 1

    A Honeypot for spammers? Sounds like an idea who's time has come.

    The problem's are many, but the outcome would be fantastic. Create a Mail-dev/null program which looks like a "real" system and make it hackable. Keep the same doors the spammers would normally use. Make said program freely available to anyone and everyone. Make it that much more difficult for Spammers to find a working program to hack.

    --
    Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it. -Samuel Johns
  43. not 100% - not good enough by kid_wonder · · Score: 1
    because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done

    Right. Try that one again after your non-100% effective filter starts filtering out business e-mails. Then where'll ya be? nowhere.

    AI people have absolutely no common sense. Its been proven by my neural net.

    --

    "Oh, you hate your job? There's a support group for that, it's called everyone, they meet at the bar."
    1. Re:not 100% - not good enough by Anonymous Coward · · Score: 0

      you might look up the definition of false positive, false negative, esp medical definitions.

      if a filter once in a while lets through a spam, that's fine, just so long as it never lets business email through

  44. Re:I still think passive euthanasia is the best wa by _Spirit · · Score: 1

    ban their IP at router level Oi, remind me to start running when you consider *active* euthanasia

    --

    beauty is only a light switch away

  45. Why not? by Greedo · · Score: 1

    Why not try it? The problem the guy had last week was that he did this on his home box that we used for other stuff (specifically, some mail-related stuff).

    So when he was blacklisted, his legitimate work was affected.

    There is nothing inherently wrong with running a honeypot mail-server. Just do it somewhere that isn't going to screw you when it shows up in ORBZ.

    (In fact, you could set up one server that acted as a honey-pot, and publish all the IPs of the spammers who try and connect to it. Other servers could use those IPs to block access at a lower level, without the risk of running their own honey-pots.)

    --
    Tuus crepidae innexilis sunt.
  46. Discussion in comp.lang.python by xihr · · Score: 1

    There was extensive discussion of Graham's spam filtering algorithm and potential improvements on comp.lang.python in mid-to-late August. Check Google Groups for the subjects "Lisp to Python translation criticism?" and "Graham's spam filter."

  47. How long until we throw out the current e-mail sys by aengblom · · Score: 2
    How long until we throw out the current e-mail system.

    I own my own domain, which makes it easier, but we really need a system designed to filter. And make it easier. This is my uninformed proposal. Perhaps it won't work, but it seems something is needed.

    People should have a private/public e-mail address. They should all go the same "account" and be part of the basic plan for any e-mail user.

    privateauthentication~myemail@myhost.com

    I know this is important and relevant

    publicauthentication~myemail@myhost.com

    I gave this person my e-mail address

    myemail@myhost.com will go into the crap bin and be deleted eventually. Perhaps some program could be used to alert users of possible important mail pieces there.

    Then we could also have some system to CHANGE the private authentication or public authentication that is form based. I.e. This address has been disconnected. Please apply for the new password.

    --


    So close and yet so far from the world's perfect ID number
  48. Dictionary spam? by gregbillock · · Score: 1

    It seems to me a countermeasure spammers might try is including a dictionary with their spam. Since filters are for sure going to be conservative and avoid false positives, they'll latch onto "good" words from the dictionary and ignore "bad" words from the spam.

    1. Re:Dictionary spam? by wirelessbuzzers · · Score: 2

      yep. anything containing the words etesian and realgar and mangelwurzel has a 99.9% probability of being spam :-)

      --
      I hereby place the above post in the public domain.
  49. It works well, but some spammers circumvent it by Len · · Score: 1

    There I was on vacation, wondering what to do with my free time, and a spam popped into my inbox. I remembered the article about Graham's statistical technique, which seemed a lot more interesting than an arbitrary keyword list or a set of ad-hoc rules, so I decided to write an anti-spam program. Vacation accomplished.

    After a couple of weeks I've built up a big enough spambase that Graham's algorithm is pretty close to 100% effective (and no false positives at all).

    However, I did run into one problem: Some particularly devious spammers are base64 encoding their email so that it can't be scanned by programs like this. (I can't think of any other reason why they're using base64 encoding for text/plain or text/html messages.)

    After I added code to check the email header and decode the message body it worked much better.

  50. Jaguar by Have+Blue · · Score: 2

    Apple's new spam detector works amazingly well for me. After some initial jitters it pretty much never gets false positives these days.

  51. microsofts trademark by portal9 · · Score: 3, Informative

    why are we even considering this method when microsoft has a trademark on it? nothing can be done.. they have a lock on it. trademark here

    1. Re:microsofts trademark by Kevinv · · Score: 2

      minor nit-picking detail: they have a patent. patent and trademark are completely different.

    2. Re:microsofts trademark by thogard · · Score: 3, Informative

      another stupid patent? This isn't new, its been done with spam on usenet for years. Maybe someone should digout the cancelmoose's freiends as prior art?

  52. Google for email... Re:Why just spam? by WolfWithoutAClause · · Score: 2
    Yes, I've always wanted google for email. You know, a small program that finds all the words in my email and then instantly pops up the emails with particular words in.

    I don't actually see the point in putting emails into different folders, if you have that feature.

    --

    -WolfWithoutAClause

    "Gravity is only a theory, not a fact!"
    1. Re:Google for email... Re:Why just spam? by Chirs · · Score: 2

      I don't know what mail client you use, but Netscape at least has the ability to search through specified mail directories for keywords, dates, authors, etc.

      I imagine most of the major clients have this ability.

    2. Re:Google for email... Re:Why just spam? by WolfWithoutAClause · · Score: 2
      Nah... I think you're greviously underestimating my inbox. My inbox and the web are of comparable size; ok, I exagerate slightly, but the search doesn't come back in less than 5 minutes.

      In contrast, Google preindexes everything and comes back in under a second.

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
  53. Re:How long until we throw out the current e-mail by Anonymous Coward · · Score: 0
  54. How is this working exactly? by Hassan79 · · Score: 1

    I think it might me interesting to apply AI methods in fighting spam, especially machine learning. For example, you could have a spam filter that is able to learn. You just show 100 spam mails to the filter program, then 100 non-spam mails, and the system "learns" how spam looks like.

    --

    Don't drink and su! antidisestablishmentariazationally
  55. How does Apple's mail spamfilter work? by wirelessbuzzers · · Score: 2

    All it says in the help is that it is adaptive and trains itself on your previous spam. It would be nice to see some source... and be able to patch it if we don't like it.... oh well, whining won't get me anywhere.

    --
    I hereby place the above post in the public domain.
    1. Re:How does Apple's mail spamfilter work? by kiddailey · · Score: 1

      They state that it looks at the content of the messages to determine if it is spam, so I'd imagine that it's similar in some ways.

      My luck with it has been very good, with no false positive and only a few missed ones over the course of a month.

      I originally stored over 1000 spams before it was released and "trained it" in one fell swoop :) It also does seem to get better over time.

  56. A call for suggestions, and coders... by doorbot.com · · Score: 2

    Let me start by saying I know very little about coding, otherwise I'd probably already be rushing off to a night of coding by the glow from my monitor.

    When the first Bayesian spam filtering article was posted, I thought it was a great idea, and this article just reinforces that idea. However, it would be interesting to build some sort of Sendmail module (or whatever MTA you like), but add some additional functionality:

    1. Option to return a 550 error if the message is determined to be spam: "550 Delivery blocked; Bayesian filter reports spam probability of nn%"
    - Right before reporting this error, wait n seconds or alternately, slow connection to n bps for n minutes.
    - After reporting the error, "deliver" the Subject and Body of the email to the spam words database.
    2. Inclusion of a whitelist, by IP, reverse DNS, MAIL FROM address, or RCPT TO address, header To: address, header From: address, etc.
    3. Configuration of account where spams can be forwarded to, for automatic addition to the database.
    - Perhaps this could be combined with the blacklist/whitelist. For example, any emails to spamthis@antispamdomain.com are always added to the DB. The entry could be as follows (similar to the Sendmail access map):
    spamthis@antispamdomain.com <tab> BAYESIAN:SILENT
    - This would allow for either silent addition to the filter (sender thinks mail was delivered -- good for spam harvesting emails, or for users to send their spam to), or a more "vocal" addition much like item #1 above, where a 550 error is reported... eg, BAYESIAN:550 or perhaps BAYESIAN:REJECT

    I realize this would block a lot of mail, but I have my Sendmail currently configured to actually block spam (or what it considers spam) and have had very few issues with valid messages bouncing. Obviously, results may vary, but I'm a firm believer in rejecting spam during the SMTP conversation, not accepting it and then deleting it silently.

    Does anyone else have any suggestions?

    1. Re:A call for suggestions, and coders... by t · · Score: 1
      yes. Don't do that. Why would you want to give feedback to the spammers? They'll just sit there and tweak their message until it is no longer blocked.

      Besides who are you going to send the error to? The reason spam is hard to stop is because it doesn't have a return address. The server sending it is just a parasitized host.

      A more effective option would be to DoS anything in the spam. That may also mean hooking a phone to your computer so that you can call any telephone numbers in the spam.

  57. Already Patented by Microsoft... by barfy · · Score: 2

    This whole methodology is already patented by Microsoft. ANY implementation not licensed by Microsoft is going to be a violation... And now that you know, it is treble damages...

    patent 6,161,130

  58. We can fight them! by frovingslosh · · Score: 2
    Another problem is that spammers might start automatically sending the same spams to the same lists via several different open relays. Thus, we might increase the volume of spam, at least in the early stages.

    I doubt that there are many spammers out there who are not using all of their available bandwidth to send spam already, I can't see how setting up dummy port 23's would make spam worse. Just the opposite: While this can be started by a few changes to an open source mail server, or maybe even by misconfiguring an existing mail server, it should grow and evolve. I think we can beat the spammers, but not just by being impressed on how well we can filter our own mail. Heck, as they add smarts, we could add smarts too. If we can identify the test messages with reasonable certainty we can elect to send them through. We could even build a nice P2P network of systems cooperating to stay one step ahead of the spammers.

    Can anyone get us started on this? Provide some Windows and/or Linux code to start the roach motel e-mail server (spammers log in but they don't send out)? I'll get one running tonight if I can get a good dummy mail server for Windows (and just slightly longer to put the hardware together if I have to build up a Linux system).

    --
    I'm an American. I love this country and the freedoms that we used to have.
    1. Re:We can fight them! by flonker · · Score: 2

      I know you didn't start this, but port 23 is telnet. SMTP is port 25.

    2. Re:We can fight them! by minas-beede · · Score: 1

      Tonight? For Linux it's easy: see http://fightrelayspam.homestead.com/ For Windows it's in Beta, so you'll have to wait a while. (I'm not the author of the Windows version; it isn't mentioned on the web page.) There was a Perl honeypot for Linux posted in news.admin.net-abuse.email 24 February this year, by John Collins. Funny. While I was composing this ZoneAlarm notified me that an SMTP attempt had been made to my Windows system.

  59. Re:I still think passive euthanasia is the best wa by agent0range_ · · Score: 1

    Most of the spam may be coming from overseas now, but at least in some of these countries it is far more likely that one could actually pass a law to sodomize the offender witha baseball bat.

  60. How is this working exactly? by Hassan79 · · Score: 1

    I think it might me interesting to apply AI methods in fighting spam, especially machine learning. For example, you could have a spam filter that is able to learn. You just show 100 spam mails to the filter program, then 100 non-spam mails, and the system "learns" how spam looks like (maybe reinforcement learning?)

    --

    Don't drink and su! antidisestablishmentariazationally
  61. Re:I still think passive euthanasia is the best wa by Anonymous Coward · · Score: 0

    Spam is a GLOBAL problem. There ARE no globsal laws. Do you think for one minute the Chinese ISPs (chinanet.cn) is going to refuse HARD US$... and not allow American and other international spammers to use their gateways? THINK AGAIN.... the ONLY way to fight spam is to make is to expensive for the spammers, that they will use other means to push their "penis enlargement" crap.

    Of course not everyone has the skills to track down and identify the spammers, but one can certainly have a lot of fun harrassing them.

    If you can identify the spammer and get a mailing address for them (very hard to do), then send them in invoice for the time you take in reading and reporting it. Kindly reminding them if they dont pay up by the deadline, you'll take them to collections.

    Now if EVERYONE did that (wishful thinking) then spammers might die or go away. Especially if everyone they spammed, would take them to small claims court demanding they pay for your time in reading their smut.

    I've been told that SOME people have actually been paid.... BOY!!! What lamers...

  62. Help out Gary Robinson by Anonymous Coward · · Score: 0

    He said he is only recieving 5-10 spams/hr. Lets try and knock that up a few levels trolls.

    Target: grobinson@transpose.com

  63. Bayesian filtering software by Roadmaster · · Score: 2
    Seems like everyone jumped on the bandwagon and implemented a bayesian spam filter shortly after Graham's article hit the net. Best part is, theory or not, the damn thing actually works.


    Paul's article lists a few of the bayesian spam filters, but here's a short list of the ones I've tried:
    Gary Arnold's bayespam is implemented in perl and geared towards qmail using maildir storage.

    Brian Burton's spamprobe, written in C++, tries to remember already-seen messages, so that you can dump your spams/good mails on separate folders, have spamprobe learn from them, and delete them afterwards. Spamprobe remembers which ones it already processed, and won't reprocess a message if it's already seen it.

    Eric Raymond's bogofilter is a typical ESR tool: concise, with a baroquely written man page, and quite simplistic, but does its job and does it well. ESR even uses some funny terms, like "spamicity", and "ham" (the opposite of spam). I don't like its dependency on the Judy libraries for dynamic arrays but what the heck.

    Matthew Walker's BayesSpam plugin for Squirrelmail provides squirrelmail users with bayesian spam filtering capabilities, no longer restricting use of the technique to those with access to procmail/mailfilter systems.

    1. Re:Bayesian filtering software by h3 · · Score: 1
      Brian Burton's spamprobe, written in C++, tries to remember already-seen messages, so that you can dump your spams/good mails on separate folders, have spamprobe learn from them, and delete them afterwards. Spamprobe remembers which ones it already processed, and won't reprocess a message if it's already seen it.

      I've been using spamprobe for a little while now and have been happy so far. One interesting bit about spamprobe is that it self-adjusts: after classifying each incoming email, it adds that email's data to its database, so as the content of spam (or your "good" mail) shifts, it's filtering criteria will shift with it.

      Perhaps this is why I like it more than I did SpamAssassin (that and the messy perl-ness of SpamAssassin) - I don't have to worry about updating rule sets.

      One caveat: the database can get pretty big. I seeded mine with about 600,000 lines total of spam and good mail and the database clocks in at about 83 MB.

      -h3

  64. One problem with that approach... by Big+Sean+O · · Score: 2

    One person's spam is another person's 'useful email'. For instance, I may want a particular type of email (eg: a pr0n mailing list, or a "George Foreman Grill" user group, or lots of Korean friends). It might be considered spam by the ISP's filters, but not by me.

    That's why it's best to train _my_ filter against _my_ received mail.

    And as more email gets received and I add the uncaught messages to the spam filter, my filter 'learns' what I consider spam.

    --
    My father is a blogger.
  65. Assholes. by Perianwyr+Stormcrow · · Score: 2

    So, if they own the damn thing, why can't they sit down and make a real implementation of it for Hotmail? I'm sure everyone involved would be happier.

    --

    What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey

  66. Sounds good by Perianwyr+Stormcrow · · Score: 2

    If the spammers have to jack a 75k file onto the end of their spams, suddenly they are sending 75 GB of data per spam run. This is about as stealthy as dancing naked on the piano in the middle of a wedding reception.

    Also, it would only work once- the first dictionary spam I got would be marked spam and then all the junk words would get marked in the list.

    --

    What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey

  67. POPFile does this by Perianwyr+Stormcrow · · Score: 2

    It can use the filter to get any result you want, not just a binary trash/don't trash.

    It puts an "X-Text-Classification" header in mails you get saying what category it determined, so that you can just write simple filter rules in whatever program you use to sort it all.

    --

    What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey

    1. Re:POPFile does this by Anonymous Coward · · Score: 0

      POPFile acts as a proxy between your windows mail client and your pop server. If you happen to have a hotmail account you can chain in pop3hot http://www.pop3hot.com/main.htm as a proxy which allows any mail client to access hotmail as a pop server.

      POPFile inserts an X-header, or adds text to the subject line for any number of categories you train it in. http://www.extravalent.com/software/popfile/

      The current POPFile engine has been split into three projects. One is an integrated version with Outlook (AutoFile http://www.usethesource.com/cgi-bin/article.pl?sid =02/09/13/1314238&mode=thread ). It requires no external proxy and will actually learn by checking the folders you currently have assigned in outlook and the mail you place into them.

  68. Except for user datagram protocol. by Perianwyr+Stormcrow · · Score: 2

    So, if someone sends you mail about which UDP ports to unblock on a firewall to play a game, you've just lost communication.

    Single word "zero-tolerance" rules are unwise, to say the least.

    --

    What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey

  69. My SPAM filter works great. by Anonymous Coward · · Score: 0

    I have my server setup to send a password to a user who sends me email. They must send the password back in an email. The password is a one-time password. Been using it for 2 months, works great....(I monitor the discarded mail, yes Im paranoid, and I've even had SPAMMERs bitch to me about wasting their bandwidth....the script is setup to adapt to continual spam from a server by forwarding emails to people like root@domain, spam@domain, remove@domain...etc..)

  70. Download it. by mcrbids · · Score: 2

    You can download the source here if you like.

    It's not from the same guy, but it's definitely derivative work.

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
  71. Any linux-based POP "proxies" for this? by mooman · · Score: 1

    This is basically an "ask slashdot" question.

    I have my popmail hosted by my ISP. I usually check my mail from my windows box. I'd like to configure my Linux box to periodically pull the POP3 mail from the server, spam-filter it, and then act as a "local" POP server that I'd just point my windows Eudora at.

    Anyone have an easy (relatively speaking) means of doing this? Seems like each of the 3 parts (Getting mail from ISP, filter, and being a POP server) are trivial, but anything out there that would do all this or pieces that play well together?

    I'm not keen on trying to deal with SMTP right now. My internet connection is a little too flaky for that...

    Thanks for any ideas.

    --
    In the Portland, Ore area and like card games? Check out: http://groups.yahoo.com/group/portlandgames/
    1. Re:Any linux-based POP "proxies" for this? by Anonymous Coward · · Score: 0
      Yep.
      1. Setup any POP server, most distros come with one already, most work right away.
      2. Get fetchmail
      3. Get filter
      You'll have to deal with sendmail (or other MTA) for delivery, but point the Windows box at your normal SMTP server from the ISP. You'll probably also need procmail although the filters can probably function without it.
  72. Re:First P0st! by sl@fireplug.net · · Score: 0

    I implemented both spamassasin and ifile one month ago.

    Results
    Both: 787 62%
    SpamAssasin only: 385 30%
    Ifile only: 62 5%
    Missed: 29 2%

    False positives: 0 0%

    I'm fairly happy with these results. I see about 1 spam message a day.

  73. Bayesian Spam Filtering won't work either by Anonymous Coward · · Score: 0

    The problem is that they're using spam they get today as sample input to their algorithms. This won't work because spammers will simply taylor their prose to fit the filter.

    This would be as simple as taking the filter and keep hitting it with the text of your spam. If the filter filters it, then tweak the words a bit, and iterate until the filter lets the spam pass. Now you have a spam that will pass through most peoples filters.

    Alternatively, you do the same as above, but instead of changing the words, change how you send them, but so that they still get rendered on a browser correctly - i.e., encode the text so the filter can't defilter it. For example, send the spam as text in a jpg image.

  74. obligatory biology humor by bbc22405 · · Score: 1

    So, a prior article described a method of spam detection which claimed to use something like Bayesian methods, and now we read that it didn't. Sounds like just another case of ...

    Bayesian Mimicry

    (Don't clap, just throw money.)

  75. Re:Your post by Jonny+290 · · Score: 1

    Considering this is Slashdot, par for the course.

    --
    Hey Taco! Looks like you're using the "infinite monkeys and typewriters" scheme to generate Ask Slashdots again...
  76. how about trolls by cheese_wallet · · Score: 1

    I wonder if this technique could be modified to spot trolls. Not too likely I guess, it'd have to be able to tell relevance to a topic.

  77. Method seems easily breakable. by guybarr · · Score: 2

    IIUC, The proposed method normalizes (with Ln norm) over the number of words, for "spammishness" and "unspammishness" of words, combining the results.

    whats stoping the spammers from attaching, say, a random scientific article longer than the spam at the end of the spam message ? This will give the spam a high grade in these bayesian method in general, but more so with his normalizing metric.

    --
    Working for necessity's mother.
    1. Re:Method seems easily breakable. by jtdubs · · Score: 2

      It depend if the bayesian method is naive or not.

      If for the top 1000 highest and lowest words you build a pair-wise table of each highest paired with each lowest and keep probabilities for these pairs as well then that would solve the problem.

      For regular email that contain no spam words, no problem. For spam that contain only spam words, no problem. For spam that contain both kinds of words the pair table would catch them.

      I mean, how many valid emails can you possibly have that both have scientific terms and words like "hot teen sex." Unless, of course, it's a scientific study about either spam, or hot teen sex.

      Unless, of course, I'm completely wrong about this whole thing and just don't realize it, which is sometimes far more likely than I approve of.

      Justin Dubs

    2. Re:Method seems easily breakable. by guybarr · · Score: 2

      I mean, how many valid emails can you possibly have that both have scientific terms and words like "hot teen sex." Unless, of course, it's a scientific study about either spam, or hot teen sex.

      well, for "hot teen sex", or "novel penis enlargement techniques available today !!" spam, I guess you're right. but for "get your mortguage now!", or for "cheap toner at amazing prices!" kind of spam this seems more tricky.

      Unless, of course, I'm completely wrong about this whole thing and just don't realize it, which is sometimes far more likely than I approve of.

      I'm no expert either, just skepticaly paranoid ...

      --
      Working for necessity's mother.
  78. available for KMail /Evolution/Mozilla? by egghat · · Score: 2

    Lot of implementations mentioned in this thread, but does anyone know of an implementation for the most wildly used E-Mail clients under Linux/BSD: KMail, Evolution and Mozilla?

    TIA for any links.

    Bye egghat.

    --
    -- "As a human being I claim the right to be widely inconsistent", John Peel
  79. This is a reinvention... by paulwomack · · Score: 1

    Of classic "probabilistic searching" from the field of information retrieval. Here's a typical tutorial You can feed key words from this to google to find more if you want to.

    The application to spam filtering is trivial. Simply take a document set (your inbox for a month), identify the spam set (manually) and the algorithm will generate term weightings for you.

    Then apply these term weightings to previous unclassified records (emails) and BINGO!

    BugBear

    --
    Ignorance is curable. Stupid is forever.
  80. Mozilla by yota · · Score: 1
    Some work is being done to implement something like this in Mozilla, check bug #163188 in Bugzilla (http://bugzilla.mozilla.org).

    Andrea

  81. baises moi (bayes me) by Anonymous Coward · · Score: 0

    That one seems quite interesting:
    http://www.ai.mit.edu/~jrennie/ifile /

  82. filtering on qmail... by georgehazlewood · · Score: 1

    I've been using and testing bayespam (thanks Gary!) for the last week or so and am impressed by how accurate it is. Easy to install too. All of the other anti-spam tools (blackhole, spamassassin etc) are a complete nightmare to setup and configure. Obviously speed is important but I'm going to use bayespam on a case by case basis rather than filter all and any. If a user has problems, start filtering... Must remember to keep saving up that spam for my corpus. To me it doesn't matter if it *really* is bayesian or not, it works. Hope someone sorts out a Mozilla setup too...

  83. 600 SPAMs?!? by akincisor · · Score: 1

    Do I sense Hotmail here?

  84. Anyone tried TMDA? by fanatic · · Score: 2

    I lkie the soun of this one - seems like it should eliminate all false positives sent by real peope and all false negatives. I worry about auto-responders and auto-reminders, though. TMDA (Tagged Message Delivery Agent)

    --
    "that's not encryption - it's a new perl script that I'm working on..." - from some Matrix parody
  85. Re:filtering not the answer - maybe SPOOFSERVERS by minas-beede · · Score: 1

    "I'm fairly sure a false relay won't work. Just like snail mail list sellers, the spammers salt their victim lists with their own valid addresses that they can check to see if the message is getting out." MAYBE some do salt, but demonstrably some don't. As recently as 17 minutes ago one spammer sent relay spam to my (2 1/2 year old) honeypot. It isn't being delivered. If he salted the list with his own address (as you say he does) he'd have figured out the honeypot last week already. The Moscow honeypot trapped Ralsky spam from February to July. Not only did Ralsky not salt the addresses he ended up sending spam run statistics reports back to himself THROUGH THE HONEYPOT. The entire episode was one long cause for ROFL. I'll grant that there may be some smart spammers and smart spamware vendors. Please don't assume that this smartness prevails. It does not. Um. Now it's trapped relay spam as recently as 9 minutes ago - I took some time to compose this, he's still busy. 88 recipients on this one. He's going alphabeticallly, he's in the bobxxxx's right now.

  86. Re:filtering not the answer - Spam Honeypot! by minas-beede · · Score: 1

    Start here: http://fightrelayspam.homestead.com/ Also, Google for "corpit honeypot" and look at the cached page. Really wicked. A honeypot with a real-time log of the incoming spam on a web page. Send the URL to the abuse@ISP and watch the throwaway accounts drop like flies. Sadly, now most relay spam seems to come through open proxies so that doesn't work.

  87. The best filter I've used by NightHwk1 · · Score: 1

    .. is Bayesspam 2.x for Squirrelmail. Its an easily installable plugin for a php-based webmail system, that uses MySQL to store the Bayesian corpus. It's also got options to limit the size of the messages to be filtered, and displays the spam probability and the 'mark as spam/nonspam' links in each email header.

  88. Re:filtering, etc. (download a honeypot Beta) by minas-beede · · Score: 1
    YES! Thank you, Jack.

    Visit http://jackpot.uk.net to download it.

    You also need a JVM, obviously.

    This one's web page is even better than the cached page you'll see if you Google for "corpit honeypot" and look at thr cached copy of the hit. You can examine any spam it has trapped.

  89. Re:filtering not the answer - maybe SPOOFSERVERS by netringer · · Score: 1
    I'm fairly sure a false relay won't work. Just like snail mail list sellers, the spammers salt their victim lists with their own valid addresses that they can check to see if the message is getting out." MAYBE some do salt, but demonstrably some don't. As recently as 17 minutes ago one spammer sent relay spam to my (2 1/2 year old) honeypot. It isn't being delivered. If he salted the list with his own address (as you say he does) he'd have figured out the honeypot last week already. The Moscow honeypot trapped Ralsky spam from February to July. Not only did Ralsky not salt the addresses he ended up sending spam run statistics reports back to himself THROUGH THE HONEYPOT. The entire episode was one long cause for ROFL. I'll grant that there may be some smart spammers and smart spamware vendors. Please don't assume that this smartness prevails. It does not.
    OK! Good for you!

    I should have guessed that when you get 3-4 copies of the same spam it means that the spamming scumbags just get redundancy by spamming the entire 64,000,000 addresses repeatedly through different raped relays.

    Your honeypot success gives me an idea. What if yours and other honeypots were used to cooperate to capture the spam to seed spam filters? Since EVERY message you process is spam all of the words in it, or a hashed signature could be send out to these filter dictionaries so that ISPs will know the message you captured should be delivered to /dev/null on first sight. The result would be that by sending you the message, Ralsky cuts his own throat so the spam doesn't get delivered to anyone on an ISP who participates.

    What'dya think?
    --
    Ever dream you could fly? Get up from the Flight Sim. I Fly
  90. Re:filtering not the answer - maybe SPOOFSERVERS by minas-beede · · Score: 1
    "Your honeypot success gives me an idea. What if yours and other honeypots were used to cooperate to capture the spam to seed spam filters?"

    Could work, should work. But there's already a service that captures spam using spamtraps that otherwise works almost exactly as you describe: DCC. It sends out fuzzy checksums, and I'm not the one to tell you how the fuzzy checksums are computed. As I recall there's a place on the web site where you can paste in a spam message and see if it would have been identified as "bulky" (DCC detects bulkiness rather than spamishness - it needs a whitelist for mailing list sources.)

    See: http://www.rhyolite.com/anti-spam/dcc/

    This truly is an excellent idea.

  91. a few things by Gumber · · Score: 2

    1. it is a patent, not a trademark
    2. just because someone has a patent doesn't mean the patent can't be challenged.
    3. just because someone has a patent doesn't mean a patent will be enforced.
    4. Some things are worth fighting for

  92. Worse practical performance by Anonymous Coward · · Score: 0

    I have attempted a quick implementation of these revised algorithms (at least the first two: S and f(w)) and the results are much less promising than the original article's algorithm.

    Caught spam dropped from near 99% into the 70s and the false positives jumped from 1 in ~2000 to 10-20%.

    Anyone else get similar results? Is it just my implementation?