Slashdot Mirror


Plan for Spam, Version 2

bugbear writes "I just posted a new version of the Plan for Spam Bayesian filtering algorithm. The big change is to mark tokens by context. The new version decreases spams missed by 50%, to 2.5 per 1000, even though spam has gotten harder to filter since the summer. I also talk about how spam will evolve, and what to do about it."

81 of 459 comments (clear)

  1. I'm sorry, but someone has to say it... by Yoda2 · · Score: 2, Funny

    But will it enlarge my penis?

  2. Archive Version (b/c it's a personal site) by Amsterdam+Vallon · · Score: 5, Interesting

    January 2003

    (This article was given as a talk at the 2003 Spam Conference. It describes the work I've done to improve the performance of the algorithm described in A Plan for Spam, and what I plan to do in the future.)

    The first discovery I'd like to present here is an algorithm for lazy evaluation of research papers. Just write whatever you want and don't cite any previous work, and indignant readers will send you references to all the papers you should have cited. I discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot.

    Spam filtering is a subset of text classification, which is a well established field, but the first papers about Bayesian spam filtering per se seem to have been two given at the same conference in 1998, one by Pantel and Lin [2], and another by a group from Microsoft Research [3].

    When I heard about this work I was a bit surprised. If people had been onto Bayesian filtering four years ago, why wasn't everyone using it? When I read the papers I found out why. Pantel and Lin's filter was the more effective of the two, but it only caught 92% of spam, with 1.16% false positives.

    When I tried writing a Bayesian spam filter, it caught 99.5% of spam with less than .03% false positives [4]. It's always alarming when two people trying the same experiment get widely divergent results. It's especially alarming here because those two sets of numbers might yield opposite conclusions. Different users have different requirements, but I think for many people a filtering rate of 92% with 1.16% false positives means that filtering is not an acceptable solution, whereas 99.5% with less than .03% false positives means that it is.

    So why did we get such different numbers? I haven't tried to reproduce Pantel and Lin's results, but from reading the paper I see five things that probably account for the difference.

    One is simply that they trained their filter on very little data: 160 spam and 466 nonspam mails. Filter performance should still be climbing with data sets that small. So their numbers may not even be an accurate measure of the performance of their algorithm, let alone of Bayesian spam filtering in general.

    But I think the most important difference is probably that they ignored message headers. To anyone who has worked on spam filters, this will seem a perverse decision. And yet in the very first filters I tried writing, I ignored the headers too. Why? Because I wanted to keep the problem neat. I didn't know much about mail headers then, and they seemed to me full of random stuff. There is a lesson here for filter writers: don't ignore data. You'd think this lesson would be too obvious to mention, but I've had to learn it several times.

    Third, Pantel and Lin stemmed the tokens, meaning they reduced e.g. both ``mailing'' and ``mailed'' to the root ``mail''. They may have felt they were forced to do this by the small size of their corpus, but if so this is a kind of premature optimization.

    Fourth, they calculated probabilities differently. They used all the tokens, whereas I only use the 15 most significant. If you use all the tokens you'll tend to miss longer spams, the type where someone tells you their life story up to the point where they got rich from some multilevel marketing scheme. And such an algorithm would be easy for spammers to spoof: just add a big chunk of random text to counterbalance the spam terms.

    Finally, they didn't bias against false positives. I think any spam filtering algorithm ought to have a convenient knob you can twist to decrease the false positive rate at the expense of the filtering rate. I do this by counting the occurrences of tokens in the nonspam corpus double.

    I don't think it's a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure. Spam filtering is not just classification, because false positives are so much worse than false negatives that you should treat them as a different kind of error. And the source of error is not just random variation, but a live human spammer working actively to defeat your filter.

    Tokens

    Another project I heard about after the Slashdot article was Bill Yerazunis' CRM114 [5]. This is the counterexample to the design principle I just mentioned. It's a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that's what it's doing.

    Once I understood how CRM114 worked, it seemed inevitable that I would eventually have to move from filtering based on single words to an approach like this. But first, I thought, I'll see how far I can get with single words. And the answer is, surprisingly far.

    Mostly I've been working on smarter tokenization. On current spam, I've been able to achieve filtering rates that approach CRM114's. These techniques are mostly orthogonal to Bill's; an optimal solution might incorporate both.

    ``A Plan for Spam'' uses a very simple definition of a token. Letters, digits, dashes, apostrophes, and dollar signs are constituent characters, and everything else is a token separator. I also ignored case. Now I have a more complicated definition of a token:

    Case is preserved.

    Exclamation points are constituent characters.

    Periods and commas are constituents if they occur between two digits. This lets me get ip addresses and prices intact.

    A price range like $20-25 yields two tokens, $20 and $25.

    Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. E.g. ``foo'' in the Subject line becomes ``Subject*foo''. (The asterisk could be any character you don't allow as a constituent.)
    Such measures increase the filter's vocabulary, which makes it more discriminating. For example, in the current filter, ``free'' in the Subject line has a spam probability of 98%, whereas the same token in the body has a spam probability of only 65%.

    In the Plan for Spam filter, all these tokens would have had the same probability, .7602. That filter recognized about 23,000 tokens. The current one recognizes about 187,000.

    The disadvantage of having a larger universe of tokens is that there is more chance of misses. Spreading your corpus out over more tokens has the same effect as making it smaller. If you consider exclamation points as constituents, for example, then you could end up not having a spam probability for free with seven exclamation points, even though you know that free with just two exclamation points has a probability of 99.99%.

    One solution to this is what I call degeneration. If you can't find an exact match for a token, treat it as if it were a less specific version. I consider terminal exclamation points, uppercase letters, and occurring in one of the five marked contexts as making a token more specific. For example, if I don't find a probability for ``Subject*free!'', I look for probabilities for ``Subject*free'', ``free!'', and ``free'', and take whichever one is farthest from .5.

    Here are the alternatives [7] considered if the filter sees ``FREE!!!'' in the Subject line and doesn't have a probability for it.

    If you do this, be sure to consider versions with initial caps as well as all uppercase and all lowercase. Spams tend to have more sentences in imperative voice, and in those the first word is a verb. So verbs with initial caps have higher spam probabilities than they would in all lowercase. In my filter, the spam probability of ``Act'' is 98% and for ``act'' only 62%.

    If you increase your filter's vocabulary, you can end up counting the same word multiple times, according to your old definition of ``same''. Logically, they're not the same token anymore. But if this still bothers you, let me add from experience that the words you seem to be counting multiple times tend to be exactly the ones you'd want to.

    Another effect of a larger vocabulary is that when you look at an incoming mail you find more interesting tokens, meaning those with probabilities far from .5. I use the 15 most interesting to decide if mail is spam. But you can run into a problem when you use a fixed number like this. If you find a lot of maximally interesting tokens, the result can end up being decided by whatever random factor determines the ordering of equally interesting tokens. One way to deal with this is to treat some as more interesting than others.

    For example, the token ``dalco'' occurs 3 times in my spam corpus and never in my legitimate corpus. The token ``Url*optmails'' (meaning ``optmails'' within a url) occurs 1223 times. And yet, as I used to calculate probabilities for tokens, both would have the same spam probability, the threshold of .99.

    That doesn't feel right. There are theoretical arguments for giving these two tokens substantially different probabilities (Pantel and Lin do), but I haven't tried that yet. It does seem at least that if we find more than 15 tokens that only occur in one corpus or the other, we ought to give priority to the ones that occur a lot. So now there are two threshold values. For tokens that occur only in the spam corpus, the probability is .9999 if they occur more than 10 times and .9998 otherwise. Ditto at the other end of the scale for tokens found only in the legitimate corpus.

    I may later scale token probabilities substantially, but this tiny amount of scaling at least ensures that tokens get sorted the right way.

    Another possibility would be to consider not just 15 tokens, but all the tokens over a certain threshold of interestingness. Steven Hauser does this in his statistical spam filter [8]. If you use a threshold, make it very high, or spammers could spoof you by packing messages with more innocent words.

    Finally, what should one do about html? I've tried the whole spectrum of options, from ignoring it to parsing it all. Ignoring html is a bad idea, because it's full of useful spam signs. But if you parse it all, your filter might degenerate into a mere html recognizer. The most effective approach seems to be the middle course, to notice some tokens but not others. I look at a, img, and font tags, and ignore the rest. Links and images you should certainly look at, because they contain urls.

    I could probably be smarter about dealing with html, but I don't think it's worth putting a lot of time into this. Spams full of html are easy to filter. The smarter spammers already avoid it. So performance in the future should not depend much on how you deal with html.

    Performance

    Between December 10 2002 and January 10 2003 I got about 1750 spams. Of these, 4 got through. That's a filtering rate of about 99.75%.

    Two of the four spams I missed got through because they happened to use words that occur often in my legitimate email.

    The third was one of those that exploit an insecure cgi script to send mail to third parties. They're hard to filter based just on the content because the headers are innocent and they're careful about the words they use. Even so I can usually catch them. This one squeaked by with a probability of .88, just under the threshold of .9.

    Of course, looking at multiple token sequences would catch it easily. ``Below is the result of your feedback form'' is an instant giveaway.

    The fourth spam was what I call a spam-of-the-future, because this is what I expect spam to evolve into: some completely neutral text followed by a url. In this case it was was from someone saying they had finally finished their homepage and would I go look at it. (The page was of course an ad for a porn site.)

    If the spammers are careful about the headers and use a fresh url, there is nothing in spam-of-the-future for filters to notice. We can of course counter by sending a crawler to look at the page. But that might not be necessary. The response rate for spam-of-the-future must be low, or everyone would be doing it. If it's low enough, it won't pay for spammers to send it, and we won't have to work too hard on filtering it.

    Now for the really shocking news: during that same one-month period I got three false positives.

    In a way it's a relief to get some false positives. When I wrote ``A Plan for Spam'' I hadn't had any, and I didn't know what they'd be like. Now that I've had a few, I'm relieved to find they're not as bad as I feared. False positives yielded by statistical filters turn out to be mails that sound a lot like spam, and these tend to be the ones you would least mind missing [9].

    Two of the false positives were newsletters from companies I've bought things from. I never asked to receive them, so arguably they were spams, but I count them as false positives because I hadn't been deleting them as spams before. The reason the filters caught them was that both companies in January switched to commercial email senders instead of sending the mails from their own servers, and both the headers and the bodies became much spammier.

    The third false positive was a bad one, though. It was from someone in Egypt and written in all uppercase. This was a direct result of making tokens case sensitive; the Plan for Spam filter wouldn't have caught it.

    It's hard to say what the overall false positive rate is, because we're up in the noise, statistically. Anyone who has worked on filters (at least, effective filters) will be aware of this problem. With some emails it's hard to say whether they're spam or not, and these are the ones you end up looking at when you get filters really tight. For example, so far the filter has caught two emails that were sent to my address because of a typo, and one sent to me in the belief that I was someone else. Arguably, these are neither my spam nor my nonspam mail.

    Another false positive was from a vice president at Virtumundo. I wrote to them pretending to be a customer, and since the reply came back through Virtumundo's mail servers it had the most incriminating headers imaginable. Arguably this isn't a real false positive either, but a sort of Heisenberg uncertainty effect: I only got it because I was writing about spam filtering.

    Not counting these, I've had a total of five false positives so far, out of about 7740 legitimate emails, a rate of .06%. The other two were a notice that something I bought was back-ordered, and a party reminder from Evite.

    I don't think this number can be trusted, partly because the sample is so small, and partly because I think I can fix the filter not to catch some of these.

    False positives seem to me a different kind of error from false negatives. Filtering rate is a measure of performance. False positives I consider more like bugs. I approach improving the filtering rate as optimization, and decreasing false positives as debugging.

    So these five false positives are my bug list. For example, the mail from Egypt got nailed because the uppercase text made it look to the filter like a Nigerian spam. This really is kind of a bug. As with html, the email being all uppercase is really conceptually one feature, not one for each word. I need to handle case in a more sophisticated way.

    So what to make of this .06%? Not much, I think. You could treat it as an upper bound, bearing in mind the small sample size. But at this stage it is more a measure of the bugs in my implementation than some intrinsic false positive rate of Bayesian filtering.

    Future

    What next? Filtering is an optimization problem, and the key to optimization is profiling. Don't try to guess where your code is slow, because you'll guess wrong. Look at where your code is slow, and fix that. In filtering, this translates to: look at the spams you miss, and figure out what you could have done to catch them.

    For example, spammers are now working aggressively to evade filters, and one of the things they're doing is breaking up and misspelling words to prevent filters from recognizing them. But working on this is not my first priority, because I still have no trouble catching these spams [10].

    There are two kinds of spams I currently do have trouble with. One is the type that pretends to be an email from a woman inviting you to go chat with her or see her profile on a dating site. These get through because they're the one type of sales pitch you can make without using sales talk. They use the same vocabulary as ordinary email.

    The other kind of spams I have trouble filtering are those from companies in e.g. Bulgaria offering contract programming services. These get through because I'm a programmer too, and the spams are full of the same words as my real mail.

    I'll probably focus on the personal ad type first. I think if I look closer I'll be able to find statistical differences between these and my real mail. The style of writing is certainly different, though it may take multiword filtering to catch that. Also, I notice they tend to repeat the url, and someone including a url in a legitimate mail wouldn't do that [11].

    The outsourcing type are going to be hard to catch. Even if you sent a crawler to the site, you wouldn't find a smoking statistical gun. Maybe the only answer is a central list of domains advertised in spams [12]. But there can't be that many of this type of mail. If the only spams left were unsolicited offers of contract programming services from Bulgaria, we could all probably move on to working on something else.

    Will statistical filtering actually get us to that point? I don't know. Right now, for me personally, spam is not a problem. But spammers haven't yet made a serious effort to spoof statistical filters. What will happen when they do?

    I'm not optimistic about filters that work at the network level [13]. When there is a static obstacle worth getting past, spammers are pretty efficient at getting past it. There is already a company called Assurance Systems that will run your mail through Spamassassin and tell you whether it will get filtered out.

    Network-level filters won't be completely useless. They may be enough to kill all the "opt-in" spam, meaning spam from companies like Virtumundo and Equalamail who claim that they're really running opt-in lists. You can filter those based just on the headers, no matter what they say in the body. But anyone willing to falsify headers or use open relays, presumably including most porn spammers, should be able to get some message past network-level filters if they want to. (By no means the message they'd like to send though, which is something.)

    The kind of filters I'm optimistic about are ones that calculate probabilities based on each individual user's mail. These can be much more effective, not only in avoiding false positives, but in filtering too: for example, finding the recipient's email address base-64 encoded anywhere in a message is a very good spam indicator.

    But the real advantage of individual filters is that they'll all be different. If everyone's filters have different probabilities, it will make the spammers' optimization loop, what programmers would call their edit-compile-test cycle, appallingly slow. Instead of just tweaking a spam till it gets through a copy of some filter they have on their desktop, they'll have to do a test mailing for each tweak. It would be like programming in a language without an interactive toplevel, and I wouldn't wish that on anyone.

    Notes

    [1] Paul Graham. ``A Plan for Spam.'' August 2002. http://paulgraham.com/spam.html.

    Probabilities in this algorithm are calculated using a degenerate case of Bayes' Rule. There are two simplifying assumptions: that the probabilities of features (i.e. words) are independent, and that we know nothing about the prior probability of an email being spam.

    The first assumption is widespread in text classification. Algorithms that use it are called ``naive Bayesian.''

    The second assumption I made because the proportion of spam in my incoming mail fluctuated so much from day to day (indeed, from hour to hour) that the overall prior ratio seemed worthless as a predictor. If you assume that P(spam) and P(nonspam) are both .5, they cancel out and you can remove them from the formula.

    If you were doing Bayesian filtering in a situation where the ratio of spam to nonspam was consistently very high or (especially) very low, you could probably improve filter performance by incorporating prior probabilities. To do this right you'd have to track ratios by time of day, because spam and legitimate mail volume both have distinct daily patterns.

    [2] Patrick Pantel and Dekang Lin. ``SpamCop-- A Spam Classification & Organization Program.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.

    [3] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz. ``A Bayesian Approach to Filtering Junk E-Mail.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.

    [4] At the time I had zero false positives out of about 4,000 legitimate emails. If the next legitimate email was a false positive, this would give us .03%. These false positive rates are untrustworthy, as I explain later. I quote a number here only to emphasize that whatever the false positive rate is, it is less than 1.16%.

    [5] Bill Yerazunis. ``Sparse Binary Polynomial Hash Message Filtering and The CRM114 Discriminator.'' Proceedings of 2003 Spam Conference.

    [6] In ``A Plan for Spam'' I used thresholds of .99 and .01. It seems justifiable to use thresholds proportionate to the size of the corpora. Since I now have on the order of 10,000 of each type of mail, I use .9999 and .0001.

    [7] There is a flaw here I should probably fix. Currently, when ``Subject*foo'' degenerates to just ``foo'', what that means is you're getting the stats for occurrences of ``foo'' in the body or header lines other than those I mark. What I should do is keep track of statistics for ``foo'' overall as well as specific versions, and degenerate from ``Subject*foo'' not to ``foo'' but to ``Anywhere*foo''. Ditto for case: I should degenerate from uppercase to any-case, not lowercase.

    It would probably be a win to do this with prices too, e.g. to degenerate from ``$129.99'' to ``$--9.99'', ``$--.99'', and ``$--''.

    You could also degenerate from words to their stems, but this would probably only improve filtering rates early on when you had small corpora.

    [8] Steven Hauser. ``Statistical Spam Filter Works for Me.'' http://www.sofbot.com.

    [9] False positives are not all equal, and we should remember this when comparing techniques for stopping spam. Whereas many of the false positives caused by filters will be near-spams that you wouldn't mind missing, false positives caused by blacklists, for example, will be just mail from people who chose the wrong ISP. In both cases you catch mail that's near spam, but for blacklists nearness is physical, and for filters it's textual.

    In fairness, it should be added that the new generation of responsible blacklists, like the SBL, cause far fewer false positives than earlier blacklists like the MAPS RBL, for whom causing large numbers of false positives was a deliberate technique to get the attention of ISPs.

    [10] If spammers get good enough at obscuring tokens for this to be a problem, we can respond by simply removing whitespace, periods, commas, etc. and using a dictionary to pick the words out of the resulting sequence. And of course finding words this way that weren't visible in the original text would in itself be evidence of spam.

    Picking out the words won't be trivial. It will require more than just reconstructing word boundaries; spammers both add (``xHot nPorn cSite'') and omit (``P#rn'') letters. Vision research may be useful here, since human vision is the limit that such tricks will approach.

    [11] In general, spams are more repetitive than regular email. They want to pound that message home. I currently don't allow duplicates in the top 15 tokens, because you could get a false positive if the sender happens to use some bad word multiple times. (In my current filter, ``dick'' has a spam probabilty of .9999, but it's also a name.) It seems we should at least notice duplication though, so I may try allowing up to two of each token, as Brian Burton does in SpamProbe.

    [12] This is what approaches like Brightmail's will degenerate into once spammers are pushed into using mad-lib techniques to generate everything else in the message.

    [13] It's sometimes argued that we should be working on filtering at the network level, because it is more efficient. What people usually mean when they say this is: we currently filter at the network level, and we don't want to start over from scratch. But you can't dictate the problem to fit your solution.

    Historically, scarce-resource arguments have been the losing side in debates about software design. People only tend to use them to justify choices (inaction in particular) made for other reasons.

    Thanks to Sarah Harlin, Trevor Blackwell, and Dan Giffin for reading drafts of this paper, and to Dan again for most of the infrastructure that this filter runs on.

    --

    Reply or e-mail; don't vaguely moderate. Ex-O'Reilly/MIT employee, now a full-time Google employee.
  3. Problem you say? by termos · · Score: 3, Funny

    rm -fr ~/Mail
    would do the trick.

    --
    Note to self: get smarter troll to guard door.
  4. Stop spam? by slykens · · Score: 5, Interesting
    Filtering is nice, I've been using SpamAssassin with reasonable results for the last few months. It has nearly no false positives but has recently been missing more. Perhaps I should update.

    Anyway, I've said a few times the only way to effectively stop spam is to make it more expensive to the companies having it done. Filtering, blocking ports, refusing mail from RBL'd hosts all helps, but it will not stop until it is fully against the law and people bring legal action to stop it.

    Even people who are supposed to be clueful don't get it. I got spammed to buy EZ-Pass for the PA Turnpike. I sent a nastygram to the state DoT. The keyboard monkey responded that I should look closely at the email, that I signed up to receive it. If I had a dollar for every site that claimed I signed up with them I would be rich. What an idiot.

    1. Re:Stop spam? by Mournblade · · Score: 2, Informative

      Just curious - did you follow up w/ him to see *why* he thought you signed up to receive the spam? Is it possible that you inadvertantly allowed them to send you spam the last time you renewed your driver's license? I ask because most of the spams I get say "you signed up with one our partner sites" and i've always wanted to (but have been too lazy to) go back and see how far up the chain I could get.

    2. Re:Stop spam? by Anonymous Coward · · Score: 2, Insightful

      You write to a state office, get a completely clueless reply and now ask for a legislative solution? You're quite the optimist, aren't you?

    3. Re:Stop spam? by rograndom · · Score: 4, Insightful
      Filtering is nice, I've been using SpamAssassin with reasonable results for the last few months. It has nearly no false positives but has recently been missing more. Perhaps I should update.

      Actually spamassassin has a nice built-in reporting tool
      spamassassin -r < *mailmessage*
      And if you setup it up to work with with Vipul's Razor for it's all automagically updated.
    4. Re:Stop spam? by Deltan · · Score: 4, Insightful

      Correction.. spam will never stop... ever.

      You say that it will stop if it's fully against the law and people bring legal action to stop it.

      Last time I checked, murder was illegal, punishable by death in many states, yet it still occurs.

    5. Re:Stop spam? by CoughDropAddict · · Score: 3, Insightful

      Last time I checked, murder was illegal, punishable by death in many states, yet it still occurs.

      People spam because it is rational to do so (or at least spammers make them think so). Very low costs, the possibility of a good return, and nothing to lose since there are virtually no spam laws.

      A better comparison than murder is the practice of child labor. While it was legal it was a rational practice to engage in, because the return was high and the risk was low -- if a kid gets eaten by a machine you just find another kid. Now that is illegal the practice is almost completely extinct because it is no longer rational -- the police would come knocking at the door, which impedes the goal of running a profitable business.

    6. Re:Stop spam? by SmittyTheBold · · Score: 2, Funny

      Correction.. spam will never stop... ever.

      . . .

      Last time I checked, murder was illegal, punishable by death in many states, yet it still occurs.


      Spam is a means to an end - selling your shit to gullible people. Murder is not just a means, but an end in itself. When you want someone dead, there's not really another way around it. With spam, there's always telemarketing and pop-ups.

      In addition, murder can be a crime of passion, while spamming is hardly such. I can't remember ever thinking "Oh that bastard cut me off! I'll help him increase his penis size, then give him a work-at-home job! Oh I'm JUST SOOOO ANGRY!"

      --
      ± 29 dB
  5. Why can't we have legal restrictions on spam? by GGardner · · Score: 5, Interesting
    Conventional wisdom seems to say that we can't outlaw spam. I don't understand why this is. My state has a do not call list. Since signing up for it, I have gotten zero phone solicitations, down from 2 or 3 a day. It is illegal to make a phone solicitation to a cell phone, and also, I get zero phone spams on my cell phone.

    Some states, like California, have anti-spam laws, but curiously, they only cover spam sent from California to California. My state's telephone do-not-call list covers all calls to my number, no matter where they originate.

    Now, I understand that there would be problems with international spam, but stopping domestic spam would be a huge boon to everyone. It seems like this legislation would be wildly popular, and easy to pass.

    1. Re:Why can't we have legal restrictions on spam? by Steve+B · · Score: 5, Insightful
      Because the last thing we need in this country is the government telling us how and when we can send email or make a phone call.

      In certain ways, the government does and should do precisely that. If I repeatedly call you at 4 AM to ask if your refrigerator is running or deliberately send you virus-laden e-mail, then you have every right to call upon the long arm of the law to slap down the harassment.

      Spamming, being a violation of the recipient's property rights, falls into that category.

      --
      /. If the government wants us to respect the law, it should set a better example.
    2. Re:Why can't we have legal restrictions on spam? by waveclaw · · Score: 2, Insightful
      Conventional wisdom seems to say that we can't outlaw spam. I don't understand why this is.

      Traditionally and in general, anything on the 'net that can be achived through both technical means and legal recourse is almost always implemented via the technical route.


      The reasons for this are many; the major reasons are simple. While most people on the 'net have not been lawyers, most of the first people - esp. USENET users - were engineers and scientists. Such people develop a distain for legal recourse after spending (wasting) so much of their time in political and legal battles in the *real world* justifying and defending their work and themselves. Just ask any graduate student standing in line at his college Bursar's office how he feels about contracts and (non-technical) paperwork.

      Furethermore, by avoiding the often easy to circumvent and hard to quantify political avenue, the solutions are usually more effecitve in both the short an long term. Many solutions, such as the Baysian SPAM filtering disscussed here, also give these technical people a chance to prove their worth or gain some small measure of fame by association with a good solution.

      Remember: Conventional Wisdom is an oxymoron. There are always reasons for something, even if theose reasons are nothing but hubris and desire. It is up to you to accept or change them.

      --

      "You cannot have a General Will unless you have shared experiences. You cannot be fair to people you don't know."
    3. Re:Why can't we have legal restrictions on spam? by babbage · · Score: 4, Informative
      Please take a look at my notes on last week's spam conference, and in particular the Jon Praed notes (near the end; two speakers came after him).

      Praed argued, very eloquoently & persuasively (hey, he's a lawyer :) that there are laws on the books banning spam in nearly every state. All you have to do is find a way to bring those laws to your assistance. In particular, note that:

      • Ever have a hard time tracking down a spammer? Ever have one that spoofed message headers? Gee, that sounds like fraud, doesn't it? Indeed it does -- much or even all spam can be considered as fraud, and as such you can attack it from that angle anywhere in the country.
      • Laws are pending in various jurisdictions to outlaw spammers' bulk mail software. The catch here is that there is a lot of legitimate bulk mail software that can be abused -- think majordomo, MailMan, etc -- so any laws crafted will have to include clauses that protect legitimate use of such software while banning UCE somehow. Watch for this to develop over time.
      • Suggestion: if you get spam that mentions a trademarked product (Viagra, pirated copies of well known software, etc), forward the message to the holder of that trademark. They will almost always be keenly interested in this abuse of their trade name, and will take it upon themselves to go after the spammer.
      • If you are in the habit of reporting spam to an organization like SpamCop, do so as quickly as possible: spammers are getting in the habit of leaving their ads up long enough for recipients to respond to, but pulling them down before investigators get a chance to scrutinize anything. The faster these groups can analyze the sources of spam, the better the chances of getting all the way back to the source.
      • Final and most important point: the precedent set by the Verizon vs. Ralsky case was very valuable to anti-spam efforts. First, that spam prosecution can be carried out in the jurisdiction that the harm occurred, not where the person doing harm was when causing it. So if California has anti-spam laws, they can potentially be used no matter where the spammer lives. Praed practices law in Virginia, so I'm assuming that their laws are amenable to this kind of application. Second point: ignorance about an ISPs acceptable use policies (AUP) are no defence in court -- certain etiquette standards have emerged over time, and it is assumed that the sender of UCE has to be aware of these standards. As a result, if your ISP has an AUP that forbids UCE, this can be a tangible protection for you in court. This is very good news!

      As a lawyer that has successfully prosecuted a number of spammers, Praed was able to talk about all of this with some authority. He cautioned everyone though that laws will never eradicate spam -- as he put it, "people still rob banks since that's where the money is". But legislation & prosecution can still be a very valuable tool in fighting spam, and an important supplement to things like better mail filters. This is a big problem, and is going to need a variety of tiered solutions to control it.

    4. Re:Why can't we have legal restrictions on spam? by Jadrano · · Score: 2, Informative

      It's not just about the US, in many European countries spam is illegal already now (clear cases are Norway and Austria), and the European Union as a whole has decided to outlaw spam, it should be implemented this year. I don't know exactly about the situation in East Asia, but I don't think the Chinese and Koreans like it too much that their resources are misused for sending spam all over the world, so they could follow soon. Yes, there certainly will be some smaller countries where spam is still legal, but once spam is illegal in the European Union, the United States, China and many other big countries no one who has sent thousands of spam mails to harvested addresses can reasonably claim that he or she believed that all the addresses were only of people in a few offshore countries.
      Furthermore, the US American conception of law has, as far as I know, the principle of being applicable exterritorially, which is in general quite controversial, but could be useful here - it would probably be possible to forbid any companies that do business in the US to send spam, even if the spam is only sent from other countries and only to people living outside the United States.

  6. AOL or Hotmail adopt? by twemperor · · Score: 3, Interesting

    I really like this analytic approach. I've been using Hotmail's spam filtering, which merely removes e-mails from addresses not in my address book. While this is most of the time effective and very easy to implement, there does seem to be a major problem with false positives. ie I give my e-mail to someone, who's not in my address book.

    Does anyone think AOL or Hotmail could start using such a system as the one outlined in the article?

    1. Re:AOL or Hotmail adopt? by Anonvmous+Coward · · Score: 4, Insightful

      "Does anyone think AOL or Hotmail could start using such a system as the one outlined in the article?"

      No. My problem's with the senders, not the messages. What Hotmail should do is send back an email saying "Your message has been rejected because you have not been authorized by this user. If you'd like to request authorization, click here and follow the instructions."

      When they properly fill out the form, you get a message saying "so'n'so wants to send you a message. Interested?" and you can say yes/no. If you say yes, they get added to your address book and they can email you until you remove them from it.

      With this approach, it requires a valid return address before the message can possibly get to you. That means you're able to tell the person to remove you, unlike today's 'send anything to anybody' system.

      If Hotmail did that, I'd actually consider paying for their service.

  7. Spam and AI by cybermace5 · · Score: 5, Funny

    And the conflict rages on. The better filters we use, the sneakier the spam artists get. Now we're developing self-modifying algorithms to detect and kill spam, and I'm sure the spammers are developing self-modifying algorithms to craft filter-tricking spam.

    How long before the back-and-forth of spam filters and spam crafters becomes self-aware? It's got to happen. Eventually the spam filters will become a skeptic consciousness that *feels* its way through spam and spots the phoneys, and the spam crafters will become a persuasive consciousness that tries to think and write as a close friend or relative.

    --
    ...
    1. Re:Spam and AI by hrieke · · Score: 4, Funny

      Okay, so we build an AI and then torture the poor thing with insane emails about penis enlargers and the like?
      No wonder Skynet rebelled.

      --
      III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIIIV IIVIIIIIIVIII...
    2. Re:Spam and AI by GreyPoopon · · Score: 4, Funny
      ...and the spam crafters will become a persuasive consciousness that tries to think and write as a close friend or relative.

      Hey bro,

      I meant to talk to you about this last time you and your girlfriend were visiting, but I wasn't sure how to bring it up. I sort of found out through some less, ahem, discreet members of our family that you're a bit unhappy with the size of your member, if you know what I mean. If this is true, there's this web site I'd like to recommend that will probably be able to help you. It'll cost you a little, but it's worth every penny. I was to embarrassed to say anything about it before, but I gave their offering a try last year and both my wife and I are really happy with the results.

      {insert html link here...}

      The choice is yours, dude. I just want you to be happy.

      Love,
      Your bro
      --

      GreyPoopon
      --
      Why is it I can write insightful comments but can't come up with a clever signature?

  8. better than legislation by Rojo^ · · Score: 5, Interesting

    This is a wonderful tool that is being developed. However, I don't think any one tool will succeed in eliminating spam. From a spammer's point of view, if my income depends on messages making it through filters, by damn I will bypass those filters by whatever means I can. These assholes send penis enlargement advertisements to my mother -- If her gender doesn't stop them, neither will an email filter.

    On a different subject, in a story about a week ago, someone posted a link to a peer-peer network of spam emails for MS Outlook available at http://www.cloudmark.com that will trap a significant amount of emails based on (and this is overly simplified, of course) users' votes. Does such a solution exist in the open source world?

    --
    <:
    1. Re:better than legislation by rgmoore · · Score: 2, Insightful
      This is a wonderful tool that is being developed. However, I don't think any one tool will succeed in eliminating spam. From a spammer's point of view, if my income depends on messages making it through filters, by damn I will bypass those filters by whatever means I can. These assholes send penis enlargement advertisements to my mother -- If her gender doesn't stop them, neither will an email filter.

      I hear this argument and variations on it from time to time, but the more I consider it the more flawed it looks to me. There are really two kinds of filters to consider:

      1. ISP-level filters applied at a network level by a third party.
      2. Personal filters applied at an individual level by the target of the spam.

      These two things are not at all equivalent to the spammer because of the psychology of spam. Fundamentally, email readers are likely to fall into two fairly tight categories: suckers who will listen to spam and non-suckers who won't. Anyone who applies his own personal email filter is likely to fall into the non-sucker category, so there's little point in designing a message specifically to bypass those personal filters. The target won't buy your product even if you do get it past his filter. That's not the case with ISP level filters, though, which protect suckers and non-suckers alike. Those are worth bypassing because they're stopping some email that would get to the suckers who would buy your product.

      Now it may be the case that the same techniques that are useful for avoiding ISP-level filters will also help get mail past personal filters. That even seems likely, given that many people use ISP-type filters for their personal mail because the ISPs don't do it for them. But it seems to me that there's little percentage in specifically trying to avoid personal level filters that work on a different system from the ISP-level filters because the simple fact that somebody is bothering to use the filter implies that he won't buy from the spammer anyway.

      --

      There's no point in questioning authority if you aren't going to listen to the answers.

    2. Re:better than legislation by Thing+1 · · Score: 3, Informative
      On a different subject, in a story about a week ago, someone posted a link to a peer-peer network of spam emails for MS Outlook available at http://www.cloudmark.com that will trap a significant amount of emails based on (and this is overly simplified, of course) users' votes. Does such a solution exist in the open source world?

      Hi, that was me . Unfortunately this only works for Outlook (not even Outlook Express), but it's been working great for me.

      As others have pointed out, Vipul's Razor is a great open-source solution.

      Checking SourceForge , I found the following additional packages:

      BogoFilter

      SpamAssassin

      JoeEmail

      Bayesian anti-spam classifier

      Anti-Spam SMTP Proxy Server

      Bayesian Mail Filter

      JunkFilter

      SpamProbe - fast bayesian spam filter

      Mailfilter

      IMAPAssassin

      That's just from the first page of search results. If you'd like to see all the results (I did a search for "spam" from their search box), click here .

      --
      I feel fantastic, and I'm still alive.
  9. What's wrong with spam? by Amsterdam+Vallon · · Score: 5, Funny

    Without spam, how else would I be able to sit home every day and make $1,000 a week watching TV while playing with my 12 inch penis?

    --

    Reply or e-mail; don't vaguely moderate. Ex-O'Reilly/MIT employee, now a full-time Google employee.
  10. Spamassassin and ENDING spam.... by ajs · · Score: 5, Informative

    The latest development Spamassassin has an interesting application of Bayesian filtering. Basically, it takes all of SA's existing heuristics, uses that to develop a sense of what is and is not spam, and then pumps the results through a Bayesian filter that learns from these messages.

    As with any other SA test, no single element of the chain is trusted enough to definitively call something spam, but if a message would have squeeked through before, this new filter can put the final nail in its coffin through word analysis against previous spam.

    So, why did I use a subject about "ENDING spam"? Because one of the tools that spammers have is SA itself. They can use it to score their messages and determine how "spamish" it is. The problem now is that each SA installation will have subtly different scoring, and the message may be "ok" according to the spammer's version, but my version has a better sense of the mail that *I* get.

    SpamAssassin is definitely a tool worth checking out if you have not already. Install it in daemon mode (spamd) and then use "spamc -f" in your procmailrc or the equiv for your MTA.

    Very nice tool, and a real time-saver for me.

    1. Re:Spamassassin and ENDING spam.... by ajs · · Score: 2, Informative

      Incorrect. SA is using that technique (and has for a fairly long time now) centrally to generate their score lists. That's important, and it's a very strong part of SA.

      However, in the next release of SA (and I'm currently running it out of CVS, so it's hardly vapor), they will *also* be using full word scoring heuristics. That scoring will result in a boolean "spamishness" which will in turn be assigned a score centrally (whihc users can override, of course).

      By way of example, here's a recent summary of one of my pieces of spam:

      Content analysis details: (12.50 points, 4 required)
      NO_REAL_NAME (1.3 points) From: does not include a real name
      INVALID_DATE (1.6 points) Invalid Date: header (not RFC 2822)
      BAYES_90 (2.0 points) BODY: Bayesian classifier says spam probability is 90 to 99%
      [score: 0.9645]
      RAZOR2_CF_RANGE_91_100 (0.0 points) BODY: Razor2 gives a spam confidence level between 91 and 100
      [cf: 100]
      RAZOR2_CHECK (3.9 points) Listed in Razor2, see http://razor.sf.net/
      DATE_IN_PAST_03_06 (0.2 points) Date: is 3 to 6 hours before Received: date
      MSG_ID_ADDED_BY_MTA_3 (2.0 points) 'Message-Id' was added by a relay (3)
      FORGED_MUA_OUTLOOK (1.0 points) Forged mail pretending to be from MS Outlook
      MISSING_MIMEOLE (0.5 points) Message has X-MSMail-Priority, but no X-MimeOLE

      As I said previously, the interesting part here is not the word-analysis, but the fact that the database for that word analysis is generated dynamically by looking at your mail, and applying SA's other rules. Self-training of this sort has proven highly successful in tests, and may yield the next quantum of spam-filtering effectiveness.

      Notice also that while that 2.0 points from Bayes is a big push to this spam's score, it's not enough to mark it as spam on it's own. This is the power of SpamAssassin. No one test says, "this is spam", and so no one test is trusted on its own.

  11. Re:More than 1.1 billion pigs are killed worldwide by molarmass192 · · Score: 5, Funny

    Could Bayesian filtering be applied to filter offtopic posts as well?

    --

    Good people do not need laws to tell them to act responsibly, while bad people will find a way around the laws-Plato
  12. Add inches to your penis! by Chocolate+Teapot · · Score: 3, Funny

    Ooooops! Wrong window. Sorry.

    --
    Modest doubt is called the beacon of the wise. - William Shakespeare
  13. Re:How is spam that big of a problem? by crawdaddy · · Score: 3, Insightful

    Overblown? The fact that you would need more than one email account to keep from having your time wasted by spam proves otherwise.

  14. Bayesian filtering by blakestah · · Score: 4, Interesting

    The basics are, you take all good mails, and create a database of words used in them. Make a different database for spam mails. Then, for each incoming mail, compare to each database, and classify as spam or non-spam.

    The algorithm starts out conservative, ie: you get most of the mail classified as good. For each "good" email that is spam, you manually re-classify it.

    Then, after a few weeks, the filter does all the work. It is basically using word-databases to compare emails and classify them the way you, the user would. Periodically you will receive another spam email, then you re-classify it, and never see an email like it again (in your inbox).

    Bogofilter and CRM114 are among the more successful efforts so far, but there are many. And they are FAR more successful than blacklist/whitelist/fixed token comparison filters. But Bayesian filtering is just a near optimal way to replicate the classification of the user, which is also why it works so well.

    1. Re:Bayesian filtering by blakestah · · Score: 2, Informative

      I think you misunderstand how easy bogofilter is.

      I initially trained on about 200 emails. At first, I got 1 spam per day, or so. There have not yet been any false positives (good mail classified as spam).

      A week later, I get 1 spam in my inbox every 3-4 days, and no good mail has been classified as spam. All I need to do it take the false identifications and re-classify them. That means, every 3-4 days I take the spam in my inbox and re-scan it through bogofilter (cat SPAM | bogofilter -S). That is all. It is not any effort, really, after the initial training. Then, the filter does all the work, and you don't need to worry about blacklisting or whitelisting or anything.

      The really important thing is that the filter statistically optimizes YOUR manual email classification. The best source of email classifying is YOU looking at an email, and Bayesian filtering is the only method that is optimized to do that.

  15. Re:base64 encoded emails....or images by delta407 · · Score: 5, Interesting
    Also, how do can you flag an ad that is an image?
    Razor.

    Vipul's Razor marks MIME parts individually, so an ad, a picture of Viagra, or even the "Unsubscribe" button can be marked spam and contribute to the overall score of the message.
  16. Spam only cost-ineffective with ISP-level filters by PseudoThink · · Score: 5, Insightful

    Spam filters are great, but it seems that only the Net-savvy are using them. Savvy users aren't the people spammers are making all their money from--they are making money off the naive and inexperienced users. These users aren't going to go out and install the latest Bayesian filters on their system, and the major email readers won't (and probably shouldn't) come with them automatically activated.

    To make spam cost-ineffective for the spammers, we've got to stop it (or flag it) before it gets to the end-user. It would obviously be a mistake to allow ISP's to automatically delete all email that fails their spam filters, but I think it would be appropriate for them to include something in the headers flagging such email as probable spam. Then future email readers could detect this header and handle it gracefully, like moving it to a "spam" folder on the user's machine. Once this happens and Grandpa no longer gets email asking him to test the latest Viagra alternative, spam may become a thing of the past.

  17. filtering effectiveness by qoncept · · Score: 5, Insightful

    I think I speak for everyone when I say false positives are the only real hinderance to the filtering of spam. I get roughly 20 emails a day, 75% of which are spam. If one of them slips past the filter and I see it, it doesn't bother me so much. Spam is no longer a problem. What is an absolute necessity, though, (and probably less so for me than other people) is that none of my legitimate email is filtered as spam. I'd rather have 100 spams filtered improperly than one legit email.

    --
    Whale
  18. Re:hopeless by Kallahar · · Score: 5, Insightful

    Yeah, 2.5 per 1000 getting through is a proof that his ideas are obviously flawed. Having a working system is the best proof that an idea works :)

    Travis

  19. Obligatory plug for TMDA by Silas · · Score: 4, Informative

    I'm really excited about all of the neat stuff happening with Bayesian filtering and related technologies, but I just wanted to put in a plug for TMDA, Tagged Message Delivery Agent, which uses a whitelist-centric strategy. Since I began using it, the amount of spam I have to look at is virtually at zero. If you haven't read about it yet, check it out.

  20. Re:Spam needs a global solution by notsoanonymouscoward · · Score: 2, Interesting

    This makes no sense to me... spam to me is primarily 1) friends sending stories, jokes, quizzes, etc... or 2) someone trying to sell you something. now if we all cc'd everyone on everything, we'd have even more spam by my 1st definition of spam, and it wouldn't affect the 2nd definition at all. how is this supposed to help?

    --
    I ate my sig.
  21. Re:hopeless by ajs · · Score: 4, Insightful

    Everyone but the folks at SpamAssassin have been focusing on the idea that any one technique for identifying spam is doomed to diminishing returns.

    Over at SpamAssassin, they've been busily creating a system that collects "good enough" tests by the dozens and uses them to collectively score a message and determine its general "spamishness". The system relies on a complex scoring system that is determined, not by the whim of human programmers, but on the results of a genetic training system that pits one set of scores against another until equilibrium is reached for a given set of example spam and non-spam.

    See my other post here for how Bayesian filtering will be used to allow this system to feed back on itself and improve as it sees more of your spam and non-spam....

  22. Spam Archive by Doctor+Beavis · · Score: 4, Informative

    The article mentions compiling a vast collection of spam. Such a project is already underway at SpamArchive.

  23. Because it's free. by Presto_slashdot · · Score: 2, Insightful

    You probably get no spam to your home or cell phone because it's too expensive to set up a company in China and make phone calls to the US, just to get around the laws. Unfortunately, it *is* basically free to send spam mail. If they could call you for free from outside the US, they would be doing that too.

  24. Standard Spam API by Anonymous Coward · · Score: 2, Insightful

    I have been quite excited with all the new ideas being put to use in fighting spam recently. Unfortunately, whenever I find one that is implemented, it doesn't work with my mail server or my client. It seems like there should be a standard API that spam filters could implement, (using soap or xml-rpc or something), so that the various mail servers and email clients could use a single plug-in to add spam filtering. This would allow the people who are good at spam filter code to focous on that one problem, and the people who are good at writing email plugins and GUI code can do what they are good at.

  25. Re:base64 encoded emails....or images by GGardner · · Score: 4, Informative
    A common thing that spammers to do try and trick filters is use

    Content-Type: text/html (or text/plain)
    Content-Transfer-Encoding: base64

    Because a lot of filters don't know how to decipher this. For me, this makes it a lot easier to filter, though. I get no legitimate e-mail encoded this way, so I just have procmail dump any e-mail encoded this way. Problem solved, and without the CPU burden of decoding or running expensive spam filters.

  26. popfile URL by roalt · · Score: 4, Informative
    Popfile can be installed as an intermediate between your mail-server and your program, and you can add tags to your mail to decide in which 'bucket' your mail belongs to.

    The url for the project is popfile.sourceforge.net

    I didn't try it yet, but it I will try it really soon now!

    1. Re:popfile URL by joeldg · · Score: 2, Informative

      Popfile rocks.. used it for a while 89% accuracy.. but the 11% is actually relatives/friends sending me stupid forwards, so in reality is is about 99% accurate.. nice..

  27. now THIS is a true geek by Anonymous Coward · · Score: 4, Funny

    >Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam...

    Spoken like a true geek.

  28. Re:Spam needs a global solution (Global Solution) by minas-beede · · Score: 5, Informative

    OK, signal and noise. What if the signal was all in one frequency band and the noise all in another. Problem separating them? No.

    What if, in effect, a similar distinction held for spam in the transmission channel - that spam by itself selected a pathway to the recipient that was never used by the signal? Block that pathway and the spam never gets through.

    Spam doesn't select a pathway but spammers do. If you could block relay spam at the open relays it would be dead. You can't, of course - the open relays are controlled by people who don't know the need to block spam. You know that, I know that. If you can't change the people then change the open relays (from the spammers' points of view.) Set up a system that looks like an open relay and stop the spam. An open relay honeypot.

    I asked an operator of such a honeypot how he did last year:

    > How did 2002 end?

    From March 7 to December 26 2002, the total was:

    235,624,232

    Using one Pentium 90 he stopped spam to 235 million recipients. Think about that number when you see filter people reporting what they stop just for their own domains. This was spam to recipients all over, not simply to the honeypot operators domain: he operates at the relay level. He stopped 100% of the spam, no deception deceived him, no tuning was needed, no valid email was caught - it is perfect filtering. Perfect filtering - who else has that?

    And you can do it at home on your DSL or cable connection (the guy above uses sendmail -bd, but Windows users have a program they can use):

    http://jackpot.uk.net/

    Yeah, I know, spammers are switching to open proxies. So, write an open proxy honeypot. That, too, will be 100% efficient. In addition you now are giving spammers reason to fear every open relay and every open proxy they detect. FEAR. The SPAMMERS have to scramble. They have to scramble and they have to show everything they do to overcome the technique - there is no stealth way to look for open relays and open proxies.

    The problem is solved, it is a matter of implementation and of getting active systems everywhere in the net space (so there's no safe IP space for the spammers anywhere.)

    Remember: A single Pentium 90, 235 million spam messages stopped in 10 months.

  29. Re:Performance by ergo98 · · Score: 2, Insightful

    And this is the clincher in any of these spam filters: If the filter automatically deletes messages that it identifies as spam (which could be legitimate business proposal or job offer, for example) then a false positive would be incredibly destructive. If it doesn't automatically delete but instead you periodically go through all of the messages, then it's of little value as you're forced to manually filter the spams anyways. The irony is that the better it is at identifying spams, the more destructive a false positive would be as you casually scan through and delete large clusters of supposed spam.

    Personally I think the author of the paper is a bit idealistic in ways when they say "If we can write software that recognizes their messages, there is no way they can get around that". Well then again maybe they aren't: Saying "if we can...recognize their messages" is a pretty wide net presumption, and of course the following conclusion follows, however the real question is "can we realistically make software that can effectively identify with zero incidences of false positives". For people who email between themselves and one or two other people on one subject that isn't a problem, but I suspect that statistical word usage analysis wouldn't be quite as successful for someone with a more disparate mail usage.

  30. Treating the symptoms, not the problem... by Anonvmous+Coward · · Score: 2, Interesting

    I hope you all realize that at best you're buying time, not solving the Spam problem. It won't take long for these guys to find ways through the filter.

    The problems need to be solved on a different level. The problem is not the messages themselves, it's that people are allowed to send these messages to anybody they want without any real challenges as to their authenticity.

    Let me explain how I have things set up right now, and hopefully my stance on this issue will be a little clearer. All my messages come into the same mailbox. I have a bunch of email aliases, though. If I sign up for Slashdot, for example, then I create a new alias like 'slashdot@insertdomainnamehere.com'. I then add that email address into my 'email allowed' list so that it gets funneled through into a visible folder. If that address gets abused, I shut down the email alias.

    My personal friends are treated a little differently. Once they email me, I add their address into my list of friends, and they get put into a friends folder. I treat this differently than a registration place because my friends all need one address to contact me at, I don't mind them sharing it with each other. If my address changes, then their messages still get through.

    I plan on going farther down the road. I'm going to give people an email address, and when they email it they get an automated message with instructdions on how to 'request permission' to send me email. When permission is granted, they don't get that message anymore. It basically means that the only messages that get through to me are the ones that have a human behind them to read the response and then go through the proper channels to reach me.

    I'm not claiming to have done anyting new here. I'm basically mimicking the way IM works, and I'm doing it without having to do anything real fancy. Outlook's Rules Wizard is doing quite a bit of the work here. But since people actually have to take the time to request my authorization, it means that it's a message meant for ME as opposed to a message meant for anybody who's out there. With an approach like this, it'd be a lot harder for spammers to get through.

    1. Re:Treating the symptoms, not the problem... by Silas · · Score: 2, Informative

      It sounds like you're using TMDA. Or, if you're not, you should be. :) Check out my related post on this story.

  31. Actually - by sean.peters · · Score: 4, Interesting

    You don't speak for everyone. On the contrary, I think that most people realize that e-mail delivery isn't guaranteed - and therefore they expect that truly vital messages will need to be backed up with a phone call or some other means, to be sure the message was delivered.

    I would prefer to lose one or two legitimate mails in return for a virtually zero rate of missed detections.

    Sean

  32. Difference with MacOS X 10.2's Mail.app? by tbmaddux · · Score: 2, Informative
    This is all quite interesting from a technical standpoint, but what can I gain as a user of Mail.app in MacOS X 10.2 (Jaguar) from this? My Junk filter catches spam and tosses it into a separate folder. I occasionally go through it and send the spam off to SpamCop. What I like about Mail.app is that it's easy to keep training by marking as Junk (for spam it failed to identify) or Not Junk (for occasional false positives). It seems to work well and doesn't require a lot of interaction from me except for interacting with SpamCop (my choice).

    It doesn't catch all the spam, and it occasionally has a false positive. This will be true of any spam filter we implement, because spam continues to change. SpamAssassin runs on some of the mailservers I connect to, but it tends to perform worse than Mail.app. So until we can get each user's spam filter customized at the server, spam identification is going to have to stay client-based. It sounds like Paul Graham's tools are getting a little more efficient, but does any of this make a big difference for the end user?

    --
    Can't you see that everyone is buying station wagons?
  33. I thought that too... by siskbc · · Score: 2, Informative

    ...until the email server at work got hacked and someone stole the entire address list. Since then, all of us have been getting spam by the bucketloads. And since I depend on people being able to get my current work address, I can't change it. Thank God for SpamAssassin!

    --

    -Looking for a job as a materials chemist or multivariat

  34. Re:How is spam that big of a problem? by Anonymous Coward · · Score: 3, Informative

    It's all fine and dandy to have a spamtrap account if you never plan to read it, but what if you want to get online bank statement notifications or other important notices? I just noticed my friendly credit card company (Capital One) took it upon themselves to introduce my previously spam-free e-mail account to their business partners so they could introduce me to the wonderful world of buying fucking flowers for valentines day. Thanks alot assholes. And no, they have NO option to opt out of this fucking crap. The spam is posted from the same address as the statement notifications with a friendly disclaimer saying they're not in any way affiliated. Nice.

  35. spews.org problems need to be addressed by wessto · · Score: 2, Informative

    I host several domains as a hobby for my family. Recently my ip address made it into a listing on spews.org. Am I a spammer? By no means. Am I screwed? Absolutely. After reading spamming newsgroups I found that I am not alone. At first I was just getting blocked because I was sending mail ( my own smtp server ) from a "known" spamming source when in fact I'm not a source of spam. My IP happens to fall into a larger block of ip's that my ISP owns, some of which are sources of spam.

    This was a minor setback, but now other services are starting to use bulk email sources as deny lists for their offerings. My free dns provider, zoneedit now prohibits me from adding / modifying any of my zones. This is simply not acceptible to me. The way spews is set up, it is not easy for my ip to get off the list. My ISP cannot just call them up and take me off. There has to be a way to avoid this, and eliminating spam at a higher level would be a good start.

  36. Re:More than 1.1 billion pigs are killed worldwide by ackthpt · · Score: 2, Funny
    Could Bayesian filtering be applied to filter offtopic posts as well?

    Unfortunately, it might work at first, but we've seen offtopic posters and first posters evolve. Alas, they seem to be a form of semi-intelligent life and once their numbers start to dwindle you can almost bet some internet environmentalist society will crop up and declare them endangered "where once, great herds of them swept majestically across the plains, now only a few cling to the ever encroaching egalitarian dark forces of the internet.

    It's probably just easier to round them up and send them to Guantanamo.

    --

    A feeling of having made the same mistake before: Deja Foobar
  37. sneakemail.com by Stalemate · · Score: 2, Informative

    sneakemail.com is my new way of eliminating spam.

  38. Focusing on the last bit of text by hrieke · · Score: 2, Interesting
    Way down in the footnotes:
    [13] It's sometimes argued that we should be working on filtering at the network level, because it is more efficient. What people usually mean when they say this is: we currently filter at the network level, and we don't want to start over from scratch. But you can't dictate the problem to fit your solution.
    This is where the problem of spam will be solved, by having a web of trust between the mail servers, that sign the message in a maner which makes it easier to back track a message and if these servers also do filting, well we kill two birds with one stone. The problems are:
    • CPU intensive
    • Need to look at every message
    • Seeding the filter database
    • Building trust with other servers
    And others of course.
    Think about this one, what does the typical email a porn star would get look like? What we think of spam, might not be someone else's.
    How would the system scale?
    And what would stop a spammer from installing a server with a bogus filter database, or just signing off on each message as being legit?

    Perhaps filtering based on each user's personal corpus of valid email is the only workable solution, or that spammers will kill off email as a usable means of communication.

    --
    III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIIIV IIVIIIIIIVIII...
  39. Insightful? No, it's flamebait. by FatAlb3rt · · Score: 2, Funny

    You don't have a problem with spam? So no one should have a problem, right? Spam does not depend solely on you doing something stupid. Any mail you send to someone could be improperly fwd'd or even posted somewhere that you don't want it to be. From there, you're screwed.

    Damn AC.

  40. Another way to filter out SPAMs by ottffssent · · Score: 2, Insightful

    Spam filters seek to classify emails as spam/nonspam based on differences in the emails. The spammers however have absolute control over the content of their emails, so such methods are doomed to a life of one-step-ahead. There is one characteristic of spam which can never be changed by the spammer: spam is computer-generated and mass-mailed. Legit emails are not.

    My idea is this: The system maintains an initially empty whitelist. When mail is received from a sender not on the whitelist, autoreply with a message explaining the situation and requesting an email back whose first line or subject contains a random word or phrase from the dictionary. Human beings will grumble, respond, and get added to the whitelist. Spammers won't give your email the personal attention it needs to get past, so you remain blissfully unaware of it.

    1. Re:Another way to filter out SPAMs by Violet+Null · · Score: 2, Insightful

      spam is computer-generated and mass-mailed. Legit emails are not.

      Some legit email is definitely computer generated. I sign up for /., it sends me an email with my password. /. will not care about an autoreply, so I would never get that email.

      If you standardize an autoreply, so that websites could parse and return it, then so could the spammers, easily enough.

      Finally, you'd be doubling the amount of bandwidth spent on email, as each spam would now have a corresponding auto reply.

  41. Re:How is spam that big of a problem? by Anonvmous+Coward · · Score: 3, Funny

    "Use another account for regular everyday things, and make sure it sin't something simple like abc123@hotmail.com. I do that and never get spam to my real accounts. This whole spam thing is way overblown."

    You remind me of the guy who fixed his leaky roof by using an umbrella in his house.

  42. been using spamassassin all this month by AssFace · · Score: 3, Informative

    I went through over 500 spam a day down to about 3 or so and I figured out that those last 3 are due to the fact that they are bypassing the filter (I have a bunch of different urls and the server that it is all hosted on also has its own name - so mail sent to that username at that host doesn't get sent through any filters and the way that the filters are setup there - pair.com - I can't trap that particular servername).

    I have been very impressed with SA and am writing scripts to track the stats even better (I love seeing what it has pulled out everyday).
    So far I have had zero false positives out of about 1-2megs of mail being filtered everyday for nearly a month now.

    SA has multiple different ways of searching the mail - any one of them can be easily bypassed by any given e-mail - but all of them together are really damn good at getting rid of spam.
    I'm very impressed with it and how well it learns (although straight "out of the box" - or perhaps I should say "straight out of the tar.gz" it brought me down from 500+ spam to 5-10 a day and then I tweaked how my accounts were filtering into SA and that fixed the rest.

    --

    There are some odd things afoot now, in the Villa Straylight.
  43. My 0.03$ (adjusted for inflation) by IWantMoreSpamPlease · · Score: 2, Interesting

    Because of where I work, I have to use Outlook Express. I know, sucks to be me. OE does have a filter setting so I can at least start putting keywords in and have mail sent to different boxes. I have found that a large (greater than 95%) of spam sent to me is "personalized", meaning that somewhere in the spam is my name.

    Co-workers, friends, family, don't call me by my name, so I add my name to the kill-filter list and most spam goes bye bye. I only wish OE had an option to kill-filter anything with HTML in it since nearly 100% of my incoming spam contains HTML, sound, images and whatnot.

    I'd love to see M$ get their act together and fix OE and Outlook and include modern filterin techniques (such as discussed in the main article) but I doubt it'll ever happen.

    --
    So rise up, all ye lost ones, as one, we'll claw the clouds.
  44. Re:How is spam that big of a problem? by zootread · · Score: 2, Interesting

    Simply use a free account for any registration required sites / internet posting and only check it when necessary to confirm registration. Use another account for regular everyday things, and make sure it sin't something simple like abc123@hotmail.com. I do that and never get spam to my real accounts. This whole spam thing is way overblown.

    Well, that won't work in a lot cases. I can create an e-mail account on my ISP (Roadrunner) and within hours I am getting spam without having even used it. The must be allowing easy access to the account list. Free accounts are worse (hotmail, yahoo), create an account and you're guaranteed to get spam, even if you've kept the e-mail address a complete secret.

    On the other hand, at work, I don't get a single piece of spam because I am careful with the address.

    --
    Zoot!
  45. Re:More than 1.1 billion pigs are killed worldwide by ravenwolff · · Score: 3, Funny

    I don't think it would be nearly as complicated to check for duplicate posts...

  46. Re:Spam needs a global solution (Global Solution) by minas-beede · · Score: 5, Interesting

    "Also, isn't it easy for a spammer to workaround a spam honeypot -- create a hotmail account, add it to your spam list, and verify that it did go through."

    Yes. So far many don't (I don't know of any that do, but spammers do, eventually, stop sending to a honeypot.) Ralsky never caught on to the Moscow honeypot that was whacking him last year (I think he's the one who told Shiksaa - visit NANAE to find out who she is - that SPEWS was killing him, just at the time of the major whacking Ralsky was getting.) (Chuckle.) I looked for spammer dropbox addresses in trapped spam 3 years ago - I figured they'd use the same address every once and a while in the list of victims. I sorted the list of recipients, sorted again, removing duplicates, and compared. No differences: each victim showed up once. They could do it, they don't. Years of experience has taught them that they can test for open relays and abuse them incautiously - nobody does anything to counter them. They think they own the internet because people ignore their attempts to relay. It's easy to knock the smirk off their faces: pay attention to illicit connection attempts.

    There is a project already in motion to collect all recipient addresses for honeypot-collected spam in a central location. If any address shows up too frequently then that's a suspicious address. The real problem isn't what the spammers do or could do, it is that too few people use this very simple method to wreck the spam path.

    My original honeypot went down last week (I retired in 2001; I haven't really checked to see what the current managers are doing with it.) This year I only captured relay messages, delivered nothing. When it went down last week it had captured over 100 relay test messages in January. You can also go after spammers with these (and I did - no results yet to report, I'm hoping for some big results.) Spammers could detect that - but too late.

    There's a sneakier version of what you suggested that the spammers could use. I won't tell them what it is.

    Volume is the key - many honeypots are needed, quickly, to whack them before they adapt. Same for open proxies. It is an absolutely simple approach. You could set up Granny's system to run a honeypot and it would work, if she has a connection to a segment the spammers search for open relays. http://jackpot.uk.net/

    Try Jackpot and see for yourself, if you can.

  47. Re:How is spam that big of a problem? by knobmaker · · Score: 2, Interesting
    This whole spam thing is way overblown.

    Maybe having spamtrap addresses works if you only use the internet as a personal communication medium. But what if you run an online business, and need to keep email addresses on your websites?

    That's why antispam technology is important to me.

  48. MATH THEORY by EEgopher · · Score: 2, Interesting

    (exhales loudly as he reclines the brown chair)
    Upon reading these extremely fine articles, my mind picks and dances at one particular point, and that is the SIZE of corpuses to use for the training. It seems to me, that at infinitely large bodies of training material, both spam and non-spam tokens would have equal chances of being passed or rejected. Even for large (4000) bodies of corpus, would you really want to be training with equal numbers of spam examples vs. non-spam examples? It seems to me that the filter could cycle unto itself, giving the word "the" superior priority to "mortgage", and so-on-and-so-forth such that the filter would have learned so many words -- regardless of good vs. bad -- that the filter would again (raises fist to clear throat) turn in on itself; cycle unto its own voidance.
    Does anyone have any ideas on this? If I missed something from the article, such as the "weighting" system he gives to known "good" text (which I still see as being futile at large sample sizes) please inform me.

    --
    hi, I like pancakes -.-- -.-- --..
  49. Problem with anti-Spam on the Server by WatertonMan · · Score: 3, Interesting
    The big problem with most current spam filters is that they work at the server level or else require an extra "intermediary" pop-like server between you and your regular mail server. This is a problem because they assume a "one size fits all" approach to Spam. The problem is that one man's spam is an other man's interesting offer. Further they require the maintainer of the server continually update the corpus that trains the filter.

    The real fact of the matter is that for most people the hassle is nearly as bad as the spam! I don't want to spend the time setting up such things. And when people have set them up *for* me I get too many false positives, if only because my interests differ from them. Thus any filter has to be trained with user data and be trainable in an unobtrusive, easy fashion.

    The only software I know of that does this is Apple's Mail program in OSX. Unfortunately the program has many limitations and annoyances. (Damn that drawer) However Apple's approach to Spam ought to be followed by all other email clients. Adding Bayesian inference to an email client is very easy. Putting it in the sever is a mistake because you *can't* easily click and lable an email as spam. As with unfortunately too much Open Source software, the interface has been ill conceived.

  50. Re:I shall crush your filter! by bugbear · · Score: 2, Informative

    Wouldn't work. The algorithm only cares about the most statistically significant 15 words. TEENS easily beat yams.

  51. Re:More than 1.1 billion hippies post off topic by orthogonal · · Score: 3, Funny

    Ok do the following, then maybe I'll care about your opinion: 1. Solve world hunger so tribe in africa don't need pork to survive ....

    Yum, cat farms!

    Tastes just like chicken, and keep down the rat population.

    Meeeeeeeooowwww!

  52. Re:Spam only cost-ineffective with ISP-level filte by bheer · · Score: 2, Interesting
    The kind who are taken in by the stupidest spam tricks, like the "future spam" he describes (nonsensical but grammatical set of English text designed to slip past Bayesian filters, followed by a URL.) What kind of a moron would click on such a URL?
    No kidding. Here's an example from my mailbox -- Moz's 1.3a spam filter didn't recognize this one. Note that I actually *know* people who write like this IRL.

    Frank

    You've gotta see this website: http://www.geocities.com/lordrings179/

    I downloaded Lord of the Rings: The Two Towers and I'm now watching it on my computer. Picture quality is great and it was tottally free.
    They've got a whole bunch of other games and movies as well. Take a look. Also, please forward this email to anyone you think would be interested.

  53. Another POP Proxy program, SpamPal by uncleFester · · Score: 2, Informative

    Another program is SpamPal, which also acts as a pop proxy. It also has a plugin structure, and one of the plugins is a Bayesian filter. This is in addition to included support for using available spam blacklist stuff like SPEWS, ORDB, SpamCop and a whole bunch of other DNSBL lists (even the ability to block entire domains like .kr, .ch and so on). It's a rather cool piece of software.

    --
    -'fester
  54. One part missing in spam filtering.... by Kjella · · Score: 2, Interesting

    ...at least in any version I've looked at, is "language" filter. Maybe 90% of the email I recieve is in norwegian, with hardly no spam. Most of my english mail is spam, simply because I have very little legitimate mail in english. Is there any guesstimate (a la winXPs "language recognition")? By the way, that function is a major PITA for writing english references in a norwegian paper.

    Kjella

    --
    Live today, because you never know what tomorrow brings
  55. Microsoft was granted a patent on this... by bergeron76 · · Score: 2, Interesting

    I wonder what the implications on the OpenSource community are going to be because of this? Details can be found here.

    --
    Don't think that a small group of dedicated individuals can't change the world. It's the only thing that ever has.
  56. Fairly Simple Spam Mail reduction tips. by jellomizer · · Score: 2, Informative

    Without using filtering software.

    1. Change your e-mail address and drop the old one. (This way you are starting off with a clean slate and not on any mailing lists.)

    2. Make sure your ISP dosent post or sell your e-mail address.

    3. Make your email address simple for people to rember but hard for a computer to crack example m1nam3@isp.com. Use simular methods as you would in making a password. That prevents common name email address.

    4. On your webpage make a CGI/PHP/ASP whatever form to send you an e-mail. When you want people to e-mail you give them the link to that page. Make sure that there are no prameters that can make your program e-mail others, and also that your e-mail address is not listed in any of the source that is visable to the web user.

    5. Only give your e-mail to people you can relitvly trust. If you cant trust them then give them a link to you weppage.

    6. When filling out forms on the network asking for your e-mail ether use an alternate e-mail or read the companies privicy clames and make sure that you do not check or uncheck something stating that they will send you e-mail or adds.

    7. Use spamassasan or other email filtering on your system.

    8. Forward all spam to ucs@ftc.gov with all the headers.

    9. See if your email client has a automatic bounce back. If so bounce the message back to sender.

    10. if you want to post your e-mail address then I would make a graphical jpg, png as your e-mail. That way it slows down most computers from reading it.

    --
    If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    1. Re:Fairly Simple Spam Mail reduction tips. by Vainglorious+Coward · · Score: 2, Informative

      Without using filtering software.

      1. Change your e-mail address and drop the old one.

      Off to an ugly start. Joe Average will abort on your list before he's even begun

      2. Make sure your ISP dosent post or sell your e-mail address.

      I'd love to know how you're going to ensure this

      5. Only give your e-mail to people you can relitvly trust. If you cant trust them then give them a link to you weppage.

      "No mom, you can't have my email address. You just use it to send me e-greetings and I hate getting those from you..."

      6. When filling out forms on the network asking for your e-mail ... read the companies privicy clames and make sure that you do not check or uncheck something stating that they will send you e-mail or adds.

      Spammers lie. We wouldn't have all these problems if spammers were truthful

      7. Use spamassasan or other email filtering on your system

      How do I do that "without using filtering software" ?

      8. Forward all spam to ucs@ftc.gov with all the headers.

      You mean uce@ftc.gov. Also note that (depending on the email client) just forwarding a message usually destroys the headers of interest.

      9. See if your email client has a automatic bounce back. If so bounce the message back to sender.

      How exactly does sending a response to an address that either (a) doesn't exist, (b) exists, but is irrelevant (joe-job), or (c) is an address-validation mechanism, help anything?

      10. if you want to post your e-mail address then I would make a graphical jpg, png as your e-mail. That way it slows down most computers from reading it

      This one I can't find fault with :) (but note there will be some people get confused/annoyed when they can't just click on a mailto: link, I'm just not of them).

      --
      My next sig will be ready soon, but subscribers can beat the rush
  57. Vipul's Razor is the equivalent by billstewart · · Score: 2, Informative
    Vipul's Razor on Sourceforge is the canonical collaborative spam filter network. These things really do make a dent in spammers constructing not-very-spam-looking messages that sneak through filters, because to get around them, they need to send sufficiently different messages to each target, though the openness of the matching algorithm means they do have the tools to try it.

    One of my ISPs's implementation of SpamAssassin seems to be using it as part of their rating heuristic.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  58. Re:More than 1.1 billion pigs are killed worldwide by archen · · Score: 2, Funny

    Trickier than you might think considering slashdot editors spell most words in the english language with 3 or 4 incorrect variations.

  59. Re:The more serious problem by bnenning · · Score: 2, Insightful
    What if you are a person who deals with financial data over e-mail? What if you routinely help people with their web pages? What if you send long blocks of code?


    Then the filter will adapt to the types of legitimate messages you receive, that's the entire point.

    --
    How to solve most of our problems: 1.Lots of nuclear plants. 2.Cure aging.
  60. False positives by stemcell · · Score: 4, Interesting

    Has anyone found a Bayesian filter that not only redirects spam into a spam folder but also sorts it's history of redirected mail into a probability list, so that it's easy to check the mails that were close to being accepted.

    Of the 4 programs I just looked at, none mentioned this feature but pretty much everyone complains about periodically having to scan their 'spam' folder for false +ves, and a history sorted into probability would make that easier.

    Stemmo