Slashdot Mirror


Paul Graham on Fighting Spam

Ramakrishnan M writes "Paul Graham, the Lisp Guru is back with a great technique to fight spam. It is based on trust matric, and he claims, only 5 out of 1000 spams got leaked out of this system with 0 false positives. Worth looking at."

675 comments

  1. spamassasin by matt4077 · · Score: 1, Offtopic

    How does this compare to spamassasin. Anybody know any figures?

    1. Re:spamassasin by dylanm · · Score: 1

      SpamAssassin is about 98% effective in catching spam, with about 1 in a 3000 false positives. (The whitelist feature helps decrease the rate of false positives over time)

    2. Re:spamassasin by tomknight · · Score: 4, Informative
      As you appear to have difficulty reading articles, I've give you a helping hand:

      "But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a spam "score" to email. The Bayesian approach assigns an actual probability. The problem with a "score" is that no one knows what it means. The user doesn't know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word "sex" in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam."

      Tom.

      --
      Oh arse
    3. Re:spamassasin by Bahamuto · · Score: 1

      So I have a question then. What if I write a 'sexy' email to my girlfriend, and I use the word sex or even worse ones, wouldn't that get filtered out too? I'm be curious to see if he tried something like that and didn't get a false positive.

    4. Re:spamassasin by Anonymous Coward · · Score: 0

      Again, if you read the article, you will see that this idea is addressed.

    5. Re:spamassasin by KMitchell · · Score: 3, Informative

      The theory (as I understand it) is that there are enough "legit words" in the "Sexy email to your gf" (i.e. her/your name/nickname, her/your email addy etc) that they'd cancel out the "bad words"

      The big shift in thinking from looking for phrases vs scoring each and every word in an email is that the rest of the email is just as saving/damning as the stuff that filters look for.

    6. Re:spamassasin by Unknown+Bovine+Group · · Score: 2, Funny

      Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.


      Obviously, the author just isn't sexy.

      --
      m00.
    7. Re:spamassasin by Znork · · Score: 2

      Actually, I'd recommend a combination between a nasty spam filter that kills off close to anything that might conceivably be spam and white-lists of senders who are automatically cleared. Your friends mails can get through, but woe betide the remote aquaintance or casual relation who mails you anything about sex... on the other hand you might be better off without that anyway.

    8. Re:spamassasin by Em+Ellel · · Score: 1

      Ok, so all the spammer needs to do is add 10x number of usual words in "safe" words (or sentances) at the end of the message? So will SPAM will now not be only annoying, but also eat up several times the bandwidth it currently does?

      There is nothing new here, except for a more methodical way to provide "scoring" of the words.

      --
      RelevantElephants: A Somatic WebComic...
    9. Re:spamassasin by Em+Ellel · · Score: 1

      Actually, I'd recommend a combination between a nasty spam filter that kills off close to anything that might conceivably be spam and white-lists of senders who are automatically cleared.

      Yeah, it's called SpamAssassin.

      --
      RelevantElephants: A Somatic WebComic...
    10. Re:spamassasin by Anonymous Coward · · Score: 0

      "but woe betide the remote aquaintance or casual relation who mails you anything about sex... on the other hand you might be better off without that anyway."

      They would be the last emails that I would want to filter ;)

    11. Re:spamassasin by walt-sjc · · Score: 2

      Yup. SpamAssassin is pretty good at identifying spam. Only problem is that you have already incurred loos of bandwidth and CPU power. Yeah, not TOO big of deal for individual users, but magnify that by 1,000,000 or so if you are a big ISP or some other number if you are a business and it is STILL a real problem.

      Filters are great for keeping spam out of your inbox but it doesn't solve ANY of the other problems associated with spam. While many people don't like the idea of spam laws, they would create a financial / criminal incentive to cut the volume. What if kidnapping and rape were not illegal? Wouldn't the problem be MUCH larger than it is today? The trend we see with spam is that it is increasing by orders of magnitude. If I had no filters turned on, I would get more spam than legit email. If the rate of increase in spam keeps up, I will see 10 spams for every legit email in about a year. This is just not right.

      I don't have high-speed internet access available where I am. Spam costs me real dollars. Why should I pay for someone elses advertising?

      Spam needs to be stopped at the source.

      There is another argument that spam is a world-wide problem and that US laws wouldn't have much impact. While it's true that it wouldn't stop all spam, it would cut the volume. I also have no problem with a "usenet death penalty" for countries that don't take effective steps to curb spam. I already do this with a whitelist allowing the few international users that I do communicate with to communicate with me. Bottom line is that international spam stopped for me. I still incure a small bandwidth cost for the connection attempt, but not nearly as much as if I recieved the entire spam.

      Am I just totally off base here? If so, why?

    12. Re:spamassasin by ceejayoz · · Score: 2

      That'd be easy to filter out.

    13. Re:spamassasin by Anonymous Coward · · Score: 1, Funny

      Actually, I'd recommend a combination between a nasty spam filter that kills off close to anything that might conceivably be spam and white-lists of senders who are automatically cleared.

      Great idea! Where'd you get it? The article?

    14. Re:spamassasin by Yuan-Lung · · Score: 1

      But my regular e-mail messages often do contain the words "sex" and "sexy".

    15. Re:spamassasin by mrtorrent · · Score: 1

      I think you're right on the money - making it _legally_ (in addition to ethically) wrong will make cracking down on spammers much more possible. Right now, they're trying to claim that what they're doing _isn't_ unethical and is a good, respectable business (see article in the latest Newsweek)..

      There is another argument that spam is a world-wide problem and that US laws wouldn't have much impact. While it's true that it wouldn't stop all spam, it would cut the volume.
      Yes, I believe the figure is something like 95% of all spam originates from the US.

    16. Re:spamassasin by FyRE666 · · Score: 2

      Well, I suppose if you wrote something like:

      Hi, my sexy naked Russian teen lolita! I've increased my penis size to 45 inches by phoning for sex along with other like-minded people who I click with!!!!!!!!!!

      I'll be around later, unless you want to opt-out, but it's not an idea I'd subscribe to!!!!

      Then you might just generate a false-positive...

    17. Re:spamassasin by Anonymous Coward · · Score: 0

      Thanks for cutting and pasting the text from a slashdotted site, but we could have done without your smarminess, asswipe.

    18. Re:spamassasin by Anonymous Coward · · Score: 0

      If spam can't get to anyone. It will cease being sent. Laws cost too much money.

      As to the rape ...etc comment. Lets equate it to robbing (something with a logical objective, like spam). If something made it impossible to rob others (the spam killer equivalent). People would cease robbing.

      People would probably still try to murder and rape even if it was not possible as that is an emotional, not logical decision (usually).

    19. Re:spamassasin by susano_otter · · Score: 2

      RTFA: Graham clearly believes that highly efficient filters at the inbox level will have the long-term effect of making spam unprofitable for most spammers.

      Sure, in the short term you don't reduce your bandwidth costs, but imagine if a significant percentage of the population were using trained Bayesian probability filters! So little spam would get through that nobody would bother sending it anymore.

      --

      Any sufficiently well-organized community is indistinguishable from Government.

    20. Re:spamassasin by nelsonal · · Score: 2

      I think the point of his filter is that your filter is unique to you. The filter is designed to pick up that you like foo, but don't like cc or something similar and it asigns a probability of the message being something that you would delete based on you current kept and deleted files. So unless there would have to be a word that everyone kept on their good list, and the spammer would have to keep up with the fact that your good list changes.
      If foo was one of the words that had a low probablity of being a spam, but spammers started using foo, and it still saw enough other bads to delete it, it would probably lower the likelihood that foo indicated a good message. It looked like a really solid system, hope someone finds a way to get this added to more inboxes quickly.

      --
      Degaussing scares the bad magnetism out of the monitor and fills it with good karma.
    21. Re:spamassasin by Anonymous Coward · · Score: 0

      then the filter will "learn" that those words don't score spammy for you.

    22. Re:spamassasin by Alranor · · Score: 1

      But surely the sort of idiot who buys stuff from spam emails isn't gonna use this, because they want the spam, and so the return from the spammers won't change much?

    23. Re:spamassasin by susano_otter · · Score: 2

      True, but if you implemented it at the institutional level (corp mailservers, ISPs, &c.), then t3H stUpiDz won't even know what they're missing!

      --

      Any sufficiently well-organized community is indistinguishable from Government.

    24. Re:spamassasin by critter_hunter · · Score: 1

      Ah, yes, but then what about new words? The current filter assigns, what, .2 to a new word? Wouldn't just adding random ascii strings greatly reduce the chance of being marked off as spam? Just put a few hundred meta words that are just randomly assigned and you've pretty much fucked the system. The solution obviously is to assign a higher spam probability to unknown words, but then people who don't speak english exclusively may get a lot of false positive.

      --
      Karma: Could be worse (could be raining)
    25. Re:spamassasin by nelsonal · · Score: 1

      Yeah but this picks the 15 most "interesting points" since the spams usually have certain words like offer, enlarge, a common domain, a link, etc, which the have to do to ensure sales/clicks. While normal messages contain links, it is very unlikely that a spam won't. The article mentioned that there was a new domain that the author did not catch but the software already had added as a likely indicator of spam. Also, I think it looks at what you actually have in your good message list and bad message list, and I believe it calcs the percentage from that. So unless the spams were custom tailored to match the uncommon words in your inbox, so new words could become an very likely inicator of spam one or two high percentages shouldn't throw off too many false positives.
      Your point about foreign languages seems quite correct. It probably would ncrease the false positives, especially if the person had just started emailing you. But it does seem like one of the better ideas for combating spam form the user end.

      --
      Degaussing scares the bad magnetism out of the monitor and fills it with good karma.
    26. Re:spamassasin by turpie · · Score: 1

      This is fairly simple to get around.
      If message comes from someone in your address book then allow it regardless, otherwise apply spam filters.

  2. This is wrong. by www.sorehands.com · · Score: 1, Insightful
    SPAM is wrong!

    The proper way to get rid of spam is to get rid of spammers. Have it illegal to send spam, to market using spam, and to host spammers.

    Make each link in the chain liable!

    1. Re:This is wrong. by morgajel · · Score: 2, Insightful

      "if you outlaw spam, the only people with spam are outlaws..." er something.
      anyways, what I was going to say is ok, US outlaws spam. now what? sue korea as a whole? how about china? nigera?

      laws don't mean shit.
      you need to go after the people making MONEY off spam, not the spammers. Most of them are US "businesses". ...and I use the term 'business' loosely.

      --
      Looking for Book Reviews? Check out Literary Escapism.
    2. Re:This is wrong. by njet · · Score: 1

      The same method should be applied also for cracking/ddos/.... But it does not work. abuse reports don't get to right hands....admins (if they have some) don't care.......

    3. Re:This is wrong. by schroedinbug · · Score: 1

      I completely agree with that except for the part about making it illegal to host spammers.

      Now if they are knowingly hosting them, thats a different thing, but I know where I work, we had one try to start spamming people. When we got the notices that this was happening, we promptly deleted his account and put his name and address on the perma-ban list.

      ISP's shouldn't be held liable unless they are purposely letting the spammer create headaches in the mailboxes of millions.

    4. Re:This is wrong. by mhore · · Score: 1
      The proper way to get rid of spam is to get rid of spammers. Have it illegal to send spam, to market using spam, and to host spammers.

      ...or have them shot on spot. messy, though... hmm... AND THEN MAKE THEM INTO REAL SPAM! YAH! A fitting end.

      Mike.

      --

      Mmmm......sacrelicious.

    5. Re:This is wrong. by ceejayoz · · Score: 2

      Spam is wrong, but so's murder. That doesn't stop it from happening.

      We should pursue legal avenues for stopping spam, but that doesn't mean we shouldn't try to block it in the meantime! The article sounds like a phenomenal way of blocking spam.

    6. Re:This is wrong. by tomknight · · Score: 2
      So you're after a world-wide law outlawing spam? Most of mine is currently coming from Taiwan, so that's what I'd need... Please, get real!

      Tom.

      --
      Oh arse
    7. Re:This is wrong. by nougatmachine · · Score: 2
      Yes, because that works so well for heroin. And prohibition worked really well, too. And isn't something like 95% of the trading on KaZaA and Gnutella illegal as well? And all of the child porn readily available on the net?

      Spam, like these things, is going to be extremely difficult to enforce. Laws or no laws, filters will be necessary.

    8. Re:This is wrong. by Stonehand · · Score: 2

      Given that much of my spam is not only /from/ Korea, but /in/ Korean, a considerable amount likely comes from Korean businesses.

      As for what to do? One heavy-handed bit of leverage would be to block /all/ telcommunications from Korea until they develop some responsible marketing laws and enforce them (with, say, a 90-day notice in advance).

      --
      Only the dead have seen the end of war.
    9. Re:This is wrong. by Anonymous Coward · · Score: 0

      We should use every reasonable option to fight spam, legal and technological. We need to make it as difficult as possible so there are very few people willing to go through the trouble to bother sending it out.

    10. Re:This is wrong. by www.sorehands.com · · Score: 1
      Well all of this is based on knowing.


      I mean, if an ISP keeps a spammer, after being made aware that they are a spammer, then the ISP should becomes liable. That includes bandwidth providers.


      We should treat spam money like drug money, all assetts that have been possibly bought with spam money, even if given away, subject to judgment.

    11. Re:This is wrong. by japhmi · · Score: 2, Insightful
      One heavy-handed bit of leverage would be to block /all/ telcommunications from Korea


      This is a very bad idea. What about companies such as Hyundai that have Korean and American (and many other countries) divisions? Or, what about my friends from Korea trying to e-mail their family back home - should they be hurt because some companies in their home country do bad things (and/or it's government doesn't have/enforce laws to stop them)? Name a country that doesn't another country/ies thinking that they need to 'change how they do things over there.'

      --
      "Giving money and power to government is like giving whiskey and car keys to teenage boys" P. J. O'Rourke
    12. Re:This is wrong. by Anonymous Coward · · Score: 0

      Make each link in the chain liable!

      Of course!

      Suing people is the modern-day equivalent of beating people up, torching their houses, shaming their daughters...

    13. Re:This is wrong. by Anonymous Coward · · Score: 0

      kill spammers. A .44 magnium would work.

    14. Re:This is wrong. by RylandDotNet · · Score: 1

      Considering the fact that spammers don't feel any compunctions about hijacking an open mail relay, I don't think they're going to consider a law against spamming much of an obstacle.

    15. Re:This is wrong. by Stonehand · · Score: 3, Insightful

      In this case, the damage to others /is/ the point, just as that's the same logic behind the Usenet Death Penalty. Hurt others (in the case of a UDP, the customers of the ISP who send perfectly legitimate email) whom the authorities do care about so that they change their policies...

      It's not particularly nice, or even remotely fair, but something like that might work. A large-scale boycott by major ISPs might do the trick.

      --
      Only the dead have seen the end of war.
    16. Re:This is wrong. by walt-sjc · · Score: 2

      The big difference is that we can SEE where spam comes from. It's in our log files. Heroin is all underground. Music is not email. We also have laws against copyright infringment and yet the labels have not gone after traders. Believe me, if there were good tough, well-written NATIONAL (not state) laws on spam, I would go after EVERY SINGLE SPAMMER. It would be a GREAT source of additional income.

      I don't know why you say that spam laws would be difficult to enforce. We have logs, the illegal mail (spam), and the target phone numbers / web sites (the spamvertized material.) It's pretty cut and dried. If the DOJ get's a chunk of the fine, and the spammie gets "restitution", it would be a self-funded program.

      I have ZERO problems instituting a "usenet death penalty" type block on coutries that don't have tough laws on spam. I already do so personally on my servers for about 30 countries. Emailers in those countries get rejected with a pointer to a web page that tells them whats going on, and how to get "whitelisted" if they are legit.

      I have no problems with having filters as well as laws, but we need the laws to reduce the bandwidth bills. Spam takes Massive bandwidth and a toll on server CPU. Not too long ago, spam took AT&T's email servers for worldnet down for 3 days. This kind of thing HAS to stop. Filters at the recieving end won't stop the bandwidth usage of spam. Without stopping spam at the source, the problem will just get worse.

    17. Re:This is wrong. by MarkGriz · · Score: 1

      "Soylent Green is SPAMMERS!!!!"

      --
      Beauty is in the eye of the beerholder.
    18. Re:This is wrong. by WMNelis · · Score: 1

      We would have to be very careful about laws against spam. I remeber a story here on Slashdot not too long ago about someone who apparently was accused of sending spam because he sent a resume. We don't want that to be illegal.

      --

      Sig free since 2/6/2002
    19. Re:This is wrong. by kallisti · · Score: 2
      I remeber a story here on Slashdot not too long ago about someone who apparently was accused of sending spam because he sent a resume. We don't want that to be illegal.


      He sent his resume to a bulk-mailing list, that's spam for sure. People will be able to send resumes, just not to everyone in shotgun fashion.

    20. Re:This is wrong. by xchino · · Score: 1

      The same method HAS been applied to cracking/ddos. You can't legislate anothercountry. The internet is an international medium, so the US can't stop spam by making it illegal. Just as they couldn't stop cracking/ddos. Spam Assassin is the best way to stop spam in it's tracks, but I think a Spam Assassin that actually assassinated spammers would be more effective.

      --
      Everyone is entitled to their own opinion. It's just that yours is stupid.
    21. Re:This is wrong. by FyRE666 · · Score: 2

      There's a big difference with spam versus the illegal drugs trade and child porn: SPAMMERS WANT YOU TO SEE IT! This is their anchovies heel, there's clear evidence of the origin of the spam (or at least the incompetently administered mail server) - so that's where the fault lies.

      I agree that there should be more effort on the part of the US government to remove these bags of sh*t from the 'net. Where available, the ISPs should be forced to diveulge the customer account used to post the spam. Any companies advertising via spam should be fined per item of spam. This last would remove a lot, since the spammers wouldn't have much to do without companies paying them. Lastly, if a mail relay is used repeatedly, either force it to close, sue the company responsible for it, or blacklist it all mail from it for at least a year (kind of like a prison sentence for the server host).

      I waste far too much of my time scraping up the crap these parasites spew into my mail server - they deserve the harshest of penalties - at least equal to child porn swappers.

    22. Re:This is wrong. by porges · · Score: 1

      This is their anchovies heel,

      Please tell me this was deliberate.

    23. Re:This is wrong. by Anonymous Coward · · Score: 0

      Please tell me this was deliberate.

      He did it on porpoise.

    24. Re:This is wrong. by terrab0t · · Score: 1

      Frankly, I see an effective filtering mechanism like this as the best way of ridding the world of SPAM. It's very true that a law is only as effective as it's enforcement. The better it's enforced, the harder life is for those in the spamming business. However, if we all had effective filters like this, life would be even harder for spammers, as they would have to both dodge the law, and deal with getting only a fraction of their waste to it destinations.

    25. Re:This is wrong. by FyRE666 · · Score: 1

      ;-) Yeah, I was going to write some clever note about it at the bottom, but I forgot...

    26. Re:This is wrong. by Anonymous Coward · · Score: 0
      Most spammers are Jews.

      Filthy, stinky, rotten, Jews.

    27. Re:This is wrong. by si1k · · Score: 1

      Cutting off countries that don't prevent spam defeats the whole INTER-NET principle of the Internet. The more networks and nodes attached to the Internet, the more useful the Internet is to its users. Being able to email or receive email from people in other countries is FAR more important to me than getting rid of spam. I would vastly prefer having to twiddle the delete key over losing touch with friends, colleagues and clients in other countries.

      The UDP doesn't apply to something as crucial as email, and it's controversial in any case.

      There are many ways to deal with spam, but this would really be throwing the baby out with the bath water!!

  3. Absolutely..... by reaper20 · · Score: 2

    I propose we define spam as unsolicited automated email. This definition thus includes some email that many legal definitions of spam don't. Legal definitions of spam, influenced presumably by lobbyists, tend to exclude mail sent by companies that have an "existing relationship" with the recipient.

    This needs to happen, just because I buy a book from a company doesn't mean I want their stupid monthly mailing list.

    This seems very similar to Spamassassin, which alot of us are using with great success.

    1. Re:Absolutely..... by WereTiger · · Score: 1

      But Keep in mind what one spammer said in a previous Spam article on /.:

      "Stopping spam is simply a matter of economics. When its uneconomical to send spam, people will stop sending it."

      IE: if everyone's inboxes are filtered, there won't be any spam anymore.

      --
      If you're hearing rhetoric about Linux, open source, or Mac and everyone's bashing Microsoft, you've found Slashdot.
  4. I heard about this! by WilliamsDA · · Score: 2, Funny

    I got an email last night about this! Also, it asked me to help out his Nigerian cousin...

  5. Filter for color ff0000 by geekoid · · Score: 2

    of course! it sounds so obvious now.
    jeez, that alone would cut down on spam, cross reference that with my trusted address book, and I'll probably be ably to filter all spam.
    I have that feeling you get when you've been stuck with a problem, and some guy looks at the code for about 2 seconds and finds a problem.

    --
    The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
  6. If you use Outlook... by Anonymous Coward · · Score: 2, Informative

    (Yeah, yeah, I know...)

    But if you do, check out Cloudmark's SpamNet. I've been quite please with it's ability to stop spam, and it gets better the more people that use it.

    1. Re:If you use Outlook... by Anonymous Coward · · Score: 0

      I'll second Cloudmark's SpamNet.. I use it and while it doesn't catch it all, it has reduced it by over half and I don't have to actually do anything.

    2. Re:If you use Outlook... by johnstewart · · Score: 1

      I tried it a month or so ago and thought it was utter crap. Lots of problems contacting their server, etc., and no response from them on the forums when people were reporting problems.

      Plus, it's free now, but they've made no promise it will remain that way. How are they going to support their servers unless they start charging?

      I finally got SpamAssassin going here. Use amavisd-new to tag stuff as it flows through a postfix server. Tell your users how to filter based on the headers.

      It changed my life. And it's free. And I'm not dependent on anyone else's boxes working.

      SpamAssassin does do something similar... they use what they call a "genetic algorithm" to assign scores to the different rules people have made. Pretty similar to what they're doing here.

      However, it sounds like they could use this algorithm to find new rules the SA folks haven't thought of yet, to put into SA.

    3. Re:If you use Outlook... by belphegore · · Score: 1

      If you use Outlook, and you like SpamAssassin, then you can use SpamAssassin Pro from Deersoft instead of SpamNet -- seems to work much better for most people in terms of false positive/false negative rates.

    4. Re:If you use Outlook... by Anonymous Coward · · Score: 0

      Ack, $29.95! Right now, SpamNet is still free, and is getting better with each version and as more people join.

  7. Ok, that is hot.... by Vengie · · Score: 4, Insightful

    1) Lisp...ever since i ran into scheme, I have _loved_ the concept of lisp based languages. A nice Hoo-ha to anyone who says there are no practical applications of lisp based languages. (except haskell...which personally, i think sucks! if one of our own professors hadn't invented it, it would be dead by now) 2) _0_ false positives. I'm perfectly happy to settle with "some small number of spams getting through" given there are NO false positives. Early on in the article he states that he realizes this is a critical problem, and from the start keeps no false positives as a goal. It is far better to have no false positives then to have 100% no-spam rate with that in mind... 3) the statistical word analysis is really interesting..."describe" is innocent. unfortunately....what happens when a few smart spammers get their hands on this analysis *sigh*

    --
    When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in vi. (Larry Wall)
    1. Re:Ok, that is hot.... by Anonymous Coward · · Score: 1, Insightful
      I'm perfectly happy to settle with "some small number of spams getting through"

      I'm not singling you out, but this statement is the exact reason spam has become as popular as it has. It's annoying, it's cumbersome, but everyone is willing to 'settle' to avoid further problems. People spend effort developing complex filters and programs and proxies. which the spammer spends about a minute and a half figuring out how to get around. I think with the spammers there should be ZERO tolerance and ZERO SPAM. To stop spam you need to stop THE SPAMMER.

    2. Re:Ok, that is hot.... by Vengie · · Score: 2

      I was referring to the spam filtering software. I realize spam is an evil that must be fought at the source -- while I _do_ wish for the eventual removal of ALL spam, in assessing a SPAM FILTERING software package, the critical element is the false positives. I'd rather have a software package that has 50% filtering and 0 false positives then 100% filtering and 1 false positive. I _never_ want to miss an actual email directed at me.

      --
      When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in vi. (Larry Wall)
    3. Re:Ok, that is hot.... by GrenDel+Fuego · · Score: 1

      I have a garunteed method for making sure that no spam gets through. Filter all e-mail to /dev/null, and you're sure not to miss a single spam message.

      However, I'm not going to use this method because I'd actually like to read mail that someone sent to me.

      He wasn't suggesting that getting rid of all spam is not a goal to strive for, it's that you shouldn't use methods that may keep you from reading real e-mail.

    4. Re:Ok, that is hot.... by Plutor · · Score: 5, Insightful

      1) [...] A nice Hoo-ha to anyone who says there are no practical applications of lisp based languages. (except haskell...which personally, i think sucks! [...])

      You ridicule people who dismiss the usefulness of your personal "favorite" language, and then you dismiss the usefulness of one particular language that you happen to dislike? That's a bit hypocritical.

      3) [...] what happens when a few smart spammers get their hands on this analysis[?]

      Paul covers this. First, he suggests that each user's filters should be personalized, so that any spammer would not be able to circumvent everyone's filters. Second, the filters would be continually learning, possibly dumping older words from the corpus in favor of newer ones. And third, even if a spammer put at the end of his spam "describe describe describe describe", this still wouldn't work; the basic premise of the filter is that the spammer HAS to tell you what he's selling, and in the process of doing that, gives himself away as a spammer.

    5. Re:Ok, that is hot.... by jglow · · Score: 2, Interesting

      the good thing about his method is that even if a spammer gets a ahold of his analysis, the more span recieved with those words, it will slowly bump the likelyhood of it actually being a real email.. thus dumping those messages into the spam box.

      --


      There's no "I" in Linux.. err..
    6. Re: Ok, that is hot.... by Black+Parrot · · Score: 2


      > 2) _0_ false positives. I'm perfectly happy to settle with "some small number of spams getting through" given there are NO false positives.

      Also, you can stack NFP filters in series, so that each tries to catch any junk that the earlier ones missed.

      --
      Sheesh, evil *and* a jerk. -- Jade
    7. Re:Ok, that is hot.... by shayne321 · · Score: 3, Informative

      I'd rather have a software package that has 50% filtering and 0 false positives then 100% filtering and 1 false positive. I _never_ want to miss an actual email directed at me.

      I have to respectfully disagree here. First, you should NEVER trust an automated mechanism to delete e-mail before you open it (I'm not say you are, just saying it should never be done). When e-mail comes in to my inbox generally it's a user problem or network down situation.. Mozilla beeps at me, and I drop what I'm doing to see what e-mail has just arrived. If it's spam, I've wasted the effort in loosing my train of thought on whatever I was working on, plus whatever amount of time it takes me to refile it in my spam folder and adjust my filters so it doesn't happen again.

      Using spamassassin, I filter all e-mails marked as spam off into a "spam" folder which I browse through about once a day at the end of the day just to be sure no legit e-mail has been filed over there. Takes only a second, and generally if the e-mail is "spammish" enough for spamassassin to file it over there it's not an important e-mail, but maybe a package ship notice from UPS, or an order update from amazon.com (though with effective whitelisting you can reduce how often this happens).

      Not trying to change your opinion, just wanted to offer an alternate viewpoint. IMHO this is one of the things that makes spamassassin so good is that you can alter your threshold, so that if you can live with some false-positives but hate spam, you can use a lower threshold. If you can live with some spam and never want to miss "legitimate" e-mail, you can use a higher threshold.

      Shayne

      --
      Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
    8. Re:Ok, that is hot.... by Vengie · · Score: 1, Troll

      1) the usefulness of one particular language that you happen to dislike? That's a bit hypocritical. Not when you consider that: a) the language was invented as an exercise in programming, b) its entire purpose was to stroke the ego of the professor who wrote it, c) said professor _required_ its useage in his intro programming class to the detriment of his students. IMHO, "language family x" sucks and "language x sucks" are a world apart.

      --
      When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in vi. (Larry Wall)
    9. Re:Ok, that is hot.... by madfgurtbn · · Score: 2

      Even better, it can differentiate between good and bad spam. That is, over time, it would be able to decide if a certain kind of spam was on-topic for you. It would be nice if they could add something like this to a search engine so when you you click on a link in Google and it turned out to be off-topic for you you could add it to the "spam" corpus for you. It would personalize your results over time so it would learn what you are looking for when you use search terms with multiple meanings, and help week out all the websites that are little more than key-word spam.

      --
      Send lawyers, guns, and money. Dad, get me out of this.
    10. Re:Ok, that is hot.... by kacp · · Score: 1

      >3) the statistical word analysis is really interesting..."describe" is innocent.
      >unfortunately....what happens when a few smart spammers get their hands on this analysis

      Then the whole process starts over again. The war on spam is not gonna be over in one single swoop...no matter how good the software.

      --
      To write a haiku - all you need is the correct - number of syli...
    11. Re:Ok, that is hot.... by Myco · · Score: 2

      I have a dream that one day programming languages will be judged not by their checkered pasts, but by their suitability to the task at hand.

    12. Re:Ok, that is hot.... by Jucius+Maximus · · Score: 2
      "First, you should NEVER trust an automated mechanism to delete e-mail before you open it (I'm not say you are, just saying it should never be done)."

      I think that you are not entirely correct here.

      About 1 year ago I was automatically subscribed to (one of many) listservs at my university along with everyone else in the engineering faculty.

      Apparently, for most of the people it was their first time on a listserv. Furthermore, the way this one was (badly) set up, the default 'reply-to' address was listserv@myuniversity.ca. Yes, this is a recipe for trouble.

      Of course some grad student innocently sent out a message with the subject "SUBJECTS NEEDED" because they needed test subjects for their grad work. Naturally, a whole whack of people replied and then replied to those replies, sending hundreds of messages called "re: SUBJECTS NEEDED" over the listserv.

      I quickly set up an auto-delete for that subject and it never came back to haunt me. (My dialup was being saturated by all the responses.) Thus, I think it's safe auto-delete when you are protecting yourself from newbies who don't know how to handle an e-mail client.

      (Still, it wasn't as bad when some idiots started signing up the list for hetero and homosexual pr0n-in-your-mailbox sites, but that's a different matter.)

    13. Re:Ok, that is hot.... by focuss · · Score: 1

      What occurred to me as a potential weakness (and I haven't really thought it through) is that the spammer could put the sales pitch at the top, followed by a bunch of newlines, and then a vast chunk of innocent text grabbed at random from an online book or the like.

      --
      burnt sig
    14. Re:Ok, that is hot.... by RevAaron · · Score: 3, Insightful

      Most people here on /. would say that same thing about Lisp-related languages that you do about Haskell. Esp that they were forced to use it, to their detriment, in an intro CS class, or perhaps in AI. I love Lisp myself, but I also think Haskell is quite interesting, and also can be very useful.

      There's no difference between you, "L1sp rules und haskell dr00ls!" and all the slashkiddiez on here that say "perl and C 0wnZ j00! fsck lisp!"

      --

      Working toward a usable PDA environment in the spirit of Newton OS: Dynapad
    15. Re:Ok, that is hot.... by William+Tanksley · · Score: 2

      I'd rather have a software package that has 50% filtering and 0 false positives then 100% filtering and 1 false positive. I _never_ want to miss an actual email directed at me.

      I have to respectfully disagree here. First, you should NEVER trust an automated mechanism to delete e-mail before you open it (I'm not say you are, just saying it should never be done).


      I don't see how you're disagreeing -- he's saying that he wants to see less spam but ALL of his real email, and you're saying that you don't want to automatically delete any email.

      Okay, use this software to move spam to a folder rather than deleting it. Difference solved.

      -Billy

    16. Re:Ok, that is hot.... by Vengie · · Score: 1

      I can't reiterate enough. I go to the school where the CS department had an aborted brain child known as Haskell. The rest of the students/faculty have accepted scheme/ml as far superior for teaching that class....except its author. *sigh*.

      --
      When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in vi. (Larry Wall)
    17. Re:Ok, that is hot.... by Dthoma · · Score: 1

      It's called homeostasis. Any changes to the environment cause the spam filter to compensate, whereas many pre-installed solutions are somewhat rigid and inflexible. Since this thing can actually change itself to best fit your particular variety of spam, this can be more effective AND with less effort.

      --

      Note to M1-ers: a curt but otherwise insightful message is not "Flamebait" or "Troll".

    18. Re:Ok, that is hot.... by Anonymous Coward · · Score: 0

      interesting idea, but while bandwidth for us costs nothing (unless your with a broadband company that has instituted broadband charges), the spammers DO have to pay for there bandwidth- if they ahve to move from sending 15-50 kb files to 150-500 kb novels in order to avoid spam filters then there will be a LOT less spam out there.

      Also, if the tred becomes to put all the sales text at the beginning of the file, and the novel at the end- then the filters could be adapted to only look at the beginning of the e-mail, and if there put in the beginning or the middel: then the impact of the spam decreases (The slaes pitch dosen't quite come across)

      The filters could also be adapted to look for strings of good hits in a given paragraph/block, and if any given paragraph/block excedes a certain value- then discard the letter.

      Keep in mind, the spammer is selling something- if all that is visible is a jarble of text to beat filters, no one will buy it, and they won't be spamming for long.

    19. Re:Ok, that is hot.... by shayne321 · · Score: 2

      I don't see how you're disagreeing -- he's saying that he wants to see less spam but ALL of his real email, and you're saying that you don't want to automatically delete any email.

      Well, he was saying he'd rather see spam than false positives, I'm saying I'd much rather see false positives (as long as they're still available in another folder) than spam in my inbox... Just a personal preference.

      Okay, use this software to move spam to a folder rather than deleting it. Difference solved.

      Yup, that was mostly my point, except that this technique seems more geared towards eliminating (or minimizing) false positives at the expense of letting some spams slip through. Spamassassin can be configured either way (well mostly, it's not 100% perfect but it's close).

      Shayne

      --
      Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
    20. Re:Ok, that is hot.... by William+Tanksley · · Score: 2

      Hmm. I can't see your previous message saying that. All I see is a statement that you'd rather not delete the spams just in case one of them was a false positive. But anyhow, what you're saying makes sense, I think. But I still see a problem.

      I guess I see now what you were trying to say -- that you know an easy way to _tolerate_ false positives. I disagree; you know a way to tolerate false positives ONLY when there's a small number of positives. Some people get WAY more spam than you do; any false positives for them would make a spamcatcher useless, since they wouldn't have the time to scan though the junk.

      A spamfilter which had no false positives would be 100% beneficial, even if it had false negatives, because it wouldn't lose any important data; you could _always_ run it to reduce the amount of noise, and then if you wanted to reduce false positives using your method you could run another filter which was meaner.

      You'd wind up with a much less full spambox, and no lost messages.

      -Billy

    21. Re:Ok, that is hot.... by RevAaron · · Score: 4, Interesting

      I'm not sure if I'd characterize Haskell as an aborted brain child. Some people use Haskell. Some people like it. At a lot of schools in the US at least, they teach Scheme, when all the students/faculty have "accepted" C, C++, and Java as "superior" for teaching. Which is blatently bullshit. Algol-kid languages suck, we all know that. (heh, couldn't help it) But the point still stands.

      --

      Working toward a usable PDA environment in the spirit of Newton OS: Dynapad
    22. Re:Ok, that is hot.... by lewp · · Score: 1

      The filters will quickly adapt to counter this behavior after at most a couple "delete as spam" clicks (and possibly none at all if the brief sales pitch is incriminating enough).

      The resistence to this sort of tampering is exactly why this is such a great technique.

      --
      Game... blouses.
    23. Re:Ok, that is hot.... by Anonymous Coward · · Score: 0

      what happens when a few smart spammers get their hands on this analysis
      They'll patent the algorithm and sue everyone who implements it.

      That is, of course, ignoring the oxymoron "smart spammer."

    24. Re:Ok, that is hot.... by shayne321 · · Score: 2

      A spamfilter which had no false positives would be 100% beneficial, even if it had false negatives, because it wouldn't lose any important data; you could _always_ run it to reduce the amount of noise, and then if you wanted to reduce false positives using your method you could run another filter which was meaner.

      Yeah, I see what you're saying here.. We're just looking at it from two different angles. In your scenario the spamfilter removes much of the noise but doesn't harm the signal. In my case I'd rather remove as much of the noise as humanly possible and look for any signal that may have gotten nabbed too at my convenience. As you said it probably depends a lot on how much signal-hunting you have to do and your other e-mail habits as to which of these camps you fall. In any case I think we can consider this horse dead and beaten. :)

      Shayne

      --
      Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
    25. Re:Ok, that is hot.... by Yobgod+Ababua · · Score: 1

      "Then the whole process starts over again."

      But not completely.

      What would happen is that a spammer might start incorporating lots of "high-innocence" words into their spam to try to balance out the "low-innocence" words that they must also include to sell their message. If they don't include enough it gets marked as spam anyway, and those words will start to lose their innocence. If they do include enough for it to get through, you have one email to delete and those words will still lose their innocence.

      Plus, since he recommends that every user (or at least every site) maintain their own hash table, there are no "globally innocent" words. Analysis of a central corpus of spam can provide a seed set of "globally spammish" words for us to use (and them to try to avoid if they can).

      The individually adaptive nature of the system makes it extremely difficult to circumvent in any reliable manner.

    26. Re:Ok, that is hot.... by Anonymous Coward · · Score: 0

      "what happens when a few smart spammers get their hands on this analysis"

      NOTHING! Because they can't possibly write something that can slide past individual user filters. Every "delete as spam" action fine tunes the filters for that person.

      "the brain of the filter is in the individual databases, then merely tuning spams to get through the seed filters won't guarantee anything about how well they'll get through individual users' varying and much more trained filters."

  8. Fighting Sperm by Anonymous Coward · · Score: 0

    The best way to avoid a torrent of gloppy manjuice shooting all over your naked buttocks every time you even CONSIDER turning your computer on is to remove Linux from your hard drive immediately.

    By installing a stable, sensible OS like Xenix, you can ensure an ejaculate-free user experience.

  9. Easy way to beat spam 100% by Anonymous Coward · · Score: 4, Interesting

    Create an E-Mail address called, say, spam@example.net.

    Put a link to it on your website, but tell people not to use it for anything, E.G.

    <a href="mailto:spam@example.net">Spam trap - don't use me</a>

    Then, it'll get harvested along with all the others on your site. That mail box will fill up with spam, and nothing else.

    What good is that? Well, you've got a ready-made list of messages to filter *out* of your other mail boxes!

    So, just write a script that checks each inbound E-Mail against the spam list. If it matches, you *know* it's either:

    1. Spam

    or

    2. An E-Mail that somebody has also sent to the "Don't use me" address.

    In either case, you don't want to read it, so it gets auto-deleted. Nice.

    Oh, I think I'll patent this, and not tell any of you about the royalty I'm going to charge in 15 years time. Hahahahahahaha!!!

    Oh, by the way, first post, first post... NOT!

    1. Re:Easy way to beat spam 100% by elmegil · · Score: 1

      You don't even have to make it "don't use me". Use tags to make the text the same color as the background, and nobody will ever see it except for the spam harvesting bots.

      --
      7 November 2006: The day Americans realized corruption and incompetence weren't addressing 11 September 2001
    2. Re:Easy way to beat spam 100% by Anonymous Coward · · Score: 0

      Yeah, good point. The only thing not to do is to make it an empty link, like:

      <a href="mailto:foobar"></a>

      because I think the spam bots probably *would* filter that out if it got widespread.

    3. Re:Easy way to beat spam 100% by gmuslera · · Score: 1

      Or better yet, an 1x1 transparent gif lost in the page (if you like, can put things like "don't use me" in the ALT attribute of the image to avoid curious people that browse in text/disable graphics mode.

      Putting the word "spam" near the trap could be used to recognize traps and avoid it for future or actual spambots.

    4. Re:Easy way to beat spam 100% by shayne321 · · Score: 2

      What good is that? Well, you've got a ready-made list of messages to filter *out* of your other mail boxes!

      WOW, what a *great* idea! What if you could make it so that it knew not only about spam sent to your spam trap, but spam sent to thousands of spam traps and real users? Oh wait, that exists already. Look at Vipul's Razor and DCC.

      Shayne

      --
      Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
    5. Re:Easy way to beat spam 100% by phamlen · · Score: 2, Informative
      Well, you've got a ready-made list of messages to filter *out* of your other mailboxes

      This doesn't work because spam messages are not identical. That's the whole problem in a nutshell - how do you determine that one email matches another?

      1. Spammers routinely change the wording/spacing/non-essential elements in a message so that they don't match exactly.
      2. If you cut down to searching for "parts of a message", then you're back to "content-filtering".
      3. the same thing occurs if you check for email address, etc.

      Also, it's worth noting that BrightMail and other companies have been using "spam honeypots" for years. Their effectiveness isn't very good.

      What is interesting, though, is that you could use this technique extremely powerfully with the Bayesian filter. Instead of writing a script for yourself, have the script automatically move the message into your "spam" corpus. You'll get your spam blocking up hugely without every having to see spam.

    6. Re:Easy way to beat spam 100% by danro · · Score: 2

      If you don't expect any Lynx or other text only browsersusers to visit your page you could even use CSS to cause the browser never to display the link to the user at all. Works in all modern, (and even not so modern browsers, like NS4) .

      So, your visitors doesn't even need to look at your little honeypot as their user agent won't render it. Harvesters however will probably get it anyway, since there are lots of reasons why a legit adress would not be displayed all of the time.

      Come to think of it, text only browsers isn't really a problem. Anyone using Lynx or is probably smart enough to not use a "I'm a spam trap" link anyways...

      Happy spam hunting, boys and girls!

      --

      "First lesson," Jon said. "Stick them with the pointy end."
    7. Re:Easy way to beat spam 100% by oktaya · · Score: 1

      This is a good idea. Actually you can do this same this way also.

      Actually I can use my verizon email for this purpose. I had verizon DSL about 2 years ago. My email address was at the time oktay@bellatlantic.net. I have used that email for a while then switched to cable. That email was never shut down. It kept gathering spam everyday. Nobody was actually using the mailbox.

      But someday I noticed something else. Most of the spam emails I was receiving was addressed to oktay@verizon.net . This is an email alias I could use apparently. But never knowing about the existance of this email alias I have NEVER used it. By the same token, I haven't published it online or given it to anybody I know.

      This is clearly Verizon keeping my email box open just for the purpose of selling my email address to spammers.

      Very scary considering verizon is supposed to be an ISP.

      Oktay Altunergil

      --
      ---------------
      Founder of the The Free Linux CD Project
    8. Re:Easy way to beat spam 100% by Geekboy(Wizard) · · Score: 1

      My [spambox@] is a legit address. I check it everyday. I also have [box@] that is an automatic forward to vipul's razor. I protect all of my email boxes (except the auto forwared) with spam assassin, and no spam has been overlooked. I have gotten a few false positives, but they were suspicus emails to begin with.

    9. Re:Easy way to beat spam 100% by Anonymous Coward · · Score: 0

      Bad idea. It's way too processor-intensive to compare every email you get against some MASSIVE list of spam. And spam changes all the time.

    10. Re:Easy way to beat spam 100% by alcmena · · Score: 4, Funny

      if you like, can put things like "don't use me" in the ALT attribute of the image to avoid curious people that browse in text/disable graphics mode.

      Better yet, use the alt text "CLICK HERE!" and everyone will assume it's some sort of ad and they will refuse to touch it with a ten foot pole. "CLICK HERE!" is like the web version of the radioactive symbol.

    11. Re:Easy way to beat spam 100% by Anonymous Coward · · Score: 0

      No offense, but your name just makes me want to say, "Babatunde Olatunji"

    12. Re:Easy way to beat spam 100% by sirinek · · Score: 1
      "CLICK HERE!" is like the web version of the radioactive symbol.


      *LAUGH* Thats the funniest thing I've read all week! :)


      siri



    13. Re:Easy way to beat spam 100% by asackett · · Score: 2
      I use slashdot@artsackett.com for this very reason. It catches anywhere from a half dozen to about two dozen spams each day. The delivering IP address is automatically added to my local DNS blocklist without a human being ever being forced to delete the message.

      I also use ORBS, spamhaus, and others, and on a typical day, I receive three or four spams, and block 74 to 76. My logs rolled seven hours ago, and already I've blocked 25, received two.

      --

      Warning: This signature may offend some viewers.

    14. Re:Easy way to beat spam 100% by M00TP01NT · · Score: 1

      Maybe Verizon does sell names -- who knows? But spammers should know that all bellatlantic.net addresses have a parallel verizon.net address, so the spammers spam both.

    15. Re:Easy way to beat spam 100% by Anonymous Coward · · Score: 0

      Just stick it in an HTML comment. I believe the spambots parse the HTML looking for mailto hrefs and other email addresses, so a comment would hide it from the browser, but not from the spambot.

      That is, until the spammers figure out not to search between the comment tags.

    16. Re:Easy way to beat spam 100% by Anonymous Coward · · Score: 0

      nice - until a spammer uses an ISPs mailhost to spam you - you block it automatically and then block people using the mailhost to mail you legitmately

    17. Re:Easy way to beat spam 100% by mikeee · · Score: 2

      I think you have something here... we can use pop-up adds to mark the long-term nuclear waste respository!

  10. only 5 per 1000? by jeffy124 · · Score: 2, Funny

    that means CmdrTaco reduces his spam intake to around 500/day.

    --
    The One Rule Of Chess You'll Ever Need: Don't play someone who carries a kit in their bookbag.
  11. This really can work by kcroke · · Score: 1

    There are some internet filters out there that use Fuzzy Logic out there instead of databases. They are able to determine what catagory a web page can go into without ever having seen the web page before.

    This technology should also be able to be applied to spam.

    I hope yahoo reads that article.

    1. Re:This really can work by Anonymous Coward · · Score: 0

      Paul Graham's spam detection method *is* fuzzy logic.

      Furthermore, he gives a way to train the fuzzy logic rule on your email data sets.

  12. not to be pessimistic.. by shiafu · · Score: 1

    Even if someone develops a clever algorithm that's 99% effective, won't the spammers just find a way around it? It's sort of like the music industry and their vain attempts at copy protection. Some of these spammers are smart, computer-savvy people too.

    1. Re:not to be pessimistic.. by Anonymous Coward · · Score: 0

      He talks about why he thinks this will not be a problem in the article.

    2. Re:not to be pessimistic.. by Anonymous Coward · · Score: 0

      The article explains that.
      As far as I understood it, the point is that for spam to be effective (and hence continue existing) it has to utilize a certain set of words and phrases (marketspeak) - and the key feature is that that lingo is different from the set of words and phrases used in normal mail.
      Of course, spammers could start putting content into dry, casual form that would not have stuff like "FREE!" and "Limited time only", but then noone would react at all, making spam unprofitable and dead.

    3. Re:not to be pessimistic.. by Anonymous Coward · · Score: 0

      99% of your question would be answered by RTFA

  13. But spammers evolve... by bobdotorg · · Score: 1

    One feature of spammers is to adapt to any sort of anit-spam technology. What's to stop spammers from writing spam filled with 'non-spam' words?

    --
    __ Someday, but not this morning, I'll finally learn to use the preview button.
    1. Re:But spammers evolve... by Anonymous Coward · · Score: 0

      Because spam written without spam words just isn't spam. Then you've got false postivies, so you adjust the filter again, to include more non-spam words, which were not from spam to begin with. So, the non-spam words are not really non-spam words, but non-non-spam words, and the problem with that is that non-non-spam words are just spam words, because of the double negative, so do you filter them out or not? In effect you are creating your own spam, or meta-spam. So, then you've got to filter out meta-spam, using non-meta-spam words, which gets a bit confusing.

    2. Re:But spammers evolve... by russx2 · · Score: 1

      And of course there's the fact that spam, without using spam-like words, just won't be effective. Now I'm not saying spam in its present form of 'cum see me and my friends naked in my dorm room FREE' is particularly effective either, but if spammers can't make their spam at least... well, intriguing :-), what's the point?

    3. Re:But spammers evolve... by Anonymous Coward · · Score: 0

      or send images instead of text to ppl with html enabled email clients....

    4. Re:But spammers evolve... by Anonymous Coward · · Score: 0

      Actualy this guy is realy clever.
      the reasonthis wont work is becouse everyones "non spam" words will be diffrent.
      so loading up an email with tones of "exra" words will just hit the filter faster.

    5. Re:But spammers evolve... by bugbear · · Score: 1

      In this case, one thing that makes it hard for the spammers is that nonspam words will vary from one individual user to another. To take advantage of that you do have to filter individually for each user, though.

      It might be possible to make a general purpose spam filter that you could just plug into your network like a router, but I am less optimistic about the chances of that working long term.

    6. Re:But spammers evolve... by Anonymous Coward · · Score: 0

      Well assuming one could weed out randomly generates chains of good-sounding words (i.e. a string of 20 SAT words in a row looks pretty damn statistically suspicious) the spammer would then have to write letters that use "good" words contextually. That would do two things. First, it would make generating tons of varied e-mails significantly harder (thus making it easy to catch them by traditional methods) and second it would make it impossible to use condensed sensationalist language and therefore reduce the already small return on the dwindling numbers of un-intercepted mail. No returns means no reason to send it in the first place.

  14. Ack! LISP! by Anonymous Coward · · Score: 0

    His sample code is written in LISP! Run away! RUN AWAY!

  15. spam is a necessary evil by Pink+Hamster · · Score: 0, Troll

    I think that spam is a necassary evil that can be easily controlled. If we make a law to simply ban spam then we might be banning other things like mail lists. I personally recieve NO SPAM in my main account and less than one piece a day in my "junk mail account." That's inluding things that the spam filter catches. All people have to do is to be careful with their e-mail addresses. Spam is not a problem for people who use a modicum of common sense

    1. Re:spam is a necessary evil by matt_wilts · · Score: 3

      I think that spam is a necassary evil that can be easily controlled. If we make a law to simply ban spam then we might be banning other things like mail lists. I personally recieve NO SPAM in my main account and less than one piece a day in my "junk mail account." That's inluding things that the spam filter catches. All people have to do is to be careful with their e-mail addresses. Spam is not a problem for people who use a modicum of common sense

      Let me tell you, the longer you've been online the more likely you are to get this shite. Remember, it only takes ONE posting of your mail address to a newsgroup (which in my case could have been years ago) and that's it. Then of course you end up on one of these "1 BILION fresh email addresses for $100" lists and you're dead meat.

      Matt

    2. Re:spam is a necessary evil by Anonymous Coward · · Score: 0

      a mailing list is not spam, because i took the effort to explicitely sign up for the wxPython mailing list. i want to recieve those messages. it is not spam then.

    3. Re:spam is a necessary evil by Anonymous Coward · · Score: 0

      > Spam is not a problem for people who use a modicum of common sense
      And the girl with the skimpy skirt is asking to be raped.

    4. Re:spam is a necessary evil by Anonymous Coward · · Score: 0

      You are behind the times...

      one technique in use by spammers today is to simply blast an ISP with lists of possible addresses... aadams@isp.com, badams@isp.com, cadams@isp.com, etc

    5. Re:spam is a necessary evil by Pink+Hamster · · Score: 1

      I am not saying that mailing lists are spam I am saying that mailing lists could fall under a law banning spam. In the article interviewing spammers they said that they used false identities to sign up for internet accounts. Many people suggested using existing laws (like laws against fraud) to attack spammers. I think this is an excelent idea.

    6. Re:spam is a necessary evil by Anonymous Coward · · Score: 0

      Hmmm.....so does that mean that you won't mind if I do this:

      MrSlothful@yahoo.com
      Mrslothful@yahoo.com
      mrSl othful@yahoo.com
      mrslothful@yahoo.com

      Afterall...this IS a necessary thing, whether it be good or bad.

    7. Re:spam is a necessary evil by Anonymous Coward · · Score: 0

      tr 'A-Z' 'a-z'

  16. arc by Anonymous Coward · · Score: 0

    I wonder when Paul will release arc to the world.

    1. Re:arc by Anonymous Coward · · Score: 0

      Yeah, especially since the part of the idea of arc was that the "top level" was going to be web accessible-

      Paul, if you are reading this, just let us in! We are used to using pre-alpha quality software, if it's cool enough.

      Jake

  17. A weak point... by tomknight · · Score: 2
    One question that arises in practice is what probability to assign to a word you've never seen, i.e. one that doesn't occur in the hash table of word probabilities. I've found, again by trial and error, that .2 is a good number to use. If you've never seen a word before, it is probably fairly innocent; spam words tend to be all too familiar.

    Sadly once the spammer knows this method's being used, he'll start chucking in obscure (but valid) words... ah well, maybe at least spanm will start getting interesting to read, assuming the spammer tries to use the word in context.

    "Buy my superlatively efficacious mail list."

    Maybe not...

    Tom

    --
    Oh arse
    1. Re:A weak point... by sebi · · Score: 2, Interesting

      You should have continued to read the article.

      To beat Bayesian filters, it would not be enough for spammers to make their emails unique or to stop using individual naughty words. They'd have to make their mails indistinguishable from your ordinary mail. And this I think would severely constrain them. Spam is mostly sales pitches, so unless your regular mail is all sales pitches, spams will inevitably have a different character.

      Basically the only way to get around this proposed method of statistical analysis ist to completely change the way spam copy is written. But changing that would basically defy the whole point of spam. If, to get through a filter, you had to stop writing sales pitches, then why spam in the first place?

    2. Re:A weak point... by tomknight · · Score: 3, Insightful
      Yes, I'll admit I hurried in with the comment there. Stupid ;-)

      Spammers would learn to adapt, and the sales pitches would change character/format. The sales pitch will still be that, but it'll be more cleverly designed - it may be hard to do, but people will manage it. having said that, this method does look like it could be worth implementing - maybe even on the mail server...

      Tom.

      --
      Oh arse
    3. Re:A weak point... by sebi · · Score: 2

      A quick quote from a recent /. story:

      If you don't think the filters and blacklists work, one spammer whines, "My operating costs have gone up 1,000 percent this year, just so I can figure out how to get around all these filters."

      Spammers might learn to adapt as long as it makes economic sense. Remember: With this kind of statistical analysis this time around the Spammers have to play catch up with the filters instead of the other way around...

    4. Re:A weak point... by tsg · · Score: 2, Insightful

      but it'll be more cleverly designed

      Ding ding ding ding &ltpoints at nose&gt.

      I think you've hit the nail on the head. Simply requiring that spam be cleverly designed should get rid of 99% of spammers.

      --
      People's desire to believe they are right is much stronger than their desire to be right.
    5. Re:A weak point... by Anonymous Coward · · Score: 0

      Like i said for someone else..
      the "non spam" words will be diffrent for each user.
      and if you load up an email with potential "non spam words" it will in the end just build up the filter lists that much faster.

    6. Re:A weak point... by plover · · Score: 2
      Not necessarily.

      Spammers could start with the simple "leet" misspellings of their pitch words: 'Ea$iest way to get a j0b.' They could rotate these or generate them on the fly with a suitable mailer. 'Easies+ way to 6et a jo8' and 'Eas!est way to ge+ a j*b' are all variations that would pass the spam filters once (although they're not terribly effective.)

      But the serious threat will come next in the form of the abomination that is Unicode. There are an infinite number of combinations of foreign letters that look 'Roman enough' that the casual user would have no trouble reading. The whole pitch would be crafted from randomly generated unicode look-alike letters. These words would then never appear twice in a dictionary.

      A related problem bit me this very morning. I was debugging some printed text that someone had cut-n-pasted from a Word document into a field on their maintenance web page. Turns out Word had used the unicode character 0x2019 to represent an apostrophe, but the Microsoft-provided wcstombs() function choked on it, unable to translate it into a recognizable 8-bit printable equivalent.

      So there will be ways around these filters. The question is now how long it will take for the spammers to start trying to beat them? I don't think they care about hitting every last hacker's inbox, but I do think they need to avoid ISP-level spam filtering.

      --
      John
    7. Re:A weak point... by Tablizer · · Score: 2

      Basically the only way to get around this proposed method of statistical analysis ist to completely change the way spam copy is written. But changing that would basically defy the whole point of spam.

      Not "completely", just enough to get through *sometimes*.

      The thing is, the closer the two grow in similarity, the more false positives will get through regardless of tuning.

      Further, I am not sure the hyperbole of the current crop of spam is that effective anyhow. If they toned it down, it may even be *more* effective.

      Humans (spammers) are still better than AI, and that is the bottom line here. Even if they were the same, false positives would still get thru.

    8. Re:A weak point... by billatq · · Score: 1
      Sadly once the spammer knows this method's being used, he'll start chucking in obscure (but valid) words... ah well, maybe at least spanm will start getting interesting to read, assuming the spammer tries to use the word in context.

      With the number of people that use web-based e-mail, such as hotmail (which automatically loads images) I don't see it as completely unplausible for them to make an innocuous e-mail, set the color to FFFFFF and then load the spam as an image. Presto! You'd need OCR built into your spam filter in order to block it. Additionally, it's possible to easily confuse OCR by using a strange font or something like that.

      While there would certainly be ways to filter it, and it isn't good for people they're likely targeting (can we say AOL dialup?), the smarter filters get, the stranger the spams will become.

      Who knows, maybe we'll get a spam like "and so after breaking into this universities's smtp server, I sent you this e-mail, visit http://mypornosite.com..."

    9. Re:A weak point... by ceejayoz · · Score: 2

      The image thing is already being done, not very widespread just yet, but they're there.

      I imagine it'd be pretty easy, to detect, though. You could, for example, block e-mails with text colors the same as the background. You could also block e-mails with images (I hate HTML e-mails anyways...).

    10. Re:A weak point... by Bradmont · · Score: 1

      I don't think this is true: Only the 15 most interesting (eg, spam probabilities farthest from 50%) are used -- .2 isn't likely to be in the top 15, so putting these words in wouldn't likely affect the spam probability at all.

    11. Re:A weak point... by Apache · · Score: 1

      I think with the way the system works this kind of thing would not help spammers in the long run.
      Having an and in the email would be high probability positive indicators that would drive their probability toward .99999... Meaning that it's probability factor would start to dwarf any number of 'good' words that could be stuck in the message.

      It would however drive the credibility of 'good' words down, which imho is the main danger..

  18. This is not news ... by dougmc · · Score: 5, Informative
    The statistical approach is not usually the first one people try when they write spam filters. Most hackers' first instinct is to try to write software that recognizes individual properties of spam.
    And he's correct. A few years ago, most spam filters did look for individual properties of spam.

    BUT, now, the best spam filters out there already use statistical properties. Spamassassin does this, for example, and it works *extremely* well. Before I found Spamassassin, I had a huge procmial recipe that used it's scoring mechanism to do basically the same thing -- but of course spamassassin does it better, so I switched :)

    1. Re:This is not news ... by wsloand · · Score: 2, Insightful

      BUT, now, the best spam filters out there already use statistical properties. Spamassassin does this...

      Spamassassin (as he addressed) does not do this, it gives individual items a score. His method dynamically scores items based on the message. You could use his filter as a plugin for Spamassassin, but with the numbers he's talking about you wouldn't need anything other than his system.

      Bill

    2. Re:This is not news ... by Anonymous Coward · · Score: 0

      Well, spamassassin is good for me. BUT

      Currently in 2 weeks of use:
      1351 good, 650 spam, 6 false positives, and 21 missed spams.

      Of those 6 false positives, 3 were my dsl provider sending me invoices and reminders. It seems that the phrase "check or money order" is a 4.5 out of 5 indicator of spam.

      Spamassassin's main difference with this approach is that Spam assassin is set up with a standard spam profile, where this is tought from his mail spool.

      If spam assassin could be taught with the same database, it would probably perform almost identically.

    3. Re:This is not news ... by Tablizer · · Score: 2

      (* Currently in 2 weeks of use: 1351 good, 650 spam, 6 false positives, and 21 missed spams. *)

      Did you have to read all 650 spams to find the false positives?

      That is the problem; either you check everything anyhow, or are in constant paranoia of losing something important.

    4. Re:This is not news ... by DVega · · Score: 3, Informative

      Bayesian filters for spam have extensively been studied and compared in the last few years.

      Recently more filtering methods have been studied.

      It's good to see someone implementing these techniques

      --
      MOD THE CHILD UP!
    5. Re:This is not news ... by AnotherBlackHat · · Score: 2
      (* Currently in 2 weeks of use: 1351 good, 650 spam, 6 false positives, and 21 missed spams. *)

      Did you have to read all 650 spams to find the false positives?

      That is the problem; either you check everything anyhow, or are in constant paranoia of losing something important.


      Well, you could combine a content filter with a challenge system, and challenge anything you thought was spam.
      That's what Spamwolf does.

      -- Stop Spam Now, Ask Me How
    6. Re:This is not news ... by pjrc · · Score: 2
      If spam assassin could be taught with the same database, it would probably perform almost identically.

      Perhaps you missed the numerous times that he pointed out the advantage of the analysis automatically discovering the spam probability of ALL words, instead of a predetermined list shipped with the filter (as in spamassassin).

      That said, I use Spamassassin and it really works well... but I found that I had to set my threshold up to 7.5 and lower the points from some of the rules to avoid false positives from students in India and other countries who ask questions about the circuits and code on my website (many of their ISPs are in blacklists and they hit various rules for various reasons).

    7. Re:This is not news ... by Tablizer · · Score: 2

      (* Well, you could combine a content filter with a challenge system, and challenge anything you thought was spam. *)

      But challenges can be used to confirm/find valid target addresses.

    8. Re:This is not news ... by Anonymous Coward · · Score: 0

      Yes, but how many spammers are going to reply to your challenge? Zero! And that alone will make the challenge an effective tool.

  19. Major geek bias there... by Kaa · · Score: 5, Funny

    From the article:

    Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.

    Hmm.... take an average adult geek and yes, an email mentioning sex or sexy can go to /dev/null immediately without as much as a second glance... :-)

    On the other hand if you run the statistics on email of an average horny teenager, the probabilities might get a bit different.

    --

    Kaa
    Kaa's Law: In any sufficiently large group of people most are idiots.
    1. Re:Major geek bias there... by tsg · · Score: 1

      Also from the article:

      But these numbers are not misleading, because that is the approach I'm advocating: filter each user's mail based on the spam and nonspam mail he receives.

      If "sex" and "sexy" are equally likely to show up in good mail as they are in spam, the filter will rate them closer to neutral and they won't be a deciding factor anymore.

      --
      People's desire to believe they are right is much stronger than their desire to be right.
    2. Re:Major geek bias there... by LordNimon · · Score: 2, Funny

      But what about the (unlikely) situation of a geek getting a girlfriend? All of her steamy email will be flagged as spam, and then she'll get upset and dump him. Oh, the irony!

      --
      And the men who hold high places must be the ones who start
      To mold a new reality... closer to the heart
    3. Re:Major geek bias there... by Anonymous Coward · · Score: 0

      You, much like most of the other posters here, either didn't read the article or are too stupid to understand the algorithm.

      He can say sex all he wants since a steamy e-mail will undoubtedly include enough legitimate non-spam words to save it from being trashed.

    4. Re:Major geek bias there... by Juergen+Kreileder · · Score: 1
      Hmm.... take an average adult geek and yes, an email mentioning sex or sexy can go to /dev/null immediately without as much as a second glance... :-)
      Nah, they probably get mails about byte sex once in a while. They surely don't want to lose those.
    5. Re:Major geek bias there... by Kaa · · Score: 1

      He can say sex all he wants since a steamy e-mail will undoubtedly include enough legitimate non-spam words to save it from being trashed.

      ROTFLMAO....

      Don't get much steamy email, do you?

      --

      Kaa
      Kaa's Law: In any sufficiently large group of people most are idiots.
    6. Re:Major geek bias there... by tongue · · Score: 2

      On the other hand if you run the statistics on email of an average horny teenager, the probabilities might get a bit different.

      Most teens i know are still naive enough to call it "making love" :).

    7. Re:Major geek bias there... by c13v3rm0nk3y · · Score: 1

      I don't know sex you are sexing about. I find ways to work sexy words into written conversation all sex day. For example, I've just finished a sex method in some sexy classes. I commented it sexily with plenty of sex.

      ...and hilarity ensued, with sexy results...

      Oh, geez, my /. sense is tingling. Looks like I lost Karma.

      --
      -- clvrmnky
    8. Re:Major geek bias there... by wheany · · Score: 1

      Please post one example of a steamy email you have gotten. You can remove any personal information, if you want.

  20. There's only one true solution by Dimensio · · Score: 1, Interesting

    Spammers will try to work around filters, as they don't care that no one wants their crap. Further, filtering it doesn't solve the bandwidth situation, as the lines are still tied up with the bits running through the system until it hits the filter.

    There is only one good solution for spam: killing spammers. It should be done, and it should be done brutally and painfully. When known criminal spammers like Ralsky (who ran a child pornography site at one point) are brutally murdered, others may think twice before firing up "EmailBlaster 2002".

    1. Re:There's only one true solution by letxa2000 · · Score: 1
      Spammers will try to work around filters, as they don't care that no one wants their crap. Further, filtering it doesn't solve the bandwidth situation, as the lines are still tied up with the bits running through the system until it hits the filter.

      I've come up with a solution that works very, very well for me.

      I've modified my sendmail server. In addition to having a 1000+ block list (it's amazing how many spammers DO use a fixed block of IPs and/or send mail from spammersite.com, etc.), my modifications to the sendmail server are essentially filters. When I get spam I *DO* read it: Or more accurately, I skim for stuff that NO-ONE would say in a real email. Our mail server is small with less than around 20 users, so that makes it easier. But, if I see the words "JUICY PUSSY", bam, that's in the filter. When I see the words "PRESTIGIOUS NON-ACCREDITED" (oxymoron), bam, that's in the filter.

      So, just filters, right? Nothing new?

      Perhaps, but the difference is that my sendmail is now AGGRESIVE. I don't receive their crap and then filter it. I filter it during the DATA phase of the SMTP connection. As *SOON* as any filterable text is recognized I immediately stop receiving data and issue a "550 No spam allowed here" and hang up. If they "call back" (they always do), I greet them with "550 No spam allowed her" and hang up before they even say HELO.

      It works VERY well. I've reduced my spam from 40+ per day to about 2 or 3. And those 2 or 3 promtply get added to my filters.

    2. Re:There's only one true solution by Anonymous Coward · · Score: 0

      Great, and you've successfully filtered out 37 of 40 spam emails (92.5%), while Paul Graham's approach filters out 995 out of 1000 spam emails (99.5% approach) which is a significant improvement. What's the difference?

      Paul's approach will allow only 5 spams to get through per 1000. Your's allows 75 spams. Not much of a difference, but significant enough if you hate reading spam at all.

      Furthermore, your approach requires constant updating to catch more and more spam. Paul's approach takes care of this on it's own, as it allows for the "learning" of new words automatically, based upon statistical inference.

    3. Re:There's only one true solution by letxa2000 · · Score: 1
      Great, and you've successfully filtered out 37 of 40 spam emails (92.5%), while Paul Graham's approach filters out 995 out of 1000 spam emails (99.5% approach) which is a significant improvement. What's the difference?

      I *hang up* on the spammer. He doesn't get to finish his attack. And then I usually end up wasting more of the spammer's time as he tries again and again, which increases his cost of spamming.

      Furthermore, your approach requires constant updating to catch more and more spam. Paul's approach takes care of this on it's own, as it allows for the "learning" of new words automatically, based upon statistical inference.

      My approach allows me to hand-pick the words that will be triggers for spam rejection. His relies on statistics. I won't tell you what I think about statistics, but I trust myself to pick the "trigger words" much more than I trust statistics to decide what mails I will see and what I won't.

  21. Who doesn't get Lisp related porn? ;) by ssimpson · · Score: 1

    To quote the author: "I get a lot of email containing the word "Lisp", and (so far) no spam that does".

    He obviously doesn't getting the "Lesbians with a Lisp" pr0n......


    --
    "Mary had a crypto key, she kept it in escrow, and everything that Mary said, the Feds were sure to know."
    1. Re:Who doesn't get Lisp related porn? ;) by Anonymous Coward · · Score: 0

      He obviously doesn't getting the "Lesbians with a Lisp" pr0n

      Hey, do ya know where I can get some from?

    2. Re:Who doesn't get Lisp related porn? ;) by Hoi+Polloi · · Score: 2

      I'd say my experience learning LISP was obscene

      --
      It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
  22. (spam) by Anonymous Coward · · Score: 0, Funny

    (insert (lisp joke (here)))

  23. A stupid suggestion by crea5e · · Score: 0


    send spam on how to get rid of spam.

    1. Re:A stupid suggestion by thunderbee · · Score: 1

      Don't laugh; I did get one...

      --
      In my opinion, Scientology is a cult you should avoid.
  24. This approach is very easy to defeat by Bazzargh · · Score: 5, Interesting

    Here's how: the spam should be written as a 'multipart/alternative' with an html version of the spam as the primary alternate. The text version contains an innocuous message intended to pass the statistical spam filter. The spam message is entirely contained as an /image/ within the html. The text of the spam becomes invisible to the reader but not to the poor schmuck who gets the email.

    I'm guessing here that the inclusion of a single image tag in the html is unlikely to trigger the spam filter, and supplying a wealth of evidence that the email is 'not' spam in the unseen alternate text will let the letter through.

    1. Re:This approach is very easy to defeat by topham · · Score: 2

      until it gets put into the 'spam' archive and processed where the word "alternate" is set at .99.

    2. Re:This approach is very easy to defeat by aggressivepedestrian · · Score: 1

      No problem: reject any message whose HTML part doesn't matchup with the text part.

    3. Re:This approach is very easy to defeat by Bazzargh · · Score: 2

      Yes, and you stop getting any mail with html in it?

      Some people might consider this a good thing :)

    4. Re:This approach is very easy to defeat by a_n_d_e_r_s · · Score: 1

      You mean dumping all HTML messages containing an image....

      Cause there is where the ad can be hidden.

      The rest can look like any other message.

      In essence fighting spam by looking at the content and using that to try and stop spam will never be 100% perfekt.

      Thus ther must be better ways to prevent spam.

      Like makeing sure noone can send them in the first place.

      --
      Just saying it like it are.
    5. Re:This approach is very easy to defeat by Dr_LHA · · Score: 2

      Actually it'll be very easy to defeat not because of flaws in the system - but because 99.9% of the idiots who use computers will never install spam filtering of this kind. The Clued up computers users who would install this kind of thing are not the type of people who would respond to spam anyway - so it doesn't affect spammers at all.

    6. Re:This approach is very easy to defeat by Idarubicin · · Score: 1
      I cleverly avoid this problem by reading all my mail in PINE.

      No images.

      No Outlook macros.

      No problem. And I really have to want to look an attachment before I'll go to the trouble of opening it.

      --
      ~Idarubicin
    7. Re:This approach is very easy to defeat by pmz · · Score: 5, Insightful

      The spam message is entirely contained as an /image/ within the html.

      Thankfully, my e-mail client is set up to not render any HTML in an e-mail. I have yet to send back any information to a spammer via specially-coded image tags and am proud of it.

      HTML-based e-mail is fundamentally insecure and really should be used by no one (except those who simply don't care about privacy). Go here to learn just what a spammer--or anyone who sends you an HTML-based e-mail--can learn about you with just one "click" of your mouse.

      Yes, the spammer can learn what browser version you use, what OS you use, and even what city you live in (via the traceroute). An unusually savvy spammer could use this information to install spyware via known exploits in certain browsers and operating systems.

      In short, HTML e-mail is damn scary knowing that so many people us it not knowing just how much information they are giving away for free!

    8. Re:This approach is very easy to defeat by Dr.+Awktagon · · Score: 2

      Easy to solve, just remove all alternatives except text/plain. Since they are supposed to be the same content, this won't affect normal legit messages.

      That's what I do on my mail, if there are multiple alternatives, and one of them is text/plain, remove the others.

      And I also defang img tags so I wouldn't see the image either. If I didn't use Mutt most of the time, anyway.

    9. Re:This approach is very easy to defeat by xipho · · Score: 1

      Arrg...its a *combination* of probabilities, both good and bad, that's why it works. Just because one is .99 (for alternate) doesn't mean that that e-mail will be rejected outright! ALL probabilities from the good and bad hashes are taken into acount.

      --

      only infrmatn esentil to understandn mst b tranmitd
    10. Re:This approach is very easy to defeat by Bazzargh · · Score: 2

      I'd like to see the algorithm you propose for that.

      I know in my own company, some of the automated emails have quite independent html and text versions, because simply downconverting the html would produce gibberish, and, for example, would not present links correctly (a text version of an anchor tag is usually the text, plus the something like 'click on this link', plus the url. Doesnt match the html very well.). Ignoring this problem, any attempt at automated checking of the differences would have to deal with user-agent differences and would be a bit of a mess.

      Secondly, theres no problem for a spammer to include the original text, but render it in such a way as to be invisible (eg in the background colour) below the spam image.

      I'm inclined to agree with other posters that whitelists are more of an answer.

    11. Re:This approach is very easy to defeat by Jeremi · · Score: 2
      Actually it'll be very easy to defeat not because of flaws in the system - but because 99.9% of the idiots who use computers will never install spam filtering of this kind.


      That doesn't matter to me -- what matters to me is that I won't have to slog through a bucket of spam every morning. And in any case, you're wrong -- when a filter like this comes standard with MicrosoftOutlook or AOL or whatnot, 99.99% of the idiots will eventually have it.

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    12. Re:This approach is very easy to defeat by batessr · · Score: 1

      This filter will work on the technique you describe. After the user files some of the messages as spam, the words multipart, alternative, and img would all be added to the bad corpus. And since almost no non-spam will be multipart/alternative with a text and HTML part, the filter should adapt extremely well.

    13. Re:This approach is very easy to defeat by gwernol · · Score: 3, Insightful

      the spam should be written as a 'multipart/alternative' with an html version of the spam as the primary alternate. The text version contains an innocuous message intended to pass the statistical spam filter. The spam message is entirely contained as an /image/ within the html.

      Yes this would make it more difficult to spot, but notice that he examines the headers as well as the content of the spam. Looking at Mr. Graham's examples a lot of the key words that his filter finds are parts of the header, so you have a good chance that the probabalistic filters can still rule these out.

      The second point, also made in Paul's article, is that part of what you want to do is push up the costs and difficulty of sending spam. Pushing out a million HTML images is much more costly to the spammer than sending out a million text messages. The more costs we can force spammers to bear the less economical it will become to spam, thus reducing the amount of spam.

      --
      Sailing over the event horizon
    14. Re:This approach is very easy to defeat by mariube · · Score: 1
      I'm guessing here that the inclusion of a single image tag in the html is unlikely to trigger the spam filter, and supplying a wealth of evidence that the email is 'not' spam in the unseen alternate text will let the letter through.
      That won't be a problem for me, since I never read HTML mails in any event.
    15. Re:This approach is very easy to defeat by Anonymous Coward · · Score: 0

      And by posting information on Slashdot that everyone knows and that the public doesn't care about, you are surely providing a sound solution to the problem of spam.

    16. Re:This approach is very easy to defeat by madstork2000 · · Score: 1

      It seems everyone is forgetting that his approach scores not only the body of the messag, but the headers as well. So even if the SPAMMER tries to get their crap through as an image, they would still have the non-trivial task of making their headers look legit to the filter. While it may not be impossible to defeat, it certainly is not trivial.

      Besides the best approach for stoping SPAM will likely be a cocktail of appraoches, including whitelists, black hole lists, content analysis filtering, etc.

      The root of the problem is that the average everyday user is too lazy to do anything about SPAM. What we need is effective tools at the server level that can be run effeciently and safely (0 false positives) so that the masses need not be bothered. That will be the most effective way to curb SPAM is getting buy-in at the ISP to preemptively stop SPAM from getting to the users, because it is the cluesless users that click through to p0rn sites and open viruses and otherwise do dumb things. With out a large market of novice/lazy users to see their message SPAM will become ineffective and be replaced by some other crap.

      -MS2k

    17. Re:This approach is very easy to defeat by pmz · · Score: 2

      And by posting information on Slashdot that everyone knows and that the public doesn't care about, you are surely providing a sound solution to the problem of spam.

      Not everyone knows, and not everyone doesn't care. Awareness is an important part of dealing with spam, and posting to Slashdot can be a good starting place to popularize information, such as that about HTML e-mail.

      By not rendering HTML e-mail, I am doing a small part to deter a spammer's success. They use the image tags to build a database of "who reads what", and my denying them that information puts a very small dent in their efforts. Public awareness can make that dent bigger.

      I would like to see statistics about how many unique visitors there are each day to Slashdot. I imagine there are many many thousands (millions?) of readers all over the Earth.

      On top of that, there are at least a few influential people who read, or are at least aware of, Slashdot, and there are many readers from within many big corporations.

      If I say something truly worthwhile, a few moderators out there will recognize that. If I post slanderous crap, I will be treated accordingly. While there is some corporate-sponsored posting and moderation on Slashdot, the noise introduced by it is still too small to drown out honest voices.

      So, if posting something to Slashdot is not a good way to say something to a broad audience, what other forum is better?

    18. Re:This approach is very easy to defeat by gokubi · · Score: 1

      SPAM is all about click-through, so they'll at least need a link. If the message is 100% inocuous, the link to http://www.biggerc0ckin30days.com will tip off the filter.

      --
      I'm much funnier now that I'm a subscriber.
    19. Re:This approach is very easy to defeat by shayne321 · · Score: 2

      In the current mozilla nightlies and in the upcomming 1.1 release, there is a preference called "Do not load third party images in HTML e-mail" (or something similar) which addresses 90% of the problems mentioned above. Plus mozilla has the ability to disable java/javascript in e-mail so that probably covers the other 9.99%.

      Just because your e-mail client is braindead doesn't mean everyone has to be afraid of HTML e-mail. Unfortunately as part of my job I have no choice but to receive HTML e-mail from 60% or more of the people I regularly correspond with. I could be indignant and refuse it, or simply use a client which allows me to render it on my terms.

      As an alternative, if you use spamassassin, there's an option called "Defang mime" which changes the mime-type on any e-mail it has identified as spam to text/plain. Of course the downside is when viewing the e-mail all you see is a ton of nearly-unreadable HTML/CSS code.

      Shayne

      --
      Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
    20. Re:This approach is very easy to defeat by pmz · · Score: 2

      Just because your e-mail client is braindead doesn't mean everyone has to be afraid of HTML e-mail.

      I wasn't refering to my e-mail client. It's VM within Emacs and works very nicely.

      I could be indignant and refuse it, or simply use a client which allows me to render it on my terms.

      As I have done. My rant about HTML e-mail is mainly pointed at the shortsightedness of software publishers who put HTML rendering into their e-mail clients. HTML rendering is usually turned on by default, which is the original scene of the crime. This is why I have to write posts whining about HTML e-mail in the first place; if the software were to come better configured, this whole conversation would have never happened.

    21. Re:This approach is very easy to defeat by tongue · · Score: 2

      I think if plugins were available for various email clients that users would use them gladly. I'd certainly use one; and they shouldn't be too terribly hard to write for the major email clients on linux, although I have to admit i have absolutely no idea how to write a plugin for outlook. Although if you asked me, spam and viruses should be regarded as a punishment for using outlook, so i'm not sure i'd want there to be a plugin available for that :).

      as far as defeating it with an image, that's kind of dumb. first of all, the image tag would be regarded as a spam indicator, as would, to an extent at least, the fact that there's an html attachment. additionally the url or image name would indicate a spam factor as well. its not a matter of what words are readable, even the lack of words would be a statistical feature.

    22. Re:This approach is very easy to defeat by j7953 · · Score: 2

      It would be fairly simple to tune his software so that it considers only the header and that part of the email that is normally displayed, i.e. the HTML part (even if your mail software is configured differently, that of most people isn't, so the HTML part is where you should calculate the probabilities). That would be a one-time improvement, without any need to continuously adapt the software to the spammers.

      The HTML part should have a fairly high probability, given that it contains things like "text/html" (he probably should consider a slash as part of a token), "img" etc. that normally don't appear in valid email.

      --
      Sig (appended to the end of comments I post, 54 chars)
    23. Re:This approach is very easy to defeat by AnotherBlackHat · · Score: 2
      Here's how: the spam should be written as a 'multipart/alternative' with an html version of the spam as the primary alternate. The text version contains an innocuous message intended to pass the statistical spam filter. The spam message is entirely contained as an /image/ within the html. The text of the spam becomes invisible to the reader but not to the poor schmuck who gets the email.

      I'm guessing here that the inclusion of a single image tag in the html is unlikely to trigger the spam filter, and supplying a wealth of evidence that the email is 'not' spam in the unseen alternate text will let the letter through.


      What you describe might beat a particular implementation, but I don't think it defeats the approach.

      Just adjust the content filter to check the part of the message that your email client actually displays.
      If your client doesn't display the innocuous part,
      then the innocuous part won't be part of the filtering process either.

      A nastier hack would be to tack the "innocuous" message (or several innocuous messages) to the end of the spam.

      This too can be corrected for, but the approach would need to be improved to consider how humans read things, which is non-trivial.

      Stop Spam Now, Ask Me How
    24. Re:This approach is very easy to defeat by GloomyTrousers · · Score: 1

      See recently-opened bug 163188 in Bugzilla. Eventually, this will be implemented, and when Mozilla is rolled out to AOLers - bingo!

  25. I wonder... by MartinG · · Score: 2

    what his spam filter would make of his article?

    --
    -- MartinG To mail me: echo kewyjlcxyzvjfxbqwh | tr bcefhjklqvwxyz .@adgimnoprstu
    1. Re:I wonder... by WolfWithoutAClause · · Score: 2
      Why don't you mail it to him and find out? If you don't get a reply he'll have to add one to his false positives ;-)

      Looks like he's at "pg@paulgraham.com"; have fun.

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
    2. Re:I wonder... by kwerle · · Score: 2

      I imagine it would see 'lisp' a bunch of times and let it through.

  26. When I said... by Anonymous Coward · · Score: 1, Insightful

    When I said market using spam, that includes the company that hires someone who spams.

    1. Re:When I said... by Lord+Apathy · · Score: 1


      Exactly! This is what I've been saying for years now. Not only do you sue the spammer but you make the fucker who's 800 number or address is on the spam liable too. Once these companies start taking hits for their spam they will stop using it and those companies that clam they didn't know they where using spammer, well tough. Ignorance is no excuse.

      --

      Supporting World Peace Through Nuclear Pacification

  27. Comment removed by account_deleted · · Score: 5, Interesting

    Comment removed based on user account deletion

  28. "delete-as-spam button" by xipho · · Score: 3, Interesting

    This is the brilliant part, and crucial to the endeavour, and so easy to implement!

    It appears all the nay-sayers here haven't even read the article (no surprise). With as little code as needed to implement this it should be a must in the next mozilla mail/pine etc. code base.

    --

    only infrmatn esentil to understandn mst b tranmitd
    1. Re:"delete-as-spam button" by tsg · · Score: 1

      You could also do it this way without having to modify any code, and it would work for most mail readers.

      Set up two mail aliases that process email through the filter. One for spam, one for good mail. Keep separate folders for good mail and spam (I do anyway). Before you empty them, forward the contents to the appoprate alias. Some mail readers may be able to automate this for you.

      --
      People's desire to believe they are right is much stronger than their desire to be right.
    2. Re:"delete-as-spam button" by mattmunz · · Score: 2, Interesting

      Not only is this a great idea, it goes way beyond spam. How about "delete-as-off-topic" or "delete-as-rtfm" buttons specific to a given mailing list? The same algorithm could be used for these cases.

      Take it a step further to organize your entire mailbox. How about "categorize-as-tech-support" or "categorize-as-jboss-related". Many of us already push our email around into folders for the purpose of organization. I can't see why this algorithm can't be used to assist that process as well.

      The power of this system is that it is feedback-based. The software uses known science (statistics) to mold itself to your own preferences, by paying attention to the input that you have to make to use the application in the first place.

      Why do you think there are businesses whose sole function is to track and to report on the input people make to the various machines in their lives (computer/websites/tv/etc.)? This information is powerful and we need more examples of the ethical use of it. Note that his system is completely "individual" and doesn't require sharing user input with others through a central server.

      I haven't read the entire article, but I really think this is a great idea.

    3. Re:"delete-as-spam button" by cburley · · Score: 1
      How about "delete-as-off-topic" or "delete-as-rtfm" buttons specific to a given mailing list?

      I want a "rewrite-in-FORTRAN" button on LKML.

      --
      Practice random senselessness and act kind of beautiful.
  29. Ban this IP, it's just a CGI proxy ;) by Anonymous Coward · · Score: 0

    Due to excessive bad posting from this IP or Subnet, comment posting has temporarily been disabled. If it's you, consider this a chance to sit in the timeout corner. If it's someone else, this is a chance to hunt them down. If you think this is unfair, please email jamie@slashdot.org with your MD5'd IPID and SubnetID, which are "c9e9c670161ecc03213cef93dc3ea53a" and "167245123af6b03ea65389334162ec02".

  30. Another way to stop Spam by mr.nicholas · · Score: 5, Interesting

    Having had the same email address since '93, I receive close to 1000 spams per day to my personal account (which is also aliased from root/postmaster/webmaster).

    I've tried everything under the planet to reduce the amount that I see in my mailbox; SpamAssassin being one of the best so far. But even that lets through quite a bit (around 10%).

    So I decided to attack it from a different angle. I wrote a series of perl-scripts that I plunked into my procmail file.

    The scripts work by checking the address of the sender each time a message is received. That address is looked up in a database. If it exists in the db, and it's marked as "authorized", it's just passed into my mailbox.

    If it's marked as denied, /dev/null.

    If it's never been seen before, an authentication message is sent to the sender asking them to reply to it to authorize themselves. If that authmessage is bounced back, a db entry is made as "denied".

    If it's replied to in a normal fashion, that email is marked as "authorized" and any queued up mail from that person is pushed out.

    The concept is that spam will almost never have a valid reply-to; so it will bounce and be marked as denied.

    Even if the email doesn't bounce, no spammer alive will reply to it; so after 30 days, that email is marked as "denied".

    Since I've set this up (for myself and my 10-year-old son who receives porn in his box (grrr!!!!)), it has worked flawlessly. The "real" email is unharmed, while the spam is stopped.

    Oh, and I have a web-based control page so that users can manually add email addresses (for lists and such).

    This week, for the first time in YEARS, I don't have spam in my mailbox anymore.

    Hurray!

    No if I can only stop those damned dictionary-based scanning of my servers, I'll be set. Thank the gods that I don't have metered service.

    1. Re:Another way to stop Spam by Anonymous Coward · · Score: 0, Interesting

      Huh, actually 5 minutes editing my Outlook mail rules acheved exactly the same thing and I've been nearly spam free for years even though I receive at least 300 a day from my domain. No scripts, no voodoo. Just sinple point and click. There's the difference between closed source and open source. Closed source you use, open source you code.

    2. Re:Another way to stop Spam by Mr_Silver · · Score: 2
      The scripts work by checking the address of the sender each time a message is received. That address is looked up in a database. If it exists in the db, and it's marked as "authorized", it's just passed into my mailbox.

      Whilst this is a very good and effective method, for a person on the end of this it's an absolute pain in the butt to go through this palava just so you can send someone one email, get one response and then never communicate with them again.

      I'm not knocking your solution, but personally I'd rather something that didn't inconveniance the legitimate people that do want to contact me.

      (plus, this sort of thing looks rather poor corporate-wise)

      --
      Avantslash - View Slashdot cleanly on your mobile phone.
    3. Re:Another way to stop Spam by Anonymous Coward · · Score: 0

      Yeah, the webmaster at php.net uses the same idea.

    4. Re:Another way to stop Spam by Brendan+Byrd · · Score: 3, Informative

      SpamAssassin already has this. It's called automatic-whitelisting.

    5. Re:Another way to stop Spam by xipho · · Score: 1

      How do you figure inconveniance? Only one confirming e-mail is all it takes and that person is in your db...a small price to pay!

      --

      only infrmatn esentil to understandn mst b tranmitd
    6. Re:Another way to stop Spam by NineNine · · Score: 0, Offtopic

      Ah, if I had some mod points for you....
      Very true, very true.

    7. Re:Another way to stop Spam by infinitey · · Score: 1

      [i]Since I've set this up (for myself and my 10-year-old son who receives porn in his box (grrr!!!!)), it has worked flawlessly.[/i]

      Why you do this? He just wanted some free passwords.

    8. Re:Another way to stop Spam by Anonymous Coward · · Score: 0

      This thinking is the reason that Linux will continue to fail on the desktop.

    9. Re:Another way to stop Spam by Tablizer · · Score: 2

      (* If it's never been seen before, an authentication message is sent to the sender asking them to reply to it to authorize themselves. *)

      Like somebody said, this is inconvenient for the sender. What if you send out resumes (a common task these days) and don't want to inconvience the person who will potentially hire you?

      Perhaps because of just such annoyance-induced unemployement you had 10 years to work on your solution? :-)

    10. Re:Another way to stop Spam by einstein · · Score: 3, Interesting

      that sounds like a great system... any plans to release the code? I'd love to set that up at home.
      ---

    11. Re:Another way to stop Spam by koreth · · Score: 2
      I do almost the same thing, with one tweak: the procmail script that decides what to do with the mail does some keyword/pattern scanning as well as running Vipul's Razor, and it will only push the mail over to my "require a reply" script if it looks like it might be spam.

      The advantage is that I don't have to worry nearly as much about false positives on my spam filters; this system makes them much less expensive than they'd be if I simply tossed all matching mail to /dev/null. It successfully filters out nearly 100% of my spam (75-100 spams a day, of which one or two a week get through.)

      I've had this system in place for a year or so, and in that time, maybe 1 out of 25 legitimate personal messages from unknown senders has required a validation E-mail, so it's not a major inconvenience for a huge number of people.

      Not that I think it's much of an inconvenience anyway, and I have yet to get one complaint. In fact, Slashdot is the only place I've heard anyone complain about it. The comments I've gotten from actual correspondents have been more along the lines of, "I get too much spam too! How do I set up the same thing on my mailbox?"

    12. Re:Another way to stop Spam by LX.onesizebigger · · Score: 5, Interesting
      Even if the email doesn't bounce, no spammer alive will reply to it; so after 30 days, that email is marked as "denied".

      I've seen similar solutions before, and they are all nice and dandy except for one application: when communicating with businesses. What happens when you order a Widget from Acme, Inc. and Acme sends you your confirmation by e-mail? Your script bounces a question, and Acme's mail server either bounces back at you, making it look like it was spam in the first place, or simply doesn't respond at all.

      The system implies that anything not sent by a human being is spam. This is not necessarily the case today. A lot of today's e-mail communications are auto-generated.

      To truly combat spam, it must be fought at the source. One step closer to that would be to integrate a standardized response to the type of message you send out in mail protocols. The problem with this is that all Joe Spammer would have to do is to point his reply-to to a valid business site.

      This brings us to the next point. Forged headers are easy to detect by software and have few (although it would be wrong to say no) legitimate applications. I cannot personally understand why it is not standard operation for mail servers to recognize and bounce messages with forged headers. Sure, it would increase processing load, but if done by all servers, more spam would be stopped closer to the source, meaning less spam to process for all.

      Or am I pulling a thinko here? Anybody?

      --
      I for one welcome our new SCOviet Russian overlords to whom all our base are belong.
    13. Re:Another way to stop Spam by bugbear · · Score: 1

      There are two problems with this approach.

      You say no spammer will reply to mail asking for authentication, but if your solution were adopted as part of a widely-used piece of software, instead of something you cooked up for yourself, spammers would would be quick to automate answering such requests.

      Probably quicker than some human users would be, which brings us to the second problem: getting authentication requests is so annoying that some senders won't bother. In that case your filtering solution is effectively generating false positives, which is the big no-no.

    14. Re:Another way to stop Spam by kelleher · · Score: 1

      I'm don't think the resume example is valid. I have a separate "resume only" email address because headhunters are only slightly more palatable than spammers. When I'm looking, I read the account. When I'm not it all just goes away...

    15. Re:Another way to stop Spam by Anonymous Coward · · Score: 0

      Where the heck is that?? I'm using Outlook 2000 and I can't find any sign of such an auto-whitelisting function.

    16. Re:Another way to stop Spam by FattMattP · · Score: 4, Informative

      What you've described is exactly what TMDA does.

      --
      Prevent email address forgery. Publish SPF records for y
    17. Re:Another way to stop Spam by Tablizer · · Score: 2

      (* I'm don't think the resume example is valid. I have a separate "resume only" email address because headhunters are only slightly more palatable than spammers. When I'm looking, I read the account. When I'm not it all just goes away... *)

      True, but if one does a lot of contracting or has been out for a long time, then there is not much difference except you have to check 2 places instead of one.

    18. Re:Another way to stop Spam by 21mhz · · Score: 2, Informative

      It already exists.

      --
      My exception safety is -fno-exceptions.
    19. Re:Another way to stop Spam by miket · · Score: 1

      I would be interested in knowing how you used Outlook rules to filter out your spam. If you could elaborate I would be grateful.

      --
      Imagination is more important than knowledge. --Albert Einstein
    20. Re:Another way to stop Spam by soybean · · Score: 1

      But, how well does this deal with, say, substribing to an email list that may not bounce your auth request but infact send it to an email list?

    21. Re:Another way to stop Spam by Fizyx · · Score: 1

      One problem with bouncing messages to spammers is that you break the rule of "never reply to spam". The bounce causes some spammers to flag your entry as a verified address and thus more valuable (they don't know and don't care that you don't read the message). That gets you on more lists, sucking up more bandwidth.

      I have a spammed address that goes back to '93, and I 'only' get a hundred a day, not a thousand -- maybe because I haven't done much bouncing.

    22. Re:Another way to stop Spam by japhmi · · Score: 1

      Could you be helpful enough to either post or send a link to your perl code? Thanks!

      --
      "Giving money and power to government is like giving whiskey and car keys to teenage boys" P. J. O'Rourke
    23. Re:Another way to stop Spam by Tim+Macinta · · Score: 3, Interesting
      I've seen similar solutions before, and they are all nice and dandy except for one application: when communicating with businesses. What happens when you order a Widget from Acme, Inc. and Acme sends you your confirmation by e-mail? Your script bounces a question, and Acme's mail server either bounces back at you, making it look like it was spam in the first place, or simply doesn't respond at all.

      The system implies that anything not sent by a human being is spam. This is not necessarily the case today. A lot of today's e-mail communications are auto-generated.

      Hmmmm... how about if you were to keep a separate address space for emails you expect to be replied to from businesses? I'll use myself as an example. I could use my main address, twm@alum.mit.edu, to receive personal email and block spam using the technique described by the original poster. When I go to order something online, I could make up addresses at my domain twmacinta.com (for example, "spamproof+amazon8291@twmacinta.com") which could be proactively added to a whitelist before I gave them. I actually worked on a system to do the second half of this solution for awhile (the whitelist aliasing) for users without their own domains, but the one drawback to the system is that it wouldn't stop spam on existing addresses. The original poster's solution sounds like it would make a very nice complement.

    24. Re:Another way to stop Spam by ArcadeNut · · Score: 2
      That's the small price we pay to eliminate SPAM. You can thank the SPAMMERS for this.

      I took a little more drastic approach to this problem. I now have an E-Mail Form on my web site that does the emailing. My Main Email address is never posted ANYWHERE, not even in the HTML source.

      I then reply to the email and then they get my email address and they no longer need to go through the form.

      I now have ZERO SPAM.

      --
      Visit the Arcade Restoration Workshop @ http://www.arcaderestoration.com
    25. Re:Another way to stop Spam by bedessen · · Score: 2

      If it's never been seen before, an authentication message is sent to the sender asking them to reply to it to authorize themselves. If that authmessage is bounced back, a db entry is made as "denied".

      It's unfortunate that you do this, since almost all spam emails have falsified 'From:' lines. Most of the time it's probably a nonexistant account on a large provider (@hotmail.com, @yahoo.com, etc.), but sometimes the spammers put a legitimate email address (that of one of their hated foes) on their 'From:' lines, since they know that innocent person will receive hundreds if not thousands of nasty "don't spam me you bastard" replies. I realize that this step is crucial to your method, since the occasional legit correspondant would need to be notified that their mail hadn't gone through and they need to whitelist themselves. But if everyone did what you did, then any poor sap who the spammers dislike would get flooded with thousands of "Please respond if you wish to communicate" autoreplies when a spammer used their address in one of their emails.

      It's another example of something that doesn't hurt the spammers one bit (in that they never supply a valid return address) and costs ordinary regular people time and/or money.

    26. Re:Another way to stop Spam by rgmoore · · Score: 2
      I've seen similar solutions before, and they are all nice and dandy except for one application: when communicating with businesses.

      There's one more aspect to this that both of you seem to have missed; a whitelist assumes that all mail from a specific user is either good or bad. If I buy a part from Acme Widgets, I do want to get things from them like order confirmations and shipping notices. That doesn't mean that I've given them blanket permission to send me ads for their products for the rest of time. Similarly, I might very well want to receive personal email from my relatives but not want to get Aunt Suzy's joke-a-day messages. To eliminate those kinds of messages some kind of content based filtering is necessary.

      --

      There's no point in questioning authority if you aren't going to listen to the answers.

    27. Re:Another way to stop Spam by Anonymous Coward · · Score: 0

      I think it can be said that effectiveness is an inverse relationship to convenience.

      effectiveness = ( 1 / convenience )

      Same can be said about security versus convenience.

      Anyway, for those of you who think this approach is ineffective for resume sending, use a separate email address for that purpose, especially one that you don't use for any other purpose. For someone who has to receive all the root@ and postmaster@ aliases for a domain, then this approach is invaluable.

      The point is, there are many approaches that can be effective for your particular situation. There's no single cure-all for spam. So don't blast one person's solution just because it doesn't work for you. Instead, find one that does.

    28. Re:Another way to stop Spam by Anonymous Coward · · Score: 0

      That's a useful techniwue, and I agree that by not replying you reduce spam by not allowing the spammers to know you have a valid email address.

      Unfortunately, spammers are putting 1x1 pixels into HTML emails now, which contain information about the email address.

      http://www.x1y2.com/getimage?thisimage.jpg&email =s oandso@domain.com

      As soon as you view the message in your email reader, their web logs track who you are and record the email as valid.

      Solution: don't read email as HTML.

    29. Re:Another way to stop Spam by osolemirnix · · Score: 2
      I've seen similar solutions before, and they are all nice and dandy except for one application: when communicating with businesses. What happens when you order a Widget from Acme, Inc. and Acme sends you your confirmation by e-mail? Your script bounces a question, and Acme's mail server either bounces back at you, making it look like it was spam in the first place, or simply doesn't respond at all.

      The system implies that anything not sent by a human being is spam. This is not necessarily the case today. A lot of today's e-mail communications are auto-generated.

      The Tagged Message Delivery Agent provides solutions to this problem and more. Basically it's a whitelisting mechanism, if the sender is unknown, the mail is "parked", a confirm request is sent and the mail is delivered upon (human) confirmation.

      This leaves problems with auto-generated mails as you describe, but TMDA has more options:
      1. you can use a mailadress that is only valid for a certain amount of time
      2. you can use a mailadress that is only valid for mail from a specific sender domain/mailadress

      So to order something you'd use one of the above and thus avoid sending out a confirmation request. At the same time you can make sure that an adress is valid only for the relationship you intended it for, e.g. if they use it after a transaction is over or sell it to adress harvesters it will not work.

      Check it out, it's really a clever concept IMHO. Of course I completely agree that this shouldn't keep us from fighting spam on other fronts, using RBLs and legal means in addition to filters.
      I just think whitelisting works far better than content filtering.

      --

      Idempotent operation: Like MS software, wether you run it once or often, that doesn't make it any better.
  31. Plug for OnLisp by nonya · · Score: 1

    While you are there check out his book "OnLisp" (available for free at http://www.paulgraham.com/onlisptext.html). It is an extreamly well written book and gives a flavor of what makes lisp special - its macros. Because lisp has such a regular syntax you can do amazing things with macros.

    My only complaint about OnLisp is it only has one chapter on the common lisp object system, which is very powerful - multimethods, method combination, and a metaobject protocall - and could have used more explanation; I don't think it talks about lisp's exception handling at all.

    But for a flavor of why people love lisp give this well written book a try!

    1. Re:Plug for OnLisp by Tiny+Elvis · · Score: 1

      The first C example was not a lexical closure, it was a weak trick that used a static global var. Erik provided an example that would allow something more like a LISP closure, but went on to explain in depth why while providing similar functionality it wasn't providing the generality of LISPs closures.

      Your animosity towards Paul Graham and Erik Naggum and Lispers in general is quite obvious. I personally learned a lot from Paul Graham's books.

      I would also like to comment that at least on c.l.l. Lispers are plenty willing to provide examples and evidence of why they believe Lisp to be superior. You will find fanaticism in every camp, Lisp included.

    2. Re:Plug for OnLisp by nonya · · Score: 1

      First, I absolutely agree with your comments about comp.lang.lisp. I lurk there - most of the posters are unbelievably arrogant. The best solution for Erik is a killfile.

      I also took a look at the code from the thread you linked to. It really does not do the same thing as a closure, which is what I assume the challenge was (I looked in my copy of OnLisp, I could not find the challenge...). However, I did send Paul C++ code to his challange of this page (http://www.paulgraham.com/accgen.html) that I believe was correct. However, my code was never posted and I never heard back from him as to why my code was deficient. (I don't really care, I was just having fun.)

      I don't know Paul, and don't have an opinion on his character, but I think your statement: "I would not trust Paul Graham or his book to educate a new programmer in lisp" is a little unfair - judge the book by its own merits, not on your opinion of the author. I learned a great deal reading OnLisp, I enjoyed reading it, and I stand by my recommendation.

    3. Re:Plug for OnLisp by Tiny+Elvis · · Score: 1

      If you want to find out if it was correct just post it on c.l.l.
      Regarding Naggum, while he does generate a lot of argumentative and fighting posts, he does have a lot of technical skill and often makes informative posts too.

    4. Re:Plug for OnLisp by Anonymous Coward · · Score: 0
      You can say, if you wish, that "it doesn't do the same thing as closure" and you're right. But what Paul Graham did was present a simple lisp program which would print out certain things for certain inputs, and then he said "What does addn (the function in question) look like in C ? You just can't write it." Yet there is a program which presents exactly the same output for every single input, written in C.

      Paul Graham's and Erik Nagum's and your point is that the C program doesn't shuffle bits and maintain state in quite the same way as the C program does. But if that's your point, then say that. Instead, you choose to take a position that sounds good to your chauvanist little souls ("C programmers can't do this and we can! Yea !") and is obviously indefensible to any freshman CS student who has heard someone ramble a bit about Turing equivalent languages.

      Lisp programmers often -- excuse me, let me correct myself. Lisp advocates (real programmers as opposed to book writing academics like Paul Graham and newsgroup deniznes like Erik Graham are much more rational) often say that any large C project eventually grows to include a poor re-implementation of a large chunk of the lisp environment, and the bugs are in precisely that portion which would be available with no work in Lisp. They are right, as the trival example from Graham's book, and the longer C solutions proposed on the newsgroup, show.

      What they fail to realize is that non lisp languages predominate precisely because they implement ONLY that portion of lisp they need, and not all the other crap. Even in today's world of the $400 desktop supercomputer, lisp culture remains defined by its origins in an elitist group of programmers working on some of the fastest machines in the world, paid for by other people (likely taxpayers), without having to share with anyone. These people write code the way plantation owners do agriculture -- and they used RAM the way old time cotton barons used human labor.

      Today most lisp people write code for which the code is not itself run by the unwashed masses, but the product is simply provided to them. Big institutions like banks and telecoms, web based services like yahoo stores, are good examples. Notice how horny those lisp bastards get when anyone starts talking about web-based services; it's exactly the kind of hierarchical arrangement of them maintaining the big machine and all the peons connecting and begging for services that is ingrained into their culture.

      Lisp doesn't have to be this way. The "Hello World" executable from a lisp compiler with the default options doesn't have to be four megabytes. The lisp environment doesn't have to suck up more memory than the data you are working on. There is nothing intrinsic about the lisp language that prevents compilers from producing executables at least as small as well written tiny C programs.

      What's holding it back is the people. The lisp community is simply unable to come to terms, or even recognize the validity of, the new democratic revolution in technology. They will never be able to write code that is useful to other people, they will always write code that is really just homage to their own dead gods.

      If you want to program really well and productively, it's probably a good idea to learn lisp well. But avoid letting your mind be infected by too close a relationship with the likes of Paul Graham.

    5. Re:Plug for OnLisp by Tiny+Elvis · · Score: 1

      "Yet there is a program which presents exactly the same output for every single input, written in C."

      Wrong. In Lisp I can write
      (defun addn (x) (incf x y))

      and it will return a function that will take a value and return a function that will increment that value when called. The key is this will work for any value type complex, rational, int, bignum. In C you are coding for a single data type. Therefore the C program does NOT produce correct output for every single input.

      Every time someone comes onto c.l.l. claiming that other languages (eg C) can do anything that Lisp can it amounts to Turing equivalence. The point of Lisp "advocates" is that you can do all of these great things in Lisp without writing tons of extra scaffolding code.

    6. Re:Plug for OnLisp by Tiny+Elvis · · Score: 1

      Ooops, I meant to say "and it will take a value and return a function that will increment ..."

    7. Re:Plug for OnLisp by Anonymous Coward · · Score: 0

      you're just too stupid to read aren't you?

      Yes massa! Right away massa!

    8. Re:Plug for OnLisp by Anonymous Coward · · Score: 0

      He does. I wonder could you apply a mail-filter technique to cut out Erik's paranoid rants, and only pass through his technical stuff?

  32. Re:Circumvent by clare-ents · · Score: 2


    I guess you never wish to converse with a blind person, or someone who's restricted to a text only medium then?

    --
    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
  33. Misleading by RainbowSix · · Score: 5, Interesting

    He isn't fighting spam, he is filtering it. There is a difference. Filtering still costs in bandwidth. Fighting it would eliminate the source and free up the gigabytes of bandwidth lost for this marketing purpose.

    Filtering is fine for now, but ultimately it must be fought and defeated.

    --
    --------
    It's OK to be social, just don't tell anyone about it.
    1. Re:Misleading by sebi · · Score: 4, Insightful

      In the long run filtering would eliminate the source as well. Spam has to be payed for by two sides: Both the spammer and the recipient have to pay for the bandwith. The spammer has to pay a lot more though. Spamming is a business that will continue to exist as long as its profitable. If the success rate of Spam drops dramatically due to refining filters than sooner or later Spammers will no longer be able to afford the bandwidth they need.

    2. Re:Misleading by cybermace5 · · Score: 3, Interesting

      Wha...? Did you read the article?

      Filtering == Fighting

      The entire success of spam depends on human eyes reading it. If no one ever sees the spam, then spammers will have no money. Then they'll quit SENDING spam and have to start EATING it! Ahahaha!

      They can have the spam, egg, bacon, spam, CROW, spam, and spam.

      --
      ...
    3. Re:Misleading by Anonymous Coward · · Score: 0

      I think he says that once it's all filtered, it won't be profitable and spammers will give up, hence fighting spam.

    4. Re:Misleading by Anonymous Coward · · Score: 0
      Fighting it would eliminate the source and free up the gigabytes of bandwidth lost for this marketing purpose. Filtering is fine for now, but ultimately it must be fought and defeated.

      There are three problems I see with fighting spam:

      • It is hard and time consuming.
      • Innocent bystanders get hurt by some of the fights. ("You are supporting spam because your ISP is also used by a spammer, so I'm going to block your mail too!")
      • There are too many idiots out there who think they can make money by sending spam. The actual amount of profit doesn't matter.

      Spam stinks, and I don't have a good answer. But filters are a valuable tool until we can boil spammers in oil. (See here for more tasty spam recipes.)

    5. Re:Misleading by Anonymous Coward · · Score: 0

      No, that's the point. If it's properly filtered at all points, there is no need to fight and stop it.
      The spammers are out there to make money. If they spend months trying to spam people, and not a single soul replies (because of the filters), they will be out of business, and will stop spamming.

    6. Re:Misleading by Jaywalk · · Score: 1
      Filtering is fighting after a fashion. The more spam is left unread, the less worth it has. The other suggestion someone made, putting fake names into HTML, is also useful since it increases the number of addresses the spammer must sift through to find a real address.

      Hmm. Maybe I should go and add a few dozen to my website now.

      --
      ===== Murphy's Law is recursive. =====
    7. Re:Misleading by RainbowSix · · Score: 2

      As others have said, I don't think filtering will eliminate the source. Sure, people on /. can filter their email, but the people who actually are ignorant enough to buy from spammers aren't likely to be the same kind of people who set up their own spam filters. They are most likely using their run of the mill aol/hotmail/ISP email addresses which have some filters in place, but anybody with a hotmail address knows, they are by no means effective.

      --
      --------
      It's OK to be social, just don't tell anyone about it.
    8. Re:Misleading by Anonymous Coward · · Score: 0

      You moderators suck. This is not "interesting" RainbowSix is wrong and obviously did not RTFA. With this kind of robust custom filetering, spam could end because none of it will get read. In fact the article has several paragraphs discussing how this is fighting spam and may bring its end.

    9. Re:Misleading by Xenographic · · Score: 1

      Spamming is a business that will continue to exist as long as its profitable.
      >>>>>>>>>>>>

      The only problem is that that may be for a long time to come. You see, some of the brighter spammers now realize that you can't really make any money off spamming; at least, not directly. Instead, they become 'internet marketers' or somesuch and offer to 'market' your products... via spam. So they get payed whether or not it works & the company who foolishly hired them gets a bad reputation.

      So, make that "spam will continue as long as people are dumb enough to give money to spammers" ...

    10. Re:Misleading by Titusdot+Groan · · Score: 2
      This makes the assumption that the people buying stuff from the spammers are only doing it because they don't have good spam filters in place, that some how merely seeing the spam causes them to buy it and if only it could be filtered they would be able to save their money.

      I submit that people who buy junk from email ads are the same people who watch and purchase from infomercials and they want to do it!

      That's why it has to be fought at the source -- because I don't want my ISP spam filtering for me and Joe "Check out my BlueBlocker Sunglasses" Sixpack wants to see this crap.

    11. Re:Misleading by AnotherBlackHat · · Score: 2

      I submit that people who buy junk from email ads are the same people who watch and purchase from infomercials and they want to do it!

      That's why it has to be fought at the source -- because I don't want my ISP spam filtering for me and Joe "Check out my BlueBlocker Sunglasses" Sixpack wants to see this crap.


      Let me see if I've got this straight.
      You're claiming that spam needs to be stopped because most people want it?

      So, are you thinking that what you want should be what everyone wants,
      or are you just hoping for the tyranny of the majority?

      Maybe we should round up all those wrong thinking people and put them in camps or something.

      -- this is not a .sig
    12. Re:Misleading by AnotherBlackHat · · Score: 2
      He isn't fighting spam, he is filtering it. There is a difference. Filtering still costs in bandwidth. Fighting it would eliminate the source and free up the gigabytes of bandwidth lost for this marketing purpose.

      Filtering is fine for now, but ultimately it must be fought and defeated.


      I assume by "it" you mean that spam must be fought and defeated, not filtering.

      The real cost of spam isn't bandwidth, it's our time.

      see- http://spamwolf.com/spaminfo.html#whatcost
    13. Re:Misleading by Anonymous Coward · · Score: 0
      No, spamming needs to be stopped because it's wrong -- it uses my own time and money to force unwanted advertising down my throat.

      The tyranny of the majority would be in effect we didn't do anything about spam simply because most people like getting it.

    14. Re:Misleading by innerlimit · · Score: 1

      i don't want to tempt fate, and i know i'm a few hours behind on the discussion, but i get like maybe one spam a week or less on my hotmail account. not even 'bulk' mail as hotmail calls it... is it because hotmail is that effective or because my mailadres isnt appeasing enough for spammers?

    15. Re:Misleading by infra-red · · Score: 1

      Once you've identified the message as spam, you can submit it to services like spamcop or create a short term blackhole list to block future connection attempts. This is of course based on the claim that it stops all but 5 out of 1000 spam messages, with 0 false positives

      I suspect that the implementation of this would best be done in the mail client, and not on the mail server, which might be a problem. Setting up a hash for every user on any mail server with a reasonably large userbase would require massive disk space and huge processing time.

    16. Re:Misleading by 40000 · · Score: 1

      Joe Average Hotmail User doesn't care enough about getting e-mails from strangers (he doesn't have a web site or if he does, it's at Geocities and the e-mail address displayed is an expired yahoo account he can't remember the password for). He can afford to delete the whole 2 MB inbox every week without even checking whether there are some useful messages in there (it's probably just used for the MSN Messenger user name and in case one of his friends sends him something funny). But Joe will read some of the spam before hitting Delete All.

  34. Time for a spam contest! :) by stere0 · · Score: 2

    Using Graham's system, write a message that will get a very high mark. The highest mark will win.

    The message has to be understandable English. Please post your entry as a reply to this message.

    --
    Trollem mirabilem hanc subnotationis exigiutas non caperet
    1. Re:Time for a spam contest! :) by Reality+Master+101 · · Score: 3, Interesting

      xIf xYou xCan xRead xThis xYou xHave xWon xA xFabulous xVacation! xClick xHere xTo xRecieve xYour xPrize!

      --
      Sometimes it's best to just let stupid people be stupid.
    2. Re:Time for a spam contest! :) by Anonymous Coward · · Score: 0

      Easy enough, just make a picture ad such that it will be included in the email if viewed in a html email reader. Make sure that the url is neutral.
      Or did you not read the fucking article where the example was already given?

    3. Re:Time for a spam contest! :) by Allaria · · Score: 1
      I believe that would only make it through once, if that. His filter works both ways, so chances are it would get negative on the domain it came from, and no score on any of those words == still a negative score.

      However, something like:

      Hey, old friend.
      I was thinking of you when I saw this, give it a read and tell me what you think.

      xIf xYou xCan xRead xThis xYou xHave xWon xA xFabulous xVacation! xClick xHere xTo xRecieve xYour xPrize!

      Might work..
      --
      If a and b in c, and a can create b, and a can create a, and b can create b, and b cannot create a, then a created c.
  35. Filtering text content by gawi · · Score: 2, Insightful

    Great... now that they know, they'll spam me with gifs and jpeg.

    --
    All humans are mortal. Socrates is a human. Socrates is dead.
    1. Re:Filtering text content by Hitokage_Nishino · · Score: 1

      From the article, I'd presume the filter would start blocking emails with those jpegs and gifs.

    2. Re:Filtering text content by dillon_rinker · · Score: 2

      Hopefully it would also start blocking the emails frommy wife's sister's cousin's daughter who emails new pictures of her baby to everyone she knows every day...

  36. Never mind that... by billbaggins · · Score: 1
    The existence of a whitelist (e-mail addresses that are "trusted" to send nonspam) makes things very easy. Right now you can buy CDs with zillions of addresses. If the whitelist lives, then the next generation of that CD will have pairs: your address, and the address of someone you've probably mailed (say, an address that appears on the same page as yours). Voila!

    As for multipart/alternative... right now anything I get that has a content-type other than text/plain goes to a special folder, where it usually gets deleted without even being opened... fortunately most of my friends use proper mailers that send text/plain :-)

    --
    "The best argument against democracy is a five minute chat with the average voter."
    --Winston Churchill
  37. Shooting spammers is wrong. by Anonymous Coward · · Score: 0
    We should put them chain them up in the tech centers around the country. Then people get to pay $20 / lash get to whip them.


    This does several things:

    • It feels good!
    • Generates money to pay for their damages,
    • Discourages other spammers, and
    • It feels good!
  38. Is this thing patented? by WetCat · · Score: 2

    Can I use that feature for my own (commercial
    or open source) mail client development?

  39. AI Anti-Spam Papers by bpfinn · · Score: 1

    There are several papers describing using Naive Bayes classification, as well as others AI techniques, to filter spam here. Look for the section on "Document Filtering".

  40. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  41. Perl by Mr_Silver · · Score: 2
    This looks like something that could easily be done in Perl.

    Although to be honest, I don't understand how the algorithm works. However I'm sure some enterprising soul can probably work it out and code something (hell I will if someone can explain it in decent mathematical terms).

    All we need then is a repository of spam mail and non-spam mail to "teach it".

    Whatcha reckon?

    --
    Avantslash - View Slashdot cleanly on your mobile phone.
    1. Re:Perl by Anonymous Coward · · Score: 0

      Maybe you should learn to read the damn article, you dirty ape.

    2. Re:Perl by plover · · Score: 2
      Heh. My delete box for today would serve as my repository.

      I haven't been hurting for spam samples recently... :-(

      --
      John
  42. Best anti-Spam method is TMDA by Erore · · Score: 3, Interesting

    I'm continually amazed at the people who are beating their heads up against a very simple problem. The answer is not statistics, it is not heuristics, it is not AI, it is not procmail.

    The answer is verification...aka whitelists. Check out TMDA, tmda.sourceforge.net. This program assumes you don't want mail from anybody whom you haven't explicitly allowed, or who has verified that they are a real person and not a spammer.

    Verification is simple, and some people will point out that it could be defeated by a spammer. But, the economics of spam do not make it feasible for a spammer to attempt to defeat TMDA.

    TMDA is similar to making your phone number private. You only get phone calls from people you have given your number to, and you never get telemarketers.

    TMDA user since December 2001. Spam messages that tried to get in, 12,133, spam messages that got in 3, false positives, 0. Time I've spent tweaking and modifying the program since installation, 0 minutes.

    1. Re:Best anti-Spam method is TMDA by Anonymous Coward · · Score: 1, Funny

      "verified that they are a real person and not a spammer."

      Heh, spammers are people too, you know :-)

    2. Re:Best anti-Spam method is TMDA by Launch · · Score: 1

      What about order confimations from online merchants and other automated information that you *DO* want that would not reply to a validation e-mail.

      Would spammers start writing software that did respond and autovalidate? I don't see how this will work.

      --
      Your mammas flamebait.
    3. Re:Best anti-Spam method is TMDA by DrVxD · · Score: 2

      > Heh, spammers are people too, you know
      What on Earth gives you that idea?

      --
      Not everything that can be measured matters; Not everything that matters can be measured.
    4. Re:Best anti-Spam method is TMDA by Anonymous Coward · · Score: 2, Insightful

      I like TMDA, but I have two issues with it. First, you can only use it if you control a mail server. Second, my friends have a terrible time dealing with the concept of having to reply to a message to let mail go through to me. Sure, I can add them in advance, but if they have a new mail address, I don't get to see their message. Maybe I just have dumb friends, but they are my friends, and I want to get mail from them!

    5. Re:Best anti-Spam method is TMDA by Anonymous Coward · · Score: 0

      TMDA is similar to making your phone number private. You only get phone calls from people you have given your number to, and you never get telemarketers.

      Except that doesn't work. Telemarketers rarely use the phone book to get your phone number. They buy lists of phone numbers from companies who need an extra buck. These companies are places to whom you have given your phone number for order verification, membership info, anything.

      A better analogy is Caller ID. This has eliminated the telemarketer problem in our house. Don't pick up anything that is Unavailable. The only people that block their numbers that way are telemarketers. Plus, with the answering machine, false positives are caught and leave a message (or screened and picked up). Telemarketers never leave a message.

      Verification has its place, but not for all patterns of email usage. What if you have your email address on a webpage/resume, etc. Yes, verification is easy in principle, but in practice it would be a gigantic pain in the ass if everyone used it. For every random email you sent to a webmaster, potential employer, old friend, whatever, there would be that extra step of email, plus bandwidth, which would add up if verification grew in usage.

      The true answer is to crack down harshly on spammers and service providers who permit spammers. Large companies like AOL, Earthlink and maybe larger backbone providers should start blocking massive amounts of address space, with the explicit purpose of creating false positives. If all of the email from .cy .cx .tw, etc. was blocked to the rest of the world, those governments might actually start looking at the problem. The same effect would occur for service providers in the US.. if all of their customers couldn't send email, they would actually start to exclude spammers from service and begin to eliminate the problem.

    6. Re:Best anti-Spam method is TMDA by Anonymous Coward · · Score: 0
      What about order confimations from online merchants and other automated information that you *DO* want that would not reply to a validation e-mail.

      I just use a keyword address:
      myname-slashdot.b1521d@mydomain.com

      Or, you could use an address that expires in 3 months:
      myname-dated-1029537365.0ee6b2@mydomain.com

      Would spammers start writing software that did respond and autovalidate? I don't see how this will work.

      They could, but then you modify TMDA to ask a simple question to validate. ("Who was the first president"/"what is 342+412".) Spammers would have to have a person do the answering, which would drive up the cost.

    7. Re:Best anti-Spam method is TMDA by Anonymous Coward · · Score: 0

      A mail is from whoever takes responsibility for it. Since spammers need to remain unaccountable, they obscure the origins of their spam and intentionally try to disavow responsibility. That causes the mail to be from no one.

    8. Re:Best anti-Spam method is TMDA by Anonymous Coward · · Score: 0

      I know people that don't have root access to mail servers that are using TMDA.

      Your friends are lazy (as are many of mine). They only have to confirm the first time they send from a new address. After that, all is let through.

      I work at home and used to get several unwanted calls throughout the day. I got caller ID so I could see who was calling, but then I was getting more and more "out of area" calls each day -- I still had to hear the ring and look at the CID box. Then I got "Privacy Detector" -- the phone equivalent to TMDA. Only identified calls ring through, all other callers must identify themselves. Telemarketers don't bother doing this, so I don't get the calls anymore.

      I've been using TMDA for a long time and SPAM is no more an issue than telemarketers anymore.

      It's a small price to pay to regain my privacy.

    9. Re:Best anti-Spam method is TMDA by Anonymous Coward · · Score: 0

      Granted, if your mail server is running qmail, you don't need root access to use TMDA. If it is running something else, the server software might need to be configured differently, if it can support TMDA at all. I have several mail accounts that can't work with TMDA; I had to set up my own mail server to use it. As TMDA gains in popularity, I'd hope that some ISPs would provide it as a service, or at least have mail systems that would support it.

      Besides, if you are polling for your TMDA-protected mail with POP3, the delay in asking for confirmations makes it more confusing.

    10. Re:Best anti-Spam method is TMDA by pjrc · · Score: 3
      Check out TMDA, tmda.sourceforge.net. This program assumes you don't want mail from anybody whom you haven't explicitly allowed, or who has verified that they are a real person and not a spammer.

      This is only a solution for people who, well, only want mail from people they already know, and don't mind putting up a rude and obnoxious barrier... "I don't want to even talk with you until you jump though these hoops to verify you're not a spammer" for anyone else.

  43. Please explain the LISP code by Anonymous Coward · · Score: 0

    For those of us that are not LISP gurus,
    can someone explain what's he's doing with
    the following code:

    (let ((g (* 2 (or (gethash word good) 0)))
    (b (or (gethash word bad) 0)))
    unless ( (+ g b) 5)
    (max .01 (min .99 (float (/ (min 1 (/ b nbad))
    (+ (min 1 (/ g ngood)) (min 1 (/ b nbad)))))))))

    1. Re:Please explain the LISP code by Anonymous Coward · · Score: 0
    2. Re:Please explain the LISP code by bsd-mon · · Score: 2, Informative

      LISP is prefix so instead of a+b you'd have +(a b)
      IIRC in c this would be similar (LISP guru's please correct me):

      int g(char* word) {
      /* if word is in good hash, return weight,
      else return 0 */
      return 2*good_word_weight;
      }
      int b(char* word) {
      /* if word is in bad hash, return weight,
      else return 0 */
      return bad_word_weight;
      }

      int main() {
      if (g(word) + b(word)

      --
      To read makes our speaking English good. - X. Harris
    3. Re:Please explain the LISP code by brausch · · Score: 2, Informative

      OK, I'll try. He's trying to score the word on a scale from .01 to .99. The value is the probability that the word is a spam word.

      g = 2 * (count of how many previous "good messages" the word has appeared in)
      b = (count of how many previous "bad messages" the word has appeared in)

      if( g+b 5 ) // word hasn't occured enough in previous messages
      return 0; // to have a valid score

      fb = b / nbad // nbad is number of bad messages in database
      fg = g / ngood // ngood in number of good messages in database

      score is fb / (fb + fg)
      minimum valid score is .01, maximum is .99

      --
      "Almost every wise saying has an opposite one, no less wise, to balance it." - George Santayana
    4. Re:Please explain the LISP code by bsd-mon · · Score: 1

      Sorry (damn < screwing up my post )
      Basically:
      unless the word's good weight and bad weight are .lt. 5 then find the word's average badness [(/ b nbad)] and it's average goodness [(/ g ngood)] and add them [(+ avg_good avg_bad)] (my edit).
      Then find it's normalized badness [(/ avg_bad +( avg_good avg_bad))] and that is the word's new "badness". Again, gurus, please correct me.

      --
      To read makes our speaking English good. - X. Harris
    5. Re:Please explain the LISP code by Keith+Maniac · · Score: 1

      One minor nit.

      If g + b is less than 5, the algorithm won't return 0, that would be bogus.
      It will return nil, as an indication of failure. That's the purpose of the "unless".

      Remember, 0 isn't false in LISP.

    6. Re:Please explain the LISP code by mla_anderson · · Score: 1
      Can you explain the second bit of code:

      (let ((prod (apply #'* probs)))

      (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x))probs)))))
      --
      Sig is on vacation
    7. Re:Please explain the LISP code by Keith+Maniac · · Score: 1

      It goes something like this:

      prod = probs[0];
      prodinv = 1 - probs[0];

      for (i = length(probs) -1; i > 0; i--){
      prod *= probs[i];
      prodinv *= 1 - probs[i];
      }
      return prod / (prod + prodinv);

      The "apply #'*" multiplies all of the elements of a list (the probabilities) together.
      The mapcar/lambda stuff subtracts each element from 1 to get the inverse probability, then multiplies those all together.

      Enjoy. LISP is fun to read.

      (BTW, I reversed that loop since it's easier than trying to get a "less than" sign. Normally I'd count up.)

  44. Another idea by caesar79 · · Score: 2, Interesting

    a nice idea to filter spam ...another one to fight it.

    1. the MTA's (mail transport agents like sendmail etc) establish trust relationships between themselves or manually. They also maintain a users safelist (i.e. addressboook + list of addresses user wants to recv mail from)

    2. All email over the trusted links and from addresses in the safelist are delivered unfiltered.

    3. For each email sent over an untrusted link
    a. Perform MD5 over message body.
    b. Ask neighbouring trusted agents if they have received an email whose MD5 is given.
    c. If no. of positives are greather than a threshold, reject as spam.

    1. Re:Another idea by plover · · Score: 2
      Sweet!

      The only problem is that too many people would want hotmail, aol and msn to be in the "trusted" list. And we all know that can't be.

      Still a great idea.

      --
      John
    2. Re:Another idea by Noofus · · Score: 2

      Unfortunatly this is also easily spoofed. The spammers could set up their own networks that falsely inject "trust" into the system.

      Mind you, I dont have an alternative. This sounds great, but I can see how it can be tricked.

  45. Could this also be used for studying spam? by FuzzyDaddy · · Score: 3, Interesting
    Could this technique be used as a way to track evolving spam techniques over time?

    You could develop a corpus of spam over a long period of time, and look for shifts in the data. What this paper describes is distinguishing between a spam-corpus and a legit-corpus, but you could also compare a spam-1999 corpus to a spam-2002 corpus, and see if the spammers are up to anything new.

    Not that it would be useful, but it might be kind of cool to try it out and see.

    --
    It's not wasting time, I'm educating myself.
    1. Re:Could this also be used for studying spam? by bugbear · · Score: 1

      That is an interesting idea. In fact, it might be useful not just to study spam but to improve filtering. If you looked at diffs of word frequencies over time, you could use this to bump up the probabilities for words whose use seemed to be accelerating.

  46. possible oversites by Launch · · Score: 1

    I have no doubts about the research that goes into the calculation of words that were in spam, since pretty much everyone gets simular types of spam and it's not difficult to collect spam marketed to many demographics.

    What I do wonder about is his collection of non-spam. I agree that this approach is very good, but I think a hash of non-spam needs to be collected by an end user or for a specific demographic.

    For instance in his article he said that the word madam almost never appears in his non-spam mails. Well he isn't a woman. It is a quite common business practice to send e-mails with the greeting madam. Also the vocabulary used in a personal e-mail enviorment would be drasticly diffent then in a business enviorment.

    Say your an AOL teeny-booper... the chances that another teen is sending you an e-mail with red text (fl0000 was one of the key words that was 99% chance of spam) are much greater then a business e-mail envoirment (which actually I use bright read sometimes when in-line replying to e-mails at the office).

    So like I said before. I really think the hash of 'good' e-mails has to come from a end-user or at the very least from a demographic...

    --
    Your mammas flamebait.
    1. Re:possible oversites by mrhight · · Score: 1

      Go back and read the article. Note the "delete-as-spam" and "delete" functions. He is talking about each individual user building their own collection of spam and non-spam for the statistical analysis.

  47. Another idea! Need repository of spam by Mr_Silver · · Score: 2
    I've got another idea which might work using Markov chains. You strip the text, work out the probabilities of groups of words appearing after each other and then score that way. As spam changes so would this.

    However to test such an idea I need a repository of spam mail - something I don't have. Hotmail junk is no good, it's just the same old adverts regurgitated over and over again.

    Does anyone have anything like the 4000 junk emails that this guy has? If so, please could you pop me an email to org dot ewtoo at silver as I'd really appreciate it!

    --
    Avantslash - View Slashdot cleanly on your mobile phone.
    1. Re:Another idea! Need repository of spam by Anonymous Coward · · Score: 0

      > Does anyone have anything like the 4000 junk emails that this guy has? If so, please could you pop me an email
      Let me get this straight - you actually want spam?

    2. Re:Another idea! Need repository of spam by stevey · · Score: 1

      I don't know why you don't just post your real email address, as a linke with the mailto attribute set.

      That way you'll get lots of random spam... ;)

    3. Re:Another idea! Need repository of spam by GloomyTrousers · · Score: 1
  48. False positives... by dillon_rinker · · Score: 5, Funny

    From the article:

    In the spam filtering business, false positives are your biggest worry...Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability...an email containing both words would have a 99.97% chance of being a spam.

    False positives could be a HUGE problem in this case...imagine the agony if you missed this email from your wife: "I'm feeling REALLY sexy today - meet me at the motel off 12th street at noon for some lunch-hour sex!"

    1. Re:False positives... by Anonymous Coward · · Score: 0, Interesting

      Wow, don't you people actually read articles? The Slashdot crowd is so stupid.

      You would very likely not miss that letter from your wife for a number of reasons. Just as "sex" and "sexy" would increase the probability that the mail is spam, there would be words that would decrease the probability that the mail is spam. This method doesn't cue on just one or two words, it looks for fifteen words that are strongly weighted toward spam or no-spam.

      And as you get more and more mail like this from your wife, the Bayesian algorithms will learn that sex is not much of a factor (getting a .79 probability for you based on your corpi, instead of the .97 for the author, since his wife prefers to talk dirty to him on the phone).

      Why did I write this post? If you were too stupid to read the article, there is no way you will read this. Again, the reason why I never bothered to get a Slashdot account. The community if riddled with idiots.

    2. Re: False positives... by Black+Parrot · · Score: 1


      > False positives could be a HUGE problem in this case...imagine the agony if you missed this email from your wife: "I'm feeling REALLY sexy today - meet me at the motel off 12th street at noon for some lunch-hour sex!"

      Or worse, what if you missed a message like that from a total stranger!

      --
      Sheesh, evil *and* a jerk. -- Jade
    3. Re:False positives... by Anonymous Coward · · Score: 0

      Well considering you probably DONT get spam from your wife, you'd automatically accept her email.

    4. Re:False positives... by Quintin+Stone · · Score: 1

      Plus if you had a brain, you'd probably have her whitelisted...

      --

      "Prejudice is wrong; you should hate everyone the same."

    5. Re: False positives... by Anonymous Coward · · Score: 0
      > False positives could be a HUGE problem in this case...imagine the agony if you missed this email from your wife: "I'm feeling REALLY sexy today - meet me at the motel off 12th street at noon for some lunch-hour sex!"

      Or worse, what if you missed a message like that from a total stranger!

      I get messages like that from strangers all the time. It is called Spam! You don't have to filter it out if you enjoy it!

    6. Re:False positives... by Anonymous Coward · · Score: 0

      Or if she's anything like my wife, blacklisted.

    7. Re:False positives... by Anonymous Coward · · Score: 0

      False positives could be a HUGE problem in this case...imagine the agony if you missed this email from your wife: "I'm feeling REALLY sexy today - meet me at the motel off 12th street at noon for some lunch-hour sex!"

      On the other hand, maybe you could train it to filter out "Honey, can you stop by the store on the way home from work and pick me up a box of tampons."

    8. Re:False positives... by Our+Man+In+Redmond · · Score: 2

      Yeah, well imagine the agony if you showed up at the motel after some sleazeball forged the mail from your wife and were met by some sweaty guy in a polyester suit and a bad toupee trying to sell you a CD-ROM full of addresses.

      --
      Someone you trust is one of us.
    9. Re:False positives... by timeOday · · Score: 1

      LOL - get a slashdot account so you can accrue the karma you deserve, A.C.

    10. Re:False positives... by Anonymous Coward · · Score: 0

      I wasn't going to mention the obvious, Tim. Besides, this could be the very first email that you get from your wife, so she isn't listed, yet.

      Good to see that there are a few smart people in the Slashdot community, though.

    11. Re:False positives... by Anonymous Coward · · Score: 0

      browse at +4 and average IQ of "The Slashdot crowd" will increase significantly.

    12. Re:False positives... by Anonymous Coward · · Score: 0

      Your wife's email address and other header information would easily overcome the 'sex' and 'sexy'. Words like "lunch" would also help.

    13. Re:False positives... by dillon_rinker · · Score: 2

      Slashdot crowd is so stupid.

      Indeed. Some of them have also had their sense of humor excised.

    14. Re: False positives... by pjrc · · Score: 2
      Or worse, what if you missed a message like that from a total stranger

      Such as:

      My name is Natalie. I live in St.Petersberg and I am looking for a real relationship with a real man. I signed up with this internet service to meet good western men -- I hope you are really there. [url deleted] Please see and write me here if you like me
      or perhaps this from "nec. Jen":
      It's all me [link deleted] Click Here
      and here's one of the few kinky ones:
      I would like invite you to come create a couples or singles profile and join this online community. If you are into alternative lifestyle, or just looking for something kinky in your life come try it out. It now has Video IM working and you do not need a web cam to use it. Check it out you will find what you are looking for.

      You can see my profile and photos by going to

      www.geocities.com/bdsmkitty2000 [oh what the hell, I'll leave this one in]

      and creating a profile it only takes 2 minutes to do this so you can look around.

      Oh, it does not cost anything to get on to look. I would not pull one of those on you.. I hate it when someone does that to me.

      You can find me under the user ID [deleted] and and see my profile and photos. I am a 34 year old bi Dom fem 34DD 120lb's

      Kisses
      [deleted]

      Actually, there's suprisingly few of these in my spam file these days.

  49. Re:Circumvent by xipho · · Score: 1

    Actually its a pattern of characters its working with, English has nothing to do with it. The concept will work for any pattern as he's definied it and therfore any language.

    --

    only infrmatn esentil to understandn mst b tranmitd
  50. False positives by godemon · · Score: 1

    We are all afraid that new and powerful spam filters will filter out an email that was directed at us, but honestly, how many of us haven't accidentally deleted one ourselves? My spam deleting technique is
    1.Check name
    2.Check subject
    3.Decide
    And even this system has been known to delete a false positive or two (Hey, I didn't know Alisa knew my email and "Hi" from the name "Alisa" just sounds like spam)
    My point being, I doubt if any spam system will ever truly get to the point of never deleting a false positive, but it doesn't mean you should avoid spam filters, or leave them set at settings that make little to no difference.

    --


    Why is a mouse that spins?
    1. Re:False positives by RetroGeek · · Score: 1

      Hey, I didn't know Alisa knew my email

      You too?

      --

      - - - - - - - - - - -
      I am a programmer. I am paid to produce syntax not grammar. Deal with it.
  51. Good idea. by Anonymous Coward · · Score: 0

    So, do we give 'em the Iron Maiden or stick to 'em Transylvania style?

  52. stupid question by kisrael · · Score: 2

    Ok, I read the article but quickly, and at the end of it I wasn't sure how he ultimately told the system that an individual e-mail was spam or that it was legitimate, so it would know into which bin to toss those words...is that a manual process?

    I set up a homebrew whitelist (which still shows me the potential spam) I'm pretty happy with. I'm trying to figure out if I should keep in the subject based whitelisting or not...some spammers use my typical "hey" or "hi" subjects now...and it's the part of the system that grows the most. I'm just worried I'll send out mail to someone and they'll reply with a different e-mail address...maybe I should expire subjects?

    Hmmm.

    --
    SO YOU'RE GOING TO DIE: The Comic for Dealing with Death
    1. Re:stupid question by 40000 · · Score: 1

      Whitelist people who send you messages with a certain subject and people who you have sent a message to. So anyone visiting your website and who wants to make a comment has to use the subject "satan rules" in order to be added to your list. Expire addresses if they don't send you anything for a while. The code could be changed every so often without much disruption.
      It still doesn't work with "verification" messages though.

  53. Shifman by T-Kir · · Score: 2, Funny

    I wonder what Bernard Shifman would make of this article?

    What is our 'CS Consultant' up to these days?

    --
    Are you local? There's nothing for you here!
  54. i wish i could try this... by bje2 · · Score: 2

    it looks great, and i will try it for my account that i use eudora or outlook for...however, i use a hotmail address for my main account (so it can travel wherever with me), and their custom filtering system sucks (if i may say so)...the only things they let you filter on are subject, From Name, From Addr, & To or CC lines...no option to filter on message content, which is where this would be useful...oh well, i guess that's what i get for using hotmail...i should get a real e-mail account...

    --

    "Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
  55. Law and Reality by prester · · Score: 3, Insightful
    Making something illegal doesn't make someone stop doing it, obviously. All it does is increase the risks of doing the action. If it's still worth it to you anyway (drug dealers, drug addicts), or you're not thinking about the consequences of your actions (shooting the bastard who you just found in bed with your wife), or if you don't think that you're actually going to get caught (warez), you're not going to stop just because it's illegal.

    Making spam illegal would probably cut down on people buying email lists and starting to spam in their free time because it seems like a great way to make some money. It might even cut down on the "legitimate businessmen" types here who do it professionally. It's going to have no effect internationally, however, and there's really not much you can do about it.

    There's an interesting point about this in the article, however, when graham says:
    "(I used to think it was naive to believe that stricter laws would decrease spam. Now I think that while stricter laws may not decrease the amount of spam that spammers send, they can certainly help filters to decrease the amount of spam that recipients actually see.)"

    I would agree with this - it seems to me that for a lot of "crimes of this nature, drugs being the best example, the solution is not criminalization but regulation. People aren't going to stop dealing or using drugs, nor is it something as serious (like murder) that it's worth it to put them in jail anyway. If drugs were regulated, however, most of the problems could be easily reduced. Enforce strict controls to prevent cutting, ban advertisement, and tie sellers to treatment programs to help get people off of drugs. As long as there's no incentive for people to buy them illegally (ie, their being much cheaper or, as it is now, the only supply), people will buy them from regulated sellers.

    Similarly if you regulate spam and make people attach footers you'll be less likely to drive people overseas to spam while also making it much easier to filter out.

    Of course, there's still not much you can do about the Koreans, other than trying to get their government to do the same thing.

    Besides, do you really want to encourage the government to effectively prohibit certain kinds of non-victimizing (non-kiddie porn) speech online?
  56. You're being shortsighted by David+Wong · · Score: 4, Funny

    It was with the help of spam that with just a simple herbal supplement I was able to add three inches to my penis (an increase of over 20%). I had assumed it was just a scam, and nobody was more suprised than me that it worked.

    Well, except my wife.

    1. Re:You're being shortsighted by geekoid · · Score: 2

      shouldn't that be:
      (an increase of over 200%)

      haha couldn't resist.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    2. Re:You're being shortsighted by VRisaMetaphor · · Score: 1

      ...I was able to add three inches to my penis (an increase of over 20%)....

      ...nobody was more suprised than me that it worked.

      Well, except my wife.


      Really? I wouldn't think that the difference between 15 and 18 inches would be that noticeable....

    3. Re:You're being shortsighted by Tommy_S · · Score: 1

      If an extra 3" was a 20% increase - damn dude - I guess you've got to make a concious effort not to put a shoe on it in the morning.

    4. Re:You're being shortsighted by Anonymous Coward · · Score: 0

      Hey! I know you. We went to school together - your first name was "Long" before you changed it to David, right?

      PS - An increase of 200% is waaaay "over 20%".

    5. Re:You're being shortsighted by perfects · · Score: 1

      Vaseline is not an "herbal supplement".

  57. Method applications by lovebyte · · Score: 3, Interesting

    Good method. I work with Bayesian technics often and I had thought of the same thing but for a different purpose: automatic classification of emails. When you receive an email, your mail reader would propose a list of potential folders into which you might want to put your email after (or before) having read it. And the best thing is that is learns with time and it gets better. And as this article shows, this method can also automatically filter emails. Now if I have time to get involved in the Evolution project or kmail, ...

    --

    I'll do it for cheesy poofs.

    1. Re:Method applications by wilhelm · · Score: 1

      It might be interesting to see some of the similarities in a person's various mailboxes. I am thinking that some of your legitimate mail (hey, why not keep the spam-filtering stuff in there too?) might need some hierarchy of "mailbox foo is more likely than mailbox bar, for a similarly-likely input"; distinct mailboxes could start developing similar signatures. It also might help a person consolidate mail groups that "match", content-wise.

      Shit, now I'm going to have to try this out. There goes my copious spare time... :)

  58. Microsoft already looked into this by michaelwexler · · Score: 2, Interesting

    Feel free to review the work at http://research.microsoft.com/~horvitz/junkfilter. htm

    They came up with similar processes to both filter and to categorize. Bayesian analysis is a very flexible, and while Paul Graham is not the first to try this, his work looks very exciting.

    I had nothing to do with any of this work; just a fan of Bayesian research.

    Michael

  59. probalilties by Sarin · · Score: 2

    I spent about six months writing software that looked for individual spam features before I tried the statistical approach...[cut]...Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability.

    ofcourse these probabilities may vary from person to person.

  60. Another really good idea. by Ludwig668 · · Score: 1

    Check this out:
    Digiportal's innovative new ChoiceMail program means the end of spam. I really don't like the idea of using someone else's server to manage my white-list, but all someone needs to do is publish an open-source CGI script to do this... integrated with qmail.

  61. news.admin.net-abuse.sightings by 13013dobbs · · Score: 3, Informative

    Look in UseNet. The group news.admin.net-abuse.sightings is where people post their spams. Enjoy!

    --

    No replies made to AC posts. Please log in.

    1. Re:news.admin.net-abuse.sightings by BACbKA · · Score: 1

      The problem with NANAS is that it has different format for sightings. I think that when the article talks about a spam repository he's talking about reliable email-folder-like one to digest for the "bad" dictionary, without polluting it with the custom wrappings of the submitters' headers and privacy suppressions.

      --

      VKh

    2. Re:news.admin.net-abuse.sightings by 13013dobbs · · Score: 2

      True. YOu could strip out teh extra crap. I am sure that could be done in perl, or something. Or, just post some spam-trap addresses around on usenet and on web-pages on geocities/angelfire/etc... I am sure that will get you a big pile of spam pretty quickly.

      --

      No replies made to AC posts. Please log in.

  62. Effectiveness of complains to abuse account by A5un · · Score: 1
    Well, I'm going back to university this fall and recently reactivated my dormant e-mail account. To my horror, the email address must have been listed on every known spam list. Hey, the internet was much more innocent when I was in university, so I didn't take any precaution during those good ol' days.

    Usually, I don't mind getting spam on my (insert your favourite free web-mail here), but my university email account is something personal. So, armed with whois command, I started complaining up and down, around the world's ISPs. My question is: Has anyone done this? Any success/dismay stories?

    I know I can install a spam filter on my email client, but I prefer to have the email stored on university's mail server. That way I can ssh from anywhere and read my email (pine) and newsgroup(tin), ah.. the nostalgia..

  63. Defeatable by Hayzeus · · Score: 1
    If understand his algorithm correctly, it should be possible to defeat his filtering methods by appending a list of low-value words to the end of a spam message.

    While he advocates generating probablility tables from an individual user's corpus of messages, I would imagine that most users will have many low-spam-probablity words in common.

    Even easier, since he assigns a low score to unknown words, appending a sequence of random sets of letters to the end of the message would have much the same effect.

    Checking for phrases (rather than words) can mitigate this a bit, but all in all, this still looks like a stopgap measure.

    1. Re:Defeatable by Anonymous Coward · · Score: 0

      You don't understand it. He uses headers as well as body text, and only looks at the most "interesting" words. Certain words have a high probability of being in spam and almost no possibility of being in legit email.

      "appending a list of low-value words to the end of a spam message" can't hide the spammish words, not does it dilute the probability of an email being spam as long as there are spammish words.

    2. Re:Defeatable by Anonymous Coward · · Score: 0

      I tried this and it works if you put the non-spam
      words at the *beginning* of the message body.

      Using unknown words instead of clean ones I need
      to hit 4 spam words in the headers to recognise
      it as spam.

      I'm not sure how close my (Perl) implementation
      is to the original but it's caught all the spam
      so far (only about 12 hours).

  64. Spammers could get around this by forkboy · · Score: 2

    1. Create layout of spam
    2. Take a screenshot
    3. Convert to low res PNG or JPG
    4. Mail the JPG to 100,000 annoyed geeks
    5. ???
    6. Profit

    --
    This message brought to you by the Council of People Who Are Sick of Seeing More People.
    1. Re:Spammers could get around this by DirkDaring · · Score: 1

      Yeah, but would that even be effective? It would require the user to type in a url, instead of simply clicking on one.

      Dirk

    2. Re:Spammers could get around this by forkboy · · Score: 2

      You've never seen an image with a hyperlink? Regardless, the content of the message could be imaged in the JPG and then a text hyperlink following it. You'd pretty much have to filter by the advertised URL, which doesn't help if it's a new website.

      --
      This message brought to you by the Council of People Who Are Sick of Seeing More People.
  65. Too bad! Patented By Microsoft by kotku · · Score: 4, Informative
    Microsoft is one step ahead of everyone. Here is the patent summary.

    "Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set"

    The full details of the patent can be seen here.

    Patent Link

    I'm surprised you guys don't check at the patent office first before you get all excited about a new idea. Doh!

    --
    The bikini - security through obscurity since 1943
  66. This won't work with HTML mail by mblase · · Score: 2

    The latest trick from spammers is sending out HTML e-mails with their ads. Not a problem by itself, but by embedding the entire spam ad as a single GIF or JPEG image, there's no text for the spambot to filter out. It's easy to trap false positives with this, too, since a family member or friend might want to send out photos without necessarily attaching text as well. Boom, statistical analysis is instantly useless, and we have to go back to the old tricks -- filtering out known spam e-mail and domain sources.

    1. Re:This won't work with HTML mail by kawika · · Score: 2

      Why *won't* it work on an HTML mail that's only an image? First, as I understood it, the whole message including the header are examined. Second, if IMG, SRC, freeserve (in the domain) are among the most interesting words and high on the probability list, then it's spam. I would expect that to be the case since the only HTML email I seem to get is spam.

    2. Re:This won't work with HTML mail by nebby · · Score: 2

      They'd have to host the images somewhere. Once they do that, the cost will skyrocket and it won't be worth it.

      --
      --
    3. Re:This won't work with HTML mail by Anonymous Coward · · Score: 0

      I would expect the "html" tag to score high on most peoples filters.

    4. Re:This won't work with HTML mail by timeOday · · Score: 1
      I would expect the "html" tag to score high on most peoples filters.
      Unless they have acquaintances using Outlook. Which is everybody.

      Encoding the spam as an image really is hard to guard against. At least it would probably increase the smammers' bandwidth costs.

    5. Re:This won't work with HTML mail by jmauro · · Score: 1

      Yea, but how many people do you communicate with send an email consisting entirely of a picture? With no other words attached at all? My guess, not very many.

    6. Re:This won't work with HTML mail by bedessen · · Score: 2

      No, that's not necessary. They can use MIME multipart encoding (base64) to include the images. Often these days the entire email is just a block of encoded goo. Do a "view source" (or whatever your mail client calls it) on some spam, and chances are it'll be a solid hunk of base64 encoded content.

      (in fact there was a recent explot of IE6 posted to BugTraq that used something similar to this: It turns out you can encode an arbirary .EXE (using MIME) into some server created error message, and then get the code to execute as local user. End result: system compromised with rootkit simply by viewing a webpage.)

    7. Re:This won't work with HTML mail by Jahf · · Score: 1

      Enough that this would be an unacceptable method. Remember that most people would rather get 1 spam than lose 1 real email.

      --
      It is more productive to voice thoughtful opinions (reply) than to judge (moderate) others.
    8. Re:This won't work with HTML mail by timeOday · · Score: 1

      So they put some reasnable-sounding text beneath the picture. By the time you get to the bottom, you've already seen the picture. For that matter, they can probably use javascript or simply color the text white so many people never see it at all.

    9. Re:This won't work with HTML mail by mapinguari · · Score: 1

      To clarify, "this won't work with mime-encoded HTML, where the HTML text is included as a base-64 attachment". That's where you run into problems with simply scanning the email contents. A smarter scanner would have to know about MIME and about base-64 encoding.

  67. Not much help for businesses... by David+Wong · · Score: 3, Insightful

    ...Or somebody who runs a website like me. I want readers to be able to get through, even though they're not each on my approved list. In the same way, a business who uses a customer feedback e-mail address needs to keep it open to everyone.

    I actually had to close down my hotmail account; the spam would exceed the 2MB within 24 hours after being cleaned (and that's with the wonderful MS spam filter set on "high.")

    BTW, these days I'm getting individual spams that are 170 KB in size. Talk about rude...

    1. Re:Not much help for businesses... by anthony_dipierro · · Score: 2

      Why do you need your customers to be able to send unsolicited email anyway? Set up a web based feedback form for the initial contact, and send replies from a uniquely generated address. Change your contact.html page to a php script which includes a clickthrough EULA which promises not to spam, and generates a unique address which identifies the person's IP address. Then if they harvest the address anyway, shut off that address and sue 'em.

      There's simply no excuse for not being able to filter out 99.9% of your spam. Be smart, not stupid.

    2. Re:Not much help for businesses... by kwerle · · Score: 2

      I'd go one step further. All you need is a good mailto link with a magic subject or body:

      mail me

    3. Re:Not much help for businesses... by Erore · · Score: 2

      If you read about TMDA you will see that it has a verification system. By responding to the verification request, customers will get through.

      Word your verification request in such a way that it appears to the customer that you are doing them a favor. For instance: Thank you for contacting us for technical support. We receive a large volume of mail, and unfortunately, much of it is spam. In order to give more time for our support people to work on real problems we are asking you to respond to this message so that your request will be put in the support queue and not left in a holding bin with illegitimate mail.

      Customers like this. What is more efficient for you in this case, and costs them 3 or 4 seconds (for the first contact, 0 for later contacts) will get their technical request, order, complaint, whaterver answered more quickly.

    4. Re:Not much help for businesses... by kawika · · Score: 0

      Right. Customers love sending a message to some support email address and then turning off their computer for the night, only to find a lame auto-email when they return and check mail the next evening after they get home from work. It's a day later and they're no closer to having their problem solved.

    5. Re:Not much help for businesses... by Anonymous Coward · · Score: 0

      Wrong. I think most customers are very familiar with the problem of spam, and they will more than likely be considerate of that fact. May be inconvenient for a customer with an emergency problem, but they could also use a phone to call if a phone number is provided.

  68. RTFA by StrawberryFrog · · Score: 1

    "but ultimately each user should have his own per-word probabilities based on the actual mail he receives ... perhaps best of all makes it hard for spammers to tune mails to get through the filters"

    --

    My Karma: ran over your Dogma
    StrawberryFrog

  69. Nothings perfect, but damn close is good enough. by prester · · Score: 2, Interesting

    Did you happen to read the article? He discusses this at length. He makes a strong argument that his system is actually pretty robust, since to get around it consistantly the spam has to look just like your real email, which is pretty darn hard for them to do.

    In a lot of ways this problem is like cheating in games. As long as you're the only one who knows the exploit, you can be pretty sure that it's not going to get fixed, though you'll still get kicked off every server you play on. Similarly, with his method a spammer might be able to find a particular phrasing that's likely to get through, though his messages will still be deleted on arrival. But even if he does, if he starts sending you too many emails or starts selling his technique the filter will adapt with the spam and start filtering it out.

  70. Using the algo on Slashdot posts by Avakado · · Score: 1

    Can this algorithm also be applied to Slashdot comments, and tell whether or not they will be rated "+5, Interesting"?

    --
    The world will end in 5 minutes. Please log out.
  71. LISP by Anonymous Coward · · Score: 0

    Sexy young hot teen lesbian girls with who lisp.

  72. YES by kotku · · Score: 1
    --
    The bikini - security through obscurity since 1943
    1. Re:YES by matfud · · Score: 1

      Great they have obtained a patent on classifing text using a classifier. Specificaly a SVM however the patent allows for any classifier to be used. Will the wonderous USPTO never stop. :)

  73. Spammers will just change tactics. by caluml · · Score: 0, Redundant

    Of course, the problem now is, is that spammers won't use ff0000 as a colour, they won't start Dear Sir or Madam, and we'll just have to start again.
    I think the best way is to make a similar list of words you find in valid emails, rather than a list of things that occur in spam.

    One idea that I use that I've never seen used anywhere else, is change your email address to:
    user.aug02@domain.co.uk, and that way any spammers will only have a valid address for max 31 days. Change your email address each month. Humans can work it out, bots can't.

    1. Re:Spammers will just change tactics. by DirkDaring · · Score: 1

      Did you even read the article?

    2. Re:Spammers will just change tactics. by Myco · · Score: 2

      How terrifically annoying it must be to correspond to you. Yeah, humans can work it out. Humans also like to use address books. Humans also don't like to have to check the calendar each time they email someone.

    3. Re:Spammers will just change tactics. by Anonymous Coward · · Score: 0

      Apparently he did not. Because if he had, he would know that no matter what language (aka words, messages) the spammers evolve to using, the statistical model would adjust itself to learn the new techniques.

  74. Just another hoop to jump through? by blink3478 · · Score: 0, Redundant


    Because it is measuring probabilities, the Bayesian approach considers all the evidence in the email, both good and bad. Words that occur disproportionately rarely in spam (like "though" or "tonight" or "apparently") contribute as much to decreasing the probability as bad words like "unsubscribe" and "opt-in" do to increasing it. So an otherwise innocent email that happens to include the word "sex" is not going to get tagged as spam.

    So what's to keep spammers from reading this article, and tailoring their spam to stop using 'hos' and 'ladies' and start include words like 'tonight' and 'apparently'"?

    'This week only! All the hiz'oes and liz'adies you could want on our website. Sign up tonight and receive a free two month membership! Apparently we'd uh... like your business!'

    D

  75. Re:They should call it "Spankdot" by Anonymous Coward · · Score: 0, Offtopic

    Taco whacking off to the girls volleyball team?

    You're new here, aren't you.

  76. Non-Bolean buckets by Tablizer · · Score: 2

    I think rather than lump stuff into "spam" and "non-spam", it should assign a ranking number and preferrably display a color to represent ranking.

    If you are tired, then you can ignore or defer the grey areas.

    Also, if the display list displayed the first X characters from the content, then one can often check without reading the whole thing. (Perhaps filter out non-indicative words like "the" and "and" to make it more compact to display.)

    I don't think there is one magic technique because spammers will work around it if it gets popular. Thus, a combination of machine and human working together will be more effective IMO.

    1. Re:Non-Bolean buckets by shadow303 · · Score: 1

      Well, that is a fairly trivial modification of his technique. He just checks the probability and uses a threshold to make a boolean result. You could just start assiging colors to probability ranges.
      I'm not sure what the benefit would be to having a few words from the text. For me (and most likely other people as well), that is enough of an inconvencience that I may as well just scan through the entire email.

      --
      I've got a mind like a steel trap - it's got an animal's foot stuck in it.
  77. Re:Circumvent by kawika · · Score: 2

    >> based on the assumption that it is written in English

    There's no reason to think that Spanish, German, or even Chinese spam doesn't follow the same statistical word frequency rules.

    >> following the simple steps outlined in the URL above

    What if you are subscribed to mailing lists, or have mail bots that send you useful messages (like "your server is down")? The usual answer is "just configure those in advance" but that's a pain and not very robust. My hosting company was bought out and their automated server status messages just started to come from a new domain. If I had this kind of filter I would have missed them.

  78. Nicely done by hrieke · · Score: 3, Interesting

    What I want to know is:
    Would this also work with email virus? I think it would since the virus would also have a defined patern to it and the program would pick it up after the first one.
    Could this be made part of the STMP protocol or built into the backbone layer of the network? Again, I no major reason why it couldn't.
    Problems that I have with it are:
    Since each word is treated as a token and everything else is not, I'm sure that spammer would quickly figure out that a spam like this just might work:
    <HTML>
    <BODY>
    Enlarge <!-- elephant --> penis [etc..]
    </BODY>
    </HTML>
    which would show the message but hide the balancing words, so it could be possible to change the delta into your favor.
    Does anyone else have thoughts on how this might be broken?

    --
    III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIIIV IIVIIIIIIVIII...
    1. Re:Nicely done by certsoft · · Score: 2, Funny
      Enlarge <!-- elephant --> penis [etc..]

      I think most elephants have a large enough penis already.

    2. Re:Nicely done by gwernol · · Score: 2

      Would this also work with email virus? I think it would since the virus would also have a defined patern to it and the program would pick it up after the first one.
      Could this be made part of the STMP protocol or built into the backbone layer of the network?


      I would assume it would work against viruii for just the reason you give, althought you'd want to run some experiments to confirm that.

      I don't however, think you want to bury this in the backbone. An important part of this is that it is personalized. Each person needs to gather the statistics for their particular incoming email. If you have a centralized system it would become much easier for the spammers to craft emails that defeat the learnt patterns. If each individual users has a separate set of individually learnt filters it becomes impossible to craft emails that can get through more than a tiny percentage of them.

      Since each word is treated as a token and everything else is not, I'm sure that spammer would quickly figure out that a spam like this just might work:

      Enlarge !-- elephant -- penis [etc..]

      which would show the message but hide the balancing words, so it could be possible to change the delta into your favor.


      The techniques Paul uses can be extended to cope with this problem. The math is a little more complex but is definetly a known science. Any decent textbook on pattern recognition will give you solutions to this problem.

      --
      Sailing over the event horizon
    3. Re:Nicely done by AnotherBlackHat · · Score: 2

      What I want to know is:
      Would this also work with email virus?

      I could write one that beat it, but it does raise the difficulty significantly.


      Could this be made part of the STMP protocol or built into the backbone layer of the network?

      No. Why would you want to?
      You could build it into an email client.
      If it works, then sooner or later you'd have to, if you want to sell an email client.


      Problems that I have with it are:
      Since each word is treated as a token and everything else is not, I'm sure that spammer would quickly figure out that a spam like this just might work:
      <HTML>
      <BODY>
      Enlarge <!-- elephant --> penis [etc..]
      </BODY>
      </HTML>
      which would show the message but hide the balancing words, so it could be possible to change the delta into your favor.
      Does anyone else have thoughts on how this might be broken?


      Wouldn't
      Enl<!-- elephant -->arge pen<!-- elephant -->is ...
      be more effective?

    4. Re:Nicely done by Anonymous Coward · · Score: 0

      no, because "!--" would then become a great indicator for spam

    5. Re:Nicely done by GloomyTrousers · · Score: 1
      Nope, he's already considered that. To quote from the paper:
      I also ignore html comments, not even considering them as token separators.
  79. The problem is the existing email infrastructure by dmelomed · · Score: 2, Insightful

    SMTP is designed broken because it:

    1) Allows senders to be faked.
    2) Is slow.
    3) Requires bounces for broken messages.
    4) Allows loops.
    5) Cross-subscription to mailing lists, complicated mailing list management.
    6) MIME.
    7) Add your gripe here.

    See http://cr.yp.to/im2000.html

  80. Re:Circumvent by Anonymous Coward · · Score: 0
    Working with the assumption that that *all* spam is sent out by machines, you can easily conclude the need for an automatic process asking the sender to add him or herself to your trusted list by following the simple steps outlined in the URL above. Any literate (i.e. email sender) human can read the bitmap of jumbled text found in the URL above, but a computer can't.

    Mailing lists are sent out by machine too. As a list administrator, if you sign up and I get one of these, then you're not going to be signed up much longer.

    One is not a problem, 10 is an annoyance, and when you're dealing with 10k people subscribed, it's a royal pain in the ass.

    So my automated bounce system will unsubscribe you, even if you have given permission for the list, paid for the list, or whatever.

  81. Stripping HTML out of emails? by caluml · · Score: 1

    Is there any way in Postfix/Sendmail/Exim/Whatever to strip HTML tags out of incoming mails?

  82. Only way to stop spam is to beat spammers down. by stuartkahler · · Score: 2, Interesting

    Laws will never stop spammers. The damages are very hard to prove, especially when the judge/jury don't realize that their ISP filters their mail for 95+% of the spam already. Most people just don't GET it. And most spammers are sending the spam from another country, running a fly-by-night operation, so prosecution is nearly impossible.
    Filters are helpful, but they still require huge resources to receive the e-mail and process it. And as stated in the article, the risk of a false positive is often much worse than just receiving the spam.
    There are already only a few mail relays that are willing to send out spam, and virtually nobody accepts ANY mail from them. The spam going out is coming through illegally used mail servers. This shows what is to be the solution to the problem of spam: ISPs will only act to stop spam when the spammer is damaging their system.
    Most spam gets deleted without the enclosed links getting clicked by at least 99%. The company hosting the web site just sees their customer getting some success with their business. They don't know why, and they really don't believe/care when someone e-mails them to say that the user spammed them from a mail relay in china. The user probably paid for a 2 gig/month of traffic, and they are well under quota.
    It's time to change that. With a SETI@Home / Prime95 type application, we could easily DDoS a daily spammer off the net. Slashdot alone could easily field 10000 users willing to put their cable modems up to the task of pounding spammers accounts (and possibly the hosting ISP) off the net. Beat them down until the account appears to be deleted. Maybe then ISPs would hold users accountable for being spammers. Web hosting contracts might start including fines ($500+) for abusing the service, rather than just the scary risk of a cancelled account. All we have to do is beat them down before the few clueless morons come buying and make it worth their while.

    Legal? Sure, I don't see why not. I can send a 10 http requests to the ISP in a second... I've never heard of a law that says I can't do that every second. As long as the computers involved are from willing users (sysadmins get permission in writing first), there is no 'hacking'. Every DDoS case I've ever heard of involved charges of 2k+ computers 'hacked', rather than the ensuing attack. Even if it is illegal, this is vigilantism that nobody (other than the hosting ISP) is going to complain about.

  83. Another use for this approach by kpharmer · · Score: 1

    In addition to looking for spam you could use this approach to initially filter your valid email into various topical inboxes.

    So, with some modifications it could move all the mail you get from work, from various hobby/educational/professional listserves and put them into separate folders for you to scan.

    I'd find this tremendously useful, and it would probably even enable me to subscribe to a few more listserves - without the worry of being buried in the resulting email.

  84. Incorrect statistics by SiliconEntity · · Score: 4, Insightful
    Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.

    This reasoning is statistically invalid. It is only true if the chance of the word "sexy" appearing in a message is independent of the chance of the word "sex" appearing. In other words, only if knowing that the word "sex" appears tells you nothing about how likely the word "sexy" is to appear, can you reason as he is doing above. That's probably a very poor assumption in this case.

    He is doing:

    p(sex & sexy) = p(sex) * p(sexy)
    The correct formula is:
    p(sex & sexy) = p(sex) * p(sexy | sex)
    where the last term means the probably of "sexy" given that "sex" appears.

    Maybe his approach is good enough for his purposes, but the statistical foundations are not correct.

    1. Re:Incorrect statistics by anthony_dipierro · · Score: 2

      I wonder if you could use google to find out the expected correlation.

      Sex: 76,200,000 results

      Sexy: 15,900,000 results

      Sexy Sex: 2,010,000 results

      We know that google indexes 2,469,940,685 pages. So P(Sex)=3.1% (wow, 3% of web pages contain the word sex). P(Sexy)=0.64%. P(Sex & Sexy)=0.81%. P(Sexy|Sex)=2.6%. P(Sex|Sexy)=12.6%.

    2. Re:Incorrect statistics by Broccolist · · Score: 4, Informative
      In other words, only if knowing that the word "sex" appears tells you nothing about how likely the word "sexy" is to appear, can you reason as he is doing above. That's probably a very poor assumption in this case.

      Graham is using a naive Bayes text classifier here, which is a pretty common approach. The naive classifier, as you perceptively point out, does relies on the obviously incorrect assumption that the appearance of any word is independent of all other words. But:

      1. It's computationally impossible to be as statistically rigorous as you would like. If we had to keep a probability table of every word given every other word, we'd have awful combinatorial explosion. Even today's most powerful supercomputers would be unable to classify spam :).
      2. The naive Bayes classifier, despite the incorrect assumption, has been empirically shown to be one of the best algorithms for dividing text documents into categories. Because of the variety of words and very small correlation between words in different sentences, the assumption seems to do very little harm.

      Your objection is one of the reasons why AI researchers shunned Bayesian methods for so long: in practice it's impossible to implement them rigorously. Unfortunately, building a completely rational system is not tractable without a planet-sized computer. The only viable solution is to make compromises: just like humans do, when they skip steps and make not-100%-warranted assumptions in their reasoning.

    3. Re:Incorrect statistics by Furry+Ice · · Score: 1

      Do the math for Christ's sake:

      p(sex) * p(sexy) = 0.97 * 0.99 = 0.9603 != 0.9997

      Paul Graham knows his shit.

    4. Re:Incorrect statistics by batsman · · Score: 1

      He should be doing this:

      p(SPAM|sex,sexy) = p(SPAM,sex,sexy)/p(sex,sexy)=
      = p(SPAM,sex,sexy)/(p(sex)*p(sexy|sex))=
      = p(SPAM,sex,sexy)/(p(sexy)*p(sex|sexy))

      Here, p(SPAM,sex,sexy), p(sex), p(sexy), p(sex|sexy) and p(sexy|sex) are known. The system finds p(SPAM|sex, sexy).

      Generalizing, this system is finding p(SPAM|X0,X1..Xn)=p(SPAM,X0,...Xn)/(p(X0)p(X1|X0). ..p(Xn|Xn-1,...,X0))
      The Xi are the "most interesting", that is those for which (p(SPAM/Xi)-0.5)^2 is greater.

  85. Re:A weak point...statistics by Anonymous Coward · · Score: 0

    Well I'm wondering if spammers could bias the score by simply adding a list of "counter-words"

    Kind of the way web sites bias themselves with search engines.

    Remember statistics do have weak points. Think "Lies,damn lies,and..."

  86. An slightly better approach to this idea: by zaqattack911 · · Score: 1
    If each person has their own DB/hash of spam words and probabilities. Then it's quite likely a spammer could still get through at somepoint.

    if it does, it might take a few emails of the exact same kind of spam until your filter starts ignoring it.

    Why not have a centralized free online database of the same kind, that way as soon as an email is sent to just a few people, the filter starts to recognize them right away.

    So basically your mail program would be contributing and borrowing the hashtable from this central DB online for each "mail session".

    If everyone used this, a spammer would be stopped dead in his tracks after the first few emails sent.

    --me

    1. Re:An slightly better approach to this idea: by Garridan · · Score: 1

      Consider this: If I sign up for 3 porn sites, and 2 shady etailers, assume 50 spammers will receive my email address. Each of these spammers will send me 5-20 emails in a week. My address will slowly leak into more spammers' hands, but generally ones in the same market; so the spams will all be similar according to the Bayesian probability. That way, if my filter is tuned exactly to the spam that I receive on a regular basis, thats all I need. But if I use a global hashtable, either the chances of false-positives will increase because the filter will get more and more paranoid, or the chances of false-negatives will increase because the filter will be looking at a broader set of words.

      And every user should have their own non-spam hashtable... my friends don't write like the average netizen... and as a result, I think that most people write email that looks like spam. In short, my hashtable would expect a much higher level of intelligence than the spammer/skriptkiddie/14-year-old-punk who I don't want to hear from.

  87. The problem is obvious by anthony_dipierro · · Score: 2

    Your filter's usefulness is inversely proportional to the number of people who use it, since it is trivial to bypass by a spammer who knows its details.

    1. Re:The problem is obvious by Jadsky · · Score: 1

      The article clearly explains why this is not a problem at all.

      The point is that each person's inbox would be uniquely filtered based on the general content of emails they receive.

      So.... your filter's usefulness would be inversely proportional to 1. That's also 1. I'll take a filter that's 100% useful, thanks.

    2. Re:The problem is obvious by anthony_dipierro · · Score: 2

      The point is that each person's inbox would be uniquely filtered based on the general content of emails they receive.

      Then either this man has discovered artificial intelligence or he is being overly optimistic about how well an automated system such as this is going to work. If you can separate spam from non-spam based on the content of the message against an active human attacker, you've just passed the turing test, as far as I'm concerned.

    3. Re:The problem is obvious by anthony_dipierro · · Score: 2

      So.... your filter's usefulness would be inversely proportional to 1. That's also 1. I'll take a filter that's 100% useful, thanks.

      By the way, 0.0000000000000000000000000001 is also inversely proportional to 1.

  88. don't delete email by Anonymous Coward · · Score: 0

    Space is cheap. It would be far better to NEVER DELETE YOUR EMAIL. Instead you should just toss it into different folders. A folder per mailing list, a folder for verified spam, a folder for filtered mail(suspected spam), and a general inbox works well.

    Michael

  89. Yet Another Permission-based solution by Jeff+Fohl · · Score: 1

    I agree - I think things have gotten so bad, that it might not be practical to use algorithms to detect spam. I am using a permission-based system like the Si20. It is called ChoiceMail and it is put out by DigiPortal. If a spammer wants to send you email, they must first ask your permission. If it is a friend, you just give them the OK, and they are forever on your whitelist. I have been using this for about a month, and I too, get ZERO spam.

  90. Mailing list hell by ajs · · Score: 3, Insightful

    Can you imagine the day everyone uses this. You send mail to a public list and get back 2000 messages asking you to "authenticate" yourself.

    This is a bad plan for working in the large.

  91. spam.NET by Lord+Omlette · · Score: 2

    Clippy/BOB/etc were based on Bayesian techniques, right? Does this mean M$ could soon build this into Exchange/Outlook?

    dundunDUNNNN

    --
    [o]_O
  92. But I think it could be easily circumvented .. by vinays · · Score: 0, Redundant

    As described, it would be very hard for legit spam to get through.. However, what I'm thinking is that they could have their normal 5 KB of email which is spam .. and at the bottom .. (or anywhere else) , just add 20 KB of words they know are "good words" .. throw html comment tags around it and its never seen to the viewer ... but the large amounts of "good words" outnumbers the "bad words" , causing a spam msg to be considered good...

    I don't know if that'll really work.. but its a thought

    --

    "cogito, ergo sum"
  93. If spam reaches the filter, you have lost by Anonymous Coward · · Score: 0
    The spammer has already stolen network resources in getting the spam to you, and you end up paying fro the transmission whether you read it our not.

    I think DNSBLs and legal actions can be effective, and perhaps additional approaches will arise, but filtering should only be a temporary tactic because the victim ends up paying for the spam anyway.

    1. Re:If spam reaches the filter, you have lost by wheany · · Score: 1

      You have lost the bandwidth, but you have not been annoyed yet. To many people that is more important.

  94. Missing the point by g8oz · · Score: 1
    Read the article, please. The point is trying to come up with filter words yourself is not scalable. Let the software do it for you.

    The beautiful thing about this approach that people seem to be missing is that it evolves as spam does.


    I dont' know how it will work with images though

  95. Or micro-payments. by beanyk · · Score: 1

    How about *paying* for e-mails?

    It's been suggested before, but if all e-mails had a small (say US$0.02) charge associated with sending them, bulk e-mailers would have to be much more careful. They like it because it's virtually free, so a tiny tiny number of replies will pay off. If you change the economics, you change their business model.

    1. Re:Or micro-payments. by Anonymous Coward · · Score: 0

      I first read a suggestion like this, although I think "charge" may have been "tax" back then, several months ago.

      Each day, I come closer and close to deciding that this is a good idea. 50-cents a week to be spam free?

      However, consider what your mailbox looks like. There are a lot of people out there willing to spend A LOT of money sending postal spam and 2-cents will still seem like a bargain.

    2. Re:Or micro-payments. by Anonymous Coward · · Score: 0
      Dorothy Denning suggested this about 15 years ago.
      I think she wrote about it in the CACM.


      It never caught on, personally, I think it is time.

  96. fighting spam by frovingslosh · · Score: 3, Interesting
    None of what I saw in the article is, in my mind, effective in fighting spam for the following reasons:

    By the time one can apply the filters, you have already received the spam. This is a load on your resources. In some cases your in-box may even fill up (yes, I've received 1000's of the same piece of spam in the same hour, exceeding the capacity of my allotted storage and effectively DOSing me from real e-mail) or you may exceed limitations from forwarding services.

    The spammers don't really care. Or notice. Their goal is to hit millions of victims, knowing that some of them will respond. The response is all they care about. Filter your e-mail all you want, you were not going to respond to them anyway. All they care about is reaching the mark that doesn't know any better, and this filter doesn't do anything to stop that (unless it is applied automatically by ISP's, unlikely due to the fear of fales positives).

    What might help is a two fold attack on what they want: responses from marks. I suggest the following:

    A massive education campaign to educate the general Internet user to never respond to (or even read) strange messages that show up in your e-mail. Banner ads would seem a good place to start, it would be a public service if a good percentage of banners were replaced with ones that educated the Internet users who still make spam profitable. This might even have the long term effect of improving banner revenue: if banners compete with spam as a way to get out a message they have a lower value than if the public is taught to not buy from spam and even to aggressively resist doing business with a spammer. In the long run an antispam banner campaign could improve banner revenue for those who help fight spam. Ideally another great way to get the word out would be UCE, but that poses a moral dilemma....

    The other thing that could effect the spammer is if the ads are not getting the desired results with the advertisers. What needs to happen here isn't filtering, it's massive negative response to the advertiser. No response don't hurt them, but making them respond themselves to unwanted responses is a more suitable way to respond to those who originate unwanted messages to use in the first place. These people need to get responses that waste their time and resources like they are wasting ours. Obviously those who supply 800 numbers are a prime target for this, while those who supply only postal addresses make it too costly to respond. I think such negative response campaigns need to be coordinated from major popular sites to be truly effective (not just from a few geeks who spend their day on an anti-spam website. Their efforts are much better applied by getting the spam sources in black holes and getting ISP's to block or filter spam). It sure would be nice to see the slashdot effect applied to spammers rather than the poor smuck who puts up a small but interesting website.

    Interested in other's thoughts in this area.

    --
    I'm an American. I love this country and the freedoms that we used to have.
  97. D.O.S. attack on spammer sites? by Tablizer · · Score: 2

    Just out of curiosity, what if a bunch of geeks set up servers to DOS-flood sites that spammed. (This would not be the return address, since those are usually phony, but the website that sells the goods being advertized.)

    If such was possible, then Viagra.com would think twice about starting another spam compaign.

    1. Re:D.O.S. attack on spammer sites? by wheany · · Score: 1

      Two wrongs don't make a right.

      Don't sink to their level.

      Slashdot requires me to wait 2 minutes between each successful posting of a comment to allow everyone a fair chance at posting a comment.

      It's been 1 minute since I last successfully posted a comment

  98. Bullshit! by www.sorehands.com · · Score: 5, Insightful
    Another spammer lie.

    Freedom of speech is not the freedom to tresspass on my computer equiptment, use my resources for me to listen to your advertising!

    This is not a prohibition on your paying your moneyto spread your advertising. This is a prohibition on you spending my money to spread your advertising.


    Commercial speech does have some constitutional protection, but not to the same level as non-commercial speech. But even with pure political speech, there is no requirement for me to pay for your speech.


    As for hitting the delete key, at that point, you have already tied up at least 2 of my computers used my disk storage, my time, my bandwidth without paying for it.


    If you want to spam, no problem, just pay me in advance.

    1. Re:Bullshit! by Anonymous Coward · · Score: 0

      Double Bullshit!

      The internet is an opt in system. When you hook up a system to receive email you opt in to the system. You do it by choice. No one is forcing you to log on.

    2. Re:Bullshit! by j7953 · · Score: 2
      Commercial speech does have some constitutional protection, but not to the same level as non-commercial speech. But even with pure political speech, there is no requirement for me to pay for your speech.

      In fact, I would claim that Graham's approach to spam provides a much better protection of freedom of (especially political) speech than any other method. If until now you never received political spam, than his filtering method will probably rate the mail average, maybe even slightly positive. If you decide to delete-as-spam, the filter will "learn" to recognize political mails as something you don't want. If you decide to read the mail, it will "learn" to let future political mails get through as well.

      --
      Sig (appended to the end of comments I post, 54 chars)
  99. World's Largest Corpus of Spam by MagnaMark · · Score: 1

    The article proposes that "one cooperative project that I think really would be a good idea would be to accumulate a giant corpus of spam."

    This brings to mind a huge, quivering, pink mass of luncheon meat sitting on the Midwestern prairie just down the road from the World's Largest Ball of Twine.

  100. Solution by Anonymous Coward · · Score: 0

    Nukeeeeeeeem!

  101. Great quote by fizban · · Score: 2, Funny

    This is the best paragraph of the whole article:

    So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice. Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.

    The Bayesian filter. You can run, but you can't hide!

    --

    +1 Insightful, -1 Troll. What can I say, I'm an Insightful Troll.

  102. Re:Circumvent by Anonymous Coward · · Score: 0

    Leave it to the Slashbots to try to destroy the rule with the exception.

    He can add them himself, dolt.

  103. important point, mod up by Anonymous Coward · · Score: 0

    funny, insightful, whatever.

    I mean really, most spammers are desperate,"be-your-own-boss" morons. Well, maybe just misguided, but yeah.

  104. to modders by oktaya · · Score: 0, Offtopic

    what kind of modding is this?

    (4 - Informative) ???

    it's just a paragraph from the original article. You modders would already be informed if you'd read the article.

    oktay

    --
    ---------------
    Founder of the The Free Linux CD Project
  105. He covers this. by Andy+Dodd · · Score: 2

    In addition to filters being individually tuned, the system allows for "whitelists" - Any mail address on the user's whitelist automagically bypasses the filter.

    The difference between this and other whitelist approaches is that "new" people who are sending you legit mail (Like Horny Teenager's latest BF/GF) will likely get through, as opposed to having to authenticate in some manner.

    --
    retrorocket.o not found, launch anyway?
    1. Re:He covers this. by SpamJunkie · · Score: 1

      Whitelists will soon be spam's entry point. All those "somebody likes you" messages are building a database of who-knows-who. Some of my stupid friends have even had some sent to me. One day all the mail will be spoofed from them and it'll get right through our spam blocking.

  106. Spam is an unnecessary evil. by plover · · Score: 2
    I'd kept my address clean for many years, but I got bit because I wrote a letter to the editor of a scientific journal, who reposted it with my email address including a mailto: URL. God that hurt.

    I had my first spam before I received my e-mailed copy of the journal. It was "related" to the topic of the journal, and said something like "I sure agree with what you wrote about in the journal. What's your opinion about http://my.url ?" But the To: line was the clue. It included not only me@myrealaddress.com but also that of smart.guy@nospam.address.com (another poster in the journal.) It was very apparent that the author had simply harvested the HTML and dropped it into his address database.

    It was only hours before I was getting offers for detoxifying myself, HGH, climax gel and all-free teen pr0n.

    --
    John
  107. Beware Statistics by DoctorNathaniel · · Score: 2, Redundant

    A few quick comments about this. Although powerful, such approaches suffer from being somewhat too 'black-box'. That is, you turn control over to the computer to make decisions based upon statistical recurrances. This leaves you very vulnerable to several problems.

    For instance, the author remarks that he believes a bigger corpus of spam would help train filters. That's true, but misleading: it would help train filters that distinguish between his 'nonspam' corpus and his 'spam' corpus. In this case, he is surely increasing his true-positives.. his rejection of things that really are spam. But his false-positive rate is not helped at all, because his samples are so biased.

    (Example: 10 spams get the word 'blunderbuss' but he has no regular email with that word. Therefore, any future email may be rejected because of the word 'blunderbuss', even though there is no basis to know whether the word CAN be used legitamately.)

    If the system is done intelligently, this will simply mean that having a lopsided sample will do nothing (the resolving power will be dominated by the smaller of the two samples), but this may be counterintutive to some.

    Another problem is that you don't know WHY choices are being made, and that's bad science. Ok, ok, so this isn't science, it's Spam prevention, but I like science.

    ---N

    1. Re:Beware Statistics by truthspirit · · Score: 1

      I don't think you read the article closely enough.

      Statistical analysis gets more accurate the more complete the model becomes. This is why he is analyzing BOTH the spam and the 'innocent' mail.

      The filter uses the most _interesting_ words not the most spam-likely words.

      Even if twenty spammers were selling blunderbusses (some what unlikely) then my friend from New York who is a history major sends me an article he wrote about ancient firearms, this method would more than likely overlook the occurence of 'blunderbuss' or even several occurences based on much larger percentage of innocent words in the document.

      So, if you care to investigate the statistical principles behind such a system, you know exactly why the choices are being made. It IS science, actually... applied to SPAM. The statistical principles involved have been around for much longer than email.

      -truthspirit

    2. Re:Beware Statistics by drxenos · · Score: 1

      How come some people always think "random" when they hear "statistical"?

      --


      Anonymous Cowards suck.
    3. Re:Beware Statistics by bedessen · · Score: 2

      Example: 10 spams get the word 'blunderbuss' but he has no regular email with that word. Therefore, any future email may be rejected because of the word 'blunderbuss', even though there is no basis to know whether the word CAN be used legitamately.

      Not likely. Remember, the algorithm only considers the top 15 most interesting words of the whole email. Interesting means words that are close to either extreme in percentage. If only 10 spams contained 'blunderbuss' (out of however many thousands in the "spam corpus" used to establish the wordlist) then its percentage would be near the middle, since it was only present in 10 out of thousands of spams. So it will probably not be one of the 15 most interesting words -- if it is a legitimate email there will certainly be a lot of low-score words (near zero, i.e. common to many legit emails) and these are what are considered when judging the message.

      In order for 'blunderbuss' to cause a message to be marked as spam, 'blunderbuss' would had to have been present in thousands of previous emails known to be spam, and the message would've had to have a near absense of any words common to thousands of legit emails. If this was the case then it probably was indeed spam, and the algorithm predicted correcly.

  108. Paul Graham is a f'ing ignorant liar. by Anonymous Coward · · Score: 0

    who is a rabid lisp loser.
    Please stop posting crap from people like him.

    Thread on google
    http://makeashorterlink.com/?N2AC15981

  109. ASK! Re:Another way to stop Spam by kwerle · · Score: 2

    ASK is a system similar to yours with some tweaks:
    If you send someone email, and they reply to it, leaving in your .sig, they are automatically whitelisted.

    Mailing lists are handled automagically.

    Check it out:
    http://a-s-k.sf.net

  110. Why not reduced to practice? by balamw · · Score: 2, Interesting

    The built in spam filters for Outlook and Hotmail are just so much less efficient than Spamassassin or Razor/SpamNET.

    My recent experience shows about 90% of the spam I get can be detected by Spamassasin, 70% by SpamNET and about the same for Hotmail. The Outlook/Outlook Express filters are basically blacklists and catch maybe 40% if properly maintained.

    It does sound very similar, so why haven't they been able to implement a Bayesian filter as successfully as the lisp guru?

  111. Hacktivism? by Andy+Dodd · · Score: 2

    What's the legality of a DDoS where each attacker is an individual person and not a "zombie"

    I recall during the RIAA DoS discussion there were some methods of DoSing that were rather legit. (Slow HTTP request for instance - G, sleep 5, E, sleep 5, T, sleep 5, etc etc. Not a huge bandwidth hog but wreaks havoc with HTTP servers if enough people do it.)

    --
    retrorocket.o not found, launch anyway?
  112. Just got some spam. by perlyking · · Score: 2

    Well, my spamfile did (thanks procmail) and I submitted it to spamcop and noticed they have a freephone number.
    Now I could ring up and it would cost THEM money, which is a little teensy bit of payback - but imagine posting that freephone number on a site somewhere where like minded people hang out. They could all ring up the number, cost the company money and tie up their staff by chatting to them about their product.
    It might make them rethink their spamming.

    --
    no sig.
  113. The design goals of SpamAssassin by belphegore · · Score: 4, Informative

    Paul is taking an interesting approach here, but he's not correct in saying that SpamAssassin doesn't use a statitstical approach. He has a bit of a point in noting that his system will generate a prediction probability which is more intuitive than SpamAssassin's scoring system in terms of determining how likely a message is to be spam, but there is also an attractive element to the simplified, non-math way that SA uses scores, which allows them to be more understandable to non-math people.
    Seems like a number of the points which Paul makes in the article about spammers being defeatable, about the basic premise that they must get their message through in order to be successful, and that the war on spam is winnable are extensions from my interview with Salon a few months back, but his statistical approach fails to make use of one factor which I believe is critical (and which SpamAssassin attempts to exploit), which is that those commercial messages must convey a commercial message, in other words, they have to be a message, and have some sort of linguistic component which encourages the reader to do something. A purely statistical approach to spam filtering will lose the power of doing analysis of higher-order linguistic concepts.
    SpamAssassin's approach is to use the universe's best known natural language processors (humans) to build rules which they believe can differentiate linguistic elements of spam vs nonspam messages, and then use the best optimization and statistical tools we have (currently only using decent tools, not the best tools) to determine how to score those rules against individual messages. The scoring system is very simplistic today, just being a simple sum of the scores of the various rules (though it's slightly nonlinear because of the properties of some of the rules, like the auto-whitelist). Future SpamAssassin development directions include extending the scoring system to be much more non-linear, including examining statistically the frequency of occurrence of combinations of rule triggers.
    Automated rule-creation certainly has its place (for example, SpamAssassin's spam-phrase rule, or the auto-whitelist), but I truly believe that the ideal spam filtering system will always have to make the best use it can of human language processing skills. Using this combination of human/computer power, I believe that SpamAssassin can (and often does for many existing users) achieve better ROC performance than anything else.

  114. It depends on your definition of spam by MemeRot · · Score: 1

    He proposes you define it as unsolicited automated mail. But that's not it exactly. It's only automated unsolicited mail that you don't want. If he had been looking for that raleigh three speed and had happened to get unsolicited automated mail offering him one, he would have been delighted to get that piece of spam. So sometimes you don't know you wanted it until AFTER you've read it. I would rather avoid filtering on headers if possible.... if the above email came from the same open mail relay as 2 tons of porn email, that doesn't change the fact that I would want the above email anyway.

    Spam is mail you don't want. The automated feature is irrelevant. If an army of trained monkeys were copy and pasting the mail to you by hand, would this make it not spam? Of course not.

    Is it mail you want just because your friend sent it to you? Even though it's a forwarded chain letter? No, then it's junk.

    The goal of filters shouldn't be to filter out automated unsolicited mail, it should be to filter out mail you don't want. So if you are a horny teenager you might want to let all the sex mails through.... that doesn't mean they're not spam. But the spam status is really irrelevant. Very good article.... just replace 'delete as spam' with 'delete as unwanted'.

    1. Re:It depends on your definition of spam by blazin · · Score: 2

      Even if I did want the Raleigh 3-speed, and I received an email (unsolicited, automated, spam, whatever) from anyone who I haven't spooken to before about my desire for the bike, I would want it deleted.

      Even if I was looking for and really, really, really wanted that bike, if the opportunity to purchase it came to me through a spam email, I would still delete it. I don't want to give the spammer any revenue or to encourage spam in any way.

    2. Re:It depends on your definition of spam by iabervon · · Score: 2

      Even if some spam is actually offering you something you want, you probably don't want to receive it in your email. You may not want to send your spam to /dev/null, though, if you might get an offer for something you actually want. In this case, you'd want to set up a file that it gets sent to, and then search it for interesting stuff on occasion.

      Of course, anyone offering something interesting to a list of email addresses probably also has a web site, and you could find the information with Google when you want it.

    3. Re:It depends on your definition of spam by benedict · · Score: 1

      Spam is not "mail you don't want". Spam is
      unsolicited bulk (or, as Graham puts it, automated)
      email.

      Mail from an obnoxious person whom you don't like
      can be mail you don't want, but it's not spam.
      UBE advertising a product or service that you want
      to buy can be mail you want, but still spam.

      A theoretically perfect mail filter might
      distinguish between mail you want and mail you
      don't want, but that's hard. It's much harder
      than distinguishing between spam and non-spam.

      --
      Ben "You have your mind on computers, it seems."
  115. Re:Circumvent by mariube · · Score: 1
    This is a nice concept. His algorithm works because spam uses the same repetive syntax. Because so many spam/emails are sent out - it can be flagged by pattern recognition... based on the assumption that it is written in English! It would also probably flag Spam parodies written by friends, or marketing information you were actually subscribed to...

    My primary concern comes from the fact that most of the spams I recieve are either Korean or English, while most of the legitimate mails are in Norwegian. Sending me Korean mail is pointless anyhow, but I fear that simply the _use_ of English will make his scheme produce lots of false positives.

    Oh well, I'll probably make my own authentification scheme. It does seem like the way to go. Or, of course, I could subscribe to a few mailing lists just to give his algorithm more entropy to work with.

  116. mod up, we need this by moogla · · Score: 2

    I was going to post a message suggesting exactly this but you beat me to the punch.

    Why doesn't my email app have this already??? :-D

    --
    Black holes are where the Matrix raised SIGFPE
  117. Re:Too bad! Patented By Microsoft by kawika · · Score: 2, Interesting

    Wow. It's described down to a level of detail that would make you think they've already written the Outlook add-in for it. I wonder why we haven't seen it yet?

  118. We should just start killing spammers. by Anonymous Coward · · Score: 0

    The only way to end spam is to end human life.

    Extreme but 100% effective.

    We don't even have to commit murder, we can just chain them up and leave them to rot in they're basements. They don't have any friends so no one would come to set them free.

    And even if they did have someone they owed money to or something and they came by and set them free. I bet the probability of them sending more spam would be greatly diminished.

  119. My favourite bit... by Anonymous Coward · · Score: 0

    ...is in the footnotes:

    [2] As a rule of thumb, the more qualifiers there are before the name of a country, the more corrupt the rulers. A country called The Socialist People's Democratic Republic of X is probably the last place in the world you'd want to live.

    So what does that make of United States of America?

  120. The CRM114 system uses this, plus more. by Anonymous Coward · · Score: 1, Interesting

    The CRM114 active filter uses the Bayesian
    technique described, but extends the probabilities
    to _phrases_ (including interrupted phrases) not just words.

    For example, the phrase

    Mary had a little lamb

    would insert hash marker entries on

    mary, had, a, little, lamb, mary had, mary a,
    mary little, mary lamb, mary had a, mary had
    little, mary had lamb, mary had a little

    and so on. My experiments say that you are
    just about out of significance at five words
    and it doesn't pay to go past that.

    The advantage of this is that it's often not
    words, but phrases that have the higher-level
    "meaning" (grammatical context?) that is even
    _more_ indicative of spam versus nonspam than
    the singular words taken alone.

    You can grab crm114 at:

    http://crm114.sourceforge.net

    -WSY

  121. Uh...... by MemeRot · · Score: 1

    Isn't it better to worry about the 'evil' html up on web pages rather than in emails? Fscking warez sites use 10 times as much evil html tricks as spammers.... where's the outrage there?

    Fscking lop.com for example.... took so goddam long to clean that shit off my system.

    1. Re:Uh...... by pmz · · Score: 2

      Isn't it better to worry about the 'evil' html up on web pages rather than in emails?

      For most people, HTML e-mail is not "opt in". Just by browsing their inbox, they could be sending out requests to spammers' websites, thanks to e-mail clients that have preview features and HTML-rendering features.

      Actively browsing the WWW, such as going to a warez site, is "opt in" just like going to the shopping mall. Browsers, such as Mozilla, which allow user control over JavaScript and cookies can help mitigate the risk of browsing the WWW (just like hiding things under the seat can help prevent your car from being stolen).

  122. another (free) image-based human recognition by BACbKA · · Score: 1
    See TT-jump .

    (It's alpha version yet, and it's presently working on a very small subset of environments - requiring MS Outlook/CDO/.NET; but the author seems to solicit invitations to have this rewritten for a normal platform/language:

    "Depending on the interest, the future versions can be re-written to support more platforms and features."

    Before that, its being free is questionable as it's basing on non-free tech...)

    --

    VKh

  123. Semi-Good Filter by Anonymous Coward · · Score: 0

    One filter that I found blocks about 50% of the spam I get is to filter by the To: field. Some (50%) spammers either don't include the To: field or have their list server address in it... Either way, it is not addressed to me, so it goes to /dev/null
    The bad part of this is that it will filter out all the mailing lists that you chose to subscribe to.

    1. Re:Semi-Good Filter by Anonymous Coward · · Score: 0

      Another effective filter is the From: field. If I receive an e-mail that claims to be From: me, then it's a pretty safe bet that it's spam. (apart from a couple of mailing lists I'm on, but I have "other measures" to deal with those).

  124. Body Filtering by bsd-mon · · Score: 1

    Of course, when filtering the bodies of messages, the easiest defeater is encoding the bodies of the messages. It's easy to block all messages which have "longer" within 1 or 2 words of "thicker" or "intense", but it's much harder to block SGkgVGhlcmUsDQogDQpUaG91Z2h0IHlvdSBtaWdodCB3YW50IH RvIHRha2Ug. Then you're back to blacklisting based on senders and domains and header information. Of course, this is for the ISP I work for, for personal mail I could just reject all encoded mails.

    --
    To read makes our speaking English good. - X. Harris
  125. Statistics by ScroP · · Score: 1

    Isn't spam assassin also using some sort of statistical scheme? I've seen some simple perl script things based on averages. I think spam assassin does more than that, but I've never really checked it out. Does anyone know how this is different or comparable to other spam filters?

  126. needs to run on *outgoing* by MrRudeDude · · Score: 1

    Several people have pointed out that by the time spam reaches you to be filtered, it has already used resources.

    That's why the large ISPs such as AOL and the DSL/Cable providers need to put this on their _outgoing_ connections, just to be able to quickly identify a machine which suddenly begins to produce spam. This would, of course, presume that they are responsible enough to care.

    A lot of spam comes from open relays, hacked machines, unscrupulous ISPs here and in asia, etc. Obviously all connections to the internet can't be filtered. But I think that as ISP can save itself time and money by eliminating their own occasional problems.

    1. Re:needs to run on *outgoing* by WebMasterJoe · · Score: 2
      That's why the large ISPs such as AOL and the DSL/Cable providers need to put this on their _outgoing_ connections, just to be able to quickly identify a machine which suddenly begins to produce spam. This would, of course, presume that they are responsible enough to care.
      I like the idea, but wouldn't it be easier for the smtp server to have more obvious rules? Such as, "If host sends out more than 100 emails in a minute they get a warning, after two consectuive minutes of this activity (or two minutes in a 1-hour period) they get banned for 24 hours and the ISP techs get notified." Businesses who send out mail legitimately in bulk would have to make arrangements in advance and somehow satisfy the ISP that it isn't spam.

      Then again, it doesn't matter how great of a system could be built if the ISPs don't use it - just look at the open relays that still exist!
      --
      I really hate signatures, but go to my website.
    2. Re:needs to run on *outgoing* by 40000 · · Score: 1

      Limit outgoing mail bandwidth to 1 MB per day and reject any message with more than 10 recipients.

  127. A Bayes Net would work better I think by sakul · · Score: 1

    And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.

    This statement is wrong -- it would only be correct if a spam containing sex and containing sexy were independant which seems definately wrong.

    If a person sent him an email containing both the words sex and sexy (and perhaps a few other related words), which seems very possible, the probability of being a spam will go way too high and it will be very hard for the system to classify it as non spam. This might seem inevitable, but it doesn't have to be the case.

    Of course generating the entire joint distribution over all possible words is impossible, but there are very good approximations, for example he could use a Bayes net.

    --
    www.facestat.com - See how strangers judge you.
  128. ASK or (Re:Best anti-Spam method is TMDA) by kwerle · · Score: 2

    Or, if you're not willing to sacrifice (or mess with) your MDA, check out ASK. It does about the same thing and works with sendmail, procmail, qmail, etc.

    A-S-K

  129. Re:Read the First Ammendment much? by a+cappella+dave · · Score: 1

    There are limits to the First Ammendment right to freedom of speech. Exceptions have been established by courts in instances of defamation, causing panic, incitement to crime, sedition, and obscenity.

    How much spam have you read with defaming remarks to Britney Spear's latest sex pics? I've seen so much spam "advertising" rape and molestation and child pornography- it might not be a literal pursuasian to commit such crimes, but it certainly is obscene to most readers. The Communications Decency Act (which prohibits "obscenity" and "indecency" on the internet) was upheld by the Supreme Court in 1998.

    So there may be grounds to strike spammers on these exceptions to the First Ammendment; however, prosecuting spammers would be a special precidence case and blazing new territory in the legal system.

    All of this however would have little effect for non US organizations. Besides... who would attempt to prosecute one spammer when there are so many more ready and willing to take that place.

  130. Filtering != Fighting by Andy+Dodd · · Score: 2

    In short, because the morons that support spammers are not likely ever to bother with filters.

    The one exception - Filtering with bounce messages. This will cause SOME spammers (not all) to take you off their lists. Since implementing fake bounce messages triggered for every identified spam (See spambouncer.org), my spam counts have halved, from 90+ spams/day to 30-35, and decreasing. Unfortunately, some spammers (azoogle.com) blatantly ignore bounces, and others have non-bounceable return paths. If more people bounce their spams back, those who DO have bounceable (but ignored) returns will have their bandwidth costs increase.

    I think the ultimate solution is that the spammers themselves have to be fought. Legislation is one - If 1 in 100,000 people respond positively to spam mail and only 1 in 100,000,000 sue for $500-1000, spam quickly stops being profitable. Also, some form of "voluntary" DDoS of spammers would be nice. Not voluntary for the spammers, but for all those who are attacking. For example, download a small app that each day presents you with an article, that basically states, "Today's target is xxxx - They are targeted because yyyy" and the evidence is presented against them. User can now decide if they want to participate. To minimize legal risks, trickery such as an absurdly slow HTTP GET would be useful. (G, sleep 5, E, sleep 5, T, etc etc) - Doesn't increase bandwidth costs, but the server will probably be brought to its knees rather quickly from having to serve too many simultaneous connections. A client could easily spool up 40-50 such connections with minimal use of local resources, but the server would have to open up hundreds of thousands of simultaneous connections, causing the server to fork like crazy.

    --
    retrorocket.o not found, launch anyway?
  131. Imagining future spam messages by Kaz+Riprock · · Score: 1
    From the article:
    For example, I think that if checksum-based spam filtering becomes a serious obstacle, the spammers will just switch to mad-lib techniques for generating message bodies.
    I might actually read that kind of spam in the future.

    Subject: __________ (noun) Enlargement in ____ (number) days!!!!!
    Hello _________(name),
    Would you like to __________(verb) for only the cost of _______(number) ___________ (hot beverage)?
    Just use this link ____________(website) to get started __________(date)!

    If you'd like to unsubscribe, ________(verb) _________ (place).

    --
    Mordor...a magical, mythical land where women are more rare than dragons--but where every man would rather find a dragon
  132. To Whomever Modded Me As A Troll... by Vengie · · Score: 1

    Whomever modded me as a troll, YOU try wading through Paul Hudak's courses. /growl/

    --
    When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in vi. (Larry Wall)
  133. bad sample by rkanodia · · Score: 1

    Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability.

    Poor guy. Based on my corpus, "sex" indicates a .97 probability of the containing email being plans for Saturday night!

    1. Re:bad sample by Anonymous Coward · · Score: 0

      Based on my corpus, "sex" indicates a .97 probability of the containing email being plans for Saturday night!

      Based on my corpus, it indicates a .99 probability that the wife's on her way home from work (some of us do it more than once a week)

  134. How can spammers learn to avoid filters? by postman · · Score: 1

    One point I'm missing here - how would a spammer know that your filter had sent his message to /dev/null? I can see how they can adapt to measures that prevent their messages from being sent, as they get the bounce notice. But, as far as the spammer knows, a filtered message has reached it's destination. Put another way, what "scoring function" could a spammer use to optimize against filters? How do they know that their messages are being read vs being automatically dumped? I ask this because I suspect they can't know which means that once a good set of filters is in place the spammers will be unable to evade them.

  135. Fighting requires filtering; SpamBouncer by BACbKA · · Score: 1

    Well, any spam fighting must start with spam recognition, which has to involve some filtering. So this probabilistic technique is as good as any other for yet another approach to single out spam messages.

    Now when you are 100% sure that something you've received is spam, it's time to complain to the sender's providers to have his account closed ASAP (and the upstream providers, and spamcop etc.)

    The best approach is hand-written complaints. Being lazy, I use SpamBouncer to do the job for me (and I have actually received a couple of manual followups to these autocomplaints leading to reported spammers' account closures).

    --

    VKh

  136. Images in an email? by oktaya · · Score: 1

    For all I care you can consider any email with an image in it SPAM.. Even if it's not, I'm not interested.

    Also.. you suggest tailoring the regular text part of the message to look like a regular legitimate mail. However, since the person sending the email does not know you, or your interests, any word they use (except maybe 'the', 'a', 'you the man') will probably get flagged as high risk anyway.

    I think the method described in the article has its strong points, the best of which is that its customized automatically for each user's own defitinition of spam mail and the mails he receives.

    Oktay

    --
    ---------------
    Founder of the The Free Linux CD Project
  137. HTML Email messages by GreenKiwi · · Score: 1

    As pointed out elsewhere, spammers can get information about whether or not you've viewed one of their messages when you view the HTML if it asks for any external data such as images.

    I use Tiny Personal Firewall to prevent progams from accessing the network in ways that I don't want them too. For example, I have told it that Outlook Express should only be allowed to talk to my servers, and even then, only on ports 25 and 145 (send mail and IMAP). This stops all images from being downloaded or other html calls from going off of my machine and letting spammers know that I've viewed their mail.

    The nice part of this is that if I decide that I want to view images in an html mail message (nytimes news stories for example), I just right click on the tiny personal firewall icon and disable the firewall, and then just enable it after.

  138. Better be careful... by rkanodia · · Score: 1

    ...or tehre will B lots of angary Korean DIABLO player out 2 get u! GIVE ME ITEM?!!? SOJ! ^_^

  139. Rationality by Baldrson · · Score: 1, Troll
    I don't know why I avoided trying the statistical approach for so long. I think it was because I got addicted to trying to identify spam features myself, as if I were playing some kind of competitive game with the spammers. (Nonhackers don't often realize this, but most hackers are very competitive.) When I did try statistical analysis, I found immediately that it was much cleverer than I had been.

    For the same reason artificial intelligence has been held back by reliance on "symbolist" languages such as LISP:

    Everyone wants to believe they are smart enough to tell the computer the rules of behavior rather than realizing they should be teaching the computer to think statistically which is to say rationally.

    Of course since the primary religion pushed by both government and media is the moral virtue of ignoring statistics (to the point that actuaries are now thought of as reactionaries) there should be no surprise that the high priesthood of "AI" has failed not only to produce artificially intelligent software but has done so through the theological bias of rules as commandments for the faithful computer.

  140. Now for retaliation by Anonymous Coward · · Score: 0

    All this needs now is a system to keep track of the email addresses of spammers PIPE them into other OPT-IN Sites. This creates a perpetual loop, because Spam bots always reply, ALWAYS REPLY BACK ad infinitum..

    I like to sow the seeds of destruction.

    Signed Anon Coward

  141. Can someone translate this? by Brant · · Score: 2

    Unfortunately, I don't grok LISP. Could someone please translate the code snippets into Perl or C so I can figure out what he's saying there?

    Thanks,

    Brant

    1. Re:Can someone translate this? by mla_anderson · · Score: 1
      This is pseudo perl and done after only a couple of minutes of learning how lisp operates, but I think I have the gist of it.

      I haven't figured out the second algorithm yet.

      # Procedure to determine probabilities of words
      $ngood = number of good emails;
      $nbad = number of bad emails;
      %good = hash of word count in good emails;
      %bad = hash of word count in bad emails;

      $g = 2 * $good{$word};
      $b = $bad{$word};
      if ($g + $b > 5)
      {

      $prob{$word} = max(0.01, min(0.99, min(1, $b/$nbad) / (min(1, $g/$ngood) + min(1, $b/$nbad)));
      }

      Where the min and max functions would return the min/max of the arguments passed.

      --
      Sig is on vacation
    2. Re:Can someone translate this? by helixcode123 · · Score: 1

      >Unfortunately, I don't grok LISP. Could someone >please translate the code snippets into Perl or C >so I can figure out what he's saying there?

      Lisp2Perl (Please ignore typos :-):

      (let ((g (* 2 (or (gethash word good) 0)))
      (b (or (gethash word bad) 0)))
      (unless (< (+ g b) 5)
      (max .01 (min .99 (float (/ (min 1 (/ b nbad))
      (+ (min 1 (/ g ngood))
      (min 1 (/ b nbad)))))))))

      Perl:

      my $g = 2 * ($good{word} || 0);
      my $b = 2 * ($bad{word} || 0);
      my $ngood = keys %good;
      my $nbad = keys %bad;

      unless (($g + $b) < 5)
      {
      my $scaled_bad = &min(1, ($b / $nbad));
      my $scaled_good = &min(1, ($g / $ngood));

      $my $scaled_quotent
      = $min(0.99, $scaled_bad / ($scaled_good + $scaled_bad));

      return $max(0.1, $scaled_quotent);
      }

      ====
      (let ((prod (apply #'* probs)))
      (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x)) probs)))))

      Perl:

      my $prod = 1;
      $prod *= $_ foreach (@probs);

      my $non_probs = 1;
      $non_probs *= (1 - $_) foreach (@probs);

      return ($prod / ($prod + $non_probs));

      --

      In a band? Use WheresTheGig for free.

    3. Re:Can someone translate this? by helixcode123 · · Score: 1

      >Unfortunately, I don't grok LISP. Could someone
      >please translate the code snippets into Perl or
      >C so I can figure out what he's saying there?

      Sorry,
      my $b = 2 * ($bad{word} || 0);

      should be:

      my $b = $bad{word} || 0;

      --

      In a band? Use WheresTheGig for free.

  142. Spelling fight also by Tablizer · · Score: 2

    xIf xYou xCan xRead xThis xYou xHave xWon xA xFabulous xVacation! xClick xHere xTo xRecieve xYour xPrize!

    Spammers will start mispelling "hype words" to get them past. (They already do this in titles.)

    I can envision having a spelling check to find such, but then you could be filtering out legitamate bad spellers, such as me.

    1. Re:Spelling fight also by Spyky · · Score: 2

      I can envision having a spelling check to find such, but then you could be filtering out legitamate bad spellers, such as me.

      Okay, but then if 5% of the words are mispelled you can mark it as a bad speller, if 75% are, then its a spam. Adjust numbers appropriately.

      -Spyky

    2. Re:Spelling fight also by Tablizer · · Score: 2

      (* Okay, but then if 5% of the words are mispelled you can mark it as a bad speller, if 75% are, then its a spam. Adjust numbers appropriately. *)

      That is an exaggeration IMO. My error rate is probably around 25 percent on bad days and you don't need 75 percent of all words to be marketing words to get a message across.

    3. Re:Spelling fight also by flonker · · Score: 2

      Just filter on an automatically spell-corrected version of the text. This has a side benefit of increasing the amount of research done on spell-correction.

  143. how about someone write an app that.. by c1pher · · Score: 1

    works with sendmail or qmail or whatever, that filters out the spam messages and autoforwards the message to a list of congressmen, etc with a message "want my vote? make this type of UCE illegal like in Europe".

    --
    The Adult Happy Meal - "I'm lovin' it!"
  144. Simple way to never get spam by Anonymous Coward · · Score: 0

    Step 1. Get your own domain name
    Step 2. use abuse@yourdomannamehere.com as your email address
    Step 3. Enjoy spam free mail

    After 5 years and numerous public news group postings, I have yet to receive a single spam.

  145. What he's giving is great info, but... by wurp · · Score: 2

    I question his testing methods. If I read the article right (oops, slashdot faux pas, I admitted to reading the article) he built the Bayesian map from about 4000 messages, then tested the efficacy of his algorithm against those same 4000 messages! He waves his hands about why that's OK, but wouldn't it make more sense to take 10 minutes to build his map against the first 2000 messages and test it against the remaining 2000? I really don't trust algorithms that use the input data combined with the desired results derive those same results against the same input data.

    Secondly, over time, assuming that spammers put forth any effort into bypassing his filters, the filters will become much less useful. Spammers will intentionally misspell key words to lower their total spam rating. The easy solution to this is to make the map using a running total of only the messages from the last 3 months, or 6 months, or whatever period works best, but he should have at least mentioned that. Otherwise, over time the massive weight from the old emails will drown out any new spam identifying words.

    All in all, it sounds like a great system, though, pending the results of a real test against emails other than the one you built the map from ;)

    1. Re:What he's giving is great info, but... by jaoswald · · Score: 2

      Misspelling won't work, because those misspellings are far more likely to mark a spam than a legitimate e-mail.

      PG makes the point himself with the c0ck example.

    2. Re:What he's giving is great info, but... by Anonymous Coward · · Score: 0
      Spammers will intentionally misspell key words to lower their total spam rating.

      Granted that peice of spam will get through, but it is highly probable that the next spam will not get through (with the same mispellings). I dont think the expiring idea is good though. It's all statistics, and as you get more mail (and file them as good or spam), the percentages get changed automatically. Remember, he doesn't even use a word in a calculation if it hasn't appeared in at least five emails already.

      I love the idea that we can use math instead of law to solve this problem; we dont need more legislature, just more people using their heads to solve problems on their own.

      BTW, can anybody explain the second peice of LISP code to me, I'm interested in implementing this filter myself.

    3. Re:What he's giving is great info, but... by Elminst · · Score: 1
      Secondly, over time, assuming that spammers put forth any effort into bypassing his filters, the filters will become much less useful. Spammers will intentionally misspell key words to lower their total spam rating.


      I don't think you read all the way thru then. He talks about exactly this. This filter LEARNS, so the first time you delete-as-spam an email with intentionally misspelled words, the filter immediately knows that those words are "bad" and that any subsequent message containing those words is probably spam.

      I love this idea, and can't wait to see it implemented.
      --
      No unauthorized use. Trespassers will be shot. Survivors will be shot again.
    4. Re:What he's giving is great info, but... by wurp · · Score: 2

      But initially, the misspelling is only a word that the filter doesn't know, which is rated at .2, which is considered not a spam. After receiving a few spams with the misspelling and "delete-as-spam"ing them, the system will correct itself, until the spammers begin misspelling the word in a different way.

    5. Re:What he's giving is great info, but... by wurp · · Score: 2

      I did read all the way through. The system requires that it run into the word five times, spelled the same way, in a message marked as spam. That means that the the first spams that you receive that use only misspelled "spam words" will bypass the filter, and that will happen again and again as long as the spammers can come up with new ways to misspell the words.

      I do think it's a good system, but his testing methodology as listed in the article is atrocious, and there are techniques spammers can use to bypass the filter if it were to become very popular.

    6. Re:What he's giving is great info, but... by CerebusUS · · Score: 1

      Remember the bit about the headers though... I'm not sure about your spam, but most of mine comes with headers that are remarkably similar to each other, and all those elements are receiving values as well... So if Spamco.com sends you 80 messages, each with "visit my porn site" mispelled in a different way, eventually the domain itself is gonna get whacked, regardless of the mispellings.

    7. Re:What he's giving is great info, but... by mgessner · · Score: 1

      Your point is interesting, but there's not enough probability behind it.

      a) you claim that spammers will start changing how they misspell things; they'd have to know what was in your filter. Probability: 0.

      b) you imply that spammers have brains enough to figure out how to do this. Probability: 0.

      c) you imply that the spammer have enough brains to know exactly WHICH words to keep misspelling. Probability: 0.

      See, it just doesn't add up!

      --
      "Sometimes the truth is stupid." - Lawrence, creator of Prime Intellect
  146. OOh..... by Anonymous Coward · · Score: 0

    you can multiply.

    can you sit up and walk too?

    fucktard.

  147. make sure you call their 1-800 from a public phone by BACbKA · · Score: 1

    ...otherwise they'll have all your personal data and your phone # for future direct marketing, and they'll know their spam had reached you so they'll have your interests more narrowly classified, making you a more valuable direct marketing target!

    --

    VKh

  148. make it distributed. by option8 · · Score: 2

    Make it Distributed and make it work with eudora, and i'll gladly use it.

    spamnet (see link above) promises to make it so that, if you add a filter to your email, and it (or you) shows promise as a good spam filterer, that filter gets added to those that all subscribers get. unfortunately, it's currently only for outlook, but i expect it will either add support for other clients, or someone will come up with an open source alternative...

  149. Can we use this on slash? by Fizyx · · Score: 2, Interesting

    Not to filter posts for spam, but for, you know, quality!

    1. Re:Can we use this on slash? by Hoi+Polloi · · Score: 2

      I think we'd stop seeing any posts then. ;)

      --
      It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
  150. Re:Circumvent by kwerle · · Score: 2

    My primary concern comes from the fact that most of the spams I recieve are either Korean or English, while most of the legitimate mails are in Norwegian. Sending me Korean mail is pointless anyhow, but I fear that simply the _use_ of English will make his scheme produce lots of false positives.

    I don't get it. Does Korean spam use something other than WORDS to communicate? Or do their mail headers look any different than Norwegian ones? What makes you think your deleting Korean spam, and thus marking those Korean words (heck, all of them) as spam will be a problem? The filter gets built up for the user, based on the user's email. How could this not work for you? Why would marking of a few English words as not being spam be a bad thing?

  151. Your eyes are brown. by www.sorehands.com · · Score: 3, Insightful
    You are so full of shit, your eyes are brown!


    If you have a driveway that connects to a public road, then people can park there. Your house is connected to a public road, I can walk in and watch TV. Your car is on a public road, I can use it without your permission.


    A spammer that I tracked down was very unhappy that I knocked on his door. He claimed I was tresspassing. How could I, he opted in by having his house accessible by a public road.


    If spamming is legal and honorable, why don't you post your real name, address, and phone number with each spam and on each website that you spam about?

    1. Re:Your eyes are brown. by Dyolf+Knip · · Score: 2
      There's a difference. While you are very stringent about who you let onto your property, most people will happily let anyone mail something to them. An inbox without deny filters really is a public site.

      This is not the way things have to be; you could easily deny any mail except for what's on your whitelist. But to lessen the risk of false positives, you give mailers the benefit of the doubt. Similarly, unless you have barbed wire running around your property, you are pretty much giving permission to anyone to walk on and 'ask for permission to be there', as it were. They don't get to stay if you tell them to get lost, nor does the spam have any say in whether or not you trash it.

      --
      Dyolf Knip
    2. Re:Your eyes are brown. by www.sorehands.com · · Score: 2
      "most people will happily let anyone mail something to them."

      Actually, an advertiser pays money for the privelege to put something into my a mailbox. A advertiser cannot legally walk up to a mailbox and put advertising into it without paying postage.
      "Similarly, unless you have barbed wire running around your property, you are pretty much giving permission to anyone to walk on"

      If you put your car onto my property, it may be towed or seized.


      On my site, it says,

      "You agree that any email you send which advertises or promotes any product, service or Internet destination, shall be subject to a $1,500.00 fee for reading and responding appropriately. THIS MEANS SPAM COSTS! Concealing, misrepresenting, or not fully disclosing, the sender's identity increases the fee by $3,000.00 to compensate for the effort to track down the sender.
      Which means that you have been told not to send spam, unless you want to pay for it.
    3. Re:Your eyes are brown. by Dyolf+Knip · · Score: 2
      Actually, an advertiser pays money for the privelege to put something into my a mailbox. A advertiser cannot legally walk up to a mailbox and put advertising into it without paying postage.

      Sure they can (hang it on my doorknob, anyway). I get that particular brand of crap rather often. They just don't get to use the USPS to deliver it.

      If you put your car onto my property, it may be towed or seized.

      Exactly my point. Equating 'sending spam' to 'using my lawn as a parking lot' doesn't work. That I'm allowed to walk onto your property doesn't mean I have any freedom there. You send me spam, I delete it without hesitation. You park your car in my driveway and I'll test out my new chain saw. In both cases, the attempt to put something into my 'sphere of influence' essentially results in it being totally at my mercy. In both cases, you _could_ change the situation and give only 'authorized personnel' the ability to enter your property, but you risk barring people who would be authorized if only they could get past the not-very-intelligent spam filter/barbed wire fence. So we don't, and instead focus our efforts on trying to make very smart barbed wire that will let the meter readers, bug man, pizza delivery guy, gorgeous babes, etc in but keep the Jehovah's Witnesses and salesmen out.

      Basically, I'm saying that people who end up with a lot of spam (yes, this includes myself) are in that situation because they feel that the lack of false positives is worth the abundance of false negatives. Any one of us, at any time, could turn our filters on to the max and reduce spam to a bare trickle, but it would be at a high cost. A former girlfriend of mine did exactly this, and while she got nothing she didn't want, there was a lot she did risk missing. A smart, effective, adaptive piece of software that stands to have the best of both worlds is simply amazing. I want it.

      --
      Dyolf Knip
  152. Sliced bread move over! by gone.fishing · · Score: 1

    This is the greatest idea since sliced bread. Better even! I do like the idea of making the corpus distributed but think that keeping a personal corpus of data is also a very good idea.

    One added button can drive it all "Delete as spam" what a wonderful idea!

    I think the solution to spam has been found!

    1. Re:Sliced bread move over! by Oswald · · Score: 1
      One added button can drive it all "Delete as spam" what a wonderful idea!

      Yeah, Tivo for your email box. Except that if this catches on, Microsoft will jump on the bandwagon, and (just like Tivo) you'll soon be sharing your idea of what is and isn't spam with Bill.

  153. Utilities to test a spam filter by ai0524 · · Score: 1

    Are there any utilities to test the effectiveness of a spam filter? Suppose that I wanted to install this Bayesian filter but I don't have spam (already deleted) to created my hash tables. Is there some web site that will send me a bunch of known spam messages to create the weights against or to test an existing spam filter?

    --
    Share bicycle touring info worldwide: http://wheretocycle.com
    1. Re:Utilities to test a spam filter by aderusha · · Score: 1

      open a hotmail account, and post a couple time to usenet with it in the reply to field. viola'! instant "corpus" of spam!

  154. P.S. I forgot by www.sorehands.com · · Score: 1
    I forgot a couple of rules:
    1. Spammers lie.
    2. Spammers are cowards
    3. Spammers are thiefs

    Prove me wrong by:
    1. Fully identifying yourself and the sites that you advertise
    2. identify the spam that you send
    3. Fully identify yourself on each spam you send
  155. Check out A-S-K by kwerle · · Score: 2

    Nice system for list matching:
    a-s-k.sf.net

    1. Re:Check out A-S-K by kisrael · · Score: 2

      I'm not crazy about the idea of sending out confirm e-mails...it makes me feel like a bit of a spammer myself, especially if the incoming spam contains a from: address of some poor unsuspecting sap. Also, I worry just a little bit about non-technical people 'not getting it'.

      Since I already do certain auto-whitelisting based on subject for mailing lists , mostly for [randomyahoogroup],[Slashdot], and [Stella], I'm toying with publcizing [Kirk] as well as a magic passphrase...

      --
      SO YOU'RE GOING TO DIE: The Comic for Dealing with Death
    2. Re:Check out A-S-K by kwerle · · Score: 2

      I'm not crazy about the idea of sending out confirm e-mails...it makes me feel like a bit of a spammer myself, especially if the incoming spam contains a from: address of some poor unsuspecting sap.

      In more than half a year of using ask, I have yet to have that problem.

      Also, I worry just a little bit about non-technical people 'not getting it'.

      I have dumbed down my confirmation message a little. I WAY dumbed down my mom's confirmation message...

  156. It's the content I want to block by Skapare · · Score: 2

    It's the content I want to block. I don't want the spam to be sent to me in the first place. I don't want it to use up my bandwidth, which is half the reason for refusing spam in the first place. Plus, when handling other people's mail, it's one thing to block suspected spam sources for them; it's another thing entirely to examine the content, even if it's just computer logic doing it. If I am able to deploy the ability to examine mail for unacceptable content, then what else will I have to test for later? What will the government expect me to be able to do?

    I'll stick with blocking dedicated spam houses, ISPs that harbor spammers, open relays, open proxies, dialup pools, and certain countries, by IP address and/or domain name. And I'll continue to block anything that can't get their reverse DNS right (this feature alone took out half the spam with very little collateral damage).

    --
    now we need to go OSS in diesel cars
  157. One perspective on the 'blunderbuss' statistic by doorbot.com · · Score: 2

    Example: 10 spams get the word 'blunderbuss' but he has no regular email with that word. Therefore, any future email may be rejected because of the word 'blunderbuss', even though there is no basis to know whether the word CAN be used legitamately.

    I don't claim to understand the article fully, but I'll take a stab at responding to your example...

    Let's say you have an email with 200 words in it (including the header, etc). Let's assume that your friend the history buff is sending you some pictures of a blunderbuss. Of course he's kind enough to provide a description of the pictures and a short history of the blunderbuss.

    Now, this goes through the filter which splits up the words into 200 tokens. Even if blunderbuss has a .99 probability of being a spam word, the message is not rejected based simply by matching this word. This one word will influence the final probability calculation (eg, moving towards a "hit" for spam), but the other 199 words could (and will) push the email towards the "valid mail" threshold. Thus, the .99 is negated by the near-zero probabilities of other words in the email (sorry, I'm not a mathematician -- my apoligies if that's confusing). There is an example in the article which explains this.

    If the email contained only the word blunderbuss (disregarding the words in the header), then it's probably spam. However, if the email contained only the word blunderbuss it's probably not very useful in the first place (unless you're an international spy ;)).

  158. Re:make sure you call their 1-800 from a public ph by perlyking · · Score: 2

    Well in the UK (not sure about elsewhere) dialling 141 before the number witholds caller id. Worth remembering :-)

    --
    no sig.
  159. Re:This is wrong.advertising. by Anonymous Coward · · Score: 0

    Spammers could take the easy way out. Pay people to read it. If people read it and become customers, they make more money than it cost for the spam.

  160. I haven't gotten any spam in a month either! by Bozar · · Score: 1

    "I have not received *ANY* spam in over a month.

    Zero spam, period."

    I haven't checked my hotmail account since a month + 1 week ago...

    I think a lot of spammers have been getting "mailbox full: too much spam" messages from me ;-)

    --
    Free as in *BUUURP!*
  161. Re:Circumvent by Anonymous Coward · · Score: 0

    Yeah, since mailing list maintainers rarely use the email protocol properly and have little regard for their end users requests - it makes sense you would automatically remove people.

    Of course, people who sign up for mailing lists would just manually add you to their client. Problem solved.

  162. positively false? by avi33 · · Score: 1

    I'd unsubscribe if I could get my inbox to lose weight now. What a sexy way to get university diplomas removed from naked redheads!

  163. no it's not by MemeRot · · Score: 1

    When i navigate to a site and they have massive amounts of Javascripts triggering new browser windows to domains I never requested, the browser session is no longer under my control or 'opt in' in any way.

    In both cases you can go to ridiculous lengths like downloading the content locally first, turning off scripting, disconnecting from the internet, and then viewing the content. But that's not really relevant. If I go to slashdot.org with my standard browser settings and another user posts an innocent looking link in the middle of a discussion that I click on which goes to a site that spawns 1000 broswers worth of goatse and 3 installations of some kind of trojan horse, I did not opt into that. And god that really sucked.... my ceo walked by right when it happened. Much more negative consequences than some spammers getting demographic info on me.

    Spammers are generally lame, but don't put up much malicious script. Web sites, including ones linked to from this one, DO. Spammers want to sell you something, not install trojans on your machine.

    1. Re:no it's not by pmz · · Score: 2

      Javascripts

      I generally have feelings about JavaScript that mirror my feelings about HTML e-mail, and the underlying problem of bad default software configurations is the same. But I'll defer ranting about JavaScript, for now.

      I agree that browsing doesn't always seem "opt in", but I was trying to point out that browsing is actively going out and about, which is different than reading through e-mail.

      HTML and JavaScript in e-mail is more like a disease-loaded letter in a person's own mailbox (no one asked for it, no one needs it, it's just there, and it's dangerous). WWW browsing, on the other hand, really is more like shopping or traveling. There are real risks in going out into the world into unknown places, but we accept them as small and not worth sheltering ourselves because of them.

    2. Re:no it's not by MemeRot · · Score: 1

      There are real risks in using an email client that renders html email, but we accept them as small and not worth sheltering ourselves because of them.

      I'm really trying to see a difference, but I just don't.

    3. Re:no it's not by pmz · · Score: 2

      I'm really trying to see a difference, but I just don't.

      Another way of putting it that I just thought of:

      We have a right to read our mail privately in our homes with no one peering in. We generally dispense with a notion of privacy when going away from home.

      The gray area between e-mail and web browsing is due to them frequently occurring on the same computer over the same network connection; however, I just tend to view them in the same way as traditional activities. My habits tend to reflect this, as I use separate tools for e-mail and web browsing, and I ensure my web browser isn't configured to do e-mail-like activities. If I wanted to take things further, I would run my web browser in a separate and limited user account (which I already do for my Windows VM, but that's a different matter) or even configure a special browsing-only workstation that has specific firewall privileges. This isolation helps protect my computer from the fact that the browser automatically processes whatever data is sent to it.

      As long as my e-mail is never automatically rendered and is always displayed as text, it is generally not as risky as web browsing and doesn't require as much isolation to be safe.

  164. cid suppression won't quite work with 1-800 by BACbKA · · Score: 1
    1-800 will know your number at the point of resolving the 1-800 to an actual number (SS7 SDP/SCP node has your number as an input to tell which one you should go to - say, based on your geo. proximity). Big direct marketers can surely own the application on the SCP and hence collect your data.

    Of course, cid suppression might work on the final (non-signaling!) voice link to the guy you're dialing so, unless he is connected in real time to that SCP, he won't know instantly that it's you calling - such setups do happen in some callcenters.

    --

    VKh

    1. Re:cid suppression won't quite work with 1-800 by Anonymous Coward · · Score: 0

      Things are different in the UK, you can withold your number.

    2. Re:cid suppression won't quite work with 1-800 by BACbKA · · Score: 1

      The principle of 1-800 resolution is the same
      in all SS7 variants (ANSI or ITU shouldn't matter here). If you disagree with the SS7-logic-based reasoning in my prev. post, I'd suggest you elaborate.

      --

      VKh

    3. Re:cid suppression won't quite work with 1-800 by Anonymous Coward · · Score: 0

      US != world Sheesh.. Jeez, give up allready.

  165. Names of countries! by Anonymous Coward · · Score: 0

    [2] As a rule of thumb, the more qualifiers there are before the name of a country, the more corrupt the rulers. A country called The Socialist People's Democratic Republic of X is probably the last place in the world you'd want to live.



    Countries with Union of Soviet Socialist Republics (USSR), Peoples Repblic of China followed closely by United States of America (USA). Scarey.
    Good names: Canada wins!

    +1 Funny


    (too chicken not to post as Anon)

    i expect Iraq, Cuba, South Korea, Iran, Saudia Arabia and various others do have silly long official names not that anyone uses them.

  166. so what? by MemeRot · · Score: 1

    So you mark it as 'delete as unwanted'.

    The point is that the filter list should be targetted to the individual user's desires, not conformance to a general idea of 'what is spam?'.

    Because that method is more universal.... and inclusive of the other goal. While the other goal cannot be expanded the other way around (the part about filtering on headers automatically for example).

    But by the way I don't believe you.... And the spammers don't either. They are sure that you want to extend your penis or increase your bust.... or both :)

    1. Re:so what? by blazin · · Score: 1

      That was the point of the article that this filter is much more effective if it is targetted at the individual's email. Otherwise it is just too easy to craft the spam to get around the filter.

      But by the way I don't believe you.... And the spammers don't either. They are sure that you want to extend your penis or increase your bust.... or both :)

      Also, believe it or not, I don't respond to the spam, and yes, I've had stuff that came as spam that I was interested in, but I didn't purchase anything from them, and I refuse to. They'll usually go in my mental blacklist for using spam as a marketing tool.

  167. Your Definition of Spam by mcguire · · Score: 1

    What I like best about this approach is that it lets you define what spam is, instead of having to rely on someone else's (possibly different) definition. For example, I hate receiving urban legend "forward me or die a slow death" emails. These generally pass through my spam filters. If I instead marked these as spam using the process described, before long they would be filtered out too. And, because of the statistical approach, future non-urban-legend emails from said "friend" would not be blocked. Neat.

  168. At the risk of sounding like a broken record... by Guppy06 · · Score: 5, Interesting

    Senator Mary Landrieu
    724 Hart Senate Office Building
    Washington, DC 20510-0001

    Dear Senator Landrieu:

    Earlier this month the Federal Communications Commission (FCC) issued a record fine of nearly $5.4 million to Fax.com for transmitting unsolicited advertisements via fax machine (ie. "junk faxing"). Coincidentally, the Associated Press published a series of three articles covering the state of unsolicited e-mail advertising ("spam"). I'm left wondering how the FCC can come down hard on junk faxers but how spammers (arguably of a lower moral class) are allowed to continue to operate nearly unmolested.

    The law Fax.com was found to be guilty of breaking is Section 227 of Title 47 of the United States Code. The relevant text follows:

    Restrictions on the use of automated telephone equipment:

    It shall be unlawful for any person in the United States (...) to use any to use any telephone facsimile machine, computer, or other device to send an unsolicited advertisement to a telephone facsimile machine(.)

    It is my understanding that the reasoning behind this law is based on the ownership of resources. Fax machines are purchased and maintained at the owner's expense and only the owner's expense. An unsolicited advertisement sent to this fax machine amounts to nothing less the use of these expensive resources without prior consent. In effect "junk faxing" is considered theft and as such the offenders are held accountable by law.

    What does this have to do with spam? In my opinion, everything.

    Receiving an e-mail is by all accounts more expensive than receiving a fax. While several companies are now offering stand-alone e-mail clients, I have yet to see one of those with a lower price tag than a fax machine. But even if their price tags were the same, an e-mail station requires that the owner not only pay a monthly fee for a telephone line but also a second monthly fee for the e-mail account itself.

    Of course not even an end client is enough to receive an e-mail. The e-mail account you would be paying for is maintained on a very large (and very expensive) e-mail server, complete with its dedicated (and pricey) connection to the internet. There is simply nothing comparable to an e-mail server in the faxing domain. While a bank of fax machines doesn't cost more than the price of the machines and their associated telephone lines, the price a dedicated e-mail server and the associated connections can easily resemble that of a small car.

    So why is it that the FCC is given free reign to crack down on junk faxers but spammers are free to consume valuable equipment they do not own?

    If you are familiar with the AP articles I mentioned earlier you will know that spam is steadily eliminating the usefulness of e-mail itself. It has been estimated that spam accounts for up to 80% of the e-mail traffic to major e-mail domains such as Hotmail and Yahoo, a problem that their respective owners are all but powerless to fix. As more and more internet resources are tied up by these advertisements, the owners of these resources have had to resort to cutting off offending service providers from the rest of the internet entirely. Customers are finding themselves unable to use the internet access they have paid for simply because another customer of that same provider is abusing theirs.

    But even then the providers are powerless to drop spammers. Spammers in the recent AP articles have proudly boasted of the way they outright defraud unsuspecting internet service providers when signing up for an account. And when the provider threatens action, the spammer threatens the provider with legal action. In recent months a spammer was even successful in receiving a legal injunction against their service provider, preventing the provider from stopping the spammer from abusing their resources.

    I have little problem with receiving advertisements through the U. S. Postal Service. I know that the delivery cost for every article in my mailbox has been entirely paid by the sender. And while I am not happy with the current situation with telemarketers (I must pay for local telephone service before I have the "privilege"of being contacted by telemarketers), I must grudgingly admit that the state and federal laws designed to restrict telemarketing have been mostly successful. But I am not happy about paying several thousand dollars for a computer and $20.00 a month simply to have my e-mail account flooded to capacity with advertisements for products and services I have no interest in (and preventing legitimate e-mail from reaching me in the process). I am sure that you yourself have been bombarded with advertisements for websites featuring "nasty teens" or "penis enhancement." I have noticed that your office no longer maintains an e-mail address accessible to the public.

    The examples of spam I mentioned in the last paragraph bring me to another point: I have noticed on your website your stated commitment to enforcing decency laws on the internet, to protecting children from access to objectionable material on the internet. It should be obvious by now to even the most casual of internet users that the biggest offender in this area is the spammer. While a user must actively attempt to locate a website in order to find such material on the world wide web, the mere existence of an e-mail account all but guarantees that the owner will have such material delivered to them on a daily (if not hourly) basis.

    In my opinion the solution to this problem is very simple: expand 227 U. S. C. 47 to prohibit unsolicited e-mail advertisements in exactly the same way it prohibits unsolicited fax advertisements. Nothing more, and certainly nothing less.

    I have seen some ineffective bills drift through both houses of Congress that are written to allow unsolicited messages so long as they have an "opt-out" mechanism. Ignoring the fact that such legal loopholes would essentially negate the law entirely (can you prove that you tried to opt out?), it quite literally sickens me the way some of your fellow members of Congress feel that spam is somehow an issue dealing with the freedom of speech. The mere existence of the internet and the supposed changes it has on how business and the legal system work (even though such "changes" have been shown to be a lie) have helped to convince these poor fools that people should somehow have a right to use and abuse the property of others. Does my neighbor have the constitutional right to break my kneecap so long as they provide me with the ability to "opt out" of future kneecappings?

    The United States Constitution guarantees that all citizens are free to say what they want. It does not guarantee a soapbox upon which they can say it. Just as I am not guaranteed the right to have a billboard on Interstate 10, spammers should not have the "right" to use the resources of others simply because they're there.

    Expanding 227 U. S. C. 47 to include e-mail is an extremely important issue to me and I hope with your stated interests on your website that it is also an important issue to you as well. I know that you are up for re-election this November and I intend to find out how your competitors feel on the issue as well.

    1. Re:At the risk of sounding like a broken record... by bugbear · · Score: 1

      This is very well put. Do you have this letter posted anywhere on the Web where I could link to it? It would serve as a good model for anyone who wants to write to politicians.

    2. Re:At the risk of sounding like a broken record... by Guppy06 · · Score: 1
    3. Re:At the risk of sounding like a broken record... by Guppy06 · · Score: 2
  169. Re:Microsoft already looked into this-innovate by Anonymous Coward · · Score: 0

    Wow! Microsoft really does innovate.

  170. patent by drxenos · · Score: 1

    You watch: now some company is going to implement his idea in their filter software and patent it as their own! They will then threaten to sue anyone else who uses it.

    --


    Anonymous Cowards suck.
  171. Paul Graham on Fighting Spam by digitalsushi · · Score: 2


    Paul Graham on Fighting Spam:

    Wham wham wham wham wham wham wham.


    (its deeper than you think)

    --
    slashdot: where everyone yells sarcastic metaphors to themselves to understand the issue
  172. Oh no! by too_bad · · Score: 1

    I have some very good spam filtering based on the content (which is almost a give away!) The only way this would keep working was if the spammers did not know what I was filtering on. Please lets keep it this way. If smart people like Paul start divulging his good techniques, the spammers will start changing their content too ... The only way we can win was if everyone came up with their own filters, kept really quiet about it, and the spammers continued to spam thinking everything was okay. And of course the SPAM-lovers could still continue to receive all the spam they like without realising that anything is different in their little world. Something to think about.

    --
    DO NOT PANIC
    1. Re:Oh no! by drxenos · · Score: 1

      That is the same logic used by Microsoft for ensuring the security of their software. How good does that work?

      --


      Anonymous Cowards suck.
    2. Re:Oh no! by too_bad · · Score: 1

      (chuckle) True! Not very well :P
      --
      Dont believe the SIG below. Its fake!

      --
      DO NOT PANIC
  173. Let ME be your spam filter by mikey573 · · Score: 1

    I'm glad this article author touched on what I consider the ultimate solution:

    If you hired someone to read your mail and discard the spam, they would have little trouble doing it.

    There are lots of unemployed people in the tech sector, why not hire them? Heck, let ME be your spam filter!

    Then again, there are privacy concerns. Oh well.

    For an interesting read, please my paper: 'An Analytical Look at Spam'. I touch on the "Hire a secretary" solution along with an extensive analysis of the entire spam situation.

  174. Re:probalilties-"denial of service". by Anonymous Coward · · Score: 0

    I'm wondering if spammers could manipulate those probabilities, not to get their spam through, but to increase the "false positives"? Sort of like a "denial of service" attack.

  175. Yeah but.... by MemeRot · · Score: 1

    If you've never heard of a product you cannot know you want it. So you won't search Google for a combo USB drive/MP3 player/keychain fob. But if you get an email about it, you may realize you want it.

    I agree with your suggestion of a middle ground. Some email you know you want, some email you know you don't want, but some email you're unsure of. If I get unsolicited automated mail about an upcoming Tcl convention because of some forums I'm on, like I did recently, do I automatically want to trash it? I didn't know there were any, I wouldn't have looked for one. And I'm not going to go.... but I did, unbeknownst to myself, want to get that information to then be able to make that decision. I generally am willing to read ANY unsolicited automated email that pertains to programming or software, and I don't care if the person sending it out subcontracts their bulk mailings to a company that also does bulk mailing for porn sites, which is why I would be very wary of heading based filters. Ultimately it is only the content of the email that decides whether I want it, regardless of whether it's commercial or non-commercial, automated or individually sent, from a person I know or from someone I've never met.

    I think that the Tcl convention announcement would probably get by this guy's filters, since he weights words about programming as non-spam. So ultimately what he has isn't a spam filter, but a content filter, which I think is more important.

    1. Re:Yeah but.... by plover · · Score: 1, Flamebait
      RTROTFA (Read The Rest Of The F'ing Article)

      Your filters are trained BY YOU to blacklist or whitelist words based on whether or not YOU decide if the message is spam. So if you mark Tcl-type messages as non-spam, then you'll get them.

      It sounds about halfway to the "user agent" concept that all the futurists say will be the Next Big Thing. Think about applying the same filtering characteristics to reading articles from a mailing list, or your perusing of Slashdot. If you never read "YRO" articles and always read "spam" articles, this could be the mechanism your agent would use to float the anti-spam articles to the top of your screen, or the head of your inbasket.

      Anyway, my point was they're YOUR filters, YOUR reading habits decide what's spam and what's not. And if you don't trust it, don't install it. You can spend all day reading all about organ-lengthening treatments, it's no skin off my nose (no pun intended.)

      --
      John
  176. Article in "Science" about how NSA does this. by Anonymous Coward · · Score: 0

    Marc Damashek, "Gauging Similarity with n-Grams: Language-Independent Categorization of Text", Science, 267, 843-848, 10 February 1995.

  177. Latest spammer trick by BigBadaboom · · Score: 1

    One recent trick that I've just started seeing in spams is a simple tactic that might do quite well w.r.t. defeating content filters.

    What this spammer does is insert html comments in the middle of every word with a random word inserted the comment. ie:

    MA<!-- fish -->KE MO<!-- now -->NEY FA<!-- account -->ST!

    Content filters may need to get a bit trickier (eg by parsing HTML).

    1. Re:Latest spammer trick by metachimp · · Score: 1

      As the article states:
      I scan the entire text, including headers and embedded html and javascript, of each message in each corpus.

      So even if the spammer is using that tactic, this implementation would at least catch that. What effect doing this might have on the statistical analysis, I don't know.

      --
      The system has failed you, don't fail yourself. --Billy Bragg
    2. Re:Latest spammer trick by cpeterso · · Score: 2


      This would make spam detection EASIER! Sure, the first spam written like that would probably pass through undetected. But when the user hits the (mythical) delete-as-spam button, then the filter would recognize that of the emails containing phrases like "MAKE", 100% of them were spam and 0% were real emails. When the second "MAKE" email arrives, it will be immediately detected as "fishy". ;-)

    3. Re:Latest spammer trick by 40000 · · Score: 1

      Just class anything with HTML tags in it as junk mail :)
      The repeated use of ! would also give it a high "spam value" if it was put through a plain text filter.
      How far could we take content filtering? I only want to get nice, friendly e-mails. All the others should have the sender's address automatically added to a spammers database before the message even reaches my inbox.

    4. Re:Latest spammer trick by BigBadaboom · · Score: 1

      Yes, but does he parse the HTML and discard the comments?

    5. Re:Latest spammer trick by metachimp · · Score: 1

      Well, seeing as how comments in HTML *are* HTML, I would assume that those are parsed as well, since HTML comments are sent to the clients. He didn't say "I parse only HTML that is rendered." That would also leave out Hidden inputs, etc. The comments are, in fact, valid parts of the language.

      --
      The system has failed you, don't fail yourself. --Billy Bragg
  178. spam, delete old mail, mailinglists by gnugnugnu · · Score: 1

    Space is cheap

    dont discount the face that most people are working of hotmail/yahoo/webmail/company accounts with restricted quotas.
    It would not take very long to fill 50 megs with just email (not including attachments).

    If you are on various mailing lists you really will need to delete some mail occassionaly.

    If you are on an open mailing list is especially annoying when your spam fileter lets mail through because the list is not a spammer but the real sender is.

  179. This can easily be circumvented! by Anonymous Coward · · Score: 0

    The easiest way to circumvent this is obviously to just put a lot of innocent words at the bottom of the email after the sales pitch. That way they can counterbalance the bad probability readings of all their market talk. Or even better, put all the market talk in a JPEG/GIF and then add a few innocent words at the end in white text.

    It is a good idea, but spam still has a way to get around it.

  180. Use a distributed intelligent network. by Moderation+abuser · · Score: 2

    Yup. Use the intelligence of hundreds of thousands of fellow spam haters across the internet.

    Vipul's Razor: http://razor.sourceforge.net/
    Pyzor: http://pyzor.sourceforge.net/
    DCC: http://www.rhyolite.com/anti-spam/dcc/

    Yes, they do work. no spammers can't get round them just by changing formatting or including random characters.

    --
    Government of the people, by corporate executives, for corporate profits.
  181. Re:Defeatable-Randomness. by Anonymous Coward · · Score: 0

    Problem is that it's "randomness" would give it away. The main message wouldn't be so random (a necessary part of all languages) AND legitimate messages wouldn't have a random and not random part (why should they?). The only flaw I can see is how much of a threshold does there need to be to prevent "false positives"?

  182. Re:Circumvent by mariube · · Score: 1

    It would not work because after a thousand spams in English, a legitimate English mail arrives and gets marked as spam. And as I said, "SENDING ME KOREAN MAIL IS POINTLESS ANYHOW". That means IT DOESN'T MATTER. gah.

  183. Re:Using the algo on Slashdot posts-breakout by Anonymous Coward · · Score: 0

    Yes it could. There's one disadvantage I see. Since the human's part of the loop. The "message" is going to hit an eyeball in order to be judged as spam. That represents a window for a spammer. That naturally will decrease as the filter becomes more efficient. However by manipulating the 'message'[1], a spammer can increase the chances that his 'message' will hit an eyeball. The proverbial 'cat & mouse' game will ensure. It's going to be an interesting battle.

    [1] Remember in language there's more than one way to say the same thing.

  184. You mean bernie. by www.sorehands.com · · Score: 2

    Bernie is a moron spammer.

  185. Spam should be expensive. by bbc22405 · · Score: 1

    Do not bother making spam illegal. Make spam cost money. Make all unwanted email cost money.

    How? Here's an idea. (Disclaimer: I haven't spent too much time thinking about this.) All email must come with an "electronic stamp", or some equivalent thing that costs the sender money, or computer time, or something. Make it possible for the recipient to "refund" the sender, or otherwise not charge the sender. Now, tie a spam filter into this, so that wanted email automatically gets sent refunds, and unwanted email automatically does not.

    Result? Mail to/from your friends is free. Mailing lists may cost slightly more, if list members sometimes fail to refund the list maintainer. In the unlikely event that you email a sociopath, they will earn the postage from a single email from you. In the unlikely event that you send email to a friend and it is eaten by the spam filter (ie. a false positive) you will notice the lack of refunded postage, and surmise that your letter never got through and react accordingly (or surmise that your "friend" has turned into a cheapskate.) Otherwise, this change will cost you little or nothing. Spammers, on the otherhand, will go out of business rapidly. "Opt in" lists really will be opt in, and the first or second or nth time you decline to refund their postage, perhaps they will count that as you wanting to be off their list. The amount of postage should I guess be something comparable to the rates the US Postal Service charges for bulk mail and presorted first class mail. Perhaps $.25 would be enough. Perhaps you'd require higher postage for people you'd never conversed with before. Mail that has insufficient postage can result in "insufficient postage" notification to the sender; the recipient is not shown the email until sufficient postage is provided.

    Of course, woe to you and your wallet if somebody hijacks your account and sends out 1,000,000 emails allegedly from you... But maybe, like an ATM machine, your "electronic stamp" vendor knows not to sell you more than $5 of unrefunded stamps per day, and automatically telephones you, or cuts you off, if you send more than your limit. (Still, that would make hacking profitable. Bad. Maybe the destination of the postage must be traceable, and the recipient must be liable for refunding if a crime was involved in the sending.)

    I suppose our spam filters still might get spam from politicians and corporations. For people using spam filters, it will just be money that we can take to the bank. For people without spam filters, but with the sense to press the "no refund" button on the mailer, they will still get to keep the postage, though they will have earned it.

    --- Ben Chase

    1. Re:Spam should be expensive. by Anonymous Coward · · Score: 0

      Users who subscribe to mailing lists should be able to give it some tagged stamps so that list will be able to mail them without incurring any "costs". I run a number of lists for various free software projects, and this is the only way that it would be able to work.

      The only trick would be that they would have to periodically "refill" the store of stamps on the list server, or the list server would have no other option but to drop them. I'm not going to spend MY stamps to deliver something that THEY want to see.

      Think of it as the SASE method applied to e-mail.

    2. Re:Spam should be expensive. by bbc22405 · · Score: 1
      Ah, yes, to subscribe to a list, the subscriber would need to send an email with a largish amount of postage (double the customary amount) to the list maintainer, who would not refund that postage. This postage is held as a deposit.

      Then, the list starts sending to the subscriber, and the subscriber starts automatically refunding the postage on each email that the list sends to the subscriber.

      If the subscriber fails to refund postage on a single email sent by the list, that consumes the first stamp.

      The second stamp is then immediately consumed by the list server sending an automatic email to the subscriber. This automatic email informs the subscriber that he has been dropped from the list's membership, and why, and how to join the list again.

      A normal, correct "unsubscribe" request would result in a "you have been dropped from the list" message with your unused deposit attached as another larger-than-normal postage. (Of course, the subscriber's spam filter is likely to automatically refund that postage... But hey, shouldn't a list maintainer get a small perk every now and then?)

      This mailing list scheme hinges on good email connectivity, and spam filtering agents that always promptly refund postage for legitimate email from the list. Otherwise, list subscribers would need to post a deposit ample enough to pay for the largest backlog of unread/undelivered mail that they would ever expect to be sent by the list. So poorly connected subscribers need spam filtering agents always runnable at their ISPs.

      If somebody breaks into your list, and sends spam to all your subscribers, the subscribers (actually, their spam filters) will fail to refund the postage on that spam email, and your list server will then boot them from the list. Oops. Don't let that happen, and have the subscribers post a large enough deposit to survive several of those glitches. If their deposits are dwindling, somehow warn them.

      This mailing list problem can get tricky. Not all list maintainers can just punt the subscribers who do not refund postage. For example, consider a corporate mailing list for all employees of a company? The employer doesn't really ever want to remove employees from the all-employees mailing list (otherwise, it would cease to be an all-employees mailing list). But sometimes spam does get sent through corporate lists. Does the employer let employees accumulate non-refunded postage when this happens? Does it automagically get deducted from the next paychecks?

  186. TMDA? by Bat_Masterson · · Score: 1

    An alternative approach is to automatically ask any unrecognized email addresses if they belong to a real person. TMDA does this for all non-whitelisted email addresses. The idea is that spammers do not put real email addresses on their spam, so will not be able to respond to a request for authentication. If the emailer doesn't respond to the authentication request, then TMDA blacklists the address for the future. Result -- no spam.

    1. Re:TMDA? by mla_anderson · · Score: 1

      And the auto reply from the last place you bought online will be rejected and the site blacklisted.

      --
      Sig is on vacation
  187. Re:Ok, that is hot....NOT! by Anonymous Coward · · Score: 0

    "the basic premise of the filter is that the spammer HAS to tell you what he's selling, and in the process of doing that, gives himself away as a spammer. "

    True however there's more than one way to say "Hot, lusty babes at my site". Two his filter has no concept of location. Is the biasing part of the spam before or after the message. Human 'pattern recognizion' coupled with a deep dictionary allows us to spot such deceptions.

  188. Dealing with images by dacarr · · Score: 1

    The solution is simple. All images should be sent to a secondary address (a receptacle, for the purpose of this), and this address is NOT public to anybody but those who are authorized to send attachments; accordingly, any attachments sent to the primary address just get bounced.

    --
    This sig no verb.
  189. It would look at the header by TimFreeman · · Score: 1
    I think his algorithm would decide that all of the non-words like "xClick" were uninteresting. The most interesting words would probably be in the header. This would still give a decent chance of recognizing the spam, since spammers tend to use a host to send multiple spams.

    Hmm, the next step in the arms race would be to reject a mail that has too many words that have never been seen before.

  190. Really cool by dh003i · · Score: 2

    It'd be great now if he offered an implementation which we coudl all use.

    I think a progressive, ever-going implementation is best. I also think its best to filter based on headers first, and not download any spam (to save bandwidth) and then filter based on message content (for the messages downloaded) and move any spam to a spam folder.

    Then the user simply looks at the spam folder and looks for false-positives, and marks them as "legit". Then the Bayesian filter recalculates.

    Same thing for false negatives, and for the messages not downloaded. The user can look at the headers of the messages not downloaded and say if they're spam. Then the Bayesian filter recalculates.

    Another good thing to do is to give a "password" to your friends for them to put in headers sent to you. I.e., 13y4890dshfpljk2134y9073254y32p9ur. Any message with that in the header would be given a 0% probability of being spam, as only those you gave that to would know to put it in the header. Should it become compromised, you can change it (or just don't give it to people who might compromise it).

    Back to the Bayesian filter, another good thing might be to have varying levels of "spam". I.e., if something is almost certailny spam (i.e., 99.99999999% likely to be spam, as would a message with the header "Get fucked for free and make lots of $$$$$"), it would be placed in a DEFINATELY SPAM FOLDER. Other things would be placed in a "PROBABLY SPAM FOLDER". Etc.

    Anyways, Bayesian Analysis is a really great method.

    If your interested in Bayesian Analysis, there's a great phylogeny program which gives you (basically) a bootstrapped maximum likelihood tree (calculated from millions of trees) via Bayesian Analysis: MrBayes.

  191. Re:The problem is the existing email infrastructur by pngwen · · Score: 1

    This is NOT the fault of SMTP. (RFC 2821)

    SMTP is only a protocol for the transporting of messages. The format of the message is irrelevant. All that is required in the message is that the server knows who the message is going to. The from address given in SMPT is not the one that you see in your browser. It is simply used for logging purposes, and was originally intended as a way for sites to help debug each other's mail servers.

    The real culprit that allows the headers to be faked is the arpanet message formatting standard. (RFC 2822). SMTP messages are defined as a block of 7bit characters. It's the messages themselves that allow the exploits, not the SMTP portion itself.

    --
    I am the penguin that codes in the night.
  192. Why Bayesian Analysis isn't so hot by TWR · · Score: 2
    IIRC, there is one huge problem with Bayesian analysis: recalculation. Unlike a neural net, there is no "backprop" correction process. Once you walk your data set, you have fixed values for analysis. If you want to update the values (new spam words!), you need to re-process all of your mail again. You need to keep all of your spam around, as well as non-spam, just so you can constantly update your filters. Ick.

    Is there a shortcut that I'm missing?

    -jon

    --

    Remember Amalek.

    1. Re:Why Bayesian Analysis isn't so hot by mla_anderson · · Score: 2, Insightful
      If you keep the two original hashes along with the probability hash you can simply update the word count of the two originals and rebuild the probability hash. This could be fairly simple.

      1. Mail arrives
      2. Mail is scanned
      3. Good/Bad hash is updated
      4. Mail is delivered (if necessary)
      Then at the end of the day regenerate the probability hash.
      --
      Sig is on vacation
    2. Re:Why Bayesian Analysis isn't so hot by TWR · · Score: 2
      I thought of that, but I'm not sure that you're right (hence my posting). Again, I'm a bit rusty on my Bayesian work, but I don't think it's quite that simple.

      If you are doing real Bayesian analysis, you need to keep track of which words (token, really) appear TOGETHER in a message, and if those words all appear, then is that message spam or not spam what percent of the time? You also need to evaluate sub-sets of the list of tokens. Since there are quite a lot of tokens per message, you hit a combinatorial explosion for even a short message. That's a lot of info to keep around.

      I think that Graham is using a short-cut, and is simply multiplying frequency analysis of single tokens (no combinations of tokens) together.

      Or, as I said before, I could be completely forgetting my statistics.

      -jon

      --

      Remember Amalek.

    3. Re:Why Bayesian Analysis isn't so hot by mla_anderson · · Score: 1

      I'm not an expert on Bayesian analysis, however my procedure will work for his algorithms whether or not they are actually good Bayesian analysis.

      --
      Sig is on vacation
  193. Who needs advanced technology? by Dthoma · · Score: 2

    I just invented this great spam filter! It counts the number of people in the cc: field! Then I multiply it by 10, and that's the percentage chance it gets chucked! Only 14% as much spam gets through, with NO false positives!

    --

    Note to M1-ers: a curt but otherwise insightful message is not "Flamebait" or "Troll".

  194. Paul Graham on Fighting OOP by Tablizer · · Score: 2

    Paul's
    Why Arc is not especially object oriented

    I would personally like to see Paul Graham spend even more time fighting OOP than spam. The second one is a lost cause arms-race IMO.

    Here is something that rang true with me on his OOP musings:

    Object-oriented programming is like crack for these people: it lets you incorporate all this scaffolding right into your source code. Something that a Lisp hacker might handle by pushing a symbol onto a list becomes a whole file of classes and methods.

    I think using databases (properly) are the same way: a single relational formula does most of the work of a bunch of classes and "hand-indexing" these classes and methods together. (AKA GOF-math)

    OOP hard-wires the "noun structure model" into the code (what Paul calls "scaffolding"). LISP and relational techniques tend to use *formulas* to manage these instead of physical code structure. IOW, we don't build structures, we order the information to build *itself* into the needed structures. (OO has the concept of "self-handling nouns", but it lacks the concept of self-handling structures, or interlinks, between those nouns.)

    It less disruptive to change a formula than change the physical structure of the code.

    OOP fans spend too much time looking for "the proper pattern or model". If you do it right, there is no one proper model or structure: it is virtual views that you create on an as-needed basis and can change on an as-needed basis without a bunch of code rework. You can also have multiple different views without them stepping on each other.

    OOP creates code and work that is unnecessary and fragile.

    (oop.ismad.com)

  195. Re:Read the First Ammendment much? by Anonymous Coward · · Score: 0

    You have the right to speak. But you don't have the right to make me pay for the message, nor to listen to it.

    "The most important right is the right to be left alone"

  196. Re:Non-Boolean buckets by Tablizer · · Score: 2

    (* I'm not sure what the benefit would be to having a few words from the text. For me (and most likely other people as well), that is enough of an inconvencience that I may as well just scan through the entire email. *)

    The point is to make it easier to eye-scan if you are worried about false positives. It helps by: 1. Making it easier to review many messages, and 2. Ranking so as to not check the flagrant ones if desired.

  197. Re:Too bad! Patented By Microsoft by woyouwenti · · Score: 1

    If the MS patent/approach is so good, why did they give up on it and adopt Brightmail for MSN and Hotmail?

    Apple also has a similar, albeit more "theoretically correct" probabilistic anti-spam filter using latent-semantic indexing. Mossberg claims he's getting a 95% catch rate in the WSJ.

    A

  198. Re:Circumvent by bedessen · · Score: 3, Interesting

    His algorithm works because spam uses the same repetive syntax. Because so many spam/emails are sent out - it can be flagged by pattern recognition... based on the assumption that it is written in English!

    Huh? Where do you get that? The algorithm has NO KNOWLEDGE of syntax or structure. It knows only the presence (or absense) of words in the message, nothing of how they are grouped, positioned, ordered, related, structured, etc. There is zero grammar / pattern recognition as far as I can tell. As long as your corpus or database of reference mail is in the same language as the emails you wish to test, then the algorithm would work just fine. Perhaps you were thinking it used Markov chains?

  199. Foolproof way of eliminating spam right here by Shamashmuddamiq · · Score: 1
    ...don't reply to it. If nobody bought spam-marketed wares, the spam would stop because it simply wouldn't be profitable.

    As long as we have people on this earth that are actually stupid enough to watch Jerry Springer, convert to Mormonism, or buy penis enlargement pills, there will be lots of lame talk shows, moron dook knockers, and spam. And you will receive some of it.

    For every genius that comes up with a cool new way to filter spam, there are thousands of idiots ordering up their first spam-marketed item. All you can do is try to ignore as much of it as possible. Filter, but don't expect to get rid of spammers and regain the resources they waste.

    --
    ...just my 2 gil.
    1. Re:Foolproof way of eliminating spam right here by Hoi+Polloi · · Score: 2

      Sorry but you can't legislate common sense. Believe me, it's been tried.

      --
      It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
  200. Not a bad idea but... by Anonymous Coward · · Score: 0

    ...it won't stop spam hosted offsite (i.e. the spam loads the HTML from elsewhere) or spam consisting of graphic images hosted elsewhere. They don't contain any HTML that would trigger such a filter.

    I'm noticing a lot of spammers moving to this in order to evade keyword filters.

    Anyway, I like spam. At $500 an email, chasing spammers is a profitable pastime. Make money AND perform a social good! (and learn lots about the legal system!)

  201. Might not work for my spam... by T.E.D. · · Score: 2

    Most of the spam I seem to get is in non-alphbetic character sets (Korean/Japaneese/Chineese, I'm not sure, I can't read it). I guess I hit the VIA support site in Taiwan one too many times or something.

    I don't know much about that character set, but I suspect they don't use the same separator characters that his filter is looking for to separate its tokens.

  202. Re:The problem is the existing email infrastructur by dmelomed · · Score: 1

    Connect to any server on port 25 (the SMTP port), and fake envelope senders all you want. Cross-subscribe mailing lists all you want. SMTP wasn't designed with authentication and security in mind at all. Furthermore, it is darn slow. Granted it's not SMTP's fault only. It's the architecture's fault. I should have been more generic.

    Parsing email messages themselves is a pain in the ass too.

  203. OFF TOPIC by Anonymous Coward · · Score: 0

    Gee, a bit OFF TOPIC wouldn't you say?

    1. Re:OFF TOPIC by Tablizer · · Score: 2

      (* Gee, a bit OFF TOPIC wouldn't you say? *)

      The topic is partially about Paul Graham, and he holds the view that OOP is oversold.

      It is a "grey area" WRT topic relavancy, but not black.

      Usually a particular moderator that beleives the superficial OOP cliches will knock it down a point or two. He/she/it must be on vacation today.

  204. I suppose DoS attacks are a "choice" too? by Sabu+mark · · Score: 1

    What's the difference between spam and denial-of-service attacks? A spammer does nothing but many unsolicited packets, just like a DoS perpetrator. If receiving spam is a "choice," as you say, then getting DoSed is also a "choice," isn't it?

    Make up your mind: Either spam is illegal, or DoS attacks are legal. There is no basis for treating one differently from the other.

    --

    What Would Jesus Do
    (for a Klondike bar)?
  205. Re:Too bad! Patented By Microsoft by T.E.D. · · Score: 2

    They also have a paper from 1998 describing it here

  206. I have a real problem with this... by Hoi+Polloi · · Score: 2

    This wouldn't work for me anyway since my personal correspondance frequently contains the words "sex" and "sexy" not to mention "stud muffin".

    --
    It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
  207. TDMA cannot handle mailing list by Anonymous Coward · · Score: 0

    You just have to allow the mailing list or otherwise. It is completely useless against spam on the list.

  208. What a pain! by Xesdeeni · · Score: 1

    So I have this big database on my machine based on my own e-mail? If my machine crashes, I have to start all over? And when the SPAMmers figure out they can send an innocent-looking e-mail with embedded SPAM images, then where are we?

    So I'll make my suggestion to eliminate spoofed-address SPAM again:

    1. Sending mail server generates a content key based on the contents of an e-mail being sent.
    2. Sending mail server uses this key with a private key to create a public key.
    3. Sending mail server sends the e-mail, along with the public key to the receiving server.
    4. Receiving mail server generates a content key from the e-mail contents.
    5. Receiving mail server sends the content key and the public key back to the sending mail server.
    6. Sending mail server uses its private key plus the content key to re-generate the public key.
    7. Sending mail server compares the public key to the one sent by the receiving mail server.
    8. If the keys match, the receiving mail server allows the mail to enter the recipient's mailbox.
    9. If the keys don't match, the mail is bounced.

    This should eliminate spoofed e-mail, which is the only type I get. This technique also keeps the second transaction to a minimum exchange of keys. The keys add traffic, but the eliminated SPAM traffic more than makes up for the penalty. As more and more mail servers are updated with this feature, spoofing is all but eliminated. The remaining "spoofable" domains can be explicitly severed from the net or blocked.

    Xesdeeni

    1. Re:What a pain! by DrVxD · · Score: 2

      > If my machine crashes, I have to start all over?
      No, you just restore from your backups. You do DO backups, don't you?

      --
      Not everything that can be measured matters; Not everything that matters can be measured.
  209. Re:Circumvent by Anonymous Coward · · Score: 0

    Two things:

    (a) Parody spam from your friends would probably make it through the filters, since the headers of the message indicating it was coming from a frequent non-spam sender would be too strong to make the contents of the message trip the filter.

    (b) Parody spam from your friends would no longer be funny if you never received spam, so it might as well get deleted anyway.

  210. Re:Too bad! Patented By Microsoft by Anonymous Coward · · Score: 0

    These kind of research has been going on in early 90's. MS is not the only one, I think they started somewhere around 96-98. There are many people doing the same all over the world, is it legal to patent such thing with a broad meaning while someone else is releasing as public information?

    The term "probabilistic classifier" cover just about every classifications algorithm one way or the other.

    One might wonder why noone has been using it in large scale: according to the results from many different people, the highiest accuracy is about 90% and is already tweaked with word/phrase weighting. Also, everyone will get different results and in the beginning it isn't that good before a lot of training. If you are seeing much higher accuracy, it just means your data set is smaller than you think.

  211. Great way to track spam and prevent it by kchris59 · · Score: 1

    The best solution in my option is using sneakemail.com.

    Sneakemail is a free service that you can use to generate disposable email addresses.

    These "sneak email" addresses are aliases of your real address, which is kept hidden.

    You can enter these Sneakemail addresses into web forms or use them to contact e-businesses without the risk of your real address being abused or bought and sold.

    Consider each Sneakemail address as an informal agreement between you and an online business or organization.

    You agree to allow them to contact you through this address, and they in turn, by accepting and using this address, agree not to abuse this privilege by sending you unwanted solicitations or to give or sell your address to others.

    If they abuse this privilege, by using Sneakemail, you have more control.

  212. Backs up what Declan McCullagh said by duras · · Score: 1

    This was an excellent article, and gives me great hope that through technological measures we can finally kill spam.

    I'm reminded of what Declan McCullagh said in his recent editorial. Through writing code, not necessarily lobbying for more perfect laws, we can overcome some of the obstacles we face online.

    Graham makes a bunch of excellent points about how more perfect spam filtering will eliminate spammers for economic reasons. As we've seen political and legal methods don't work.

  213. P2P for fighting spam? by incog8723 · · Score: 2, Interesting

    Maybe the concept of a P2P network could be harnessed in order to fight spam. For each spam tagged as actual spam by a real human, by a ridiculously large CRC (1024 bit or something--to rule out possibly tagging innocent mail), the CRC could be traded via the P2P network. Automatic updating, almost instantly. A client could be written in about 2k of code.

    Interacting with the email client would be another story, but just an idea.

    The only problem I can think of would be sabotage. Anyone could tag legitimate mass mailings as spam (such as a mailing list).

    Any comments on this idea?

  214. Possible attack by Anonymous Coward · · Score: 0

    Why can't a spammer just append a "normal" looking email to the spam message. Then all but the spam message -- which can be a small part of the email -- will look statistically "good." Perhaps just using words is not the solution, but other attributes of the message (say, structure or whitespace). Still, I think it's a good approach, and I think statistical analysis on the header is great!

  215. Ahhhh, spam, the other lunch meat by Anonymous Coward · · Score: 0

    I saw some of the posts here on forged headers. I'm a newbie Linux user. Instead of using linux@local as my machine name, I decided to give it a name. So I named it after the computer in 2001, a space odossey (I'm sure I mangled that), and a year. But since it wasn't a fully qualified domain name, which I don't understand yet, my email headers say something to the effect of: not name of machine, not a fully qualified domain name, message may be forged.

    Now that I have more than one box running Linux, and serving web sites, it's a little difficult going back to linux@local, and getting rid of the "forged domain" or whatever message. I have one box serving one web site with Apache, but no email service yet because I haven't studied Sendmail or other email applications. I have another box serving another site, same situation. These boxes are only serving one site right now, so I can give them a fully qualified domain name of the site, but I will be switching them to virtual hosting as soon as I can get that to work. This will preclude me from using one of the domain names as the fully qualified domain name. So I am currently stuck with email message headers that identify my emails (kmail in Linux and Webmail ((ISP provided email remote login-Windows)) as using forged headers, which they are not.

    Some of my emails are being sent to dev/null or whatever, from people who's system or network uses tight filtering rules due to this FQDN issue. But that's something I'll have to live with.

    Not every email with a "forged" header or domain, or one that does not resolve (I'm behind a Linkie NAT firewall on one of my boxes, the workstation is invisible to the net) is spam.

  216. Re:But spammers evolve...Smart hits. by Anonymous Coward · · Score: 0

    Well spammers techniques will evolve. Maybe using "goal defining" software. For example a spammer would simple tell his software what goal he's trying for (get people to see the luscious babes at my site). The software would then figure out what combination of words, sentence structure,etc would be needed to maximize his hit rate on your mailbox.

  217. or you can just fool them by heavyd · · Score: 0

    is it considered good netiquite to go along with the spam and then at some point become irrational with them? and then post the email exchange to a website for others to enjoy?

    i would like to do this.

    --

    Software testers needed for

  218. World wide ban? by L-Train8 · · Score: 2

    Is the spam for Taiwanese products, or just routed through open mail relays in Taiwan? If it's the latter, we could certainly outlaw using spam as a marketing tool for US entrepeneurs. If your company or home business sends out spam from Taiwan to US computers, you would still be breaking the law.

    --

    Don't forget that Friday is Hawaiian shirt day.
    1. Re:World wide ban? by tomknight · · Score: 2
      It appears to be for Taiwanese products. The charset used seems to imply this. Otherwise you've a most valid point.

      Tom.

      --
      Oh arse
  219. No, YOU'RE wrong!!!! And here's why... by DoctorFrog · · Score: 2
    1) Spammers have to pay to produce their spam.

    2) If you don't read the spam, they have no revenue.

    3) You're gaining the valuable benefits of spam without paying for them.

    4) Therefore, not reading spam is STEALING!!!

    Oh, and

    5) ???

    6) Profit!!!

  220. There needs to be anti-porn legislation. by El+Camino+SS · · Score: 2

    We can all talk all day about Spam-busting techniques, but honestly, can we all get together and make sure that our nine year old doesn't get porn mail all the time? Stopping porn spam would really knock the wind out of the sails of all spammers everywhere. I mean, this thing seems like slam dunk legislation. I know that many of you will say that this is a slippery slope of legislation and scream "THINK ABOUT OUR FREEDOMS," but no one wants their children to see pornography.

    Really, all we need is some new-era Tipper Gore to scream the phrase we all hate at a Senate hearing... and no more porn spam:

    "Won't somebody please think about the children?!?"

    The chilling effects of this will be monumental. Why the current Right-Wing U.S. administration hasn't gone after this is totally beyond me. Its a cheap and easy target. Shows that they reinforce family values. I hardly agree in anything super-right wing, but this whole children-looking-at-steaming-hot-teens thing is ridiculous.

    Whether enforced or not, in the United States soliciting pornography to a minor is still very much illegal. I think that the /. crowd can really sell that tagline to our local legislator and put a real strike back in the spam wars.

    1. Re:There needs to be anti-porn legislation. by 40000 · · Score: 1

      Spam is spam is spam... it isn't the content which matters, it's all spam. HTML e-mail is the cause of most problems as there is a limit to how pornographic plain text can be.

    2. Re:There needs to be anti-porn legislation. by Anonymous Coward · · Score: 0

      Less than 25% of my spam is from porn sites--the rest is from casinos, lenders, and snake oil vendors. A content-based spam law would be even worse than the current fraud-based spam laws. It would accomplish nothing other than keeping the next generation as ignorant and screwed up about sex as we were, while reducing pressure that's needed to get a law that actually deters network abuse.

  221. OT: your sig by Anonymous Coward · · Score: 0
    > There's no "I" in Linux.. err..

    You work for VA Software by any chance?

  222. Other uses? by yardbird · · Score: 1
    Seems like you could use this for a lot more than just spam filtering. Couple of ideas: sorting into folders by type (personal/work), urgency, which project it pertains to... In each case, all you need is a corpus of messages with the given characteristic.

    Any other ideas?

    --
    Free, legal music for iTunes users.
  223. Adaboost algorithm much better then bayes by brw215 · · Score: 2, Interesting

    There are several classification techniques in the field of machine learning that are all more powerful then simple native bayes. In fact in graduate school I built one that outperformed N.B. by a significant margin.
    If people want to claim a "great new idea" they should research what has been done in the field first.

    1. Re:Adaboost algorithm much better then bayes by teledyn · · Score: 1

      Please consider this an innocent question, not an attack, because I truly want to know: If this method is so great and your method is so much better, why are none of these methods common in modern email or MTA software?

  224. Way around this by panxerox · · Score: 0

    What if spammers just put all their wonderfull words of wisdom in a large picture or a flash file thus "hiding" it. Of course alot of people have html turned off but the vast majority do not.

    --
    "It's so convenient to have a system where everyone is a criminal" - A. Hitler
  225. But if SPAM doesn't work, it will go away by Great_Jehovah · · Score: 1

    If filtering, as described, were widely implemented, SPAM would become ineffective to the point that it would no longer exist. The cost of "making each link in the chain liable" is much greater than the benefit which can be achieved by other means.

  226. The short cut is the independence assumption by pussyco · · Score: 1

    He is assuming that you can just multiply the word by word probabilities together. This is a standard assumption. If you don't do something like this you get a combinatorial explosion, just like you said. More to the point, if you don't do something like this, your data becomes sparse. In the limit of making no assumptions you are reduced to recognising only the spam you have already seen, you have no capacity for generalisation and all the new spam gets through. No statistical method is any use if it doesn't generalise. Any method that works in practise has some kind of assumption hidden inside to make it go.

    One reason I like the Bayesian approach is that it is pretty transparent. When an implementation is making the independence assumption, it is clearly apparent, and if you need to relax the assumption, for example by looking at word pairs, it is clear enough how to go about it. Graham does discuss this towards the end of his article.

    Often the main effect of the independence assumption in practise is to exaggerate the confidence with which the classifier classifies things. Since Graham is not using his probabilities as input to subsequent processing he gets away with this,

  227. Re:probalilties-"denial of service". by Sarin · · Score: 2

    Yeah they could, but it would take too much effort. They first have to make the probabilities factor change by sending you a whole lot of legitimate email (as if) and then later send you a spam message that can finally contain the words they made less likely to be spam.. wait a minute!

    hmm sounds like a great idea, how about this a elisa style bot starts a mail conversation with you after sending 10 mails back and forward the bot sends you a spam message, the bot has beaten your spamfilters because the filters don't think someone on your contact list would send you a spam message right and you will read the spam message quite focussed, the spam message will be actually read, because you don't understand it and you think the bot is a person!

    well don't be surprised if you experience it one day, remember this message, it started it all!

  228. No, it's _NOT_ easy to defeat. by Christopher+B.+Brown · · Score: 2
    I did a lot of tuning on Ifile, which I've been using for this purpose for about five years now.

    Consider:

    • It's doing FULL TEXT word search. The HTML is looked at.
    • Are they really going to generate different "innocuous" messages?

      If they are REPEATED innocuous messages that match against PAST "innocuous" messages that I decided were spam, that is going to pick this up.

    • Fool me once, shame on you.

      Then your message goes into the corpus as "spam."

      And messages that are written as multipart/alternative with statistically similar "innocuous" messages will be matched as spam.

    • The only "Wealth Of Evidence" that you can provide in an email to me that you aren't sending spam is for you to send me messages that have similar statistical parameters to those messages that I did not consider to be spam.

      You don't know the parameters. The parameters essentially involve the subjects I discuss with my family, or with friends, or with business associates, or with technical associates.

      How can you possibly construct, as a "spam-meister," messages that resemble those without being someone that I regularly communicate with?

    No, this "defeat" represents nothing of the sort.

    --
    If you're not part of the solution, you're part of the precipitate.
  229. Fight Spam? The $15 solution! by Conesus · · Score: 3, Interesting
    Ok, so the subject line looks like spam. But what I did was buy a domain (conesus.com) and setup auto-forwarding on everything @ the conesus.com domain.

    ANytime someone asks for my e-mail addres, it's their_business_name@conesus.com or their_personal_name@conesus.com.

    If I ever get spam from a certain address, I can block the address, and goto the site in question and change my address to something else.

    But the coolest part is if anybody sends a mass-email to me and my buds, they usually include a personal_message_to_me@conesus.com.

    --

    Don't eat your soul to fill your belly.
    conesus.com
  230. No bias; it's just an incomplete explanation by Christopher+B.+Brown · · Score: 2
    In my "corpus," consisting of many years of MH mail (quite a bit more than Graham's, I think), processed using Ifile, my stats for "sexy" pretty nicely pin down messages to either the Spam/Phonesex or Spam/Snakeoil folders. The word sex is rather less useful, as the word comes up in rather more common contexts than spam.

    But this isn't enough, by itself, to classify a message. Messages do not solely consist of one or two words; they consist of many. And collecting the statistics together requires calculating a "relevance factor," based on all the words.

    The one used for Naive Bayesian Inference is as follows: Rf calculation , and you'll notice it involves doing a logarithm-based weighting.

    The formula doesn't care what words are used, or that you think one folder contains "spam" and that another contains "gold."

    In my corpus, the word sex is used in 65 different mail folders, mostly probably pretty "innocently."

    Drawing conclusions based on one or two words is, unfortunately, pretty incomplete. It might well be that the one use of "sexy" in a particular message doesn't force it into the Spam/Phonesex folder because it makes even more extensive mention of Enlightenment and WindowMaker and GTK Themes and winds up being very strongly tied to the X/WindowManager folder because there are several other words not related to sexual activity that make it (correctly) appear relevant to a discussion of window managers.

    Graham is drawing an analogy based on two words (words likely to grip adolescent attention!); reality involves adding everything up, and those two words certainly don't tell the whole story of the whole corpus.

    --
    If you're not part of the solution, you're part of the precipitate.
  231. Foreign Word Circumvention by Christopher+B.+Brown · · Score: 3, Interesting
    No, the approach does not make any assumptions about words being constructed in English.

    The "foreign language" Spam that I get gets nicely refiled by Ifile into my Spam/Foreign folder.

    That folder has a corpus of messages assortedly written in Han, French, Kanji, Korean, Finnish, French, Spanish, and Russian, and Ifile nicely recognizes that words in those languages provide evidence that messages seem most relevant to go into that folder.

    Ultimately, it all involves human classification:

    • Initially, the corpus must be "primed" with an initial set of messages that I classify into the various categories I want to distinguish between.
    • Some messages are processed by Ifile into an appropriate mail folder.

      I go through them, and read them, perhaps just browsing titles when I see that spam seems appropriately filed.

      By leaving the messages in the folder, indicate that they were correctly filed, and should become part of the corpus.

    • Ifile drops some messages in the wrong folder.

      That then involves human intervention as I move the messages to where they should have been.

    Note that IFile is useful for filing good messages, not merely at throwing away spam.

    Indeed, the more that you use Bayesian filtering for, the more folders with distinctive kinds of message that you have, the better it gets at discriminating where messages should go. I don't have one "Spam" folder; I've got about 8 for different sorts of spam. I don't have one 'inbox' for all my "good" mail; the mail gets thrown into a veritable huge chasm of mail folders. The more there are, the better.

    --
    If you're not part of the solution, you're part of the precipitate.
  232. Not a problem, at least not technically by Christopher+B.+Brown · · Score: 2
    The results are not based on just one word, but rather on the combination of all the words in the message.

    The typical formula is
    Relevance - Rf

    There may be a bit of a "fight" between the words, but if all the messages containing the string my_wife@frobozz.org go in the Honey folder, and occasionally contain phrases like That dress was so sexy or the likes, that will change the Ff(w) value for f = Honey , and the message will be appropriately routed, perhaps into the subfolder Honey/Rendezvous where you put the weekly messages of that sort from your wife.

    Of course, there's then the non-technical problem, namely locating a wife that would actually send that message.


    "Since oral sex is topologically equivalent to anal sex, converting one to the other is simply a matter of finding the right conformal map. Currently I only have solutions for a spherical girlfriend." -- Robert Bowler
    --
    If you're not part of the solution, you're part of the precipitate.
  233. IF that's true, there's definite prior art by Christopher+B.+Brown · · Score: 2

    As Ifile source code is available that dates back as far as about 1996.

    --
    If you're not part of the solution, you're part of the precipitate.
  234. Legitimize spam to fight spam by kazbah · · Score: 2, Interesting

    I've had this theory for a long way on a technique that could be used to defeat spam once and for all. Despite what the author of this article states, trying to fight spam by analyzing the content is not going to defeat it, and as has been pointed out, there are many ways to work around that solution.

    Targetting the sending addresses, and most other techniques like that simply lead to wars of one-up-manship as the spammer and spam fighter struggle to find better techniques to hide and detect spam, respectively.

    So what's the theory? Fairly simple, really, and the technology is already available, but not widely implemented. Spam largely suffers from an identity problem. Consider that junk mail that arrives in the post box can easily be identified and/or blocked through legal means if necessary, largely because we know where it comes from. The reason spam has proliferated is because SMTP traffic is largely anonymous - mail servers basically trust the mail they receive and have no real way to verify the information being presented to them. Yes, they can check From: and To: headers to verify that the email is local / remote / relay attempt, whatever. But with the number of open relays on the net, it's easy to forge and bypass these checks.

    By using SSMTP (SMTP over SSL), all email can be logged with identifying information from the original sender. If enough servers on the net start to support SSMTP, and increasingly mandated its use, eventually I'd be able to block all regular SMTP traffic. This has the added advantage of making email more secure.

    But how does this stop spam? Well, it doesn't directly stop spam, but it means that we would legitimately be able to identify who originally sent the email. Once that happens, the spammer can no longer hide behind anonymous gateways. It probably wouldn't even matter too much if open relays were accidently left open - so long as the open relay didn't support SMTP but only supported SSMTP.

    Ideally, every user would require their own secure certs to properly identify the sender, but this would probably add too much cost for the average user, and may be rejected for privacy reasons. But so long as the mail servers themselves were configured this way, we would always be able to identify very quickly where the email was originally sourced, thus giving a recipient an easy place to target (and hence sue if it comes to that).

    As this takes off, it may actually be a way to make spam legitimate. The secure cert attached to the email could have an incentive allowing users to opt-in or opt-out automatically. A user could set their mail to say "yes, I'm willing to put up with ads if you're willing to pay me for it" putting the cost back on the person responsible for the spam in the first place - the advertiser.

    Anyway, it seems to me like a fairly simple way to solve this - but it does take a lot of co-operation to get there. Something that hasn't happened yet for IPv6, another new protocol that doesn't really seem to be getting off the ground. So what am I missing?

  235. Incomplete Presentation of Formula by Christopher+B.+Brown · · Score: 2
    The problem isn't with the statistics.

    We're talking about Naive Bayesian Filtering, where the assumption is made that we can assume the use of Bayes' formula even though we know it's not quite independent.

    What you're missing is that the real formula doesn't just involve two words; it involves all of the words in the message.

    The usual formula is Rf, and you'll notice that it involves multiplying the occurrances of words in the message with the logarithm of their frequencies in each folder.

    The word "sexy" may usually be enough to consign messages to the Spam/Websex folder, but if there are some occurances of the term "sexy window manager" in a discussion of some window managers, the fact that the names Enlightenment, WindowMaker, stupid , memory-hungry and themes occur rather a lot in X/WM and never in the Spam folders means the relevance total will most likely favor the right folder.

    --
    If you're not part of the solution, you're part of the precipitate.
  236. Re:Circumvent by kwerle · · Score: 2

    If you get a new email message that has a bunch of "non spam" words in it, it seems likely that it will not be marked as spam. As the article said, spammer's vocabulary is really limited.

  237. LisP is cool by vga_init · · Score: 0

    Hurray for LisP! :)

  238. Looks good! SPAM button needed. by dwheeler · · Score: 2
    This looks great!

    I'd like to see mail browsers add a nice big "SPAM" button that will can do a number of configurable actions, and has a useful default. I suggest as the default that it forge and send back a "no such user" message, save the message in a "past spam" folder, and occasionally invokes a naive Bayesian statistical analysis program (as Graham describes) to create a filter for the future (then filter out email with a high probability of being spam). Perhaps it could optionally do other things, such as forward a copy to a list of email addresses (e.g., your local "abuse" account, the newsgroup news.admin.net-abuse.sightings, and email addresses of well-known spam killers), or calling on other spam killers to check it like SpamAssassin.

    Perhaps there could be checkbox beside each action like "don't do it when you press SPAM", "do it when you press SPAM", or "confirm before doing it when you press SPAM" - that way, you could get rid of chain letters without sending them to net-abuse.

    By building easily-invoked SPAM-handling capabilities right into the mail browsers, people will be able to fight back more easily.

    I know the Mozilla folks are considering anti-SPAM measures; I hope they're willing to build in this kind of functionality, so that it's enabled by default.

    --
    - David A. Wheeler (see my Secure Programming HOWTO)
  239. How I stopped spam. by Anonymous Coward · · Score: 0

    I use hotmail. I got lots of spam. I went into the filters menu. I setup a filter for every letter of the alphabet (yes 26 filters) in the subject line. I then went into the tagline menu and inserted the tagline "All mail to me at insertnamehere@hotmail.com must have a totally blank subject line or it will be automatically deleted.

    I then went in and put accept to all my friends email addresses. I have no more spam, and no problems receiving mail.

    Works like a charm.

  240. filtering don't work by www.sorehands.com · · Score: 1
    In California, the spammer is required to have "ADV:" in the subject line. This would make it easy to filter, except spammers ignore the law.

    Spammers intentionally hide their identity and try to make their spam hard to filter.

  241. Effective Web Porn Site Filter? by detler · · Score: 1

    This technique, if I understand it well enough (IANA genius), would work pretty well as a porn site filter--scan the site before it's displayed, decide if its porn, filter based on the probabilies.

    As far as building a corpus of porn sites versus a corpus of non-porn sites, I'm not sure of the best way--perhaps it'd be enough to pre-create the probabilities for porn sites through careful and excrutiatingly thorough research (good work for some enterprising /. readers), and compare it to a corpus of sites the user usually visits. "Accidentally" get a porn site? Add it to the correct corpus using a -Porn- button.

    If the technique is as good as Mr. Graham says it is, it might put to rest the concerns of those who fear that other kinds of filters exclude innocent content and therefore restrict speech.

  242. An experiment. by Anonymous Coward · · Score: 0

    I realize the ingrates who run Slashdot will not see this (AC limits indeed, I remember when...).

    Since this technique is so effective in combating spam and is easy to set up. Why not set up filtering on "First Post","Goatcx",etc, etc, you get the picture. Your moderator workload will drop dramatically. And for an experiment see how well it does for filter and/or sorting of submitted stories and their acceptability for posting on slashdot?

  243. Re:Too bad! Patented By Microsoft by Anonymous Coward · · Score: 1, Informative

    I saw Eric Horvitz demo this (along with a lot of other impressive stuff) when I was at MSFT. The spam filtering works very well for him. And yes, he's already written an Outlook COM plugin that does it.

    The problem is that Eric works in MS Research, not on a product team. MSR does an excellent job developing cool new technology, and a very bad job working with the product groups to ship it out the door. (Likewise, the product teams do a poor job working with Research.)

    The ultimate example of that is the MSAgent technology... otherwise known as "Clippy". Horvitz was the brain behind the original (and very cool) concept. But the Office product team couldn't take the concept and ship it in a useful form, so it shipped in the painful form we all know and hate.

    Eventually, Microsoft will figure out how to do successful technology transfer from MSR to the product teams. Hopefully spam filtering will be the first one to get it right.

  244. Re:Misleading-Envelope please... by Anonymous Coward · · Score: 0

    "The entire success of spam depends on human eyes reading it."

    Which raises the question of how that "filter" database is going to be generated? Hold the envelope to one's head like that old Johnny Carson routine? Nope, looking at the spam then saying "this is spam, plonk". Opportunity there even if smaller than before.

  245. DCC vs Statistics by jwiegley · · Score: 1
    I'll always be wary of Statistically filtered spam mail. Especially if your simply filtering on the probabilities of words. Plus I think this is something that spammers can figure a way around by altering their choice of words and phrases

    The only "trait" that all spam mail has is that the same message is sent to hundreds or thousands of recipients. A trait which can not be altered.

    The Distributed Checksum Clearinghouse (DCC) filters on exactly this aspect. You can find it here

    The mail server runs DCC on every incoming message and computes a fuzzy checksum for the message. This checksum is then reported to a central set of servers which record the presence of this checksum and then reports back to the mail server the number of times others have reported a similar message. If you get a high number back its spam and the mail server rejects the message.

    Similar messages generate identical checksums. So personalizations and random tokens do nothing to circumvent the filtering.

    I think that if every existing sendmail/qmail server ran DCC then spam would simply cease to function instantly. Currently though I don't preceive there to be a sufficient number of mail servers computing and reporting checksums to make it 100% effective but my server is currently filtering out about 95% of spam mail.

    This is not as good as the 99.95% reported by this article but DCC will be more resistant to spammers getting clever and attempting to using statistically rare words or phrases to defeat the anti-spam filter.

    --
    I will never live for sake of another man, nor ask another man to live for mine.
  246. wrong by www.sorehands.com · · Score: 1
    That is saying that if I don't have a car alarml, it is ok to steal it.


    Or because a woman wears skirt, it is ok to grab her. Just because you get slapped does not mean that it is right, legal, or proper to grab her.

    A spammer is a thief by definition.

    1. Re:wrong by Dyolf+Knip · · Score: 2
      That is saying that if I don't have a car alarml, it is ok to steal it.

      No, it's like saying that if you not only don't have a car alarm, but leave it unlocked and the keys inside and put a sign on it that says, "Drive me to your heart's content", you don't get to complain when people do so.

      --
      Dyolf Knip
  247. Re:Effective Web Porn Site Filter? by 40000 · · Score: 1

    If P2P is killing music then surely it will also kill porn in the end so why worry about spam? Of course this needs as much help as possible so get sharing pornography right away.

  248. Spammer Nailer by jukal · · Score: 2

    I did this sometime ago, Unique Spam Invoicing System, USIS aka "Spammer Nailer". And am really planning to bill the spammers. The idea: spammers collects email by harvesters: this page contains an unique address and a service agreement, which says that by sending an e-mail to the address, you agree to the terms of service, which you can read at the url. And as the address is unique and I got the weblogs, there is atleast even some chance of nailing the spammer.

  249. Other ideas: legislators' email, no-spam hash db by dwheeler · · Score: 2
    This idea is neat.

    Another idea is to start putting lists of legislators' email addresses (as well as email addresses of their major supporters) on web pages so that spammers start spamming them, too. Legislators hire others to read their emails, and they surely have filters (false positives aren't a problem here!), but it could eventually become obvious even to legislators. Especially if you get the personal email addresses (according to many legislatures, it's legal to share the email address with spammers - if they don't want it to be, they'll need to pass a law to make it illegal!).

    Another idea: a non-profit organization creates and maintains a database of HASHES of email addresses that do NOT want spam (say MD5 and SHA-1 of canonicalized email addresses, e.g., all lower case; an entire site could be represented by "@mycompany.com"). Anyone can download the database, for a small fee. Anyone can add or remove their email address from the list for FREE (and it must always be free); they just need to subscribe/unsubscribe, with a separate email to confirm (to show that they really did add their email address to the list; entire sites could require "root" or "postmaster" to represent them). Then legislation can be enacted that gives serious $$ penalties to any spam to the "no-spam" list. Capturing the database wouldn't do any good; it would only provide hashes and date/time stamps.

    Anyway, just an idea.

    --
    - David A. Wheeler (see my Secure Programming HOWTO)
  250. email is not mail by MemeRot · · Score: 1

    Sorry, it just ain't.

    You can't encrypt real mail. Real mail takes days to weeks to arrive somewhere. Real mail can also be dangerous (anthrax anyone?). If I read my real mail out at the mall - do I have privacy rights to it? What about if I read my email there? You have a lot of fuzziness about privacy rights being dependent on where you are and what you're doing..... if you have privacy rights at home, then you have privacy rights at home to read you real mail, read your email, and surf the web.

    My primary email addresses now are hotmail addresses. Reading my email obviously means going out on the web. So I've already left my 'house' in your terms, and I read my email with a web browser. Are you going to say that email that goes to a real email server is a different animal than email that goes to hotmail/yahoo/etc.? That's a pretty flimsy distinction.... they're SMTP packages in either case, and that's what defines if it's email. Old style email clients work the way they do because of the state of technology at the time... NOT because of specific design decisions.

    I also take exception to all these metaphors that rely on the physical universe. In both cases you're sitting in front of your computer, looking at your monitor. You're not 'going out' in one instance and not in the other. Applying old paradigms inappropriately is why legislation on the net is so fscked up.

    1. Re:email is not mail by cburley · · Score: 1
      I agree with him, because while he gives physical-universe explanations for his expectations, the fact is that email is conceptually different, in the sense that it is mail that is received, from web sites, in the sense that they're places that are visited.

      He highlights these differences using real-world examples. That doesn't mean flaws in his analogy indicate flaws in his reasoning.

      (For example, you ask "If I read my real mail out at the mall - do I have privacy rights to it?" -- yes, you still do, except to the extent it is obvious to anyone that you are reading mail, that you are reading N pieces of mail, that some of them might have big bold print readable from a distance...but the fact that you aren't at home does not automatically grant someone permission to "borrow" your mail and read it, or even read it over your shoulder, on the grounds that you've "given up privacy" by doing it in a public place. That's true regardless of whether it's mail or email. None of this means the rest of your argument is invalid, of course.)

      Some of your arguments are just trivial technological twists, though, and do little to illuminate the issues.

      So let me try this out on you. The main difference between reading mail/email and going shopping/browsing is that the former is assumed to be capable of being done by the client in his own home, on his own schedule, without necessarily being "connected" to the rest of the world. The latter is assumed to require all those things -- to leave the home (physically, or to leave the home computer virtually by reaching out to an external web site and giving it some degree of control over your computer); to do so on a schedule convenient to the provider of services; and to do so while some kind of connection (Internet, roads, walkways, etc.) to the provider via outside world is maintained.

      Therefore, given that HTML is specifically designed as a hypertext markup language -- a language making the browsing activity easier and broader for the client -- it is reasonable to conclude that it is not automatically suitable for a communications medium, email/mail, that one must assume will be read when hypertext links (and similar paraphernalia) might not be operative, or welcome, by the client.

      (How would you feel if you fetched all your email onto your computer, disconnected, walked to a park to read it, and found that every single message was simply a link to someplace on the 'net -- to which you were no longer connected -- containing the full text of the message? That's certainly something HTML supports, so I would conclude from your support of HTML as a natural technology for email that you wouldn't mind. I certainly would mind, since I think it should be clear to everyone that, extreme cases aside, emails, like regular mails, should be readable "offline", whatever that means.)

      That's not to say HTML as a technology cannot be embedded in email, since, obviously, it can. But what email readers should default to is what would amount to a subset of HTML that allows only browsing within the email itself (plus, in certain cases such as corporate LANs, within the known-safe LAN, or network), disallows arbitrary code being run on behalf of the provider (since the client must be assumed to wish to control the entire experience), and so on. And email authoring software, in recognition of this general convention, would help the user writing an email to either obey the conventions or understand that, in not doing so, the reader might choose to ignore part of the message (perhaps by default -- not even seeing that there's a choice to be made in an individual case).

      In short, HTML email is, as practiced today by email-sending software, closer to sending raw executables and expecting email clients to simply run them to "serve" the user who wishes to read them, than it is to sending plain text and other markup that obeys the concept that the client is in complete control of the mail-reading experience, including where, when, and with what degree of connecitivity to the outside world he is reading his mail.

      (And, yes, I've bitten the heads off a few friends who insist on sending snail-mail with chunks of glitter that "explode" on you when you open it. If I wanted glitter all over my clothes, I'd go party with some 12-year-old girls or something.)

      Remember, the question isn't under what circumstances might someone read mail, shop, or browse -- it's under what circumstances must someone designing an infrastructure supporting these activities take into account as the most pertinent, in terms of what users of that infrastructure expect, how their rights are best protected, and so on.

      So, no, it doesn't matter that you go to a web site to read your email. That doesn't change the nature of email, until the day when everyone (pretty much) does the same thing -- not likely anytime soon, I think. Ditto for reading regular mail in a public place -- that doesn't change the nature of mail until it's very rare to read it anyplace that is not quite public (at which point things like credit cards can no longer be sent via mail).

      --
      Practice random senselessness and act kind of beautiful.
    2. Re:email is not mail by pmz · · Score: 1

      Thank you for expressing the issue better than I could. The aspect of e-mail, where it should work regardless of a network connection, is a good point. Once a message is delivered, the recipient should be able to read it on-line, off-line, or even printed out. An e-mail with links to the WWW is, essentially, an e-mail that was never completely delivered.

    3. Re:email is not mail by cburley · · Score: 1
      An e-mail with links to the WWW is, essentially, an e-mail that was never completely delivered.

      Generally, yes.

      Of course, when the sender knows the recipient reads email in a fully-connected environment, he can take advantage of that. I do all the time with my wife, sister, and her husband (at least I assume he's fully-connected when reading email), e.g. by emailing them a link with a very short description, leaving it to them to click on the link.

      I wouldn't do this to someone who might read email offline. Though the kind of audience that does that is probably changing; I'm "trailing-edge", so I've only recently gone from offline reading to online-via-Broadband reading, whereas my wife is "leading-edge" -- she had five computers on her desk at home the other day, from an old Mac SE to one of them Blackberry thingies -- so she's probably going beyond always-connected to sometimes-offline-reading, due to her frequent travels and increased reliance on wireless communication (which must cope with being unconnected while still allowing reading and composing of email).

      Another practical distinction between email and web pages is that people have historically chosen whether to keep or trash their mail. Email too.

      This is distinct from shopping/browsing in that one expects to be able to locate useful nuggets of info in years-old mails -- something that is impossible if mails/emails contain, primarily, pointers to external data that might have been "live" near the time of delivery but has since died, or at least changed substantially.

      I guess it's important, yet sometimes difficult for techies like us, to distinguish between the concept of email and the technical utility. When I email links to my family and friends, I'm not really mailing them so much as using email (SMTP, you could say) as a delivery mechanism, one that's somewhat different -- more "laid-back", queued, formal, etc. -- than instant messaging, which I've yet to use.

      In that context -- using email as simply a data-delivery mechanism -- it doesn't really matter much what's in the email, because both the sender and recipient are more "tightly bound together" in terms of what kinds of communications are expected and tolerated.

      But non-techies, such as newcomers to the field of computing-enabled communication, are, IMO, best served by software that helps them use the technologies in ways that are most consistent with their target audience (other newcomers). In this case, that'd be software that helps them avoid delivering email in the form of web pages (HTML), among other things.

      --
      Practice random senselessness and act kind of beautiful.
  251. I taught my 9 year old to program in Haskell by x1048576 · · Score: 1
    I don't know why you think Haskell is not suitable for intro programming. It's easy to understand and you can start writing interesting stuff straight away.

    My nine year old had no trouble learning to program in Haskell and really enjoyed it.

  252. that's my point by MemeRot · · Score: 1

    I'm saying what he has is NOT a spam filter, it's a trainable content filter. THERE IS A DIFFERENCE! He wouldn't WANT what he says he wants, something that blocks unsolicited automated email. The filter has nothing, ultimately, to do with the 'spamness' of the message, only with whether you like the content and headers - and I think the inclusion of header filtering is a mistake because it's being included because he thinks he's making a spam filter, when in reality he's making a content filter. My concern is when he defines spam as 'automated unsolicited email' when he should define it as 'any email i don't want'. Sorry Aunt bertha, I've had it with your forwarded joke of the day emails; even though i continue to want personal emails from you, i don't want the forwarded joke of the days - see how header filtering doesn't work there, but pure content filtering does?

    Like a user agent, yes. Excite used to have a news clipper feature, you could make up a category of keywords to look for "category:cyborg = cyborg, implant, neurochemical, mind-machine interface" and it would grab stories that matched the guidelines. At first it would be a rough fit, but with each story you could mark 'I like it' or 'I don't like it' and it would make some behind the scenes list of other keywords to use or avoid, and over a couple weeks it would start surprising you with all kinds of things you didn't know you wanted, but that you loved as soon as you read. And yes, that's why I'm so insistent that the judgement be just on the content and not on the 'spamness' of the headers, because I had a really positive experience with this kind of technology, and the articles were not rated at all according to their source. Just because I like one story from AMA doesn't mean I'll like others, just because I hated one story from Newsweek didn't mean I'd hate another one, and I think the same is equally true of the mailing source of an email.

    1. Re:that's my point by plover · · Score: 2
      I apologize, I didn't see your point this clearly in your first email.

      And I understand your second point about the utility of having an agent learning your prefrerences. But that is of maximum value in the situation you describe: reading news. That's a situation where you don't care if you miss 1% of the valuable stories. That's not true of email; or even if it is for some specific people it probably isn't acceptable performance for someone trying to distribute something called a "spam filter".

      I also understand your point about including headers vs. just the content, but I think it was a good choice on his part. I assume you've done some work on spam-fighting software. I have, and I can assure you that (at least a few years ago) some headers are truly spam-only-markers. He'd be passing up a great filtering chance if he didn't look at them.

      Perhaps the software needs to go that extra step: rather than have a "this is spam button", it probably already has a "move to this folder" option. Why not create a probability array rather than a single spam probability assigned to each dictionary word? Tie the folder names to probabilities in the array.

      Word Spam Inbox ChainLetters
      angels .2 .0 .8
      opportunity .8 .05 .15
      bertha .01 .01 .98
      FW: .05 .40 .45

      It could then transparently learn to move all my email, sending my Pilotgear mailings to my Pilotgear folder, etc. It would also reduce the "value" of automated senders as being only spam-related.

      I still wonder how an agent will be able to discern the difference between a chain letter from Aunt Bertha and a "Hi, this is Aunt Bertha, meet my plane tomorrow please?" Every message from her gets a "Love, Aunt Bertha/get your free Juno account" tagline at the bottom. So, "love" "aunt" "bertha" all become words that are very strongly associated with ChainLetter-taint, when the real fingerprint probably should be the "Fw:" at the head of the subject line (as well as every line beginning with '>') Without at least one good letter from Bertha, her email will always end up in the ChainLetters bin. But with an array of sorting options, at least they wouldn't end up heading straight to /dev/null.

      Perhaps other message-cumulative characteristics should be used in conjunction with word counts, such as message length, total count of exclamation points or dollar signs (or even of all individual ASCII characters,) grammar checker score, spell check score, etc. I think the overall concept of using a probability based mechanism rather than a score/threshhold mechanism is sound. I think we both agree that his approach needs more refinement.

      --
      John
  253. that';s because he's wrong about what he's got by MemeRot · · Score: 1

    He wouldn't WANT what he says he wants, something that blocks unsolicited automated email. The filter has nothing, ultimately, to do with the 'spamness' of the message, only with whether you like the content and headers - and I think the inclusion of header filtering is a mistake because it's being included because he thinks he's making a spam filter, when in reality he's making a content filter. My concern is when he defines spam as 'automated unsolicited email' when he should define it as 'any email i don't want'. Sorry Aunt bertha, I've had it with your forwarded joke of the day emails; even though i continue to want personal emails from you, i don't want the forwarded joke of the days - see how header filtering doesn't work there, but pure content filtering does? Or do you think that you should still have to delete the forwarded joke of the day emails manually, every day, for the rest of your life? And why would you want to do that when this could filter them without filtering the personal emails?

  254. Re:Other ideas: legislators' email, no-spam hash d by Anonymous Coward · · Score: 0
    1. Sure, canonicalize the domain name, but the local-part of an addr-spec is case sensitive.
    2. A legal definition of unsolicited bulk email must be very carefully designed. The new spam trend is to claim you have opted in with some unidentified "marketing partner".

    But legislators are some of the least efficient communicators on the planet. They won't even think twice about ignoring email, since they still think dead tree connotes authority.

  255. another spammer justification by www.sorehands.com · · Score: 2

    No, it's like saying that if you not only don't have a car alarm, but leave it unlocked and the keys inside and put a sign on it that says, "Drive me to your heart's content", you don't get to complain when people do so.


    No, it is if you say, move the car, if you are blocked. Then you decide to take the car for a long drive.


    Spamming is stealing.


  256. Security thru obscurity again? by Tablizer · · Score: 2

    (* Yes, but how many spammers are going to reply to your challenge? Zero! And that alone will make the challenge an effective tool. *)

    If confirmation requests become a wide-spread practice, they *will* take advantage of that.

    Too many techniques here assume that what works in obscurity works en-mass also.

    Not the case. Spammers tarket the widest-used techniques. When something becomes wide use, kazaam!

    1. Re:Security thru obscurity again? by Anonymous Coward · · Score: 0

      True regarding the weides-use techniques.

      But there are ways to devise a challenge-response system that are difficult to resolve without a human being on the other end actually reading it. And if there are several varieties, then it's even more difficult to automate responses to such challenge emails.

      The only way to get around it is to use a human being to read the challenge and reply. I think it's a safe bet that the spammers don't use that kind of manpower, because it's too expensive.

  257. Hey, by my reckoning it works by Hugh+Beyer · · Score: 1

    Well I thought this sounded cool so I spent an hour or two coding it up in a VB macro for Outlook.

    There's now a "Delete Spam" button on my toolbar that moves the selected message to a "Spam" folder. There's an event handler that runs whenever a new message comes in, analyzes it, and if it looks like spam puts it in a "Probable Spam" folder. There's a macro which analyzes all the messages in the "Spam" folder and all the messages in my Inbox to generate the word probabilities hash table.

    I did a quick run through my deleted mail folder, used the "Delete Spam" button to move a representative sample of spam (250 messages) to the Spam folder (I didn't do them all just to save time). I then ran the analyzer to get an initial hash. Then I analyzed the messages in my deleted mail folder, wrote the scores and subject lines to a text file, and moved most of the spam that didn't get flagged as spam to the spam folder, and re-ran the analyzer.

    Bingo. That simple technique has caught every spam I've gotten since. From time to time I can check the "Probable Spam" folder and move those messages to the "Spam" folder and re-run the analyzer to improve it. We'll see how it weathers over time, but it's already doing better than I have any right to expect.

  258. The Fatal Flaw in the Plan? by teledyn · · Score: 1

    Actually, there's two, and both are easily found by simply entering "bayesian spam" into Google:

    • There already exists a generic method (ifile) to fold this technique into a procmail script, which means you don't need any special-purpose email program; any company that thinks its going to replace the email-browser is dreaming. I also downloaded some experimental code for Emacs GNUS, but it was too clunky for anything more than a demonstration of the rating method. Since ifile works with procmail, any ISP could use it to tag suspect email, so it wouldn't matter if there are both geeks and sex-starved teens (not the same thing?) in the audience; each can do with the extra tags as they wish.

    • the Google results also show this is not a new problem; research, heavy research, has been applied to Bayesian network classification of spam emails since 1994 ... so it is the first approach, or close to it. My first question is then, "if it works so well, with 0% false positives, then why did everyone, even Microsoft, abandon it?" IFile has been around for a long time, yet none of even the Linux distros include it by default. That's a little suspicious, don't you think? If the method is so foolproof, why are there no fools using it?

      Excuse my greying cynicism, but there's no mention in Paul's paper of how he's accounting for the mass failure of the corpus of work that goes before him, and I get a little dubious when one lone programmer claims they can out-think large numbers of trained professionals and academics. Yes, it does happen, but when you hear hoofbeats in the street, it's usually not a zebra.

    Please feel free to enlighten me about the above two; I'm not investing in Paul's employer, so the first issue is not nearly as important as the second, but as a spam-victim, I truly do so want to believe there's a magic anti-spam bullet, I just have trouble believing this particular story based on the data at hand.
  259. Proposed by waldoj · · Score: 2

    Would this also work with email virus? I think it would since the virus would also have a defined patern to it and the program would pick it up after the first one.

    I actually proposed this on Advogato many moons ago, in February of 2001.

    -Waldo Jaquith

  260. Multilingual possibilities by Anonymous Coward · · Score: 0

    I really like the elegance of this approach, but Mr. Graham neglected to brag about one important capability: It transcends the language of the text it filters. A good database of spam and non-spam messages in French, Italian, Greek, Russian, Arabic, Thai, Korean, Japanese, or whatever (you'll notice I put some double-byte languages in there) will generate good filtering of any sort of messages. This will continue to increase in importance as the Far East gets to be a larger portion of the Internet (for both legitimate and spam users.)

  261. yahoo.com should do this [actual text] by Flamesplash · · Score: 1

    mail.yahoo.com or someother big free mail program should implement this. Yahoo has a "report as spam" button on every message you read, so that would be an easy way to build the spam group. As for the "good" group dunno.

    -shane

    --
    "Not knowing when the dawn will come, I open every door." - Emily Dickinson
  262. Re:make sure you call their 1-800 from a public ph by Anonymous Coward · · Score: 0

    And in the US, calls to an 800 (888/877/866/855) number from a payphone result in a 28-cent charge to the RECIPIENT (ie: the spammer), which is paid to the operator of the payphone.

    Something to think about next time you have time to kill at a shopping mall or airport with rows of unused payphones....