Slashdot Mirror


Fighting Spam with DNA Sequencing Algorithms

Christopher Cashell writes "According to this article from NewScientist, IBM's Anti-Spam Filtering Research Project has started testing a new spam filtering algorithm, an algorithm originally designed for DNA sequence analysis. The algorithm has been named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits). Justin Mason, of SpamAssassin, is quoted as saying that it looks promising. A paper is available on the algorithm, too (PDF)."

142 comments

  1. hm by Anonymous Coward · · Score: 0, Interesting

    wonder what the spammers will come up with to get around this...

    1. Re:hm by Pigbot · · Score: 5, Insightful

      wonder what the spammers will come up with to get around this...

      Of course. Spam is a moving target. Given that it is cheaper to create spam than to block spam, it will always be an uphill battle.

      Lately, much of the spam I have been getting in my Inbox (squirrelmail/spamassassin) has been email that has no typos, no random text, no blatent "click here" lines and looks like normal mail. Except they are trying to sell me something.

      --
      print "Oink!\n" if ( $tail =~ "pull" );
    2. Re:hm by Proud+like+a+god · · Score: 2, Informative
      Lately, much of the spam I have been getting in my Inbox (squirrelmail/spamassassin) has been email that has no typos, no random text, no blatent "click here" lines and looks like normal mail. Except they are trying to sell me something.

      You lucky g*t! :-P
    3. Re:hm by great_snoopy · · Score: 3, Informative

      In fact, they did. The last spams I receive are composed of two parts : the spammy part, and a longer part that is usually a news paragraph from a public news site like news.google.com or cnn. The second part usually has a very small or none spammy fingerprint, cloaking the first spammy part.

    4. Re:hm by Anonymous Coward · · Score: 0

      Lately the only spam that has been getting through are just headers with no body email. There isn't much I think I can do to stop it either because the headers seem to be unique every time. I don't understand why they bother sending a mail with no content... It's not even a virus.

    5. Re:hm by ca1v1n · · Score: 2, Interesting

      The great thing about the similarity matching algorithms is that they read with noise filtering the same way that humans do. They also allow for like-character matching without any added computational overhead. This means that you can make a table of unicode characters that are similar to certain ascii characters that gets incorporated into the similarity matrix. By the power of these properties combined, your spam filter can recognize that c;al_is is intended to look like cialis, without a lot of expensive extra computations.

      Now that we've neutralized that form of message garbling, we're left to dealing with bayes filter poisoning. This is something that entropy-based filtering deals with quite well.

      All spam filtering techniques have weaknesses, but if you use a few different methods in concert, preferably within the same package to spare the poor user from having to set up a whole lot, you can get just about all of it.

      Even using a few of these different methods together, I still get a few ads from companies I've done business with that have screwed up my communication preferences. This sucks, but most of these companies are clueless rather than malicious. Threatening to take my business elsewhere has never failed to correct these problems.

  2. Feng Shui hardware by simp · · Score: 5, Funny

    Excellent! This will go wel with my Feng Shui compliant wall of rocks that I use as a firewall.

    1. Re:Feng Shui hardware by Anonymous Coward · · Score: 1, Funny
      Excellent! This will go wel with my Feng Shui compliant wall of rocks that I use as a firewall.
      Make sure you have some moss or other greenery to balance its hardness, and ideally some water too. For a fully integrated experience, use a themed wallpaper like Stonehenge on your desktop.
    2. Re:Feng Shui hardware by Pigbot · · Score: 4, Funny

      Considering how much spam I get trying to sell me Viagra or porn, I have reservations about using someone's DNA to fight spam. It just sounds dirty. And sticky. Like someone should at least buy me dinner first.

      --
      print "Oink!\n" if ( $tail =~ "pull" );
    3. Re:Feng Shui hardware by BJH · · Score: 5, Informative

      If I'm not mistaken, Chung Kwei is the figure known as Shouki in Japanese. He's usually described in English as the "Demon Queller", which seems a suitable-enough symbol for an anti-spam program.

      I mean, come on - don't anti-spam programs have the coolest names? SpamAssassin, Vipul's Razor...

    4. Re:Feng Shui hardware by DNS-and-BIND · · Score: 1, Insightful

      It's hardly appropriate that such superstition should be given encouragement in this day and age. Penn & Teller did a great bit on "feng shui" on their show, "Bullshit!". They had 3 different feng shui consultants come in to a house, and each one recommended different changes for different reasons. Some discipline.

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
    5. Re:Feng Shui hardware by Anonymous Coward · · Score: 0

      Hmmm...That proves nothing. What if Penn and Teller do a similar bit on the alleged profession of computer programming?

      "They had 3 different software engineers come in to a business, and each one recommended different changes for different reasons. Some discipline."

    6. Re:Feng Shui hardware by Xyde · · Score: 1

      That's especially hilarious (and probably apt) coming from somebody with the username pigbot ;)

  3. Wordfilter by bert.cl · · Score: 3, Insightful
    While the numbers are impressive, this just looks like a filter that does combined wordsearches?

    Even with training, isn't this just some regexp and searchting after particular strings.

    And what about short messages, that don't use as much words, is the spamscore relative or absolute? The article is a little low on details, anybody who can point to some more informative articles?

    1. Re:Wordfilter by rokzy · · Score: 2, Interesting

      91% detection is far from impressive. AFAIK the better filters today are 99.9% successful. the benefit of this one is its low false-positive rate.

      personally I'd prefer a much better set of filter tools e.g. being able to say "I only speak English, I NEVER use this account for commerce, and the people I email are professionals so score spelling mistakes much higher as probable spam".

      can someone point me in the direction of such a filter?

    2. Re:Wordfilter by Anonymous Coward · · Score: 0

      typo: parent should say 97% not 91%

    3. Re:Wordfilter by FooAtWFU · · Score: 1
      My sentiment: Regex schmegex, so long as it works, and keeps working.

      But really- have a new algorithm that's not perfect? Work on it. More algorithms to choose for cannot mean anything but better antispam solutions.

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    4. Re:Wordfilter by Incadenza · · Score: 4, Informative

      personally I'd prefer a much better set of filter tools e.g. being able to say "I only speak English, I NEVER use this account for commerce, and the people I email are professionals so score spelling mistakes much higher as probable spam".

      can someone point me in the direction of such a filter?

      How about spamassassin?
      Just add the following to /etc/mail/spamassassin/local.cf:

      ok_languages en

      And increase the score for BIZ_TLD and other tests you find more important than others. Scoring per test is fully configurable, complete list of tests here.

  4. The one... by Anonymous Coward · · Score: 0

    ... in Thunderbird works for me.

  5. Mozilla Firefox by nycsubway · · Score: 2, Insightful

    I have to say the adaptive spam filter in Firefox works pretty darn well. I have tried other adaptive spam filters as plugins in Outlook and they work pretty darn well too.

    With the nature of new spam messages that look like real emails, the only person who can really tell if something is spam is the recipient.

    1. Re:Mozilla Firefox by rokzy · · Score: 2, Insightful

      I've had mixed results with Thunderbird. in the beginning it seemed to work great, then I noticed it was junking all my legitimate email too. then I fixed that but it started letting through blatantly obvious stuff.

      the newest version has been doing better so far.

      I think my problem is my rate of email is quite low so it's difficult to train. I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.

    2. Re:Mozilla Firefox by danharan · · Score: 2, Interesting

      I think you mean Thunderbird.

      My experience with it has been rather disapppointing. Why I need to tag as spam two messages from the same sender or with the exact same subject is a mystery to me. After the 10th "Make $/d+ in XX days" type message one has to wonder just how effective this thing is.

      This method is promising because it uses spell-checking and a better way to identify spammy string sequences, something none of the two main camps of spam-filters have seem keen to do until now.

      --
      Information: "I want to be anthropomorphized"
    3. Re:Mozilla Firefox by littlem · · Score: 3, Interesting
      My experience with it has been rather disapppointing. Why I need to tag as spam two messages from the same sender or with the exact same subject is a mystery to me. After the 10th "Make $/d+ in XX days" type message one has to wonder just how effective this thing is.

      This shouldn't be all that surprising - Bayesian filtering is all based on probabilities. The reason "Outlook message rules" is so bad is because a friend of mine might send me a joke about Viagra, which I don't want to have deleted indiscriminately as spam. False positives are infinitely more annoying than false negatives, so I'd much rather have conservative filtering that let a bit of spam through.

      I'm not saying Bayseian algorithms are perfect yet (though they'll improve) - my personal experience has been SpamAssassin, which got 97% of spam, and I've been experimenting with Thunderbird for a week, which gets 85%-90% and will no doubt get much much better as I train it in the next couple of weeks - but ultimately Bayesian filtering is enough to beat enough spam to make spamming not worthwhile (if everyone did it...)

    4. Re:Mozilla Firefox by aussie_a · · Score: 3, Funny

      I agree. The Mozilla Firefox spam filter works great for me. I no longer go to all those goatse sites that people link to thanks to the plugin :) But I have to keep uninstalling and reinstalling it, because after 2 days it says slashdot is spam.

    5. Re:Mozilla Firefox by toxic666 · · Score: 2, Interesting

      "I" being the key word in your assessment. Fine for the home user, not so good for a business.

      Maintaining an enterprise mail system based upon user-controlled spam filtering software is not practical. That small percentage of users with consistent ID 10T errors adds up fast. Try correcting false positives for a user-configured filter. It's time-consuming.

      The better approach from an administrative standpoint is controlling spam at the MTA- and MDA- levels of the mail server. I use postfix checks with MDA-level Bayesian filtering with reasonable success. The spam mbox is comprised of user-submitted and administratively approved mail. The user submits it, and the admin checks for things like filter poisoning text before moving it to the real spam mbox.

      Most importantly, my false-positive rate is extremely low -- probably 10's of thousandths of a percent.

    6. Re:Mozilla Firefox by Technonotice_Dom · · Score: 3, Informative

      I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.

      There are a few databases out there that take hashes of spam e-mails (either sent to spam traps or reported) and use them for spam tagging. SpamAssassin can use their client programs to help tag messages also - I don't know if there's an extension or anything for Thunderbird, I don't use it.

      The three that come to mind are DCC, Razor and Pyzor.

      All have their advantages or disadvantages, but you have to remember that you're relying on somebody else's judgement. I think it's DCC that you can easily configure to say that you need x reports of the message before you class the message as spam, which gives you more control. But you only need one person who doesn't use it correctly to ruin the system and introduce lots of false positives.

      You could always set up SpamAssassin on your local machine and proxy messages through that.

    7. Re:Mozilla Firefox by It'sYerMam · · Score: 1
      The problem I've been having is that spammers have stopped, well, spamming. They have a subject reading "Get new Vi'agra" or whatever, and the body is filled with those random words - I couldn't find any advertising whatsoever.

      I mean, how are these twats going to get even the most floppy, lazy, frustrated 99 year old to buy their product by telling him "rankin decisionmake portraiture approval slothful clamber teutonic activism alcoa tofu wakeful polonaise burt afghan lad sedimentary pennyroyal aristotelean restaurant catherwood veridic cottonseed circumference rupee automorphism lachesis homesick?!"

      --
      im in ur .sig, writin ur memes.
  6. High tech for what ? by Ozh · · Score: 3, Interesting

    Funny how some people develop more and more sophisticated stuffs to fight against something that is just as simple as sending out emails to random address... and so simple that it will never stop :/

  7. Thunderbird by bert.cl · · Score: 2, Informative

    I think you mean Mozilla Thunderbird?

    1. Re:Thunderbird by Anonymous Coward · · Score: 0

      No, he's accessing his gmail account using firefox ;-)

    2. Re:Thunderbird by nycsubway · · Score: 1

      Yes, Thuderbird! It's early in the morning...

  8. Misnomer, it's not "fighting spam"... by argent · · Score: 1, Insightful

    This isn't "fighting spam", it's "adapting to spam".

    1. Re:Misnomer, it's not "fighting spam"... by avalys · · Score: 1

      Not really. As more and more people begin to use spam filtering (especially on the server level), spam's effectiveness will decrease.

      --
      This space intentionally left blank.
    2. Re:Misnomer, it's not "fighting spam"... by argent · · Score: 5, Insightful

      As more and more people begin to use spam filtering (especially on the server level), spam's effectiveness will decrease.

      People have been improving filtering, and the spammers just pump up the volume. As filtering improves, the delivery rate goes down, but so does the complaint rate so they end up being able to pump more spam before they're detected.

      I've been watching this arms race for almost a decade, and the advantage is still on the spammer's side. At the moment I'm blocking between 10,000 and 20,000 connections a day just on the basis of their IP address (including blocks against entire countries), another 3-5,000 using a greylist/honeypot app I'm working on, and I'm still getting one or two hundred messages per day hitting my procmailrc. A few years back, when I was getting a few hundred spams a day without all those RBLs and personal blacklists, people were all excited about how bayesian filters were gonna make spam uneconomical... and I made the same comment back then. Now I'm filtering a couple of hundred times more efficiently and effectively and I'm still getting almost the same volume.

      I don't see anything different this time. You can't fight spam with filters, all you can do is adapt to it.

    3. Re:Misnomer, it's not "fighting spam"... by shubert1966 · · Score: 1
      I totally agree. Allowing unwanted files onto your system just because 'they' know the address is USER ERROR. These FILTERS are ASKING FOR SPAM!!!

      This middle-market-merchandising-madness has to stop. Bill Gates and attendent remora-ware are getting richer and richer each and every day.

      I guess if politicians can't figure out that their own computers aren't safe, or how to tax internet transactions, then we can't bloody rely on them to stop consumer gouging either can we?
      1. Acquire domain and setup your site.
      2. Set up a list of your known contacts, accepted email addresses.
      3. Create email page with CAPTCHA image - to filter out 'bots. Once they message you the first time you can decide for yourself if they get on your acceptable list. Most spammers won't take the time to defeat this - not cost effective.
      4. Give the URL of this page out AS your email address.
      5. They have to hack your site instead of just hitting a keystroke in order to send you SPAM.
      This allows anonimity and allows new, valid users to contact you.

      --
      Stuff that matters.
    4. Re:Misnomer, it's not "fighting spam"... by zogger · · Score: 1

      yours is closest to the best idea, IMO. All email-in should be blocked by default, and only whitelist allowed in through the filter. You can use a form on a web page for a first contact.

      I'd also like to see email addys be treated exactly the same as a snail mail street address addy or a telephone number, ie, make them cost to get, so they are treated correctly. We register domains, why not email addys? If it cost 10$ a year (something like that) to register an email addy, there would be no incentive for the spammers to throw the dictionary at domains, and conversely, the spammers couldn't/wouldn't want to create thousands of email addys to spam from.

    5. Re:Misnomer, it's not "fighting spam"... by Proc6 · · Score: 1
      Wrong.

      The effectiveness of the spam that's blocked decreases, the potentcy of the spam that gets through skyrockets since it stands alone. This alone is motivation to triple the efforts of spammers. Im sure the more talented spammers out there nearly jizz themselves as they run thier latest crafted email through their local "test servers", seeing it passes through all the filters with ease, and hit the SEND button.

      Until there is new methodologies to prevent the "ability" to spam, period, everything else is just throwing effort into an unbeatable problem. Not saying things like Spam Assasain aren't nice, they are, but the bulk of the effort should be going to some universal email method that can be moved to easily and prevents spamming.

      --

      I'm Rick James with mod points biatch!

  9. Bayesian Still Works by Admiral+Justin · · Score: 4, Funny

    For now, Bayesian filtering still gets the job done most of the time, so I think we shouldn't get too excited.

    Besides, you have to ask yourself some questions...

    "What happens if you try to filter spam with RNA?"

    "Just how good can ACT and G manage spam?"

    and, most important of all...

    "Are you sure this spam filter uses no portion of Keanu Reeves' genetic code?"

    --
    You will be baked, and there will be cake.
  10. Re:Mozilla Thunderbird by nycsubway · · Score: 0

    I really like the programs, but I get their names confused... I meant Mozilla Thunderbird in the above post.

  11. Love SA... by ajs · · Score: 5, Informative

    You have to love SpamAssassin for it's very Perlish approach to spam filtering... "hey, there's a cool new way to filter spam... throw it in!"

    I love this mostly because it means that SA is a moving target. Spammers can figure out how to defeat pieces of it, but it deploys a wide range of static, dynamic, network-based and user-driven tests that changes so much that spammers simply can't afford to keep up.

  12. The biggest problem I see, at the moment.... by Rahga · · Score: 3, Interesting

    It looks like much of the spam I'm recieving today consits of either nearly-blank or e-mails containing news articles that seem to be designed to pass trough content filters just so users can send them back to their admins as spam, essentially making it easier for bayesian filters and such to mark legitimate e-mail as spam.... though honestly, it's more of annoyance for me, as it makes it easier for users to say "The spam filter isn't working, what are you doing wrong?"

  13. Wrong title, I guess by stm2 · · Score: 5, Interesting

    According to the ./ title, it seems they used an algorithm used for DNA secuencing, when in fact they used an algorithm used for DNA analisis (or DNA sequence analisis that is the same), more specifically, gene finding techniques. As you may know, most DNA in a genome is not translated into protein (some people still call it junk, but most of it is no junk at all). So there are programs to sort genes out from the rest of DNA.
    I think we will see more and more applications like this with the growing cross-polination between Biology and CS.

    --
    DNA in your Linux: DNALinux
  14. Fighting spam with BioRythm testing? by jmcmunn · · Score: 0

    It's too early in the morning I guess. When I read the title of this article, I immediately thought it was indicating that we should test the 'Dna' of incoming emails.

    And then I wondered what the BioRythm of an email would be. I need to go back to bed.

    1. Re:Fighting spam with BioRythm testing? by jmcmunn · · Score: 0

      (at least after seeing the next article's title of 'Biometric E-Passports' I know where my subconcious was getting the second mistake from.)

  15. Re:What I want to know is... by samael · · Score: 1

    I don't care about the cost of spam. With my 1MBit connection it doesn't compare to my other downloads.

    I just don't want to read it - and now I don't have to.

  16. What could we do... by d3ity · · Score: 2, Funny

    I'd love to meet the scientist that thought this up. It probably went something like this: Boss: Well we've made promising gains in the DNA reasearch project, Now what applications could this be used for Engineer: The possibilites are litless! we could cure cancer! We could invent a super puppy that combines the abilities of a lovable puppy and tux, the friendly linux penguin! We could use it to rengenerate limbs for amputees! Marketing: Lets use it to get rid of spam emails! Boss: Great idea! Lets go with that one.

    1. Re:What could we do... by bhima · · Score: 1

      Actually, given the current climate it would be more profitable to cause spammers cancer or remove their lims automatically for each spam sent or kill their puppies or something like that.

      --
      Nothing in the world is more dangerous than sincere ignorance and conscientious stupidity.
  17. Re:What I want to know is... by MadFarmAnimalz · · Score: 1

    You are confused.

    To block spam at the transport level is one thing; an algorithm for identifying spam without human intervention is another entirely.

    I suggest you RTFA. Their method is actually pretty interesting. Lackluster is not the appropriate word for the novel idea they have come up with.

    --
    Blearf. Blearf, I say.
  18. Works until the Spammers get a copy of it by G4from128k · · Score: 4, Insightful

    This is interesting and promising technology. But like all antispam techniques, spammers will find a way around it. Once spammers get a copy of the software, they can create and test countermeasures in the comfort of their own sleazy lairs.

    For example, the article mentions the software accepts a message that is long but has a few "spammy" sequences. This suggests an immediate countermeasure of adding bulk to spam -- appending a copy of some news article to the spammy payload (some already do this).

    Personally, I've always thought that a simple spell check would do a good job as another layer filtering. It would place spammers in a no-win situation -- either the keyword filter or the spell check filter would get them.

    --
    Two wrongs don't make a right, but three lefts do.
    1. Re:Works until the Spammers get a copy of it by Donny+Smith · · Score: 2, Interesting

      Good point - that's why, in theory, closed-source software that isn't available for free download and in open-source version should be more effective against spam.

      Spell checker as anti-spam filter - that would create huge problems for most Americans :-)
      Otherwise it's a good idea.

    2. Re:Works until the Spammers get a copy of it by Anonymous Coward · · Score: 0

      sleezy? These people are making so much money...
      http://www.pteam.net/

      It's fucking obscene

    3. Re:Works until the Spammers get a copy of it by Tim+C · · Score: 2, Insightful

      in theory, closed-source software that isn't available for free download and in open-source version should be more effective against spam.

      How so?

      1) install software
      2) treat as black box
      3) spam spam spam
      4) see what gets through
      5) study, enhance
      6) goto 3)

      Just because you can't see how it works, doesn't mean you can't teach yourself how to get around it.

    4. Re:Works until the Spammers get a copy of it by Anonymous Coward · · Score: 0

      Personally, I've always thought that a simple spell check would do a good job as another layer filtering. It would place spammers in a no-win situation -- either the keyword filter or the spell check filter would get them.

      As terrible as it sounds, I doubt many people will have significantly better spelling that spam, making a test like this worthless. Even if you ignore all the jargon that doesn't appear in a dictionary, web addresses, etc, I find a hell of a lot of my friends send me email in "txt spk" - saying things like "wot u up 2 2nite?" It bugs the hell out of me, because they aren't as moronic as their language makes out - it's just that they are used to sending text messages through mobile phones, where there is a character limit on the messages you can send.

    5. Re:Works until the Spammers get a copy of it by Anonymous Coward · · Score: 0

      How so?

      Trial by error takes longer than reading determineSpamProbability(). Sending 1000's of emails with subtle variations and analyzing the results is certainly more painstaking than just reading the source code. That would make it easier to defeat to an open-source spam filter. Since you asked.

    6. Re:Works until the Spammers get a copy of it by Tablizer · · Score: 2, Insightful

      Personally, I've always thought that a simple spell check would do a good job as another layer filtering.

      Then 3/4 of slashdotters wouldn't be able to get their messages through to anybody :-)

    7. Re:Works until the Spammers get a copy of it by Donny+Smith · · Score: 1

      >How so?

      Some closed source anti-spam software isn't available to just anyone and there is no evaluation version - it is available only to corporate users and I think the installation and maintenance process are created in such way so that it makes it hard, even for sysadmin, to perform actions 1-6.

  19. ....feng-shui... and WAKE up ppl. by danalien · · Score: 1, Troll
    OT, anyone half decent knows 'feng-shui' is a fake thing /* like astrology, tarot-card reading, ... */. ... it`s a belif-system, and as long as you believe in it your mind will make it real for you ... no *real* scientific studies back them up *AFAIK*.

    and btw, WAKE up ppl. 'Filtering' won't make SPAM *ever* go away. As long as you keep on filtering, I guess, it'll act as a cure/remedy that 'relieves pain', but it isn't a cure/remedy that'll kill 'cancer' for good.

    And from a different sidenote, 'Filtering' cost us the consumers more money in the long run, as it's we who pay for the SPAM! weather we look at it, or we keep filtering it away (shouldn't such activities be HIGHLY illegal? in any justice system? ...). Becase it's we who pay for the Broadband the ISPs deliver to us, and they have to charge us according to how much it cost's them to sustain it (+some profitable margin). SPAM eat's like *what was it* 60-80% of the total broadband (world wide) now?! And yes sir'y, You and I are the ones paying for it, if all we do is keep on 'filtering' it...

    ... I guess, the recursive point I wanted to make is, as long as you believe in this new 'Chung-Kwei'-filtering will STOP SPAM your mind will make it real for you ...

    --
    I don't claim I know more than I know, and if you know you know more than I know, then by all means, let me know.
    1. Re:....feng-shui... and WAKE up ppl. by Anonymous Coward · · Score: 0

      OT, anyone half decent knows 'feng-shui' is a fake thing /* like astrology, tarot-card reading, ... */. ... it`s a belif-system, and as long as you believe in it your mind will make it real for you ... no *real* scientific studies back them up *AFAIK*.

      Oh, that's okay then. I'd love to comment further, but it's time for my hourly placebo tablet.

      'Filtering' won't make SPAM *ever* go away. As long as you keep on filtering, I guess, it'll act as a cure/remedy that 'relieves pain', but it isn't a cure/remedy that'll kill 'cancer' for good.

      Spam isn't an illness, it is a business. Filtering reduces spammers' profit margins. Reduce them enough and they'll stop doing it.

    2. Re:....feng-shui... and WAKE up ppl. by Rahga · · Score: 1

      ...no *real* scientific studies back them up *AFAIK*.

      Let me get this straight... You are claiming that fend-shui is fake because science doesn't back it up, then you disclaim that claim by saying you don't really know if science backs it up or not. Ok.

      SPAM eat's like *what was it* 60-80% of the total broadband (world wide) now?!

      This recent article says that about 80% of the e-mail in the US is SPAM... but e-mail is just a small portion of all internet traffic, less than 5% in many locations such as univiersities and major corporations. In other words, you probably should have placed a disclaimer saying "I really have no proof or statistics to cite, so I made them up, thanks for reading my post!"

    3. Re:....feng-shui... and WAKE up ppl. by danalien · · Score: 1
      >Spam isn't an illness, it is a business. Filtering reduces spammers' profit margins. Reduce them enough and they'll stop doing it.

      I concur with you on the fine point you make "Reduce them (profit margins) enough and they'll stop doing it". But I have a hard time even hypothetically conciving that "Filtering techniques" will ever ever bog spammers (enough to make them stop) from reverse-engineering 'Filtering techniques".

      I've used a couple of the (at the time) best "filtering techniques". At this present day, it's "Accuracy rate" is sinking like titanic, in the begining it had a 99% accuracy, now it's down to 80% (or lower) - some proof? I get like ~200 SPAMs a day, and ~50 of them pass right thru it. (1 - (50/200))=0.75 ==> ~75% accuracy. But I haven`t done a full analysis, so this is a rough presumption.

      --
      I don't claim I know more than I know, and if you know you know more than I know, then by all means, let me know.
    4. Re:....feng-shui... and WAKE up ppl. by danalien · · Score: 1
      >Let me get this straight... You are claiming that fend-shui is fake because science doesn't back it up, then you disclaim that claim by saying you don't really know if science backs it up or not. Ok.

      no, I think you missinterpreted me. I claim (from what I've read (scientific or otherwise )) that feng-shui is a fake thing. Hence the "AFAIK".

      I don't claim I know more then I know, and if you know you know more then I know, then by all means, let me know. I sure would like to know as much as you know, you know?!

      >In other words, you probably should have placed a disclaimer saying "I really have no proof or statistics to cite, so I made them up, thanks for reading my post!"

      ok - it's a totally fair counter-argument you make, let me relpy by quoating - "Since shattering that 50 percent mark the level of global spam e-mail has continued to skyrocket. By most measures that figure is now somewhere around 75 percent". And that was July 15, 2004.

      --
      I don't claim I know more than I know, and if you know you know more than I know, then by all means, let me know.
  20. Re:chinks by Anonymous Coward · · Score: 0

    I won't use anything on my computer with a chink name or written by a chink.

    Try looking at your computer and see where 99% of the parts where made, you racist asshole. Or your TV. Phone. Answering machine. And everything else in your home (or more probably, your parent's basement.)

  21. This is all bull -- Change the law by Nice2Cats · · Score: 1
    This isn't going to work -- you simply can't solve a social / legal problem with technology. The only way we are going to get rid of spam is if the U.S. makes it a crime, but there is no sign of that: The new law in fact has done nothing less than legalize it. Don't get your hopes up for a new one: Congress gets too much money from industry and too few Americans care to vote that it is a no-brainer for it to support the spam-makers over the citizens -- I'm sorry, the correct word these days is "consumers", isn't it?

    So, yeah, nice technology, but nothing the bad guys can't get around. If you are serious about stopping spam, stop playing with your computer and start bugging your congressperson.

    1. Re:This is all bull -- Change the law by David+M.+Andersen · · Score: 1

      No offense, but there are plenty of examples of (at least partial) technological solutions to social problems. For instance, the ignition lock on my car prevents people from casuallly stealing it.

      This might not solve the social problem of people wanting to steal cars, but is a decent try at solving the technological problem of people being able to easily do it.

    2. Re:This is all bull -- Change the law by koreth · · Score: 2, Insightful
      This isn't going to work -- you simply can't solve a social / legal problem with technology.

      You'll be buying all your doors without locks from now on, I take it, since burglary is a social/legal problem and the government has passed laws against it. Let us know how that goes.

    3. Re:This is all bull -- Change the law by grumbel · · Score: 1

      And the locks alone without the laws would have have solved the problem of burglary? I kind of doubt that...

      The law alone will of course not make the spam magically go completly away, but it will make sure that sending spam gets a pretty risky business, instead of a completly risk free one, so people might think twice before sending out a million spam mails. Sure this won't stop people from other countries, however reducing spam from the USA would be a pretty good start.

  22. Interesting... Electronic evolution... by dnaboy · · Score: 5, Insightful
    I think it's really interesting to watch the literal evolution of spam and spam filters. There are really amazing parallels to biological evolution.

    First, there's a constant tuning of both preditor and prey (Anti-spam tools and spam).

    Second, there seems to be some sort of equilibrium which is inevitably achieved, and

    Third, there are occasional discreet major developments which change the game. This would be an example. Now, spam is going to be forced to majorly adapt.

    I could see the 'Quality' of spam improving a lot as a result of tools like this. No more letters from my long lost benefactors in nigeria, and no one liners about 'Gushing like a firehose' (My coworkers and I got a good chuckle out of that one), but, as the story said, if you have keywords in a long email, it gets far less penalized. OK. Attach verses from Dante's Inferno, or Joyce's Dubliners to the email. Problem solved. You can't block words like viagra altogether or Pfizer researchers are going to have a hell of a time getting anything through.

    Another concern is that if this forces spammers to make up new and compelling spam, people will be more likely to check it out. While my parents are probably pretty confident they didn't win a secret lottery 3 or 4 times last week, they might possibly believe new and creative stories.

    Perhaps evolution of email readers is just plain going to be a neccessary part of the solution...

    1. Re:Interesting... Electronic evolution... by devphil · · Score: 2, Insightful


      First, there's a constant tuning of both preditor and prey

      Absolutely. Unfortunately, as most predator-prey models will tell you, neither population ever goes to zero unless something catastrophic happens. And in this case, catastrophe is precisely what we want to happen to the prey.

      (If they'd simply implement my proposed scheme of a bullet to the head of every spammer, no mercy, no appeal, it'd be easy. But noooo, "spammers are human beings no matter how useless and harmful they are," waaaaah.)

      there are occasional discreet major developments

      Um. "Discrete" is the word you want. Spammers are anything but discreet. :-)

      --
      You cannot apply a technological solution to a sociological problem. (Edwards' Law)
    2. Re:Interesting... Electronic evolution... by TimmyDee · · Score: 1

      "Gushing like a firehose." That's good, but can it compare to "Scientists find new black hole!" I thought I was getting the weekly mailing from Nature.

      --
      Per Square Mile, a blog about density
    3. Re:Interesting... Electronic evolution... by Anonymous Coward · · Score: 0

      You can't block words like viagra altogether or Pfizer researchers are going to have a hell of a time getting anything through.

      You assume I want email from Pfizer researchers.
    4. Re:Interesting... Electronic evolution... by cas2000 · · Score: 1

      > (If they'd simply implement my proposed scheme of
      > a bullet to the head of every spammer, no mercy,

      and two bullets for everyone who buys something from a spammer. just to be sure. think of it as assisting evolution by helping to remove stupidity from the human gene pool.

      (in other words, if you can't eliminate the predator, then eliminate the prey and starve the predator :-)

    5. Re:Interesting... Electronic evolution... by Vadim+Makarov · · Score: 1
      Why bullets... just put the economic forces to work.

      In the animal world analogy, if the economic solution is implemented the users who employ it become species without natural enemies in the habitat... like some large animals. In respect to spam, that is.

      --
      17779 eligible voters in a district, 17779 'vote' as one. This is Russia.
  23. Re:What I want to know is... by azaris · · Score: 1

    You are confused.

    Rather more confused are the slashbots who tout client-side content filtering as the end-all be-all "solution" to spam.

    To block spam at the transport level is one thing; an algorithm for identifying spam without human intervention is another entirely.

    The only catch: it's not possible to identify spam (unsolicited bulk e-mail) based on the content alone. Why? Because the two words in the definition, 'unsolicited' and 'bulk'. How can the existence of the word 'viagra' possibly tell me the message was unsolicited? Even if I'm not interested in buying Viagra, I can still receive important e-mail containing spammy words that's neither bulk nor unsolicited (like spam complaints about my users). The bulk criteria is even more difficult to predict using content filtering alone. About the only solution that addresses this point I know of that is the Distributed Checksum Clearinghouse.

    I suggest you RTFA. Their method is actually pretty interesting. Lackluster is not the appropriate word for the novel idea they have come up with.

    The method might be novel. The purpose (content filtering spam that's already been delivered) is not. Such methods simply don't address the costs of receiving and storing spam, only the perceived user inconvenience.

  24. Re:Mozilla Firefox^WThunderbird by FooAtWFU · · Score: 1

    My biggest issue with Thunderbird is the bounce messages. A fair amount of people forge addresses which bounce to me (I'll be putting up SPF Real Soon Now, but that doesn't even mean everyone will read it). As a result, I get some legit bounce messages and some with spam in 'em. If I mark the ones with spam as Junk, I risk throwing away the ones without spam. If I mark the ones with spam as not-junk, I get spam which is similar to them thrown into my Inbox.

    --
    The World Wide Web is dying. Soon, we shall have only the Internet.
  25. It is difficult to beat statistical spam filters by gvc · · Score: 2, Informative
    Notwithstanding accepted wisdom espoused above, random words cannot defeat current statistical spam filters, and it is difficult to defeat such filters even if you have access to the algorithm and the recipient's mailbox.

    John Graham-Cumming presented a talk Beating Bayesian Filters at the 2004 Spam Conference detailing these results. A video recording is available; alas, no paper.

    In conducting a recent spam filter evaluation I observed (but did not report) that the statistical filter attacks were not particularly effective. The only attack that worked sometimes was to make the entire body of the message a current news item or joke, with only a URL linking to the spam payload.

  26. Stop This B\/llsh!t Filtering Crap by Anonymous Coward · · Score: 0
    Even if you achieve true human-level understanding of natural languages in these filters (and most AI specialists agree natural language processing is an AI-hard problem) you still will not get complete reliability. Hell, spam has gotten so sophisticated that sometimes even after reading the whole message I still don't know if the e-mail is a legitimiate one from my bank, stock broker, etc.

    Stop devoting resouces on dead-end technological solutions! The problem of spam is the problem of unauthenticated e-mail. Add authentication to the mail delivery protocol and the problem of spam goes away.

    1. Re:Stop This B\/llsh!t Filtering Crap by mikael · · Score: 2, Insightful

      Hell, spam has gotten so sophisticated that sometimes even after reading the whole message I still don't know if the e-mail is a legitimiate one from my bank, stock broker, etc.

      If after reading the E-mail, you still don't know what product the spam is advertising, then the spammers are losing, since those E-mail's will not lead to a sale, and the spammers are simply wasting their own bandwidth.

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
    2. Re:Stop This B\/llsh!t Filtering Crap by Anonymous Coward · · Score: 0
      If after reading the E-mail, you still don't know what product the spam is advertising, then the spammers are losing, since those E-mail's will not lead to a sale, and the spammers are simply wasting their own bandwidth.
      What about phishing schemes? Plus as someone who does all of his banking and financials management online I cannot afford to let any legitimate e-mail pass through and so always read the headers of everything that shows up in my bulk mail folder. Yeah, I'm not going to open any of the "HOT Y0UNG T33NS" messages, but I recently applied for a mortgage and I would probably have to double-check all those "HISTORIC LOW RATES!" ones.
  27. They'll.. by aussie_a · · Score: 2, Interesting

    To get around this spammers will use DNA algorithms to create spam that gets around the blockers ;)

  28. Corrections... by littlewild · · Score: 3, Insightful

    Chung-Kwei is a Chinese semi-deity that wards of evil. He isn't some kind of tailsman.

  29. Uh oh - there goes the patent now.... by syrinje · · Score: 1, Interesting

    Congratulations /.

    By now, all the patent-trollster-lurkers who passively phish in the /. pool must be rushing with suitably edited claims to their frienly neighborhood USPTO.

    Can anyone who works in the IP (intellectual property NOT Internet Protocol) post a list of known trollster companies that are full of lawyers who acquire patents (by any means) and make patent litigation their primary business model?

    --
    See that long UID - that's what you get for lurking too long
  30. Get the Feng Shui Motherboard by Kozz · · Score: 2, Funny

    "We put the CPU in the center, because that is the chi, or life force for the entire board. A centered chi provides better performance." Now don't you want one?

    --
    I only post comments when someone on the internet is wrong.
  31. Nice tool but greylisting does more right now! by slashname3 · · Score: 2, Interesting

    This will make another nice tool to identify spam. But why not use greylisting at all the ISPs MTAs to simply refuse 99% of the spam that is being sent right now?

    Seriously, greylisting implemented on all the ISPs MTAs would overnight block 99% of the spam being sent. Most spam at the moment is being sent from armies of bots run on unsuspecting users systems connected to cable and DSL service. The programs used are unsophisticated, they churn through a list of addresses spewing messages out by the thousands. They do not queue messages or retry them if they get an error. Greylisting uses this to great effect and blocks spam while letting legitimate MTAs deliver messages.

    True, it is not 100% effective, some small number of spam messages get through since some spam goes through legitimate MTAs and the message is retried. But once you remove the bulk of spam those can be tracked down and shutdown or blocked at the firewalls.

    If the ISPs would implement this spam would become a non-issue over night. Email would once again become a mostly useful tool. But I guess the problem is that the ISPs have no vested interest in solving this problem. None of them will listen or implement this simple solution which does not block any legitimate email. With 70% of the email on the network being spam (number may be higher than that at this time) I would think they would jump at a solution that would reduce the loads on their servers. But I guess they make to much money from spammers to implement such a simple solution.

    1. Re:Nice tool but greylisting does more right now! by Incadenza · · Score: 1

      Seriously, greylisting implemented on all the ISPs MTAs would overnight block 99% of the spam being sent. Most spam at the moment is being sent from
      armies of bots run on unsuspecting users systems connected to cable and DSL service. The programs used are unsophisticated, they churn through a
      list of addresses spewing messages out by the thousands. They do not queue messages or retry them if they get an error. Greylisting uses this to
      great effect and blocks spam while letting legitimate MTAs deliver messages.

      And how much time will it take for the spammers to write their programs around greylisting? A matter of weeks, if not days.


      Do you ever look at your server logs? I have seen coordinated spam-attacks from different servers for well over a year now. When a spam gets rejected because of an IP block, it is a matter of seconds before I see the same or similar spam submitted from an entirely different IP address, which could get blocked again. Sometimes this technique uses ten different servers befor they give up or hit a non-blacklisted machine (which means that the mail does reach my server).

      I don't seen any reason why the people that wrote this spam software can't write something overnight to bypass greylisting.

    2. Re:Nice tool but greylisting does more right now! by Anonymous Coward · · Score: 0

      Greylisting is great; I'm using it now. But it's not without its drawbacks. Legitimate, possibly urgent email can be delayed if the information isn't in the local database. Some ISPs and some MTAs don't properly deal with temporary failures. You can certainly argue that they're the ones in the wrong, but some of these ISPs are the 800-pound gorillas, and a site with lots of typical users will never be able to simply lose lots of mail from these sites -- not if you want to keep your paying users, communicate with your clients, etc. So you have to either actively monitor your greylisting database for lost mail from legitimate sites that need to be added to your whitelist, or decide that you don't care.

      Now that I'm using greylisting for my personal email site, the vast majority of the spam I get is through mailing lists and personal email forwarding addresses, all of which are at legitimate sites and which greylisting will do nothing about. So once you've implemented a greylist filter, you still may need another spam filtering or blocking technique.

    3. Re:Nice tool but greylisting does more right now! by slashname3 · · Score: 1

      But as soon as they write their code to queue the message and retry it after the delay period we pull another little trick out. During that delay period the spammer is sending out spam to lots of other sites, including a few spam traps. The spam traps add the spammer address to an RBL. When they get back to your system after the delay period you check the RBL list and drop the message now that it is showing up there.

      In over a year the spammers have not done anything different but dump and spew. You still keep spamassassin running as a second line of defense which this new tool could be used for as well. So the ones that manage to get past greylisting are stopped by spamassassin.

      The whole idea is to make it as costly and troublesome for the spammers to keep up their crap. By using greylisting along with other tools you increase the cost since they have to resend the message multiple times and keep track of all the addresses and sites they send it to. Makes it more likely that they would run out of resources or at least reduce the number of messages they can successfully pump out over a given period of time.

      By reducing the number of messages they can get delivered it has to start reducing their income. Hit them in the cash flow and maybe they will be convinced to cheat and scam some other way.

  32. Spam == Terrorism by Anonymous Coward · · Score: 0

    We must stop these terrorist spammers.
    Now watch this drive!

  33. For those who don't want to RTFA by Frankie70 · · Score: 2, Funny

    Summary
    1) Make your PC face the North, whenever you are checking Email.
    2) Hang a metal windchime above your workstation.
    It is important that the rods of the windchime to be hollow, so that the auspicious Chi can rise up the chimes.
    3) Add a user account for the Dragon Turtle & make him the admin.

    1. Re:For those who don't want to RTFA by Anonymous Coward · · Score: 0

      4) Profit! when you charge your consulting fee.

  34. More correct than you know by Hao+Wu · · Score: 2, Interesting
    Funny how some people develop more and more sophisticated stuffs to fight against something that is just as simple as sending out emails to random address...

    This is just like your own immune system, which uses such things as "V-D-J" recombination (and other tricks) to create billions of some what random different epitope to attack potential unknown pathogens. Cells they must further educate not to attack "self" in your own body.

    If only computer geeks took some lesson from biologist, perhaps they could get a grip on principles to stop SPAM.

    --
    I suggest you read Slashdot
    1. Re:More correct than you know by misleb · · Score: 1
      If only computer geeks took some lesson from biologist, perhaps they could get a grip on principles to stop SPAM.

      Doesn't Bayesian filtering work somewhat like the immune system? After being exposed to the "environment" it learns what is "self" and what is "pathogen" and starts distinguising one from the other pretty reliably. I currently use a server-side Bayes filter on my email and I get 99.5% accuracy with very little manual intervention. And it gets more an more accurate the longer you use it.. unlike things like SPAMAssassin which requires manual updating to adapt to new SPAM. -matthew

      --
      "THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
  35. Giving birth to Artificial Intelligence... by mcrbids · · Score: 3, Interesting

    It's my belief that the most likely source of the birth of Artificial Intelligence will be the SPAM filter.

    Think about it - we now have software that "learns' what you like.

    Sorry, but anything that "learns" fits a definition of intelligence - using past results to predict future outcomes. Note that I'm not saying "self aware" or "conscious", simply "intelligence".

    As we move forward, we'll see more and more intelligence on the part of the spammers, and the warring factions of intelligence will likely provide massive financial and political impetus to build ever more intelligence solutions - thus AI is born.

    The problem with other vehicles for developing AI is simply the budget. With SPAM, everybody has a direct, financial incentive to develop it, so development will definitely happen!

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
    1. Re:Giving birth to Artificial Intelligence... by Anonymous Coward · · Score: 1, Interesting

      You are 40 years behind the times. While it's chic to filter your spam using naive Bayesian text classifiers, don't kid yourself. Machine learning and text classification have been around since the 1960s.

    2. Re:Giving birth to Artificial Intelligence... by cas2000 · · Score: 1

      > It's my belief that the most likely source of
      > the birth of Artificial Intelligence will be
      > the SPAM filter.

      so AIs will spend all their time reading spam?

      great, just what we need - a psychopathic HAL with an obsession for penis enlargement.

  36. Biology is information technology by mcrbids · · Score: 1

    I think over the next 2 decades, we'll come to a greater understand of life - and I think that we'll discover a unique aspect of life - that life is truly information technology.

    Each cell in your body contains approximately 20 GB of data. Consider the redundancy and sheer massive size of information storage capacity your body consists of! Compare THAT to an Oracle cluster...

    So, given the incredible need to process information in order to understand life itself (which could be considered a form of self-replicating information) I think that not only is it likely, but it's all but guaranteed that the lion's share of Information Technology advances will come from biological research.

    PS: nanotechnology == microbiotics. Why re-invent the wheel when nature has spent billions of years perfecting nanotechnology? I think the "nanotechnology revolution" will be largely biological, with technological extensions.

    When we speak of "the singularity", I think that's the point where our (currently abiological) technology fuses with biology to where they aren't clearly defined any longer.

    Man or machine? Who can tell? How would you define either one?

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
  37. Best Spam Software by cstream_chris · · Score: 1


    I have tried just about every single anti-spam software out there, so I have some experience. After being fed up with getting false positives and having to deal with tons of spam getting past the spam filters I tried out Cloudmark's Spamnet - a community based approach to fighting spam. So far it has been 95-99% effective with 0 false positives which is the most important factor for me.

    In the past couple of months it has blocked 19,221 spam messages. I don't even bother to send spam to a Spam folder anymore it just goes straight to the deleted items.

    For those of you getting a lot of e-mail, the price of the subscription is definitely worth it.

    URL: http://cloudmark.com/products/spamnet/

    1. Re:Best Spam Software by Anonymous Coward · · Score: 0

      Maybe they've changed recently, but my experience with Razor2 (the free OS clients to the same signature network) is that while the catch rate is very high, there are a lot of false positives that come with it.

  38. Everybody's doing this now by K-Man · · Score: 1

    I just went through a couple of rounds of interviews with a spam filtering company about doing something similar. The problem these days is that spammers have figured out that "V1AGRA" can be spelled in a number of ways which fool word-based spam filters. There is also a lot of hidden information, such as html and urls, which may be significant, but is difficult to identify with exact string matching.

    The approach used to be:

    1. Find features (usually well-delimited words) in the message.
    2. Look up the features in a database of precalculated scoring information.
    3. Add up the scores for all the features found, using some buzzwordy algorithm.

    Nowadays the features may not be so obvious. For instance "V1AGRA" may not be present in the feature database, but if "VIAGRA" is, we should be able to link to it via some sort of approximate match, or substring match. Here we can see that both strings have "AGRA" in common, and score accordingly. Longer strings, like "Former Dictator of Nigeria", provide more material to match on.

    One problem with substring matching is that substrings can overlap, yielding multiple matches for the same piece of text. A string of length n has n^2/2 different substrings, so our feature space is enormous. Adding up all the feature scores from multiple overlapping hits in a useful way is also much more difficult.

    One way out of this mess is to pick a really simple scoring method. Gzip "scores" (in compression amount) messages on how many characters match, in substrings beyond a certain length (4?), using a greedy algorithm. It's a simple tool for guaging the similarity of two files.

    The IBM method seems a bit more sophisticated. I've looked up similar methods in bioinformatics textbooks. They handle overlapping, and appear to choose their features with a substring-counting approach.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
    1. Re:Everybody's doing this now by Anonymous Coward · · Score: 0

      For instance "V1AGRA" may not be present in the feature database, but if "VIAGRA" is, we should be able to link to it via some sort of approximate match, or substring match.

      set allow_leet_speak=0
    2. Re:Everybody's doing this now by ZeekWatson · · Score: 1
      I just went through a couple of rounds of interviews with a spam filtering company about doing something similar. The problem these days is that spammers have figured out that "V1AGRA" can be spelled in a number of ways which fool word-based spam filters. There is also a lot of hidden information, such as html and urls, which may be significant, but is difficult to identify with exact string matching.

      I don't think there are any real "word-based-spam filters" other than some useless but well intentioned procmail scripts and such.

      Baysian (any adaptive filter really) will handle cases like v1agra quite well because its unlikely that anyone would ever spell viagra like that. So it will be considered a very likely spam token. Of course it needs to see a known spam message using the v1agra spelling first, which is the big shortcoming of all adaptive filters.

      I know at least 1 company where they already has worked around the n^2/2 problem with efficient algorithms. These filters know that v1agra or v.agra or v i a gra is actually spelling viagra and score the message appropriatly.

      Finally, the spammers are not keeping up at all in the arms race. Anyone with a good antispam solution doesn't really get any spam.

    3. Re:Everybody's doing this now by Mikeydude750 · · Score: 0

      What about using an OCR system to capture each word as a whole and analyze it? Tell it to read each character as the letter...or even better...tell it to just treat all "l33ted" words as spam.

      No person you actually want to talk to uses l33t-speak, anyway. At least, I hope not...

  39. Nothing new here, move along... by po8 · · Score: 4, Informative

    As someone who's done some research on machine learning for spam filtering, this sure looks to me from their 8-page paper like yet another simplistic ML algorithm advocated by folks who don't know the field and tested using techniques of questionable sensitivity. Their "novel" method sounds an awful lot like feature set construction by clustering, a method that is widely used in the spam filtering literature, but with a somewhat novel clustering technique from biology.

    Message filtering starts by throwing away line breaks for no obvious reason, then optionally removing the known ham from the training set for no obvious reason. Message headers are then thrown away, for no obvious reason.

    No general method is given for corpus allocation. In the experiment reported later, the original corpus appears to have been split roughly in half. (For unreported reasons, none of these splits are exact. No rationale is given for the various corpus allocations.) The training corpus is then split into ham and spam, and the ham portion is split in half. The spam training corpus is used for "positive training": determining a complex feature set as described below. One half of the ham training corpus is then used for "negative training": filtering out complex features that are common in ham. The remainder of the ham corpus is used as a validation set to select thresholds described below. No justification is given as to the failure of the validation set to include spam messages, and the procedure is vague on this point.

    The description of the key "positive training" phase is difficult to follow: it seems to assume the pre-existence of the "SPAM vocabulary" [sic] being constructed. The key idea seems to be to use positional index of words within the body as base features, and construct complex features by using a pattern recognition algorithm to find correspondences between sets of base features across spam messages. Patterns that appear across many spam messages are treated as indicating spam.

    The final training step is to set thresholds for (1) minimum number of complex features in the spam message and (2) fraction of the message text covered by the complex features. One would expect these two criteria to be highly correlated: no effort appears to have been made to enforce or explore their orthogonality.

    The classification phase proceeds by simply counting the number of patterns in a given test message and the percent coverage of the message by the patterns. If the result exceeds both thresholds, the message is classified as spam.

    For the empirical evaluation, the corpus used seems to have consisted of approximately 130,000 messages, roughly 1/4 ham and 3/4 spam. No details of the construction or acquisition of this large corpus were given. Because of its volume, one would suspect a synthetic corpus from high volume sources. The details of this corpus construction are critical to the evaluation of the method, so no useful conclusions can really be drawn from the empirical evaluation other than that, like most machine learning methods, this method works well on some problem set.

    The claimed accuracies from the technique are at a level that is highly suspect from previous experience: there are fundamental bounds on how well any ML algorithm can do in real situations that don't appear to be met here. Indeed, messages found to be misclassified as spam in the test corpus were manually reclassified, but no effort seems to have been made to identify messages that were "correctly" classified by the algorithm but misclassified in the corpus. The error rate before manual manipulation of the results (!) appears to be about 97%, which is well within the normal expected range. Computational efficiency appears to be good.

    The vocabulary used in the paper is not particularly consistent with the vocabulary normally used in the spam filtering or machine learning literature. A few spam filtering and machine learning papers are cited, but not many: citations are primarily from the

    1. Re:Nothing new here, move along... by danalien · · Score: 1
      >P.S.---I can't believe that the banner ad at the top of my browser window as I write this is actually blinking at me. Thanks, /. editors. Do me a favor, folks, and don't buy anything from Server Beach.

      why? *just curious, as from the post you seem like a bright person ...*

      --
      I don't claim I know more than I know, and if you know you know more than I know, then by all means, let me know.
    2. Re:Nothing new here, move along... by BuGless · · Score: 1

      Because blinking ads are a nuisance and a clear sign of bad taste.

      On the topic of knowing, just to let you know: I know more (about grammar) than you do, please check your signature and find the two occurrences of "then" that should have been replaced with "than".

    3. Re:Nothing new here, move along... by danalien · · Score: 1
      >bad taste.

      ah. so other than a bad PR judgement, they are ok.

      >about grammar

      all right'y then. think I found what you where saying.

      --
      I don't claim I know more than I know, and if you know you know more than I know, then by all means, let me know.
  40. Here's what I'm wondering... by myov · · Score: 1

    Why can't we start filtering based on the URL's in spam? There would need to be some verification process (otherwise valid URL's would be blocked), but wouldn't it increase the cost to spam since spammers would need to register even more domains? After a while, this should also give us a list of spam-friendly hosting providers who should be banned from the rest of the internet.

    --
    I use Macs to up my productivity, so up yours Microsoft!
    1. Re:Here's what I'm wondering... by ZeekWatson · · Score: 1
      Why can't we start filtering based on the URL's in spam?

      ActiveState PureMessage has been doing this for years.

      Also now available for free via SURBL

      Just when you though you had a new idea, it turns out to be older than the hills...

    2. Re:Here's what I'm wondering... by Incadenza · · Score: 1

      Also available in Vipul's Razor:

      NAME Changes - razor-agents 2.61 (July 06, 2004) * Introduced the Whiplash signature scheme. Whiplash signatures are based on canonical domain names present in URLs embedded in spam messages. A Whiplash signature is also a function of the length of the spam message. It's important to note that not all whiplashes are used as classifiers. The Whiplash engine is augmented by sophesticated logic on the Razor2 backend to select the Whiplashes that are used to filter spam.
  41. Or... by sean.peters · · Score: 2, Funny

    1) Acquire software
    2) Decompile
    3) Study code
    4) Develop countermeasure
    5) spam spam spam

    It's not like spammers care about the EULA that says they can't look at the code. Oh, and before I forget...

    6) ???
    7) Profit!

    Sean

    1. Re:Or... by Mikeydude750 · · Score: 0

      Reading decompiled code? Man, that would suck...

  42. Virus and worm detection! by Ungrounded+Lightning · · Score: 2, Interesting

    That should work for virus and worm detection, too!

    Even moreso, since viruses are much more a compilation of a set of previous constructions with a few mods than a new composition not necessarily based on the wording of old scams.

    And Viruses and worms (especially worms) are more constratined by their environment, requiring an exploit of a vulnerability and the instation of work-doing code. Though gene-shuffling techniques might be able to bury much of the code, the basic exploit must continue to be some sort of match to the vulnerability's "receptor".

    --
    Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
  43. Registering eMail addresses by shubert1966 · · Score: 1

    If it cost 10$ a year . . . to register an email addy, there would be no incentive for the spammers to throw the dictionary at domains, and conversely, the spammers couldn't/wouldn't want to create thousands of email addys to spam from.

    I had not heard that angle before. That rocks! You'd think it would be the sort of thing a politician could wield in court too.

    It's strange to me that there are a whole slew of laws concerning other modes of communication, but the internet is slow to be regulated. I know regulation won't stop people from doing stuff, but if the laws are defined then you can punish people in court when they transgress. I think a bevy of young lawyers, reared on IT, are gonna change that someday soon.

    --
    Stuff that matters.
    1. Re:Registering eMail addresses by qwijibo · · Score: 1

      This wouldn't have any impact on spam.

      On the one hand, you have the "email marketers" who use their own valid domains/addresses, who wouldn't need to pay anything extra. They pay for the domains, but that doesn't stop them from registering a bunch of them. Adding another tax for the user@ part of the name would have no impact.

      On the other hand, you have the ones who use cracked windows boxes to send out their scams. They're already stealing the resources to do their bidding. What difference would it be to them if they had it send "from" the legitimate email account of the person who's machine is a spam zombie?

      I believe digital signatures are the only way to solve the email problems we have. I get spam claiming to come from me because there is nothing that prevents spammers from forging their addresses. The first step to solving this problem is accountability. If I know that an email comes from who it claims to come from, I can lump that source with a good or bad group. With a web of trust, I could easily let through friends of friends of friends while blocking friends of friends of spammers.

    2. Re:Registering eMail addresses by shubert1966 · · Score: 1

      Did you read the parent thread about using a CAPTCHA image in a web form and handing out a URL instead of an email address? Don't you think that would have an impact?

      Saying that paying to register email addresses would have "no impact" isn't sitting with me too well. I would think that incurring a $10 fee per address would have a significant impact on legitimate spammers, they use so many. The ROI would be cut drastically. Of course the negative political impact would be astronomical. Yet your right, Zombie machines would still be doing their thing until the villagers track them down one by one and burn those machines.

      . . . digital signatures . . . The first step to solving this problem is accountability . . .

      I stopped reading SSL and digital signature documentation because it was too complex for me to grasp and implement. I must admit my ignorance here. I tend to think that anything can be forged. If possible however the resulting Clean Internet would be, well, clean. Any occasional abuse would be immediately arrested. You'd think that once developed they could apply the concept to publicly held companies that transmit accounting data and catch them when they aren't liquid - no more Enron, WorldCom, et. al.

      Ok, ok, ok it's pie in the sky . . .

      --
      Stuff that matters.
  44. Totally offtopic... by NoOneInParticular · · Score: 1
    ... but I think the combination of parent and grandparent let me finally see the light on the issue of what these three question-marks are supposed to be:

    1) Collect underpants
    2) Goto 1
    3) Profit

    See, it does make sense!

  45. better answer by Anonymous Coward · · Score: 0

    KILL ALL SPAMMERS

    Come on, you dumbass gun loving Dubya-ass-kissing americans can do it!

    Spam is a WMD! most spam comes from the USA!

    NUKE THE USA! !! ! Come on do it! I'm sure it will be good for Bush's oil friends, so you must be eager to try it!

    reasoning any Dubyafucker can understand and love.

  46. Why not just change? by Proc6 · · Score: 1
    I am no expert on mail or spam, but I recall reading some various good ideas of how to re-implement email to make it nearly impossible for spammers. Like that "ticket" system, where the mail actually sits on the senders mail server until you collect it. Im sure theres dozens of good ideas, that just with simple logistics make it nearly impossible / unfeasible to mass mail random people.

    Given that, why can't there just be a proposal, adopted (like a DVD format, etc) by some huge players (Microsoft, OpenSource, whatever), and then announce a sort of "Spam Doomsday", a ways out. Say January of 2006. Give people time to write in the new mail handling ability so it's side by side with POP in the next Outlook and Thunderbird and whatever else, as well as all the various mail-servers, and months for organizations to plan on roll out.

    Yes upgrades are a bitch and especially on a large scale, but coporations have to go through them anyway, from the OS's to the software running on them. I dont know of too many people woul'd prefer to throw never-ending money at spam blocking and net traffic associated with it instead of just knowing "January 18th, 2006 - everyone on the net (who has a brain) is moving to nEw-Mail".

    I realize POP/SMTP is ancient and embedded, but unlike a physical "format" like CD-ROM or 4mm DAT, the bulk of existing email is transient, it's collected and forgotten about or "local" in a few minutes. So picking a transition day, agreeing on an open transition method, and just "DOING IT", can't possibly be that hard. POP/SMTP wouldn't even have to go away, it could run side-by-side for the hold-outs, but I think 99% of the people are so tired of spam they'd be understanding of "Sorry, if you want to email customer support at Citibank, you will need to use our new support#citibank.com nEw-Mail address." when its so universally pushed.

    --

    I'm Rick James with mod points biatch!

  47. I solved the spam problem. Seriously. Interested? by iamcf13 · · Score: 1

    My fast, efficient, method is very light on system resources and attacks spam by detecting one or more common attributes of spam and taking the appropriate action.

    Complete detailes here.

    Bryan Taylor
    iamcf13@hotpop.com
    SpamByte code: 7
    (see http://www.cf13.com/game-over-spammers.htm )
    http://www.cf13.com/press-release.htm
    All email containing unwanted content will be summarily deleted or reported as spam.

  48. Greylisting works for me by muleboy · · Score: 1

    I just installed greylistd by Tor Slettnes about 24 hours ago, and haven't received a single spam yet (down from 20-30 per day before). I only have a 5 minute greylist delay, meaning there's almost no downside to this method. Assuming my correspondants don't use broken mail servers (and that's their problem if they do) there are no false positives and no maintenance with this system. I use no other spam filters of any kind. I guess they just aren't patient enough to wait 5 minutes :)

    And if they start doing retries, I wall add SMTP delays or other techniques as suggested in Tor's excellent guide to mail filtering at the server level: Spam Filtering for Mail Exchangers.

    1. Re:Greylisting works for me by ZeekWatson · · Score: 1
      I only have a 5 minute greylist delay, meaning there's almost no downside to this method.

      Hmm, it sounds like you don't understand how greylisting works. That 5 minute delay is only the minimum delay before you'll accept a message. It has no bearing on how long the actual delay will be.

      This is because it is up to the remote MTA to decide how long to wait before retrying, so your 5 minute setting doesn't really affect the actual message delay. ie If the remote MTA wants to wait 4 hours before retrying, there is nothing you can do about it.

    2. Re:Greylisting works for me by muleboy · · Score: 1

      I understand that, but in practice it looks like most MTAs retry within 15 minutes, and usually several times within 15 minutes. I'm not really worried about a 4 hour delay, I don't have time-critical things to respond to, and if I do, it will be from people who are already whitelisted.

  49. Abandon AI crap. Need another approach by Tablizer · · Score: 1

    More clever filters and pattern matchers are not going to work. Just like encryption, the more something is used, the more likely it is to be hacked around. Maybe early adoptors will benefit, as the spammers have not had the time or target size to catch up yet. But on a grander scale, it is a no-win cat and mouse game.

    The solution is same one that reduces paper junk mail: postage fees. Charge 5 cents or so per message, and spam will greatly shrink.

  50. A pitfall of relying on others' classifications by ynotds · · Score: 1
    I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.
    One of my accounts is a catch all for a domain which has gotten addresses misentered into both legitimate mailing lists and as the erroneous e-mail address of people who are copied and sometimes even directly addresses by genuine personal e-mails. But to me they are all equivalent to spam, so if I was reporting spam to some authoritative list there would likely be an outbreak of false positives.
    --
    -- Our systemic servants do not good masters make.
  51. Blinking text by po8 · · Score: 1

    When blinking text was made part of HTML in the 90s, it was universally despised by users and competent webmasters alike. It's distracting, annoying, and can even be dangerous to those prone to epilepsy. Eventually, for the most part browsers quit supporting it, webmasters quit using it, and users quit visiting sites where it appeared.

    As far as I'm concerned, the fact that the Server Beach ad is a blinking flash animation doesn't make it substantially different. I'm getting it again as I type this, and finding it just as annoying as obnoxious as I did the first time. Professional web advertisers and site admins should know better.

  52. Masking the symptoms by ekhben · · Score: 1

    By the time the SPAM has reached your filters, you've already lost. It's already consumed your bandwidth, it's consuming your processing time and storage, and the process of updating, teaching, writing and managing your more and more complex filters is still consuming your time. The answer is to go for the root of the problem, which is the naive level of trust that SMTP implies. There are a number of attacks on this problem, with SPF looking like a strong contender. Encourage your ISP to enable SPF checking, and block the spam before it's even sent.

  53. Parse Carefully by timotten · · Score: 1

    They're talking about IBM's

    (((Anti-Spam) Filtering) Research) Project

    This is not the same as the

    ((Anti-(Spam Filtering)) Research) Project

    Nor is it the

    (Anti-((Spam Filtering) Research)) Project

    I'm not sure, but I think the last two are run by AT&T.

  54. Serious methodological flaws by YU+Nicks+NE+Way · · Score: 3, Insightful

    It sounds like a great paper until you get down into the guts of their materials and methods. They trained their system on half of their total data, and did not then test on separate data. That captures the two classic no-nos of data driven techniques: they inflate their results by including their training data in the results, and, worse, their training data comprises a larger sample of their total data than would be seen in the real world.

    The first of these calls their sensitivity result into quesiton. If they classify their training data perfectly, then the 4.4% false negative rate they quote needs to be doubled to 8.8% -- almost one false negative in every eleven messages scanned.

    The second of these calls their false positive rate into question: training with an unrealistically thorough set leads to better catergorization, ceteris paribus. They need to show the trend with a variety of different training set sizes to support any claims about performance.

    This sounds like a fully buzzword compliant non-result to me.

  55. Re:It is difficult to beat statistical spam filter by martin-boundary · · Score: 1
    That's slightly incorrect. It depends on the filter algorithm used.

    Some statistical algorithms only pick a small number of tokens according to some rationale or other (e.g. most extreme scores). For such algorithms, the padding attack is a very good idea, as with enough random words, one or more of these should have a sufficiently extreme score (so that it replaces a more legitimate token in the list of considered tokens), although whether an extreme score can be synthesised randomly would depend on the computation of token scores.

    Algorithms in such tools as popfile, ifile, dbacl, crm114 use all the tokens, which ought to have the advantage of making the (incorrect) token distribution of the extra padding words stand out when applying whatever likelihood function is used.

  56. Use an ad blocker by Anonymous Coward · · Score: 0

    Use an ad blocker, like junkbuster or privoxy. Thanks to ad blockers I've never had to look at a Slashdot banner ad. Ever.

  57. Coming next... by tropavantgarde · · Score: 1

    "Fighting MS with human cloning technology."

    --

    --A witty sig proves nothing.--

  58. Penn and Computing by billstewart · · Score: 1

    Don't know about Teller, but Penn used to write columns for some computer rag. One I particularly remember was back when US airports were starting to freak out about laptops and insisting that people turn them on at the rent-a-cop checkpoints, and he was annoyed enough about the general harassment and interference with civil liberties to suggest that an appropriate startup screen would be one that goes "10" "9" "8" "7" etc. in big scary-looking letters.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  59. Change Economics, not laws by billstewart · · Score: 1
    The US now has Federal laws against spam, as well as a number of state laws. The CAN-SPAM law theoretically legalized some forms of spam, but in practice it had no effect - one well-reported study says that about 3% of spam made a pretense of compliance when it first came out, but it's now down to 1% or so, and I saw less effect from California's anti-spam laws. Scotty Richter's OptInRealBig made that pretense, and they're gone, but the pretense was really just to slow down the process of getting kicked off of more and more ISPs.

    CAN-SPAM was a great example of why legislation usually doesn't work - Politicians aren't technologists, and usually aren't competent economists, and even technologists have trouble coming up with solid definitions of what the problems are and what they want to do about them without having adverse side-effects. But politicians _are_ politicians, so if there are people clamoring for them to Do Something, they'll come up with Something to Do, and Do that, and at best it'll involve hiring some technologists who'll come up with something at least half-assed and not totally evil. But Politicians aren't technologists, so they can't tell if laws they make about technology are any good - the part they're good at is deciding whether the laws Look Aggressive, or Look Fair and Balanced, or Kick Asses and Take Names, or Kiss Asses and Take Campaign Contributions, or help their buddies in Homeland Security achieve other political goals, and those aren't the parts of the law that really matter much.

    Spam makes economic sense for the spammer, and until that changes, spam won't go away. You talk about Congress preferring spammers over consumers, but you're not correct - they don't care about spammers, and the reason there's spam is that there are enough consumers willing to buy "Fake Herbal Vi@gra" or "Great Mortgage Deals" that spammers can make money even though they need to send out billions of emails to people who don't want to be consumers of their products to find the few who do. Most laws don't have any effect, because they depend on police, and police are too busy fighting the War on Politically Incorrect Drugs or dealing with bad drivers or actually fighting real crime to waste their time on the hopeless and unprofitable job of catching spammers. Some proposed laws give bounties to spam recipients for successfully catching spammers, and allow them to use mechanisms like small claims court instead of criminal prosecution, and that's more likely to have some effect.

    But fundamentally, until you change the economics, you won't get rid of spam. The economics include the facts that

    • sending email is very low real cost,
    • finding addresses to send the mail to is very low cost due to automation,
    • operating from anywhere in the world is relatively low cost,
    • some countries have low-cost politicians who don't mind taking spammers' money even though it annoys a bunch of foreigners,
    • free or low-price email receiving is available from thousands of providers, generally funded by advertising,
    • spam is sufficiently profitable (or believed by gullible wannabee spammers to be sufficiently profitable) that spamware vendors are willing to violate Rule 2 and make products that will get around popular anti-spam defenses,
    • most people either run appallingly insecure email software on appallingly insecure operating systems or else use webmail or AOLmail from appallingly insecure operating systems, where the sources of the insecurity not only include the basic product (which is getting better), but the system administration (which is usually nonexistent), and the user behaviour (which is willing to click on random buttons to install cute screensavers with funny dancing pigs), making it easy for spam-support vendors to install spam-forwarders on millions of machines,
    • corporations can be formed for low enough prices, typically $100-500, that you can move the legal jurisdiction of your activities around and hide your money trail e
    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  60. If it were that easy, most ISPs would be using it by billstewart · · Score: 1
    If greylisting were such a magic bullet solution, lots more ISPs would be using it. While the most important cost of spam is the wasted time of the recipients, the most direct economic costs are to companies that provide mailboxes for users (i.e. ISPs and email outsourcers), and they'd not only love to avoid the direct costs, they'd love to have a big competitive advantage over other providers. So if it were easy to implement and worked really really well, they'd jump at it.

    That doesn't mean it's not a helpful tool - just that it's either harder to implement than it looks, or less effective than it looks, probably the former. So get to work writing greylisting tools :-)

    Of course, if greylisting were very common, spammers would try to find a way around it, but we knew this was an arms race when we started the discussion.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  61. Re:If it were that easy, most ISPs would be using by slashname3 · · Score: 1

    That is the point. It is very easy to implement. There are several versions available now that can be setup quickly. Probably the hardest problem to solve is those with email server farms. One version however utilizes a database which could be accessed by multiple servers so when the message is retried it would be able to match the entry and allow it through.

    I think one of the reasons ISPs have not jumped at this is that they do not perceive a cost benefit. It will cost them up front to get it configured and tested, it won't make any money for them. As such they are happy with the status quo. Besides they are most likely selling address lists to spammers and making a profit doing so. Implementing effective spam blocking tools is not really in their best interest for this quarter.

  62. e-postage by peter303 · · Score: 1

    A penny or tenth of a cent would be unnoticeable to the average email user, but would break the spammer's bank.

  63. Way too complex by argent · · Score: 1

    Right now, just requiring a keyword on your subject line is more than enough protection to effectively block all spam that's not forged from your whitelisted addresses.

    Yes, spammers do successfully guess whitelisted addresses, by stealing people's address books and mailboxes through viruses and guessing that if you're in someone's address book or they've got mail from you then you're whitelisted from them.

    So, it's an effective filtering mechanism for now, but eventually you'll have to require something better than whitelisting your contacts and making it hard for everyone else... and almost any precaution that requires a human in the loop is enough to deter most spammers.

    Bittim line is, filtering is just adapting to spam. To fight it, you have to cost them real money.

  64. Re:If it were that easy, most ISPs would be using by billstewart · · Score: 1
    For most email mailbox specialist companies, any spam protection that doesn't require them to triple their capital investment really does pay off this quarter, because it lets them save a lot of bandwidth and email server hardware, since 70-80% of their traffic is spam that their customers don't want, and being able to offer "less spam" is a strong selling point (if they're any good at it in practice.) Graylisting and rejecting SMTP based on IP address or envelope let them avoid receiving email bodies, so that's a big win.

    For ISPs that are mainly bandwidth sellers, for whom email is just a small sideline, it's a different case, because they're carrying a lot more bandwidth from users doing web browsing, than real or spam email, so it may or may not be the top of their priority list (unlike virus protection, which can prevent really big spikes in traffic depending on the worm (e.g. Slammer was really big, but most of the Outlook-hoax-of-the-week mails aren't that heavy traffic.) Also, how high up the priority list a problem is at some ISPs depends on whether they're charging flat-rate or usage-based for traffic.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks