Slashdot Mirror


DSPAM v2.10 Released

Nuclear Elephant writes "DSPAM v2.10 is finally available, after four months of development. This is the first stable release to include Bayesian Noise Reduction which was recently mentioned on Slashdot and in Wired News as an algorithm providing accuracy levels as high as 10x that of a human. Some other new features include Neural Networking - which finds nodes in a network that are contextually similar to form a decision matrix, Global Filtering - which provides SpamAssassin-like out-of-the-box type filtering for new users until they build up their own wordlist, Automatic Whitelisting - which automatically learns who your trusted senders are, and many other optimizations and enhancements. Head on over and download the latest tar ball."

16 of 234 comments (clear)

  1. Details. by Anonymous Coward · · Score: 5, Informative

    Introduction

    DSPAM (as in De-Spam) is an extremely scalable, open-source statistical-algorithmic hybrid anti-spam filter. A majority of users running v2.10+ achieve filtering rates ranging from 99.92% - 99.98+%, DSPAM is currently effective as both a server-side agent for UNIX email servers and a developer's library for mail clients, other anti-spam tools, and similar projects requiring drop-in spam filtering. DSPAM has been implemented on many large and small scale systems with the largest systems being reported at about 125,000 mailboxes.

    What is a Statistical-Algorithmic Hybrid Filter?
    Present-day language classifiers bear the responsibility of maintaining accuracy in the midst of ever-increasing sample complexity. In the setting of spam filtering, many types of intentional attacks have been introduced such as obfuscation, word list injection, sample flooding, and etcetera. As the complexity of classification text continues to multiply rapidly, many filter developers today are left with conflicted feelings between increasing the complexity of their filter and wise teachings from CS class reminding them that computer science is about controlling complexity, not creating it. At the rate complexity is rising, filters will (and have already begun to) become so resource-intensive that they lose scalability, eventually leading to a second conflict of interests: where fighting spam becomes more expensive than managing it.

    DSPAM is the first Statistical-Algorithmic Hybrid filter and in being such boldly suggests that there is a better alternative to increasing the feature set of filters to match the spams they are trying to fight. By employing algorithms designed to increase the quality of existing data rather than the quantity of data with the goal of reducing the feature set rather than increasing it, DSPAM has managed to achieve nearly equal levels of accuracy with present-day Markovian-based filters and other types of filters that employ large feature sets with the added benefit of using a significantly fewer amount of resources. DSPAM presently peaks at 99.984% accuracy, which is ten times more accurate than a human being [1] and is presently being used on implementations as large as 125,000+ mailboxes.

    DSPAM's Focus
    The DSPAM project attempts to go beyond "just another statistical filter" by focusing on the following areas:

    * DSPAM has a strong focus on providing better data to already existing algorithms (Bayesian, Chi-Square, etcetera) Combination algorithms work inherently well, but depend on the quality of data. Some of the approaches deployed in DSPAM towards this goal include Chained Tokens, Inoculation Groups, Classification Groups, advanced de-obfuscation techniques, and a new noise reduction algorithm called Bayesian Noise Reduction. The goal is to incorporate processing algorithms that can withstand the long haul of ever increasing message complexity. So far we're doing a great job.
    * A strong focus on large-scale implementation support. The largest implementation of DSPAM we've heard about to-date involves 125,000 users. DSPAM has been designed to experience a very short execution time (0.03s - 0.10s on average hardware), and has been equipped with a storage driver API allowing several different storage mechanisms to be used. Depending on disk space constraints, accuracy can be traded off for additional disk space or vice-versa.
    * Empty Corpus Support and Global Dictionary Support. It is very important in a large-scale environment to allow users to build their own dictionaries starting from scratch. Why? Because system administrators haven't got the time to create 20,000 seeded dictionaries. On top of this, ISPs require out-of-the-box filtering, which DSPAM's global dictionary feature provides for end-users, with minimal centralized learning. DSPAM provides support for building corpuses from scratch without suffering many fatal training errors (false positives). When these two approaches are combined, we end up with instant-filtering for all u

  2. I wonder if this will catch what Mozilla misses by wmspringer · · Score: 4, Informative

    Right now the only spam getting through my Mozilla filter is stuff that starts with one or two unrelated sentences, then goes into the advertising with any spam-type words (viagra, etc) horribly mispelled.

    1. Re:I wonder if this will catch what Mozilla misses by reaper20 · · Score: 4, Informative

      Thunderbird's latest builds have an improved spam filter using some ideas from SpamBayes, it's substantially improved from the older filter.

  3. Re:What's DSPAM? by wintahmoot · · Score: 4, Informative

    From what I can tell, DSPAM plugs into your MTA as a local delivery agent, very much like SpamAssassin does.

    I couldn't see any platform requirements on their site, but here's what they say about MTA compatibility:

    DSPAM works great with Sendmail, Postfix, Qmail, Courier, and Exim, and should work well with any other MTA that supports an external local delivery agent.

    Hope that answers your questions :P

  4. Re:I still prefer tougher email security by Enahs · · Score: 3, Informative
    --
    Stating on Slashdot that I like cheese since 1997.
  5. Re:Cool! by Monx · · Score: 5, Informative

    IIRC, the "10x better" means 10x lower failure rate. The wording almost seems meant to deceive. The idea is that if you misidentify 10 messages out of 100, the filter would only misidentify 1. Since you made 10x as many mistakes, the filter was 10x as accurate as you were.

  6. Re:Cool! by Anonymous Coward · · Score: 1, Informative

    Woops, moderated you incorrectly. Meant to mark it funny, but it came out flamebait. Hopefully that will get reversed by my posting here.

  7. Re:More accurate than a human? by asavage · · Score: 2, Informative

    yes it can. A human can be 100% accurate when dealing with only a few emails, but when you are dealing with tens or hundreds you will sometimes make mistakes.

  8. Here's where "10x as accurate as human" comes from by Gldm · · Score: 4, Informative
    If you check the footnotes on the DSPAM page, it says "According to a study by Bill Yerazunis of CRM114."

    If you then check the link to CRM114's project, you'll find this: "I measured my own accuracy to be around 99.84%, by classifying the same set of 3000ish messages twice over a period of about a week, reading each message from the top until I feel "confident" of the message status, (one message per screen unless I want more than one screen to decide on a message.) and doing the classification in small batches with plenty of breaks and other office tasks to avoid fatigue. Then I diff()ed the two passes to generate a result. Assuming I never duplicate the same mistake, I, as an unassisted human, under nearly optimal conditions, am 99.84% accurate.)."

    Given the amount of people who even read the article on slashdot I doubt anyone else is going to check the tiny [1] footnote and find this.

    --

    Introducing the new Occam Fusion! Now with sqrt(-1) fewer blades!

  9. Re:Works great with Qmail? Oh really now? by 7Ghent · · Score: 2, Informative

    Easy, just set up a .qmail file in each virtual account's home dir that contains

    |/usr/local/bin/dspam --user $EXT@HIDDEN$HOST -d $EXT@HIDDEN$HOST

  10. Re:Take it one step further; share what you filter by Anonymous Coward · · Score: 1, Informative

    According to the DSPAM website, there is another project called the SBL (Streamlined Blackhole List) which is similar to what you're talking about, only appears to be more real-time than the WPBL. DSPAM seems to explicitly support this.

  11. Re:Umm... what's the definition of spam? by Snowmit · · Score: 3, Informative

    Is this to say I can't tell when I'm being spammed?

    Leaving aside the part where you barely avoid the paranoid rantings of a madman, yes, there are times when you can't tell if you're being spammed. Like, how many times have you accidentally deleted an email that you thought was spam but was really from a long-lost friend? Or how many times have you opened Spam because you weren't sure that it was Spam or something from your ISP (or whatever).

    Say you've done it 10 times in 10 000 messages. If this program only did it once in 10 000 messages (false positive or missing negative) then it was 10x as accurate as you.

    --
    I have a lot of opinions about Cyborgs and Architects
  12. Re:Umm... what's the definition of spam? by kryptkpr · · Score: 2, Informative

    Didn't look very hard did you?

    Tools, Options, Security, uncheck "Do not Allow attachments to be Opened that cound potentially contain a virus".

    --
    DJ kRYPT's Free MP3s!
  13. Re:More accurate than a human? by jmv · · Score: 2, Informative

    Most likely, it'll make less errors than the number of mistakes you're going to make because you're flooded in spam. Given a mailbox with 1000 spam and 1000 ham, I'm pretty sure I'll mess up a couple times while trying to delete only the spam.

  14. Certified SMTP Hosts. by eluusive · · Score: 3, Informative

    What would work well is SSL certified SMTP relays. If every valid SMTP relay needed an SSL certificate then, If spam was sent their SSL certificate could easily be rejected. And hosts that didn't have one at all could just be dropped.

    SSL certificates are costly, and that limits everyone from having one. However, there is no reason the Open Source community could not make up our own root certficate, and have an SMTP SSL certificate signing organization. Where we verify the authenticity of someone before we give them a cert. For a small fee to cover costs. It wouldn't be like we'd have to convince Netscape, Microsoft, Apple and whoever else makes a browser to include the cert. It'd just need to be available for people hosting servers to download.

    Yes, this would mean rejecting massive amounts of email to begin with. Maybe some intern solution could be thought of as people move over to it?

    Ideas? Comments?

  15. Re:Take it one step further; share what you filter by Anonymous Coward · · Score: 2, Informative

    AFAIK, both the SBL and the WPBL only allow list writes from trusted users with accounts.