DSPAM v2.10 Released

← Back to Stories (view on slashdot.org)

Posted by michael on Saturday March 13, 2004 @04:04PM from the self-promotion-is-the-best-kind dept.

Nuclear Elephant writes "DSPAM v2.10 is finally available, after four months of development. This is the first stable release to include Bayesian Noise Reduction which was recently mentioned on Slashdot and in Wired News as an algorithm providing accuracy levels as high as 10x that of a human. Some other new features include Neural Networking - which finds nodes in a network that are contextually similar to form a decision matrix, Global Filtering - which provides SpamAssassin-like out-of-the-box type filtering for new users until they build up their own wordlist, Automatic Whitelisting - which automatically learns who your trusted senders are, and many other optimizations and enhancements. Head on over and download the latest tar ball."

26 of 234 comments (clear)

Min score:

Reason:

Sort:

Cool! by Anonymous Coward · 2004-03-13 16:05 · Score: 5, Funny

I've always wanted a spam filter with 1000% accuracy!
1. Re:Cool! by Monx · 2004-03-13 16:31 · Score: 5, Informative
  
  IIRC, the "10x better" means 10x lower failure rate. The wording almost seems meant to deceive. The idea is that if you misidentify 10 messages out of 100, the filter would only misidentify 1. Since you made 10x as many mistakes, the filter was 10x as accurate as you were.
The real problem by Anonymous Coward · 2004-03-13 16:08 · Score: 4, Insightful

The real problem is people who actually buy this stuff. If no one was buying things from spam, no one would send spam. We all know this.

I propose we start spamming. Anyone who responds gets a nice l'il pistol whipping and is returned to their comptuer. After the first news report, people will be afraid to respond to spam.
1. Re:The real problem by www.fuckingdie.com · 2004-03-13 16:20 · Score: 5, Funny
  
  Is there somewhere that I can sign up to be a pistol whipper?
  
  --
  That really is my homepage, no kidding.
2. Re:The real problem by kramer · 2004-03-13 16:37 · Score: 5, Insightful
  
  I think the best answer the 'If nobody would by this stuff...' argument was:
  
  Spam works on the level of 1 in 10,000. The general population contains a far higher rate of mental illness, senility, and retardation.
  
  You'll never cure spam by 'education' of any sort. There are some people who are just too crazy or too stupid to learn.
3. Re:The real problem by Anonymous Coward · 2004-03-13 17:36 · Score: 4, Insightful
  
  All these suggestions make the naive assumption that people in general learn from past mistakes.
4. Re:The real problem by r_glen · 2004-03-13 17:46 · Score: 5, Funny
  
  But I thought they were the spammers.
Details. by Anonymous Coward · 2004-03-13 16:09 · Score: 5, Informative

Introduction

DSPAM (as in De-Spam) is an extremely scalable, open-source statistical-algorithmic hybrid anti-spam filter. A majority of users running v2.10+ achieve filtering rates ranging from 99.92% - 99.98+%, DSPAM is currently effective as both a server-side agent for UNIX email servers and a developer's library for mail clients, other anti-spam tools, and similar projects requiring drop-in spam filtering. DSPAM has been implemented on many large and small scale systems with the largest systems being reported at about 125,000 mailboxes.

What is a Statistical-Algorithmic Hybrid Filter?
Present-day language classifiers bear the responsibility of maintaining accuracy in the midst of ever-increasing sample complexity. In the setting of spam filtering, many types of intentional attacks have been introduced such as obfuscation, word list injection, sample flooding, and etcetera. As the complexity of classification text continues to multiply rapidly, many filter developers today are left with conflicted feelings between increasing the complexity of their filter and wise teachings from CS class reminding them that computer science is about controlling complexity, not creating it. At the rate complexity is rising, filters will (and have already begun to) become so resource-intensive that they lose scalability, eventually leading to a second conflict of interests: where fighting spam becomes more expensive than managing it.

DSPAM is the first Statistical-Algorithmic Hybrid filter and in being such boldly suggests that there is a better alternative to increasing the feature set of filters to match the spams they are trying to fight. By employing algorithms designed to increase the quality of existing data rather than the quantity of data with the goal of reducing the feature set rather than increasing it, DSPAM has managed to achieve nearly equal levels of accuracy with present-day Markovian-based filters and other types of filters that employ large feature sets with the added benefit of using a significantly fewer amount of resources. DSPAM presently peaks at 99.984% accuracy, which is ten times more accurate than a human being [1] and is presently being used on implementations as large as 125,000+ mailboxes.

DSPAM's Focus
The DSPAM project attempts to go beyond "just another statistical filter" by focusing on the following areas:

* DSPAM has a strong focus on providing better data to already existing algorithms (Bayesian, Chi-Square, etcetera) Combination algorithms work inherently well, but depend on the quality of data. Some of the approaches deployed in DSPAM towards this goal include Chained Tokens, Inoculation Groups, Classification Groups, advanced de-obfuscation techniques, and a new noise reduction algorithm called Bayesian Noise Reduction. The goal is to incorporate processing algorithms that can withstand the long haul of ever increasing message complexity. So far we're doing a great job.
* A strong focus on large-scale implementation support. The largest implementation of DSPAM we've heard about to-date involves 125,000 users. DSPAM has been designed to experience a very short execution time (0.03s - 0.10s on average hardware), and has been equipped with a storage driver API allowing several different storage mechanisms to be used. Depending on disk space constraints, accuracy can be traded off for additional disk space or vice-versa.
* Empty Corpus Support and Global Dictionary Support. It is very important in a large-scale environment to allow users to build their own dictionaries starting from scratch. Why? Because system administrators haven't got the time to create 20,000 seeded dictionaries. On top of this, ISPs require out-of-the-box filtering, which DSPAM's global dictionary feature provides for end-users, with minimal centralized learning. DSPAM provides support for building corpuses from scratch without suffering many fatal training errors (false positives). When these two approaches are combined, we end up with instant-filtering for all u
I wonder if this will catch what Mozilla misses by wmspringer · 2004-03-13 16:11 · Score: 4, Informative

Right now the only spam getting through my Mozilla filter is stuff that starts with one or two unrelated sentences, then goes into the advertising with any spam-type words (viagra, etc) horribly mispelled.

--
Twenties Retirement
1. Re:I wonder if this will catch what Mozilla misses by reaper20 · 2004-03-13 16:38 · Score: 4, Informative
  
  Thunderbird's latest builds have an improved spam filter using some ideas from SpamBayes, it's substantially improved from the older filter.
Re:What's DSPAM? by wintahmoot · 2004-03-13 16:12 · Score: 4, Informative

From what I can tell, DSPAM plugs into your MTA as a local delivery agent, very much like SpamAssassin does.

I couldn't see any platform requirements on their site, but here's what they say about MTA compatibility:

DSPAM works great with Sendmail, Postfix, Qmail, Courier, and Exim, and should work well with any other MTA that supports an external local delivery agent.

Hope that answers your questions :P

--
Martin May
funny faq by adamruck · 2004-03-13 16:12 · Score: 4, Funny

this is from the faq...

In real-world scenarios, false positives have ranged anywhere from 0% (none) to 0.10% depending on both implementation and user's mail behavior. Users with relatively predictable mail behavior (such as geeks, dweebs, and freaks) have generally received very few false positives (less than 1 in 10,000 messages).

--
Selling software wont make you money, selling a service will.
1. Re:funny faq by Feztaa · 2004-03-13 19:30 · Score: 4, Funny
  
  Users with relatively predictable mail behavior (such as geeks, dweebs, and freaks) have generally received very few false positives
  
  What about losers, dorks, and morons? Are they cursed with a high rate of false positives?
I still prefer tougher email security by NanoGator · 2004-03-13 16:15 · Score: 4, Insightful

This may work for a little while, but the creative peeps will find a way around it.

I say forget the filtering shit and force email to evolve. Part of the reason that spam happens is that there is no real authentication going on. No requesting permission to be on your white list. No real strong way to block anybody you don't want to hear from. No real way to verify the sender is legit. etc.

I don't claim to have all the answers, but I do know that I've been using ICQ for years and haven't seen a Spam from there since I turned on the 'require authorization' feature.

--
"Derp de derp."
1. Re:I still prefer tougher email security by tftp · 2004-03-13 17:38 · Score: 4, Interesting
  
  Evolution of email is difficult even in theory.
  The authentication is useless even if implemented - you want to receive email from strangers, that's what all businesses are doing. If you are not one of them and only converse with your buddies, make a whitelist and be done - no spammer will guess your friends' emails.
  Permissions to send email are also troublesome. If they are automated, then spam robots will be written to ask for permission first. If they are not automated... but how would you know if some random "John X. Frisby" <jfrisby@big.provider.net> is really who he is, and the matter he wants to discuss with you is not a bug in your Loafizer 0.99 script for your bread making machine, but a placebo enlargement pill. Additionally, permissions delay the mail exchange, which is bad for business.
  There are ways to block anyone you don't want, and all other senders are legit (until they spam you, that is.)
  So the problem is quite different, as you can see. There is a free channel of marketing, and spammers will be using it until it remains a) free and b) channel. Remove any one of those two, and they will close up the shop.
CRM114 Discriminator works better for me by Anonymous Coward · 2004-03-13 16:15 · Score: 5, Interesting

I tried several incarnations of dspam over a period of about 6 months. It was a pain in the butt to install, required a massive amount of training, and required you run a web server in order to have the point and click training capability.

I eventually gave up and tried the CRM114 Discriminator:

http://crm114.sourceforge.net/

It was MUCH easier to install, MUCH easier to maintain, and has the same or better level of accuracy. I used to get 100+ spam messages a day and now I'll get maybe 1 or 2 a week that sneak through (after only a few weeks of training on errors only).
Preventing Victims of Spam by www.fuckingdie.com · 2004-03-13 16:30 · Score: 4, Funny

Computer manufacturers will begin including a Hammer type device into PCs beginning immediately. This device will, when its associated software detects a user attempting to sign up for free porn, hammer the user to death.
Computer manufacturers are also investigating whether this device will be able to deal with the so-called "Stupid User Problem" which plagues so many IT professionals world wide.

--
That really is my homepage, no kidding.
Bayesian Unsupervised Learning by VoidEngineer · 2004-03-13 16:31 · Score: 5, Interesting

FYI, modern MRI scanners use bayesian noise reduction during image processing. I used to work in a MRI research laboratory, and our director had pioneered the application of Bayesian noise-filtering algorithms in post-processing of image data.

Oddly enough, our director of research was notoriously difficult person to schedule a meeting with. Makes me wonder about 'unsupervised learning'...
More accurate than a human? by Percent+Man · 2004-03-13 16:42 · Score: 4, Funny

accuracy levels as high as 10x that of a human...

So, let me get this straight - my spam filter will know better than I do which emails I want to read, and which ones I don't?
"No, trust me man, you really want a bigger johnson. Read it!"
Umm... what's the definition of spam? by michaelmalak · 2004-03-13 16:43 · Score: 4, Interesting

algorithm providing accuracy levels as high as 10x that of a human
Is this to say I can't tell when I'm being spammed? I thought the ultimate definition of spam was mail unwanted by a person. How can a computer decide a piece of mail is bad for a person if that person really wanted it? One could digress way off with this on Asimov's Laws and the politics of Socialism/Fascism vs. Libertarianism (that e-mail is just no good for you, you oughtn't read it).
Take it one step further; share what you filter by bigberk · 2004-03-13 16:44 · Score: 5, Interesting

DSPAM is one of these statistical filters (like spamprobe and CRM114) that can perform virtually perfect filtering of spam/non-spam you receive.

Now that you are free of spam yourself, may I suggest that you take it one step further and share your data with the anti-spam community; the WPBL project lets many users report the IPs sending them spam and non-spam in realtime using a couple simple scripts installed in procmail.

Our central database then publishes a real-time list of spam sources (the IP blocklist). Unlike spamcop, WPBL is entirely based upon automatic decisions made by statistical filters, 24/7. The resulting blocklist is already used by many ISPs; and you can also use it to block spamming IPs at your own server.
Here's where "10x as accurate as human" comes from by Gldm · 2004-03-13 17:13 · Score: 4, Informative

If you check the footnotes on the DSPAM page, it says "According to a study by Bill Yerazunis of CRM114."
If you then check the link to CRM114's project, you'll find this: "I measured my own accuracy to be around 99.84%, by classifying the same set of 3000ish messages twice over a period of about a week, reading each message from the top until I feel "confident" of the message status, (one message per screen unless I want more than one screen to decide on a message.) and doing the classification in small batches with plenty of breaks and other office tasks to avoid fatigue. Then I diff()ed the two passes to generate a result. Assuming I never duplicate the same mistake, I, as an unassisted human, under nearly optimal conditions, am 99.84% accurate.)."
Given the amount of people who even read the article on slashdot I doubt anyone else is going to check the tiny [1] footnote and find this.

--
Introducing the new Occam Fusion! Now with sqrt(-1) fewer blades!
Put this into Slashcode? heh by dsanfte · 2004-03-13 17:22 · Score: 4, Insightful

By the looks of the Intel story below, Slashdot sure needs a good Bayesian spam filter. I recommend this. Or a baseball bat. Because you can go over to anti-slash and really pound some skulls with a baseball bat, and it would probably be more satisfying. But filters are good too, don't get me wrong.

--
occultae nullus est respectus musicae - originally a Greek proverb
Bah... by Pig+Hogger · 2004-03-13 17:24 · Score: 4, Interesting

It's STILL just an " automated press-deleter".
No matter what technology it uses, neural nets, b-trees, recursion, tinkertoy logic, smell-emitting diode, leaky junction zener transistor, steam-powered aeolipiles, it only automagically presses delete, which is a pretty lame way of fighting spam.
It's a lame way of fighting spam, because, we STILL have to pay for the fucking spam bandwitdh; we STILL have to pay for the goddammed disk space used by the spam; we STILL have to pay for the bloody time lost transmitting the spam; we STILL have to pay for the extra ISP infrastructure to carry those spams.
Naaah. Spammers should be eradicated from the Internet, and the best way to do so is to completely BLOCK networks who host spammers (no matter what service), in order to force the collateral damage to whine to the ISP or simply vote with their feet.
Explained in the last DSPAM /. story by devphil · 2004-03-13 17:34 · Score: 4, Insightful

except that my article history is truncated in a futile attempt to get me to subscribe. So I can't point to the writeup I did.

The increased accuracy comes from the emails that will slip under your mental radar. You are a human, and you make mistakes. You wouldn't deliberately choose to read the email, but one day the subject line looks plausible, and so you bring it up. Three-quarters of a second later, you're glaring at the monitor and hitting "delete", but DSPAM wouldn't have let that slip by in the first place.

--
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
Combating SPAM is easy, if you have the technology by Avlimator · 2004-03-13 17:46 · Score: 5, Interesting

I don't get SPAM. I don't have SPAM filters. How is this possible? Simple. I create a different e-mail address for any new untrusted entity that I have to provide one for. In the beginning I took advantage of being able to alias all e-mail for non-existent mailboxes (basically, *) at my domain to my primary account. It seemed to me an obvious and simple approach. Whenever I needed to provide an e-mail address, I just made one up, and it was forwarded to my regular Inbox. In my opinion, at that time my ISP was more "sophisticated" than most. Since then I have moved to hosting all of my domains on my own co-located server which runs Exchange 2000, thus complicating things. Now I have to actually add any new aliases that I want to use into my user account. I know of at least one product out there that can handle non-existent addresses and forward them to a specific account, but it is rather expensive for a feature that should have been built-in from the beginning (althought I'm not aware if the new Exchange can do this out of the box). Not to mention that someone with the proper knowledge and skills could make a similar add-on in relatively short order, but who ever has the time? The point is that you have to consider when and where you give your e-mail address out, and the possible consequences therein. It's not altogether different from giving out your phone number (especially if you are unlisted) or even your SSN.