Spamassassin Beats CRM-114 In Anti-Spam Shootout
Simon Lyall writes "A new study of antispam software shows that Spamassassin performed well in various configurations along with Spamprobe , Bogofilter and Spambayes also came out good while CRM-114
failed to live up to its previous claims . The study shows: 'The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day.'"
GNAA Announces Remastered Version of Gayniggers From Outer Space: The Movie
GNAA Vice-President and co-founder JesuitX announced Friday that GNAA founder timecop had completed his nine-month long project of remastering Morton Lindberg's classic Danish masterpiece, Gayniggers from Outer Space: The Movie .
Said timecop, "I undertook this project so the Gay Nigger Association of America could easily spread the gay nigger seed with a crystal-clear picture and DVD-quality sound. But most of all, I do it for my gay nigg[er]s."
The previously mentioned JesuitX and GNAA high-level operator lysol were allowed early access to view the remastered version of movie. Having been already familiar with the VHS copy currently in circulation, they were in for a real treat. JesuitX was quoted as saying "In that scene where Captain B. Dick [played by Sammy P. Soloman] takes Arminass [played by Coco P. Dalbert] into the relaxing room for a conversation, the beautiful quality of the black skin, combined with the crystal clear sound made it feel like the Captain was sitting right next to me, massaging my knee, and letting me know he always has an eye on my ass. I lost complete control and starting masturbating furiously."
GNAA member l0de was also heard in background continuously saying "LOL JEWS DID WTC LOL JEWS".
Digitally Remastered version of Gayniggers from Outer Space is to be available for worldwide distribution immediately. Everyone is encouraged to download it using BitTorrent, by clicking here. You will need a BitTorrent client to download this release.
About Gayniggers from Outer Space: The Movie:
Dino De Laurentus & Raymond Hansen Present
A Lindberg & Kaistensen Production
"The Universe. It's mighty power. It's evolutionary force, not to be stopped by anyone. In its beauty, this, this is a happy place to stay, filled with harmony and cosmic joy. A free place, where men can express themselves, and be as when they were born. All of this is, because someone cares. Because someone looks after us. When we sleep, when we play. When we act natural. This is a movie about those who risk life, and partners, to guarantee living in a wonderful and free universe. This is a movie about the Gayniggers From Outer Space. The Gayniggers come from the planet Anus, in the 8th Sun System, far far away from here. They are much, much more intelligent than any other creature in the Univerise. The most fascinating thing about them is that they, with the help of their super intelligence, and their highly developed telepathic system, Braintapping, will be able to create a world, a society, a perfect world to live in without the presence of women. A MALE ONLY WORLD."
Starring
About GNAA:
GNAA (GAY NIGGER ASSOCIATION OF AMERICA) is the first organization which
gathers GAY NIGGERS from all over America and abroad for one common goal - being GAY NIGGERS.
Are you GAY ?
Are you a NIGGER ?
Are you a GAY NIGGER ?
If you answered "Yes" to all of the above questions, then GNAA (GAY NIGGER ASSOCIATION OF AMERICA) m
By GNAA staff
New York, NY - GNAA (Gay Nigger Association of America) this afternoon announced completion of a project started almost 6 months ago, porting of Windows CE.NET environment to Microsoft's XBOX gaming platform.
In a shocking announcement this afternoon, GNAA representative lysol demonstrated the XBOX running Microsoft's own Windows CE.NET 4.2 operating system.
"This is quite an important achievement," lysol began. "By porting Windows CE to XBOX, GNAA will be able to create a beowulf cluster of all XBOXes and use them to recruit more gay niggers. Next step will naturally be porting our GNAAOS framework to the new system, which will allow us to highly increase our gay nigger membership.
Unlike the Microsoft's custom OS, based on Windows 2000 kernel currently running on the XBOX, with a custom (and somewhat limited API), Windows CE.NET will allow running a whole range of Win32 applications on the XBOX by simply recompiling them. Because the differences between Windows CE.NET Win32 API are minimal, any type of gay nigger software can be easily ported to run on the new platform. GNAA is expecting to begin work on porting GNAAOS framework "real soon now", according to GNAA representative goat-see.
For more details, please visit GNAA official website at http://pepper.idge.net/gnaa/.
About GNAA
GNAA (GAY NIGGER ASSOCIATION OF AMERICA) is the first organization which
gathers GAY NIGGERS from all over America and abroad for one common goal - being GAY NIGGERS.
Are you GAY ?
Are you a NIGGER ?
Are you a GAY NIGGER ?
If you answered "Yes" to all of the above questions, then GNAA (GAY NIGGER ASSOCIATION OF AMERICA) might be exactly what you've been looking for!
Join GNAA (GAY NIGGER ASSOCIATION OF AMERICA) today, and enjoy all the benefits of being a full-time GNAA member.
GNAA (GAY NIGGER ASSOCIATION OF AMERICA) is the fastest-growing GAY NIGGER community with THOUSANDS of members all over United States of America. You, too, can be a part of GNAA if you join today!
Why not? It's quick and easy - only 3 simple steps!
First, you have to obtain a copy of GAY NIGGERS FROM OUTER SPACE THE MOVIE and watch it.
Second, you need to succeed in posting a GNAA "first post" on slashdot.org, a popular "news for trolls" website
Third, you need to join the official GNAA irc channel #GNAA on EFNet, and apply for membership.
Talk to one of the ops or any of the other members in the channel to sign up today!
If you are having trouble locating #GNAA, the official GAY NIGGER ASSOCIATION OF AMERICA irc channel, you might be on a wrong irc network. The correct network is EFNet, and you can connect to irc.secsup.org or irc.isprime.com as one of the EFNet servers.
If you do not have an IRC client handy, you are free to use the GNAA Java IRC client by clicking here.
About Xbox
Xbox (http://www.xbox.com/) is Microsoft's future-generation video game system that delivers the most powerful games experiences ever. Xbox empowers game artists by giving them the technology to fulfill their creative visions as never before, creating games that blur the lines between fantasy and reality. Xbox is now available in the continents of North America, Europe, Asia and Australia.
About Microsoft
Founded in 1975, Microsoft (Nasdaq "MSFT") is the worldwide leader in software, services and Internet technologies for personal and busi
f u homos
CRM-114
The link in the article points to SpamBayes again.
"People that quote themselves in their signatures bother me" - athakur999
Rusty Wallace beats Earnhardt in a tricycle race.
Stupid fools. Content-based spam filtering is a waste of time. Why is Slashdot covering this crap? It's a never-ending battle of updating filters and formulas. There are less permutations in isolating and blacklisting every IP on the Internet than there would be to analyze e-mail content, waste server resources and masturbate.
RBLs WORK. This is why spammers are forced to use worms to invade users' machines to create proxies. As soon as the authorities wake from their slumber and start prosecuting these scumbags who break into others' machines, the whole spam thing will essentially be over. But don't tell that to the little content-based-filtering-fools. They obviously have money to burn.
the mozilla spam filter does a very good job too, when it learns enough it becomes over 95% acurate. i dropped evolution for it , and never looked back
Putting a windows cd backwards, plays evil messages, but it gets worse, putting it right, installs windows.
We need a good code of law that puts SPAM bastards in jail for decade.
I suggest you read Slashdot
I must admit that I am not upto date on these new anti-spam software packages, which operate on the server side. However, what is the probability of real mail getting rejected by these things. It seems almost like an invasion of privacy to block my own email even if it is from a "benevolant big brother" perspective.
I guess that is why there are privacy policies though.
aj
GroupShares Inc. - A Free and Interactive Stock Market community!
-------
artlu.net
...false positives?
Baysian, gaysian. Real men hit delete.
The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day.
How many false positives though?
John Kerry is a Joke!
I use Spamassassin. Surviving mail then goes through CRM-114. At least in my case, it works better than each of the filters on its own.
And a long document it is (funny placeholder images though.) Here's the conclusions for the impatient but interested in a little more than the summary:
Supervised spam filters are effective tools for attenuating spam. The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day. The corresponding risk of mail loss, while minimal, is difficult to quantify. The best-performing filters misclassified a handful of spam messages early in the test suite; none within the second half (25,000 messages). A larger study will be necessary to distinguish the asymptotic probability of ham misclassification from zero.
Most misclassified ham messages are advertising, news digests, mailing list messages, or the results of electronic transactions. From this observation, and the fact that such messages represent a small fraction of incoming mail, we may conclude that the filters find them more difficult to classify. On the other hand, the small number of misclassifications suggests that the filter rapidly learns the characteristics of each advertiser, news service, mailing list, or on-line service from which the recipient wishes to receive messages. We might also conjecture that these misclassifications are more likely to occur soon after subscribing to the particular service (or soon after starting to use the filter), a time at which the user would be more likely to notice, should the message go astray, and retrieve it from the spam file. In contrast, the best filters misclassified no personal messages, and no delivery error messages, which comprise the largest and most critical fraction of ham.
A supervised filter contributes significantly to the effectiveness of Spamassassin's static component, as measured by both ham and spam misclassification probabilities. Two unsupervised configurations also improved the static component, but by a smaller margin. The supervised filter alone performed better than than the static rules alone, but not as well as the combination of the two.
The choice of threshold parameters dominates the observed differences in performance among the four filters implementing methods derived from Graham's and Robinson's proposals. Each shows a different tradeoff between ham accuracy and spam accuracy. ROC analysis shows that the differences not accountable to threshold setting, if any, are small and observable only when the ham misclassification probability is low (i.e. hm
CRM-114 and DSPAM exhibit substantially inferior performance to the other filters, regardless of threshold setting. Both exhibit substantial learning throughout the email stream, leading us to conjecture that their performance might asymptotically approach that of the other filters. From a practical standpoint, this learning rate would be too slow for personal email filtering as it would take several years at the observed rate to achieve the same misclassification rates as the other systems. Both these systems were designed to be used in a train on error configuration, and do not self-train. This configuration could account for a slow learning rate as each system avails itself of the information in only about 1,000 of the 50,000 test messages. In an effort to ensure that we had not misinterpreted the installation instructions, we ran CRM-114 in a train-on-everything configuration and, as predicted by the author, the result was substantially worse.
Spam filter designers should incorporate interfaces making them amenable for testing and deployment in the supervised configuration (figure 4). We propose the three interface functions used in algorithm 1 - filterinit, filtereval, and filtertrain - as a standardized interface. Systems that self-train should provide an option to self-train on everything (subject to correction via filtertrain) as in algorithm 2.
Ham and spam misclassification proportions should be reported separately. Accuracy, weighted accuracy, and precision should be avoided as primary evaluation measures as th
everything in moderation
I wonder how Mozilla Messenger/Thunderbird's spam filtering stacks up against these filters? I've heard some negative comments about the Mozilla filtering system, but it's worked wonders for me.
I have been using SpamAssassin in conjunction with Evolution and it has cut my spam to virtually nothing. I wish it was built right into Evolution so that it was a little faster however it is worth the wait as I barely ever get any spam in my Inbox anymore. I set it up with evolution very similar to how it is shown here. I really like using it with Evolution however I am curious if anyone knows of anything that would work faster and as efficient in conjuntion with Evolution?
Is to do away with current email protocols and go with new ones with verification.
That should take care of the problems. The gov is now concentrating on this.
You don't want to face an assassin in a shootout. Maybe a pie eating contest, or a spelling bee... but not a shootout.
CRM114's best was about 80%, which lasted for a few of weeks (weeks 3-5). Before and after that, it's doing good to catch 25% of the spam. I'm not sure why, but for the last month it's only been catching about 10%. When one gets through, I run it through mailfilter.crm with the learnspam switch. It'll say it's learned it, but if I have it check the spam again, it still lets it past.
I have been using spamprobe for some time, with the webfilt front-end, and I'm very pleased with the speedy spamprobe program (written in C++).
I receive approximately 10 legit emails/day and about 300 spam/day. I have only had 2 false positives overall (that's 2 out of about 100,000 total emails received) and on average only 2 spams/day split past the filter. Now I'm testing Spambayes on one of my most spammed accounts, but it's definitely much slower than spamprobe and not more accurate as far as I can tell.
From page 24: Hidalgo suggests the use of ROC curves, originally from signal detection theory and used extensively in medical testing, as better capturing the important aspects of spam filter performance.
Perhaps a distributed analysis system (similar to SETI@home) could be used to combat spam. Not only could the idle time of bazillions of CPUs be levereaged to improve "signal" analysis, but perhaps the clients could analyize local incoming mail to corelate new trends in spam originators and then share that information with all of the other clients. Then you could combine that with the genetic evolution improvements of the F1 sim-cars recently mentioned on /.
So there's the high-level idea, now you smart people go make it work. :-)
This one gang kept wanting me to join cause I'm pretty good with a bo staff.
Maybe I'm missing something human accuracy always going to be 100%? I tell the computer what is spam, it learns. I may decide that regardless of what it thinks, this last message is OK. So aside from clicking too fast or changing your mind (which is a common thing to do) how can a filter ever suggest it is be better then people at deciding what people want to see?
Razor: Vipul's Razor is a collaborative spam-tracking database, which works by taking a signature of spam messages. Since spam typically operates by sending an identical message to hundreds of people, Razor short-circuits this by allowing the first person to receive a spam to add it to the database -- at which point everyone else will automatically block it.
This is a really cool.
I did a *much* smaller test of spam filters earlier this year (which was published in hakin9 but not in English).
I also found that crm114 gave poor results in comparison to other filters - but figured I must have set something up incorrectly...
- I only use my main email address with friends and family, and never post it online.
- Whenever I post an email address or register for anything online, I put thatsite@mydomain.com as my email address.
- All email is received by one account, but each message can have a different "to:" header. I set my filters to filter mail to different boxes. Email sent to amazon@mydomain.com goes to the amazon folder. Same with ebay, slashdot, whatever.
- Any time I start receiving spam, I just set my mail server to disregard email sent to whatever email address is getting the spam, and I can stop doing business with the company that sold my email address.
I receive on average 0 spams per day.Albuquerque PC
OSX's built in mail seems to be pretty close to the accuracy numbers listed in the above summary. I tend to have one to three pieces of spam slip through which are almost always entirely image based with some poetry or equivalent attached.
I must say I've been pleasantly surprised with the spam filtering it provides and it has been a lot easier than the hoops I used to utilize to clean out my inbox.
--- I do not moderate.
Anyone know that three letter prefix to get through the CRM-114?
In real world deploys of statistical filters, something like DSPAM's "global user" feature is necessary. The ability to begin with a relatively mature dictionary is critical to the user experience. Personally, DSPAM is filtering around 200 SPAMs per day for me, allowing one through every few days. It's 99.985% effective for me.
:w
It's unforchunately that DSPAM was left out of this very good quality report. I have personally used SpamAssassin, SpamProbe and DSPAM
After using each for a couple months at a time, I found DSPAM to be by far the most effective (after it was properly trained)
DSPAMS claim "DSPAM (as in De-Spam) is an extremely scalable, open-source statistical hybrid anti-spam filter. While most commercial solutions only provide a mere 95% accuracy (1 error in 20), a majority of DSPAM users frequently see between 99.95% (1 error in 2000) all the way up to 99.991% (2 errors in 22,786). DSPAM is currently effective as both a server-side agent for UNIX email servers and a developer's library for mail clients, other anti-spam tools, and similar projects requiring drop-in spam filtering. DSPAM has been implemented on many large and small scale systems with the largest systems being reported at about 125,000 mailboxes." was quite accurate for me
Also check out some priceless photos Priceless Photos
Gamblers Forum
Up to this past weekend I was using only bogofilter (which is a pure bayesian filter). I seem to get about 200 spam a day on my main account. Until about a month or two ago bogofilter was amazing - I'd get maybe 1 or 2 spam a day, if that many. Then recently I suddenly started getting hit with 20 spam messages a day, and I noticed most of those were using lots of common words to bypass bogofilter. Most spam was still being removed by bogofilter, but enough to make me annoyed. This past weekend I also enabled spamassassin (without its bayes filter though), and its cut down the number of spam to maybe 5 a day, but its still too much for me. I'm hoping we have the next breakthrough in spam filtering technology soon (akin to bayesian filtering) because it seems that every new technique we use to filter the spam is eventually targeted by the spammers and bypassed.
Only 2 messages out of 150 normally get through that are spam? Good god, I normally get 5-10 spam messages a day that get through SpamAssassin. That's 750-1,500 spam e-mails a day! I thought it was bad before I enabled spamassasin a few months ago... but Jesus, man am I glad I got SA!
users. those silly, silly users. i was in charge of spam for my company for the greater part of a year. using an outdated KEYWORD based system > I was forced to read every.caught.message to look for false positives. ...
did you catch that? yeah...i had to go through EVERY 'spam' tagged e-mail that went through the company.
needless to say, after the first week i was ready to gouge my eyes out. but hey, at least i earned that 'i read your e-mail' sticker!
anyways, the point that i'm failing to make here is the cause of the spam...
the damn users.
whether it be responding to spam, putting their e-mail address in every single webform they encounter while surfing instead of working, signing up for spam voluntarily, or whatever the cause may be..
i ran some numbers on the logs, and came to an astounding find.
a few people were getting literally a thousand messages blocked, per month.
i, on the other hand, had maybe one or two a month.
and i'm not a nazi with my e-mail address....but i do take some care in what places i type it in.
an ounce of prevention goes a long way folks.
SpamAssassin used to be super-good for me, but recently it has become a nightmare... even with Bayes filters on and training it with about almost 2000 spam messages that have escaped it before, I STILL get an enourmous amount of spam every day... maybe I'm doing something wrong with the config, I admit that I haven't spent that much time on that, but it seems like it should be working better :-((.
Spam sucks. Everyone stop buying the products advertised and it'll be over. But then again, people will always be too dumb for an easy solution like that (reminds me of the gooback southpark...)
I haven't read everything in detail yet, but one of the things that stands out is that their 'gold standard' representing the best result consists of 9,038 ham messages (18.4%) 40,048 spams (81.6%). While large, the dataset is unbalanced. One of the things that is recommended by many of the filters is training on equal proportions of ham/spam in order to prevent biasing (overfitting).
Their train on errors approach may simulate what goes on with some filters it doesn't reflect the scenario where there is a initial dataset to be trained on _before_ new messages are processed. Instead, each message is in essence 'new'. So in their tests the machine learning filters start out knowing nothing, but SpamAssassin starts out with its inbuilt ruleset. Not exactly fair.
-Greg
I have tried a number of Baysian type filters and none of them filter the spam when I send it...
BEGIN
The Library of Babel
By Jorge Luis Borges
Translated by James E. Irby
"By this art you may contemplate
the variation of the 23 letters..."
- The Anatomy of Melancholy, part 2, sect. II, mem. IV
The universe (which others call the Library) is composed of an indefinite and perhaps infinite number of hexagonal galleries, with vast air shafts between, surrounded by very low railings. From any of the hexagons one can see, interminably, the upper and lower floors. The distribution of the galleries is invariable. Twenty shelves, five long shelves per side, cover all the sides except two; their height, which is the distance from floor to ceiling, scarcely exceeds that of a normal bookcase. One of the free sides leads to a narrow hallway which opens onto another gallery, identical to the first and to all the rest. To the left and right of the hallway there are two very small closets. In the first, one may sleep standing up; in the other, satisfy one's fecal necessities, Also through here passes a spiral stairway, which sinks abysmally and soars upwards to remote distances. In the hallway there is a mirror which faithfully duplicates all appearances. Men usually infer from this mirror that the Library is not infinite (if it really were, why this illusory duplication?); I prefer to dream that its polished surfaces represent and promise the infinite... Light is provided by some spherical fruit which bear the name of lamps. There are two, transversally placed, in each hexagon. The light they emit is insufficient, incessent.
Like most men of the Library, I have travelled in my youth; I have wandered in searh of a book, perhaps a catalogue of catalogues; now that my eyes can hardly decipher what I write, I am preparing to die just a few leagues from the hexagon in which I was born. Once I am dead, there will be no lack of pious hands to throw me over the railing; my grave will be the fathomless air; my body will sink endlessly and decay and dissolve in the wind generated by the fall, which is infinite. I say that the Library is unending. The idealists argue that the hexagonal rooms are a necessary form of absolute space or, at least, of our intuition of space. They reason that a triangular or pentagonal room is inconceivable. (The mystics claim that their ecstasy reveals to them a circular chamber containing a great circular book, whose spine is continuous and which follows the complete circle of the walls; but their testimony is suspect; their words, obscure. This cyclical book is God.) Let it suffice now for me to repeat the classic dictum: The library is a sphere whose exact center is any one of its hexagons and whose circumference is inaccessible.
There are five shelves for each of the hexagon's walls; each shelf contains thirty-five books of uniform format; each book is of four hundred and ten pages; each page, of fourty lines, each line, some eighty letters which are black in color. There are also letters on the spine of each book; these lettersdo not indicate or prefigure what the pages will say. I know that this incoherence at one time seemed mysterious. Before summarizing the solution (whose discovery, in spite of its tragic proportions, is perhaps the capital fact of history) I wish to recall a few axioms.
First: The Library exists ab aeterno. This truth, whose immeditate corrolory is the future eternity of the world, cannot be placed in doubht by any reasonable mind. Man, the imperfect librarian, may be the product of chance or of malevolent demiurgi; the universe, with its elegant endowment of shelves, of enigmatical volumes of inexhaustible stairways for the traveler and latrines for the seated librarian, can only be the work of a god. To percieve the distance between the divine and the human, it is enough to compare these crude wavering symbols which my fallible hand scrawls on the cover of a book, whith the organic letters inside: punctual, delicate, perfectly black, inimitably symmetrical.
Second: The orthographical symbols are twenty-five in numb
It worked in Iraq! By God, the very act of the US being ballsy enough to invade made all the WMD's disappear! Imagine what it'd do to spammers!
just my humble opinion...
i use email for business and receive many letters from clients. i just afraid to loose any of these because of a spam filter. therefore even when i used one, i checked all the emails anyway.
though she didn't fuck a homo!
I use Netscape's Bayesian filter as a second tier, and that removes about 60% of the remaining spam.
SpamCop was better, until IronPort bought them and they went black-hat, with Bonded Spammer and the Spam Engine.
Now I have gmail.
...hammer the spammer's ISP with complaints until the advertised website is DEAD, DEAD, DEAD.
STOP MISUSING APOSTROPHES, YOU MORONS!!!
The shift key is next to the Z on the left of the keyboard, and next to the / on the right.
It's often used on the first letter after a full stop - '.' character.
People LOVE it.
There are some false positives and some false negatives.
But I have it set to delete anything 12+. That gets rid of the worst of the worst spam. So far, not a single complaint of any email being deleted.
Everything else has the subject re-written so people can run their own rule set against it.
In the past 8 hours
1867 messages received
375 messages deleted
1266 messages flagged as spam
So, only a few hundred actual, good emails.
Of course, that's only 4 hours during the regular work day (and 4 hours after work). But you can see the proportions. It saves people a TON of time.
And it makes them happier when they don't have to constantly dig through crap to see if any real messages have arrived.
Now, those spam messages are NOT distributed evenly. Our HR manager had her email address posted on the website. So she gets about 20-25% of the spam.
It's not exactly Big Brother 'cause no human sees the deleted spam.
No false positives, disgusting amounts of spams killed. 'Tis a glorious thing.
Why do you hate GNAA?
And it has just now learned to filter out almost all the spam. IIRC, SpamAssassin said it would learn what to mark as spam after a couple hundred obvious spams and the same number of obvious non-spams. I still get the occasional false positive.
[Ripley] "I say we take off and nuke the entire planet
from orbit. That's the only way to be sure."
[Hudson] "F--kin' A..."
[Burke] "Ho-ho-hold on a second! The Earth has a
very substantial dollar value attached to it!"
[Ripley] "They can BILL me."
>;k
The first person who says gmail is getting shot. By me.
This article from the beeb puts human accuracy over machine accuracy...
What is GNAA's position on hair-fuckers? It's a large and dedicated subculture that demands recognition! Our sex symbols are even gaining respected publicity
Yahoo! allows you to have suspected spam automatically deleted or moved to a spam folder. It also allows you to disable the spam filter completely. (Mail Options -> Spam Protection)
As for SpamAssassin, I've been using it for about a week on my mail server. There have been about 500 filtered spams and one false positive - an AOL greeting card.
Why people don't use disposable accounts is beyond me. Once you start using Spamgourmet you'll never go back. I've been active with them over two years and here's my current stats:
Your message stats: 339 forwarded, 43,796 eaten. You have 155 disposable address(es).
yeah, that's right, thanks to disposable addresses I *haven't* read 43,457 spam emails! When I do need (want) to use my real address, I use SpamSieve (with Entourage X) - very good baysean filter (not sure if it Mac only or not).
NIGGER coomunJity
For an INDIVIDUAL, Bayesian filter works far better than just the regular SpamAssassin rulesets.
That's because the Bayesian system will LEARN from you what you consider to be spam and ham.
I use SpamAssassin with Bayesian filtering turned on and it catches over 90% of the spam. But then I've fed it a decent sized corpus.
Thunderbird already has integrated significant improvements based on SpamBayes, I believe. See http://bugzilla.mozilla.org/show_bug.cgi?id=230093 , which was closed about a month ago. The test data from that patch is encouraging, although obviously results will be different for everyone since not everyone gets the same type of spam. If you want to keep tabs on upcoming refinements to junk mail filtering, take a look at the dependencies of this meta bug: http://bugzilla.mozilla.org/show_bug.cgi?id=228674 . Please don't "spam" up that bug with comments though, if you have something to say put it in a specific bug or file a new one if something relelvant doesn't exist.
Rock over London, Rock on Chicago. Wheaties: Breakfast of Champions.
That's funny. Evolution under MDK 10 uses Spamassassin.
It has to be said, did they set the CRM-114 to discriminator to OPE or some other arrangment of P,O,E cause ya'll know unless you specify the code prefix you can't recall the spam and the doomsday device will go off.. cause that spam can get in there real low, I mean if the spammer is _really_ good he can fly, er send that e-mail right under their radar
No time to read it, son, just email it to me.
Hey freaks: now you're ju
Would be interesting to see how that message sample reacts against more spam filtering technologies, or even webmails with spam protection integration.
maybe I'm doing something wrong (wouldn't be the first time). I run the spamd command as root (tried it with the -d option too), pointed sa-learn at 3000+ spams and about 200 hams and set up kmail filters to pipe everything less than 250k through spamc and move anything with X-Spam-Flag=Yes to junk. It's slow as heck and only filters about 60% of my spam. Bogofilter was doing about 80% (it's more trouble to set up though). But I keep reading posts of people with 98% filter rates.
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
I am quite fond of niggers. They create music that I enjoy and do a fantastic job of advising the president on issues of national security.
Homos are good too. For one thing, they're funny -- almost as funny as the kikes. Yes, a Homo comedian is a good thing.
I have used it sitewide (small site, about a dozen active mailboxes) for a few months. Currently it has an error rate of about 1 or 2 mistakes per week per mailbox (in mailboxs that get 100+ spams per day). I did have to do a lot of work to configure it properly though, which may be the reason the authors saw poor performance from it; the "forward to yourself to train" didn't work at all because both my IMAP server and my mail reader would slightly reformat my headers, meaning that CRM114 was training on different text then it saw when it was filtering! So I put together my own system to save pristine copies of all inbound mail and train on them as needed. Maybe the reason CRM114 fared so poorly is the difficulty in setting it up properly?
There is a somewhat interesting article where they more or less explain how the Mac OS X Mail application works regarding Spam:
http://www.macdevcenter.com/pub/a/mac/2004/05/18/s pam_pt2.html
I'd like to second SpamSieve. If more than one piece of spam gets through in a day (where each day I receive > 500 pieces of email), I am truly surprised. My stats for June are:
Works for me. Oh, the false positive was a list that I just signed up for. They sent a confirmation mail, I checked to see if it was caught (it was), and marked it as "good". Piece of cake.
I live ze unknown. I love ze unknown. I am ze unknown.
Are you sure you trained it on a proper corpus? You have to look at your mail with a real mail reader, eg. Mutt. You will probably find that your good mail corpus is full of spam that was marked for deletion but isn't really deleted. This will cause the filter to train badly.
I think 200 hams is way too small. Keep sorting and it should improve.
I lost a ton of emails in v2.63 of spamassassin. I use a chain of fetchmail -> postfix -> kmail get -> filter through spamc -> kmail inbox/spam.
I had to turn off spamc processing because I lost a bunch of email. Maybe it was a bad interaction with kmail, but it was disheartening nontheless. Taking out the spamc filter, I did not run into the problem again.
Firstly it should be remembered that the 'owned' part is a bit subjective as most of the project could live on regardless of 'ownership' thanks to it being opensource. But regardless of that.. am I the only one that finds the prospect of microsoft buying SpamAssassin a bit odd?
Microsoft to buy Network Associates?
At the very least they'd be buying the name and the tarted up version of SpamAssassin sold as SpamKiller.
0daymeme.com: Great stuff.
Some guy a few stories back mentioned he was getting 3000 ad impressions and 15 clicks a day or so with AdSense. Which is terrible. At first I assumed he was just oversaturating his visitors with ads. But his ad placement is also terrible. It's at the very bottom of the page where few are going to see it. But he is also over saturating. His pages are very busy with information and the ads are on every single page.
What happens when you constantly shove something in someone's face is that they learn to ignore it. Either consciously or subconsciously. In the case of advertising if someone is shown an ad and they aren't interested and another ad is shown there's a very good chance they won't even notice it. Even if they would have been interested in what it was offering. This is because they were annoyed by the first ad so they just mentally block any additional ads.
This is why the response rate to spam is so terrible. People for the most part just subconsciously ignore it. It's just noise.
Advertisers like radio stations because it tends to be a captive audience. People are very unlikely to turn the station when ads come on. However there is one local station that I've learned to turn the channel on when the ads start because I know I'm going to get to my destination before another song comes on. There are other stations that I don't change the channel on because I know it's just a short break.
Just like the guy pumping out 2985 ads that no one clicks on, spammers would benefit immensly by pulling a large chunk of the ads. People are more likely to notice when they aren't bombarded by ads and the response percentage goes up.
It seems counterintuitive that less advertising means a greater response but that's actually the case.
I normally notice the ad banners on Slashdot because that's pretty much all the advertising there is. I rarely ever notice the text ads. Even though they're placed on the left side in the best position as anyone who scrolls the page is probably going to see them. Slashdot's problem is that the ads blend in with the web-site's color scheme too well so they're pretty much invisible to anyone with a scroll wheel.
On GameDev the site is so littered with advertising that I never notice it anymore. By the time I close the stupid popup ads that circumvent Google's pop up blocker using evil little tricks I'm too annoyed to even look at the other ads.
Web-sites get desperate and think more ads == more money. And the actual result is less valuable ad space because the click thru rate is so low and fewer clicks because users tune the ads out which results in less money than if they had focused on the click thru percentage rather than the number of impressions. If you have a web-site with a high click thru rate advertisers are more likely to pay more because they know that if they show an ad there's a very good chance they'll get a click thru.
But then I'm guess spammers have never taken a course in marketing or bothered to think about things from their potential customer's perspective.
Keeping ineffective ads visible hurts the effectiveness of the better ads. Spammers are in effect destroying themselves in that area. As are ad happy web-sites.
Ben
Work Safe Porn
I've been getting spams lately that seem to be trying to get around the highly effective statistical solutions, such as SpamAssassin, that have been implemented. Spammers seem to be adding random, or possibly even carefully selected dictionary words to skew their statistical rating. Here is an example from the several I've received lately--has anyone seen information about this on /. or elsewhere?
[spammers irritating message snipped]
Thu, 17 Jun 2004 19:42:34 -0500
No Thanks
beatify
sacred atom drank deprecate cathodic thermionic sherman delinquent hanley swum wooster asteroidal bilayer haiti saudi wink bijective reserpine baronial gloss ambrose threadbare chianti predatory earmark bilingual angora palazzi chartres alveolar phosphate civet radish barricade diem laurie minutem! en crusty
camilla jade lineman bendix masonic dublin incontrovertible defecate generous buddhist yesterday endow bitten conley trunk pitchfork beret bloat gelatine dovetail gambia medea niggardly blackburn suey dialogue ilyushin anastigmatic berth abort bodied contractor of ridden embarcadero corset trademark
ID: W993gt72
carnation
constructor maltese bantam airfield pique douglas pungent criterion cloudburst illiterate sausage career stile pebble bonnie shim carbonium
magnesite pembroke abrade jogging dynast physiochemical stochastic sumac conference obtain villain midwinter incompetent eradicable madhouse airline antony household cursory instinctual gratuitous clown shaven des cornflower
I've been using DSPAM for nearly a year now, and it's just kept on getting better. I can't imagine life without it now.
I have 17 DNS-based blacklists in front of it, because I would rather block the messages at the network interface than filter them with my own resources, but those that slip through don't stand much of a chance of reaching my inbox. I have had my current email address out there on the web and in Usenet for six years, so I see a lot of junk -- DSPAM stops all but one or two per month. SpamAssassin can't even come close to that.
Warning: This signature may offend some viewers.
Can anyone suggest the best way of filtering spam received into a mail server running Exchange 2003?
I have no sig yet I must scream.
150 spams a *day*?????
Hell, that's what gets through my filters. On a bad day I get 1000/hour.
I run both spamprobe and bogofilter and find that the OR of the two is noticably better than either alone. Haven't managed to spot why, but the moral is that if you have CPU to burn it can be worth getting a second opinion.
_O_
.|< The named which can be named is not the true named
I didn't see results for how much span GPG and PGP block.
It's normally around 100% on my pc, but sometimes about 110%.
thank God the internet isn't a human right.
I've been using it since March and the stats talk form themselves:
My spamfilters stats.
It's worth mentioning that I don't get false positives with SA, and CRM114 gets 1 every now and then. On a daily basis I get 70 spams caught by CRM114 and not by SA.
EOF
I can empathise with that. The other problem with using disposable accounts as far as business contacts or clients is the potential fall-out from the LACK OF TRUST! What would your contact or client think if you give them a spamgourmet address and they know what spamgourmet does? Or, if you give them a sneakemail address... "Can you spell your sneak e-mail address to me again please? That's a-5-b-z-what?"
--- root@127.0.0.1
Even if Microsft buys NAI, they would not get the SpamAssassin trademark.
SpamAssassin is in the process of becoming a project in the Apache Software Foundation. That process requires the trademark to be assigned to the ASF, which is already in progress as can be seen in this status report.
I'm also using CRM114. On a bad week, maybe 4-5 spam messages sneak by and I probably get 200-300 messages a day (which is overwhelmingly spam). There was a bug in previous versions of CRM that would cause the filter to claim in learned when you trained on an error and would give you a "I already know that's spam, I don't need to relearn that" message upon training. You can fix that by getting the new version and using a training command like this:
command password spam force
That VASTLY improved performance for me.
Seriously. Exposing an Exchange server directly to the net is just asking for trouble. Your best bet is to put the exchange server behind your firewall and relay all incoming mail through a hardened unix machine running your favourite email transport (like postfix). Then you can use any of the well documented spam sifters discussed here and offer your exchange server more protection from the elements.
You really need to make sure Spam Assassin is using DNS RBL and the like. I was seeing the same kind of thing - lots of spam getting through. Once I turned on the RBL checkers, spam levels dropped immediately. The other thing to do is make sure SA is the latest version - the newest spam techniques beat old SA versions.
Make trash your inbox. 100% effective.
Ask me about my vow of silence!
I am the author of CRM114 and I corresponded with Professor Carmack for setup assistance during this study; he did have some problems with CRM114 that he brought to my attention and which were possibly never quite resolved.
I can also state that *do* run CMR114 myself; I also run SpamAssassin (regularly maintained by the systems staff) on a parallel account. I find that SA gets about 90+ percent of what makes it past the firewall's immediate RBL lists (which matches Prof. Cormack's Figure 8 pretty closely); CRM114 nails 99.9% or more (this week, ending June 21, 2004, my CRM114 stats are 2528 nonspam and 1114 spam messages, and had just 1 error (a false reject) which is 99.972% accuracy.
I have gotten reports from some very happy users who are seeing similar accuracies; I've also gotten sad reports similar to Prof. Carmack's that show very weak accuracy.
I can conclude from this (and other reports) that filter performance varies _greatly_ with spam mix - that is to say, Your Mileage Will Vary.
Further, consider Fig 15, which compares CRM114's accuracy with respect to nonspam v. spam. Note that the two curves are displaced considerably, by a factor of accuracy between 3 and 5 times!
This is odd, because CRM114 is _entirely_ symmetrical; it does NOT have any predisposition toward (or against) erring on the side of caution; the only difference between nonspam and spam is the names of their files, which could be changed to "foo.css" and "bar.css" (or even interchanged) without affecting anything else.
Therefore, the two accuracy curves _should_ therefore lie on top of each other; there is no difference in the processing. The fact that the nonspam v. spam curves seem to differ by a factor of 3 to 5 in magnitude gives me some reason to believe that the setup issues Prof. Carmack encountered never really were completely addressed.
-Bill Yerazunis
Personally, I look forward to Bayesian categorization. Not just Spam, but Personal, Work, Bills, etc. It would be splendid if I could have some more dynamic rules instead of doing this stuff manually.
I tried CRM-114 after the previous Slashdot article. I payed a lot of attention to my email and did all the required training. After getting over the initial hump of misclassified email it got to a steady low level. Once it made a mistake and I had to train it, though, I would get a run of false positives and negatives for a bit until it settled out again.
What sent me back to SA was that a number of CRM-114 misclassifications were marking ham as spam. Losing a real message in the sea of spam is much more of a concern for me than getting a bit of spam with the regular stuff. It is very rare that I get ham classified as span in SA.
Turning on Postfix 2.1's "address verification" feature immediately eliminated 90% of the spam that my company was receiving! (SpamAssassin + ClamAV + CRM114 catch the rest). This feature confirms that the incoming email is coming from an account that also accepts email. (Spambots don't normally accept mail, of course...) The spam email never even makes it into your system this way, because the SMTP transaction is deferred until the address is verified. - Mike
But SpamAssassin is just getting better and better. Version 3.0 is coming up, and 3.0-pre1 was recently released. I do not have a test system available for it, but those who have may want to take it for a spin.
Especially for large sites, this is extremely interesting. It adds relational database support for the Bayes database, so it should be a lot easier to set up on a large site.
I find the lack of individual training the main reason why SA works so well for me, but not very well at my old university.
Employee of Inrupt, Project Release Manager and Community Manager for Solid
I evaluated SA as a possible filtering solution where I work, and it was a full order of magnitude slower than bogofilter even with every test disabled. And that *is* running spamc/spamd. Without the daemon it was even worse.
So it may be a nice solution for people who are running it on a small scale, for large installations (e.g. we get over six million SMTP connections a day) it requires a lot more hardware thrown at it.
I have set a maximum spam score of 1.9 and have had only two false positives (on subscribed mailinglists) in a 12 PC network.
One thing I really like about SA is how they are very careful to make sure their error rate is on the right side. It's better to let some spam get though than to mark good mail as spam.
My ISP implemented postini the other day and it had collected 30000 messages before I realized that it was blocking my Mexican cousin's email- his trip to visit was almost fubar'd.
And the only way to get the messages back was via frickin scroll, click click several hundred times. (Or open the ssl client scripting can of worms)
I've also been using POPFile for about a year, and it's done an amazing job - 99.87% accuracy, very few false-positives, and great summary info with six email accounts collectively filtered through it.
I recently helped a few friends install it on their machines, and, rather than just having them start from scratch, I copied my Spam corpus for them. With the spam corpus already in place, all of them noticed spam drop to close to zero while they trained their other buckets.
- Jack
I (and the company I work for) use ASSP and have been very impressed with its results. Spam in my boss's inbox went from 100-200 messages per day down to a handful... I'd like to see it compared to the other anti-spam packages mentioned.
Read my keyboard review.
No comparison of Bayesian systems would be complete without some method to normalize the training of them. In other words, different Bayesian approaches to anti-spam will learn differently from a different training set. So ironically, this comparison is only as good as the completeness of the spam used to train the filters.
http://tinyurl.com/4ny52
We've been using ASSP for just over 200 days now. http://assp.sourceforge.net.
.exe, .pif, etc)
785621 messages processed
334565 messages rejected as spam
159278 viruses blocked (attachments of
Major points in ASSP's favor include the fact that it blocks the email at the network interface (it takes over port 25 and forwards only the stuff on to sendmail that isn't spam), it's easy to install at the server-wide level, anyone on the whitelist can help train the spam filter by emailing it, and it rejects most viruses immediately which keeps the machine running smoothly even during M$ virus blitzkreigs.
But what about the emails I'm missing, you say? They get a message telling them they were rejected and why. Better yet, they get the message even if they are a spammer using a fake return address, which gives you a chance to "opt out" in a fashion they can't legally ignore (yeah, like they care, but still...). We've gotten no complaints from valid users so far and the message tells them to use our phone number to get whitelisted.
What still gets through? Bayesian poisoned emails do occasionally make it past--usually about 3-5 a day, but the spam rate is quite low compared to the deluge we would be under without it.
Examples.
A mom reading an email from her daughter saying "help, i'm being sexually assaulted by a football team" is far less likely go "gee, that contains the word rape so its spam"
A CEO readin an email from his biggest customer saying "you're getting rich. we're placing the order you need to survive" won't dismiss it because of spam words.
Spam filters have a higher chance of deleting the important emails than these overall percentages suggest.
From the article "'The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day.'"
I don't give a damn if they reduce 150 spams to 2 or 3 or 4 or 5. I care that they do NOT delete the one important email hidden there. Spam filter writers -- start focusing on avoiding false positives, not on trying to delete everything.
The article admits that they didn't follow the training guidelines for CRM114. Its HOWTO and FAQ clearly indicate that training of the type used in the Shootout decreases accuracy significantly. I followed the author's recommendations carefully (having found his rationale for them very rational) and have had very good results.
"The more spam you get the less you read" is what somebody told me at a recent user group meeting.
.procmailrc includes somthing like this... :0 :0 :0 c :0:
The trick is to train the spamfilter against the spamtrap addrs so that when they hit the good addrs the spamfilter knows they're spam.
I use CRM114 train-on-error, so my
* ^X-CRM114-Status: Good
{
* ^TO_compromisedaddr@mydomain.org
{
| $HOME/my/etc/crm114/learnspam
$MAILDIR/checkspam-learnt
}
}
Of the million or so emails I process per day, 80% are marked as spam. Of those approximately 75% are caught by the RBLs before it even reaches the spamassassin engine.
:)
I highly recommend RBLs to anyone. Not only are they fast and usually pretty accurate, but they are very fast learners usually.
One of my favorites is the SURBL which seems to catch a good chunk of it. Bayes filters are always gonna be thrown off by the dummy words thrown in there but the minute they try to link the person to their site BAM the surbl gets them.
I'm surprised greylisting hasn't become more widely used... I've not used it personally, but it sounds effective & fairly benign for non-spam mails.
"A great democracy must be progressive or it will soon cease to be a great democracy." --Theodore Roosevelt
And here is a very long and detailed response on the DSpam site by Jonathan A. Zdziarski himself..
I have exactly the same setup as you do. As some of the others said, you need to keep running sa-learn and it will eventually work.
I was doing this for about two weeks with no noticiable effect, and all of a sudden it started to catch well over 90% of all spam.
With the razor and other remote site checking in place it is slow very though.
Alas gallinaceas de urbe bovis volo