spamconference.org · Domains

Ending Spam

News · Spam · 2005-08-15 09:25 · posted by timothy · from the overdue dept. · 184 comments

Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification author Jonathan A. Zdziarski pages 312 publisher No Starch Press rating 8 reviewer Shalendra Chhabra ISBN 1593270526 summary Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters

Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.

Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.

The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).

In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.

The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.

The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.

The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.

The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.

Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.

William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Ending Spam

News · Spam · 2005-08-15 09:25 · posted by timothy · from the overdue dept. · 184 comments

Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification author Jonathan A. Zdziarski pages 312 publisher No Starch Press rating 8 reviewer Shalendra Chhabra ISBN 1593270526 summary Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters

Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.

Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.

The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).

In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.

The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.

The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.

The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.

The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.

Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.

William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

DSPAM 3.4 + SBL 1.0 Released

It · Spam · 2005-03-12 07:33 · posted by timothy · from the filtration-nation dept. · 5 comments

Nuclear Elephant writes "After a grueling five months of development, DSPAM Version 3.4 has been officially released. Among the major changes include full LMTP support, client/server support (dspamc), Bayesian Noise Reduction v2.0 technology (as previously introduced at this year's spam conference), many improvements to speed and accuracy, and support for the Streamlined Blackhole List, a machine-automated true-time collaborative blacklist that has tweaked the interest of filter authors from other filtering projects including Death2Spam and Bogofilter. Version 1.0 of the streamlined blackhole list client/server software was also released this weekend. If testing spam filters is your cup of tea, some testing tips for this new version (and statistical filters in general) have also been published on the website."

The Spam Conference 2005

It · Spam · 2005-01-22 12:41 · posted by timothy · from the it-seeps-in-the-pores-it-does dept. · 156 comments

dos_dude writes "This year's Spam Conference is over. As usual, the MIT provides low and high bandwidth webcasts. The talks featured a full spectrum of anything possible. From absurd to sound, from boring to entertaining, and from dead-horse-beating to brand-new. Highlights: John Graham-Cumming presented the results of the survey he did with the help of many Slashdot readers, Jon Praed gave the details of the trial against spammer Jeremy Jaynes and friends, Brian McWilliams posed the question what will happen when all spam is finally filtered, and Matthew Prince plugged Project Honeypot in a very entertaining way. Shameless but useful plug: here's the final schedule with links to the webcasts."

The Spam Conference 2005

It · Spam · 2005-01-22 12:41 · posted by timothy · from the it-seeps-in-the-pores-it-does dept. · 156 comments

dos_dude writes "This year's Spam Conference is over. As usual, the MIT provides low and high bandwidth webcasts. The talks featured a full spectrum of anything possible. From absurd to sound, from boring to entertaining, and from dead-horse-beating to brand-new. Highlights: John Graham-Cumming presented the results of the survey he did with the help of many Slashdot readers, Jon Praed gave the details of the trial against spammer Jeremy Jaynes and friends, Brian McWilliams posed the question what will happen when all spam is finally filtered, and Matthew Prince plugged Project Honeypot in a very entertaining way. Shameless but useful plug: here's the final schedule with links to the webcasts."

Lycos Declares War on Spam Servers

It · Spam · 2004-11-26 14:07 · posted by michael · from the spam-it-forward dept. · 567 comments

Psychotext writes "The Register have posted a story about a new screensaver from Lycos that targets known spam servers (taken from spamcop and verified by hand) with traffic in order to raise their bandwidth costs and hopefully price them out of the game. Lycos state that this is not a DDOS as Lycos monitors the site's responsiveness and throttles back when the site starts to falter. The screensaver is available here for Mac OSX, Mac OS9 and Windows, though you might need to lie about what country you are from." Reader JohnGrahamCumming writes "As part of preparing for the MIT Spam Conference I've put together a survey on what people are experiencing out there with spam, what they are doing about and followed it up with a test of different views of an inbox filled with spam and ham. You can take the test and be part of the survey results in January."

Armoring Spam Against Anti-Spam Filters

Yro · Spam · 2004-02-04 03:18 · posted by timothy · from the take-two-viagra-and-call-nigeria-in-the-a.m. dept. · 511 comments

moggyf points to a BBC article about how spam can be successfully tweaked to slip past current filtering methods, excerpting "To finding out how to beat the filters Mr Graham-Cumming sent himself the same message 10,000 times but to each one added a fixed number of random words. When a message got through he trained an 'evil' filter that helped to tune the perfect collection of additional words." iluvspam adds "It's an interview with POPFile author John Graham-Cumming that summarizes his talk at the recent MIT Spam Conference. You can still listen to the technical details here (choose the Afternoon 1 session, he starts about 75 minutes in)."

MIT Spam Conference Conclusions

Yro · Spam · 2003-01-18 16:53 · posted by timothy · from the about-what-you'd-expect dept. · 373 comments

RT Alec writes "The 2003 Spam Conference has concluded, reports InfoWorld. (related read: abstracts of the conference discussions). I was unable to attend the conference, but it appears all that was discussed was filters (client and server). I think the key problem is ISPs that do not block egress traffic on port 25. If you need to send mail through a different SMTP server than provided by your ISP, the admin of that server ought to provide you with a means of using it with authentication on a port other than 25 (you do have permission to use that SMTP server, don't you?). It is not too tough to set up an SMTP server to require authentication, or at a minimum to run off a different port. I am suprised that this is never mentioned as a cure for spam. If just AOL blocked port 25, this could reduce spam by 50% (I base this figure on close examination of the headers of the spam I receive). I was pleased to see that Barry Shein, president of The World (a Boston based ISP) was included in the talks. I am not sure by the abstract (see link above) posted if he mentioned blocking port 25. In a recent interview he did not mention it."

MIT Spam Conference Conclusions

Yro · Spam · 2003-01-18 16:53 · posted by timothy · from the about-what-you'd-expect dept. · 373 comments

RT Alec writes "The 2003 Spam Conference has concluded, reports InfoWorld. (related read: abstracts of the conference discussions). I was unable to attend the conference, but it appears all that was discussed was filters (client and server). I think the key problem is ISPs that do not block egress traffic on port 25. If you need to send mail through a different SMTP server than provided by your ISP, the admin of that server ought to provide you with a means of using it with authentication on a port other than 25 (you do have permission to use that SMTP server, don't you?). It is not too tough to set up an SMTP server to require authentication, or at a minimum to run off a different port. I am suprised that this is never mentioned as a cure for spam. If just AOL blocked port 25, this could reduce spam by 50% (I base this figure on close examination of the headers of the spam I receive). I was pleased to see that Barry Shein, president of The World (a Boston based ISP) was included in the talks. I am not sure by the abstract (see link above) posted if he mentioned blocking port 25. In a recent interview he did not mention it."

Spam Conference in Boston

It · Spam · 2002-12-27 11:56 · posted by michael · from the just-provide-your-email-address-to-register dept. · 229 comments

bpfinn writes "Are you working on your own anti-spam solution? Would you like to compare notes with other coders? You'll get your chance at the Spam Conference in Cambridge on January 17, 2003. Among the speakers are: Paul Graham (of "a plan for spam" fame), ESR, John Graham-Cumming (of "POPFile" fame), and Matt Sergeant from MessageLabs. According to the homepage, this conference will be very informal: "no fees, sponsorships, proceedings, luncheons, contests, etc. Just a series of quick, concentrated talks, and then we all go off and get Chinese food." Slashdotters who are peeved about spam can register here."

Spam Conference in Boston

It · Spam · 2002-12-27 11:56 · posted by michael · from the just-provide-your-email-address-to-register dept. · 229 comments

bpfinn writes "Are you working on your own anti-spam solution? Would you like to compare notes with other coders? You'll get your chance at the Spam Conference in Cambridge on January 17, 2003. Among the speakers are: Paul Graham (of "a plan for spam" fame), ESR, John Graham-Cumming (of "POPFile" fame), and Matt Sergeant from MessageLabs. According to the homepage, this conference will be very informal: "no fees, sponsorships, proceedings, luncheons, contests, etc. Just a series of quick, concentrated talks, and then we all go off and get Chinese food." Slashdotters who are peeved about spam can register here."

A Conference About Spam

Yro · Spam · 2002-12-16 14:56 · posted by timothy · from the universal-convergence dept. · 392 comments

zonker writes "January 17th will be the first (annual?) meeting of the Spam Conference held in Cambridge, Massachusetts. The informal meeting will feature Paul Graham, John Graham-Cumming, John "Cap'n Crunch" Draper among others (possibly including ESR though he hasn't yet confirmed). The free conference will consist of a number of talks about new ways to combat the growing spam problem, after which everyone's going out and getting some Chinese food. Should be an informative and fun meeting and a chance to meet some interesting people."

Slashdot Mirror

Domain: spamconference.org

Stories · 12

Ending Spam

Ending Spam

DSPAM 3.4 + SBL 1.0 Released

The Spam Conference 2005

The Spam Conference 2005

Lycos Declares War on Spam Servers

Armoring Spam Against Anti-Spam Filters

MIT Spam Conference Conclusions

MIT Spam Conference Conclusions

Spam Conference in Boston

Spam Conference in Boston

A Conference About Spam