Domain: nuclearelephant.com
Stories and comments across the archive that link to nuclearelephant.com.
Stories · 55
-
DSPAM v3.6 Released
Nuclear Elephant writes "After six months of development, DSPAM v3.6 has been released. The most notable change is the series of new features added to make an anti-spam gateway appliance possible (Knoppix anyone?). Version 3.6 also includes a highly accurate alternative to Bayesian filtering known as Markovian discrimination, based on Bill Yerazunis' research. Other significant enhancements include trusted sender whitelisting, integrated Clam Antivirus and LDAP support, a centralized spam training alias, and a new dependency-free storage driver. Much of the documentation has also been rewritten to make installation easier. A change log and release notes are also available. Slashdot has recently featured a review of the author's book, Ending Spam and an interview as well." -
DSPAM v3.6 Released
Nuclear Elephant writes "After six months of development, DSPAM v3.6 has been released. The most notable change is the series of new features added to make an anti-spam gateway appliance possible (Knoppix anyone?). Version 3.6 also includes a highly accurate alternative to Bayesian filtering known as Markovian discrimination, based on Bill Yerazunis' research. Other significant enhancements include trusted sender whitelisting, integrated Clam Antivirus and LDAP support, a centralized spam training alias, and a new dependency-free storage driver. Much of the documentation has also been rewritten to make installation easier. A change log and release notes are also available. Slashdot has recently featured a review of the author's book, Ending Spam and an interview as well." -
DSPAM v3.6 Released
Nuclear Elephant writes "After six months of development, DSPAM v3.6 has been released. The most notable change is the series of new features added to make an anti-spam gateway appliance possible (Knoppix anyone?). Version 3.6 also includes a highly accurate alternative to Bayesian filtering known as Markovian discrimination, based on Bill Yerazunis' research. Other significant enhancements include trusted sender whitelisting, integrated Clam Antivirus and LDAP support, a centralized spam training alias, and a new dependency-free storage driver. Much of the documentation has also been rewritten to make installation easier. A change log and release notes are also available. Slashdot has recently featured a review of the author's book, Ending Spam and an interview as well." -
DSPAM v3.6 Released
Nuclear Elephant writes "After six months of development, DSPAM v3.6 has been released. The most notable change is the series of new features added to make an anti-spam gateway appliance possible (Knoppix anyone?). Version 3.6 also includes a highly accurate alternative to Bayesian filtering known as Markovian discrimination, based on Bill Yerazunis' research. Other significant enhancements include trusted sender whitelisting, integrated Clam Antivirus and LDAP support, a centralized spam training alias, and a new dependency-free storage driver. Much of the documentation has also been rewritten to make installation easier. A change log and release notes are also available. Slashdot has recently featured a review of the author's book, Ending Spam and an interview as well." -
DSPAM v3.6 Released
Nuclear Elephant writes "After six months of development, DSPAM v3.6 has been released. The most notable change is the series of new features added to make an anti-spam gateway appliance possible (Knoppix anyone?). Version 3.6 also includes a highly accurate alternative to Bayesian filtering known as Markovian discrimination, based on Bill Yerazunis' research. Other significant enhancements include trusted sender whitelisting, integrated Clam Antivirus and LDAP support, a centralized spam training alias, and a new dependency-free storage driver. Much of the documentation has also been rewritten to make installation easier. A change log and release notes are also available. Slashdot has recently featured a review of the author's book, Ending Spam and an interview as well." -
New Identity Theft Technology Fails to Protect
Nuclear Elephant writes "According to BBC News, identity thieves are quickly adapting to new technologies such as chip-and-pin credit cards using human nature tactics rather than cracking the technology. At least that's what Dr. Emily Finch (UEA), who interviews career criminals about their activities, claims. Finch swapped credit cards with a male coworker and performed a number of transactions without being challenged by cashiers. Finch also believes biometric identity cards will only exacerbate the problem. Regardless of which side of the fence you sit on, could this take us closer to embedded chips under the skin?" -
Jonathan Zdziarski Answers
Wednesday we requested questions for Jonathan Zdziarski, an open source contributor and author of the recently reviewed book "Ending Spam." Jonathan seems to have taken great care in answering your questions, which you will find published below. We have also invited Jonathan to take part in the discussion if he has time so if your question didn't make the cut perhaps there is still hope. Winkydink asks:
How do you pronounce your name?
Jonathan Responds:
Hi. Well first off, I'm sticking to the pronunciation 'Jarsky', however many of my relatives still pronounce it 'Zarsky' or "Za-Jarsky". As far as I can tell, my last name was originally 'Dziarstac' when the first generation of my family came over, which would have been pronounced with a 'J'. It's of polish decent, but I'm afraid I'm not very in tune with my ancestors on this side of the family. The other side of my family is mostly Italian, and they drink a lot, organize crime, and generally have more fun - so they are much more interesting to hang out with. For the past 29 years of my life, giving my last name to anyone has included the obligatory explanation of its pronunciation, history, and snickering at puns they think I'm hearing for the first time (-1, Redundant), so don't feel too bad for asking.
As far as who I am and why you should care - I guess that depends on what kind of geek you are. I've never appeared in a Star Trek series or anything (I've been too busy coding and being a real geek), so I guess that eliminates me as a candidate for public worship in some circles. I guess if you're into coding, open source, hacking all kinds of Verizon gear, or eradicating spam, then some of my recent projects may be of interest. If you at least hate me by the end of the interview, I'll have accomplished something.
An Anonymous Coward asks:
What do you think about the proposed change to the GPL with the upcoming GPL 3? Is it a welcomed breath of fresh air to the Open Source Community, or will it just be a reiteration of the previous GPL? What are your thoughts and comments on the GPL 3?
Jonathan Responds:
Based on the scattered information I've read about some potentially targeted areas in GPLv3 and the religious fervor with which some of these discussions have been reported, all I can say is I hope common sense prevails. Actually there's much more I can, and will, say about the subject below, but I think it's probably a good idea to summarize in advance as you may not make it through the list of details in one sitting. So in summary of all my points to come: I hope common sense prevails.
One of the things I've heard, which doesn't make much sense to me, is the idea of changing the GPL to deal with 'use' rather than 'distribution', which would affect companies like Google and Amazon. The argument seems to be that some people feel building your infrastructure on open source should demand a company release all of their proprietary source code which links to or builds on existing GPL projects. They argue that the open source community hasn't benefited from companies like Google and Amazon. Well, from a source code perspective that might be somewhat true - but if you take into consideration the fact that we all have a good quality, freely accessible search engine, cheap books, and employment for many local developers (many of whom write open source applications), the benefits seem to balance out the deficiency. Does anybody remember what the world was like before Google? None of us do, primarily because we couldn't find it - we couldn't find much of anything we were looking for on the Internet as a matter of fact, including other people's open source projects. You might not be getting "free as in beer" or "free as in freedom", but you are getting "free as in searches" and "free as in heavily discounted but not quite free books" in exchange. That's a pretty good trade. It's certainly better than having to look at pages of advertising before completing your order, or subscribing to a Google search membership. On top of this, you probably wouldn't want to see half of the source code that's out there being integrated (internally) into these projects. While I haven't seen Google or Amazon's mods specifically, I do heavily suspect that, if they are like any other large corporate environment, there are many disgusting and miserable hacks that should under all circumstances remain hidden from sight forever - many of which are probably helping ensure job security for the developers that performed the ugly hacks in the first place. Just how useful would they be to your project anyway? Probably little. And if you really believe in free software ("free as in freedom"), then the idea that someone should be required to contribute back to your project in order to use it is contradictory to that belief - you might just as well be developing under an EULA instead of the GPL.
With that said, there's a difference between freedom and stealing. I've heard that GPLv3 will attempt to address the mixing of GPL and non-GPL software. I think this clarification might be a good thing. For one, because I've seen far too many pseudo-open source tin cans and CDs being resold commercially out there, distributing many different F/OSS tools with painfully obvious closed commercial code, and finding ways to easily loophole around this part of the GPL, and secondly because it's based around implementation guidelines that really aren't any of the GPL's business. At the moment, mixing uses a very archaic guideline, which is - in its simplest terms -based on whether or not your code shares the same execution space as the GPL code. I think this needs to be reworked to give authors the flexibility to define "public" and "private" interfaces in a project manifest. We're already defining these anyway if we believe in secure coding practices. Closed source projects may then use whatever public interfaces the author has declared public (such as command line execution, protocols, etcetera) but private interfaces are off limits. One particular area where this would come in handy is in GPL kernel drivers, which need this ability to avoid tainted-kernel situations. If the author wants, they can declare dynamic linking to a library as a public interface and even make their code more widely useful without having to switch to the GPL's red-headed stepchild, the LGPL. It would also be nice to be able to restrict proprietary protocols (such as one between a client piece and a server piece, which may have originally been designed to function together) to only other GPL projects, which would essentially create GPL-bonded protocol interfaces. This won't restrict use in any way - only what closed-source projects are limited to interfacing with when redistributed.
I would also like to see the GPL's integration clause tightened down quite a bit. There are some companies out there abusing the GPL with "dual licensing". I've considered dual licensing myself in some commercial products, and I just don't believe it's being done in the right spirit much, if at all. Doing away with the possibility of integrating the GPL into a dual license could help strengthen the GPL.
Finally, I'd say mentioning a few words in the GPLv3 about submission practices to help stave off problems like this whole Sco and Linux® fiasco from ever happening again would be a good thing. People generally don't want to limit usage, but if you're going to submit code, there should be at least some submission guidelines. I suspect much of this can (and should) be done outside of the GPL, but at least covering the basics might be appropriate. It should be understood that if you're going to contribute code to the GPL, it had better be unencumbered. It's definitely something every project should already be considering already.
An Anonymous Coward asks:
Do you have any suggestions for the enthusiastic yet inexperienced? Perhaps a listing of projects in need of developers, with some indication of the level of experience suggested (as well as languages required).
Jonathan Responds:
The best projects I've seen were those started from someone with a passion for what it is they're coding. Open source development is the internship of the 21st century, and working on projects is tedious, frustrating, and likely to make you want burn out if you haven't developed perseverance. I usually suggest to people to come up with ideas for some projects they feel passionately about and make those their first couple of goals. Even if it's completely useless to anyone else, you're still likely to benefit from it yourself. Just look at my Australian C programming macros. Who would have thought that people wouldn't want to use "int gday(int argc, char *argv[])" in their code. I'm sure I learned something from that project, though I still can't remember what.
Instead of spending idle time looking for other projects to jump on, I'd spend as much time as I could in man pages, books, and coding up my own little concoctions. Even if they're stupid ones, you're likely to learn something, or even better - come up with another neat little idea you can spin off of it. Necessity is the mother of invention, so I try and figure out what it is I need, and then do it myself. That usually works. If you still can't think of anything, see if you can catch a vision for something someone else needs. I wouldn't touch anything that you're not 100% bought into and excited about for your first projects.
RealisticCanadian asks:
I myself have had numerous interactions with less-than-technically-savvy management-types. Any time I bring up solutions that are quite obviously a better technical and financial choice over software-giant-type solutions; conversation seems to hit a brick wall. The ignorance of these people on such topics is astounding, and I find many approaches I have tried seem to yield no results in the short term. "Well, yes, your example proves that we would save $500,000 per year using that Open Source solution. But We've decided to go the Microsoft (or what-have-you) route." With your track record, I can only assume you have found some ways to overcome this closed-mindedness.
Jonathan Responds:
I'm not so sure that I have convinced anyone open source was better inasmuch as I've convinced people that other people's projects were better than what Microsoft had to offer, and that's not hard for anyone to accomplish. I can strongly justify some open source projects to people because they are already superior to their commercial counterparts, but there are also a lot of crummy projects out there that should be shot and put out of my misery. I'm not one to advocate a terribly written project, even if it is open source. The good projects can usually speak for themselves with only a little bit of yelling and biting from me. So if you want to become a respected open source advocate at your place of business, I'd say the first rule of thumb is not to try and advocate crap projects for the mere reason that they're open source. Advocating the good ones will help you build a reputation. It also helps if you read a lot of Dilbert so you'll understand the intellectual challenges you'll be facing.
Some other things that I've found can help include what managers love to call a "decision matrix" which is a spreadsheet designed to make difficult decisions for them. For your benefit, this should consist of a series of features and considerations that the competitor doesn't have, with a big stream of checkboxes down the row corresponding to your favorite open source project. Nobody's interested in knowing what the projects have in common anyway, so tell them (with visual cues) what features your open source solution has over the competitor. And if you really want to get your point across clearly to your manager, do the spreadsheet in OpenOffice so they'll have to download and install an open source project to read it.
Once you've done that, and if you're still employed by now, the next thing to put together is an ROI (return on investment) comparison, which not only addresses the costs of the different solutions, but costs to support both solutions in the long run, cost of inaccuracy (if this is a spam solution for example), cost of training, customizations, and resources to manage each product. This is a great opportunity to size machines and manpower and include that in a budget forecast. Many managers are sensitive to knowing just how much extra dough it's going to cost to implement the commercial solution. At the very least, you ought to be able to prove many commercial solutions don't actually make the company much money in the long run. If speaking of cash isn't enough to convince your manager then a full analysis of low-level technical aspects will be necessary. This is simply a dreadful process, and where most open source attempts fail - because a lot of people are just too lazy to learn about the technical details of both projects and complete their due diligence. If you take the time, though, you're likely to either convince your boss or utterly confuse him - either one is very satisfying.
The biggest challenge in justifying many open source projects I've run into is finding solid support channels that your boss can rely on if you get hit by a bus (or in his mind, fired). Support is, in many cases, a requirement but not all good open source projects see the benefit in offering support. A lot of companies are willing to pay just to have someone they can call when they have a problem. So if you can find a project that's got a pool of support you can draw out of, you can not only use that to justify the project to your manager, but kick a few bucks back into the open source community. I started offering support contracts for dspam primarily because people needed them in order to get the filter approved as a solution. I think I do a good job supporting my clients that do need help, but at least half of them just pay for a contract and never use it. I certainly don't have a problem with that, and it supports the project as well as the people investing time in it.
Goo.cc asks a two parter:
1. In your new book, you basically state that Bogofilter is not a bayesian filter, which was news to some of the Bogofilter people I have spoken to. Can you explain why you feel that Bogofilter is not a bayesian filter?
Jonathan Responds:
Bogofilter uses an alternative algorithm known as Fisher-Robinson's Chi-Square. Gary Robinson (Transpose) basically built upon Fisher's Inverse Chi-Square algorithm for spam filtering, which provided some competition for the previously widely accepted Bayesian approach to this. Therefore, Bogofilter is not technically a Bayesian filter. The term, "Bayesian", however is commonly a buzzword known to most people to describe statistical content filtering in general (even if it isn't Bayesian), and so Bogofilter often gets thrown into the same bucket. CRM114 is another good example of this - many people throw it in the same bucket as a Bayesian filter, but it is configured (by default, at least) to be a Markovian-based filter which is "almost entirely nothing like Bayesian filtering". Technically, CRM114 isn't a filter at all, but a filtering-language JIT compiler (it can be any filter). I cover all of these mathematical approaches in Ending Spam, so grab a copy if you're interested in learning about their specific differences.
2. Bayesian filters have been around for some time now but there still seems to be no standardized testing methods for determining how well filters work in comparison to one another. Do you think that comparative testing would be useful and if so, how should it be performed?
Jonathan Responds:
Part of the reason there's no standardized testing methodology is because there's no standardized filter interface. A few individuals have attempted to build spam "jigs" for testing filters, but the bigger problem is really lack of an interface. About a year ago, the ASRG was reportedly working on developing such a standard - but as things usually turn out, it's an extremely long and painful process to get anything done when you've got a committee building it (take the mule, for instance, which was a horse built by a committee). This is probably why filter authors have also been hesitant to try and accommodate their filters to a particular testing jig. Incidentally, this is how I surmise that SPF could not have possibly made it through the ASRG - the fact that it made it out at all suggests that it never went in.
I think it's of some interest to compare the different filters out there, but it's also somewhat of a pointless process too. Since these systems learn, and learn based on the environment around them, only a simulation and not a test, will really identify the true accuracy of these filters - and even if you can build a rock solid simulation, it will only tell you how well each filter compared for the test subject's email. If we are to have a bake-off of sorts, it definitely ought to include ten or more different corpora from different individuals, from different walks of life. Even the best test out there can't predict how a filter might react to your specific mail, and for all we know the test subjects may have been secretly into ASCII donkey porn (which will, in fact, complicate your filtering).
This is why some people misunderstand my explanations of dspam's accuracy. All I've said in the past is "this is the accuracy I get", and "this is the accuracy this dude got". Which is the equivalent of "our lab mice ate this and grew breasts". There's no guarantee anybody else is going to get those results, though I'm sure many would try (with the mice, that is). In general, though, I try to publish what I think are good "average" levels for users on my own system, and they are usually around 99.5% - 99.8%. In other words: your mileage may vary. So try it anyway. Incidentally, I've been working with Gordon Cormack to try and figure out what the heck went wrong with his first set of dspam tests. So far, we've made progress and ran a successful test with an overall accuracy of 99.23% (not bad for a simulation).
What would be far more interesting to me would be a well-put together bakeoff between commercial solutions and open source solutions. The open source community around spam filtering really has got the upper hand in this area of technology, and I'm quite confident F/OSS solutions can blow away most commercial solutions in terms of accuracy (and efficacy).
Mxmasster asks:
Most antispam software seems to be fairly reactionary - wither it is based on keyword patters, urls, sender, ip, or the checksum of the message a certain amount of spam has to first be sent and identified before additional messages will be tagged and blocked. Spf, domainkeys, etc... requires a certain percentage of the Internet to adopt before they will be truely effective. What do you see on the horizon as the next big technique to battle spam? How will this affect legitimate users on the Internet?
Jonathan Responds:
That's the problem with most spam solutions, and why I wrote Ending Spam. Bayesian content filtering, commonly thrown into this mix, has the unique ability to grow out of your typical reactive state and become a proactive tool in fighting spam. I get about one spam per month now at the most, and DSPAM is learning many new variants of spam as it catches them; I'd call that pretty proactive. Spam, phishing, viruses, and even intrusion detection are all areas that can benefit greatly from this approach to machine learning. They will likely never become perfect, but these filters have the ability to not only adapt to new kinds of spam, but to also learn them proactively before it makes it into your inbox. Some of this is done through what is called "unsupervised learning" and not traditional training, while other tools, such as message inoculation and honey pots, can help automate the sharing of new spam and virus strains before anyone has to worry about seeing them. We haven't thoroughly explored statistical analysis enough yet for there to be a "next big technique" beyond this. The next big techniques seem to be trying to change email permanently, and I don't quite feel excited about that. Statistical tools are where I think the technology is at and it needs to become commonplace and easier to setup and run.
The problem seems to be in the myth that statistical filtering is ineffective or incomplete. Many commercial solutions pass themselves off as statistical(ish) and it seem to be contributing to this myth by failing to do justice to the levels of accuracy many of the true (and open source) statistical filters are reflecting. Any commercial solution that claims to be an adaptive, content-based solution (like Bayesian filters are) really ought to deliver better than 95% or 99% accuracy. Part of the problem is just bad marketing - most of these tools are not true "Bayesian" devices; they just threw a Bayesian filter in there somewhere so they could use the buzzword. Another problem is design philosophy and the idea that you need an arsenal of other, less accurate tests, to be bolted in front of the statistical piece. If you're going to train a Bayesian filter with something other than a human being, whatever it is that's training it ought to be at least as smart as a human being. Blacklist-trained Bayesian filters are being fed with about 60% accurate data, (whereas a human is about 99.8% accurate). So it's no surprise to me that Blacklist-trained filters are severely crippled - what a dumb combination. If you really want to combine a bunch of tools for identifying spam, build a reputation system instead. They do a very good job of cutting spam off at the border, are generally more scalable than content-based filtering, and most large networks can justify their accuracy by their precision.
Not all commercial content-based filters are junk. Death2Spam is one exception to this, and delivers around 99.9% accuracy, which is in the right neighborhood for a statistical filter. Not all reputation systems are junk either. CipherTrust's TrustedSource is one example of what I call a well-thought out system. If you must have a commercial solution, either of these I suspect will make you quite happy. As for (most of) the rest, quit screwing around and build something original that actually works.
Jnaujok asks:
The SMTP standard that we use for mail transfer was developed in the late 70's - early 80's and has, for the most part, never been updated. In that time period, the idea of hordes of spam flowing through the net wasn't even considered. It has always been the most obvious solution to me that what we really need is SMTP 2.0. Isn't it about time we updated the SMTP standard?
Jonathan Responds:
You're talking about an authenticated approach to email, and there have been many different standards proposed to do this. First let me say that, even though SMTP was drafted a few decades ago, it's still successful in performing its function, which is a public message delivery system - key word being public. There exist many private message delivery systems already, which you could opt to use, including bonded sender and even rolling your own using PGP signatures and mailbox rules. I have reservations about forcing such a solution on everybody and breaking down anonymity for the sake of preventing junk mail. Until you can sell a company like Microsoft on absolute anonymity in bonded sender and sell ISPs into putting up initial bonds for their customers (so that a ten-year old gradeschool student can still use email), I see a very large threat (especially by the government) in globalizing this as a replacement for the 'public' system. With services like gmail, where you can store an entire life's worth of email, the idea that everything you've ever said could be sufficiently traced back to you and used against you, I would rather deal with the spam. Why? Let me pull out my tinfoil hat...
It's been advertised plenty of times on Slashdot that Google stores everything about all of its queries. It wouldn't surprise me if they already have government contracts in place to perform data mining on specific individuals. How would you like, in the future, all of your email to be mined and correlated with other personal data to determine whether or not you should be allowed to fly? Buy a firearm? Rent a car? We're not very far off from that, and even less so once this correlation is made possible.
So abstract some level of anonymity at the ISP-level you say? That's just not going to happen. For one, that makes it just as simple for a spammer to abuse somebody's network and then we've gone and redesigned SMTP for no good reason. Remember, business has to be able to set up shop online fairly easily and spammers are a type of shop. So we are always going to balance between free enterprise and letting spammers roam on the network. Should we employ a CA, how much would it cost to run your own email server? More importantly - does this perhaps open the door for per-email taxes? I'd much rather just deal with spam the way we are now. For another thing, abstracted identity architectures would only give you a level of anonymity parallel to the level of anonymity you have when you purchase a firearm (where the forms are stored by your dealer, rather than filed to a central government agency). See how long it takes for the feds to trace your handgun back to you if you leave it at the scene of a crime.
You can't leave it in the ISP's control anyway. The sad truth is that most ISPs still don't care about managing outgoing spam on their network; so new spammers are being given a nurturing environment to break into this new and exciting business. I had a recent bout with XO Communications about one such new spammer who had run a full-blown business on their network since 1997 and recently decided he'd like to start spamming under the "CAN-SPAM" act (which he was convinced defended his right to spam). He included his phone number, address, and web address in the spam - I called him up and verified he was who he said he was (the owner of this business, and spamming). Provided all of this information (over a phone call) to the XO abuse rep (let's call him "Ted"), even filed a police report, and XO still to this day has done nothing. His site is even still there, selling the same crap he spams for. This happens every day at ISPs out there.
The consequences outweigh the benefits. The people who drafted the SMTP protocol probably thought of most of these issues too. A public system can't exist without the freedom to remain anonymous, ambiguous, and the right to change your virtual identity whenever the heck you like.
Sheetrock asks a two parter:
1. In the past, I've heard it suggested that anti-spam techniques often go too far, culling good e-mail with the bad and perhaps even curtailing 1st Amendment rights. Clearly this depends on what end of the spectrum you're on, but recent developments have given me pause for thought on the matter. For example, certain spam blacklists would censor more than was strictly necessary (a subjective opinion, I realize) to block a spammer -- sometimes blocking a whole Class C to get one individual. This would cause other innocent users in that net space to have their e-mail to hosts using the blacklists silently dropped without any option of fixing the problem besides switching ISPs.
Jonathan Responds:
A lot of blacklists have started taking on a vigilante agenda, or at the very least rather questionable ethical practices. Spamhaus' recent blacklisting of all Yahoo! Store URLs (and Paul Graham's website) is a prime example of this. As long as you're subscribed to human-operated blacklists, you're going to suffer from someone's politics. That's one of the reasons I coded up the RABL, which is a machine-automated blacklist. There is also another called the WPBL (weighted private block list). As the politics of the organizations running human-maintained lists get worse, I think more of these automated lists will start to pop up. Machine-automated blacklists don't have an agenda - they have a sensitivity threshold. It's much easier to find the right list with the right threshold than it is to find the right politics (and then keep tabs on them to make sure they don't change). The RABL, for example, measures network spread rather than number of complaints. If a spammer has affected more than X networks, they are automatically added to the system, and removed after being clear for six hours (no messy cleanup). Another nice thing about machine-automated blacklists is that they are really real-time blacklists, and capable of catching zombies and other such evils with great precision.
NOTE: I haven't had time yet to bring the RABL into full production, but am interested in finding more participants to bring us out of testing.
2. This is an extreme example, but most anti-spam approaches have the following characteristics: They are implemented on a mail server without fully informing the users of the ramifications (or really informing them at all). They block messages without notification to the sender, causing things to be silently dropped. Even if the recipient becomes aware of the problem, few or no options are given for the recipient to alter this "service".
Jonathan Responds:
I've run into issues like this with my ISP (Alltel), and I agree with a lot of what you're saying. In the case of Alltel, not only are they filtering inbound messages using blacklisting techniques and other approaches they don't care to tell me about, but they are filtering outbound messages as well. I had to eventually give up using their mail server because I could not adequately train my own spam filter (Alltel would block messages I forwarded to it). To make matters worse, there is no way to opt out of this type of filtering on their network, even though I offered to give them the IP address of my remote mail server. This clearly does affect their customers, and I feel there are censorship, violation of privacy and denial of service issues all going on here. (Somebody please sue them by the way).
Fortunately, I don't think this issue is as wide spread as you might think. Many of the ISPs and Colleges I've worked with are, unlike Alltel, very dedicated to ensuring that their tools only provide a way for their users to censor themselves. I think this ought to be a requirement for any publicly used system. Specifically...
1.The user must be able to opt in or out of all aspects of filtering
2.All filtering components and their general function must be fully disclosed
3.The user must be able to review and recover messages the system filtered
Opting out of RBLs is as easy as having two separate mail servers and homing on the box you want. I would strongly advise to ensure that your solution is capable of receiving instruction from a user to improve its results, but it is still very difficult to scale this to millions users. At the very least should be fully disclosed, recoverable, and removable.
An Anonymous Coward asks:
Without going into the truths of the beliefs in question, which I'm sure will be debated enough in the Slashdot thread anyway (and I hope you'll join in), what do you think the reason is that so many scientists, nerds and people otherwise rather similar to you think your beliefs are obviously incorrect? Do you think they are all deluded? Do you agree that there might be a possibility that your beliefs are not rational?
Jonathan Responds:
The beliefs I hold as a Christian aren't always the popular ones, but they're certainly valid arguments for anyone who cares to ask about them (not that that has happened). When you read about someone's beliefs, you have the option to engage in discussion, or to filter his or her beliefs through your own belief system. The former option involves cognitive thought, however the latter is how most people today respond to anything that even smells religious. And I say this coming from the position of someone who hasn't tried to shove my beliefs down anyone's throat - I merely documented them on my personal website. That tells me that some people don't believe I have the right to my own beliefs - how asinine is that?
But to address the question, my beliefs aren't based on some religious intellectual suicide. In fact, the Bible teaches that you should know what you believe and why, and that you should even be prepared to give a defense for your faith - so the Bible encourages sound thinking and not some pontificated ideal structure as many quickly dismiss it as. I didn't dumb down when I became a Christian. In fact, it felt more like I began to think more clearly. I was raised in the same public school system as everyone else and didn't even know who Jesus Christ was until around my junior or senior year of high school. I've read from my early days in Kindergarten how "billions of years ago, when dinosaurs roamed the earth" and I've been taught the theory of evolution like everyone else. The problem, though, is that no matter how credible or not a particular area of science is, much of what is out there is taught based on authority. I find it very ironic to be flamed by anyone who thinks I'm an idiot for not believing in a theory that's never been proven by scientific process. It's recently become a "religious act" to question science in any capacity, but isn't questioning science the only way we can tell the good science from the bad science? And there is a lot of great science out there - even in public schools. But there's no longer a way for students to evaluate the credibility of what they're being taught. That seems to be degrading the quality of the subject. Science should be a quest for the truth, with no presuppositions, and appropriate understanding between hypotheses vs. theories vs. laws. When a theory is presented in the classroom as law and it's not held accountable to method, it's degenerated into mere conditioning.
I've spent a considerable amount of time studying topics such as the age of the earth and the theory of evolution, and I could probably argue it quite well if so inclined to engage in a discussion. That's important if you're going to believe anything really - including whatever the mainstreamed secular agenda happens to be.
Just as an example, I've recently looked into Carbon-14 dating and found that in cross-referencing it to Egyptian history (which dates back as far as 3500 B.C. and is held to be in very high regard by archaeologists and scientists alike), there is evidence that Carbon dating may be inaccurate beyond around 1800 B.C. For someone not to consider that would be ignoring science. My point here is that my beliefs aren't merely unfounded, eccentric ideas. Just because microevolution is feasable, that doesn't mean I'm going to sweep macroevolution under the rug and not test it - the two are actually worlds apart, just cleverly bundled. The Bible has given me a perspective that seems to offer a reasonable and sensible way to put the different pieces of good science together. No matter what you believe, I strongly feel that you should have some factual foundation to support whatever it is, and if you don't, then be man enough to admit you only have a theory put together.
No matter what side of the camp you are on, your beliefs require a certain amount of faith, as neither side is at present proven scientifically. I don't have all the answers, but I don't think science in its present state does either. At the end of the day, you can't prove the existence of God factually, and so whatever you believe is still based on faith. But at least the Christians can admit that - I just wish the evolutionists would too. -
Ask Jonathan Zdziarski
You may recognize the name Jonathan Zdziarski from a recent Slashdot book review of his book Ending Spam. Aside from his DSPAM spam filter Jonathan has also contributed several other projects to the open source community under the GNU General Public License. These projects include Verizon-Compatible SMIL Multimedia Gateway, The Reactive Automated Blackhole List Server, Apache DoS Evasive Maneuvers Module, and several others. Want to know how to effectively contribute projects to the open source community? Curious to ask another programmer about his history? Now is the time to ask. Moderators will select the top few questions that we will forward on to Jonathan sometime tomorrow. The answers to the questions will be displayed next Tuesday when we will encourage Jonathan to participate in the discussion as time permits. -
Ending Spam
Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification author Jonathan A. Zdziarski pages 312 publisher No Starch Press rating 8 reviewer Shalendra Chhabra ISBN 1593270526 summary Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters
Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.
Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.
The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).
In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.
The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.
The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.
The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.
The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.
Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.
William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Ending Spam
Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification author Jonathan A. Zdziarski pages 312 publisher No Starch Press rating 8 reviewer Shalendra Chhabra ISBN 1593270526 summary Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters
Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.
Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.
The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).
In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.
The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.
The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.
The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.
The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.
Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.
William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Ending Spam
Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification author Jonathan A. Zdziarski pages 312 publisher No Starch Press rating 8 reviewer Shalendra Chhabra ISBN 1593270526 summary Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters
Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.
Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.
The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).
In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.
The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.
The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.
The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.
The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.
Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.
William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Ending Spam
Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification author Jonathan A. Zdziarski pages 312 publisher No Starch Press rating 8 reviewer Shalendra Chhabra ISBN 1593270526 summary Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters
Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.
Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.
The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).
In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.
The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.
The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.
The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.
The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.
Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.
William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Ending Spam
Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification author Jonathan A. Zdziarski pages 312 publisher No Starch Press rating 8 reviewer Shalendra Chhabra ISBN 1593270526 summary Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters
Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.
Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.
The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).
In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.
The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.
The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.
The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.
The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.
Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.
William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Hacking the Motorola E815
Nuclear Elephant writes "With Verizon's release of the Motorola E815, an improved and EVDO-capable version of the former v710, many v710 enthusiasts have started hacking at it to see what it can do. Surprisingly, many of the hacks which previously did not work on the v710 (such as those from the failed OBEX Hacking Contest) now suddenly do work on the E815, and other hacks which did work on the v710 still work. Among new hacks include the ability to activate OBEX (Object Exchange) for transferring pictures and music to/from the handset via Bluetooth, activating the disabled DUN (Dialup Networking) profile which was previously available on its predecessor, and modding the web browser to allow using any homepage and proxy. One cool existing hack involves the ability to use an alternative PIX messaging gateway to send multimedia, made possible by the reverse engineering of the MMS protocol used. A complete list of modifications can be found here. An unofficial challenge to enable OPP (Object Push Profile) is underway with a prize of $500. Unlike the v710, the Motorola E815 was Bluetooth qualified for both OPP and OBEX." -
Hacking the Motorola E815
Nuclear Elephant writes "With Verizon's release of the Motorola E815, an improved and EVDO-capable version of the former v710, many v710 enthusiasts have started hacking at it to see what it can do. Surprisingly, many of the hacks which previously did not work on the v710 (such as those from the failed OBEX Hacking Contest) now suddenly do work on the E815, and other hacks which did work on the v710 still work. Among new hacks include the ability to activate OBEX (Object Exchange) for transferring pictures and music to/from the handset via Bluetooth, activating the disabled DUN (Dialup Networking) profile which was previously available on its predecessor, and modding the web browser to allow using any homepage and proxy. One cool existing hack involves the ability to use an alternative PIX messaging gateway to send multimedia, made possible by the reverse engineering of the MMS protocol used. A complete list of modifications can be found here. An unofficial challenge to enable OPP (Object Push Profile) is underway with a prize of $500. Unlike the v710, the Motorola E815 was Bluetooth qualified for both OPP and OBEX." -
Hacking the Motorola E815
Nuclear Elephant writes "With Verizon's release of the Motorola E815, an improved and EVDO-capable version of the former v710, many v710 enthusiasts have started hacking at it to see what it can do. Surprisingly, many of the hacks which previously did not work on the v710 (such as those from the failed OBEX Hacking Contest) now suddenly do work on the E815, and other hacks which did work on the v710 still work. Among new hacks include the ability to activate OBEX (Object Exchange) for transferring pictures and music to/from the handset via Bluetooth, activating the disabled DUN (Dialup Networking) profile which was previously available on its predecessor, and modding the web browser to allow using any homepage and proxy. One cool existing hack involves the ability to use an alternative PIX messaging gateway to send multimedia, made possible by the reverse engineering of the MMS protocol used. A complete list of modifications can be found here. An unofficial challenge to enable OPP (Object Push Profile) is underway with a prize of $500. Unlike the v710, the Motorola E815 was Bluetooth qualified for both OPP and OBEX." -
Hacking the Motorola E815
Nuclear Elephant writes "With Verizon's release of the Motorola E815, an improved and EVDO-capable version of the former v710, many v710 enthusiasts have started hacking at it to see what it can do. Surprisingly, many of the hacks which previously did not work on the v710 (such as those from the failed OBEX Hacking Contest) now suddenly do work on the E815, and other hacks which did work on the v710 still work. Among new hacks include the ability to activate OBEX (Object Exchange) for transferring pictures and music to/from the handset via Bluetooth, activating the disabled DUN (Dialup Networking) profile which was previously available on its predecessor, and modding the web browser to allow using any homepage and proxy. One cool existing hack involves the ability to use an alternative PIX messaging gateway to send multimedia, made possible by the reverse engineering of the MMS protocol used. A complete list of modifications can be found here. An unofficial challenge to enable OPP (Object Push Profile) is underway with a prize of $500. Unlike the v710, the Motorola E815 was Bluetooth qualified for both OPP and OBEX." -
Hormel Back on The Spam Offensive
Anonymous Howard writes "After an xapparent setback in litigation, Hormel Foods is again pursuing actions against entities and organizations over the 'spam' trademark. According to the web site of DSPAM, an open-source statistical anti-spam filter, "Anti-spam software manufacturers may be in for a rude awakening. Hormel Foods Corporation and Hormel Foods LLC have recently filed for extensions to oppose or to cancel many new and existing spam-related trademarks and are even filing a few technology trademarks of their own. The DSPAM project, a popular open source and freely available spam filtering application, has already received two such notices of opposition from the trademark trial and appeal board. The complete history can be viewed here. This came about a year after the software's user community scrounged up the fee to file for a trademark..."" -
DSPAM 3.4 + SBL 1.0 Released
Nuclear Elephant writes "After a grueling five months of development, DSPAM Version 3.4 has been officially released. Among the major changes include full LMTP support, client/server support (dspamc), Bayesian Noise Reduction v2.0 technology (as previously introduced at this year's spam conference), many improvements to speed and accuracy, and support for the Streamlined Blackhole List, a machine-automated true-time collaborative blacklist that has tweaked the interest of filter authors from other filtering projects including Death2Spam and Bogofilter. Version 1.0 of the streamlined blackhole list client/server software was also released this weekend. If testing spam filters is your cup of tea, some testing tips for this new version (and statistical filters in general) have also been published on the website." -
DSPAM 3.4 + SBL 1.0 Released
Nuclear Elephant writes "After a grueling five months of development, DSPAM Version 3.4 has been officially released. Among the major changes include full LMTP support, client/server support (dspamc), Bayesian Noise Reduction v2.0 technology (as previously introduced at this year's spam conference), many improvements to speed and accuracy, and support for the Streamlined Blackhole List, a machine-automated true-time collaborative blacklist that has tweaked the interest of filter authors from other filtering projects including Death2Spam and Bogofilter. Version 1.0 of the streamlined blackhole list client/server software was also released this weekend. If testing spam filters is your cup of tea, some testing tips for this new version (and statistical filters in general) have also been published on the website." -
DSPAM 3.4 + SBL 1.0 Released
Nuclear Elephant writes "After a grueling five months of development, DSPAM Version 3.4 has been officially released. Among the major changes include full LMTP support, client/server support (dspamc), Bayesian Noise Reduction v2.0 technology (as previously introduced at this year's spam conference), many improvements to speed and accuracy, and support for the Streamlined Blackhole List, a machine-automated true-time collaborative blacklist that has tweaked the interest of filter authors from other filtering projects including Death2Spam and Bogofilter. Version 1.0 of the streamlined blackhole list client/server software was also released this weekend. If testing spam filters is your cup of tea, some testing tips for this new version (and statistical filters in general) have also been published on the website." -
DSPAM 3.4 + SBL 1.0 Released
Nuclear Elephant writes "After a grueling five months of development, DSPAM Version 3.4 has been officially released. Among the major changes include full LMTP support, client/server support (dspamc), Bayesian Noise Reduction v2.0 technology (as previously introduced at this year's spam conference), many improvements to speed and accuracy, and support for the Streamlined Blackhole List, a machine-automated true-time collaborative blacklist that has tweaked the interest of filter authors from other filtering projects including Death2Spam and Bogofilter. Version 1.0 of the streamlined blackhole list client/server software was also released this weekend. If testing spam filters is your cup of tea, some testing tips for this new version (and statistical filters in general) have also been published on the website." -
DSPAM 3.4 + SBL 1.0 Released
Nuclear Elephant writes "After a grueling five months of development, DSPAM Version 3.4 has been officially released. Among the major changes include full LMTP support, client/server support (dspamc), Bayesian Noise Reduction v2.0 technology (as previously introduced at this year's spam conference), many improvements to speed and accuracy, and support for the Streamlined Blackhole List, a machine-automated true-time collaborative blacklist that has tweaked the interest of filter authors from other filtering projects including Death2Spam and Bogofilter. Version 1.0 of the streamlined blackhole list client/server software was also released this weekend. If testing spam filters is your cup of tea, some testing tips for this new version (and statistical filters in general) have also been published on the website." -
V710 Hacker Reward Program Unsuccessful
maxofthewell points to the announcement at the top of ""Regretfully, the OBEX hacker's contest for the Motorola v710 was unsuccessful. As of the contest's deadline (January 3, 2005) nobody has stepped forward to claim the prize. Many useful inventions and modifications came out of this effort." Full report here." -
Class Action Filed Against Verizon Wireless
Nuclear Elephant writes "Kirtland & Packard has filed a California-based class action suit against Verizon Wireless alleging some of their handsets have been advertised to have certain features, only come to find later that they were crippled for profit. With the Motorola Bluetooth Hacker's Contest ending unsuccessfully, many have taken this opportunity as a last-ditch effort to change things at Verizon." We mentioned the Verizon/Bluetooth episode earlier. -
Class Action Filed Against Verizon Wireless
Nuclear Elephant writes "Kirtland & Packard has filed a California-based class action suit against Verizon Wireless alleging some of their handsets have been advertised to have certain features, only come to find later that they were crippled for profit. With the Motorola Bluetooth Hacker's Contest ending unsuccessfully, many have taken this opportunity as a last-ditch effort to change things at Verizon." We mentioned the Verizon/Bluetooth episode earlier. -
RCA / Thomson Modem Hack Discovered
An anonymous reader writes "Those un-employed modem hackers are at it again. The group known as TCNiSO has released a very interesting hardware modification for RCA / Thomson cable modems. The modification is done by grounding the bus clock on the serial EEPROM which throws the device into a diagnostic panic mode. Then by using the debug tools from the embedded console to reprogram the EEPROM, a user can permanently enable a developers menu which gives complete control of the modem, such as modifying the hardware addresses or flashing new firmware. Now if only these guys can figure out how to enable the Bluetooth features on my v710 phone..." -
DSPAM v3.2 Released
Nuclear Elephant writes "After four months of development DSPAM v3.2 has been released, bringing many new enhancements and filtering technologies. These include distributed computing support, implementation of Bill Yerazunis' Sparse Binary Polynomial Hashing algorithm (from CRM114), and v1.2 of Bayesian Noise Reduction. Other enhancements include SQLite support and many significant performance enhancements for PostgreSQL. DSPAM's official release is next week, but you can download the preview release now. Users of the project have also contributed towards creating a new logo for this release." -
DSPAM v3.2 Released
Nuclear Elephant writes "After four months of development DSPAM v3.2 has been released, bringing many new enhancements and filtering technologies. These include distributed computing support, implementation of Bill Yerazunis' Sparse Binary Polynomial Hashing algorithm (from CRM114), and v1.2 of Bayesian Noise Reduction. Other enhancements include SQLite support and many significant performance enhancements for PostgreSQL. DSPAM's official release is next week, but you can download the preview release now. Users of the project have also contributed towards creating a new logo for this release." -
DSPAM v3.2 Released
Nuclear Elephant writes "After four months of development DSPAM v3.2 has been released, bringing many new enhancements and filtering technologies. These include distributed computing support, implementation of Bill Yerazunis' Sparse Binary Polynomial Hashing algorithm (from CRM114), and v1.2 of Bayesian Noise Reduction. Other enhancements include SQLite support and many significant performance enhancements for PostgreSQL. DSPAM's official release is next week, but you can download the preview release now. Users of the project have also contributed towards creating a new logo for this release." -
DSPAM v3.2 Released
Nuclear Elephant writes "After four months of development DSPAM v3.2 has been released, bringing many new enhancements and filtering technologies. These include distributed computing support, implementation of Bill Yerazunis' Sparse Binary Polynomial Hashing algorithm (from CRM114), and v1.2 of Bayesian Noise Reduction. Other enhancements include SQLite support and many significant performance enhancements for PostgreSQL. DSPAM's official release is next week, but you can download the preview release now. Users of the project have also contributed towards creating a new logo for this release." -
DSPAM v3.2 Beta-1 Released
Nuclear Elephant writes "After three months of development, the first public beta of DSPAM v3.2 has been released for testing. New features include SQLite support, A Win32 build supplement, extensions API, and some advanced new processing functionality such as Bill Yerazunis' (CRM114) Sparse Binary Polynomial Hashing and v1.2 of the author's Bayesian Noise Reduction Logic. Accuracy in 3.x has reportedly peaked as high as 99.991% (2 errors in 22,786 messages). Grab the new copy and participate in the request for feedback." -
DSPAM v3.2 Beta-1 Released
Nuclear Elephant writes "After three months of development, the first public beta of DSPAM v3.2 has been released for testing. New features include SQLite support, A Win32 build supplement, extensions API, and some advanced new processing functionality such as Bill Yerazunis' (CRM114) Sparse Binary Polynomial Hashing and v1.2 of the author's Bayesian Noise Reduction Logic. Accuracy in 3.x has reportedly peaked as high as 99.991% (2 errors in 22,786 messages). Grab the new copy and participate in the request for feedback." -
DSPAM v3.2 Beta-1 Released
Nuclear Elephant writes "After three months of development, the first public beta of DSPAM v3.2 has been released for testing. New features include SQLite support, A Win32 build supplement, extensions API, and some advanced new processing functionality such as Bill Yerazunis' (CRM114) Sparse Binary Polynomial Hashing and v1.2 of the author's Bayesian Noise Reduction Logic. Accuracy in 3.x has reportedly peaked as high as 99.991% (2 errors in 22,786 messages). Grab the new copy and participate in the request for feedback." -
The File Sharing Report
An anonymous reader writes "In July, Slashdot posted an article about the file sharing experiment, which was a database where users could report items they've purchased as a result of file sharing. The author has completed the experiment and written a report outlining the results. He offers the philosophy that file sharing is a result of the industry's failure to meet the business models demanded by today's consumer, and provides many suggestions to the various industries on how to take advantage of the market emerging from file sharing to generate revenue." -
Motorola Hacker Rewards Program
Nuclear Elephant writes "Pen Computing Magazine recently ran an article about the Motorola v710, which has been crippled by Verizon. A hacking contest is now underway, and the pot is steadily growing. The first hacker to provide a hack (or instructions) to enable OBEX and OPP features on the handset before Jan 1 wins the pot. See the official site for more information." We mentioned this phone a few days ago. -
Motorola Hacker Rewards Program
Nuclear Elephant writes "Pen Computing Magazine recently ran an article about the Motorola v710, which has been crippled by Verizon. A hacking contest is now underway, and the pot is steadily growing. The first hacker to provide a hack (or instructions) to enable OBEX and OPP features on the handset before Jan 1 wins the pot. See the official site for more information." We mentioned this phone a few days ago. -
Verizon Crippled Bluetooth Features in Motorola V710
djdoubles writes "Apparently Verizon Wireless has put firmware with crippled Bluetooth features in the new Motorola v710 phone. A lot of people have been anticipating a Bluetooth phone from Verizon, only to be disappointed by lack of OBEX. Verizon says they have no plan to add OBEX because it doesn't fit their business model--greedy bastards. PC Magazine doesn't have very nice things to say either. More discussion here." -
The File Sharing Database
Nuclear Elephant writes "The File Sharing Database is an online record of things users wouldn't have bought if they hadn't downloaded it (or part of it) first, and therefore tracks sales as a direct result of file sharing. The RIAA and MPAA claim that file sharing hurts sales, but some recent figures show that file sharing works FOR the industry. This database sets out to prove it once and for all. So if you've ever bought something you downloaded, roll on over and add it to the database." -
The File Sharing Database
Nuclear Elephant writes "The File Sharing Database is an online record of things users wouldn't have bought if they hadn't downloaded it (or part of it) first, and therefore tracks sales as a direct result of file sharing. The RIAA and MPAA claim that file sharing hurts sales, but some recent figures show that file sharing works FOR the industry. This database sets out to prove it once and for all. So if you've ever bought something you downloaded, roll on over and add it to the database." -
Response to Gordon Cormack's Study of Spam Detection
Nuclear Elephant writes "In light of Gordon Cormack's Study of Spam Detection recently posted on Slashdot, I felt compelled to architect an appropriate response to Cormack's technical errors in testing which ultimately explain why one of the world's most accurate spam filters (CRM114) could possibly end up at the bottom of the list, underneath SpamAssassin. I spend some time explaining what is a correct test process and keep my grievances simplified about the shortcomings of Cormack's research." -
Response to Gordon Cormack's Study of Spam Detection
Nuclear Elephant writes "In light of Gordon Cormack's Study of Spam Detection recently posted on Slashdot, I felt compelled to architect an appropriate response to Cormack's technical errors in testing which ultimately explain why one of the world's most accurate spam filters (CRM114) could possibly end up at the bottom of the list, underneath SpamAssassin. I spend some time explaining what is a correct test process and keep my grievances simplified about the shortcomings of Cormack's research." -
DSPAM v3.0 RC1 Spam Filter Released
Nuclear Elephant writes "DSPAM v3.0 RC1 is now available for download, with a stable release scheduled for June 13. DSPAM has appeared on Slashdot and in Wired News in the past for its high levels of accurate spam filtering. v3.0 is the product of three solid months of work. Some of the highlights include a very sleek redesigned interface, PostgreSQL support, many mathematical enhancements, and support for many of Gary Robinson's algorithms (such as Chi-Square, Geometric Mean Test, and Robinson's technique for combining P-Values)." -
DSPAM v3.0 RC1 Spam Filter Released
Nuclear Elephant writes "DSPAM v3.0 RC1 is now available for download, with a stable release scheduled for June 13. DSPAM has appeared on Slashdot and in Wired News in the past for its high levels of accurate spam filtering. v3.0 is the product of three solid months of work. Some of the highlights include a very sleek redesigned interface, PostgreSQL support, many mathematical enhancements, and support for many of Gary Robinson's algorithms (such as Chi-Square, Geometric Mean Test, and Robinson's technique for combining P-Values)." -
AOL Blocking Spammers' Web Sites
Nuclear Elephant writes "According to this article, AOL has decided to take a fresh approach to fighting spam and is now blocking the spammer's web address. The philosophy is, if the customers can't visit spammers sites, spammers will not be able to make any money. On a side note, I suggested this concept about six months ago but nobody thought ISPs would adopt it. Now perhaps we can get a group like NANOG interested in sponsoring a blacklist for spammer addresses?" -
DSPAM v2.10 Released
Nuclear Elephant writes "DSPAM v2.10 is finally available, after four months of development. This is the first stable release to include Bayesian Noise Reduction which was recently mentioned on Slashdot and in Wired News as an algorithm providing accuracy levels as high as 10x that of a human. Some other new features include Neural Networking - which finds nodes in a network that are contextually similar to form a decision matrix, Global Filtering - which provides SpamAssassin-like out-of-the-box type filtering for new users until they build up their own wordlist, Automatic Whitelisting - which automatically learns who your trusted senders are, and many other optimizations and enhancements. Head on over and download the latest tar ball." -
DSPAM v2.10 Released
Nuclear Elephant writes "DSPAM v2.10 is finally available, after four months of development. This is the first stable release to include Bayesian Noise Reduction which was recently mentioned on Slashdot and in Wired News as an algorithm providing accuracy levels as high as 10x that of a human. Some other new features include Neural Networking - which finds nodes in a network that are contextually similar to form a decision matrix, Global Filtering - which provides SpamAssassin-like out-of-the-box type filtering for new users until they build up their own wordlist, Automatic Whitelisting - which automatically learns who your trusted senders are, and many other optimizations and enhancements. Head on over and download the latest tar ball." -
DSPAM v2.10 Released
Nuclear Elephant writes "DSPAM v2.10 is finally available, after four months of development. This is the first stable release to include Bayesian Noise Reduction which was recently mentioned on Slashdot and in Wired News as an algorithm providing accuracy levels as high as 10x that of a human. Some other new features include Neural Networking - which finds nodes in a network that are contextually similar to form a decision matrix, Global Filtering - which provides SpamAssassin-like out-of-the-box type filtering for new users until they build up their own wordlist, Automatic Whitelisting - which automatically learns who your trusted senders are, and many other optimizations and enhancements. Head on over and download the latest tar ball." -
Getting Better Battery Life w/ Linux?
Nuclear Elephant asks: "After a little hacking, Linux has been running great on my Thinkpad T30 for about a year now. I can talk to my cellphone and bluetooth devices, do all kinds of neat hacking on wireless, and just about everything you'd expect to be able to do from a Windows machine, except make the battery last. Even after the standard optimizations (like cpufreq, laptop_mode, brightness, turning off useless processes, etc.) my battery still only lasts about an hour running under Linux as opposed to 2 1/2 hours in Windows. Has anybody come up with some innovative battery conservation ideas for Linux? It seems to be the only thing lacking in this fine operating system." What kernel options might one look into, for saving laptop battery power? Also, what desktop settings (both for Gnome and KDE) would work best, for this situation? -
Two Spam Filters 10 Times As Accurate As Humans
Nuclear Elephant writes "The authors of two spam filters, CRM114 and DSPAM, announced recently that their filters have achieved accuracy rates ten times better than a human is capable of. Based on a study by Bill Yerazunis of CRM114, the average human is only 99.84% accurate. Both filters are reporting to have reached accuracy levels between 99.983% and 99.984% (1 misclassification in 6250 messages) using completely different approaches (CRM114 touts Markovan, while DSPAM implements a Dolby-type noise reduction algorithm called Dobly). If you're looking for a way to rid spam from your inbox, roll on over to one of these authors' websites."