Because a real competition involves 50-60 problems and leaves it up to the developer (or team of developers) to pick the ones they want to tackle. A true and fair competition actually has the goal of figuring out who the best coders are rather than using people to solve their problems. Google should have advertised it as "we're using you" rather than "lets see who the best coders are" as there is no way to have a competition based on 3 problems. I'm sure by the point you're probably thinking I'm just bitter - but quite the contrary; I've won several *real* competitions before and didn't bother entering this one b/c I saw how obvious it was that google was just using people, and stayed out of it.
$20,000 - think about that...that's one fifth of the salary one good programmer makes...to solve a significant problem they would've otherwise had to pay a small team of developers full salaries for. 5,000 developers for a mere $20,000: a mere $4 per developer. you do the math.
They helped google figure out how to do some things apparently their own programmers could not do...free consulting. Don't have the cash on hand to hire 5,000 programmers? Hold a contest and generate IP for free!
And how the hell is this newsworthy? it's not...slashdot has once again gone a little further downhill in the quality of their news; this time so that we'll look at their new pretty banners and earn them double the cash every time we look at an article. now that slashdot is obviously heading in the commercial direction maybe someone ought to consider putting together a new geek news site?
Lenslet Optical Processor: $123801238 IBM Thinkpad to Put it into: $2000 Not having to worry about DVD frameskip because you have no room for a DVD player: Priceless
> I don't understand how "server side" bayesian filtering would work.
We use a web-based quarantine box that you can periodically skim over for false positives (you can even set up key words that'll hilight potential false positives in yellow). You can change this behavior, though, if you want it to deliver the messages with a spam header instead or if you wanted some other way to manage them.
I've heard rumors that SA is heading more in the direction of probability-based filtering rather than score-based filtering; disabling all your negative rules seemed to be quite a step. I still think converting all the rules into a "Tokenized Ruleset" for probability-based filtering would be a better solution...you could have either a separate calculation for content and rules, or you could reserve N slots in your single calculation for rules-based calculations. If you go with standard '15' bayesian filtering, open that up to 30 and allow up to 15 rulesets to get thrown in there. If there aren't enough interesting rules for a message, allow tokens to take an extra slot.
Anyhow, I think the big problem most people have with Bayesian is that they don't do any research on it...they just look at the surface, maybe read a few flames about it...there is so much work going on in the bayesian field you can't just discount it without a fair examination...especially when you've got things like Chained Tokens (we discussed those before), inoculation, functional groups, merged dictionaries, etc.
anyhow we can pick this up via email... i'm sure we'll both get "Off topic" troll scores from this.
> The people who run bayesian filters are, most likely, not the people who respond to spam.
This is why more ISPs need to implement server-side solutions.
> In order to use them, you have to download the entire message. On dial up, receiving 200+ spams a day, that isn't worth it for me.
If your ISP ran one on the server, you wouldn't need to download them all except for initial training (which can be easily accelerated with a seeded dictionary).
> sites that are on a DNS blacklist and tags that mail.
Good luck with the whole DNS blacklisting. Just about every single spam our system receives comes from a different IP address.
> It took SA a long time to get down into that range,
Not to be picky, but unless this is a change in the new version of SA, most people have been a reporting 0.06% or worse FP rate. I'm very glad though if they finally did make it down to that rate.
> Still, how long it takes you to get there, and how much pain new users suffer to achieve that is important
There are two types of users: users who want out of the box filtering, and users who are willing to sit and train. Merge tools to create seeded dictionaries and perform other types of "prep work" for new users to have instant filtering certainly make the learning situation a lot easier. In either scenario, I believe false positives are unacceptable, and software should make every attempt to avoid these at all costs (including filtering). DSPAM does a good job of this, but some people still run into some FP's during initial training - those folks should be running with a seeded dictionary.
> DSPAM's author (was that you? I don't know Slashdot IDs, sorry)
Yes that was you and me talking a while back. I lost your email address BTW so send me some more mail =)
> In the end, I'm glad there are at least two > tools that are taking the "no one solution" > approach. I don't buy the idea that any > pure-word-analysis approach is going to work,
It's all about that last 1% of spam when it all comes down to it...regardless of your approach, you can achieve 99% filtering on a bad day (that's 1 in 100). To get to 99.999% though is the real trick, and I think a lot of different approaches can help. I am still experimenting with "Tokenized Rules" although I am much more interested in the inoculation thread and development of an inoculation standard amongst filters. I don't think the buzzword 'Bayesian' is going to solve anything (on a side-note we implemented three different algorithms into DSPAM lately including Chi-Square).
We're working on some new technologies to help pre-train new spams. The inevidible fact is that new spams are going to come out and need to be learned. We're collectively implementing a new standard for inoculating users within a particular group. This standard will allow the different spam tools to talk to one-another and share information. External inoculation is also a new feature I've implemented in DSPAM, which enables you to have the spammers inoculate you before you ever receive their message.
This is a good thing; I would rather see future spams say "hi, please look at this web page: [link]" than have to see big porn banners and red text. spammers don't make any money off of the innocent-types of spams because they don't captivate an audience...so by the time this happens the spam industry will have suffered severely.
The only thing that keeps spammers in business is their message getting to their recipients. In the long term, filtering _is_ the solution to shut down spammers. If we can prevent the messages from being delivered, spammers will not be able to make any money. As it is now, 100,000 addresses may only generate a few hundred hits. Cutting this supply chain off will no-doubt shut spammers down.
Mebbe learn to write a bayesian filter?
on
Another Whack at Spam
·
· Score: 4, Interesting
Tim fails to understand that he's still getting spam only for the reason that his Bayesian filter sucks. Most other Bayesian-style filters (and friends) are up to a 99.9% filter rate and working towards five-nines efficiency. Their learning potential continues to improve as well with new concepts such as inoculation. It's no longer a question of "can we filter spam" it's a question of "how do we stop that one in a thousand spams that get through"...and that's soon going to be one-in-ten thousand. The problem is that only a small number of people have actually done any research in this area and tried Bayesian-style filtering. If they did, they would realize it worked... very effectively. There are also server-side tools that make it easy for the 95% of non-tech people on the Internet. Bottom line, Tim needs to quit his bichin and go rewrite his spam filter - or install someone else's.
My original point was that spam filters are good enough and therefore we no longer need to worry about legislation, do-not-email lists, and other less effective forms of filtering. If everyone who complained on slashdot about spam would install a filter at their ISP, I think you'd find there would hardly be any spam left in the world. Obviously, additional resources are going to be given to improving the effectiveness and learning capabilities of spam filters...but so far the effectiveness of even the most basic filters hasn't changed over the past few years that Bayesian has been hot. We should always be working on improving our software, but my point was that there are a million other "solutions" people are wasting their time with on this board.
Paul Graham's paper on Bayesian filtering, although incomplete, is a great start to understanding how it all works. http://www.paulgraham.org.
Several attempts have been made to attack the tokenizer, which is one area DSPAM has a considerable lead on other tools. DSPAM performs several different deobfuscation techniques prior to tokenizing a message. From simple things as removing embedded html comments to more complex issues such as j/u-n,k t,e*x$t, DSPAM makes every attempt to deobfuscate such messages - and is very successful. Mis-spellings are actually ideal ways to identify spam because they show up much more frequently in spams than in innocent spams - DSPAM treats them just like any other token.
DSPAM tracks ordering to some degree - if a token shows up in a particular header, or a URL, etc., it makes note of the (for example URL*[Email Address] is a LOT more guilty than just your email address). Even attaching ham messages doesn't quite do the trick, for the reasons I mentioned in my previous email.
Frequency isn't measured on a per-message basis but just totals. E.g. if the word 'offer' appears once or 20 times in a message it makes no difference to most filters...for obvious reasons.
I know I'm not the only one who has deployed DSPAM on my system, and judging by the number of people reporting to the lists I'd say it's a success for everyone else running it too. In response to your comments about an intelligent person who can think about circumventing the filter...this really isn't accurate. If you look at what spammers are doing today to _try_ and circumvent spam filters, they seem to only be succeeding with static tools like spamassassin. Although the term 'Bayesian' filtering is a very loose term, they all usually have the following traits in common:
1. Unknown tokens are assigned a moderately neutral value.
2. Only the most interesting tokens are used in the actual calculation
3. Statistics are stored on a per-user basis
With the above 3 mechanisms, it is very difficult to craft a spam that will make it through a majority of filters, and here's why: since each user has different email behavior, the innocent tokens that exist in their system are going to be very different meaning that a spammer can't simply "run their spam through a filter" like they can with spamassassin. With a tool like dspam, where chained tokens are used, it is even more difficult to determine what the most commonly innocent tokens are. Since only the _most interesting_ tokens are used (and not the most common), most of the common words a spammer might choose are never used in the calculation. Many spammers will flood emails with junk words that may or may not hit...such as "tomato" or what have you. These tokens, when they don't have any significant hits in the user's database, is given a fairly neutral value which causes them to be ignored in the calculation. When it all hits the fan, ultimately a good spam filter will detect whatever spammy words a spammer has embedded (or even tried to hide) in the email and ignore any of the junk words that were unknown to the user's dictionary (or didn't have enough hits). The only way to get a spam through is to provide more tokens that are not only innocent, but more innocent than spammy tokens (e.g. 0.01 in value) and these types of tokens are very different for each user. Like I siad, since DSPAM uses case-sensitive chained tokens, the spammer would need to come up with two adjacent tokens, case sensitive, that a majority of users are likely to have as very innocent in their dictionary...not a very easy feat.
I'm not blind enough to say it's impossible to do, just very difficult...and should some spams get through that are crafted to hit these tokens, the spam filter should quickly learn and adjust these tokens to a slightly more neutral value - meaning the NEXT time they spam, they'll have to find another set of very-innocent tokens.
While it may be somewhat feasible to craft an email that targets a small group of people, spammers don't make any money off of that - they only make money when a large mass of their emails can get through, so even though I could find some way of getting around YOUR bayesian filter, it's extremely difficult to find a way to get around a hundred thousand people's.
While I do realize that there are potential exploits involved, and have read several papers on such, I think many of them are overrated. Even in my own testing many of the exploits haven't significantly impacted filtering. Should a spammer find a way that really does beat the system, it's only a matter of a little time before whatever development "tweaks" are made to fix the problem.
Because the money they save by paying their employees $2 an hour more than makes up for any expenes incurred from phone calls. Not to mention, call centers are primarily incoming, and so I imagine it can't be much more expensive than the standard 800-service to America.
Whine and insult me all you like... and you can throw all the papers you want to my way, but the proof is in the fact that I DONT GET SPAM (except for the mindless responses such as yours posted to slashdot).
You guys can moan and groan all you want about how [insert tool] won't work, or you can shut up and install the thing. I personally don't care if you wanna whine for the rest of your life - some of us are whiners and some of us are born to a higher purpose.
Because a real competition involves 50-60 problems and leaves it up to the developer (or team of developers) to pick the ones they want to tackle. A true and fair competition actually has the goal of figuring out who the best coders are rather than using people to solve their problems. Google should have advertised it as "we're using you" rather than "lets see who the best coders are" as there is no way to have a competition based on 3 problems. I'm sure by the point you're probably thinking I'm just bitter - but quite the contrary; I've won several *real* competitions before and didn't bother entering this one b/c I saw how obvious it was that google was just using people, and stayed out of it.
$20,000 - think about that...that's one fifth of the salary one good programmer makes...to solve a significant problem they would've otherwise had to pay a small team of developers full salaries for. 5,000 developers for a mere $20,000: a mere $4 per developer. you do the math.
They helped google figure out how to do some things apparently their own programmers could not do...free consulting. Don't have the cash on hand to hire 5,000 programmers? Hold a contest and generate IP for free!
What? This is a derivative work? But yours has clothes on!
A bunch of geeks whining about bandwidth..isn't that original
And how the hell is this newsworthy? it's not...slashdot has once again gone a little further downhill in the quality of their news; this time so that we'll look at their new pretty banners and earn them double the cash every time we look at an article. now that slashdot is obviously heading in the commercial direction maybe someone ought to consider putting together a new geek news site?
Lenslet Optical Processor: $123801238
IBM Thinkpad to Put it into: $2000
Not having to worry about DVD frameskip because you have no room for a DVD player: Priceless
Wouldn't it be easier just to make your stuff not suck?
The real problem here is lack of adequate testing prior to an upgrade.
> I don't understand how "server side" bayesian filtering would work. We use a web-based quarantine box that you can periodically skim over for false positives (you can even set up key words that'll hilight potential false positives in yellow). You can change this behavior, though, if you want it to deliver the messages with a spam header instead or if you wanted some other way to manage them.
I've heard rumors that SA is heading more in the direction of probability-based filtering rather than score-based filtering; disabling all your negative rules seemed to be quite a step. I still think converting all the rules into a "Tokenized Ruleset" for probability-based filtering would be a better solution...you could have either a separate calculation for content and rules, or you could reserve N slots in your single calculation for rules-based calculations. If you go with standard '15' bayesian filtering, open that up to 30 and allow up to 15 rulesets to get thrown in there. If there aren't enough interesting rules for a message, allow tokens to take an extra slot.
... i'm sure we'll both get "Off topic" troll scores from this.
Anyhow, I think the big problem most people have with Bayesian is that they don't do any research on it...they just look at the surface, maybe read a few flames about it...there is so much work going on in the bayesian field you can't just discount it without a fair examination...especially when you've got things like Chained Tokens (we discussed those before), inoculation, functional groups, merged dictionaries, etc.
anyhow we can pick this up via email
> The people who run bayesian filters are, most likely, not the people who respond to spam.
This is why more ISPs need to implement server-side solutions.
> In order to use them, you have to download the entire message. On dial up, receiving 200+ spams a day, that isn't worth it for me.
If your ISP ran one on the server, you wouldn't need to download them all except for initial training (which can be easily accelerated with a seeded dictionary).
> sites that are on a DNS blacklist and tags that mail.
Good luck with the whole DNS blacklisting. Just about every single spam our system receives comes from a different IP address.
> It took SA a long time to get down into that range,
Not to be picky, but unless this is a change in the new version of SA, most people have been a reporting 0.06% or worse FP rate. I'm very glad though if they finally did make it down to that rate.
> Still, how long it takes you to get there, and how much pain new users suffer to achieve that is important
There are two types of users: users who want out of the box filtering, and users who are willing to sit and train. Merge tools to create seeded dictionaries and perform other types of "prep work" for new users to have instant filtering certainly make the learning situation a lot easier. In either scenario, I believe false positives are unacceptable, and software should make every attempt to avoid these at all costs (including filtering). DSPAM does a good job of this, but some people still run into some FP's during initial training - those folks should be running with a seeded dictionary.
> DSPAM's author (was that you? I don't know Slashdot IDs, sorry)
Yes that was you and me talking a while back. I lost your email address BTW so send me some more mail =)
> In the end, I'm glad there are at least two
> tools that are taking the "no one solution"
> approach. I don't buy the idea that any
> pure-word-analysis approach is going to work,
It's all about that last 1% of spam when it all comes down to it...regardless of your approach, you can achieve 99% filtering on a bad day (that's 1 in 100). To get to 99.999% though is the real trick, and I think a lot of different approaches can help. I am still experimenting with "Tokenized Rules" although I am much more interested in the inoculation thread and development of an inoculation standard amongst filters. I don't think the buzzword 'Bayesian' is going to solve anything (on a side-note we implemented three different algorithms into DSPAM lately including Chi-Square).
We're working on some new technologies to help pre-train new spams. The inevidible fact is that new spams are going to come out and need to be learned. We're collectively implementing a new standard for inoculating users within a particular group. This standard will allow the different spam tools to talk to one-another and share information. External inoculation is also a new feature I've implemented in DSPAM, which enables you to have the spammers inoculate you before you ever receive their message.
> 1) false positive rate
Most filters are down to below 0.03. DSPAM is down to 0.01% and lower with some of our users.
> 2) false negative rate
That's the accuracy I was referring to; 99.9% catch rate.
This is a good thing; I would rather see future spams say "hi, please look at this web page: [link]" than have to see big porn banners and red text. spammers don't make any money off of the innocent-types of spams because they don't captivate an audience...so by the time this happens the spam industry will have suffered severely.
The only thing that keeps spammers in business is their message getting to their recipients. In the long term, filtering _is_ the solution to shut down spammers. If we can prevent the messages from being delivered, spammers will not be able to make any money. As it is now, 100,000 addresses may only generate a few hundred hits. Cutting this supply chain off will no-doubt shut spammers down.
Tim fails to understand that he's still getting spam only for the reason that his Bayesian filter sucks. Most other Bayesian-style filters (and friends) are up to a 99.9% filter rate and working towards five-nines efficiency. Their learning potential continues to improve as well with new concepts such as inoculation. It's no longer a question of "can we filter spam" it's a question of "how do we stop that one in a thousand spams that get through"...and that's soon going to be one-in-ten thousand. The problem is that only a small number of people have actually done any research in this area and tried Bayesian-style filtering. If they did, they would realize it worked ... very effectively. There are also server-side tools that make it easy for the 95% of non-tech people on the Internet. Bottom line, Tim needs to quit his bichin and go rewrite his spam filter - or install someone else's.
Are there Linux drivers for this new technology yet? If not, it doesn't really exist.
My original point was that spam filters are good enough and therefore we no longer need to worry about legislation, do-not-email lists, and other less effective forms of filtering. If everyone who complained on slashdot about spam would install a filter at their ISP, I think you'd find there would hardly be any spam left in the world. Obviously, additional resources are going to be given to improving the effectiveness and learning capabilities of spam filters...but so far the effectiveness of even the most basic filters hasn't changed over the past few years that Bayesian has been hot. We should always be working on improving our software, but my point was that there are a million other "solutions" people are wasting their time with on this board.
Bruce,
Bottom line is you can complain about it all you want or you can actually try it and see that it works. I've got better things to do today - cheers.
Paul Graham's paper on Bayesian filtering, although incomplete, is a great start to understanding how it all works. http://www.paulgraham.org.
Several attempts have been made to attack the tokenizer, which is one area DSPAM has a considerable lead on other tools. DSPAM performs several different deobfuscation techniques prior to tokenizing a message. From simple things as removing embedded html comments to more complex issues such as j/u-n,k t,e*x$t, DSPAM makes every attempt to deobfuscate such messages - and is very successful. Mis-spellings are actually ideal ways to identify spam because they show up much more frequently in spams than in innocent spams - DSPAM treats them just like any other token.
DSPAM tracks ordering to some degree - if a token shows up in a particular header, or a URL, etc., it makes note of the (for example URL*[Email Address] is a LOT more guilty than just your email address). Even attaching ham messages doesn't quite do the trick, for the reasons I mentioned in my previous email.
Frequency isn't measured on a per-message basis but just totals. E.g. if the word 'offer' appears once or 20 times in a message it makes no difference to most filters...for obvious reasons.
I know I'm not the only one who has deployed DSPAM on my system, and judging by the number of people reporting to the lists I'd say it's a success for everyone else running it too. In response to your comments about an intelligent person who can think about circumventing the filter...this really isn't accurate. If you look at what spammers are doing today to _try_ and circumvent spam filters, they seem to only be succeeding with static tools like spamassassin. Although the term 'Bayesian' filtering is a very loose term, they all usually have the following traits in common:
1. Unknown tokens are assigned a moderately neutral value.
2. Only the most interesting tokens are used in the actual calculation
3. Statistics are stored on a per-user basis
With the above 3 mechanisms, it is very difficult to craft a spam that will make it through a majority of filters, and here's why: since each user has different email behavior, the innocent tokens that exist in their system are going to be very different meaning that a spammer can't simply "run their spam through a filter" like they can with spamassassin. With a tool like dspam, where chained tokens are used, it is even more difficult to determine what the most commonly innocent tokens are. Since only the _most interesting_ tokens are used (and not the most common), most of the common words a spammer might choose are never used in the calculation. Many spammers will flood emails with junk words that may or may not hit...such as "tomato" or what have you. These tokens, when they don't have any significant hits in the user's database, is given a fairly neutral value which causes them to be ignored in the calculation. When it all hits the fan, ultimately a good spam filter will detect whatever spammy words a spammer has embedded (or even tried to hide) in the email and ignore any of the junk words that were unknown to the user's dictionary (or didn't have enough hits). The only way to get a spam through is to provide more tokens that are not only innocent, but more innocent than spammy tokens (e.g. 0.01 in value) and these types of tokens are very different for each user. Like I siad, since DSPAM uses case-sensitive chained tokens, the spammer would need to come up with two adjacent tokens, case sensitive, that a majority of users are likely to have as very innocent in their dictionary...not a very easy feat.
I'm not blind enough to say it's impossible to do, just very difficult...and should some spams get through that are crafted to hit these tokens, the spam filter should quickly learn and adjust these tokens to a slightly more neutral value - meaning the NEXT time they spam, they'll have to find another set of very-innocent tokens.
While it may be somewhat feasible to craft an email that targets a small group of people, spammers don't make any money off of that - they only make money when a large mass of their emails can get through, so even though I could find some way of getting around YOUR bayesian filter, it's extremely difficult to find a way to get around a hundred thousand people's.
While I do realize that there are potential exploits involved, and have read several papers on such, I think many of them are overrated. Even in my own testing many of the exploits haven't significantly impacted filtering. Should a spammer find a way that really does beat the system, it's only a matter of a little time before whatever development "tweaks" are made to fix the problem.
Because the money they save by paying their employees $2 an hour more than makes up for any expenes incurred from phone calls. Not to mention, call centers are primarily incoming, and so I imagine it can't be much more expensive than the standard 800-service to America.
Whine and insult me all you like... and you can throw all the papers you want to my way, but the proof is in the fact that I DONT GET SPAM (except for the mindless responses such as yours posted to slashdot).
You guys can moan and groan all you want about how [insert tool] won't work, or you can shut up and install the thing. I personally don't care if you wanna whine for the rest of your life - some of us are whiners and some of us are born to a higher purpose.