(SORRY, REPOST FROM FURTHER ON, BUT I WANTED HENRY TO SEE IT)
Wow, I never thought this would generate this much discussion or attention! To everyone that's giving me good feedback (Henry Stern especially), thank you very much as it is appreciated... now for the detractors and script kiddies.
Ok, first off, I was not de-railing existing spam filters, so you guys that took it that I was personally attacking your spam filter, get over yourselves. If you don't have a problem with spam, then congratulations, you get a freakin' cookie. Since most users do, I was trying to create a forum for discussion on new techniques and ideologies for addressing it, not lameass quotes like "this guy's an idiot". And, most of those comments were because people didn't read all the article or didn't understand it. I guess that's what happens when you mix script kiddies with democracy.
Next, if other people have tried and tested this method, great, I wasn't trying to steal their work. This is basically a guy, a garage, and a computer type of project, and I don't have access to research done in deep academic circles. If this overlaps with other work, my deepest apologies to those individuals.
Now, for you people that thought I was trying to say this was the next best thing to canned bread, give me a break! If you read the conclusion, I basically state that no one simple algorithm can adequately address spam (this one included!). That it would take a multi-pronged attack, and that we should start treating spam the same way we treat evolving organisms. That was the point of the article. For the algorithm, I was just trying to present a starting point, not an end.
Whew, now that I've got that off my chest, to the other problems with the article.
Henry, you were absolutely right, the training set does need to be randomly shuffled each time per epoch. My fault on that one guys.
Also, I do apologize about the metaphors, I should have used the proper ANN terminology. I used the metaphors to re-enforce my point about the similarities between this and biology.
Now, I am not quite sure why it takes so many training cycles to train the network in SpamAssasin, but since I'm using standard BP, it is quite fast, and does not require an extensive training set because of the generalization traits of a MLP.
I do agree about the ability of spammers to exploit the error inherent within the hidden layer of a MLP. But, I would think it would be extremely difficult to do this, especially if multiple distributions are created of the trained MLP, each with different structures within the hidden layer and with different initial weights. Then, they would only be able to take advantage of certain traits for a small percentage of filters. Also, I would argue that training should also happen on the users PC, as a background task, say weekly. This allows the network to adapt to new types of spam, and when initialized with random weights, it should make the error within the hidden layer random among differnt installations, effectively killing that exploit, as each machine will converge differently.
Now, regarding spam messages adapting, I am sure their are many more exploits they can implore, but as of this point, they are running out of options, atleast "structual" options. If we can work together to identify the exploitable structures, then we should be able to design a filter that can adapt to the different features within these structures.
That's all I have, thanks again to the folks who gave me constructive feedback on the article. To those who's remarks were out of complete ignorance, **** off.
Ummm, most readers of this article are going to plug it into a mail client, which is done primarily on a Windows box. And as a former Java Nazi myself (SCJD, SCJA), I can say with utmost confidence that.NET beats the living hell out of Java in just about every category. I hate MS just like the next guy, but Sun seriously effed up Java when they let every freakin vendor under the sun (no pun intended) into the spec. process and didn't open source it.
I shed a tear when I came to this conclusion, but then I realized religion doesn't pay my mortgage.
Nuff said.
Wow, I never thought this would generate this much discussion or attention! To everyone that's giving me good feedback (Henry Stern especially), thank you very much as it is appreciated... now for the detractors and script kiddies.
Ok, first off, I was not de-railing existing spam filters, so you guys that took it that I was personally attacking your spam filter, get over yourselves. If you don't have a problem with spam, then congratulations, you get a freakin' cookie. Since most users do, I was trying to create a forum for discussion on new techniques and ideologies for addressing it, not lameass quotes like "this guy's an idiot". And, most of those comments were because people didn't read all the article or didn't understand it. I guess that's what happens when you mix script kiddies with democracy.
Next, if other people have tried and tested this method, great, I wasn't trying to steal their work. This is basically a guy, a garage, and a computer type of project, and I don't have access to research done in deep academic circles. If this overlaps with other work, my deepest apologies to those individuals.
Now, for you people that thought I was trying to say this was the next best thing to canned bread, give me a break! If you read the conclusion, I basically state that no one simple algorithm can adequately address spam (this one included!). That it would take a multi-pronged attack, and that we should start treating spam the same way we treat evolving organisms. That was the point of the article. For the algorithm, I was just trying to present a starting point, not an end.
Whew, now that I've got that off my chest, to the other problems with the article.
Henry, you were absolutely right, the training set does need to be randomly shuffled each time per epoch. My fault on that one guys.
Also, I do apologize about the metaphors, I should have used the proper ANN terminology. I used the metaphors to re-enforce my point about the similarities between this and biology.
Now, I am not quite sure why it takes so many training cycles to train the network in SpamAssasin, but since I'm using standard BP, it is quite fast, and does not require an extensive training set because of the generalization traits of a MLP.
I do agree about the ability of spammers to exploit the error inherent within the hidden layer of a MLP. But, I would think it would be extremely difficult to do this, especially if multiple distributions are created of the trained MLP, each with different structures within the hidden layer and with different initial weights. Then, they would only be able to take advantage of certain traits for a small percentage of filters. Also, I would argue that training should also happen on the users PC, as a background task, say weekly. This allows the network to adapt to new types of spam, and when initialized with random weights, it should make the error within the hidden layer random among differnt installations, effectively killing that exploit, as each machine will converge differently.
Now, regarding spam messages adapting, I am sure their are many more exploits they can implore, but as of this point, they are running out of options, atleast "structual" options. If we can work together to identify the exploitable structures, then we should be able to design a filter that can adapt to the different features within these structures.
That's all I have, thanks again to the folks who gave me constructive feedback on the article. To those who's remarks were out of complete ignorance, **** off.
(SORRY, REPOST FROM FURTHER ON, BUT I WANTED HENRY TO SEE IT)
Wow, I never thought this would generate this much discussion or attention! To everyone that's giving me good feedback (Henry Stern especially), thank you very much as it is appreciated... now for the detractors and script kiddies.
Ok, first off, I was not de-railing existing spam filters, so you guys that took it that I was personally attacking your spam filter, get over yourselves. If you don't have a problem with spam, then congratulations, you get a freakin' cookie. Since most users do, I was trying to create a forum for discussion on new techniques and ideologies for addressing it, not lameass quotes like "this guy's an idiot". And, most of those comments were because people didn't read all the article or didn't understand it. I guess that's what happens when you mix script kiddies with democracy.
Next, if other people have tried and tested this method, great, I wasn't trying to steal their work. This is basically a guy, a garage, and a computer type of project, and I don't have access to research done in deep academic circles. If this overlaps with other work, my deepest apologies to those individuals.
Now, for you people that thought I was trying to say this was the next best thing to canned bread, give me a break! If you read the conclusion, I basically state that no one simple algorithm can adequately address spam (this one included!). That it would take a multi-pronged attack, and that we should start treating spam the same way we treat evolving organisms. That was the point of the article. For the algorithm, I was just trying to present a starting point, not an end.
Whew, now that I've got that off my chest, to the other problems with the article.
Henry, you were absolutely right, the training set does need to be randomly shuffled each time per epoch. My fault on that one guys.
Also, I do apologize about the metaphors, I should have used the proper ANN terminology. I used the metaphors to re-enforce my point about the similarities between this and biology.
Now, I am not quite sure why it takes so many training cycles to train the network in SpamAssasin, but since I'm using standard BP, it is quite fast, and does not require an extensive training set because of the generalization traits of a MLP.
I do agree about the ability of spammers to exploit the error inherent within the hidden layer of a MLP. But, I would think it would be extremely difficult to do this, especially if multiple distributions are created of the trained MLP, each with different structures within the hidden layer and with different initial weights. Then, they would only be able to take advantage of certain traits for a small percentage of filters. Also, I would argue that training should also happen on the users PC, as a background task, say weekly. This allows the network to adapt to new types of spam, and when initialized with random weights, it should make the error within the hidden layer random among differnt installations, effectively killing that exploit, as each machine will converge differently.
Now, regarding spam messages adapting, I am sure their are many more exploits they can implore, but as of this point, they are running out of options, atleast "structual" options. If we can work together to identify the exploitable structures, then we should be able to design a filter that can adapt to the different features within these structures.
That's all I have, thanks again to the folks who gave me constructive feedback on the article. To those who's remarks were out of complete ignorance, **** off.
Thanks SlashDot!
Shawn Evans
Ummm, most readers of this article are going to plug it into a mail client, which is done primarily on a Windows box. And as a former Java Nazi myself (SCJD, SCJA), I can say with utmost confidence that .NET beats the living hell out of Java in just about every category. I hate MS just like the next guy, but Sun seriously effed up Java when they let every freakin vendor under the sun (no pun intended) into the spec. process and didn't open source it.
I shed a tear when I came to this conclusion, but then I realized religion doesn't pay my mortgage.
Nuff said.
Wow, I never thought this would generate this much discussion or attention! To everyone that's giving me good feedback (Henry Stern especially), thank you very much as it is appreciated... now for the detractors and script kiddies.
Ok, first off, I was not de-railing existing spam filters, so you guys that took it that I was personally attacking your spam filter, get over yourselves. If you don't have a problem with spam, then congratulations, you get a freakin' cookie. Since most users do, I was trying to create a forum for discussion on new techniques and ideologies for addressing it, not lameass quotes like "this guy's an idiot". And, most of those comments were because people didn't read all the article or didn't understand it. I guess that's what happens when you mix script kiddies with democracy.
Next, if other people have tried and tested this method, great, I wasn't trying to steal their work. This is basically a guy, a garage, and a computer type of project, and I don't have access to research done in deep academic circles. If this overlaps with other work, my deepest apologies to those individuals.
Now, for you people that thought I was trying to say this was the next best thing to canned bread, give me a break! If you read the conclusion, I basically state that no one simple algorithm can adequately address spam (this one included!). That it would take a multi-pronged attack, and that we should start treating spam the same way we treat evolving organisms. That was the point of the article. For the algorithm, I was just trying to present a starting point, not an end.
Whew, now that I've got that off my chest, to the other problems with the article.
Henry, you were absolutely right, the training set does need to be randomly shuffled each time per epoch. My fault on that one guys.
Also, I do apologize about the metaphors, I should have used the proper ANN terminology. I used the metaphors to re-enforce my point about the similarities between this and biology.
Now, I am not quite sure why it takes so many training cycles to train the network in SpamAssasin, but since I'm using standard BP, it is quite fast, and does not require an extensive training set because of the generalization traits of a MLP.
I do agree about the ability of spammers to exploit the error inherent within the hidden layer of a MLP. But, I would think it would be extremely difficult to do this, especially if multiple distributions are created of the trained MLP, each with different structures within the hidden layer and with different initial weights. Then, they would only be able to take advantage of certain traits for a small percentage of filters. Also, I would argue that training should also happen on the users PC, as a background task, say weekly. This allows the network to adapt to new types of spam, and when initialized with random weights, it should make the error within the hidden layer random among differnt installations, effectively killing that exploit, as each machine will converge differently.
Now, regarding spam messages adapting, I am sure their are many more exploits they can implore, but as of this point, they are running out of options, atleast "structual" options. If we can work together to identify the exploitable structures, then we should be able to design a filter that can adapt to the different features within these structures.
That's all I have, thanks again to the folks who gave me constructive feedback on the article. To those who's remarks were out of complete ignorance, **** off.
Thanks SlashDot!
Shawn Evans