Slashdot Mirror

← Back to Stories (view on slashdot.org)

Paul Graham on Fighting Spam

Posted by CmdrTaco on Friday August 16, 2002 @04:08AM from the near-and-dear-to-my-heart dept.

Ramakrishnan M writes "Paul Graham, the Lisp Guru is back with a great technique to fight spam. It is based on trust matric, and he claims, only 5 out of 1000 spams got leaked out of this system with 0 false positives. Worth looking at."

1 of 675 comments (clear)

Min score:

Reason:

Sort:

Beware Statistics by DoctorNathaniel · 2002-08-16 05:23 · Score: 2, Redundant

A few quick comments about this. Although powerful, such approaches suffer from being somewhat too 'black-box'. That is, you turn control over to the computer to make decisions based upon statistical recurrances. This leaves you very vulnerable to several problems.

For instance, the author remarks that he believes a bigger corpus of spam would help train filters. That's true, but misleading: it would help train filters that distinguish between his 'nonspam' corpus and his 'spam' corpus. In this case, he is surely increasing his true-positives.. his rejection of things that really are spam. But his false-positive rate is not helped at all, because his samples are so biased.

(Example: 10 spams get the word 'blunderbuss' but he has no regular email with that word. Therefore, any future email may be rejected because of the word 'blunderbuss', even though there is no basis to know whether the word CAN be used legitamately.)

If the system is done intelligently, this will simply mean that having a lopsided sample will do nothing (the resolving power will be dominated by the smaller of the two samples), but this may be counterintutive to some.

Another problem is that you don't know WHY choices are being made, and that's bad science. Ok, ok, so this isn't science, it's Spam prevention, but I like science.

---N