Paul Graham on Fighting Spam

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Friday August 16, 2002 @04:08AM from the near-and-dear-to-my-heart dept.

Ramakrishnan M writes "Paul Graham, the Lisp Guru is back with a great technique to fight spam. It is based on trust matric, and he claims, only 5 out of 1000 spams got leaked out of this system with 0 false positives. Worth looking at."

4 of 675 comments (clear)

Min score:

Reason:

Sort:

Spammers will just change tactics. by caluml · 2002-08-16 05:00 · Score: 0, Redundant

Of course, the problem now is, is that spammers won't use ff0000 as a colour, they won't start Dear Sir or Madam, and we'll just have to start again.
I think the best way is to make a similar list of words you find in valid emails, rather than a list of things that occur in spam.

One idea that I use that I've never seen used anywhere else, is change your email address to:
user.aug02@domain.co.uk, and that way any spammers will only have a valid address for max 31 days. Change your email address each month. Humans can work it out, bots can't.

--
Get your own free personal location tracker
Just another hoop to jump through? by blink3478 · 2002-08-16 05:00 · Score: 0, Redundant

Because it is measuring probabilities, the Bayesian approach considers all the evidence in the email, both good and bad. Words that occur disproportionately rarely in spam (like "though" or "tonight" or "apparently") contribute as much to decreasing the probability as bad words like "unsubscribe" and "opt-in" do to increasing it. So an otherwise innocent email that happens to include the word "sex" is not going to get tagged as spam.

So what's to keep spammers from reading this article, and tailoring their spam to stop using 'hos' and 'ladies' and start include words like 'tonight' and 'apparently'"?

'This week only! All the hiz'oes and liz'adies you could want on our website. Sign up tonight and receive a free two month membership! Apparently we'd uh... like your business!'

D
But I think it could be easily circumvented .. by vinays · 2002-08-16 05:13 · Score: 0, Redundant

As described, it would be very hard for legit spam to get through.. However, what I'm thinking is that they could have their normal 5 KB of email which is spam .. and at the bottom .. (or anywhere else) , just add 20 KB of words they know are "good words" .. throw html comment tags around it and its never seen to the viewer ... but the large amounts of "good words" outnumbers the "bad words" , causing a spam msg to be considered good...

I don't know if that'll really work.. but its a thought

--

"cogito, ergo sum"
Beware Statistics by DoctorNathaniel · 2002-08-16 05:23 · Score: 2, Redundant

A few quick comments about this. Although powerful, such approaches suffer from being somewhat too 'black-box'. That is, you turn control over to the computer to make decisions based upon statistical recurrances. This leaves you very vulnerable to several problems.

For instance, the author remarks that he believes a bigger corpus of spam would help train filters. That's true, but misleading: it would help train filters that distinguish between his 'nonspam' corpus and his 'spam' corpus. In this case, he is surely increasing his true-positives.. his rejection of things that really are spam. But his false-positive rate is not helped at all, because his samples are so biased.

(Example: 10 spams get the word 'blunderbuss' but he has no regular email with that word. Therefore, any future email may be rejected because of the word 'blunderbuss', even though there is no basis to know whether the word CAN be used legitamately.)

If the system is done intelligently, this will simply mean that having a lopsided sample will do nothing (the resolving power will be dominated by the smaller of the two samples), but this may be counterintutive to some.

Another problem is that you don't know WHY choices are being made, and that's bad science. Ok, ok, so this isn't science, it's Spam prevention, but I like science.

---N