Paul Graham on Fighting Spam

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Friday August 16, 2002 @04:08AM from the near-and-dear-to-my-heart dept.

Ramakrishnan M writes "Paul Graham, the Lisp Guru is back with a great technique to fight spam. It is based on trust matric, and he claims, only 5 out of 1000 spams got leaked out of this system with 0 false positives. Worth looking at."

6 of 675 comments (clear)

Min score:

Reason:

Sort:

Ok, that is hot.... by Vengie · 2002-08-16 04:15 · Score: 4, Insightful

1) Lisp...ever since i ran into scheme, I have _loved_ the concept of lisp based languages. A nice Hoo-ha to anyone who says there are no practical applications of lisp based languages. (except haskell...which personally, i think sucks! if one of our own professors hadn't invented it, it would be dead by now) 2) _0_ false positives. I'm perfectly happy to settle with "some small number of spams getting through" given there are NO false positives. Early on in the article he states that he realizes this is a critical problem, and from the start keeps no false positives as a goal. It is far better to have no false positives then to have 100% no-spam rate with that in mind... 3) the statistical word analysis is really interesting..."describe" is innocent. unfortunately....what happens when a few smart spammers get their hands on this analysis *sigh*

--
When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in vi. (Larry Wall)
1. Re:Ok, that is hot.... by Plutor · 2002-08-16 04:36 · Score: 5, Insightful
  
  1) [...] A nice Hoo-ha to anyone who says there are no practical applications of lisp based languages. (except haskell...which personally, i think sucks! [...])
  
  You ridicule people who dismiss the usefulness of your personal "favorite" language, and then you dismiss the usefulness of one particular language that you happen to dislike? That's a bit hypocritical.
  
  3) [...] what happens when a few smart spammers get their hands on this analysis[?]
  
  Paul covers this. First, he suggests that each user's filters should be personalized, so that any spammer would not be able to circumvent everyone's filters. Second, the filters would be continually learning, possibly dumping older words from the corpus in favor of newer ones. And third, even if a spammer put at the end of his spam "describe describe describe describe", this still wouldn't work; the basic premise of the filter is that the spammer HAS to tell you what he's selling, and in the process of doing that, gives himself away as a spammer.
Re:Misleading by sebi · 2002-08-16 04:39 · Score: 4, Insightful

In the long run filtering would eliminate the source as well. Spam has to be payed for by two sides: Both the spammer and the recipient have to pay for the bandwith. The spammer has to pay a lot more though. Spamming is a business that will continue to exist as long as its profitable. If the success rate of Spam drops dramatically due to refining filters than sooner or later Spammers will no longer be able to afford the bandwidth they need.

--
Hank! White!
Re:This approach is very easy to defeat by pmz · 2002-08-16 04:51 · Score: 5, Insightful

The spam message is entirely contained as an /image/ within the html.

Thankfully, my e-mail client is set up to not render any HTML in an e-mail. I have yet to send back any information to a spammer via specially-coded image tags and am proud of it.

HTML-based e-mail is fundamentally insecure and really should be used by no one (except those who simply don't care about privacy). Go here to learn just what a spammer--or anyone who sends you an HTML-based e-mail--can learn about you with just one "click" of your mouse.

Yes, the spammer can learn what browser version you use, what OS you use, and even what city you live in (via the traceroute). An unusually savvy spammer could use this information to install spyware via known exploits in certain browsers and operating systems.

In short, HTML e-mail is damn scary knowing that so many people us it not knowing just how much information they are giving away for free!

--
Healthcare article at Kuro5hin
Incorrect statistics by SiliconEntity · 2002-08-16 05:05 · Score: 4, Insightful

Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.
This reasoning is statistically invalid. It is only true if the chance of the word "sexy" appearing in a message is independent of the chance of the word "sex" appearing. In other words, only if knowing that the word "sex" appears tells you nothing about how likely the word "sexy" is to appear, can you reason as he is doing above. That's probably a very poor assumption in this case.
He is doing:
p(sex & sexy) = p(sex) * p(sexy)
The correct formula is:
p(sex & sexy) = p(sex) * p(sexy | sex)
where the last term means the probably of "sexy" given that "sex" appears.
Maybe his approach is good enough for his purposes, but the statistical foundations are not correct.
Bullshit! by www.sorehands.com · 2002-08-16 05:18 · Score: 5, Insightful

Another spammer lie.
Freedom of speech is not the freedom to tresspass on my computer equiptment, use my resources for me to listen to your advertising!
This is not a prohibition on your paying your moneyto spread your advertising. This is a prohibition on you spending my money to spread your advertising.

Commercial speech does have some constitutional protection, but not to the same level as non-commercial speech. But even with pure political speech, there is no requirement for me to pay for your speech.

As for hitting the delete key, at that point, you have already tied up at least 2 of my computers used my disk storage, my time, my bandwidth without paying for it.

If you want to spam, no problem, just pay me in advance.

--
Fight Spammers!