Paul Graham on Fighting Spam
Ramakrishnan M writes "Paul Graham, the Lisp Guru is back with a great technique to fight spam. It is based on trust matric, and he claims, only 5 out of 1000 spams got leaked out of this system with 0 false positives. Worth looking at."
Create an E-Mail address called, say, spam@example.net.
Put a link to it on your website, but tell people not to use it for anything, E.G.
<a href="mailto:spam@example.net">Spam trap - don't use me</a>
Then, it'll get harvested along with all the others on your site. That mail box will fill up with spam, and nothing else.
What good is that? Well, you've got a ready-made list of messages to filter *out* of your other mail boxes!
So, just write a script that checks each inbound E-Mail against the spam list. If it matches, you *know* it's either:
1. Spam
or
2. An E-Mail that somebody has also sent to the "Don't use me" address.
In either case, you don't want to read it, so it gets auto-deleted. Nice.
Oh, I think I'll patent this, and not tell any of you about the royalty I'm going to charge in 15 years time. Hahahahahahaha!!!
Oh, by the way, first post, first post... NOT!
Here's how: the spam should be written as a 'multipart/alternative' with an html version of the spam as the primary alternate. The text version contains an innocuous message intended to pass the statistical spam filter. The spam message is entirely contained as an /image/ within the html. The text of the spam becomes invisible to the reader but not to the poor schmuck who gets the email.
I'm guessing here that the inclusion of a single image tag in the html is unlikely to trigger the spam filter, and supplying a wealth of evidence that the email is 'not' spam in the unseen alternate text will let the letter through.
Comment removed based on user account deletion
This is the brilliant part, and crucial to the endeavour, and so easy to implement!
It appears all the nay-sayers here haven't even read the article (no surprise). With as little code as needed to implement this it should be a must in the next mozilla mail/pine etc. code base.
only infrmatn esentil to understandn mst b tranmitd
Having had the same email address since '93, I receive close to 1000 spams per day to my personal account (which is also aliased from root/postmaster/webmaster).
/dev/null.
I've tried everything under the planet to reduce the amount that I see in my mailbox; SpamAssassin being one of the best so far. But even that lets through quite a bit (around 10%).
So I decided to attack it from a different angle. I wrote a series of perl-scripts that I plunked into my procmail file.
The scripts work by checking the address of the sender each time a message is received. That address is looked up in a database. If it exists in the db, and it's marked as "authorized", it's just passed into my mailbox.
If it's marked as denied,
If it's never been seen before, an authentication message is sent to the sender asking them to reply to it to authorize themselves. If that authmessage is bounced back, a db entry is made as "denied".
If it's replied to in a normal fashion, that email is marked as "authorized" and any queued up mail from that person is pushed out.
The concept is that spam will almost never have a valid reply-to; so it will bounce and be marked as denied.
Even if the email doesn't bounce, no spammer alive will reply to it; so after 30 days, that email is marked as "denied".
Since I've set this up (for myself and my 10-year-old son who receives porn in his box (grrr!!!!)), it has worked flawlessly. The "real" email is unharmed, while the spam is stopped.
Oh, and I have a web-based control page so that users can manually add email addresses (for lists and such).
This week, for the first time in YEARS, I don't have spam in my mailbox anymore.
Hurray!
No if I can only stop those damned dictionary-based scanning of my servers, I'll be set. Thank the gods that I don't have metered service.
He isn't fighting spam, he is filtering it. There is a difference. Filtering still costs in bandwidth. Fighting it would eliminate the source and free up the gigabytes of bandwidth lost for this marketing purpose.
Filtering is fine for now, but ultimately it must be fought and defeated.
--------
It's OK to be social, just don't tell anyone about it.
I'm continually amazed at the people who are beating their heads up against a very simple problem. The answer is not statistics, it is not heuristics, it is not AI, it is not procmail.
The answer is verification...aka whitelists. Check out TMDA, tmda.sourceforge.net. This program assumes you don't want mail from anybody whom you haven't explicitly allowed, or who has verified that they are a real person and not a spammer.
Verification is simple, and some people will point out that it could be defeated by a spammer. But, the economics of spam do not make it feasible for a spammer to attempt to defeat TMDA.
TMDA is similar to making your phone number private. You only get phone calls from people you have given your number to, and you never get telemarketers.
TMDA user since December 2001. Spam messages that tried to get in, 12,133, spam messages that got in 3, false positives, 0. Time I've spent tweaking and modifying the program since installation, 0 minutes.
You could develop a corpus of spam over a long period of time, and look for shifts in the data. What this paper describes is distinguishing between a spam-corpus and a legit-corpus, but you could also compare a spam-1999 corpus to a spam-2002 corpus, and see if the spammers are up to anything new.
Not that it would be useful, but it might be kind of cool to try it out and see.
It's not wasting time, I'm educating myself.
Good method. I work with Bayesian technics often and I had thought of the same thing but for a different purpose: automatic classification of emails. When you receive an email, your mail reader would propose a list of potential folders into which you might want to put your email after (or before) having read it. And the best thing is that is learns with time and it gets better. And as this article shows, this method can also automatically filter emails. Now if I have time to get involved in the Evolution project or kmail, ...
I'll do it for cheesy poofs.
xIf xYou xCan xRead xThis xYou xHave xWon xA xFabulous xVacation! xClick xHere xTo xRecieve xYour xPrize!
Sometimes it's best to just let stupid people be stupid.
What I want to know is:
Would this also work with email virus? I think it would since the virus would also have a defined patern to it and the program would pick it up after the first one.
Could this be made part of the STMP protocol or built into the backbone layer of the network? Again, I no major reason why it couldn't.
Problems that I have with it are:
Since each word is treated as a token and everything else is not, I'm sure that spammer would quickly figure out that a spam like this just might work:
<HTML>
<BODY>
Enlarge <!-- elephant --> penis [etc..]
</BODY>
</HTML>
which would show the message but hide the balancing words, so it could be possible to change the delta into your favor.
Does anyone else have thoughts on how this might be broken?
III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIII
By the time one can apply the filters, you have already received the spam. This is a load on your resources. In some cases your in-box may even fill up (yes, I've received 1000's of the same piece of spam in the same hour, exceeding the capacity of my allotted storage and effectively DOSing me from real e-mail) or you may exceed limitations from forwarding services.
The spammers don't really care. Or notice. Their goal is to hit millions of victims, knowing that some of them will respond. The response is all they care about. Filter your e-mail all you want, you were not going to respond to them anyway. All they care about is reaching the mark that doesn't know any better, and this filter doesn't do anything to stop that (unless it is applied automatically by ISP's, unlikely due to the fear of fales positives).
What might help is a two fold attack on what they want: responses from marks. I suggest the following:
A massive education campaign to educate the general Internet user to never respond to (or even read) strange messages that show up in your e-mail. Banner ads would seem a good place to start, it would be a public service if a good percentage of banners were replaced with ones that educated the Internet users who still make spam profitable. This might even have the long term effect of improving banner revenue: if banners compete with spam as a way to get out a message they have a lower value than if the public is taught to not buy from spam and even to aggressively resist doing business with a spammer. In the long run an antispam banner campaign could improve banner revenue for those who help fight spam. Ideally another great way to get the word out would be UCE, but that poses a moral dilemma....
The other thing that could effect the spammer is if the ads are not getting the desired results with the advertisers. What needs to happen here isn't filtering, it's massive negative response to the advertiser. No response don't hurt them, but making them respond themselves to unwanted responses is a more suitable way to respond to those who originate unwanted messages to use in the first place. These people need to get responses that waste their time and resources like they are wasting ours. Obviously those who supply 800 numbers are a prime target for this, while those who supply only postal addresses make it too costly to respond. I think such negative response campaigns need to be coordinated from major popular sites to be truly effective (not just from a few geeks who spend their day on an anti-spam website. Their efforts are much better applied by getting the spam sources in black holes and getting ISP's to block or filter spam). It sure would be nice to see the slashdot effect applied to spammers rather than the poor smuck who puts up a small but interesting website.
Interested in other's thoughts in this area.
I'm an American. I love this country and the freedoms that we used to have.
Senator Mary Landrieu
724 Hart Senate Office Building
Washington, DC 20510-0001
Dear Senator Landrieu:
Earlier this month the Federal Communications Commission (FCC) issued a record fine of nearly $5.4 million to Fax.com for transmitting unsolicited advertisements via fax machine (ie. "junk faxing"). Coincidentally, the Associated Press published a series of three articles covering the state of unsolicited e-mail advertising ("spam"). I'm left wondering how the FCC can come down hard on junk faxers but how spammers (arguably of a lower moral class) are allowed to continue to operate nearly unmolested.
The law Fax.com was found to be guilty of breaking is Section 227 of Title 47 of the United States Code. The relevant text follows:
Restrictions on the use of automated telephone equipment:
It shall be unlawful for any person in the United States (...) to use any to use any telephone facsimile machine, computer, or other device to send an unsolicited advertisement to a telephone facsimile machine(.)
It is my understanding that the reasoning behind this law is based on the ownership of resources. Fax machines are purchased and maintained at the owner's expense and only the owner's expense. An unsolicited advertisement sent to this fax machine amounts to nothing less the use of these expensive resources without prior consent. In effect "junk faxing" is considered theft and as such the offenders are held accountable by law.
What does this have to do with spam? In my opinion, everything.
Receiving an e-mail is by all accounts more expensive than receiving a fax. While several companies are now offering stand-alone e-mail clients, I have yet to see one of those with a lower price tag than a fax machine. But even if their price tags were the same, an e-mail station requires that the owner not only pay a monthly fee for a telephone line but also a second monthly fee for the e-mail account itself.
Of course not even an end client is enough to receive an e-mail. The e-mail account you would be paying for is maintained on a very large (and very expensive) e-mail server, complete with its dedicated (and pricey) connection to the internet. There is simply nothing comparable to an e-mail server in the faxing domain. While a bank of fax machines doesn't cost more than the price of the machines and their associated telephone lines, the price a dedicated e-mail server and the associated connections can easily resemble that of a small car.
So why is it that the FCC is given free reign to crack down on junk faxers but spammers are free to consume valuable equipment they do not own?
If you are familiar with the AP articles I mentioned earlier you will know that spam is steadily eliminating the usefulness of e-mail itself. It has been estimated that spam accounts for up to 80% of the e-mail traffic to major e-mail domains such as Hotmail and Yahoo, a problem that their respective owners are all but powerless to fix. As more and more internet resources are tied up by these advertisements, the owners of these resources have had to resort to cutting off offending service providers from the rest of the internet entirely. Customers are finding themselves unable to use the internet access they have paid for simply because another customer of that same provider is abusing theirs.
But even then the providers are powerless to drop spammers. Spammers in the recent AP articles have proudly boasted of the way they outright defraud unsuspecting internet service providers when signing up for an account. And when the provider threatens action, the spammer threatens the provider with legal action. In recent months a spammer was even successful in receiving a legal injunction against their service provider, preventing the provider from stopping the spammer from abusing their resources.
I have little problem with receiving advertisements through the U. S. Postal Service. I know that the delivery cost for every article in my mailbox has been entirely paid by the sender. And while I am not happy with the current situation with telemarketers (I must pay for local telephone service before I have the "privilege"of being contacted by telemarketers), I must grudgingly admit that the state and federal laws designed to restrict telemarketing have been mostly successful. But I am not happy about paying several thousand dollars for a computer and $20.00 a month simply to have my e-mail account flooded to capacity with advertisements for products and services I have no interest in (and preventing legitimate e-mail from reaching me in the process). I am sure that you yourself have been bombarded with advertisements for websites featuring "nasty teens" or "penis enhancement." I have noticed that your office no longer maintains an e-mail address accessible to the public.
The examples of spam I mentioned in the last paragraph bring me to another point: I have noticed on your website your stated commitment to enforcing decency laws on the internet, to protecting children from access to objectionable material on the internet. It should be obvious by now to even the most casual of internet users that the biggest offender in this area is the spammer. While a user must actively attempt to locate a website in order to find such material on the world wide web, the mere existence of an e-mail account all but guarantees that the owner will have such material delivered to them on a daily (if not hourly) basis.
In my opinion the solution to this problem is very simple: expand 227 U. S. C. 47 to prohibit unsolicited e-mail advertisements in exactly the same way it prohibits unsolicited fax advertisements. Nothing more, and certainly nothing less.
I have seen some ineffective bills drift through both houses of Congress that are written to allow unsolicited messages so long as they have an "opt-out" mechanism. Ignoring the fact that such legal loopholes would essentially negate the law entirely (can you prove that you tried to opt out?), it quite literally sickens me the way some of your fellow members of Congress feel that spam is somehow an issue dealing with the freedom of speech. The mere existence of the internet and the supposed changes it has on how business and the legal system work (even though such "changes" have been shown to be a lie) have helped to convince these poor fools that people should somehow have a right to use and abuse the property of others. Does my neighbor have the constitutional right to break my kneecap so long as they provide me with the ability to "opt out" of future kneecappings?
The United States Constitution guarantees that all citizens are free to say what they want. It does not guarantee a soapbox upon which they can say it. Just as I am not guaranteed the right to have a billboard on Interstate 10, spammers should not have the "right" to use the resources of others simply because they're there.
Expanding 227 U. S. C. 47 to include e-mail is an extremely important issue to me and I hope with your stated interests on your website that it is also an important issue to you as well. I know that you are up for re-election this November and I intend to find out how your competitors feel on the issue as well.
His algorithm works because spam uses the same repetive syntax. Because so many spam/emails are sent out - it can be flagged by pattern recognition... based on the assumption that it is written in English!
Huh? Where do you get that? The algorithm has NO KNOWLEDGE of syntax or structure. It knows only the presence (or absense) of words in the message, nothing of how they are grouped, positioned, ordered, related, structured, etc. There is zero grammar / pattern recognition as far as I can tell. As long as your corpus or database of reference mail is in the same language as the emails you wish to test, then the algorithm would work just fine. Perhaps you were thinking it used Markov chains?
I'm not sure if I'd characterize Haskell as an aborted brain child. Some people use Haskell. Some people like it. At a lot of schools in the US at least, they teach Scheme, when all the students/faculty have "accepted" C, C++, and Java as "superior" for teaching. Which is blatently bullshit. Algol-kid languages suck, we all know that. (heh, couldn't help it) But the point still stands.
Working toward a usable PDA environment in the spirit of Newton OS: Dynapad
ANytime someone asks for my e-mail addres, it's their_business_name@conesus.com or their_personal_name@conesus.com.
If I ever get spam from a certain address, I can block the address, and goto the site in question and change my address to something else.
But the coolest part is if anybody sends a mass-email to me and my buds, they usually include a personal_message_to_me@conesus.com.
Don't eat your soul to fill your belly.
conesus.com
The "foreign language" Spam that I get gets nicely refiled by Ifile into my Spam/Foreign folder.
That folder has a corpus of messages assortedly written in Han, French, Kanji, Korean, Finnish, French, Spanish, and Russian, and Ifile nicely recognizes that words in those languages provide evidence that messages seem most relevant to go into that folder.
Ultimately, it all involves human classification:
I go through them, and read them, perhaps just browsing titles when I see that spam seems appropriately filed.
By leaving the messages in the folder, indicate that they were correctly filed, and should become part of the corpus.
That then involves human intervention as I move the messages to where they should have been.
Note that IFile is useful for filing good messages, not merely at throwing away spam.
Indeed, the more that you use Bayesian filtering for, the more folders with distinctive kinds of message that you have, the better it gets at discriminating where messages should go. I don't have one "Spam" folder; I've got about 8 for different sorts of spam. I don't have one 'inbox' for all my "good" mail; the mail gets thrown into a veritable huge chasm of mail folders. The more there are, the better.
If you're not part of the solution, you're part of the precipitate.