A Timeline Of Spam And Antispam
Haak writes "American Scientist has a fine article by Brian Hayes summing up the history of spam and proposed measures to deal with it." A shorter article along the same lines is running at The Economist.
← Back to Stories (view on slashdot.org)
Not that we should not pursue anti-spam countermeasures but spam will never clearly fully go away. Its like warez, its like mp3's, its like drugs, its like this, that and everything. You can try but you'll never really get a hold on it. Minimise it as much and as conveniently as you can, but as soon as you start spending ages trying to outlaw it you will find you've wasted more time than it would have taken to delete the spam and move on with your life.
Anti-spam activists go to a lot of trouble to help locate and identify people and groups responsible for flooding the net with spam (or who provide spamware to misinformed laypeople). These same good-doers are often sought out by spammers, sued by groups of them, have their privacy invaded (release of home phone, address) in effort to scare them into shutting up.
I am not kidding here. Take a look at some of the projects that scare the hell out of professional spammers:
spamhaus keeps an exhaustive list of major spam operations.
SPEWS lists areas of the Internet that have frequently be used for spamming, including detailed evidence files and histories of ISPs that turn a blind eye to spam.
Spamware vendor list has a listing of sites that sell spamming software -- without which we would have little or no spam.
First I ignored it. This worked for a while, but my paitence didn't grow nearly as fast as the spam volume (I've been on the net for years, so I remember when spam was a rare occurace). These are only the major things. I've tried others here and there.
Next I started using MS Outlook's built in spam catcher. This is basically a blacklist that you maintain that you can easily add things too. This actually worked somewhat well, but as the use of forged addresses (and just plain random ones) grew, this became less effective.
Next I started to use SpamNet. I used this up untill about last week. This used to be somewhat effective, and in the last month or so has been almost completely effective. This is the most wonderfull anti-spam device I've used. It was great near the end of the beta. But now it's out of beta and I'm not going to pay $5 a month to stop something I shouldn't get in the first place. Sorry Cloudmark.
When Spamnet started, it was pretty effective, but still left a decent amount to be desired. So I searched around and found SAProxy. This program let's you run Spamassassin on Windows, and the combination of this and Spamnet worked wonders. As Spamnet got better, this became more or less useless.
Unfortunatly, I had to get rid of Spamnet, due to the afformentioned monthly fee. So now all I have is SAProxy. It does work great, and it does get better with each new release. Now only about 3 messages a day get through, which is quite fantastic. Only 5% or so of the spam I get gets though. I could set the limit lower (to catch more spam) but right now I don't have to worry about it catching ham (it never has for me) and I don't want to have to start wading through my spam folder to check for ham. I thought I was using this stuff to not have to do that in the first place?
So in short, I'm now using SAProxy and quite happy. If there was a free version of Spamnet, I'd use it, but there isn't. If you're on Windows and have a supported e-mail client, get SAProxy, and save yourself a huge headache.
So what will I use next? I've been thinking of setting up a perl script to automatically find the home address of people who spam me and sending them a few ICBMs with notes attached like "HOW TO WIN AT EBAY WITH FREE CHEAP ICBMS THAT INCREASE YOUR SEXLIFE AND GROW HAIR."
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
According to This Site, The earliest spam was sent by DEC in 1978.
Einar Stefferud, a longtime net hand, reports that DEC announced a new DEC-20 machine in 1978 by sending an invite to all ARPANET addresses on the west coast, using the ARPANET directory, inviting people to receptions in California. They were chastised for breaking the ARPANET appropriate use policy, and a notice was sent out reminding others of the rule.
Interestingly, a young Richard Stallman argued that spammers had every right to send spam.
I'm not Seth.
Simple.
Money.
Mitnick's foes' lawyers claimed billions of dollars (that's laywer dollars, not real dollars, of course) of damage to the people padding the politician's pockets.
When spam gets there, we could count on the jack-booted thugs raiding a place or two in the night. Unfortunately, the spammers are getting richer, and trying to make laws that favor them...
For us carnivores, "Sucking the marrow out of life" isn't a transcendentalist philosophy but a practical instruction.
First off, the article is WAY behind the times on anti-spam techniques. SpamAssassin's statistical techniques far outstrip the simplistic features discussed. For example, it mentions obfuscation techniques, and yet SA is known to detect almost all of them one way or another, and even when it doesn't it catches the mail because it's in Razor2, comes from a BLed site, has obviously forged bits, doesn't look like valid mail to Bayes, etc, etc, etc.
Second, the article is also a bit naive on several points regarding blacklists. Many blacklists are good and useful, many are not. But taken as a whole, they present a spectrum of data that can be interpreted through a number of classical techniques that are applied to noisy data sources. Trusting any one BL or a small list is almost always a mistake, you need to build a sample set and determine who you trust and how much. SA does this, but it would be easy enough to build a BL-only SA-like tool for high-speed analysis on high volume ISPs and pipe-providers.
I'm getting worried that the problem of spam erradication is starting to look like the most divisive problem the net has faced to date. There are an awful lot of angry people, and those pitchforks and torches are starting to point in some very "infrastructurish" directions. Articles like this one, really don't help much....
Oh, I dunno. Fax SPAM was effectively stopped by law; is there any reason to believe that an effective Federal law won't work to at least reduce the volume?
Larry Lessig's proposal for a law, which is actually being introduce by my own Representative, Zoe Lofgren, may very well reduce the flow
I would like to see that law include provisions for going after companies that hire spammers, rather than just the spammers themselves. I don't believe that there is such a provision in the current proposal, but it's been a few weeks since I read it, so I might be wrong. But that might be a helpful addition, if it's not already there.
Finally, I read recently that there are only about 180 major spammers responsible for most of the spam we get. 180 people is not an impossible number to arrest, charge, and shut down. The remaining bit players will probably dry up if the major guys and gals are gone...
I agree with you completely.
.
However, I did see one paper on this which was submitted to the IETF ASRG which was pretty neat on relatively new methodologies to eliminate spam.
You can find it here - Eliminating Spam: Protocol and Infrastructure Changes
I too have noticed that the vast majority of spammers now seem to forge the HELO/EHLO greeting. And as most non-spammers don't, this is actually a wonderful way to catch them. I've even seen them send the IP address of my secondary mail gateway in hopes that my primary mail server would fully trust it (obtained probably by looking up my MX records). I run a mail gateway for a corporate domain an get on average 30 to 40 thousand spams per day. Using sendmail with it's milter programming interface I put the HELO greeting though a very strict check. For those contemplating doing the same...
One last note about Forged AOL Spam after talking to one of their postmasters...all their legitimate mail by corporate policy is always sent from within the *.aol.com or *.aol.net domains. This will be in both the HELO as well as a reverse DNS lookup of the connecting IP address. If you don't see this in the HELO and DNS but you see a MAIL FROM for aol.com, it's probably spam.
I wish more big ISPs would provide public information about how to better detect forged mail claiming to come from their sites. For instance if I see a MAIL FROM *@yahoo.com, then should the connecting IP address always be from a *.yahoo.com host? Some ISP's like hotmail seemingly always add in a known predictable header whose absence indicates spam. But I can't reliably make these calls unless the ISPs provide that information. Also, beware that some semi-legitimate sites, like Monster.com forge the sending address on purpose; so if you want to receive resumes you may need to whitelist them.
I personally advocate Bayesian along with some simple keyword filters that contain mostly known spamvertised domains or spam sources. If it is kept up-to-date that helps.
It's been a few months now, and it's gotten worse. Much of my spam seems to be one-liners like "Here's that URL we were looking for: ..." Others contain mis-spellings in common spam-related words, and slip by the filters.
First, with a sufficiently large corpus the mis-spellings shouldn't slip through. The fact that they slip through means your Bayesian filter is still "learning." At some point, "VIAGRA" might be a 98% chance of spam but V1AGRA will essentially be a 100% chance of spam. The mis-spellings often make it easier to detect spam with confidence and the rest of the email should generally be enough to let Bayesian calculate a good spam percentage.
The one-liners can be caught by improving the Bayesian filter itself: Perhaps a new characteristic considered by Bayesian is "Is the message 1-line long?" or "Is the line 2-lines long?" or "Is more than 40% of the body of the message used to convey an HTTP address?" Things like this are valuable characteristics that will help Bayesian catch even 1-liners. Perhaps 90% of your 1-line messages that have an http reference in it are spam--that's something Bayesian can work with.
Marking the ones that slip through as Junk causes more problems with false-positives.
Really shouldn't.
Plus, it's fairly easy for a spammer to tweak his message against a relatively common corpus. I believe that most people would come to the same conclusions as to whether or not something was spam -- and thus an "average" corpus is trivial to create, and tweak your spam against.
While pretty much everyone will agree on what IS spam, not everyone will agree on what isn't--and that's what's great about Bayesian. Sure, they might avoid the word "Viagra" or "slut" but the headers themselves can be damning, the fact that they have 15 images being loaded off an external site is damning, and the fact that a message with a 60-character body consists of a 30-character HTTP address is also probably damning. They're not going to know I have a best-friend named Fred (which is something that will lower the spam score when it is found in my email). As Paul Graham said, if spammers have to stop using all the words (viagra, porn, slut, etc.) and techniques (images loaded from external servers) that they are using to make their pitch, they're going to be significantly limited in what they can say.
If it gets to a point where they totally mangle their emails with SMS-like substitutions to convey their message, you can also add new characteristics for Bayesian: "Are more than 40% of the tokens unknown?" "Are more than 50% of the tokens unknown?" You can assume that if you have a halfway-decent corpus and more than X% of the tokens in an incoming message are unknown, that may be a good indication of a spammer trying to use mangled words to get their message accross.
Sure, Bayesian as proposed by Paul might not be the final solution. But the countermeasures that spammers use will end up being such that the simple use of those countermeasures will probably be something which can be considered a characteristic of the message which will be further used in identifying it as spam.
In my opinion, the trick will be keeping Bayesian "up to date" in terms of identifying new characteristics that can be used to identify spam. For now, tokens in the message are sufficient.