Spam Detection Using an Artificial Immune System
rangeva writes "As anti-spam solutions evolve to limit junk email, the senders quickly adapt to make sure their messages are seen. an interesting article describes the application of an artificial immune system model to effectively protect email users from unwanted messages. In particular, it tests a spam immune system against the publicly available SpamAssassin corpus of spam and non-spam. It does so by classifying email messages with the detectors produced by the immune system. The resulting system classifies the messages with accuracy similar to that of other spam filters, but it does so with fewer detectors."
I have to admit, I don't see the need for these recent whizbang's additions to the spam-fighting repertoire. Sure, they might be ingenious, but on a practical level they don't do anything more than a properly-configured SpamAssassin system. I used to get a lot of spam coming through a default installation of SpamAssassin, but after spending some time with O'Reilly's book (the free docs may already be up to this level of reader-friendliness, it's been a couple of years) and tweaking my installation, I get spam once in a blue moon. There's just no need for anything more.
So now we can look forward to a spam filtering solution that actively searches for spammers and kills them?
I Am My Own Worst Enemy
I think this is a very useful new anti-spam tool, but as usual, it will have the possibility of false positives, which can be very damaging. And Spammers will adapt to this technology as well, reducing its effectiveness.
Be who you are and say what you feel, because the people who mind don't matter, and the people who matter don't mind.
Not that I'm arguing that it's the same, rather I'd like to know:
What seperates this from a Bayesian filter?
Business \Busi"ness\, n.;
A scam in which all people involved perceive as beneficial...
Ever heard of hay fever? Allergies? Think, people, think! charon
It looks fancy but when you get down to it, all it means is that there are a number of heuristics that are combined into filters (this happens by user training.) The filters are 'weighted' and filters that are not used often enough are 'culled' (killed off.) I don't think this will be significantly better than any other Bayesian-type spam systems.
You can't handle the truth.
Ultimately, very little. At core, they're probably identical techniques, and if I were reviewing this as a scientific paper I'd ding them for not answering exactly that question. There are such strong parallels between the two (train them on known data, add up probabilities, cut stuff on a threshold) that I strongly suspect that they're identical.
There are useful things to be gained from a change of metaphor. For example, one difference between this and most bayesian spam filter implementations is that this explicitly incorporates a decay function. That could be useful, if a word that used to be common in spam no longer is (e.g. if I actually decided to buy a Rolex, it's no longer a strong spam indicator, whereas right now any email mentionining "Rolex" is 99.9999% certain to be spam).
You could easily modify a Bayesian filter to have time-decaying weights, but if the change in metaphor leads somebody to come up with a good insight, then perhaps this is useful. Mathematically, though, the equations look very similar.
Spam and content filtering will always be a struggle for anybody who actually utilizes email. Simply adding more logic will not solve the problem. Reporting spammers to every rbl list you can think of, and alerting forums and newsgroups of abusive ip blocks on the other hand is already doing quite nicely.
I recently gave up on tweaking filters for myself and a few dozen people whose accounts I administer. I wrote a little script that asks for confirmation from the sender...if the sender confirms, they are added to a whitelist and will go straight through after that. I can also add addresses manually to the whitelist, and will soon be able to have wildcard (domain-wide) approved addresses. I've gotten exactly two spam in 6 weeks...both were confirmed by either a person or an autoresponder. Five years ago I never would have wanted such a blunt system...nowadays it's just the ticket.
Evil is the money of root.
The "immune system" solution is just another way to detect spam, but it is unlikely to be much more successful than existing methods. As someone else pointed out, SpamAssasin is pretty good already. So what if this new type of filter eventually improves the spam filtering accuracy from 98% to 99%? A more highly-polished rock is still a rock.
The real problem is the sending of spam itself, and that problem arrises from an inability to correctly attribute the spam to the spammers. If we can do that, we can block it, or at least better convict the spammers who violate the law. Things that solve this problem, like Yahoo!'s "DomainKeys", are the future of anti-spam, not more highly-polished rocks.
oh snapz terminator coming soon D:
Now your spam filter can catch AIDS too. But don't ask how.
I'm waiting for the day when we see our first email 'virus'. Something not unlike what happens with real viruses. Then we'd need antibodies similar to this.
AccountKiller
They claim to be as accurate as a Bayesian process, but with fewer check items.
But from their paper, it seems that they're "tuning" their check items to the corpus of spam that they're testing against.
So of course they will use fewer check items. There are a finite number of characteristics of that corpus.
I did not see where they were using their system in a Real World environment (I may have missed it, the article was pretty painful to read). Now, if they can do as good as a fully tuned SpamAssassin system (comparable true positives, true negatives, false positives and false negatives), in a Real World environment, with fewer check items, then they MAY be on to something.
I have two major objections to this idea, and to the article that presents it.
1. The ONLY problem this solves is performance -- i.e., processing throughput. And that's not what's wrong with anti-spam systems today. They live and die on the precision/accuracy tradeoff, or maybe on UI.
2. The authors seem to assume that Bayesian systems work really, really well. While technically most or all current spam-filtering products are Bayesian in some sense, that still speaks of considerable naivete about real-world spam.
To err is human. To forgive is good system design.
I just had a thought while reading about the spam filters about spelling. So I went and looked in my spam folder and found that every piece of spam has many, many words that are not in a dictionary, ie not spelled correctly.
Why not run a script that filters messages based on spelling? If there are more than 'xx' many words that do not exist in the dictionary you choose to use, then the message gets sent to the spam folder. This would catch the odd e-mail from friends who don't know how to spell or what a spell checker is, but then when you clean out your spam folder you should notice it.
Your post advocates a
(x) technical ( ) legislative ( ) market-based ( ) vigilante
approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)
( ) Spammers can easily use it to harvest email addresses
( ) Mailing lists and other legitimate email uses would be affected
( ) No one will be able to find the guy or collect the money
( ) It is defenseless against brute force attacks
( ) It will stop spam for two weeks and then we'll be stuck with it
(x) An enormous amount of spam will initially go undetected before your idea is effective
( ) Users of email will not put up with it
( ) Microsoft will not put up with it
( ) The police will not put up with it
(x) Your idea proposes a solution that only large corporations could deploy
( ) Requires too much cooperation from spammers
( ) Requires immediate total cooperation from everybody at once
( ) Many email users cannot afford to lose business or alienate potential employers
( ) Spammers don't care about invalid addresses in their lists
( ) Anyone could anonymously destroy anyone else's career or business
Specifically, your plan fails to account for
( ) Laws expressly prohibiting it
( ) Lack of centrally controlling authority for email
( ) Open relays in foreign countries
( ) Ease of searching tiny alphanumeric address space of all email addresses
( ) Asshats
( ) Jurisdictional problems
( ) Unpopularity of weird new taxes
( ) Public reluctance to accept weird new forms of money
( ) Huge existing software investment in SMTP
( ) Susceptibility of protocols other than SMTP to attack
( ) Willingness of users to install OS patches received by email
( ) Armies of worm riddled broadband-connected Windows boxes
( ) Eternal arms race involved in all filtering approaches
( ) Extreme profitability of spam
( ) Joe jobs and/or identity theft
( ) Technically illiterate politicians
( ) Extreme stupidity on the part of people who do business with spammers
( ) Dishonesty on the part of spammers themselves
( ) Bandwidth costs that are unaffected by client filtering
(x) The large amount of resources needed for implementation of your idea that small companies don't have
( ) Outlook
and the following philosophical objections may also apply:
( ) Ideas similar to yours are easy to come up with, yet none have ever been shown practical
( ) Any scheme based on opt-out is unacceptable
( ) SMTP headers should not be the subject of legislation
( ) Blacklists suck
( ) Whitelists suck
( ) We should be able to talk about Viagra without being censored
(x) Your solution is nothing more than a conceptual remanifestation of a solution that already exists
( ) Countermeasures should not involve wire fraud or credit card fraud
( ) Countermeasures should not involve sabotage of public networks
( ) Countermeasures must work if phased in gradually
( ) Sending email should be free
( ) Why should we have to trust you and your servers?
( ) Incompatiblity with open source or open source licenses
( ) Feel-good measures do nothing to solve the problem
( ) Temporary/one-time email addresses are cumbersome
( ) I don't want the government reading my email
( ) Killing them that way is not slow and painful enough
Furthermore, this is what I think about you:
(x) I think it is a creative concept, but there is no need to reinvent the wheel.
( ) Sorry dude, but I don't think it would work.
( ) This is a stupid idea, and you're a stupid person for suggesting it.
( ) Nice try, assh0le! I'm going to find out where you live and burn your house down!
Falun Dafa is good!
Inflict heavy fine on people buying spamvertised products and execute spammers. Only then can spam be stopped for good.
ELOI, ELOI, LAMA SABACHTHANI!?
Look up "bayes_expiry_max_db_size". If your database gets larger than the limit you set then the lesser used tokens are deleted.
More specifically, it correctly classifies 84% of spam and 98% of non-spam.
The authors used the SpamAssassin corpus. Holden shows that, on the Spamassasin corpus, Bogofilter correctly classifies 90.3% of spam and 99.88% of non-spam. See http://sam.holden.id.au/writings/spam2/
This approach is nowhere near state of the art.
Any good programmer worth their salt would have programmed this to cut out their tongue, cut off their fingers one by one, slice off their eyelids and force them to watch "Biodome" 5 times in succession.
I want those fuckers to live painfully damnit, just like the rest of us do when we have too much spam.
"All great wisdom is contained in .signature files"
Sounds like a genetically modified clone of Bayes :-)
Has anybody stopped to think that the human immune system is a little less than perfect? It doesn't stop all diseases, not by a long shot. And sometimes it creates illness, as anybody with Hay Fever — or Multiple Sclerosis — will testify.
Take off and nuke them from orbit. It's the only way to be sure.
- None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
I'm seriously sick of people abusing biological methodolgies. People seem very attracted to ideas simply because they are grounded in "how nature works" and ignore the mathematical benefits or weaknesses. Now this idea pretty much just sounds like statistical rules based on a corpus - pretty much how every successful solution out there now works. This solution simply prunes rules that aren't being used, but there are better ways to get a smaller spam detection database. Have you seen the stuff the CRM114 people are doing? This is nothing new.
Read your Russell and Norvig, people. Airplane research didn't get off the ground (ugh) until we stopped trying to mimic birds and study physical principles of flight.
Did you ever notice that *nix doesn't even cover Linux?
lol
an Amazon spammer talking about spam
if you want to paste links to help people try them without sticking your stupid Amazon refferer code in there
Has anyone come across the newer spam ideas, where the spam message looks so much like a real message, I can sometimes have to spend a good few minutes looking at it to see if it's genuine - they use your nickname - eg. "Dear Bob", and end with the name of someone you know. They are usually about mundane things (eg. "do you want to come to a party on saturday?"), and the emails make good sense and have a suitable subject line. The only giveaway is that they all have a tinyURL link to the actual spam site - but how can I tell if a spammer is using tinyURL of if a friend of mine is using tinyurl? The annoying thing is each email has a unique tinyurl, so by clicking on the link they know it's an active address - and I made the mistake of clicking on the first oine I got.
One thing that concerns me is how certain fields are filled in, for example my nickname and a friends name at the bottom. Also, it seems to sometimes use my geographic location (nearest city - presumably from IP location) - eg. "Meet tomorrow in London, UK." I suspect the fields are filled in by some spyware on the pc reading previous emails and analysing them - All these emails appear on my vmware spyware/virus test machine. It's also possible the fields could be filled in by a hack of someone elses mailbox (mail server or PC), because as soon as they've got a mailbox full of email (including headers), they can auto-analyse it to find out nicknames etc. fairly reliably with a decent amount of mail.
How is this even close to news?!
The first paragraph of TFA, even above the abstract:
"This article was published in Crossroads Magazine, November 2004 edition. It was supposed to be on their website, but since it no longer seems to be available, I have provided this copy for reference."
No wonder it's not even near the "state of the art", maybe it was.. back then.
/ AC
from the lymp0cty3z-narf-poit!-claire-said-the-laundry-whee l dept.
Pinky, if I could reach you I would hurt you.
Come on, guys.
Mod parent up. That was an awesome post.
And kind of ironic that the author slipped in some unsolicited politically motivated PR on the Falun Gong as part of his/her message.
Are we still on the message-filtering bandwagon? I know it was all the rage when we talked about it in 2000, but now it's 2006, and we've all had experience with it. Pattern-matching has been defeated, and it was an embarassing defeat. This is usually a sign to those who proposed it that they should consider a career change. With the exception of those patterns that correspond to firewall rules blocking domains run by companies with names like "Megaultra Webcram Holdings, Inc", it's a dead issue.
The real issue I have is with those researchers and businesses that to continue to push this cyber snakeoil. It's getting to the point that e-mail is worthless, not because of the high volume of spam, but because easy-confused pattern-matching blockers remove just enough messages to cause major problems for the rest of us. Here is why it's stupid, and should be stopped:
* While contaminated pattern-matching filters don't always block wanted messages, they remove just enough messages to cause doubt and frustration with my users, and those on the other end of the loop. This leads to network administrator (me) having to individually resolve each problem by sifting through the logs.
* Because the matched-messages are removed on the far end of the transaction, i.e. on the "client side", there's no indication of trouble, or even an error message (to the user or in the logs). Neither party understands where the message has gone, and this reinforces superstition. For years, I whined, teased and scolded to get the attention of the morons who were going gung-ho with client-end filtering for spam and viruses, but they just wouldn't listen.
* ISPs and other service providers have deployed these infernal filters everywhere, making a huge mess which I cannot resolve. It is next to impossible to politely explain the problem is theirs, without having their attention tossed amid a sea of techie jargon. They usually come away with the message, "it is your fault, not ours". I'm fed up dealing with the hostile confrontations that result.
I have a sneaking suspicion that the same morons who thought spam/virus filtering based on pattern-matching the 'From' line was brilliant are the same idiots responsible for the current crop of "security" dud-ware. Do I sound hostile? I am, and these charlatans can go shove it. At this point, I think only the "homeopathic remedy" market has more frauds than the computer industry.
I'm sorry, no matter how graceful the descriptions or the analogies, I will no longer accept content-based pattern-matching filters on e-mail. They have been proven horribly ineffective. Spam-filtering isn't rocket science, okay? First you block any SMTP traffic without a zone pointer, then block large chunks of addresses from underdeveloped countries based on message header sampling. From there, build up a list of UK, US, and Canadian spam-pushers based on their domain registrations. You'll eliminate most of it, and unless you communicate extensively with people in China, Bolivia, Russia or Brazil, you won't have to do much tuning.
This is all incredibly stupid anyway. The solution to the spam problem is not a technological one, or a political one. It's an economic problem. The powers that be chose - in their infinite wisdom - to allocate huge blocks of addresses to largely underdeveloped nations based on populace, instead of demand. Most of these people don't have a network device, and won't have one in the foreseeable future. The value of these addresses is so ridiculously deflated, that they're worth close to nothing. Spammers have massive chunks of address space, and can cycle through millions of IPs before all of them are at risk of being blocked. Want it to stop? Charge a reasonable rate to pass the traffic through your country's network backbone.
.45 caliber penicillin, applied directly to the spammer's kneecaps.
The higher the technology, the sharper that two-edged sword.
The idea of applying immune system models to spam and computer virus detection is old. Nobody has so far demonstrated that it is any better than a sound statistical approach, and this paper fails to do so as well. It's junk science.
Here is a better Idea: Blue Security was attacked and shut down because the Internet is septic. The germs (spammers) have taken over. The best way to win this is to take the profit out of spamming. This can be done in a similar manner in which the body's t cells alert the rest of an immune system on how to attack a pathogen. A cryptographically signed spammer complaint (attack) file should be distributed via a peer to peer network protocol. This file is sent amongst complaining programs that complain to a spammer's website each time a spam advertising said website is received.
Like an immune system, this network of spam attack programs will have a t-cell. The "t-cells" will be a small group of people who draw up the complaint instruction file. Whenever the pathogen (spammer) releases enough toxins (spam) into the body (Internet), the T-cells (people who write the complaint instruction file) alert the immune cells (spam complaint program) of the presence of the pathogen and how to attack (complain to website advertised) it. The pathogen is overwhelmed with a quick immuno responce (high bandwidth usage resulting from many, many complaints).
When the cost of running a website surpasses the revenue earned from said website, the website is shut down. When the costs of spamming or advertising via spam exceeds the income, spam stops. Blue Security was beginning to become successful. Too bad they bowed out.
How about a REAL IMMUNE SYSTEM anti-spam filter? I had a dream...
Here's how it works. I catch me a SPAMMER, and have it tested. IFF it is alergic to a common item (ragweed, peanuts, shellfish, etc.). I keep it in the sub-basement. Otherwsie, I release it back to the wild and catch me another.
Once SPAMMER is aquired, I put it in a chair, and provide food and water. SPAMMER is given computer, internet access, and is also attach to an allergen device that delivers the substance SPAMMER is allergic to, in controllable quantities.
SPAMMER is given control of the COMPUTER INCOMING SPAM FILTER, and allowed to freely hack on the internet.
If SPAM is delivered, and identified by my userbase, the ALLERGEN DEVICE is activated, releasing a quantity of the ALLERGEN. If a period of time (settable) goes by WITHOUT identified SPAM, the ALLERGEN DEVICE is disabled, with a random delay in the system.
If the SPAMMER is able to capture two additional SPAMMERs, it is removed from service.
Ratboy
Just another "Cubible(sic) Joe" 2 17 3061
Someone should set up an organization where a panel reviews submitted spam emails, and when an email is identified as spam, a program is activated that sends massive quantities of replies, essentially a DoS, to the spammer's computer. After getting bombarded with thousands of requests (that is what they wanted, right?) the hosting server will eventually crash and shut down. How can they complain when you gave them what they wanted? ----- Sig Sauer
Knowing Google's lust for data collection, the Soviet Union is still alive and well inside the psyche of Sergey Brin....
This is a general question, how does a well configured Spam filter compared to a simple grey listing?
Haven't been able to find any nice graphs that show a direct comparison.
First of all, you can't stop spam. Filtering will always be an imperfect arms race--we build a better filter, the spammers come up with a better way of circumventing it. It's a never-ending battle.
Secondly, you can't end spam. Too many companies rely on its existence for their business model to work.
The only way to stop spam is to stop the spammers from SENDING the stuff. However if this happened, you would see a huge number of companies suffer and possibly go bankrupt. Sure, the organised crime groups behind it would suffer, but I'm thinking of the moderately legitimate companies: Symmantec, Tumbleweed, Borderware, and the like make their money from spam and viruses. They cannot afford for these threats to go away! (Well, perhaps Symmantec could survive now that they own Veritas.) Now consider the amount of network gear and bandwidth that has been sold and is being consumed by spam, and you realise that even the big gear vendors like Cisco and Nortel have a major stake in spam sticking around.
Zero tolerance of spam might have worked if we had started worldwide in 1996, but that chance is long gone. Furthermore, legislation won't work as long as their are 'safe haven' countries out there who will host spammers' gear.
The only potentially useable answer is true vigilantism--if spammers start consistently showing up dead, we might be able to reduce spam. Failing that, we can give up on email as a useful medium.
In other words; short of serial murders, the spammers have won.
"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
A real solution would be end to end authentication and encryption. I wonder why none of the supreme innovators have thought of this yet. But then again the NoSuchAgency wouldn't be able to monitor our inbox or product vendors spam our inboxs.
davecb5620@gmail.com
I have two major filtering layers (perimeter & inbox). If the recipient is not known, it's spam, and gets temp-failed. If the sender is not known, it is likely spam, and can only send 1 message per second, or get temp-failed (otherwise, I allow several messages per second). I allow only 2 recipients per envelope (temp-fail overage). Whatever makes it through my permieter filters gets to the second major layer (inbox). At this layer, if the sender is known, it stays in the inbox, otherwise, it goes into a "new-contacts" folder. This inbox layer, of course, is fully at the discretion of the individual owner. The inbox-owner can scan through this folder for legitimates or spam, report the spam to me (for specific blacklisting), or reply to (and/or add to their addressbook -- making them "known") the legitimates.
Spammers tend to use botnets, and botnets tend to go elsewhere when presented with a temp-fail. Legitimate MTAs keep trying automatically until the message is relayed, or times-out. Spammers tend to have lots of bad addresses; legitimates tend to have very few. Spammers tend to send to more than 2 recipients per envelope. For my environment, legitimates tend to send to only 1 or 2 recipients at a time, but even when they send to more, they keep going (yes, this causes me some extra work for the extra data portions that must be virus scanned) until they're done.
To "know the sender", I evaluate my outgoing mail logs for recipients my customers send to. This is NOT challenge-response. If I don't know you, and you're legitimate, your mail will come through on the first try -- it may just take a bit longer than if I know you already.
Of course, the perimeter layer also does various other filtering (heuristics, content, virus) that may result in the message being quarantined as spam.