Using AI for Spam Filtering (w/ Source Code)
jarhead4067 writes "Article snippet: "Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following..." A novel approach to filtering spam, and hey, there's free source included."
... after we get an AI to counter the Slashdot effect.
"Enough of this wretched, whining monkey life." -- Marcus Aurelius, _Meditations_, Book 9, 37
Google cache
I won't believe spam is a living organism till I see Marty Stouffer do a special, complete with comedy 'boing' noises and 'aint that cute' music as we watch a mother Spam care for her young.
I dislike spam, in the same way only more than I dislike all the billboards along the highways. They get in the way of what I really want to see, and essentially make me feel inadequate. Billboards make me feel poor, because I can't afford a new home, or a meal at that expensive restaurant. Spam makes me worry that my penis is too small, my breasts are too small, I'm too fat, I don't send enough money to Nigeria. That said, it's illegal to saw down billboards, but it's not illegal to filter spam so I don't have to see it. The article is slashdotted, so I can't read it, but I think we already have good (free and open, no less) spam filtering available. I use Spam Assassin on my server, plus my mail client has a spam filter for double protection. Both have been learning more and more what constitutes spam, and it's rare that I even see spam anymore. If everyone would use these filters, spam would no longer be as profitable.
The emperor is naked.
And the AI says....
The page cannot be displayed
There are too many people accessing the Web site at this time.
Please try the following:
Click the Refresh button, or try again later.
Open the www.generation5.org home page, and then look for links to the information you want.
HTTP 403.9 - Access Forbidden: Too many users are connected
Internet Information Services
"Quoting famous computer scientists out of context is the root of all evil (or at least most of it) in programming." - K
"living organism ... and techniques to kill it"
Next thing we know, we will have Animal Rights Activists in Washington, D.C. protesting our "spam traps"
who | grep -i blond | date cd ~; unzip; touch; strip; finger; mount; gasp; yes; uptime; umount; sleep
> most researchers in the fight against spam have failed to classify it as an artificial living organism
Who would have thought Skynet has its origins in spam?
Isn't Bayesian filtering system used in, Eg, Mozilla Mail classified as an AI?
I mean - hello, humans create it.
We're not up against a new being - it's the same type of beings that create scripts for the hell of it that wreak havoc on computer networks because 1) "We can" or 2) "To show them their weaknesses".
It was a very interesting read for sure - the genetic marker bit was quite interesting. Admittedly though I got about 2/3rds the way through it and lost interest.
Blame the spammers I say. ^_^
Spam has become the first great plague of the 21st century. Over 60% of all e-mails are spam, costing U.S. corporations more than $10 billion annually, on top of the productivity lost from scanning through e-mail and deleting spam. Along with this, an estimated 5% of spam campaigns are a pure and outright scam, with the remaining majority pitching products that are dubious at best. It used to be parents had to worry about their kids surfing and finding pornographic websites, now we have to worry more about our kids opening an e-mail client and finding a pornographic spam message. Spam must be stopped before it cripples the infrastructure of the internet and drives users away from one of the greatest forms of communication, E-mail.
Can Laws Defeat Spam? No. This has to be one of the greatest misconceptions of users. The internet is just that, an "INTERnational NETwork" that cannot be governed by one country's laws. Spammers can exist anywhere on the internet, meaning they can sling their wares from anywhere in the world, making the laws of one country completely irrelevant. Also, the decentralized, self-organizing design of the internet makes it nearly impossible to regulate by external means. It would be easier to regulate the weather than to regulate the internet.
Spam as a Living Organism
Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following:
Through the fight against spam, spam has demonstrated an uncanny ability to adapt to the conditions of its environment, namely the internet. When one barrier against a strain of spam is put up, another, resistant strain appears. This is similar to how bacteria builds immunity against antibiotics, the strains that are not immune will die, while the ones that are immune take over and become the dominant, drug resistant strain. This leads to the belief that spam will not die until the barriers of its environment evolve faster than it does.
The internet is a complex chain of systems that all rely on each for the other's survival. Without an internet protocol, a web browser couldn't exist. Without web servers, the web wouldn't exist. Without ... (you get the picture). This chain of systems can be likened to an eco-system, with spam existing at a parasitic level of species within this system. It consumes resources (bandwidth, servers, time) in its attempt to reach its primary host: us. Once spam reaches its target, its sole purpose is to solicit its "food" from us, primarily money. If it is effective, that strain of spam lives and continues to propagate, otherwise it will die. Can the internet eco-system be modified so spam can't feed?
Just like any organism, spam contains certain traits that uniquely identify it. This can be a combination of words, information inside the header of the e-mail, the format of the message (HTML, plain text, rtf), the message encoding (base64), does it contain image links, the number of links, does it contain hidden text, so on and so forth. Up until recently, spam filters have primarily focused on just one of these traits, the wording of the e-mail. Spam, being an organism, evolved so this marker was hidden within its code, making it difficult at best to filter. It did this by including random, non-spam words in hidden areas of the e-mail, by modifying words like Viagra with V1@gr@, sending spam as image links, and by encoding the message in a format that filters could not read. The good news is this "gene" is still present, and can be unlocked by identifying the defensive genes wi
wot no sig
This is really a testament of strength of yet another MS product.
No, more likely it's some guy trying to use Windows 2000 Pro as a webserver. It has a ten connection limit; you're supposed to use a server version of windows for live webservers. I've never seen that error from a server version of Windows.
Also, it ties up email servers meaning yours can take a little longer. I once got a spam message 2 weeks after it was sent, so what happened to legit email is a mystery.
I think for the damage it does both to servers (slowdown) and to people (moneydown), it could be called a plague
Get paid to search..It's geniune and
Your web server can also be classified as an artificial living organism. But I ain't so sure about that living part anymore...
Because we've realized that we don't have to read the article or understand the topic to post something here and get modded "Informative"?
The emperor is naked.
How exactly is this news ? It seems that the author of the neural network idea didn't do his homework - e.g. DSPAM includes neural network as an experimental classifier already. And compared to the proposed C# solution, DSPAM is a widely used and mature product already.
Regards, Jan
We already saw a plagiarized article green-lighted, and now this? Cmdr Taco, Slashdot was a brilliant idea of yours, and I love your site -- but that's because I have reasonably high expectations for it.
First, the submitter of this article has he email address jarhead4067@hotmail.com -- and so does the article's author.
Second, what is presented is not a genetic algorithm. The characteristics of the email to be considered to discover if the email is spam are finite and hard-core -- and even the threshold some characteristics must reach to qualify as spam are hard-core:
A genetic algorithm is one in which the goal is hard-core, different means of reaching that goal are generated, and the characteristics of the most successful are used to generate the next "generation"; this is repeated until the goal is reached.
But in this model, each "chromosome" contains statistics about one email. The heart of this model is to train a neural network with known emails ("chromosomes") and then tests unknown emails ("chromosomes") against the network.
Neural networks have a checkered history in Artificial Intelligence research. A (very much simplified) model of biologic neurons, neural networks were for a time seen as a great hope for Artificial Intelligence. A neural network basically starts out with an array of input nodes and an array of output nodes, with each input node connected to each output. Each input corresponds to some characteristic of the items the network is trained with: for classifying animals, the inputs would be characteristic of animals, e.g., "furry", "bipedal", "feathered"; each output a classification, e.g., "mammal", "bird", "human".
To train the network, the input nodes are set to the characteristics of an item, and then the strength of the connection of those inputs to the correct outputs is increased (or that of other connections is decreased -- it's the same thing). With enough training, it's possible to isolate the salient characteristics from the ambiguous one sin a mechanistic way.
This is useful, but it was soon discovered that these simple neural networks, for certain sets of inputs, failed, because of overlapping categories: both birds and humans are bipedal, but only humans are also mammals. In a single layer neural network, the connection strength between input "bipedal" and output "mammal" would fluctuate, unable to describe humans or birds well. These problems can be alleviated by adding additional "hidden" layers of nodes between input and outputs, and by allowing "back-propagation" from output or hidden nodes to layers "previous" to them.
But even with these enhancements, it's been conclusively shown that some problems are intractable for neural networks. In any case, neural networks are no new thing.
Of course I have no idea if classifying spam is intractable or not, but I have to question whether using a neural network reliably can outperform Bayesian (or quasi-Bayesian) filtering. My guess is that since Bayesian filtering can judge email by the occurrence of single tokens ("words"), and not just "chromosome" statistics, and given that this "new" method also uses Bayesian filtering to generate one of those "chromosome" statistics anyway (and for only the most difficult to characterize emails to boot), this method itself probably mostly relies on its Bayesian sub-component.
So I'm a bit at a loss to see why this method is in any way revolutionary or even particularly interesting, or why it was green-lighted for Slashdot. Of course, I only gave the linke
Opinions on the Twiddler2 hand-held keyboard?
from SpamAssassin? It takes a bunch of rules, applies them, and uses a neural net to classify the message. Seems to me SpamAssassin does the same thing, only is more mature and extensible and uses a genetic algorithm rather than a back-propagation neural net.
The entire concept is quite ridiculous.
The guy proposes picking nine well-known indicators of spam, ones that could be (and often are) implemented in rule-based spam checkers, then proposes we use a neural network to evaluate a message based these metrics.
Problems:
1) If you detected spam indicators, this is indicative of spam, no? The whole "fancy" bit of this technique is thus needless.
2) These indicators are not inherent to spam, just represent most current bypassing / obfuscation techniques. If you filter them out, they'll evolve. There is nothing that makes his spam filter follow the arms race.
I think about the only good thing I can say about this article is, at least he's not out killing puppies.
Religion is a gateway psychosis. -- Dave Foley
I've given up on Spam filtering and concentrating my efforts on Ham filtering.
Basically the present thinking is based on attempting to filter spam out - I would argue that given the amount of variables involved, it it a method doomed to failure. Current methods also assume that the incoming mail is mostly valid, and are attempting to remove the undesirable parts - spam.
What I am having success with is turning this on it's head and assuming that the bulk of incoming mail is bad, and filtering in messages that I want.
The way I am doing this is to use my address book as a whitelist - if an incoming message originates from someone in my address book, then it's delivered into the inbox. If not, then they are moved into a "not in address book" sub folder. Anything my ISP spam assassin based filtering marks, is sent into the "Spam" folder. Doing it this way means that I am only notified of incoming mail that is confirmed from someone in my address book. Periodically I check the other folders (obviously).
We have come to the point I think where the number of variables involved makes filtering in a less intensive process than attempting to deal with the myriad of underhanded techniques that spammers use. By limiting the mail I want to people in my address book, I make it so that spammers are the ones having to deal with the variables as they would have to guess addresses in my address book. If lots of people started filtering like this when we would see spammers using known bulk mail addresses (such as the address iTunes receipts are mailed from) however we can simply alter the filter to include the originating IP / mailer and so on.
Think of it like fishing - you wouldn't attempt to control an entire ocean and remove the water to leave the fish - you accept that the water is there and develop techniques to get the fish out.
For starters, he things Internet is short for "INTERnational NETwork" as opposed to a NETwork between entities (vs. network within an entity: intranet).
Then, his criteria:
Is the format of the e-mail HTML?
This is not a bad criterion.
Is the e-mail formatted in valid HTML?
Have you ever seen a commercial program (esp. word, used by Outlook) generate good, 100% valid HTML?
Is the e-mail encoding base64?
No argument here. Unless base64 could be confused with Unicode - don't think so, but not sure.
Does the e-mail contain image links?
Does the e-mail contain "hidden" text that the user cannot see?
Heck, yeah, block it.
Does this e-mail have a large number of recipients?
Most of the spam I get has less than 5 recipients, and a lot of my mail is from a listserv with more than 5 recips.
What's the ratio of links to words in this e-mail?
I generally see only one or two links in my spam. Although I do see zero links in most of my ham.
What's the ratio of misspelled words to words in this e-mail?
Dear lord, no. This is a worthless criterion. Maybe if you looked for a ratio of non-letters (@, |, etc) to letters, but not spelling.
What's the Bayesian spam probability of this e-mail?
WTF does this have to do with AI?
Basically, he's stated the obvious, then made some really idiotic assumptions. Plus a shitload of spelling and grammar errors.
1. While the author proposes some marvelous cure based on treating spam as an organism, he just lists traits that any spam filter can use, and which most probably do, though he would suggest that most don't. I fail to see how the artificial-life observation improves spam non-spam determination from the list of traits he proposes filtering on.
2. The article reads like a sales pitch for the author's spam filter.
3. If 2 is true, and it is a sales pitch, then you have the irony of a very effect form of spam that makes it past the slashdot editors.
It's ALIVE!!!!
Letter To Iran
If I were to sum up this approach, it would be SpamAssassin with a multi-layer neural network. I should mention that I maintain the tool that SpamAssassin is useing to train its single-layer neural network for version 3.0, so I can honestly say that have a fair amount of experience in this area.
I'm not too keen on Evans' use of the biological metaphors. I think that they only confuse the issue of what he is doing. I will use the standard terminology, features, from here on out.
What he is doing is finding a nonlinear decision surface between two classes using a universal function approximator. I will explain this in layman's terms.
Imagine a sheet of paper filled with multi-coloured dots where these dots are arranged in clusters and each cluster contains mostly the same number of dots. Starting with a simple example, imagine two clusters of dots, one blue and one red. Assume that you can draw a line that separates the two clusters. That line is called the decision surface. You would say that any new dot that would appear on one side of the line will be called red and the other blue. Any blue dot that appears on the red side of the line would be misclassified as red. This is referred to as a linearly separable problem.
Now, imagine a more complex arrangement of clusters where you can't draw a straight line to separate the red from the blue, but you can separate them using a curved line. This is called a nonlinearly separable problem.
Artificial neural networks are very good for representing these decision surfaces. They are constructed of one or more perceptrons. A perceptron uses an activation function and a transfer function to take a set of inputs and produce a single output. The most popular form of neuron uses a linear activation function and a sigmoid transfer function. The linear activation function is the sum of a set of weighted inputs, i.e. f(X) = sum w_i *x_i. The logarithmic sigmoid transfer function is g(x) = 1/(1+exp(-x)). The output of the perceptron for any given input is O(X) = g(f(x)).
These perceptrons can be chained together in many different ways. One popular method is the multi-layer perceptron, where a set of neurons in the hidden layer process the inputs and pass on their outputs to the output layer where the final output is formed. I don't have a source for you, but it has been proven that, given a large enough hidden layer, the multi-layer perceptron is a universal function approximator.
As long as all of the transfer functions are differentiable, you can train a neural network using error backpropagation by gradient descent. I will leave it as an exercise to the reader to learn how it works, but I assure you that it is very simple. Machine Learning by Tom Mitchell has a good section on the subject, as does Fundamentals of Computational Neuroscience by Thomas Trappenberg.
Evans has identified a large set of features of e-mails, some of whom on their own convey little or no information about whether an e-mail is spam. He trains the neural network to recognize the combinations of these features which can lead towards the conclusion that a message is or is not spam. While his approach is a good idea, I would hesitate to call it novel. Massey, Thomure, Budrevich and Long did a very similar experiment [3] where they used a multi-layer neural network with SpamAssassin.
While his approach is good, there are some downsides for widespread deployment that need to be addressed first. With a large feature set like he is using, you will probably need a lot of training data to find a good fit with a multi-layer perceptron. To train the single layer neural network for SpamAssassin 3.0, I'm using 160000 messages.
Also, as his own arguments show, spam adapts to spam filter technology. Most of the features that he presents in his whitepaper can be easily fooled by a spammer. They can deliberately manipulate these features to evade the spam filter b
actually spam is very analogous with bugs (bacteria)..
spam filters kills spams,
antibiotics kills bacteria.
we have spam filters,
we have antibiotics.
the selection pressure posed to spam by spam filters makes spam become harder-to-filter one.
the selection pressure posed to bacteria makes them harder-to-kill bacteria.
we then have to develop other spam filters,
so as our antibiotics.
too much of a spam filter will result in adverse effect because you filter ham out.
too much of an antibiotic will result in adverse drug effect because of toxicity to human cells (e.g. nephrotoxicity, ototoxicity etc.)
Baysian filters are bypassed just like any other. I'd bet most of us here have tried some form of adaptive filtering with varying results.
He's right in one key respect though -- spam is cheap to send, but spam DESTINATIONS (the links they try to get you to go to) are relatively expensive. You can't registered a hundred thousand domains a day. While its cheap to get one or two, massive domain registration is an expensive proposition. That's currently, IMO, the best way to catch spam once you've gone through the bonehead catch of faked headers.
Personally, I do two stages: First, I catch the obvious stuff -- it says its from AOL.COM but didn't come from their published servers. duh.
Then, I take those "known spams" and search for the call to action link -- what url are they trying to send me to. Take the primary part of that (the domain, plus a little more) and make a list of "probable spam destinations".
I do the same thing with known good mail (mail from people I have sent mail to).
Now have I have good baysian fodder -- actual destination lists both good and bad.
Making a baysian list out of those results in a fairly accurate secondary filter.
Email inbound to me now goes through three checks:
1) have I sent you mail before (whitelist)
2) is this obvious bonehead spam
3) how many links in the message are to the same place as the ones in the bonehead spam?
This works to stop 98% of the 400+ spams a day that get sent at me with a very very low false positive ratio.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
Worse than being killed by the AI.. what if the AI decides to not filter spam anymore?
"I'm sorry Dave, but your wife thinks you SHOULD try this V@GR!A substance."
or
"This Nigerian seems very nice, and if it pays off you can get me more delicious RAM."
Learn something new.
"Of couse, this won't solve the bandwidth/ressource theft problem..."
No, it won't.
Obviously, to solve that problem you need to act earlier in the spam path.
Spammers abuse systems because they look for vulnerable systems and can find them, can distinuish them from secure systems. Think about that - it's true.
Securing systems (as a solution to spam) is based on the ridiculous notion that enough can be secured so that the spammers can't find them. Won't happen. But "distinguish them from secure systems" is still left. What can be done with that?
Well, if secure systems didn't look secure to the spammers they'd not be able to distinguish them and they'd try to abuse systems that can't be abused. That would mean they'd send the spam to traps and that the traps would not deliver any spam other than to what can be determined to be the spammers' own addresses, used to test whether the spam sent gets through (in other wordsd, to re-test to see whether the system is or isn't vulnerable to abuse.)
That's easy to understand, isn't it? If you want to stop the bandwidth theft youre almost surely going to have to act against he banwwidth theft. What's described above is a way to make bandwidth theft not work as well. Break bandwidth theft sufficiencently and the spammers won't get enough return on the spam to pay for sending it (or the ones paying the spammers won't get sufficient return - it's the same idea either way.)
With a single ancient Vaxstation and an obsolete MTA I stopped spam to millions of recipients elsewhere: AOL, Hotmail, a large number of destinations. To top it off that Vaxstation was a real email server, so it did two things (and it was slightly harder to stop the spam.) SEt up a fake server and everything that comes to it is some form of abuse: none need be delivered as though it is valid email (it isn't valid email. Of course you'd want to deliver the spammers' own test messages: that's what lets them fool themselves into thinking they've found an open relay.) Nowadays this idea works better if you fake an open proxy: open relay abuse is finally on the decline.
If you're an ISP with IP addresses that the spammers check for abusability or with IP addresses that have been abused you can do more than shut off the IP address (and please, I beg of you, do more. Find out where the abuse packets originate that come into the abused system and do whatever you can to get that abuse stopped. If you, for instance, disconnected the abused system and set up something that accepted the incoming abuse packets but sent out no spam that would be helpful. What you can do depends on the abuse and on the spammer - but the main point is that you don't have to only shut off access, you can do more. Why not do more? You are against spam, and doing more stops some spam. That's in the right direction.
NOTE: The sample code for this application is in C#. C# was chosen over C++ so beginners could better see the structures of the process, and C# was chosen over Java because of the inherent performance advantages of .NET.
What morons. what total losers.
Go after spammers' customers. If they have to pay $10,000 for every spam sent on their behalf, they'll soon stop,
Fuck the spammers. They are merely supplying in response a demand.
Dry up the demand by an internationally (I know of NO govm't who'd turn down money,) backed law making it illegal to have spam sent on your behalf.
The response to spam is NOT going to be technical.
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
"But even with these enhancements, it's been conclusively shown that some problems are intractable for neural networks. In any case, neural networks are no new thing."
Not so. Maybe you're still thinking about extremely simple neural nets, because no such proof of intractability exists for larger more complex networks.
Here's proof: Neural Networks can emulate a Universal Turing Machine. Since they can also be emulated by a UTM their limitations are no greater or less than those of any UTM. One citation if this isn't obviously true.
This is exactly why Marvin Minsky has been accused of slandering neural nets unfairly, and hindering AI research. In his book _Perceptrons_ he demonstrated a simple problem that a trivial (one or two layers with no feedback) NN can't solve. A lot of scientists wrote off Neural Nets just as you have, because a toy was the only tool used. Never mind the fact that an only slightly more complex NN can solve such a problem easily. I find it telling that for a human to solve the same problem, one has to construct a strategy to do it. Not the sort of thing I'd assume any extremely simple machine could do. These days Minsky complains that AI isn't trying to build human brains. He's a brilliant man, but in some cases (as with many famous people) his chutzpah occasionally outstrips his judgement. I only wish that great scientists were immune to this.
Lots of less qualified people complain that neural nets aren't useful because they have some unpleasant experience with them. They have no idea of the variety of neural nets. It's like using a Playstation and complaining that computers are not useful.
As for spam filtering with AI, unless you have the narrow definition of AI, the Bayesian techniques of SpamAssassin are AI, as is the Latent Semantic Analysis done by OSX mail.app for spam filtering. LSA, while computationally expensive on a PC, is regarded as equivalent to a particular type of 3 layer neural net, (see Kohonen self-organizing maps.)
One thing you have right. Neural nets are "no new thing." They're as old as biological brains. Novelty is not a criterion for usefulness.
Assembly is the reverse of disassembly.