These messages hold hundreds of non-words, together with creatively "uglified" versions of common spam words. The trait I'd like to check for is "ratio of words never seen in ham"; seems like a nice and sensible thing to look for.
That sounds like a very good idea. e-mail me and we can look at it further.
Neural networks probably represent a better way of combining probabilities gained from multiple techniques. Bayesian stuff works pretty damn well, but we may need to give it a little more "traction" into the problem...
If you're interested, the conceptual difference between naive bayes and neural networks is that the neural networks try to find the mutual information between features while naive bayes just pretends that there is no mutual information (everything is independent). In most cases, the naive bayes assumption is incorrect but it usually works well anyway.
I think you make a very good point, but given a large enough[1] training corpus, and being very conservative on the weight to assign to error backpropagation, wouldn't it be interresting to see if the decision hyperplane would be able to reshape itself quickly enough to include freshly "evolved" forms of spam as they appear? (Provided, of course, that those consist of variants on previous forms).
I'm not aware of anyone doing online updating of their neural networks for spam classification. I've always been of the impression that error backpropagation and online updating don't mix with multi-layer neural networks because they tend to take O(n^2) time to converge. In addition, you have to deal with the stability/plasticity tradeoff where you want to give the network enough freedom to learn the new patterns while retaining accuracy on the previously-learned patterns.
I agree, however, that your concern about constructed attacks against detection of specific features is a killer, as it stands. But given a large enough set of features to look for in both form and contents the task becomes increasingly more difficult (hence SpamAssassin's success), would that problem tend to eleminate itself?
The spammers are pretty smart. You'll have to trust me that they find and exploit any and every hole that is left open.
Older versions of SpamAssassin had rules with negative scores for recognized e-mail clients (such as pine and mutt). Spammers started putting those headers into their messages to get an extra boost.
I'm using SpamAssassin now, and I think its primary weakness is lack of combinatorial weighing. Feature X is worth n point independently of the presence of other features in the message (or not? I might just have never found how).
I agree with you 100%. I think that SpamAssassin would be much better off with combinatorial weighting. But, I haven't yet found a good way to do it.
One of the ideas that we've had uses what is called a "sigma pi" node, a perceptron with an activation function that looks like f(X) = sum w_ij * x_i * x_j. Aside from the obvious security holes, I'd expect that as a quadratic function, it would require quadratically more training data than the linear activation function.
We've had some other ideas for activation functions, one of whom looks like f(X) = (prod m_i * x_i) * (sum w_i * x_i). This one would make the network more or less sensitive given that certain tests have hit. It is nonlinear and does not require the messy cross product stuff of the sigma pi node. I haven't done any experiments with this one yet, but my gut tells me that it probably won't work very well.
If any of you have better ideas, I'm all ears. Feel free to drop me an e-mail or catch me on irc.freenode.net in #spamassassin. My nick is henry and I'm usually around during business hours.
The difference between this and SpamAssassin is that he uses a multi-layer neural network where we use a single-layer neural network. His feature space is a bit more expansive: he uses a lot of features that don't indicate a message being spam on their own.
The first thing that I did when I became involved with SpamAssassin was to replace the old genetic algorithm-based score learning tool with one that uses error backpropagation. It only takes a few seconds to run as compared to a few days for the old GA and it consistently finds better solutions. Look at masses/perceptron.c and masses/README.perceptron in the SpamAssassin SVN repository if you're interested in more about what I've done.
I think that you're jumping the gun a bit on your accusations. Perhaps, as you admit in your last paragraph, you should have read the article a bit more carefully before writing your response.
I can understand your premature conclusion that he is talking about using genetic algorithms from his biological metaphors, but I didn't see any actual mention of them. He's just using a funny name for features.
I wouldn't dismiss neural networks in the way that you do. People did put a lot of hope in the perceptron when it was first invented and did lose faith for about 30 years when it was shown that it couldn't be used to separate XOR. However, multi-layer networks and kernel functions have helped them regain their utility.
Lastly, I wouldn't get too hung up on his use of the output of another classifier (naive bayes text classifier) as an input to his neural network. That is, after all, exactly the same thing as what happens between the hidden layer and the output layer. We do the same thing with SpamAssassin and it works out very well for us.
If I were to sum up this approach, it would be SpamAssassin with a multi-layer neural network. I should mention that I maintain the tool that SpamAssassin is useing to train its single-layer neural network for version 3.0, so I can honestly say that have a fair amount of experience in this area.
I'm not too keen on Evans' use of the biological metaphors. I think that they only confuse the issue of what he is doing. I will use the standard terminology, features, from here on out.
What he is doing is finding a nonlinear decision surface between two classes using a universal function approximator. I will explain this in layman's terms.
Imagine a sheet of paper filled with multi-coloured dots where these dots are arranged in clusters and each cluster contains mostly the same number of dots. Starting with a simple example, imagine two clusters of dots, one blue and one red. Assume that you can draw a line that separates the two clusters. That line is called the decision surface. You would say that any new dot that would appear on one side of the line will be called red and the other blue. Any blue dot that appears on the red side of the line would be misclassified as red. This is referred to as a linearly separable problem.
Now, imagine a more complex arrangement of clusters where you can't draw a straight line to separate the red from the blue, but you can separate them using a curved line. This is called a nonlinearly separable problem.
Artificial neural networks are very good for representing these decision surfaces. They are constructed of one or more perceptrons. A perceptron uses an activation function and a transfer function to take a set of inputs and produce a single output. The most popular form of neuron uses a linear activation function and a sigmoid transfer function. The linear activation function is the sum of a set of weighted inputs, i.e. f(X) = sum w_i *x_i. The logarithmic sigmoid transfer function is g(x) = 1/(1+exp(-x)). The output of the perceptron for any given input is O(X) = g(f(x)).
These perceptrons can be chained together in many different ways. One popular method is the multi-layer perceptron, where a set of neurons in the hidden layer process the inputs and pass on their outputs to the output layer where the final output is formed. I don't have a source for you, but it has been proven that, given a large enough hidden layer, the multi-layer perceptron is a universal function approximator.
As long as all of the transfer functions are differentiable, you can train a neural network using error backpropagation by gradient descent. I will leave it as an exercise to the reader to learn how it works, but I assure you that it is very simple. Machine Learning by Tom Mitchell has a good section on the subject, as does Fundamentals of Computational Neuroscience by Thomas Trappenberg.
Evans has identified a large set of features of e-mails, some of whom on their own convey little or no information about whether an e-mail is spam. He trains the neural network to recognize the combinations of these features which can lead towards the conclusion that a message is or is not spam. While his approach is a good idea, I would hesitate to call it novel. Massey, Thomure, Budrevich and Long did a very similar experiment [3] where they used a multi-layer neural network with SpamAssassin.
While his approach is good, there are some downsides for widespread deployment that need to be addressed first. With a large feature set like he is using, you will probably need a lot of training data to find a good fit with a multi-layer perceptron. To train the single layer neural network for SpamAssassin 3.0, I'm using 160000 messages.
Also, as his own arguments show, spam adapts to spam filter technology. Most of the features that he presents in his whitepaper can be easily fooled by a spammer. They can deliberately manipulate these features to evade the spam filter b
It's a shame that people still have to resort to the Google cache when there is a great caching service, FreeCache provided by the Internet Archive. Just make your link like http://freecache.org/http://whatever...
I'm sorry to hear that you're having so much trouble with SpamAssassin. I've heard some rumblings from the Faculty of Law at my university that SA makes a lot of errors on their e-mails. The disclaimers that they put in their signatures must be tripping some of the rules.
I'm trying to find a reliable method of personalising your scores without requring you to download the 300MB corpus that we use to optimise the scores. I hope that, in the future, it will make SpamAssassin more to your liking.
P.S. Your system administrator really shouldn't be discarding those e-mails. The SpamAssassin documentation reccomends that you only tag them and let the users write a filter that detects the X-Spam-Status header. You should ask them to tag the e-mails instead so that you can use DSPAM or CRM114, since they work so well for you.
1. Cormack is very inexperienced in the area of statistical filtering.
Disagreed. Gordon Cormack has been doing information retrieval for 20 years. He is fairly well known in the area. See his publication history at DBLP.
A far more likely conclusion about what's going on here is that Zdiarski's ego has been hurt. Both he and Dr. Yerazunis engage in some very sketchy statistics in their papers and I think that it has caught up to them.
1. Yerazunis' study of "human classification performance" is fundamentally flawed. He did a "user study" where he sat down and re-classified a few thousand of his personal e-mails and wrote down how many mistakes he made. He repeats this experiment once and calls his results "conclusive." There are several reasons why this is not a sound methodology:
a) He has only one test subject (himself). You cannot infer much about the population from a sample size of 1.
b) He has already seen the messages before. We have very good associative memory. You will also notice that he makes fewer mistakes on the second run which indicates that a human's classification accuracy (on the same messages) increases with experience. For this very reason, it is of the utmost importance to test classification performance on unseen data. After all, the problem tends towards "duplicate detection" when you've seen the data before hand.
c) He evaluates his own performance. When someone's own ego is on the line, you would expect that it would be very difficult to remain objective.
2. Both Yerazunis and Zdziarski make use of "chained tokens" in their software. This is referred to in other circles as an "n-gram" model. As with many nonlinear models (the complexity of an n-gram model is exponential with n), it is very easy to over-fit the n-gram model to the training data. Natural language tends to follow the Pareto law (sometimes called the 80/20 rule) where the ranking of a term is inversely proportional to the frequency of occurence of that term. The exponential complexity of the n-gram model contributes to the sparse distribution of text leading to a database with noisy probability estimates.
3. Zdziarski uses a "noise reduction algorithm" called Dobly to smooth out probability estimates in the messages. Aside from his unsubstantiated claim of increased accuracy, I have never seen anything to suggest that it actually works as advertised.
Considering these points, I was not surprised at all by the results of Dr. Cormack's study. While one may argue that his experimental configuration can use some improvement, his evaluation methods are logically and statistically sound. What I personally saw in the results of this paper was that two classifiers that use unproven technology did not perform as advertised. After all, every other Bayes-based spam filter performed acceptably well.
Lastly, I won't really touch his flawed arguments about how using domain knowledge about spam (i.e. SpamAssassin's heuristic) somehow hinders the classifier over time when you are also using a personalised classifier. You'll notice that SpamAssassin still did acceptably well when all of the rules were disabled.
Go read some more of Zdziarski's work and draw your own conclusions about his work. Pay careful attention to his use of personal attacks when comparing his filter to that of others.
The above poster has obviously never ventured into the accounting department at a university and is merely saying the word "GPL" to karma whore. "Non-trivial" doesn't even begin to describe the complexity of what goes on, to the point that even humans can't get it straight. Just the other day, I had to simultaneously corral no less than 5 university employees to figure out exactly what was going on with my pay situation.
I beg to differ with you on the matter of it being only "an annoyance." I've had to delete comments on my own weblog that (supposedly) link to underage pornography sites. I'm not a lawyer, but I'm fairly certain that it is illegal to link to child pornography. Assuming that this is true, those SEOs are actually causing you, the innocent weblog/wiki owner, to unwillingly and unwittingly commit a criminal act.
After reading through the comments here, it is obvious that there are some misconceptions about what Apple is doing.
Latent Semantic Indexing (LSI) was invented by Deerwester et. al. [1] as a method of reducing the dimensionality of a text corpus by finding a low-rank approximation of the term-document matrix.
The singular value decomposition (SVD) [2] factors a matrix A into the product of two orthogonal matrices and a diagonal matrix, A = U'SV. To find a rank k approximation of A using this factorisation, create matrices U^, S^ and V^ where S^ contains the first k rows and columns of S, U^ contains the first k rows of U and likewise for V^. Then, let A^ = U^'S^V^. The difference in Frobenius norms [3] of A and A^ is minimal for a rank-k approximation of A (least squares).
Rather than storing the full matrix, A^, in practice it is much more common to save U^ and S^ and project the columns and rows of A into a k-dimensional space. This allows both terms and documents to be clutered together and helps to associate keywords with documents.
You can do many things with these approximated document vectors, clustering, classification, document retrieval. Apple is probably using a k-nearest neighbour classifier [4] to determine how a message is to be filed.
I would be most interested to see Apple's updating strategy. There are several algorithms that allow you to add new rows and columns to a matrix where you know the full SVD, but none that I know of for the truncated SVD.
For one of my graduate-level courses, I wrote a little search engine that uses LSI to cluster 1000 newspaper articles. You can play with it here. My favourite query is "Rowan Gorilla." The Rowan Gorilla is an oil rig that frequents Halifax harbour. The search engine returns articles on the oil and gas industry that contain neither the word "Rowan" nor "Gorilla" but are still topical.
[1] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990.
Apparently, the poster has never heard of a realtime blacklist. If a spammer were to use an A60 to send out their spam, their IP would be added to the RBL and none of their messages would get through.
Devices like that are useful for sending out legitimate messages, such as 'Technical Errata' or trying to run product support over E-mail.
Whenever I make a telephone call, for whatever purpose, it is associated with my telephone number. Thus, I am accountable for the use of my communications equipment.
Why should it be different for people using TTY services? Provisions for anonymity only allow people to abuse the telephone system.
One of the primary design constraints for home theatre PCs is that they need to be absolutely silent. Since hard drives can be noisy, keeping the number of drives in your system to a minimum should be important. Many people (myself included) use networked fileservers to serve media to their HTPCs.
These little boxes seem like just the ticket. Imagine a diskless HTPC. All that you would need to do is boot it over the network and mount the drive in the Asus enclosure as your root filesystem. If you were to use a Via C3-based motherboard and a power supply with passive cooling, you could then have an HTPC with no moving parts and thus, totally silent.
I don't think that it is stupid at all. In this particular situation, the line of where the crime was committed is very blurry. While he may have been sitting at his computer in Australia, he was accessing those in other countries and presumably the United States (think IRC servers, FTP servers, DCC connections).
P.S. Do you always feel that you need to make the fallacy of personal attack for your point to be considered valid?
While his actions were performed in Australia, many of his victims (the owners of said IP) reside in the United States. Without getting into an IP law debate, It's not that much of a stretch to prosecute someone under the laws of the country of the victim.
An analogue would be attempting to extradite a 419 scammer from Nigeria because they defrauded a North American.
I have analytically and experimentally proven that over time, those random words will break your spam filter. I hope to publish a paper on the subject this summer at the First Conference on Email and Anti-Spam (CEAS). If you are interested, contact me by e-mail and I can send you a pre-print once the paper is finished and submitted.
Gathering all of that information on all of those people is very difficult. However, in this situation, the user does all of the work and publishes it in a centralized location. If you're making the information publically available, you can no longer reasonably expect it to be private.
I never thought that Slashdot would help me find papers relevant to my research!
I think that their idea is good from a technical point of view, but very bad from a privacy point of view. I am of the opinion that gathering social network information is extremely dangerous. A pertinent example: If your friend is branded a "terrorist," then "they" can exploit the information that you have voluntarily provided to then put you on a "terrorist" watch list.
Another example: Say that someone who knows someone that you know actually buys something from a spam. If the spammer can access the social network information, suddenly your little niche of the network is going to be aggressively spammed. After all, like minds congregate.
There is no doubt in my mind that the black hatters will infiltrate the social network communities and use that information to spy on potential viewers. See this bugzilla thread where the folks from Atriks Professional Email Deployment Service follow SpamAssassin's development and adapt their "ratware" tool accordingly.
The biggest problem with collecting social networks is that once the data has been gathered, it is very hard to control. Those of you using Orkut should think long and hard about it.
In conclusion, I think that this is technically a good idea but it opens a Pandora's box.
Considering your medical background and interest in comptuer science, you may find health informatics to be of interest. You will be able to leverage all of your experience as a physician while breaking in to the field of computer science.
I work closely with the Departments of Health Informatics at Dalhousie University and University of Victoria and have met many of the other people in the field in Canada. They are a good mix of doctors, nurses and computer scientists and are doing some very interesting and releveant work.
You can find more information at http://www.hiww.org/ and by searching for Health Informatics on Google.
That sounds like a very good idea. e-mail me and we can look at it further.
If you're interested, the conceptual difference between naive bayes and neural networks is that the neural networks try to find the mutual information between features while naive bayes just pretends that there is no mutual information (everything is independent). In most cases, the naive bayes assumption is incorrect but it usually works well anyway.
I'm not aware of anyone doing online updating of their neural networks for spam classification. I've always been of the impression that error backpropagation and online updating don't mix with multi-layer neural networks because they tend to take O(n^2) time to converge. In addition, you have to deal with the stability/plasticity tradeoff where you want to give the network enough freedom to learn the new patterns while retaining accuracy on the previously-learned patterns.
The spammers are pretty smart. You'll have to trust me that they find and exploit any and every hole that is left open.
Older versions of SpamAssassin had rules with negative scores for recognized e-mail clients (such as pine and mutt). Spammers started putting those headers into their messages to get an extra boost.
I agree with you 100%. I think that SpamAssassin would be much better off with combinatorial weighting. But, I haven't yet found a good way to do it.
One of the ideas that we've had uses what is called a "sigma pi" node, a perceptron with an activation function that looks like f(X) = sum w_ij * x_i * x_j. Aside from the obvious security holes, I'd expect that as a quadratic function, it would require quadratically more training data than the linear activation function.
We've had some other ideas for activation functions, one of whom looks like f(X) = (prod m_i * x_i) * (sum w_i * x_i). This one would make the network more or less sensitive given that certain tests have hit. It is nonlinear and does not require the messy cross product stuff of the sigma pi node. I haven't done any experiments with this one yet, but my gut tells me that it probably won't work very well.
If any of you have better ideas, I'm all ears. Feel free to drop me an e-mail or catch me on irc.freenode.net in #spamassassin. My nick is henry and I'm usually around during business hours.
I have hypermail hooked up to the address ano.nymo.us. You can see all of the stuff that people sign up for (and spam they get because of it) here.
The difference between this and SpamAssassin is that he uses a multi-layer neural network where we use a single-layer neural network. His feature space is a bit more expansive: he uses a lot of features that don't indicate a message being spam on their own.
The first thing that I did when I became involved with SpamAssassin was to replace the old genetic algorithm-based score learning tool with one that uses error backpropagation. It only takes a few seconds to run as compared to a few days for the old GA and it consistently finds better solutions. Look at masses/perceptron.c and masses/README.perceptron in the SpamAssassin SVN repository if you're interested in more about what I've done.
See my full response for more details.
I think that you're jumping the gun a bit on your accusations. Perhaps, as you admit in your last paragraph, you should have read the article a bit more carefully before writing your response.
I can understand your premature conclusion that he is talking about using genetic algorithms from his biological metaphors, but I didn't see any actual mention of them. He's just using a funny name for features.
I wouldn't dismiss neural networks in the way that you do. People did put a lot of hope in the perceptron when it was first invented and did lose faith for about 30 years when it was shown that it couldn't be used to separate XOR. However, multi-layer networks and kernel functions have helped them regain their utility.
Lastly, I wouldn't get too hung up on his use of the output of another classifier (naive bayes text classifier) as an input to his neural network. That is, after all, exactly the same thing as what happens between the hidden layer and the output layer. We do the same thing with SpamAssassin and it works out very well for us.
If I were to sum up this approach, it would be SpamAssassin with a multi-layer neural network. I should mention that I maintain the tool that SpamAssassin is useing to train its single-layer neural network for version 3.0, so I can honestly say that have a fair amount of experience in this area.
I'm not too keen on Evans' use of the biological metaphors. I think that they only confuse the issue of what he is doing. I will use the standard terminology, features, from here on out.
What he is doing is finding a nonlinear decision surface between two classes using a universal function approximator. I will explain this in layman's terms.
Imagine a sheet of paper filled with multi-coloured dots where these dots are arranged in clusters and each cluster contains mostly the same number of dots. Starting with a simple example, imagine two clusters of dots, one blue and one red. Assume that you can draw a line that separates the two clusters. That line is called the decision surface. You would say that any new dot that would appear on one side of the line will be called red and the other blue. Any blue dot that appears on the red side of the line would be misclassified as red. This is referred to as a linearly separable problem.
Now, imagine a more complex arrangement of clusters where you can't draw a straight line to separate the red from the blue, but you can separate them using a curved line. This is called a nonlinearly separable problem.
Artificial neural networks are very good for representing these decision surfaces. They are constructed of one or more perceptrons. A perceptron uses an activation function and a transfer function to take a set of inputs and produce a single output. The most popular form of neuron uses a linear activation function and a sigmoid transfer function. The linear activation function is the sum of a set of weighted inputs, i.e. f(X) = sum w_i *x_i. The logarithmic sigmoid transfer function is g(x) = 1/(1+exp(-x)). The output of the perceptron for any given input is O(X) = g(f(x)).
These perceptrons can be chained together in many different ways. One popular method is the multi-layer perceptron, where a set of neurons in the hidden layer process the inputs and pass on their outputs to the output layer where the final output is formed. I don't have a source for you, but it has been proven that, given a large enough hidden layer, the multi-layer perceptron is a universal function approximator.
As long as all of the transfer functions are differentiable, you can train a neural network using error backpropagation by gradient descent. I will leave it as an exercise to the reader to learn how it works, but I assure you that it is very simple. Machine Learning by Tom Mitchell has a good section on the subject, as does Fundamentals of Computational Neuroscience by Thomas Trappenberg.
Evans has identified a large set of features of e-mails, some of whom on their own convey little or no information about whether an e-mail is spam. He trains the neural network to recognize the combinations of these features which can lead towards the conclusion that a message is or is not spam. While his approach is a good idea, I would hesitate to call it novel. Massey, Thomure, Budrevich and Long did a very similar experiment [3] where they used a multi-layer neural network with SpamAssassin.
While his approach is good, there are some downsides for widespread deployment that need to be addressed first. With a large feature set like he is using, you will probably need a lot of training data to find a good fit with a multi-layer perceptron. To train the single layer neural network for SpamAssassin 3.0, I'm using 160000 messages.
Also, as his own arguments show, spam adapts to spam filter technology. Most of the features that he presents in his whitepaper can be easily fooled by a spammer. They can deliberately manipulate these features to evade the spam filter b
It's a shame that people still have to resort to the Google cache when there is a great caching service, FreeCache provided by the Internet Archive. Just make your link like http://freecache.org/http://whatever...
I'm sorry to hear that you're having so much trouble with SpamAssassin. I've heard some rumblings from the Faculty of Law at my university that SA makes a lot of errors on their e-mails. The disclaimers that they put in their signatures must be tripping some of the rules.
I'm trying to find a reliable method of personalising your scores without requring you to download the 300MB corpus that we use to optimise the scores. I hope that, in the future, it will make SpamAssassin more to your liking.
P.S. Your system administrator really shouldn't be discarding those e-mails. The SpamAssassin documentation reccomends that you only tag them and let the users write a filter that detects the X-Spam-Status header. You should ask them to tag the e-mails instead so that you can use DSPAM or CRM114, since they work so well for you.
1. Cormack is very inexperienced in the area of statistical filtering.
Disagreed. Gordon Cormack has been doing information retrieval for 20 years. He is fairly well known in the area. See his publication history at DBLP.
A far more likely conclusion about what's going on here is that Zdiarski's ego has been hurt. Both he and Dr. Yerazunis engage in some very sketchy statistics in their papers and I think that it has caught up to them.
1. Yerazunis' study of "human classification performance" is fundamentally flawed. He did a "user study" where he sat down and re-classified a few thousand of his personal e-mails and wrote down how many mistakes he made. He repeats this experiment once and calls his results "conclusive." There are several reasons why this is not a sound methodology:
a) He has only one test subject (himself). You cannot infer much about the population from a sample size of 1.
b) He has already seen the messages before. We have very good associative memory. You will also notice that he makes fewer mistakes on the second run which indicates that a human's classification accuracy (on the same messages) increases with experience. For this very reason, it is of the utmost importance to test classification performance on unseen data. After all, the problem tends towards "duplicate detection" when you've seen the data before hand.
c) He evaluates his own performance. When someone's own ego is on the line, you would expect that it would be very difficult to remain objective.
2. Both Yerazunis and Zdziarski make use of "chained tokens" in their software. This is referred to in other circles as an "n-gram" model. As with many nonlinear models (the complexity of an n-gram model is exponential with n), it is very easy to over-fit the n-gram model to the training data. Natural language tends to follow the Pareto law (sometimes called the 80/20 rule) where the ranking of a term is inversely proportional to the frequency of occurence of that term. The exponential complexity of the n-gram model contributes to the sparse distribution of text leading to a database with noisy probability estimates.
3. Zdziarski uses a "noise reduction algorithm" called Dobly to smooth out probability estimates in the messages. Aside from his unsubstantiated claim of increased accuracy, I have never seen anything to suggest that it actually works as advertised.
Considering these points, I was not surprised at all by the results of Dr. Cormack's study. While one may argue that his experimental configuration can use some improvement, his evaluation methods are logically and statistically sound. What I personally saw in the results of this paper was that two classifiers that use unproven technology did not perform as advertised. After all, every other Bayes-based spam filter performed acceptably well.
Lastly, I won't really touch his flawed arguments about how using domain knowledge about spam (i.e. SpamAssassin's heuristic) somehow hinders the classifier over time when you are also using a personalised classifier. You'll notice that SpamAssassin still did acceptably well when all of the rules were disabled.
Go read some more of Zdziarski's work and draw your own conclusions about his work. Pay careful attention to his use of personal attacks when comparing his filter to that of others.
The above poster has obviously never ventured into the accounting department at a university and is merely saying the word "GPL" to karma whore. "Non-trivial" doesn't even begin to describe the complexity of what goes on, to the point that even humans can't get it straight. Just the other day, I had to simultaneously corral no less than 5 university employees to figure out exactly what was going on with my pay situation.
I beg to differ with you on the matter of it being only "an annoyance." I've had to delete comments on my own weblog that (supposedly) link to underage pornography sites. I'm not a lawyer, but I'm fairly certain that it is illegal to link to child pornography. Assuming that this is true, those SEOs are actually causing you, the innocent weblog/wiki owner, to unwillingly and unwittingly commit a criminal act.
Is it still just "annoying?"
After reading through the comments here, it is obvious that there are some misconceptions about what Apple is doing.
s ition.html
e stNeighbor
Latent Semantic Indexing (LSI) was invented by Deerwester et. al. [1] as a method of reducing the dimensionality of a text corpus by finding a low-rank approximation of the term-document matrix.
The singular value decomposition (SVD) [2] factors a matrix A into the product of two orthogonal matrices and a diagonal matrix, A = U'SV. To find a rank k approximation of A using this factorisation, create matrices U^, S^ and V^ where S^ contains the first k rows and columns of S, U^ contains the first k rows of U and likewise for V^. Then, let A^ = U^'S^V^. The difference in Frobenius norms [3] of A and A^ is minimal for a rank-k approximation of A (least squares).
Rather than storing the full matrix, A^, in practice it is much more common to save U^ and S^ and project the columns and rows of A into a k-dimensional space. This allows both terms and documents to be clutered together and helps to associate keywords with documents.
You can do many things with these approximated document vectors, clustering, classification, document retrieval. Apple is probably using a k-nearest neighbour classifier [4] to determine how a message is to be filed.
I would be most interested to see Apple's updating strategy. There are several algorithms that allow you to add new rows and columns to a matrix where you know the full SVD, but none that I know of for the truncated SVD.
For one of my graduate-level courses, I wrote a little search engine that uses LSI to cluster 1000 newspaper articles. You can play with it here. My favourite query is "Rowan Gorilla." The Rowan Gorilla is an oil rig that frequents Halifax harbour. The search engine returns articles on the oil and gas industry that contain neither the word "Rowan" nor "Gorilla" but are still topical.
[1] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990.
[2] Singular Value Decomposition -- from MathWorld. http://mathworld.wolfram.com/SingularValueDecompo
[3] Frobenius Norm -- from MathWorld. http://mathworld.wolfram.com/FrobeniusNorm.html
[4] Artificial Intelligence Wiki: NearestNeighbour. http://www.ifi.unizh.ch/ailab/aiwiki/aiw.cgi?Near
Apparently, the poster has never heard of a realtime blacklist. If a spammer were to use an A60 to send out their spam, their IP would be added to the RBL and none of their messages would get through.
Devices like that are useful for sending out legitimate messages, such as 'Technical Errata' or trying to run product support over E-mail.
Whenever I make a telephone call, for whatever purpose, it is associated with my telephone number. Thus, I am accountable for the use of my communications equipment.
Why should it be different for people using TTY services? Provisions for anonymity only allow people to abuse the telephone system.
evanbd:
I did a parallel implementation of an Amazons solver last summer. Drop me an e-mail if you want to chit chat about Amazons.
One of the primary design constraints for home theatre PCs is that they need to be absolutely silent. Since hard drives can be noisy, keeping the number of drives in your system to a minimum should be important. Many people (myself included) use networked fileservers to serve media to their HTPCs.
These little boxes seem like just the ticket. Imagine a diskless HTPC. All that you would need to do is boot it over the network and mount the drive in the Asus enclosure as your root filesystem. If you were to use a Via C3-based motherboard and a power supply with passive cooling, you could then have an HTPC with no moving parts and thus, totally silent.
I don't think that it is stupid at all. In this particular situation, the line of where the crime was committed is very blurry. While he may have been sitting at his computer in Australia, he was accessing those in other countries and presumably the United States (think IRC servers, FTP servers, DCC connections).
P.S. Do you always feel that you need to make the fallacy of personal attack for your point to be considered valid?
While his actions were performed in Australia, many of his victims (the owners of said IP) reside in the United States. Without getting into an IP law debate, It's not that much of a stretch to prosecute someone under the laws of the country of the victim.
An analogue would be attempting to extradite a 419 scammer from Nigeria because they defrauded a North American.
Paste into matlab:
u serid=1');
I = imread('http://si20.com/img/authimage?seed=63419&
J = double(I(2:31,2:161));
hl = nncopy(J(:,2),1,size(J,2));
vl = nncopy(J(1,:),size(J,1),1);
J(find(J == hl)) = 0;
J(find(J == vl)) = 0;
J(find(J)) = 1;
spy(J);
Tada! Use a simple hamming distance-based nearest neighbour matching algorithm to find the values of the individual letters.
I have analytically and experimentally proven that over time, those random words will break your spam filter. I hope to publish a paper on the subject this summer at the First Conference on Email and Anti-Spam (CEAS). If you are interested, contact me by e-mail and I can send you a pre-print once the paper is finished and submitted.
If you are concerned about the source being too large, take a peek at mod_gzip.
Gathering all of that information on all of those people is very difficult. However, in this situation, the user does all of the work and publishes it in a centralized location. If you're making the information publically available, you can no longer reasonably expect it to be private.
I never thought that Slashdot would help me find papers relevant to my research!
I think that their idea is good from a technical point of view, but very bad from a privacy point of view. I am of the opinion that gathering social network information is extremely dangerous. A pertinent example: If your friend is branded a "terrorist," then "they" can exploit the information that you have voluntarily provided to then put you on a "terrorist" watch list.
Another example: Say that someone who knows someone that you know actually buys something from a spam. If the spammer can access the social network information, suddenly your little niche of the network is going to be aggressively spammed. After all, like minds congregate.
There is no doubt in my mind that the black hatters will infiltrate the social network communities and use that information to spy on potential viewers. See this bugzilla thread where the folks from Atriks Professional Email Deployment Service follow SpamAssassin's development and adapt their "ratware" tool accordingly.
The biggest problem with collecting social networks is that once the data has been gathered, it is very hard to control. Those of you using Orkut should think long and hard about it.
In conclusion, I think that this is technically a good idea but it opens a Pandora's box.
In other news, predators kill billions of animals per year.
Considering your medical background and interest in comptuer science, you may find health informatics to be of interest. You will be able to leverage all of your experience as a physician while breaking in to the field of computer science.
I work closely with the Departments of Health Informatics at Dalhousie University and University of Victoria and have met many of the other people in the field in Canada. They are a good mix of doctors, nurses and computer scientists and are doing some very interesting and releveant work.
You can find more information at http://www.hiww.org/ and by searching for Health Informatics on Google.