Poison Attacks Against Machine Learning
mikejuk writes "Support Vector Machines (SVMs) are fairly simple but powerful machine learning systems. They learn from data and are usually trained before being deployed. SVMs are used in security to detect abnormal behavior such as fraud, credit card use anomalies and even to weed out spam. In many cases they need to continue to learn as they do the job and this raised the possibility of feeding it with data that causes it to make bad decisions. Three researchers have recently demonstrated how to do this with the minimum poisoned data to maximum effect. What they discovered is that their method was capable of having a surprisingly large impact on the performance of the SVMs tested. They also point out that it could be possible to direct the induced errors so as to produce particular types of error. For example, a spammer could send some poisoned data so as to evade detection for a while. AI based systems may be no more secure than dumb ones."
Why the hell is the only link in the summary to that rather useless "I Programmer" website? The summary here at Slashdot is basically the content of the entire linked "article"!
Here is a much more useful link for anyone interested in reading the actual paper: http://arxiv.org/abs/1206.6389v1
Universities should run a number of psychology experiments to see how this can be done to human intelligence to see how susceptible it is compared to AI. Or you could just study people who tune in to .
On this side of the human / AI line, we call this propaganda. It has historically proved very effective, specially if you can control all of the "training data."
The security implications aside, one problem I see is a possible arms race between the poisoners and the AI designers. The only way for the designers to win is to build tests that are less tolerant of the poisoned data. This is good if AI systems are built to interact only with other AI systems. But what if humans are the end users?
At some point, the increase in data precision will come up against the natural imprecision of human users. Fewer humans will be smart enough to pass the Turing test. A practical example: I've noticed how Google's recaptcha puzzles have become more difficult. I now need to magnify the page view in order to make out some of the letters.
it's called propaganda
see: Fox News
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
Well,m duh ... leave the learning on and GIGO rule is active! Leave the learning off and people will figure out how to be ignored by it.
Nothing new here at all.
There's already a whole subfield of machine learning which concern itself with these problems. It's called "adversarial machine learning".
The approaches are very different from usual software security. Instead of busying oneself with patching holes in software or setting up firewalls, adversarial machine learning re-design the algorithms completely, using game theory and other techniques. The premise is "How can we make an algorithm that works in an environment full of enemies that try to mislead it?" It's a refreshing change from the usual software-security paradigm, which is all about fencing the code into some supposedly 'safe' environment.
I wonder how long it will take for Machine Intelligence Sanding to be incorporated into a sci fi flick:
"What are you doing? I didn't even know you liked to fish."
"Whenever I order weapons I also order something from an unrelated site. Besides, a box of streamers might come in handy"
Thanks for your order! Bass Pro Shops
So if you know the algorithm and training data, and you can feed the system new data with manipulated labels then you can confuse it. It's a little early to panic about your spam filter. Hopefully everyone realizes that if you let the spammers tell your computer what is and is not spam, they can cause it to let their spam through.
I know that email spammers have been exploiting this to make bayesian filters for the past decade
"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
Support Vector Machines are just a way of performing unsupervised data partitioning/clustering. i.e. you feed a bunch of data vectors into the algorithm and it determines how to split the data into a number of clusters where the members of each cluster are similar to each other and less similar to members of other clusters.
e.g. you feed it (number of wheels, weight) pairs of a lot of vehicles and it might automatically split the data into 3 clusters - light 2-wheeled vehicles, heavy 4-wheeled ones, and very heavy 4-wheeled ones. If you then labelled these clusters as "bikes", "cars" and "trucks" you could in the future use the clustering rules to determine the category a new data point falls into.
This isn't Artificial Intelligence - it's just a data mining/classification technique.
From the article, if you have access to the training data and know the learning algorithm, you can game the machine learning (SVM,not AI) system. How is that anything but self-evident, non-news?!
"Consensus" in science is _always_ a political construct.
Stop talking about how easy it is to poison data collection efforts; you're going to kill the golden goose of those who insist that analyzing social data can allow you to pinpoint psychopaths and other "problematic" individuals before that goose ever takes to the air (on the wings of "black budget" funding, no doubt).
Orwell: "In a Time of Universal Deceit, telling the Truth is a Revolutionary Act"
A couple of commenters have noted that there is a branch of research related to defending against this - according to one it's called "adversarial machine learning". I've been casually wondering for some time about a related question, which is very relevant to the questions of using the various 'bottom up' AI systems like SVM and neural nets as models of human intelligence and of various complex adaptive systems ('living systems') including economies and polities (and evolutionary biology for that matter). If we look at these systems (both the real world ones and the mathematical models) as decision convergence models, what is the effect of nodes that make errors once, occasionally, frequently, or continuously ? And how does a successful neural network that is dealing with a continuously changing environment accommodate an element/node that provides, for example, randomly varying responses? What about a node that 'purposely' provides poisoned responses - like a secret agent putting false data into the news? In a machine, those things may be manageable by simply starting over, but in a continuous system like a real brain, that is not an option.
I learned a while back that in the human brain, a neuron whose output signals become ignored (the output from its axons becomes weighted so low that it has no influence on the 10,000 other neurons it is talking to), it dies. The brain seems to act very much like a republic of cantankerous, disagreeable citizens arguing at many different levels (and with shifting alliances). But if one continuously shouts "We're all gonna dieeee!!!", pretty soon nobody listens any more.
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
The Texas board of education has a pretty good handle on the minimum amount of poisoned data it takes to affect learning.
Have gnu, will travel.
Cat and Mouse 2.0. Nothing new here.
From the paper:
"...we assume that the attacker knows the learning
algorithm and can draw data from the underlying
data distribution. Further, we assume that our attacker
knows the training data used by the learner;"
They characterize these assumptions as "unrealistic", which I think is about right in a real world setting.
Comment removed based on user account deletion
Comment removed based on user account deletion
I'm not sure why this would be surprising. ML algorithms work best if the future behaves like the past, if it has the same probability distribution as the training data. Some algorithms can handle slow changes if they can continually get new training data, but large changes is a problem.
it's just an elaborate filter program. which is far away from real AI.
world was created 5 seconds before this post as it is.
In other words, artificial intelligence is just as limited and varied as regular ole human intelligence.
Jeez. Who'd a thunk it?