Would it be feasible for a TOR-like system to restrict traffic to text only (i.e. stop/limit Base64 or whatever encoding)?
Nope. I Can capitalize Each word Of A Sentence according to Whether i Want To send a 1 Bit or A 0 bit. now i Have Encoded Illicit information In a plain text Form.
Good ideas, but I'd nix the ivy. In most regions it is considered an invasive species. Its roots pulverize house siding over time, the cuttings take root and grow in unpredictable places, the vines wrap around and choke the growth of trees, etc. Ivy, particularly English ivy, is a hellish plant that should be ripped up and burned when found, not encouraged. (Unless of course it's in its native environment where the other members of the ecosystem have found a balance with it.)
I just said that paper recycling reduces demand for new carbon that began life as a CO2-eating tree.
What does demand for new carbon have to do with whether or not new trees grow? So grow the trees and DON'T make paper out of them -- hell, why not leave them standing? It's something called a "forest."
If you bury that paper, the carbon remains sequestered, and then a new tree can take new carbon from the air to make new paper.
This argument is ridiculous. Why bother making paper? Just take the trees themselves, bury them in such a manner that they will not decay, and allow new trees to grow where the old ones stood. Sounds pretty stupid, doesn't it?
Also, you are asserting without proof that trees can only grow where previous trees once stood, a premise easily refuted by the existence of a fine apple tree in my front lawn, planted last year, which takes the place of no previous tree.
That would presume the Asgard need to carry around a multitude of devices with them. How PRIMITIVE. All they need is the "hockey puck thingy," and one of those is available at any nearby console.
This actually raises a question to which I don't know the answer: if you take a fairly standard symmetric cypher, say DES, and two keys K1 and K2, does there always exist a key K3 such that E_K1(E_K2(Message)) == E_K3(Message) ?
No, this is not always the case. For DES it is certainly NOT the case. The terminology you are looking for is "idempotent cipher." A cipher is idempotent if it obeys the equation you stated -- i.e., for any key pair K1,K2, there is a K3 such that E_K2(E_K1(x))=E_K3(x). DES is NOT an idempotent cipher.
These questions only seem deep to non-crypto-experts (no offense to you intended) -- and it was certainly accounted for in the design of DES (along with features which harden it against differential cryptanalysis, a technique that wasn't even publically known outside the NSA at the time -- the people who work on these things are far from stupid)
Ha ha funny and all, except that in your cherry-picked examples there is no reasonable overlap between the fields where the terms are used. However, in a field called planetary geology it would be quite REASONABLE to expect that a scientist might want to discuss the plutons (geological sense) which exist on a certain pluton being studied (astronomical sense).
I, for one, find it virtually impossible to imagine how overloading "pluton" could result in a lot of confusion. In context, it's going to be obvious to the most casual observer which meaning is actually intended. This whole objection is ridiculous.
Only a halfwit couldn't see the potential for confusion. How, exactly, are you supposed to clearly talk about plutons (geological sense) that happen to be on plutons (astronomical sense)? Your rejection of the objection displays lack of imagination.
This is one thing I never understood. Why do advanced races no longer need clothing in out shows?
Well, with the Asgard the explanation is simple. They reproduce through cloning, and can't reproduce sexually. Thus, they have no concept of "sexuality," and therefore no body-sexuality neurotic complex which compels them to cover their nakedness. Plus, why the hell should you waste time making clothing when you live in an advanced, totally climate controlled city or spacecraft?
Yeah, it's all the creepiness of Google, but without the "do no evil" oversight. What could possibly be wrong with that?
Ever hear the phrase "hindsight is 20/20?" And do you really think some "do no evil" corporate mantra is going to protect you until the end of the universe? I could say you're just as stupid for using Google (after all, didn't you realize that in May 2009 they're going to release all your search queries?)
Here's a potential solution to your "industrywide problem": Stop treating us (your users) as nothing more than a market. We're individual human beings. Right now, we just look like sacks of money to you and your "research" consists of trying to extract that money from us.
If you want to be treated as a person, then limit your interactions to other people, not corporations. What the hell do you expect from a FACELESS ENTITY?
Why are people continually shocked at the behavior of corporations (which are entities conceived for the sole purpose of MAKING PROFIT)? Does it suck? Yeah, it sucks. Does bitching about it make any more sense than complaining about the lack of whiskers on a lizard? Nope.
For example, if you have looked for a job in the last four years, you were foolish if you didn't search for your own name to see if your friends' blogs had descriptions of your late-night drinking binges and drug use. (You are probably foolish if you used AOL search to do this, but that's a different discussion.)
Why would it be foolish? AOL search is just Google, anyway.
That of course, is assuming that he really is as innocent in all of this as he claims to be.
I think "naive" is a better word in this case. Of all people, I'd expect a data mining academic to understand the potential ramifications of releasing search data. Maybe this guy's head was so far in the clouds above the ivory tower that it just didn't occur to him, but I somehow doubt it. More likely, the idea of getting a publication out was too attractive to worry about petty ethical considerations.
Why do they keep such logs, anyway? If it's to help tailor results better, or to help sell advertising, then why is it correlated with a user ID? My company, for example, saves a keyword search history, but there is no user-identfiable information correlated with it. And it's plenty of information for our needs.
First, the search database doesn't list AOL user IDs. It lists "unique IDs" for each user, but they are not correlated to whatever AOL's internal "User ID" is. But to assume that sanitizing the data by changing or completely removing user IDs will make people safe is boneheaded.
Let's start with a grep for social security numbers. I've blipped out the actual numbers themselves, but that's not much help for these poor folks, since anybody can get their hands on the database:
find robert williams akron oh 44306 XXX-XX-XXXX
birth certificate for debra ann collins 1-28-59 ss XXX-XX-XXXX
locate keith ivan thompson born 3 may 64 social security XXX-XX-XXXX last address was XXXXXX colorado
kristy nicole vega hammond la. social secruity number XXX-XX-XXXX birth date 03
08 81 drivers license number la. XXXXXXXXX address XXXXXXXX.
Moving on, check out this fascinating query:
all i can say is you looked amazing in that photo. i would love to get achanceto know you. expect a call from me soon. are you looking for a friend or a companian just for future reference
Looks like somebody accidentally copy-pasted a portion of their private communication (email or IM, perhaps) into the search query box and clicked "Submit." Now their private thoughts are available for all to see. You'd be AMAZED at the stuff you'll find in these logs. The idea that by removing usernames/IDs from data is "instant sanitization" is naive and dangerous. There is more than enough information in many of these queries to identify specific individuals and examine EVERYTHING they have searched for in the past 6 months.
(I do question the sanity and intelligence of some of the people who submitted queries like the ones above, but ultimately this is not their fault.)
And if it is so great and reliable, why are they using hot dogs and not this guys hand?
That's a little like asking why they don't use live people instead of dummies in automobile crash tests. Don't they have any faith in their products?
Anybody with a realistic sense of safety and security understands that even if your safety system is 99.9% reliable, you still don't press it into service UNLESS ABSOLUTELY NECESSARY. Why have a 99.9% chance of being okay when you can have a 100% chance of being okay? When using a table saw, your PRIMARY line of safety is not putting your fingers in the fricking saw, NOT some fancy electronic capacitance gizmo. It's great to have that around in case you decide to be an idiot one day, but relying totally on it as an excuse to be a dumbshit is stupid, and if you lose a finger you get what you deserve.
In rock climbing, great pains are taken to make sure the climbers have SOLID ANCHORS to the rock face -- attachments that you could hang a Chevy from. That still doesn't mean you're going to deliberately fall on your protection!
So much for understanding of basic trademark law. If Google does not act to defend its trademarks, they will lose their legal status as trademarks. How well off do you think Google would be if anybody could put any software on the web and call it Google software?
If the filter threshold is set to junk these chatty spams, then it is tough enough to eliminate the first email from any chatty person.
You can't expect a machine learning system to function adequately before it has learned anything. I'm not trying to say that Bayesian filtering is infallible, but the specific attacks being described in this article are anything but efficient.
You're missing the point. To confuse a spam filter, you only need to cause it to have a small number of false negatives. Even 5% of legitimate mail being classified as spam is too high for many people.
Did you mean false positives? Especially with the sheer volume of spam many people receive, I don't think many will give up spam filtration because of false positives. They might seek other solutions, such as combinations of statistical filtering and whitelisting. Or choose a more reliable method of exchanging critical communications. But the VERY small impact that a random text attack might possibly have would not, I think, ever overwhelm the desire to filter spam.
You are correct, and this is exactly why shared statistical filters don't work well. The problem is not the lack of strong non-spam keywords (there are plently) -- in this case, the problem is that a portion users receive spam-like email which they consider perfectly legitimate, and this decreases the usefulness of the spammy keywords. Ten thousand users who purposefully sign up for Buy.com daily updates are going to wreak havoc with the system.
Most inboxes, not containing many "special" words like "IRQ" and "Johannsen", are filled with these common words. If a Bayesian filter were to assume that all emails in your inbox are to be learned as non-spam then spammers using the most common 5,000 words would get through most filters.
You are asserting, with no basis, something which is empirically proved to be untrue. Almost all legitimate email contains user-specific keywords. Over time the accumulation of these word counts will overwhelm any random attempts by spammers to corrupt the database. Even the simple technique of including the sender's email address as a token during Bayesian processing is enormously beneficial.
Take my dad for instance; he isn't on any mailing list; 99% of his email is along the lines of "how are you" and "give my love" etc; pretty run of the mill stuff.
People who ask those sorts of things usually sign their name to their email. Those names will become strong non-spam keywords. ANYTHING your dad talks about specifically will help -- hobbies, places he usually goes, etc. You'd be surprised how much specific, intelligent content even the most "ordinary" of people will produce.
By having a baysian filter forget over time, it also helps shrink down the database and helps it adapt as the contents of spam change over time.
Having the filter forget is the ONLY effective policy. In statistical filtering, it is certainly NOT true that more data == better results. You want a sample of data that most accurately represents the sort of content you are receiving RIGHT NOW. I completely purge my Firefox Bayesian database every couple of months and retrain on recent emails only. The result is ALWAYS an increase in accuracy, particularly a reduction in false positives.
Would it be feasible for a TOR-like system to restrict traffic to text only (i.e. stop/limit Base64 or whatever encoding)?
Nope. I Can capitalize Each word Of A Sentence according to Whether i Want To send a 1 Bit or A 0 bit. now i Have Encoded Illicit information In a plain text Form.Good ideas, but I'd nix the ivy. In most regions it is considered an invasive species. Its roots pulverize house siding over time, the cuttings take root and grow in unpredictable places, the vines wrap around and choke the growth of trees, etc. Ivy, particularly English ivy, is a hellish plant that should be ripped up and burned when found, not encouraged. (Unless of course it's in its native environment where the other members of the ecosystem have found a balance with it.)
I just said that paper recycling reduces demand for new carbon that began life as a CO2-eating tree.
What does demand for new carbon have to do with whether or not new trees grow? So grow the trees and DON'T make paper out of them -- hell, why not leave them standing? It's something called a "forest."If you bury that paper, the carbon remains sequestered, and then a new tree can take new carbon from the air to make new paper.
This argument is ridiculous. Why bother making paper? Just take the trees themselves, bury them in such a manner that they will not decay, and allow new trees to grow where the old ones stood. Sounds pretty stupid, doesn't it?
Also, you are asserting without proof that trees can only grow where previous trees once stood, a premise easily refuted by the existence of a fine apple tree in my front lawn, planted last year, which takes the place of no previous tree.
That would presume the Asgard need to carry around a multitude of devices with them. How PRIMITIVE. All they need is the "hockey puck thingy," and one of those is available at any nearby console.
This actually raises a question to which I don't know the answer: if you take a fairly standard symmetric cypher, say DES, and two keys K1 and K2, does there always exist a key K3 such that E_K1(E_K2(Message)) == E_K3(Message) ?
No, this is not always the case. For DES it is certainly NOT the case. The terminology you are looking for is "idempotent cipher." A cipher is idempotent if it obeys the equation you stated -- i.e., for any key pair K1,K2, there is a K3 such that E_K2(E_K1(x))=E_K3(x). DES is NOT an idempotent cipher.
These questions only seem deep to non-crypto-experts (no offense to you intended) -- and it was certainly accounted for in the design of DES (along with features which harden it against differential cryptanalysis, a technique that wasn't even publically known outside the NSA at the time -- the people who work on these things are far from stupid)
Ha ha funny and all, except that in your cherry-picked examples there is no reasonable overlap between the fields where the terms are used. However, in a field called planetary geology it would be quite REASONABLE to expect that a scientist might want to discuss the plutons (geological sense) which exist on a certain pluton being studied (astronomical sense).
I, for one, find it virtually impossible to imagine how overloading "pluton" could result in a lot of confusion. In context, it's going to be obvious to the most casual observer which meaning is actually intended. This whole objection is ridiculous.
Only a halfwit couldn't see the potential for confusion. How, exactly, are you supposed to clearly talk about plutons (geological sense) that happen to be on plutons (astronomical sense)? Your rejection of the objection displays lack of imagination.This is one thing I never understood. Why do advanced races no longer need clothing in out shows?
Well, with the Asgard the explanation is simple. They reproduce through cloning, and can't reproduce sexually. Thus, they have no concept of "sexuality," and therefore no body-sexuality neurotic complex which compels them to cover their nakedness. Plus, why the hell should you waste time making clothing when you live in an advanced, totally climate controlled city or spacecraft?Yeah, it's all the creepiness of Google, but without the "do no evil" oversight. What could possibly be wrong with that?
Ever hear the phrase "hindsight is 20/20?" And do you really think some "do no evil" corporate mantra is going to protect you until the end of the universe? I could say you're just as stupid for using Google (after all, didn't you realize that in May 2009 they're going to release all your search queries?)Here's a potential solution to your "industrywide problem": Stop treating us (your users) as nothing more than a market. We're individual human beings. Right now, we just look like sacks of money to you and your "research" consists of trying to extract that money from us.
If you want to be treated as a person, then limit your interactions to other people, not corporations. What the hell do you expect from a FACELESS ENTITY?
Why are people continually shocked at the behavior of corporations (which are entities conceived for the sole purpose of MAKING PROFIT)? Does it suck? Yeah, it sucks. Does bitching about it make any more sense than complaining about the lack of whiskers on a lizard? Nope.
For example, if you have looked for a job in the last four years, you were foolish if you didn't search for your own name to see if your friends' blogs had descriptions of your late-night drinking binges and drug use. (You are probably foolish if you used AOL search to do this, but that's a different discussion.)
Why would it be foolish? AOL search is just Google, anyway.That of course, is assuming that he really is as innocent in all of this as he claims to be.
I think "naive" is a better word in this case. Of all people, I'd expect a data mining academic to understand the potential ramifications of releasing search data. Maybe this guy's head was so far in the clouds above the ivory tower that it just didn't occur to him, but I somehow doubt it. More likely, the idea of getting a publication out was too attractive to worry about petty ethical considerations.
How, exactly, are they absolved of any responsibility?
The same way a rape victim is absolved of responsibility, even if they were wearing a provocative outfit, you fucking sociopath.Why do they keep such logs, anyway? If it's to help tailor results better, or to help sell advertising, then why is it correlated with a user ID? My company, for example, saves a keyword search history, but there is no user-identfiable information correlated with it. And it's plenty of information for our needs.
First, the search database doesn't list AOL user IDs. It lists "unique IDs" for each user, but they are not correlated to whatever AOL's internal "User ID" is. But to assume that sanitizing the data by changing or completely removing user IDs will make people safe is boneheaded.
Let's start with a grep for social security numbers. I've blipped out the actual numbers themselves, but that's not much help for these poor folks, since anybody can get their hands on the database:
Moving on, check out this fascinating query:
Looks like somebody accidentally copy-pasted a portion of their private communication (email or IM, perhaps) into the search query box and clicked "Submit." Now their private thoughts are available for all to see. You'd be AMAZED at the stuff you'll find in these logs. The idea that by removing usernames/IDs from data is "instant sanitization" is naive and dangerous. There is more than enough information in many of these queries to identify specific individuals and examine EVERYTHING they have searched for in the past 6 months.
(I do question the sanity and intelligence of some of the people who submitted queries like the ones above, but ultimately this is not their fault.)
And if it is so great and reliable, why are they using hot dogs and not this guys hand?
That's a little like asking why they don't use live people instead of dummies in automobile crash tests. Don't they have any faith in their products?
Anybody with a realistic sense of safety and security understands that even if your safety system is 99.9% reliable, you still don't press it into service UNLESS ABSOLUTELY NECESSARY. Why have a 99.9% chance of being okay when you can have a 100% chance of being okay? When using a table saw, your PRIMARY line of safety is not putting your fingers in the fricking saw, NOT some fancy electronic capacitance gizmo. It's great to have that around in case you decide to be an idiot one day, but relying totally on it as an excuse to be a dumbshit is stupid, and if you lose a finger you get what you deserve.
In rock climbing, great pains are taken to make sure the climbers have SOLID ANCHORS to the rock face -- attachments that you could hang a Chevy from. That still doesn't mean you're going to deliberately fall on your protection!
So much for "Do No Evil"
So much for understanding of basic trademark law. If Google does not act to defend its trademarks, they will lose their legal status as trademarks. How well off do you think Google would be if anybody could put any software on the web and call it Google software?
Just because the HP calc used RPN doesn't mean the CPU itself was stack based. Does anybody know the specific processors used in those calculators?
If the filter threshold is set to junk these chatty spams, then it is tough enough to eliminate the first email from any chatty person.
You can't expect a machine learning system to function adequately before it has learned anything. I'm not trying to say that Bayesian filtering is infallible, but the specific attacks being described in this article are anything but efficient.
You're missing the point. To confuse a spam filter, you only need to cause it to have a small number of false negatives. Even 5% of legitimate mail being classified as spam is too high for many people.
Did you mean false positives? Especially with the sheer volume of spam many people receive, I don't think many will give up spam filtration because of false positives. They might seek other solutions, such as combinations of statistical filtering and whitelisting. Or choose a more reliable method of exchanging critical communications. But the VERY small impact that a random text attack might possibly have would not, I think, ever overwhelm the desire to filter spam.
You are correct, and this is exactly why shared statistical filters don't work well. The problem is not the lack of strong non-spam keywords (there are plently) -- in this case, the problem is that a portion users receive spam-like email which they consider perfectly legitimate, and this decreases the usefulness of the spammy keywords. Ten thousand users who purposefully sign up for Buy.com daily updates are going to wreak havoc with the system.
Most inboxes, not containing many "special" words like "IRQ" and "Johannsen", are filled with these common words. If a Bayesian filter were to assume that all emails in your inbox are to be learned as non-spam then spammers using the most common 5,000 words would get through most filters.
You are asserting, with no basis, something which is empirically proved to be untrue. Almost all legitimate email contains user-specific keywords. Over time the accumulation of these word counts will overwhelm any random attempts by spammers to corrupt the database. Even the simple technique of including the sender's email address as a token during Bayesian processing is enormously beneficial.
Take my dad for instance; he isn't on any mailing list; 99% of his email is along the lines of "how are you" and "give my love" etc; pretty run of the mill stuff.
People who ask those sorts of things usually sign their name to their email. Those names will become strong non-spam keywords. ANYTHING your dad talks about specifically will help -- hobbies, places he usually goes, etc. You'd be surprised how much specific, intelligent content even the most "ordinary" of people will produce.By having a baysian filter forget over time, it also helps shrink down the database and helps it adapt as the contents of spam change over time.
Having the filter forget is the ONLY effective policy. In statistical filtering, it is certainly NOT true that more data == better results. You want a sample of data that most accurately represents the sort of content you are receiving RIGHT NOW. I completely purge my Firefox Bayesian database every couple of months and retrain on recent emails only. The result is ALWAYS an increase in accuracy, particularly a reduction in false positives.
Thing is, the spam detection already catches it ... so I'm not sure how this will "train" the filters.
It won't. The technique is ineffective, as you've already seen. It's the "brainchild" of a mind who doesn't understand how statistical filters work.