The Evil in E-Mail
Frenchy in Ontario writes "An Ontario university researcher is devising ways to help law enforcement agencies better pinpoint likely criminal behavior in e-mails. His theory is that people who are "up to something" are more likely to write differently than people who aren't - either by avoiding using certain words at all that could be flagged for possible criminal context (like "bombed) or to examine patterns that might indicate criminal activity - like several people e-mailing one person but not each other, which is how some criminal networks operate. There's also an interesting paragraph on why Enron's emails aren't as valuable as you might think for this sort of work."
From TFA:
Super. I'm predicting a whole lot of false positives...especially during the initial phase of this operation...
Also from TFA:
Great...so words like 'bombed' get the email flagged...as well as an absense of the word 'bombed'? So far, Skillicorn's test appears 100% sensitive...too bad it's 0% specific.
Some more from TFA:
OMG! This is the pattern of emails in my company! My whole company is a giant terrorist organization! I had no idea!
But here's the kicker...again with the quoting:
So let me get this straight...if criminals are okay with their criminal activity (like...say...terrorists), they'll 'slip under the radar'??? Great test, Skillicorn...sounds a lot like a standard polygraph test, which experienced criminals can fool at will, while innocent people fail them 50% of the time. That's what the War on Terror really needs...another inaccurate 'test' that does nothing but throw false positives.
I'm just glad that this method is so obviously stupid that it will never be implemented by our government...
Oh, wait...one more from TFA:
Crap.
____
~ |rip/\/\aster /\/\onkey
This may work well for English,etc.. but may not work with other languages..
The emails you send would be encrypted instead plaintext.
Real criminals aren't dumb, only the bad ones who get caught are.
There are no atheists when recovering from tape backup.
This line in the lead jumped out at me: We have an addresses "techsupport@internaldomain" which matches this pattern to a T.
--MarkusQ
P.S. Back when we were on MS-Windows, it would have been OK, because the people asking for TechSupport were often sending each other worms at the same time.
Pattern recognition has been around a long time - from analyzing the causes of infection to finding likely cheats on expense reports (and the latter uses the frequencies of certain digits, rather than looking for the text entries).
I do disagree with his statement about not being useful to fight spam - recognizing patterns ins spam is already in use, applying the idea that the same or significantly numerous occurrences of the same words from either the same person to multiple users at the same sight and different sites, or the same basic message sent to multiple users from different mailers / return addresses might be a good indicator of spam. The challenge is how do you monitor all the traffic?
I'm a consultant - I convert gibberish into cash-flow.
This will be a total BOMB , Honestly this is not a new field of science at-all , Letters and writing have been examined for years and criminals writing E-mails will be writing the same things they always write .
The only things certain in war are Propaganda and Death. You can never be sure which is which though
Ah, my alma mater Queen's makes it onto Slashdot!
I don't know if using the Enron e-mails as his test material is such a good idea. Corporate malfeasance is probably not conducted the same way that every other criminal (or terrorist) network runs. At least their communication might be different due not to a "lack of guilt" but due to the fact that it's probably so easy to make a naughty memo sound like an innocent one without being obvious. After all these memos would be mixed in with a lot of legitimate company business the conspirators are also conducting.
How does automated analysis separate a memo saying "I think we should go ahead and promote Price out of the mailroom" - which means "Have Price-Waterhouse cook those spreadsheets I sent you", from one which just leads to some dude getting promoted out of the mailroom? Of course if they are not bothering to use code words then the system might work very well.
A related trick, he says, is to examine patterns in who e-mails whom. As an example, in criminal networks it is common to find several people communicating regularly with the same person, but never with each other. This is meant to ensure that if one lawbreaker is caught, he or she is unlikely to lead authorities to too many others. But it can also be a clue to suspicious activity.
Traffic analysis is probably more promising, since you can reconstruct relationships between players with it. The traffic pattern could look like a terrorist cell, or it could look like a bunch of guys who know each other - as he says, there's a difference. But this is old news, though automating it would make snoops' lives easier.
At any rate I find this line of inquiry disturbing for civil rights reasons, but I don't believe we should attack the researcher for working on it. Academic freedom is a very useful concept and ultimately does us more good than harm, IMO.
Freedom: "I won't!"
I am sure this will prove to as productive as searching eBay images for hidden Al-Qaeda messages.
This is not the sig you are looking for...
That should keep me safe for a few days.
--
Registered .sig quotient : 1337
Personally, I can't see how this would ever work. It is typical of the attitude that "all terrorists are bad, they are all the same and we just have to deal with them all in the same way".
Isn't it obvious that different terrorists will have different styles, different levels of literacy, different levels of security awareness, different languages, different aims, different approaches - the list goes on and on. Normal emails all have these traits too. I can't imagine there is any way of applying Bayesian filtering to help with this task.
I'm going to go out on a limb here and say that Al Queda probably uses GPG or some other form of strong encryption in their e-mails.
The Yasashii Syndicate ||
He's just using statistics to detect emails that are "different". So, anyone who isn't conforming is flagged up. Organising an anti-war protest? There you are, flagged. Say goodbye to freedom, if you hadn't already. Or encrypt all your emails, and try and persuade everyone you know to. Maybe we can make encryption widespread enough these things are useless.
I am trolling
...or to examine patterns that might indicate criminal activity - like several people e-mailing one person but not each other, which is how some criminal networks operate.
Not to mention most social networks. Or is everyone you know equally popular?
This reminds me of a Perl module Text::Gender
or something which I tried out in a few experiments last year. It is supposed to analyse writing and determine whether its author is female or male.
It works rather well given the conditions that the authour is also is American, white and middle class. Any samples outside that field and it fails spectacularly actually getting more wrong than right (worse than chance).
These sort of ideas are cute in their ambitions
but not science of any kind at all. The tests given in the email analysis article are even more wooly still. It sort of annoys me as a scientist that standards have sunk so low and funding is available for hairbrained capers like the one described in the article.
Dr Skillicorn has obviously never done any work with or for a law enforcement or intelligence agency. After spending three years in this area working on data mining of electronic communication, I can say this fella has not done his research properly. He has failed to note that the frequency of grammatical and spelling mistakes, let alone "missing" words, have become so frequent now in the SMS TXT generation that this will cause a major problem when scanning messages on this scale. I really can't be bothered to pick any more holes in this because it is time for a bacon and ketchup sandwich.
Everytime I hear one of these stories about how they can catch criminals from their email messages, I'm like, "OMG! They made a fast factoring algorithm!" But then I read the article and discover it only works for unencrypted messages. Gee.
Just remember a not so old story where there was reported the presence of e-mail encryption software was considered as evidence in some child porn case.
First they start using some very un-smart word-scanning piece of crap filtering system [and god help you if you write foreign language letters, or have a different style than the average], then they will punish the use of mail signing and encryption software [which is something I regularly do], then if the filtering still has a false positive rate above 99% they will ban e-mailing. Then they will find out other forms of efficient communication exist.
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.
Everyone knows that you just have to check the evil bit. (Some terrorists may be sophisticated enough to tamper with the evil bit but if they use Windows, the lack of the bit will stick out like a sore thumb.)
One line blog. I hear that they're called Twitters now.
So It is now no longer good enough to just have the ability to subpoena your records if your arrested?Now the government wants to activly sort/monitor the emails of an entire nation. HMM I smell more violations of the rights of the people. How much more of this are we willing to accept. How much longer until dissidents start a revolution. That's right I said it a revolution. This sounds like a combo of search/packet sniffing software.Last I heard PGP and RSA encryption was still unbreakable. This will NOT be effective for the worst thieves or tererorists.
Graduate students, take notice. This research is a wonderful example of ... going where the wind is blowing; that gives you media coverage and funding from people who know even less than you ... not doing your background research; doing your background research would just discourage you, and it takes time that isn't required for convincing people who know less than you that your sexy proposal is worth funding
So if you don't talk about things which a terrorist would talk about, you are a terrorist?
like several people e-mailing one person but not each other, which is how some criminal networks operate.
Yes, it's also how every other nuclear network of friends operates. Not all my friends know eachother. Not all a bank's customer's know eachother, not all a mailing list's users know eachother.
My 3D Texturing Skinning work (under construction)
-Tacitus
Government is already too invasive. I'm already forced to seek a building permit before I can erect a structure on my own property. The fines for ignoring this, (and say, having the gall to build a solar powered house which is not connected to the AC power grid, or (horrors!) a straw-bale house), are huge and the government's reasons for these laws are utterly ridiculous.
Any professor who suggests that we should be looking to monitor email content is not thinking clearly. The Government already has their nose in everything, and telling us that, "It's For Our Own Good," is NOT a valid excuse.
It's MUCH more important that people be able to make mistakes -and even die through their own faults- than live ensnared in the safe-keeping of a bunch of ignorant civil servants who are trying to build a Starfleet future where everybody dresses the same, and nobody is allowed to think or act outside a bunch of pre-set 'safe' boundaries designed for middle-class suburbanites who exist in eternal ignorance of the real world, who actually believe in the Discovery Channel, who drink milk, and live in absolute terror of anything you can't experience beyond the confines of a nice, respectable department store.
-FL
Letter from College:
Hi Mom,
I blew it and bombed the final exam. The physics
prof put the gun on my head and told me to work harder.
I could kill him. I feel like having a knife
at my throat. The anger feels like poison in my
blood but I know it is my fault and the all is
blamed to that virus, I had been laboring with
for quite a while. I'm working on it mom! I promise
to make you proud. I can not wait to be on the subway
home to work on my final project on weapons of
mass destruction in my political science class. Its
mental terror.
Love
Your son
P.S. The powder you sent me works well for my
skin infection. Strong agent.
This email doesn't contain the words r0lexx, v!/\gr4 or c14ll4s. It sticks out like a sore thumb from 99% of the email traffic we've intercepted, he must be up to no good!!!!
*Splort*
Statistical analysis of word (token) frequency works great in a closed domain set, such as the Enron corpus. But once you scale up to the ISP level it falls down horribly.
:2 004.1265082
Why ? The size of the token database increases massively to the point where it becomes un maintainable. Every spelling mistake, word variant, not to mention foreign language, gets included. Eventually you are unable to separate the wood from the trees. Let alone make statistically significant assertions about a single message.
And lets not mention the fact that all the work on detecting deception in correspondance hase been done on English language text. Those pesky al-Qaeda types tend to speak Arabic. So before you can even begin to detect dodgy emails written by al-Qaeda, you need to construct a written arabic parser. Then you need access to a large corpus of Arabic emails (if you have one I'd be very interested too). Then you need to research the lexical rules that tend to signify deceptive arabic.
Its an interesting problem, but not even trained and experienced intelligence operatives are able to routinely detect deceptive correspondance, so coding that algorithm is quite tricky.
This is a good place to start
http://doi.ieeecomputersociety.org/10.1109/HICSS.
How many criminals are going to send plain text emails discussing criminal activities?
This is clearly just designed to appeal to the government of Police State America, probably to get more funding.
This whole obsession with 'terrorists' is just becoming tiring. There are very few 'terrorists' in the world that the Americans didn't create through their own acts of terror. If America would stop its interference in the affairs of other countries, there would probably be almost none at all outside of the White House.
- Many languages are conjunctive/agglutinating in nature (e.g. Turkish, Finnish, Swahili). This means that words of sentences aren't isolated (like most European languages) but are in fact formed from 'parts' that change depending on the surrounding words. Moreover, modifying pre-/suffixes are used as inflections for e.g. verb paradigms. This results in language that effectively have literally billions or even an infinite number of possible "words". It is impossible to do keyword-based analysis on such languages without a full morphological parser for each language to break a word into its 'parts' - such a parser is a massive task.
- Chinese is the opposite, it is a totally "isolating", meaning each word is distinct with no inflections, and because different characters are used for different words there are NO SPACES between words. So you cannot begin to analyse Chinese data at all unless you have a full "Chinese segmenter" to locate word boundaries.
The need to do further disambiguation further complicates all of this analysis.
There is pretty much no way for this type of analysis to be really accurate under the current level of written language analysis technologies.
terrorist bomb al qaeda bin laden firebomb death destruction chaos terror plane WMD nuclear weapons
Yes the people who are "up to something" actually write differently. Most of the time they use phrases like "validate your bank account",
"please verify your credit card information", etc.
--- Eat my sig.
Then all that will be left is futile, self-destructive petty rebellion.
Get your teeth into a small slice: the cake of liberty