The Evil in E-Mail
Frenchy in Ontario writes "An Ontario university researcher is devising ways to help law enforcement agencies better pinpoint likely criminal behavior in e-mails. His theory is that people who are "up to something" are more likely to write differently than people who aren't - either by avoiding using certain words at all that could be flagged for possible criminal context (like "bombed) or to examine patterns that might indicate criminal activity - like several people e-mailing one person but not each other, which is how some criminal networks operate. There's also an interesting paragraph on why Enron's emails aren't as valuable as you might think for this sort of work."
I especially liked the part about:
Another, Skillicorn says, is that research shows
people speak and write differently when they feel guilt about a
subject, for instance using fewer first-person pronouns, like I and we.
Because people always use first person pronouns in messages. That's just what's done. And alot of them should be used.
Sounds like a way to track messages with "substance" rather than the "hai h u r? heer are the pictures of my vacation." messages.
Think about that. This man has just come up with a way to measure the relative interest of what the sender has to say to people in the government.
Yet another way to cut down on the messages that the government has to read and be bored with. Yet another way to enable the government to read out communications more effectively
Yet another reason to look into using real encryption.
The previous has been a secret message to my comrades.
So very, very true. I'd support the guy just because he's a fellow Ontarian, but there is nothing in this article of any substance or worth, and it sounds like a giant heap of grant-sucking bullshit. I think the "researcher" caught the season premiere of "Numbers", one in which they caught the criminal based on exclusion of activity (e.g. he committed crimes in the area around his stomping grounds, excluding where he lived and worked), and thought he could rationalize some nonsense about email analysis.
This reminds me of a Perl module Text::Gender
or something which I tried out in a few experiments last year. It is supposed to analyse writing and determine whether its author is female or male.
It works rather well given the conditions that the authour is also is American, white and middle class. Any samples outside that field and it fails spectacularly actually getting more wrong than right (worse than chance).
These sort of ideas are cute in their ambitions
but not science of any kind at all. The tests given in the email analysis article are even more wooly still. It sort of annoys me as a scientist that standards have sunk so low and funding is available for hairbrained capers like the one described in the article.
Just remember a not so old story where there was reported the presence of e-mail encryption software was considered as evidence in some child porn case.
First they start using some very un-smart word-scanning piece of crap filtering system [and god help you if you write foreign language letters, or have a different style than the average], then they will punish the use of mail signing and encryption software [which is something I regularly do], then if the filtering still has a false positive rate above 99% they will ban e-mailing. Then they will find out other forms of efficient communication exist.
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.
That may not work either. There's that fine s-f Polish novel "Paradyzja" by Janusz Zajdel about a closed society in a space colony. The population was under constant surveillance and anyone questioning the government was immediately punished. Due to amount of gathered data the government had to use automatic systems to find such people. So what the unhappy residents did was to develop language based on metaphors and associations. For automated systems it looked like a spoken poetry while an intelligent listener easily got the point.
It was written during Cold War and of course referred to socialist governments of the time but I see new paralles now.
-Tacitus
Government is already too invasive. I'm already forced to seek a building permit before I can erect a structure on my own property. The fines for ignoring this, (and say, having the gall to build a solar powered house which is not connected to the AC power grid, or (horrors!) a straw-bale house), are huge and the government's reasons for these laws are utterly ridiculous.
Any professor who suggests that we should be looking to monitor email content is not thinking clearly. The Government already has their nose in everything, and telling us that, "It's For Our Own Good," is NOT a valid excuse.
It's MUCH more important that people be able to make mistakes -and even die through their own faults- than live ensnared in the safe-keeping of a bunch of ignorant civil servants who are trying to build a Starfleet future where everybody dresses the same, and nobody is allowed to think or act outside a bunch of pre-set 'safe' boundaries designed for middle-class suburbanites who exist in eternal ignorance of the real world, who actually believe in the Discovery Channel, who drink milk, and live in absolute terror of anything you can't experience beyond the confines of a nice, respectable department store.
-FL
Statistical analysis of word (token) frequency works great in a closed domain set, such as the Enron corpus. But once you scale up to the ISP level it falls down horribly.
:2 004.1265082
Why ? The size of the token database increases massively to the point where it becomes un maintainable. Every spelling mistake, word variant, not to mention foreign language, gets included. Eventually you are unable to separate the wood from the trees. Let alone make statistically significant assertions about a single message.
And lets not mention the fact that all the work on detecting deception in correspondance hase been done on English language text. Those pesky al-Qaeda types tend to speak Arabic. So before you can even begin to detect dodgy emails written by al-Qaeda, you need to construct a written arabic parser. Then you need access to a large corpus of Arabic emails (if you have one I'd be very interested too). Then you need to research the lexical rules that tend to signify deceptive arabic.
Its an interesting problem, but not even trained and experienced intelligence operatives are able to routinely detect deceptive correspondance, so coding that algorithm is quite tricky.
This is a good place to start
http://doi.ieeecomputersociety.org/10.1109/HICSS.
- Many languages are conjunctive/agglutinating in nature (e.g. Turkish, Finnish, Swahili). This means that words of sentences aren't isolated (like most European languages) but are in fact formed from 'parts' that change depending on the surrounding words. Moreover, modifying pre-/suffixes are used as inflections for e.g. verb paradigms. This results in language that effectively have literally billions or even an infinite number of possible "words". It is impossible to do keyword-based analysis on such languages without a full morphological parser for each language to break a word into its 'parts' - such a parser is a massive task.
- Chinese is the opposite, it is a totally "isolating", meaning each word is distinct with no inflections, and because different characters are used for different words there are NO SPACES between words. So you cannot begin to analyse Chinese data at all unless you have a full "Chinese segmenter" to locate word boundaries.
The need to do further disambiguation further complicates all of this analysis.
There is pretty much no way for this type of analysis to be really accurate under the current level of written language analysis technologies.
Then all that will be left is futile, self-destructive petty rebellion.
Get your teeth into a small slice: the cake of liberty