The Evil in E-Mail
Frenchy in Ontario writes "An Ontario university researcher is devising ways to help law enforcement agencies better pinpoint likely criminal behavior in e-mails. His theory is that people who are "up to something" are more likely to write differently than people who aren't - either by avoiding using certain words at all that could be flagged for possible criminal context (like "bombed) or to examine patterns that might indicate criminal activity - like several people e-mailing one person but not each other, which is how some criminal networks operate. There's also an interesting paragraph on why Enron's emails aren't as valuable as you might think for this sort of work."
I especially liked the part about:
Another, Skillicorn says, is that research shows
people speak and write differently when they feel guilt about a
subject, for instance using fewer first-person pronouns, like I and we.
Because people always use first person pronouns in messages. That's just what's done. And alot of them should be used.
Sounds like a way to track messages with "substance" rather than the "hai h u r? heer are the pictures of my vacation." messages.
Think about that. This man has just come up with a way to measure the relative interest of what the sender has to say to people in the government.
Yet another way to cut down on the messages that the government has to read and be bored with. Yet another way to enable the government to read out communications more effectively
Yet another reason to look into using real encryption.
The previous has been a secret message to my comrades.
So very, very true. I'd support the guy just because he's a fellow Ontarian, but there is nothing in this article of any substance or worth, and it sounds like a giant heap of grant-sucking bullshit. I think the "researcher" caught the season premiere of "Numbers", one in which they caught the criminal based on exclusion of activity (e.g. he committed crimes in the area around his stomping grounds, excluding where he lived and worked), and thought he could rationalize some nonsense about email analysis.
This reminds me of a Perl module Text::Gender
or something which I tried out in a few experiments last year. It is supposed to analyse writing and determine whether its author is female or male.
It works rather well given the conditions that the authour is also is American, white and middle class. Any samples outside that field and it fails spectacularly actually getting more wrong than right (worse than chance).
These sort of ideas are cute in their ambitions
but not science of any kind at all. The tests given in the email analysis article are even more wooly still. It sort of annoys me as a scientist that standards have sunk so low and funding is available for hairbrained capers like the one described in the article.
Just remember a not so old story where there was reported the presence of e-mail encryption software was considered as evidence in some child porn case.
First they start using some very un-smart word-scanning piece of crap filtering system [and god help you if you write foreign language letters, or have a different style than the average], then they will punish the use of mail signing and encryption software [which is something I regularly do], then if the filtering still has a false positive rate above 99% they will ban e-mailing. Then they will find out other forms of efficient communication exist.
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.
That may not work either. There's that fine s-f Polish novel "Paradyzja" by Janusz Zajdel about a closed society in a space colony. The population was under constant surveillance and anyone questioning the government was immediately punished. Due to amount of gathered data the government had to use automatic systems to find such people. So what the unhappy residents did was to develop language based on metaphors and associations. For automated systems it looked like a spoken poetry while an intelligent listener easily got the point.
It was written during Cold War and of course referred to socialist governments of the time but I see new paralles now.
As someone that has developed commercial systems that do latent semantic processing (and other sorts of text analysis), I'm soooooooo glad that you can tell if it works by a single article written for the layman.
I know when I describe the ones I work on to even academics, I rarely explain in depth except in sound biteable concepts because unless you are willing to invest a few hours reading articles and white papers, its going to sound like bullshit.
But all in all, I'm glad your psychological training and years in computer science has made it possible for you to discern if a technology like this can work or not from a few over simplified statements. Maybe I can get you to work on my next grant because I am soooooo sick of those fucking validation studies -- you know the ones that sort out the false positives far before it ever gets to a state that can be used in a real life setting (Human Subjects is soooo f'n anal about not allowing university property be used in a way that might come back and bite them in the ass...your instant validation would make them so much more amicable...trust me, we have no problem TripMasterMonkey has given it the seal of approval and he even RTFA!!!).
As an aside, a friend was developing a competing product to my organization's products, he did a quick study to analyze personality defects and otherwise. During the final validation, it was able to analyze writings by several well-known psychopaths. Not a lot of false positives, again because of the natural of this industry, but it did pick up 90% of the papers by these known psychopaths (most gotten only after proving the initial validity because the families and other administrators of this data are overly protective, not want their sons being used as the sole basis of an expert system describing crazies). None of the final scoring had to do with a single paper or otherwise, but the end result of a dozen or more. I don't know what the lower limit on number of texts / time span was in this case. But all in all, the application was never used for this purpose because there was no way that they could get away without any acceptable liability -- even though it was always said it was a diagnostic tool and not an acceptable replacement for a trained persons analysis of these texts.
Posted as an AC because I'd hope even an anonymous source that has worked in this field is more informed than an idiot with the name TripMasterMonkey that seems to be 'informed' by reading a single article.
-Tacitus
Government is already too invasive. I'm already forced to seek a building permit before I can erect a structure on my own property. The fines for ignoring this, (and say, having the gall to build a solar powered house which is not connected to the AC power grid, or (horrors!) a straw-bale house), are huge and the government's reasons for these laws are utterly ridiculous.
Any professor who suggests that we should be looking to monitor email content is not thinking clearly. The Government already has their nose in everything, and telling us that, "It's For Our Own Good," is NOT a valid excuse.
It's MUCH more important that people be able to make mistakes -and even die through their own faults- than live ensnared in the safe-keeping of a bunch of ignorant civil servants who are trying to build a Starfleet future where everybody dresses the same, and nobody is allowed to think or act outside a bunch of pre-set 'safe' boundaries designed for middle-class suburbanites who exist in eternal ignorance of the real world, who actually believe in the Discovery Channel, who drink milk, and live in absolute terror of anything you can't experience beyond the confines of a nice, respectable department store.
-FL
Statistical analysis of word (token) frequency works great in a closed domain set, such as the Enron corpus. But once you scale up to the ISP level it falls down horribly.
:2 004.1265082
Why ? The size of the token database increases massively to the point where it becomes un maintainable. Every spelling mistake, word variant, not to mention foreign language, gets included. Eventually you are unable to separate the wood from the trees. Let alone make statistically significant assertions about a single message.
And lets not mention the fact that all the work on detecting deception in correspondance hase been done on English language text. Those pesky al-Qaeda types tend to speak Arabic. So before you can even begin to detect dodgy emails written by al-Qaeda, you need to construct a written arabic parser. Then you need access to a large corpus of Arabic emails (if you have one I'd be very interested too). Then you need to research the lexical rules that tend to signify deceptive arabic.
Its an interesting problem, but not even trained and experienced intelligence operatives are able to routinely detect deceptive correspondance, so coding that algorithm is quite tricky.
This is a good place to start
http://doi.ieeecomputersociety.org/10.1109/HICSS.
- Many languages are conjunctive/agglutinating in nature (e.g. Turkish, Finnish, Swahili). This means that words of sentences aren't isolated (like most European languages) but are in fact formed from 'parts' that change depending on the surrounding words. Moreover, modifying pre-/suffixes are used as inflections for e.g. verb paradigms. This results in language that effectively have literally billions or even an infinite number of possible "words". It is impossible to do keyword-based analysis on such languages without a full morphological parser for each language to break a word into its 'parts' - such a parser is a massive task.
- Chinese is the opposite, it is a totally "isolating", meaning each word is distinct with no inflections, and because different characters are used for different words there are NO SPACES between words. So you cannot begin to analyse Chinese data at all unless you have a full "Chinese segmenter" to locate word boundaries.
The need to do further disambiguation further complicates all of this analysis.
There is pretty much no way for this type of analysis to be really accurate under the current level of written language analysis technologies.
> Which is why America developed our "due process"
A French development of an idea that is traced to a doctrine practiced by Egyptian kings, as evidenced by the oldest court records known.
Madison was really late to this buffet. He was carrying forward the English ideal of due process of law, that stems from one of the most important things that distinguishes English law from Roman law: The Magna Carta.
But even the writers of Magna Carta were not the ones who invented the concept. It existed in various forms, under various incarnations of the Roman government -- you can even find the doctrine being applied in the second most famous trial in history (OJ Simpson being the first?)
Then all that will be left is futile, self-destructive petty rebellion.
Get your teeth into a small slice: the cake of liberty