Your 'Clickprint' Gives Away Your Identity Online
Krishna Dagli writes to mention an article at the Guardian site about an increasing interest in the possibility of identifying users by their 'clickprint', or online access habits. The article discusses a new paper on online identification written by two American professors. The piece posits that not only is nailing down individual users by their habits useful for advertisers looking to sell products, it may be possible to use this information to flag stolen identities. From the article: "'Our main finding is that even trivial features in an internet session can distinguish users,' Padmanabhan told the Wharton Review. 'People do seem to have individual browsing behaviors.' The duo found that anywhere from three to 16 sessions are needed to identify an individual's clickprint ... In one example, they found that from just seven aggregated sessions they could distinguish between two different surfers with a confidence of 86.7%. Given 51 sessions, the confidence level rose to 99.4%."
You don't have to worry about this, however, as it is easy to distinguish two different users but probably difficult to pick you out of a crowd. Furthermore, if they're tracking your clicks, they probably already know your IP address. The number of sessions probably raises to a problematic number if you are trying to identify one user out of one thousand. Therefore, this will only be useful in identifying different behavior between two users -- or specifically identifying when it is highly likely that someone who is logged in is significantly different from the click profile associated with that account (as the article states).
There's a lot of discussion about this in the paper. Mentioning that the priors are set at 50% for 2 users but at 1% for 100 users (obviously). And also that: They go on to say that the method they suggest for detecting a fradulent user "do not require that users have truly unique profiles."
I read a bit of the paper and I identified Weka's decision tree method being used to classify the users (if you've ever used the ID3 algorithm or its brethren C4.5 in classification, imagine exploring methods of developing different decision trees).
Indeed the paper states: I'll take this opportunity to recommend two open source projects. Torpark for those of you concerned about your identity and also Weka -- the easy to use collection of data mining software in Java! Also something to note is that Weka has recently become part of Pentaho, a project of open source business intelligence products. Explore the valuable tools that are out there and enjoy!
My work here is dung.
How about a program that sits in the background and randomly hits sites while you are browsing?
Technoli
Great! Finally we'll be able to distinguish between the two guys who use the Internets... most of the time.
I'm the guy who can read; I get the "slow down cowboy" message constantly.
But I'm used to living among dyslexics, illiterates, and dumbasses. Sigh.
mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest
My work here is dung.
In one example, they found that from just seven aggregated sessions they could distinguish between two different surfers with a confidence of 86.7%.
Well, I know I'm one of the websurfers. Who's the other one?
Install AdBlock + NoScript and do not allow cookies unless you need them and you will reduce the chances of someone on the web identifying you significantly.
I haven't read the full paper, but the article makes this sound extremely preliminary as a usefull tool. It says they can distinguish between two users with 99% accuracy. That's all well and good when you only need to distinguish between two people, but what about when you need to distinguish between a million people?
I can distinguish between a person with blond hair and a person with brown hair given only the hair color 100% of the time. But that doesn't mean hair color is something that's a very usefull tool at positively identifying people. The key is how different peoples "click profiles" are. If there's only 1000 different possibilities (evenly distributed) that's not terribly good at idenfification. If there's 10^10 possible profiles, evenly distributed among the populace, that would certainly be usefull. Also, what's the false positive rate? If you try to use this at identifying fraud and you have a 1% false positive rate, you'll end up pissing off 1% of your customers. That's probbably not acceptable.
AccountKiller
How about a Firefox extension that, at random time intervals, randomly requests one of the page links? It wouldn't have to even load the page in a tab. That might introduce enough noise to cover a "clickprint." (Implementation is left as an exercise for the reader.)
Even if you run something in the background that submits random search queries or random spidering, the instant you open up a bookmark full of tabs, you've identified yourself.
User 12345: the clickstream consists of completely random clicks on flickr, delicio.us, and Digg links, except that (at least) once a day, someone initiates a series of TCP/IP connections to Slashdot, Digg, Google, two mainstream news sources, three blogs, and a brokerage firm, Delicio.us, and Flickr, all within five seconds of each other.
Back in the old days, that didn't mean anything, because Slashdot didn't have any way of finding out that you were reading Digg, and Doubleclick and its ilk - the only things likely to track your visits to multiple sites - were prevented from doing so by virtue of being firewalled at the router. And No Such Agency would refuse to be involved in such a project.
Today, of course, No Such Agency is involved in such a project :)
This is similar to the SSH exploit reported here on Slashdot a few weeks back where data could be determined via statistical/timing analysis done on the packets sent during an SSH session.
It sounds like if these types of timing and statistical analysis attacks become common, a simple solution would be a firefox extension that would randomize the timing of the input from the mouse and the keyboard. I suspect that randomly delaying a keystroke or a mouse click anywhere between (0-100ms) would be enough to defeat this type of analysis as well as short enough as to not adversely affect the browsing experience.
Of course browsing browsing the web through a good anonymous web proxy will probably do alot more to hide your identity than any type of randomizing of your input strokes.. but then, utilizing both methods as well as encryption would make things all the harder for any attacker.
YahmaI'm sure that recognizing return anonymous users wouldn't be that important to the marketing people behind the scenes.
Isn't this a graduate research paper by two individuals at different Business schools? Hmmmmm.
The odds are low and this is a variable to be tweaked. But the assumption is that you will still visit your old sites and exhibit your behaviors on them. If you found say one new site a week, it would actually slowly be incorporated into your routine (if they used regression properly and allowed the model to train on your data -- old and new). But if you suddenly stopped going to your old sites and started visiting new ones, you would probably be flagged. And that's the trade off of trying to repress fraud.
I should point out that there's a lot of play with the variables here and that actual implementation of this theoretical paper could be either well done or badly done.
Excellent point, though. Sometimes these new technologies turn out to be more cumbersome than helpful and we need to watch out for that!
My work here is dung.
Thats the only pattern apart from Slashdot most users here will have!
It would probably be possible to distinguish between users, depending on the part of the link they click. Top, bottom, left, right, edge, center. Something must be fairly common.
Oh You POS
Who doesn't like clicking on Tiny Urls?
f
Tiny Urls just don't compute as part of my safe surfing habits.
Example:
Tiny Url --> my redirect --> paper
After it hits the front page
Tiny Url --> my redirect --> 0-day exploit
There really is no need for them in Slashdot Submissions.
Here's the direct link to the paper
http://knowledge.wharton.upenn.edu/papers/1323.pd
[Fuck Beta]
o0t!
thiss it tsht wruostt thingti everurheard of assoson ii sober up ima gonanagjoigewhtesdqwhiu yerrsmy bests frenns u nme ginst worlds
If you were blocking sigs, you wouldn't have to read this.
Follow them to their myspace page.
Perhaps this will help spark more interest in anonymous web browsing.
I want to see the data. I'll bet the distinguishing "clickprints" of users is along the lines of what type of porn they search for, other minor things of obsession (guns, cars, free MP3s, MMORGs websites, etc), and specific accounts they use and/or repeatedly visit online (MySpace, Hotmail, Yahoo, etc).
Watching someone for even 4 online sessions could easily give you an "imprint" of a user's consistent habits.
Where the heck are the system requirements?
Obviously, this has other more dangerous uses other than ads.
For example, if you visit this site too much,
http://palestinechronicle.com/
They may start pulling you aside more at the airport.
mod parent up. add in factors such as gmail account auto checkers and other extensions that login automagically and it's a trivial excersise.
Please please please, read TFA and the paper :-)
:-)
Directly from the paper, specially to you: "It is important to note that the research presented in this paper discuss the possibility of identifying users based on their online behavior. However this 'identification' is still anonymous, and even perfect methods will only be able to indicate that some current session belongs to the 'same user' as some previous session. These methods cannot identify users by 'name'."
What you said makes complete no sense in regarding this story, this article and this paper
ilex paraguariensis for all
One thing I've noticed about my family's computer use (they all use XP) is the way that they launch their browser. My mom clicks the destop icon. I like the quicklaunch button. My sister uses the recent items menu, and my dad likes to open a folder and type an address in the address bar (despite my attempts to get him to use Firefox). One possible way to make clickprinting much more effective would perhaps be to monitor the methods people use to get form one page to another. Some people like to click a button to submit a form. Others prefer the enter key. Stuff like that could probably be far more effective in telling if a differnet person is using something they shouldn't. So in short, if someone has multiple ways to launch a program or go somewhere on the web, the path they take could be more telling than the speed at which they navigate.
Yeah, this is all fine and good if the account is single user access.
It would be interesting to see what a "clickprint" analysis of an account shared with bugmenot.com would look like.
The idea of using this sort of technology as a security feature sounds absolutely horrible.
I mean, a change in your browsing habits on a site gets you locked out? That's not Orwellian, it's just plain stupid.
That is good news. Just visit some site for N-teen times and you do not need password!
Ooops!
And how is "click pattern" new? Morse code operators (telegraph and CW (radio)) have been able to identify each other by their "fist" for around 160 years.
I remember talking to a vendor 20 years ago. His company had a way of identifying people by their typing habits. Time between keys, spelling, etc. So you've added the mouse to it, and are tying it in to surfing habits.. big deal. Why did it take 20 years?
It'll be tied to cookies, bluetooth, and that proximity chip in your head pretty soon. This isn't really news, it's the logical progression of technology. Tech works best when you know who it's aimed at, especially advertising and remote controlled guns. (Same effect, really. )
In Neil Stephenson's Cryptonomicon, he introduced the idea that the operators sending the encrypted messages were distinguishable by their "hand" (the subtleties in how they transmitted their messages). Stephenson even went on to say that they used professional pianists for their adroitness in mimicking various enemy operators to avoid detection. I don't know how much of that is rooted in actual history, but it was an intriguing idea that bears a resemblance to the method these guys are using.
However...
If a website implements this secretly, then gets information about your usage while having some sort of login information with which to associate this information, then they would be able to connect future sessions to that session, which they could then connect back to a user profile.
Although this seems to be focused on usage within a single website, it seems a reasonable extrapolation to think that someone could develop a less effective but more general algorithm that would help to identify on different sites as well.
Famous Last Words: "hmm...wikipedia says it's edible"
What is that, a five-sided prostitute??
Soylent Green is peoplicious!
mod parent up
The article describes another form of clickstream analysis. However, I wonder whether user behavior couldn't also, and perhaps better, be identified by content interaction. There are a number of products that show Web page heatcharts ostensibly to identify layout problems. But there are not many products that show what a person actually did on a page. The article used sample data for a year, but I wonder how much of that data was skewed by changes in content layout and promotion. For example, I monitored the behavior on several Web pages with a consistent layout for three years and can show clear behavior patterns by content type (Identifying Behavior by Content Type). I think individual behavior on a given page with a particular type of content might also be useful in identifying particular users.
thinking of an analogy: the birthday paradox:
Just because it is easy to distinguish between 2 users does not mean that this has much practical use:
In most applications (without the user's consent) this is going to be used remotely (server-side), which means that it is going to be totally useless at tracking users if there are more than say fifty users (someone do the math - assuming that the users follow enough links on that same site).
You can safely stow away the tin-foil hat^W^W browsing pattern disguise Firefox extension combined with the anonymous proxy / anonymous browsing.
TODO: 753) write sig.
Yes, yes, we've all read Cryptonomicon too.
http://www.mysecureisp.com/
Sure the math to track more that 50 users would be difficult... unless you came up with names for different browsing styles instead of trying to match every movement. Like they did with driving automobiles in the 80's, hmmn, no one seems to use names for driving styles anymore...Now say attached to your name were the 3 most common browsing styles you used, and 15 styles you used occasionally?
(the only movie I know that talks about this is David Byrne's True Stories)
I see this was described as a "working paper" on the 20th. It doesn't show up anywhere as being "under review". I wonder if they've just blown their publication chances given that it is already "pre-published" at this point?
It'll be interesting to see how this shakes out.
The AOL Search database which was releases didn't identify the users by name either, didn't mean they were identifiable.
I don't think the goal is an ability to map anonymous clickprints onto a domain of known users---guessing right 99% of the time (optimistically speaking) with a population of only 2 users does not seem very good for that application. However, if gmail or yahoo or whoever alerted me when my access habits suddenly changed dramatically, or prompted for identification confirmation more often when my usage patterns changed, that might be somewhat useful.
--TheOrangeSquid Is it any wonder things seem so awry? We swim in a sea of confusion and don't have to think to survive