New Method of Tracking UIP Hits?
smurray writes "iMediaConnection has an interesting article on a new approach to web analysis. The author claims that he is describing 'new, cutting edge methodologies for identifying people, methodologies that -- at this point -- no web analytics product supports.' What's more interesting, the new technology doesn't seem to be privacy intrusive." Many companies seem unhappy with the accepted norms of tracking UIP results. Another approach to solving this problem was also previously covered on Slashdot.
new, cutting edge methodologies for identifying people....the new technology doesn't seem to be privacy intrusive
The Wookie defense in action!
time is a perception of a being's consciousness
time is your 6th sense, the wierd ones are 7+
Sending your PCs unique CPUID along with every HTTP request would be ideal for this. You could also group up websites and use this to track people across websites. It would be great for marketing and for law enforcement.
Oh, you all disabled your nice Intel CPUID? Why ever would you want to do that?
International Union of Private Wagons
Quimper, France - Pluguffan (Airport Code)
Ultimate Irrigation Potential
Uncovered Interest Parity
Undegraded Intake Protein
United International Pictures
Universidad Interamericana de Panamá
Unusual Interstitial Pneumonitis
Upgrade Improvement Program
Urinating In Public
User Interface Program
USIGS Interoperability Profile
Usual Interstitial Pneumonia of Liebow
Utilities Infrastructure Plan
We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible. If we have one IP address in New York, then one in Tokyo 60 minutes later, we know it can't be the same person because you can't get from New York to Tokyo in one hour.
If my company had computers in New York and Tokyo, I could ssh between them in much less than 60 minutes. . .
iMediaConnection starts a huge field test of tracking unique slashdot readers with their cutting edge technologies.
So you can use probabilistic means to identify unique visitors. That's not a paradigm shift, except for those whose paradigms are already very small.
Somehow I don't think this research is worthy of an NDA.
I'm not sure what the Flash is, but to me, scanning all the cookies your computer has had IS privacy intrusive.
No single test is perfectly reliable, so we have to apply multiple tests.
No kidding. This guy probably needs a wake up call.
We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible. If we have one IP address in New York, then one in Tokyo 60 minutes later, we know it can't be the same person because you can't get from New York to Tokyo in one hour.
Ok, so this is what normally is called a really stupid argumentation. I don't say that it can't be accounted for, but stating such a thing is nothing more than plain stupidity. Has this guy ever heard about that Internet thing ?
Flash can report to the system all the cookies a machine has held.
Uhmm, not a great argument to make people use it.
No one wants to know.
I don't think they don't want to know. They just don't want to see a sudden drop of ~50% of their user count from a day to the other. And it really doesn't matter if it's the truth or not. A drop is a drop.
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.
What's more interesting, the new technology doesn't seem to be privacy intrusive
The only mention of the word "privacy" on the linked web page is the term "Privacy Policy" at the bottom of the page.
John.
From the article:
" We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible. If we have one IP address in New York, then one in Tokyo 60 minutes later, we know it can't be the same person because you can't get from New York to Tokyo in one hour."
Everheard of ssh and similar tools to make that travel?
And they put this on slashdot. Ignorance, just pure ignorance...
Evolution of Language Through The Ages: 6000 BC : ungh, grrf, booga 2000 AD : grep, awk, sed
They make some silly assumptions that I don't think work with users using proxy agents, but in the end it still boils down to the existence of cookies. Which would be ok, if the problem they are trying to solve wasn't that users are deleting and not storing cookies at all. They do mention using Flash to store cookies, which I suspect will have to be the next area users will have to start cleaning up. But just because cookies don't overlap in time and the IP address is the same doesn't mean it's the same person. A bunch of users that use the same browser and share an IP address that always delete their cookies with this system will look like one user. Vastly under counting. Which I don't think web sites are interested in. Vast over counting is profitable. Under counting, not so much.
In the end there is no way they can even mostly recognize repeat web site visitors if the VISITOR DOESN'T WANT THEM TO.
The big problem is stated at the top of the article:
"We need to identify unique users on the web. It's fundamental. We need to know how many people visit, what they read, for how long, how often they return, and at what frequency. These are the 'atoms' of our metrics. Without this knowledge we really can't do much."
If knowing who unique users are is that important they need to create a reason for the user to correctly identify themselves. Some form of incentive that makes it worth giving up an identification for.
The article's "Sky is Falling" tone rests on a single factoid. "30 to 55% of users delete cookies" therefore current analytics products are out by "at least 30 percent, maybe more".
That is of course complete nonsense. Let's say we accept the author's assertion that different studies have given cookie deletion rates across that range. I can accept that a significant number of users might delete cookies at some point, but what percentage of normal, non-geek, non-tinfoil-hat-wearing users are deleting cookies between page requests to a single site in a single session? If it is 30%, then I will eat my hat.
Most cookie deletion amoung the general populace will be being done automatically by anti-spyware software and is not done in realtime.
The author clearly knows that even the most primitive of tools also use other metrics to group page requests into sessions, so even if 30% of users were deleting cookies, it would not result in a 30% inaccuracy.
Of course "researchers propose more complex heuristic that looks to be slightly more accurate than current pracice" does not make as good a story as "paradigm shift" blah blah "blows out of the water" blah blah "We've been off by at least 30 percent, maybe more." blah blah.
I develop web analytic software for a living.
There's only so much you can do to track users.
IP address, user agent, some javascript stuff for cookieless tracking.. the only real "unique" identifiers for any one visitor. It stops there.
Of course, using exploits in flash doesn't count, but supposedly this new method is "not intrusive."
I call BS because it simply can't happen.
If a user doesn't wanna be tracked, they won't be tracked. This story is just press, free advertisement, and hype for this particular company.
We have secretly replaced these Slashdot mods' sense of humor with a rusty nail. Let's see if they notice!!
When I read "paradigm shift" in the very first paragraph, my bullshit sensor sound such a loud alarm that it's hard to continue reading...
The article uses a lot of time to establish that this is a paradigm shift, when it's actually not. I do believe their idea is good, but basically it's just applying a lot of "possible" user identifiers and merge them together to form a unified result.
Some of the identifiers they haven't used are linkage on the site. If one page links to another, it might be the same user, if the pages are called in sequence.
On top of links "time" might be applied. Some links are expected to be clicked fast, others after some reading on the page.
Some may argue that linkage is what you want to determine in the following analysis, and can't therefore be used to determine the use in advance, but this is not true. The determination of the user uniqueness looks to see if its possible for the user to get from one page to an other, while the analysis want to determine if they did it.
-:) Oh no - not again.
www.rednebula.com
ROI is mentioned, along with the 'atoms' of their metrics: page hit count, popular URL count, URL dwell time, and returning visitors. When these metrics are used to produce reports, how valuable are these reports in ascertaining how ROI is affected by said metrics? For example, getting a neat funnel report of the path people take through a site and where the traffic drops off offers insight into popular paths and locations where people bail out, but apart from listening for errors, there is no further insight into why a person bailed.
What seems to be missing is gathering insightful information into what transpires while someone is on a particular page. I'd like to know the general trends in behavior, not just the server requests. I've found it more useful to be able to see the interactions with the content than reporting where people enter, traverse, and exit a site.
"If the same cookie is present on multiple visits, it's the same person. We next sort our visits by cookie ID"
Only after that they seem to continue the analys ("We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible", etc)
Thus turning off or regulary removing cookies will render their bleeding cutting edge technology useless? And how are cookies a 'breakthrought'?. Their only alternative to this seems to be;
You can also throw Flash Shared Objects (FSO) into the mix. FSOs can't replace cookies, but if someone does support FSO you can use FSOs to record cookie IDs.
I don't know what the fuzz is about
This is just basic logic, which any decent programmer should be able to come up with, even the M$ certified ones.
I think we can keep recursing like this until someone returns 1
For those who can't be bothered to read through all the buzzwords, here's the actual method used:
Each of these steps is applied in order:
1. If the same cookie is present on multiple visits, its the same person.
2. We next sort our visits by cookie ID and look at the cookie life spans. Different cookies that overlap in time are different users. In other words, one person cant have two cookies at the same time.
3. This leaves us with sets of cookie IDs that could belong to the same person because they occur at different times, so we now look at IP addresses.
4. We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible. If we have one IP address in New York, then one in Tokyo 60 minutes later, we know it cant be the same person because you cant get from New York to Tokyo in one hour.
5. This leaves us with those IP addresses that cant be eliminated on the basis of geography. We now switch emphasis. Instead of looking for proof of difference, we now look for combinations which indicate its the same person. These are IP addresses we know to be owned by the same ISP or company.
6. We can refine this test by going back over the IP address/Cookie combination. We can look at all the IP addresses that a cookie had. Do we see one of those addresses used on a new cookie? Do both cookies have the same User Agent? If we get the same pool of IP addresses showing up on multiple cookies over time, with the same User Agent, this probably indicates the same person.
7. You can also throw Flash Shared Objects (FSO) into the mix. FSOs cant replace cookies, but if someone does support FSO you can use FSOs to record cookie IDs. This way Flash can report to the system all the cookies a machine has held. In addition to identifying users, you can use this information to understand the cookie behavior of your flash users and extrapolate to the rest of your visitor population.
Please correct me if I got my facts wrong.
About 20% of my time on my last job was spent doing web analysis. It drove me insane.
The problem is with the word "accurate". To management, "accurate statistics" means knowing exactly how many conscious human beings looked at the site during a given period. However, the computer cannot measure this. What it can measure, accurately, is the number of HTML requests during a given period.
You can use the latter number to estimate the former number. But because this estimate is effected by a multitude of factors like spiders, proxies, bugs, etc., management will say "these stats are clearly not accurate!". You can try to filter out the various "undesirable" requests, but the results you'll get will vary chaotically with the filters you use. The closer you get to "accurate" stats from the point of view of management, the further you'll be from "accurate" stats from a technical point of view.
Makers of web analysis software and services address these problems by the simple of technique of "lying". In fact, a whole industry has built up based on the shared delusion that we can accurately measure distinct users.
Which is where this article comes in. The author has discovered the shocking, shocking fact that the standard means of measuring distinct users are total bollocks. He's discovered that another technique produces dramatically different results. He's shocked, shocked, appalled in fact, that the makers of web analysis software are not interested in this new, highly computationally-intensive technique that spits out lower numbers.
My advice? Instead of doing costly probability analysis on your log files, just multiple your existing user counts by 0.7. The results will be just as meaningful and you can go home earlier.
fish and pipes
Somebody please explain to me: why would you go to all this trouble to get a close estimate of how many unique visitors your site draws?
I'm personally always more interested in how many pages get requested, and which ones. The first gives me an impression of how popular the site is*, the second tells me which pages people particularly like, so I can add more like that.
The only reason I see for really wanting to track people is if your site is actually an app that has state. In those cases, you have to use a more bullet-proof system than the one presented in TFA.
* Some people object that it counts many people who visit once, then never again; but I consider it a success that they got there in the first place - they were probably referred by someone, followed a link that someone made, or the page ranks high in a search engine.
Please correct me if I got my facts wrong.
Macromedia have a page that allows you to modify what sites can do on your computer in regards to Flash:n /flashplayer/help/settings_manager02.html#118539
http://www.macromedia.com/support/documentation/e
"I highly doubt anyone is THAT stupid to put THAT big of a security flaw into a system."
Read the article, and the guy is proposing to build exactly that kind of a security flaw into the system.
Flash can use, basically, some local shared storage on your hard drive. This isn't really designed as cookie storage, and doesn't have even the meager safeguards that cookies have. (E.g., being tied only to a domain.) It's really a space that _any_ flash applet can read and write, and currently noone (with half a clue) puts any important data there.
This guy's idea? Basically, "I know, let's store cookies there, precisely _because_ any other flash applet, e.g., our own again from a different page, can read that back again."
Caveat: so can everyone else. I could make a simple flash game that grabs everything stored there, just as you described, and sends it back to me. Including, yes, your session id (so, yes, I can take over your session in any site you were logged in, including any e-commerce sites or your bank) and anything else they stored there.
Since it's used to track your movements through sites, depending how clueless that's programmed, I may (or may not) also be able gather all sorts of other information about you.
So in a nutshell his miracle solution is to build _exactly_ that kind of a vulnerability (not to mention privacy leak) into the system.
So, well, that's the problem with assuming that "noone could be THAT stupid". Invariably when I say that, someone kindly offers himself as living proof that I'm wrong. Soneone CAN be that stupid.
A polar bear is a cartesian bear after a coordinate transform.