Calculating Number of Users Based on Amount of Unique IPs?
pjdepasq asks: "I run a small but growing web site. Currently the site has optional registration (for the message boards), though we know we have a larger number of anonymous users. Is there an industry standard for calculating number of unique users based on the unique IP addresses over a period of time (1 week? 1 month?) We'd like to get a handle on the number of users we have. Sure, I know about dynamic IP addreses and ISPs like AOL which can dilute or confuse the numbers, but surely there's some benchmark calculation we can use."
But I'm told by people in the magazine business that the industry standard there is to assume that the number of readers is 5x the number of issues sold. Of course that will vary widely by magazine: but that's the ratio they all use when making readership claims in their rate cards.
This is exactly the question the original poster was asking, but for the web: everybody knows that getting an exact answer is impossible, he's just looking for a rule of thumb.
Q: "How do you map IPs to users?" A: "Use cookies!"
They mean well, but they don't live in the real world. I on the other hand mean well AND live in the real world, so here are two reasonable ways to handle it. They both should give similar -- but not identical -- numbers. Either of which is good enough for anyone with reasonable expectations.
1) Count the number of unique IP addresses you see every half hour. Simple, fast, easy. And reasonably accurate.
2) A series of hits from a single IP can be considered a single user if there are no gaps more than five minutes between hits. Count up the number of these bunches of hits you see and you get the number of people. Hard and slow, but reasonably accurate.
Neither of these will really give you the people that came to your site, but they definitly give you a good guess. They can't see stuff hidden behind proxies (but neither can anyone else) and they don't deal with IP addresses that change during a single session. But compare even THIS data to what TV advertisers get from Neilsen and you will NEVER feel like you need to drop a cookie on every one of your users.
I use the first version to report traffic for a 4-million pageview a day website and it works just fine. And if your boss doesn't like it, beat some reasonable expectations into him or her.
That's because I wasn't trying to say IP counting is better than cookies. I was saying that for most sites, counting IPs is more than good enough. Cookie tracking and IP counting are both reasonably accurate.
I'll gladly submit to better ideas, if only you can show me the flaws in my own arguments and convince me of yours.
I wasn't trying to change anyone's mind. I was just trying to answer the original question. Cookies are a great way to track (and count) users, but they have nothing to do with the original question.
Cache-control: private is probably the best solution, as it lets the browser cache the page but tells the proxy not to. Not sure if this always works or not, though.
// mlc, user 16290
--
11.0010010000111111011010101000100010000101101000
One problem is that it would depend very much on the type of website and thus the type of users you had. If you have a B2B website, and most of your visitors are from companies, your (unique user):(unique IP) ratio will look very different to a site with mostly home visitors coming through large ISPs.
The industry seems to be more concerned with developing more and more reliable versions of the half-hour timeout metric. Of course, they're chasing the wind. (And furthermore, all the different versions of their metric are then not comparable -- see this study from Xerox PARC (PDF, 228kb).)
I leave you with this thought from my essay How the Web Works:
11.0010010000111111011010101000100010000101101000
Your first statement was valid and made a lot of sense.
THEN you said:
"Second, there are man[y] servers which have a ton of virtual hosts on them, each with their own IP. A server could have 20 or 30 or more IP's assigned to it, there's no way to know. Furthermore, a server could have multiple NIC's, assiging different virtual hosts to different NIC's, making it even harder to figure out."
How is this at all relevant to the question?
He's trying to count visitors to his site, not sites on the web; and most visitors don't surf from their vhost accounts (most visitors don't HAVE vhost accounts to surf from.)
To pick a nit, load balancers may move IP's on Vhosts around, but I don't believe having multiple NIC's would affect the IP's that a server sits on - they would remain static.
http://www.bullnet.com
That's what cookies are for. If you know the problems with IP numbers, why try to use them for something that's clearly inappropriate and fraught with error?
There are circumstances where this is impossible. Like, say, for last year's logs.
I'll grant that this method is fraught with error and that using IP addresses to count noses is the work of the devil. Setting that aside, could a few folks who are running unique cookies on a large site count 'em and count IP addresses in the same period and give us their ratio?
My rough rule of thumb is that the ratio is around 1:1 but it has been several years since I verified this.
> could a few folks who are running unique
> cookies on a large site count 'em and count IP
> addresses in the same period and give us their
> ratio?
Certainly. We were counting users by IP address/browser type combination for about a year on three sites getting between 5,000 and 50,000 users per month based on that calculation. We then decided to use cookies (expiring at 5 years). We saw an increase in user numbers of about 20% for each site.
Which was nice, seeing as we then went for ABCe audits and got lots of advertising money!
See - cookies DO work!
G
"And the meaning of words; when they cease to function; when will it start worrying you?"
Exactly. All the people posting here going "cookies don't work because people turn them off" are on planet Slashdot.
HOWEVER - one thing that's not been mentioned yet is that if you use mod_usertrack to cookie your logs and you get a user who does not accept a cookie, it creates a stream of unique IDs - one for every request that user makes.
So - those people who turn their cookies off and go to Apache servers could be looking like hundreds or even thousands of users! Hooray!
G
"And the meaning of words; when they cease to function; when will it start worrying you?"
You can't do it.
First, there are too many corporations using NAT, and it's impossible to know how many people are NAT'ed. A company may have 100 employees, but only have 3 static IPs.
Second, there are man servers which have a ton of virtual hosts on them, each with their own IP. A server could have 20 or 30 or more IP's assigned to it, there's no way to know. Furthermore, a server could have multiple NIC's, assiging different virtual hosts to different NIC's, making it even harder to figure out.
-Cire
You can beat the caching by placing a Pragma: no-cache in the http response and/or set the last modified date to now.
It's Linux, damnit! Pay no attention to renaming attempts by self-aggrandizing blowhards.
IP based ratios won't work because AOL will fuck you over. I've seen the same users come from different IP's in their proxy space, plus their cacheing means some users will never make it to your page.
Some users have plug ins that request the page from a second IP (NBCI's quick click anyone?) that will skew your numbers. Exact numbers will not happen, and ratios will vary widely based on your clientele.
Set cookies and go on your way.
I didn't say I was using them, I was wondering if there was some industry calculation based upon the IPs.
I've not dealt with cookies before and had not considered using them.
Good luck getting your rough estimate based on IP addresses. I think it's great that you're eschewing cookies; go for cache friendliness.
The process would be something like this (same as your suggestion):
Have some page (/ ?) set a cookie and automatically redirect to a "test" page. This page simply verifies if a cookies is returned by the user. If not, he is rejecting them.
You can always check if the user has accepted the cookie or not: redirecting from one page to another where the user must have a cookie it was accepted. You can then estimate the proportion of non-cookie users.
And if you base your stats on a short period of time, say 2 or 3 weeks, users clearing cookies will be a minority.
IP's are misleading... every user behind a proxy server shows up as the same IP address. There could be thousands of users behind the same proxy.
MadCow.
I used to have a sig, but I set it free and it never came back.
I have no problem with people refuting my claims/advice, but you offer no reasoning whatsoever as to why your "method" is better than using cookies. I certainly DO live in the "real world", and the methods suggested using cookies would definately provide simple, accurate, and objective measurements of unique users.
Do you not have experience using cookies? Do you not know what they do or how they work? Do you understand the problems with using IP addressing, as indicated in the cookie discussions above? Do you have any actual "substance" with which to argue those points?
I'll gladly submit to better ideas, if only you can show me the flaws in my own arguments and convince me of yours. Unfortunately, your post lacked any (possibly quite correct) details to support your claims.
MadCow.
I used to have a sig, but I set it free and it never came back.
That's what cookies are for. If you know the problems with IP numbers, why try to use them for something that's clearly inappropriate and fraught with error?
"Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS