On Counting Website Traffic
Logic Bomb writes: "The San Francisco Chronicle has an interesting article about measuring website traffic. This is kind of an obnoxious issue, but it means everything to commercial websites seeking investors. Apparently the figures reported by the sites themselves through analysis of server logs are often much higher than the ones given by firms like Media Metrix (whose numbers I see all the time in articles from Cnet and the like). The basic dispute is over whether sampling, a la Nielsen, is appropriate for the web. It seems counterproductive to purposely use an innacurate statistical measure when exact counts are readily available, but I can't imagine many things easier to fake than a server log. Anyone have a good idea about how to approach this?"
On a serious note, I work at a fairly busy web site in the data warehousing/business reporting section. We simply don't have _time_ to fake enough server log entries to make it worth our while, we're too busy processing the _legit_ stuff and filtering out such non-reportables as crawler hits. As to the discrepancies between inside and outside numbers, well, I can _guess_ about what Mars looks like from some telescopic photos, but nothing beats going and looking. Understand I'm not shooting at those outside companies for the differences in said numbers, but one must be aware that different methods can produce different numbers, and that using statistical methods to arrive at metrics.... well, that's how the US Census works.... and if you live in the right neighborhood you're going to find an awful lot of _dark_-skinned 'Caucasians' ;>
Radar detectors aren't reliable, because many troopers use only stopwatches and a known distance. The radar is off.
The police radios, however, are always on.
********* sig: If you don't like the law, get filthy stinking rich, and buy a better one.
You could probably set up a Perl script in a few minutes to make up the numbers for you.
In this country (I needn't specify which), we elect representatives. We don't directly vote on most issues, even though it would be technologically feasible for us to do so (especially with the advent of the internet). Why? Because we're not just looking for an accurate measure of what people want. We want what they ought to want, and we hope that representatives will better reflect that than their actual choices.
Advertisers don't just want to know what the most visible piece of real estate is in the world so they can erect a billboard on it. They want to know what the next upcoming innovation is so they can be the first to ride the upsurging wave of popularity. It doesn't help that altavista is the most popular search engine in the world today if placing a big banner ad on google tomorrow will catch the as-yet unseen mobs.
Take Netcraft and server operating systems. You don't just want to know what people are actually running. You want to know what they dare to tell you they're running. This is why it's ok for Netcraft to base its statistics on what servers tell each other they're running, rather than on some complicated fingerprint of their tcp/ip stacks.
It comes down to this: Adam Smith had it wrong with his theory of the invisible hand of market forces. It's not just what the markets do that's interesting; for that tells you nothing more than what, imperically, they do. If you pretend otherwise, then you're behaving no differently from all the Linux bandwagoners or Microsoft bandwagoners who base their decisions only on the herd. Herd mentalities are antithetical to proper advertising, and advertisers are finally waking up to this fact.
Cheers,
Froid
Silly po boy! Of course I didn't let them have my dl number. I put down some random number which matched mine for about four digits. Just in case they asked.
steven
-- I have marked myself unwilling to moderate-- I don't have other accounts to artificially inflate the karma of
Well, falsifying server logs in order to get better rates for banner ads would probably count as fraud, which happens to be a criminal offense in the US. A couple of show trials followed by public hangings should solve this little problem.
Besides, banner ads are typically served from a server NOT controlled by the company which own the page. So people like DoubleClick know for sure how many times their ad was ignor^H^H^H^H^Hseen.
Kaa
Kaa
Kaa's Law: In any sufficiently large group of people most are idiots.
The idea of hiring a company to generate web statistics to test for commercial viability seems impractical.
If a company truely wanted to, they could easily obtain numerous IPs to forge the logs ahead. And think about a script kiddie exploiting java, perl, or whatever-- that would certainly make a website's statistics look better. The list goes on of ways to increase a website's usage.
I think the only way to get this done fairly is to post a raw log, and let the investors (or whoever the target is) decide for themselves. Apache logfiles are fairly straightforward, and require little to no effort on deciding what is an actual hit and what is not. Of course, this would require honesty on part of the company, which seems to be the real issue.
Are you kidding? When I worked at my last internship the boss would take the server stats from WebTrends, plop it in a Word file (to look good for investors) and then sometimes "moderately improve" some of the stats before printing the document.
Fact is, most investors don't get a verbatim server log with all the technical "mumbo-jumbo". They get a simplified version with only the information the CEO wants them to hear.
- I don't care if they globalize against free speech. All my best free thoughts are done in my head.
It was just an example of the size of 800 million downloads. Still I think it was quite inaccurate. You must also remember, most people on earth dont have and dont care to have internet access, and of those people who would have heard of this womans site? People in north america.
-- iCEBaLM
The problem is that when people invest $10+ million dollars in a web company, no only do they want numbers, but they want EVERYONE to know those numbers. I work for a website that get ~8 million hits/day and has many regulations to conform to. The accuracy of our logs is what keeps our company alive. I've seen Nielsen, Media Metrics, report numbers for us, and they're all off from what we get. That's to be expected from sampling. What *really* matters from a marketing perspective is how much granularity you can get from these numbers. If you're logs show 20% less than what Nielsen shows, but you can drill down and get demographic/session/referrer/etc. data, then you're in a much better position. Number of hits are useless nowadays, but being able to break up this number into geographic location, time of day, site path, avg. session length, etc. is what makes logs usefull.
Now if a company is interested in gathering web statistics in order to steer corporate decision making, then they should really look at collaborative filtering as a means to do this. No matter what else you have to say about Amazon.com, their implementation of the Net Perceptions collaborative filtering engine is incredibly accurate at analyzing and predicting their customers' needs/desires.
If someone is willing to take the hosting site's word at face value with regard to eyeball real-estate, then I've got some banner ads (and a bridge) to sell them.
And this is the really sad part. The information age has created a new type of cyber-criminal. The false information broker. Society is moving away from products and building multi-purpose machines. As a whole were're more service oriented than we used to be. This means all our assets and business transactions are on paper. Nothing tangible is being exchanged. And typically we have such a high volume of data being transferred that it can't be checked for 100% accuracy. I signed up for one of those "saver" cards at a local grocery store(part of a national chain) and totally faked the information on the signup sheet(I get enough spam as it is, thank you very much) No one caught it, even though an application with an address of 1600 Penn Ave in Ft. Worth, Utah with a completely made up Zip code and a Texas DL number showing up at a store in Tennessee _should_ have raised an eyebrow or two.
So now we have the buyers and the sellers. A buyer can't always trust a seller and a seller can't always trust a buyer. Enter the middleman who keeps both parties honest. Am I the only one saddened by the necessity of a service like this?
Steven
-- I have marked myself unwilling to moderate-- I don't have other accounts to artificially inflate the karma of
I would like to get one of those detectors. Police frequencies aren't hard to determine.
Wonder what kind of range it has?
********* sig: If you don't like the law, get filthy stinking rich, and buy a better one.
If you're turning your logs over to someone else for analysis, you might as well post your savings acount number, PIN, and SSN to the Internet. The information contained in your logs is, IMHO, some of the most proprietary data an Internet company owns. DarkSparks
You let them have your DL number? Seems kind of pointless to lie about the rest of the stuff when your DL number is on there.
I guess I am assuming that you didn't lie on your drivers license (about more than your height and weight)
The server logs don't tell you who is coming to the site. Sure, you know that 201.189.67.109 (completely made up) stopped here and you can even do a reverse DNS on it, but the advertisers that pay for banner ads and the corporate marketing types want to know how much disposable income is behind that IP and what they might spend it on. That is why DoubleClick and all want to track you and even correlate you with a name and address, that info lets them classify you and sell your eyeballs to the advertisers. Have you seen the higher prices that they get for targetted ads? Nearly double their normal rate last time I looked.
Bleh!
You could configure George Schlossnagle's mod_log_spread to multicast apache log entries to a third party audit host. That would be realtime, very hard to fake, and transparent to your config.
The real Paul Vallee is slashdot userid 2192, and, what do you mean it's not cool to point out your low userid?
Yes, you're right, but it seems nobody has mentioned what is painfully obvious to me from dealing with my own clients. "What's a server log? Do we have one of those? How much is that?" Most businesspeople (suits) need somebody to translate the tech for them, and whenever there's a translation, there's the opportunity for deceit.
The Divine Creatrix in a Mortal Shell that stays Crunchy in Milk
The House Between - Original Sci-Fi Series
It is virtually impossible for a device to detect what radio station you are tuned to. There's no way for anyone to tell what station you're listening to, short of getting into your car and looking at your radio. If you still have an all analog radio, then maybe you could detect harmonics caused by the filters at the local oscillator, but I don't see that as being reliable from any distance. If you've got a recent stereo, then its probably DSP driven anyway. So how could you possibly tell then what station the person is tuned to? Telnet to the proc and do a ps aux | grep LO-RF?
I'm sorry... I just don't buy it... a device that could detect what radio station you are listening to? Nope. Don't buy it.
---
I was thinking about this when I heard on Entertainment Tonight about Guinness crowning the "Most downloaded woman on the internet". And when I heard her astronomical number of 800 million downloads I thought it was incredibly inaccurate. Every man, woman and child in the US would have to download 4 of her pictures. How does Guinness come up with the final numbers? Do they even check the logs themselves? Are thumbnails viewed on a page included in the final numbers?
When I eventually went to her site (I can't even remember her name for gods sakes) she had almost no pictures on it of herself, lots of other girls however, I tried in vain looking for some of her and I was thinking to myself that the numbers were severely inflated.
While this might be an "obnoxious" question I think a standard way of evaluating just how many hits and downloads a site gets needs to be determined, expecially for awards like the Guinness Book.
-- iCEBaLM
Please, send her over! I'll gladly give her triple what she recieved for her last album gratis, in the name of continuing art.
[technos begins scrawling in the checkbook.. Pay to the order of: Courtney Love, Date: September 25, 2000, Amount: $3,000 and no cents]
.sig: Now legally binding!
NOTE: By reading this post, you have agreed to run around the room which you are currently in, flapping your arms, and sqawking like a chicken.
Okay, I did it. Unfortunatly, I was reading your post at the same moment my boss was entering the cube, and I've been fired. Under the terms of the 'technos' AUP (As amended September 12, 2000), and UCITA, you are hearby notified that you owe me $28,941,285.42.
Referencing clause two of the AUP, this number reflects the sum of my maximum earnings potential until retirement age, as well as the cost of obtaining said employment (six years of college at a major University), as well as an additional 34% transgressive penalty and a 9% compounded cost-of-living increase.
You have ten business days to remit the sum, in whole, or I will be forced to submit a class B lien request against both your holdings and those of your employer in the State of Maryland.
Clause six clearly states you indemnify me against any legal malfeasance or action, so don't even try to get cuetsy with a countersuit. It has a binding compensation clause of $2,000,000.
.sig: Now legally binding!
- Yesterday, 1308 people visited your site. Of those:
- 183 weren't paying attention at all anyway.
- 22 were your competitors.
- 318 were poor college students drooling over, rather than contemplating buying, your products.
- 139 were actually looking for pornography and left your site immediately.
- 38 were webdesigners stealing your HTML code.
- 133 were here to compare your prices with the competitors. Of those, 29 decided to buy your product.
- 84 were in your target demographic, but were so stoned at the time that they didn't read your sales pitch.
- 12 people actually bought something online.
- 18 people liked your product and went out and bought some offline.
- Of those 30 people who bought something, 28 sent the URL to a total of 56 friends to show off what they had just bought. Of those friends, 3 subsequently bought something.
Ok, so where's the software which can get that data out of your server logs?Any technology which is distinguishable from magic is not sufficiently advanced.
This is one reason why a genuine 'audience' is going to be lower than the raw logs. Local traffic and robots aren't real traffic. I could increase the raw hits on a site to almost any level, simply by throwing a few htdig processes at it. Wouldn't mean anything though.
The point isn't to prove it to yourself, it's to prove it to the advertisers who might want to put an ad on your site. You dolt.
--
I work for a major ad agency that produces the full spectrum of work, online banners and applications, broadcast and print spots, etc., so really from our perspective its about comparable measurability. We deal is a world where the media mix can contain any number of mediums, and right now the online space is the most difficult to measure and justify to our clients. This isn't so much about what
:) ) background in the interactive territory, and I've gotten pretty used to the issues of measurability on the internet. The reality is that, for those of us creating work online, we've gotten overly accustomed to the nuances of online and forget too often to explain it all over again. There's also no major player that will admit that measurability across sites and users is nothing more than a statistical crap-shoot. I don't know why none of them will admit this -- certainly the polling that's done by Nielson and the like is nothing more than statistical projections, and really it's a lot better to have something imperfect rather than nothing at all.
I come from a good (read: more than five years
In reality, our clients still really don't understand why these numbers are so different and then question our recommendations based on what they read. It challenges our reputation and affects the trust the clients typically feel in our creative or media teams. Broadcast and print, as well as the other "offline" mediums, really then have one big advantage: those mediums have been in use long enough that our clients no longer ask the questions of "how can we justify those reach numbers" or "sure I see what you're saying, but my other consultant says that you're only reaching half that audience with that commercial."
So, maybe the challenge really lies with each of these "measurement" firms not admitting that they could be wrong. Maybe its that the sites that are polled are financially incented to inflate their numbers to justify acquisition or second-round financing. Maybe its that the technology exists to perfectly track a user's path anywhere, anytime but one of the first "features" in the browser was anonymity. Maybe it's the convergence of all of these different pieces at the same time (which is most likely the case).
Sad. The interactive space has such opportunity to get around lofty advertising and blink-tag style direct marketing. But unless we can justify the funds, apportioned largely based on reach to the market, we won't end up with the type of experience marketing that actually ads value to those of us online.
---- Please be nice in case my Slashdot karma ~= my real life karma.
- Have them pull the banner from your server, not theirs - never ever let them put your ad banner on their server. Do a test ad run with them, then analyze your own server logs. You'll be able to see if your banner was really pulled, say, 10K times or if they quit showing it after far fewer impressions. I've caught several places shorting me. You can expect some discrepancies due to caching and other issues, but if you're supposed to get 10K impressions and the image only gets served 2K times, consider it a lesson learned and advertise somewhere else.
- If you want proof of their traffic claims, ask them to embed a 1x1 GIF from your server (or one of those little FastCounters set to 1x1 size) on their page. Check your own logs, or view the FastCounter in full size, to see if they're really getting the traffic they say they are. Most one-man websites will be happy to do this when faced with the chance to gain you as an advertising customer; but don't expect Excite et al to bend over for you like this.
- Whenever possible, purchase ads by click-through, not CPM. Click-throughs will cost you more, but I'd rather get 1K guaranteed clicks than 10K ignored impressions.
ShaunThanks to the War on Drugs, it's easier to buy meth than it is to buy cold medicine!
-
make web application code open and available for
audit in order to prevent invalid/illegal logging.
-
cryptographically sign the logs at periodic
intervals and/or when the applications are stopped
and started. This will help prevent tampering.
Even encrypting the logs so that only particular
individuals can access them might be suitable.
-
use W3C standard log file formats.
-
hire a reputable, independent auditor to validate your metrics at regular intervals.
What's all the fuss about?Rob.
sig:
sig:
See the "..for smart people" banners Wired runs here? Look elsewhere guys.
Server logs can tell you a variety of things, but I don't necessarily think they're useful for marketing purposes except for the owner of the server, and not so much for advertising. I run a small site that gets about 100 unique visitors a day and about 25 regulars. Using the logs and parseing out the data, I can determine that almost all of the people who visit my site stick around for a little while, but don't come back later. At least, thats what the logs say. I can also see the referring site, which tells me where any advertising should be focused on, as well as if someone actually clicked on a link or entered the URL straight (or from bookmarks) which would indicate if a user has visited before. Of course, any user using a dialup connection will probably have a different host/ip the next time they visit, and the logs will still show them as a separate user. AOL's proxy is especially bad as the host will change EVERY TIME the user makes another hit on the site, which makes it very difficult to track. Cookies and user accounts would be much more useful to determine exactly how many visitors you have and how many of them visit frequently. However, I still believe that this information really is only useful to the server operator and not to someone looking to advertise on the site. Marketing as it stands should probably be a trial and error operation. Spend some money and see what happens. When I ran a business several years ago I tried advertising in a variety of different places. Ads for computer sales got practically no response from a computer magazine but got a LOT of response from a simple 4 line classified ad in the newspaper. Sometimes you just have to throw some money around and see what you get back. Yes, there is some risk, and yes, you will probably lose some money before finding a medium that works well for you, but thats the name of the game. -Restil
Play with my webcams and lights here
What if a company who is a third-party, independent of either the advertiser or the web hoster were to set up a box through which all internet traffic to the server was transparently passed to. The third party logs the traffic to determine if his logs match what the hoster is claiming. The advertiser can trust the third party because he hires one he trusts to provide this service for him.
The hosters guys can't access the box because it is literally black boxed (locked up, no physical access, and no knowledge of the logins/passwords)
The third party logger can remotely access his box, download logs or whatever and provide that info to the advertiser. The advertiser can then check the logs of the hoster and compare them to the thirdy party (aka verifier). If the verifiers logs match the hosters you know the data is somewhat accurate (at least as accurate as these things can be).
I mean, nielsen does this with those boxes they give to their test families, why can't some enterprising third-party verification company (hmmmmm?) do the same with web-hosts.
This looks like a nice little niche market for exploitation and mucho money to be made off of. I mean you write a few scripts to keep control over your logs and to send the logs back to a central server that formats this stuff into nice pretty print outs for the suits to drool over at their next board meeting.
Just a thought...
The only effective measurement of web traffic is by having volunteers that use a special proxy that reports what sites that the user visits back to a server, and to generate it from there. Exactly how the Neilsen boxes do it for television, which unfortunately means the same problems will crop up (Neilsen families tend to be favored around east/west coasts, thus making shows that appeal to midwest or plains state viewers less popular by appearence). Additionally getting volunteers might be a problem, as you'll most likely create a biased set by whom you select. And probably most importantly, privacy issues are more apparent for net ratings.
"Pinky, you've left the lens cap of your mind on again." - P&TB
"I can see my house from here!" - ST:
In the end, the only way to guague how many people have read your site is to place unique or unusual information on it, and then find out who knows it.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I've got a problem with that right now. A site that I operate, nancies.org, serves up about 600,000 pageviews each month. But we're regularly credited by 24/7 Media (aka ContentZone) for just over 400,000. But they don't give two shakes for our logs, and say that we just have to trust them. That's like the U.S. government saying, regarding carnivore, "trust us."
BS. So I applied to Engage (formerly Flycast) last night to get our ads through them. Are they any better? I have no idea. But I do know that ContentZone is screwing us over, and that's incentive enough for me.
-Waldo
I disagree. When 300 UNIQUE visitors view my page using the same proxy, they look like one visitor. The only thing web logs can tell you is how many requests your websever received. I call those "hits". Your definition sounds different.
Gee, I guess somebody finally figured out what the third kind of lie is!
I think that if you're investing in a web company, you should IGNORE the statistics. Go to the site. If it's lame, don't give them your money. If it rocks, go for it? How hard could that be?
These are my friends, See how they glisten. See this one shine, how he smiles in the light.
My admin has that too. In fact, we just(read: lazy) turned off any LOCAL referrers(which comes from people surfing around different pages in 1 session). It's amazing what you'll find sometimes linked to your page.
For some bizarre reason, there were 15 counts of a referrer from osdn regarding the Slashdot cruiser. I checked the Slashdot cruiser web page only to find nothing linked to my site. Strange(but then again a page doesn't have to be linked. It could be 15 people were at that page first, and then went to something on my site)
Other than that, I found the usual google returns, and plenty from articles I commented on from here.
And when i hosted a real wacky e-zine(the boulder news frenzy) which had tons of vulgar language, every pervert with keyword searches of "toilet sex", "rape", etc went to the zines on my pages, only to be disappointed to find an ASCII rag.
So take a look at those referrers. You'll be amazed what you find. Often you'll see someone on a webboard post a link to one of your pages with a positive/negative comment.
But when you advertise on the web, you can look at your web logs to gauge the audience - you don't need to trust their logs, or Media Metrix', or anyone else's.
In fact, by looking at your own logs, you can say, "Well, Yahoo sends 10,000 people a day to my site, but only 10 of those people buy anything.. Meanwhile, Slashdot sends 1,000 people, but 500 of them end up buying stuff."
So why are such ratings needed?
--
--
Mod up a post Rob doesn't like and you'll never mod again
Carnivore is the answer. Let the feds provide accurate and unbiased information!
As for the ease of faking server logs, not a problem (inserting standard I-am-not-a-lawyer disclaimer here): if you're using them as proof of traffic to your advertisers, write that into the contract -- then faking the server log becomes fraud, with the appropriate legal remedy's available. This is not my favorite solution (especially not with anything to do with the Internet), but displaying advertisements for money is a business relationship, and can be managed as such.
Quotes from A Man for All Seasons
the web is a popularity contest because in the "new economy", it's all about marketshare. That's it. Nothing else matters. Revenue doesn't matter. Profitability doesn't matter. A business plan doesn't matter.
The premise behind this "marketshare is everything" is that, since the internet is a "new thing", the guy who takes over the most marketshare first, is going to be the dominant player - people think this way because they saw what happened when Microsoft entered a new market, and got the most marketshare. They dominate. They're damn near owning the whole freakin world. If they had played it more laid back, and done more honest hard work up front, they probably would have avoided this whole DOJ mess, and ten years from now, *would* 0wn the whole world. But no, the execs got lazy and greedy, and when it became apparent early on that Microsoft was only interested in putting out "good enough" products and killing off competition (instead of allowing competition to exist, albiet in a weakend state), the threat was so obvious, they had to be stopped. Act like a bunch of gangsters, get treated like gangsters.
Anyway, the investment and business community is expecting SOMEONE to take over, and they want a piece of the action, of course, so that's why people are willing to risk a few investment bucks on who they perceive will be the Genghis Khan of the Internet.
That's the "new economy" in a nutshell. And frankly, AOL/TW is "it".
These are my friends, See how they glisten. See this one shine, how he smiles in the light.
What I'm blocking out so far is:
our company's internal IP traffic
images
funky robots like Keynote-Perspective that the old webmaster had let loose on our sites.
This gives us some numbers I have confidence in (even though they're 10x less than the numbers the old guy was producing through Webtrends), but I'd like to find out what others are doing for making their own web stats.
Thanks,
Steve
Why must the web be a popularity contest? At most the website itself should only be conserned about how many people visit they're website so they can keep their servers up to speed. They can get this form their own logs.
Seriously, who really cares if NewsTrolls is visited more than Slashdot (just an example). The important thing is that they're getting visitors and the owners are enjoying their job.
--
Excuse me while I go "Grumpy old man". This is an old, old problem. It goes back to the days when I first started using the web. See "Why web statistics are (worse than) meaningless." It's an old article. That's the point.
In short, spiders, proxies and caches make it impossible to be accurate in measuring traffic. But everyone else is affected the same way. So your relative stats are relevent-- they just aren't hit-for-hit accurate.
What your server logs are really for is resource planning. They'll help you find out how much traffic your server is serving, which should help you plan bandwidth and hardware upgrades as needed.
the one thing I can count on is that my site doesn't (and won't) get any hits
In Soviet Russia...michael would be rotting in Siberia!
First, suppose I am using a number of web sites to promote my online store, In this case, I may be most interested in the amount of sales each site produces from click through users. For this purpose, I can simply assign a sale to a certain site. For the purposes of this discussion, I will assume that all sales can be assigned to a certain web site. At certain intervals, I can find the percent profit attributable to each site, and create a statistic with the ratio of the % profit from a site to the cost of advertising on that site. This statistic will create a valid comparison between sites.
Second, suppose I am most interested in branding, as Verizon is of late. In this case, I might want to pay an external agency to monitor the sites on which I advertise. Such an agency would presumable use a consistent and statistically sound method to determine the number of eyes that has seen my brand. I can then set up a statistic with the ratio of # of eyes to the cost of advertising for each site. Again, this will create a valid comparison.
It is notable that in either case the web logs for particular sites are not clearly useful. Even if the information itself was not suspect, web logs would not be comparable between sites. It would be difficult to set up a useful statistic to compare the value of each site with respect to my product. To put it another way, the web log for a particular site are useful to that site for generating a number of site specific statistics, but few if any of those are going to be of interest to me as a paying advertiser.
The FBI could get into the business of counting hits. I mean, they'd be reading through all the traffic anyway; they might as well do something useful with it...
-------------
-------------
The truth is out th- oh, wait, here it is...