Writing Software to Collect Click Stream Stats?
AntiPasto asks: "I am working with a small business that wants to evaluate their "click streams[?]". I've investigated openstats and even commercial products like Funnel Web as being turn-key solutions, but they don't offer the sort of authenticated-user page-view detail that we're looking for. I've since decided to start a mod_usertrack implementation, but it looks like we need to write our own stuff to process this. Anyone have any experiences with tracking a user's visit?"
I just completed a semester project for a database class where we parse the apache log from a web server into a database and then wrote some embeded SQL things in some C code to spit out statistics. One of our key things was to define a "session" for a user rather than just track "hits" on specific pages. By creating a session we can then look at the relation of pages on the site occuring in session. In other words we can ask, out of all sessions what percentage contained pages x and y (obviously expandable to x, y, z, etc... or x and z but not y, but you get the point). We implemented the session like this:
Read a line in.
Look up the IP address and time in the session information.
IF that IP does not exist, make a new session.
IF that IP does exist:
Is the time we just read within (you define) minutes of the current end time in the database for the last session?
Yes: Set end time of this session to the time just read in.
No: create a new session (same IP, but session ID is just one greater than the last)
For example, someone loads just one page from your site, then they would have an row in the session information table with their IP and a session ID of 0 and their start and end time would be the same. If they load a page again (then there would be another line in the apache log for this) within the time you set then the end time of that session is just set to the time of the most recent load of a page in that session. We have another table called PageSession where we list the IP, session ID, and page ID for all pages accessed in the session. Note: we distinguished between html, htm, txt, php, etc... content and other (jpg, gif, mpg, etc...) into different tables so we can query just html info or just picture info.
Other than IP we don't "authenticate" the user. We put in place the means to try to weed out dial-up users vs. static IP users, but this is by no means a well implemented as of now thing since it relies on knowing the domain names of dial-up ISPs or looking for keywords like "dial" in the hostname of a page from a known ISP that has several types of connection. With that asside, I don't know what means you have your authentication, but I don't see why your authentication couldn't be tied into (or inplace of) our use of IP to denote a specfic user.
Our initial goal of the project was just to look at date and time info about the sessions with the extension of page x, y, and not z (hehe, "not z") reports available from our design. Part of the problem with looking for a corelation between pages visited on your site during a session (and referal URL stuff, too) is simple data mining algorithms usually have a threshold in there to look for "interesting" relationships. You would supply an expected percentage for the number of sessions involving certain items. For example, if your main page links only to a sub page then you'd expect a high degree of relationship between those, but if a page burried deep down links to no pages it might be interesting to see how a person got to it. A tricky thing this threshold is, but the info you get about unexpected things is amazing.
I recall an anecdote about supermarkets looking for unexpected item sales and they found out that there was a higher than expected percentage of people buying beer and diapers on a Friday. It is suspected that men were doing the shopping at this time and had to get some things for the family and they had their own priorities... Supermarkets have it easy since people tend to buy all of their items in one transaction, the web is a gimmey-now-one-at-a-time type thing so defining a session is also an art, and I'd suspect would vary greatly depending on your site's content, target audience, bandwidth, etc...
Well, I guess I should end my rambling as tomorrow evening I have to give a presentation on this project and end my semester (finals are next week). Hope there is some nugget of info in there that helps!
Wheeeee
Otoh, if you use a single driver script/servlet/jsp, and dynamically produce content based on form variables, then your driver must handle the reporting, because the server log isn't going to report anything except the base URL for every request. In this case, your driver needs to log what is happening before it serves up the appropriate content.
go have a look at phpopentracker (http://www.phpopentracker.de) - its very useful - I've used it on my own site.
p hp3?articleID=7&show=1).
Theres an article about phpopentracker at trafficmanager (http://www.trafficmanager.co.uk/reviews/article.
There are also discussion forums for this kind of thing at http://forums.trafficmanager.co.uk
Take a look at Sawmill. It is a first rate log analyzer that includes path analysis. It runs on almost any platform and out of the box understands dozens of log formats. It is very flexible, so you should be able to tweak it for any specific fields you want in your reporting.
He's done massive amounts on Datawarehousing and has abook on the subject of looking at clickstream wrt to datawarehouse techniques.
check out www.rkimball.com and also stuff on datawarehousing if you are going to really get into all this
You should take proxies more into account.. many large companies shove tens of thousands of users behind a handful (or one!) IP. Some colleges do as well. And, oh yeah - AOL. 10 million people or whatever, and they all use a handful of IPs. Don't foget the cable modem companies, the DSL companies, and all the little ISPs that encourage (or force) users to use their proxies.
If you rely simply on IP, not only would your sessions not make any sense in any kind of "he went here, then there, then there" kind of sense, but you'd vastly underestimate the number of users/sessions.
One issue that I have noticed is that unless you are serving scripts, hostnames in the logs are becoming less relevant due to caching and proxies. However, if you do track host names then getting a 100%(ok 99%) accurate list is hard -- If you wait a few days, weeks or months to analyse your logs, then some of the IP may have changed owners. If you try to let the HTTP daemon do the lookup, then you suffer a drop in performance and the very first lookup or two often fails anyway.
If you use Apache (I have no experience with the others), then it is easy to pipe the log output directly to a script.
You can even make it tab-delimited or what ever it takes to be easier to parse.Myself, I'm about to experiment with logrotate, rotatelogs, cronolog and mod_mylog. mod_mylog puts the log output straight into your RDBMS and even claims to cache records if the RDBMS is temporarily unavailable.
Beta is broken and the link to classic doesn't work. Stop wasting our time or there won't be anybody left here.
granted, i don't get that many (2-3,000 a month) and a substantial amount are people i know, but it sure is fun to do.
Setting a cookie will let you pinpoint that a given instance of netscape is viewing your site. with lack of cookies you can tell that a given ip is viewing your site, but then proxies will get you in trouble and may get you some nice email (that one deals with those pesky nipr.mil people). Using ip, timestamps and useragent can get you a more accurate pinpointing, but still not exact.
I like to set an invisible image at the bottom of my website with a special id on it. This image gets changed at onunload which, for those who use javascript and unload the page, will tell me how long they viewed any given page. (some of this info is also presented at the bottom of every page).
If you really wanted to and could afford a fully dynamic site you could have every single link called with a ?sessionid at the end of it (like http://www.example.com/?1234) and have this reset if the referer wasnt from your site (this could cover people copying and pasting that link to someone else) and then parse through your logs afterwards. but that could get annoying.
as far as thelogs go, a friend of mine has apache log directly to mysql which facilitates his parsing. as another poster mentioned, sniffing traffic can help alleviate strain on your webserver - a nice openbsd bridging firewall will do the trick. (checking your firewall logs is handy in other ways - i have some hidden "easter eggs" on my site which appear to be exploits on my box - i check who gets to those pages and then who tries to connect to port XX and see what ip's match up - nice little stats)
One time i wondered what would happen if i matched the ips from my mail headers to the ips from my weblogs. It turns out that around %1 of the unique ips in my mail headers also appeared in the weblogs, which means that with pretty fair certainty i knew who was browsing my site. But this mainly works due to the personal natureof my site.
On the other hand, if you've access to lastlogs, query logs, and weblogs you can really start identifying local users of your website. i work at a local college and can learn a lot about a particular viewer by seeing if the same ip is logged in to a given server, or by looking at the query log and seeing what else they've done dns lookups for. Add in a messaging system and you can freak people out. (i also use this to freak out people who search images.google.com for 'breast' and get to my site (i'm a photographer too)).
one other thing i find useful is to keep track of who searches for robots.txt. this can be an indication that someone is a robot or proxy. It also helps me present special information to search engines, allowing them to know of (and thus index) a new page the next time they get to my site (i put a couple special links if you access robots.txt)
a friend of mine runs a journal/bbs website and was wondering about tracking his users when they create different accounts. We are thinking about implementing something similar to my spam identification to identify similar writing styles and possibly the same people in different accounts.
Once you've gathered up some data you'll want to look at it in a nice way. you could use excel or you could create some really nice webmaps (that site also has links to similar mapping projects).
finally a word of advice, if you put up a page of your refer logs, include that as disallowed in your robots.txt or you will get a lot of strangely referred people.
oh, and keep in mind no one method will be %100 accurate, but a combination of methods can get you close.
and there was an article not too long ago about MIT (i think) doing studies into howpeople view webpages - that is, if mouse is over to side of screen then person is most likely reading page, if mouse in middle of page then probably not, etc).
maybe forcing every viewer into a frameset and then tracking changes in the subframes is a viable option. associate a frame change to a hidden image change with an encoded identifier.
ok, thats it for now.
-f
www.blackant.net