Writing Software to Collect Click Stream Stats?
AntiPasto asks: "I am working with a small business that wants to evaluate their "click streams[?]". I've investigated openstats and even commercial products like Funnel Web as being turn-key solutions, but they don't offer the sort of authenticated-user page-view detail that we're looking for. I've since decided to start a mod_usertrack implementation, but it looks like we need to write our own stuff to process this. Anyone have any experiences with tracking a user's visit?"
I just completed a semester project for a database class where we parse the apache log from a web server into a database and then wrote some embeded SQL things in some C code to spit out statistics. One of our key things was to define a "session" for a user rather than just track "hits" on specific pages. By creating a session we can then look at the relation of pages on the site occuring in session. In other words we can ask, out of all sessions what percentage contained pages x and y (obviously expandable to x, y, z, etc... or x and z but not y, but you get the point). We implemented the session like this:
Read a line in.
Look up the IP address and time in the session information.
IF that IP does not exist, make a new session.
IF that IP does exist:
Is the time we just read within (you define) minutes of the current end time in the database for the last session?
Yes: Set end time of this session to the time just read in.
No: create a new session (same IP, but session ID is just one greater than the last)
For example, someone loads just one page from your site, then they would have an row in the session information table with their IP and a session ID of 0 and their start and end time would be the same. If they load a page again (then there would be another line in the apache log for this) within the time you set then the end time of that session is just set to the time of the most recent load of a page in that session. We have another table called PageSession where we list the IP, session ID, and page ID for all pages accessed in the session. Note: we distinguished between html, htm, txt, php, etc... content and other (jpg, gif, mpg, etc...) into different tables so we can query just html info or just picture info.
Other than IP we don't "authenticate" the user. We put in place the means to try to weed out dial-up users vs. static IP users, but this is by no means a well implemented as of now thing since it relies on knowing the domain names of dial-up ISPs or looking for keywords like "dial" in the hostname of a page from a known ISP that has several types of connection. With that asside, I don't know what means you have your authentication, but I don't see why your authentication couldn't be tied into (or inplace of) our use of IP to denote a specfic user.
Our initial goal of the project was just to look at date and time info about the sessions with the extension of page x, y, and not z (hehe, "not z") reports available from our design. Part of the problem with looking for a corelation between pages visited on your site during a session (and referal URL stuff, too) is simple data mining algorithms usually have a threshold in there to look for "interesting" relationships. You would supply an expected percentage for the number of sessions involving certain items. For example, if your main page links only to a sub page then you'd expect a high degree of relationship between those, but if a page burried deep down links to no pages it might be interesting to see how a person got to it. A tricky thing this threshold is, but the info you get about unexpected things is amazing.
I recall an anecdote about supermarkets looking for unexpected item sales and they found out that there was a higher than expected percentage of people buying beer and diapers on a Friday. It is suspected that men were doing the shopping at this time and had to get some things for the family and they had their own priorities... Supermarkets have it easy since people tend to buy all of their items in one transaction, the web is a gimmey-now-one-at-a-time type thing so defining a session is also an art, and I'd suspect would vary greatly depending on your site's content, target audience, bandwidth, etc...
Well, I guess I should end my rambling as tomorrow evening I have to give a presentation on this project and end my semester (finals are next week). Hope there is some nugget of info in there that helps!
Wheeeee
go have a look at phpopentracker (http://www.phpopentracker.de) - its very useful - I've used it on my own site.
p hp3?articleID=7&show=1).
Theres an article about phpopentracker at trafficmanager (http://www.trafficmanager.co.uk/reviews/article.
There are also discussion forums for this kind of thing at http://forums.trafficmanager.co.uk
Take a look at Sawmill. It is a first rate log analyzer that includes path analysis. It runs on almost any platform and out of the box understands dozens of log formats. It is very flexible, so you should be able to tweak it for any specific fields you want in your reporting.
He's done massive amounts on Datawarehousing and has abook on the subject of looking at clickstream wrt to datawarehouse techniques.
check out www.rkimball.com and also stuff on datawarehousing if you are going to really get into all this