Writing Software to Collect Click Stream Stats?

Semester Project... by heliocentric · 2001-12-05 20:59 · Score: 4, Informative

I just completed a semester project for a database class where we parse the apache log from a web server into a database and then wrote some embeded SQL things in some C code to spit out statistics. One of our key things was to define a "session" for a user rather than just track "hits" on specific pages. By creating a session we can then look at the relation of pages on the site occuring in session. In other words we can ask, out of all sessions what percentage contained pages x and y (obviously expandable to x, y, z, etc... or x and z but not y, but you get the point). We implemented the session like this:

Read a line in.
Look up the IP address and time in the session information.
IF that IP does not exist, make a new session.
IF that IP does exist:
Is the time we just read within (you define) minutes of the current end time in the database for the last session?
Yes: Set end time of this session to the time just read in.
No: create a new session (same IP, but session ID is just one greater than the last)

For example, someone loads just one page from your site, then they would have an row in the session information table with their IP and a session ID of 0 and their start and end time would be the same. If they load a page again (then there would be another line in the apache log for this) within the time you set then the end time of that session is just set to the time of the most recent load of a page in that session. We have another table called PageSession where we list the IP, session ID, and page ID for all pages accessed in the session. Note: we distinguished between html, htm, txt, php, etc... content and other (jpg, gif, mpg, etc...) into different tables so we can query just html info or just picture info.

Other than IP we don't "authenticate" the user. We put in place the means to try to weed out dial-up users vs. static IP users, but this is by no means a well implemented as of now thing since it relies on knowing the domain names of dial-up ISPs or looking for keywords like "dial" in the hostname of a page from a known ISP that has several types of connection. With that asside, I don't know what means you have your authentication, but I don't see why your authentication couldn't be tied into (or inplace of) our use of IP to denote a specfic user.

Our initial goal of the project was just to look at date and time info about the sessions with the extension of page x, y, and not z (hehe, "not z") reports available from our design. Part of the problem with looking for a corelation between pages visited on your site during a session (and referal URL stuff, too) is simple data mining algorithms usually have a threshold in there to look for "interesting" relationships. You would supply an expected percentage for the number of sessions involving certain items. For example, if your main page links only to a sub page then you'd expect a high degree of relationship between those, but if a page burried deep down links to no pages it might be interesting to see how a person got to it. A tricky thing this threshold is, but the info you get about unexpected things is amazing.

I recall an anecdote about supermarkets looking for unexpected item sales and they found out that there was a higher than expected percentage of people buying beer and diapers on a Friday. It is suspected that men were doing the shopping at this time and had to get some things for the family and they had their own priorities... Supermarkets have it easy since people tend to buy all of their items in one transaction, the web is a gimmey-now-one-at-a-time type thing so defining a session is also an art, and I'd suspect would vary greatly depending on your site's content, target audience, bandwidth, etc...

Well, I guess I should end my rambling as tomorrow evening I have to give a presentation on this project and end my semester (finals are next week). Hope there is some nugget of info in there that helps!

--
Wheeeee

Re:Semester Project... by GregWebb · 2001-12-06 07:00 · Score: 2

(Disclaimer - I write in ASP for a living, so storing a session is easy, just declare an ID in your global.asa file and reference that... No, I don't like much about it but it keeps me solvent)

One thing jumps out at me from that - proxy servers. We've got 10 or so users sharing an IP behind a proxy here. Your technique won't differentiate between us...

What about just setting an ID cookie with a short timeout but making every page bump the expiry up? OK, you've got a job identifying images that way, but it handles proxy servers and you can still use IP addresses for them.

--
Greg
(Inside a nuclear plant)
Aaaarrrggh! Run! The canary has mutated!
Re:Semester Project... by heliocentric · 2001-12-06 07:25 · Score: 2

Our goal was to not use cookies. Using the apache log is just a start, our goal is to drop this and use a packet sniffer - idea being you can run the program on your web server, but if you are in an environment where you don't want your webserver tied up stuffing things into mysql and running some nasty queries you could have another machine on the same hub as the server do the job. Cookies might get around your particular proxy becuase maybe you and your colleges are nice enough to allow cookies, but increasing amount of people are turning them off (atleast in a limited capacity). You're right, our technique won't destinguish between you behind your firewall, but if you all had distinct IPs and had cookies blocked our way would get you and cookies wouldn't.

Here's another way we looked at it. You have 10 people sharing the proxy, suppose 2 or 3 hit my site at the same time, you would be lumped together into a single session. Fine, then sessions do not mean a specific user hit the site, just a specific IP - that's all we really want. We know that IP could mean 10 people seeing things at a presentation, 10 behind a proxy, 10 who dial in, click, hang up, someone else dials into same IP, click, hang up... We know there's going to overlap and error, but in the grand scheme of things I'd think it'd all work out in the wash.

We just wanted a simple and as non-invasive as possible means to gather information. We see cookies as invase (not that I have a problem with them, but since some do then we have to treat them as such) and relying on the user to be as honest as possible.

This "problem" of proxies and such is actually eliminated entirely under the conditions the "Ask /." submitter described - "authenticated-user" which I presume to be some sort of login that can be linked to the session transactions. If you login to a site and I tie that login to my database updates then my UserSession table would have IP/ID/StartDate/StartTime/EndTime AND LoginID. Now you and your buddies using the same IP would (presumably) be using different logins. The people who use the dial in only to access the site, hang up, and have someone quickly jump on and hit the same site using the same IP - again, I presume would authenticate themselves.

--
Wheeeee

Depends on the implementation by imrdkl · 2001-12-05 21:30 · Score: 2

If you've got a site that uses only GET requests for specific static files, or if you use POST to individual CGI/Servlet/JSP, then things are pretty easy using any number of logparsers or database tools (if you're logging to a DB). Many implementations of this type also pass arguments appended to the request string, which puts more information in the log.

Otoh, if you use a single driver script/servlet/jsp, and dynamically produce content based on form variables, then your driver must handle the reporting, because the server log isn't going to report anything except the base URL for every request. In this case, your driver needs to log what is happening before it serves up the appropriate content.

Try this... by Anonymous Coward · 2001-12-05 22:03 · Score: 2, Informative

go have a look at phpopentracker (http://www.phpopentracker.de) - its very useful - I've used it on my own site.

Theres an article about phpopentracker at trafficmanager (http://www.trafficmanager.co.uk/reviews/article.p hp3?articleID=7&show=1).

There are also discussion forums for this kind of thing at http://forums.trafficmanager.co.uk

Don't re-invent the wheel by tdyson · 2001-12-06 00:56 · Score: 2, Informative

Take a look at Sawmill. It is a first rate log analyzer that includes path analysis. It runs on almost any platform and out of the box understands dozens of log formats. It is very flexible, so you should be able to tweak it for any specific fields you want in your reporting.

Ralph Kimball's the guy here by martin · 2001-12-06 05:00 · Score: 3, Informative

He's done massive amounts on Datawarehousing and has abook on the subject of looking at clickstream wrt to datawarehouse techniques.

check out www.rkimball.com and also stuff on datawarehousing if you are going to really get into all this

Proxies drasticly change number of IPs by hab136 · 2001-12-06 07:51 · Score: 1

Here's another way we looked at it. You have 10 people sharing the proxy, suppose 2 or 3 hit my site at the same time, you would be lumped together into a single session. Fine, then sessions do not mean a specific user hit the site, just a specific IP - that's all we really want. We know that IP could mean 10 people seeing things at a presentation, 10 behind a proxy, 10 who dial in, click, hang up, someone else dials into same IP, click, hang up... We know there's going to overlap and error, but in the grand scheme of things I'd think it'd all work out in the wash.

You should take proxies more into account.. many large companies shove tens of thousands of users behind a handful (or one!) IP. Some colleges do as well. And, oh yeah - AOL. 10 million people or whatever, and they all use a handful of IPs. Don't foget the cable modem companies, the DSL companies, and all the little ISPs that encourage (or force) users to use their proxies.

If you rely simply on IP, not only would your sessions not make any sense in any kind of "he went here, then there, then there" kind of sense, but you'd vastly underestimate the number of users/sessions.

Re:Proxies drasticly change number of IPs by heliocentric · 2001-12-06 09:30 · Score: 2

Yes, proxies are common. Yes, NATs are common, too (don't forget them). But I think you may have a different goal or at the least a misnotion. There is no means that will guarantee that you can distinguish every single web user from every other web user. Proxies are one obvious impediment to IP monitoring only. People disabling cookies sure shoots down a heavy reliance on cookies. But even if you had a magic way of telling every computer client, you still wouldn't be able to tell when dad surfs the web and turns it over to his daughter. Even logins to a site are meaningless since I can log in to a page and someone else using my workstation can surf. Expecting there to be a means to get everything and debating about the known shortcomings of certain solutions is fruitless. Yes, my approach does not account for proxies in a means that may satisfy you (and me), cookies can be turned off (a means that doesn't completely satisfy me), people can publish on /. their login (think NY Times) to a site. There is no magic bullet, but I hope to offer another solution to the mix to use in conjunction with other solutions. The key is to understand the shortcomings and be able to not be an idiot and say "well, my dbase shows that this one person did this, and did that, and did this." One should not look that fine in granularity. Yes, some people behind some proxies will be able to skew certain things in certain ways, but won't others behind other proxies skew it in the opposite way? That is why we need to look at overall trends. Relying on one weeks worth of data, or even a single month is meaningless. Relying on one means to judge pages used in a session is just as naive. If your site has a means of cookie'ing I'm sure you realize the shortcomings, maybe running something that does not rely at all on cookies (IP based) along with it and comparing the data will help you to better understand things. But realize this: if you run an IP based program, a cookie based program, and have authentication of your user sessions - you will still miss some things, the key is to be aware of this and judge accordingly.

--
Wheeeee
Re:Proxies drasticly change number of IPs by heliocentric · 2001-12-06 09:40 · Score: 2

Oh, I also forgot to add, you mentioned the AOL proxy - yup, we know all about that, when we do a host look up we don't just look at the .com .uk .whatever part, we also look for key words, like having both "aol" and "proxy" in the name usually implies that it's from an aol proxy and should be judged accordingly. As I said, one must know the shortcomings, and we do realize you sir can go get hab136.com and name your computer proxy.aol.hab136.com and it would fool that lookup (ok, it would do it since I didn't code it to make sure the aol part was right before the .com part) but how many people aer going to go to those lengths in an attempt to skew web statistical data?

Another job of the data miner is to look for these types of anomolus trends and account for them. As you see odd jumps around from things that look like proxies, add a rule that you feel it's from a proxy. Will this catch them all? Heck no. Even if you miss some, you can still get some bearings on the popularity of some pages with respect to others on your site.

Underestimating the session information I think is a good thing. The goal of session identification (atleast for me) was to reduce the reliance on these big "hit count" numbers. But to each his own...

--
Wheeeee
Re:Proxies drasticly change number of IPs by heliocentric · 2001-12-06 10:06 · Score: 2

Sorry, but I just took a shower and thought up something else. Perhaps my original explaination of defining a session was a tad obtuse (I didn't expect such good questions about a simple toss out of an idea). I also look at OS and browser information. If there is an IP and a "current" session going on from that IP, but the new entry has a different OS I deem that to be a different session. Since sessions are never really locked in the dbase, if there is another entry from that IP with the original OS within the time limit for that session, then that session is updated. This would handle people like me with a NAT who have a win box, a linux box, and a few sun boxen. Will this solve the problem? Well, consider a corporation with a proxy and 200 drone clients all running standardized OS and browser and they are all unrythmically surfing my site, then the answer is no, they would all be lumped into one session as they are indistinguishable from the log based sense. However, extend this worse case senario to them all also (since it's standard) not accepting any cookies from anyone, and they all got their login to the nytimes site from a /. post. Now, how in the world do you expect to do any better? You can't rely on some sort of "well, this page is linked to that one so users must follow some sort of linear progression" since we all see how that is flawed.

Is my way an end all - no. Is it better than cookies - oh the fun on a /. debate, but I don't think so in an amortized sense. Is relying on one and only one way to get your data ever a good idea - well, that's up to you.

I see cookies as pass/fail type things. Either you get good data or you don't. My way I think has a little grey area where there are the obvious few clicks here and there from someone with a dedicated IP and there are those that are more tricky, but with some good coding I feel there are means to clean up the data and make some better judgements over the first impression of "he's stupid to be just relying on IPs, there's nothing one can get form that nonsense." I think a more appropriate statement covering many means of web traffic analysis would be "He's silly to be relying on a single means of data interpretation and even sillier if he thinks there exists a solution that does not have a senerio circumventing it."

--
Wheeeee

Name lookups can be tricky by SgtChaireBourne · 2001-12-07 02:03 · Score: 1

I've either had to customize other people's statistics generators or write my own. Most of the services I've had to work with either have no session tracking or else use GET and include an identifier in the path. This make for long bookmarks / URLs but makes it easy to follow one session in the logs.

One issue that I have noticed is that unless you are serving scripts, hostnames in the logs are becoming less relevant due to caching and proxies. However, if you do track host names then getting a 100%(ok 99%) accurate list is hard -- If you wait a few days, weeks or months to analyse your logs, then some of the IP may have changed owners. If you try to let the HTTP daemon do the lookup, then you suffer a drop in performance and the very first lookup or two often fails anyway.

If you use Apache (I have no experience with the others), then it is easy to pipe the log output directly to a script.

CustomLog "|/usr/local/apache/bin/rotatelogs /var/log/access_log 86400" common

You can even make it tab-delimited or what ever it takes to be easier to parse.

LogFormat "%h\t%l\t%u\t%t\t%r\t%>s\t%b\t%{Referer}i\t%{Us er-Agent}i" tab CustomLog /var/log/httpd/access_log combined CustomLog /var/log/httpd/funky_new_log tab

Myself, I'm about to experiment with logrotate, rotatelogs, cronolog and mod_mylog. mod_mylog puts the log output straight into your RDBMS and even claims to cache records if the RDBMS is temporarily unavailable.

--
Beta is broken and the link to classic doesn't work. Stop wasting our time or there won't be anybody left here.

who's who & what are they doing. by friscolr · 2001-12-07 12:35 · Score: 2

using phpsessions/cookies, javascript (onload() and onunload()), webbugs (hidden images), apache logs files, a webspider, and my e-mail i try to discover as much as i can about my website visitors and develop nice ways to view it all.
granted, i don't get that many (2-3,000 a month) and a substantial amount are people i know, but it sure is fun to do.

Setting a cookie will let you pinpoint that a given instance of netscape is viewing your site. with lack of cookies you can tell that a given ip is viewing your site, but then proxies will get you in trouble and may get you some nice email (that one deals with those pesky nipr.mil people). Using ip, timestamps and useragent can get you a more accurate pinpointing, but still not exact.

I like to set an invisible image at the bottom of my website with a special id on it. This image gets changed at onunload which, for those who use javascript and unload the page, will tell me how long they viewed any given page. (some of this info is also presented at the bottom of every page).

If you really wanted to and could afford a fully dynamic site you could have every single link called with a ?sessionid at the end of it (like http://www.example.com/?1234) and have this reset if the referer wasnt from your site (this could cover people copying and pasting that link to someone else) and then parse through your logs afterwards. but that could get annoying.

as far as thelogs go, a friend of mine has apache log directly to mysql which facilitates his parsing. as another poster mentioned, sniffing traffic can help alleviate strain on your webserver - a nice openbsd bridging firewall will do the trick. (checking your firewall logs is handy in other ways - i have some hidden "easter eggs" on my site which appear to be exploits on my box - i check who gets to those pages and then who tries to connect to port XX and see what ip's match up - nice little stats)

One time i wondered what would happen if i matched the ips from my mail headers to the ips from my weblogs. It turns out that around %1 of the unique ips in my mail headers also appeared in the weblogs, which means that with pretty fair certainty i knew who was browsing my site. But this mainly works due to the personal natureof my site.

On the other hand, if you've access to lastlogs, query logs, and weblogs you can really start identifying local users of your website. i work at a local college and can learn a lot about a particular viewer by seeing if the same ip is logged in to a given server, or by looking at the query log and seeing what else they've done dns lookups for. Add in a messaging system and you can freak people out. (i also use this to freak out people who search images.google.com for 'breast' and get to my site (i'm a photographer too)).

one other thing i find useful is to keep track of who searches for robots.txt. this can be an indication that someone is a robot or proxy. It also helps me present special information to search engines, allowing them to know of (and thus index) a new page the next time they get to my site (i put a couple special links if you access robots.txt)

a friend of mine runs a journal/bbs website and was wondering about tracking his users when they create different accounts. We are thinking about implementing something similar to my spam identification to identify similar writing styles and possibly the same people in different accounts.

Once you've gathered up some data you'll want to look at it in a nice way. you could use excel or you could create some really nice webmaps (that site also has links to similar mapping projects).

finally a word of advice, if you put up a page of your refer logs, include that as disallowed in your robots.txt or you will get a lot of strangely referred people.

oh, and keep in mind no one method will be %100 accurate, but a combination of methods can get you close.

and there was an article not too long ago about MIT (i think) doing studies into howpeople view webpages - that is, if mouse is over to side of screen then person is most likely reading page, if mouse in middle of page then probably not, etc).

maybe forcing every viewer into a frameset and then tracking changes in the subframes is a viable option. associate a frame change to a hidden image change with an encoded identifier.

ok, thats it for now.

--

-f
www.blackant.net

Slashdot Mirror

Writing Software to Collect Click Stream Stats?

16 comments