Domain: analog.cx
Stories and comments across the archive that link to analog.cx.
Comments · 36
-
I'm Working On This Same Problem All By Myself.
October 28, 2005
I am working on a project with three young programmers and a manager. The oldest of the programmers is more than ten years younger than I. The manager is older, but does not know much about programming beyond how to check our code out of Subversion and type "make" to check our progress.
I was talking to my buddy Leo Baschy yesterday about it. Leo's around the same age I am. He is a Rocket Scientist: he wrote MacsBug 6.2 when he worked for Apple, and spent several years writing an access control application that he is just now bringing to market. Leo Does Things Right.
I told Leo I really enjoyed talking shop with someone who had a clue. But I said:"When I talk to those guys about how to write better code I have the sense that their experience of me is like going to church.
"Many go to church. How many are without Sin?"
"But I didn't learn to preach because I studied at the seminary. It's because I was a derelict on Skid Row until I was saved by..."
Many - but definitely not all - security flaws that leave one's code or one's box vulnerable to exploits are nothing other than simple bugs.
Most insidiously, storing an input buffer on the runtime stack, then reading an element from a file or network connection without validating that the size of that element complies strictly with the file format or network protocol specification, enables one to prepare a "Specially Crafted Document" or Network Packet that overflows the buffer, overwrites the proper return address on the stack with one of its own, then when that subroutine returns, instead of control returning to the caller, control is passed to malware code that is included with that too-big file or network element.
It's a little tricky to actually craft these documents, but I assure you that if you read Learn Java in 21 Days, you will learn absolutely all you require to enable Stack Smashing Buffer Overflows for your users.
Don't think that Plain Text File Formats like HTML, or Text Network Protocols like HTTP or POP are any manner of protection!
Just now I had a look through the Failure Report produced by the Analog Web Server Log File Analyzer. Here are some choice tidbits:
$ grep ajax_create_folder access.log | sed 's+^.*"GET+GET+' | sed 's+HTTP/1\.1".*$+HTTP/1.1+' | wc
184 552 16980$ grep ajax_create_folder access.log | sed 's+^.*"GET+GET+' | sed 's+HTTP/1\.1".*$+HTTP/1.1+' | sort | uniq | wc
102 306 9466$ grep ajax_create_folder access.log | sed 's+^.*"GET+GET+' | sed 's+HTTP/1\.1".*$+HTTP/1.1+' | sort | uniq | head
GET //ajax_create_folder.php HTTP/1.1
GET /Gallery/zenphoto/zp-core/zp-extensions/tiny_mce/plugins/ajaxfilemanager/ajax_create_folder.php HTTP/1.1
GET /Gallery/zp-core/zp-extensions/tiny_mce/plugins/ajaxfilemanager/ajax_create_folder.php HTTP/1.1
GET /Photo/zp-core/zp-extensions/tiny_mce/plugins/ajaxfilemanager/ajax_create_folder.php HTTP/1.1
GET /Photos/zp-core/zp-extensions/tiny_mce/plugins/ajaxfilemanager/ajax_create_folder.php HTTP/1.1
GET /Pics/zp-core/plugins/tiny_mce/plugins/ajaxfilemanager/ajax_create_folder.php HTTP/1.1
GET /Scripts/tiny_mce/plugins/ajaxfilemanager/ajax_create_folder.php HTTP/1.1
GET /ZenPhoto/zp-core/zp-extensions/tiny_mce/plugins/ajaxfilemanager/ajax_create_folder.php HTTP/1.1
GET /Zenphoto/zp-core/zp-extensions/tiny_mce/plugins/ajaxfilemanager/ajax_creat -
The Answer Lies In Your Web Server Log FilesIf you analyzer your logs with a tool such as Analog, you'll find that a significant number of your web sites' visitors are still running Explorer or Netscape versions 3 or 4. At least that's what I find for my sites - and it's been that way for a long time.
There are lots of reasons for this. Some people cannot afford the new hardware required for Mac OS X. Some of those who could buy the hardware have a big investment in software that uses Apple Desktop Bus (ADB) dongles that wouldn't work on OS X even if the newer Macs were equipped with ADB - they haven't been for years.
Some software has been discontinued, with the vendors out of business, and so will never be ported to OS X-native. If the software is useful enough to the end user, then they'll keep running Mac OS 9.
Finally, some people simply don't know how to upgrade. Until very recently a relative of mine was running Internet Explorer 5.0 on Mac OS X 10.2 - no doubt riddled with well-known security holes, but she simply didn't know better. I bought her Mac OS X Tiger for Christmas (Leopard won't run on her G3), then visited soon after and installed it for her, then downloaded and installed all the updates.
All of these are reasons that I plan for Ogg Frog to support the Classic Mac OS.
(And there are many Macs out there that are too old to run Mac OS 9; they'll be running 8.6 or some such.)
-
Re:Count users, not hits. -- you can't
The parent post said: "I only count "browsers per known user per day". So users that come in more than once per day are only counted once; anonymous users (and robots/crawlers without a credit card in hand) are excluded."
And *how* do you count users/day accurately? With proxy servers, you *can't* always know that kind of information from server logs, though many logfile analyzer s/w packages will try to make you think you can...
See the Analog logfile analyser docs: What the results mean, and particularly How the web works.
"This section discusses what happens when somebody connects to your web site, and what you can and can't find out about them. If you think that you can get statistics on how many people have visited your web site (or want to know why you can't), then this section is for you." -
Re:Count users, not hits. -- you can't
The parent post said: "I only count "browsers per known user per day". So users that come in more than once per day are only counted once; anonymous users (and robots/crawlers without a credit card in hand) are excluded."
And *how* do you count users/day accurately? With proxy servers, you *can't* always know that kind of information from server logs, though many logfile analyzer s/w packages will try to make you think you can...
See the Analog logfile analyser docs: What the results mean, and particularly How the web works.
"This section discusses what happens when somebody connects to your web site, and what you can and can't find out about them. If you think that you can get statistics on how many people have visited your web site (or want to know why you can't), then this section is for you." -
Re:form the readme for Analog 6.0: How the web worI was about to post that same analog doc.
That whole page is well worth reading.
Many of the web stats packages other than analog really try to make you think they can get more data out than they really can.
That page and the one above it (What the results mean) should be required reading for anyone about to read a web stats report. I certainly send it to all my customers whenever I set them up with a report.
-
form the readme for Analog 6.0: How the web works
It's actually much more complicated than most people think. The best write up I've seen is on Analog's site:
This section is about what happens when somebody connects to your web site, and what statistics you can and can't calculate. There is a lot of confusion about this. It's not helped by statistics programs which claim to calculate things which cannot really be calculated, only estimated. The simple fact is that certain data which we would like to know and which we expect to know are simply not available. And the estimates used by other programs are not just a bit off, but can be very, very wrong. For example (you'll see why below), if your home page has 10 graphics on, and an AOL user visits it, most programs will count that as 11 different visitors! -
form the readme for Analog 6.0: How the web works
It's actually much more complicated than most people think. The best write up I've seen is on Analog's site:
This section is about what happens when somebody connects to your web site, and what statistics you can and can't calculate. There is a lot of confusion about this. It's not helped by statistics programs which claim to calculate things which cannot really be calculated, only estimated. The simple fact is that certain data which we would like to know and which we expect to know are simply not available. And the estimates used by other programs are not just a bit off, but can be very, very wrong. For example (you'll see why below), if your home page has 10 graphics on, and an AOL user visits it, most programs will count that as 11 different visitors! -
Re:one to avoid
I'd also recommend avoiding WebTrends. Where it would take 12-18 hours to process log files, Analog runs in a fraction of the time (under an hour).
My issues with Analog are that I haven't discovered how to make it only parse each log file only once, and I haven't discovered any way to have it display stats for different time periods (ie. daily/weekly/monthly/quarterly/annually) all on one page. I'm not sure if these are real faults with the program, or if I just didn't figure out how to do it yet,so YMMV. -
Anlog
http://www.analog.cx/ that works well, atlest for my servers
-
New Discussion System
For those subscribers using Slashdot's new discussion system, this link will work better.
From the posting, though, I don't understand why you think your (Javascript-based) stats would be inaccurate, though, since only about 1.34% of users disabled or did not support Javascript.
That said -- I personally use Analog, and although it does give some fairly useful statistics such as search engine terms, most popular directories, referers, etc., I don't find it gives me a very high level of insight into surfing habits. A log analysis tool such as that may be a good starting point for you, though, if you don't currently do analysis of that sort.
-
Re:Use cookies
http://www.analog.cx/docs/logfmt.html
Look for %u in defining a custom log format for analog, which can be used with the user report capability to give you session ID information (easily paired with Apache's mod_usertrack http://httpd.apache.org/docs/2.0/mod/mod_usertrack .html).
http://www.serverwatch.com/tutorials/article.php/3 504311 is a good read on the module as well.
You have to get your log formats set right, but I believe this is what you're looking for (I don't use awstats, but it's most likely possible in a similar fashion). -
Re:I have your answerIf you want to collect your own stats on this, the javascript looks something like:
colors = window.screen.colorDepth;
if (navigator.appName == "Netscape")
{
width = window.innerWidth;
height = window.innerHeight;
}
else
{
width = document.body.clientWidth;
height = document.body.clientHeight;
}
document.write ("<img style=display:none src=/cookies/?w=" +width blah blah);That gets us a hit to a script which logs the sizes. Then you need to check out what your logs say with something like awstats or analog. We're more interested in sizes of browser windows than sizes of monitors, because we aren't the kind of thugs to go around resizing people's windows without asking them.
-
Re:Finally....Opera lets users set the user agent string to spoof various browsers,
Not perfectly
;-)
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.50
Okay, it's the current version that produces the above, but as far back as I recall, Opera always appended the "Opera" string at the end. The better log analysers aren't fooled by this.
If you are talking about the modifications to the ua.ini file for types 4 and 5, I don't think there's a lot of people doing that.
-
Nice work - some minor suggestions for 'yaNice analysis David. I'd personally love to see Analog (an oldie but googie) added to your table and my guess is that Urchin would be another popular one.
As you know, one can easily spoof the User Agent (and Firefox makes this totally trivial) - any idea on what percentage of folks are doing this type of stuff? Too bad that Slashdot didn't put this on the front page, because then you could analyze that inbound traffic.
P.S. FYI FWIW: using Analog, here's my browser percentages for Christmas/2004. I also have a Browser Info Page for those folks interested in seeing real-time what their browser is reporting.
-
Re:Obligatory web stats notice
I would have thought that a large enough sample would provide useful information
Useful information on what? Examining your own logs to determine how best to tune your web servers is useful. Thinking you can determine who is using what software is not.
Clearly we can't draw conclusions about precise market share, but surely trends might be identified?
Not really. All it takes is a large ISP like AOL to tweak their caching parameters, or for Microsoft to push out a service pack, and you can see a gigantic change in traffic with zero difference in market share.
For example, current surveys hint at a trend away from Internet Explorer; should we disregard this as a statistical hiccup?
If you can back up those surveys with evidence, no. But httpd logs are very flimsy evidence. Traffic analysis alone is not enough; there are two separate aspects - the design of HTTP, which can't support analysis on this level; and the assumption that traffic corresponds to market share.
Firstly, the design of HTTP. Unless you switch off caching of your HTML, you aren't going to get anywhere near the right figures. If you switch it off, if you have a website that has enough traffic to count for anything in these statistics, it will cost you real money and will slow down your website. Chances are, if somebody does that, it's because they are clueless rather than they value the stats more than the cash - and if they are clueless can you expect them to gather statistics reliably?
Even then, there are biases. There are biases towards Internet Explorer (spoofers generally emulate Internet Explorer rather than otherwise, etc) and biases against Internet Explorer (users of Windows are more likely to be both Internet Explorer users and "firewall" users that block things like the User-Agent header, etc). All sorts of random odds and ends that alone are likely to be swept under the carpet because they are small and these analysts can't account for them.But probably more importantly, these analysts are equating traffic with market share which is a mistake. The best example I can think of to illustrate this is Google. Given identical market share, a browser that has a Google search field built in will send, on average, about half as much traffic to Google as a browser that doesn't have it built in (assuming the user prefers Google, of course). Simply because the toolbar users won't be loading the Google front page first. Identical market share, half the traffic. The same sort of thing can happen across a wide range of websites, problems with some browsers not caching things when they should artifically inflates the numbers for those browsers, for instance.
NB. I'm not trolling, or even particularly disagreeing, but I would like more evidence/citations to support your viewpoint.
Well half of it's just common sense, but I really wish people would ask these analysts how they have accounted for these things, because the surveys I've seen in the past simply ignored the issues, and the popular web statistics packages that everyone likes to quote from are all pretty flawed, e.g. the commercial ones assume cookies are always present, always written to, etc.
If you want second opinions, the person who wrote the most popular logfile analyser in the world agrees with me, and even links to a study done at Xerox. Another decent introduction to the issues is Why web usage statistics are (worse than) meaningless.
-
Some people will never upgradeI just did the Analog BROWSERSUM browser summary report for last month's GoingWare's Bag of Programming Tricks traffic. My findings showed some suprises:
- Requests - Browser
- 402 - MSIE 4
- 42 - MSIE 3
- 2 - Mozilla M18
- 15 - Netscape 3
- 2 - Netscape 2
- 12 - Opera 5
It looks appalling in Netscape 4. One reason we haven't posted her new design yet is that, until her 1996-era Mac died a month or so ago, Netscape 4 was what my mom used. I convinced her to buy an iMac. She likes the stylish design.
Something people need to realize is that there are still many people who cannot upgrade. Some people aren't permitted to by their IT departments, but more likely many are people like my Mom using ancient hardware where Mozilla won't run.
-
Re:Give me reporting tools!
-
You can't: live with itYou can't measure the exact number of human visitors to a website, any more than you can measure the exact number of people who read a magazine. With a magazine, two people may read one magazine. With a website, one person may come from two computers, or two people from one computer. The problem is only that people, especially advertisers, seem to expect that exact numbers are somehow possible. But they really need to match their metrics to the medium, and not try and force the medium to fit print-media analogies.
Anyway, the exact numbers don't really tell you anything. You really need to know the differences between two sub-populations (are visitors from pay-per-click ads or visitors from standard search results more likely to buy?). A program which makes this sort of comparison easy will give you far more insight than one which tries to get the total number of visitors closer to some mythical "true" number.
(I am the author of analog and CTO of ClickTracks, but I'm writing in a personal capacity).
-
Re:Good, I think
I use Analog to check my logs. Even when Opera is identified as IE it still sends Opera in the user agent. Analog checks for this.
-
Re:"Official market share" - how big is your site?
Hm. What stats package do you use, Tom?
Just wondering because there's a difference between hits (requests) and "visits", which is a euphemism for whatever the stats software maker wants it to be.
This page explains the difference in more detail:
http://www.analog.cx/docs/webworks.html
BTW, Analog rocks, it's the fastest stats analyzer I've ever seen, takes a 600MB log file and spits back your report in less than a minute. (And it's open source! *grin*) -
Re:Analog
Yes, Analog, AWStats, and Webalizer (here's Webalizer Win32) are the three packages that my web host installs for all its users.
-
Analog
You should look at Analog. It is free, and open source. While it probably doesn't export straight to Excel, you would likely have two choices there:
First, since it's open source, you could add support to export to CSV fairly easily.
Second, Analog can export to what it calls "COMPUTER" output, which is designed for easy parsing. Couple the COMPUTER output with a little Python or PERL, and you'll have a CSV file fairly quickly.
When you're finished looking at Analog, make sure you also consider Report Magic for Analog to make things look pretty in a browser. -
Re:Browser Spoofing.I specifically said the string " ) Gecko/ ". Safari contains the string " like Gecko) ", which sets it apart.
As for Netscape (6+) and Camino, they ARE Gecko, so why would you need to sub-detect them? Any browser-specific code you've got for Mozilla/Netscape/Camino/etc should be identical.
But if you're really that obsessive, you should grab the source code for Analog and see how a group of web experts do it.
-
More technical, but very useful..
Wget - wget.sunsite.dk I use it almost every day.
Analog - www.analog.cx Web server log parser. An absolutely essential tool for a webmaster. -
Siege
I have had good luck with the neat little program called siege. It can stress a single URL, multiple URLs, follow links from a root URL (simulating an actual user), and have many multiple concurrent connections active. At the end, Siege can tell you all about the server performance, latency, etc.
I really like one of the other poster's idea about having a load tester read actual log files from Apache, then simulate real user activity. The only problem I can see with this method, is if you changed the layout of your site, all the program would get is a bunch of 404s. However, if one were so motivated, one could hack up such a thing relatively easy, I think. analog can parse Apache/httpd log files, could'nt be all that hard. Siege works well for me, though, so I'll stick with it. -
Plenty of People to Sue
Well, it looks they'll have plenty of people to sue if this is possible. Analog also extracts the server's likely country of origin by parsing a resolved IP address, as do hundreds of other applications.
It's hardly a big deal to equate a TLD to a country, and whilst it may take a little longer to map IP addresses to geographic locations, this data is already in the public domain!
-
Re:Opera vs IE, no, Opera vs Mozilla.definatly 20% using some version of nestcape 4.x or earlier
You aren't interpreting your website logs appropriately if you come to this conclusion.
Many of the web crawlers advertise themselves as being early Netscape/Mozilla clients in the HTTP request; if you are including these in your figures as "real people using a browser" you're going to come up with horribly skewed figures like your own. Most decent server log analysis tools (such as the ever-present Analog) do a pretty good job of removing bots from the "real browsers" totals. See the Analog ROBOTINCLUDE option documentation for starters.
-
Re:Opera vs IE, no, Opera vs Mozilla.definatly 20% using some version of nestcape 4.x or earlier
You aren't interpreting your website logs appropriately if you come to this conclusion.
Many of the web crawlers advertise themselves as being early Netscape/Mozilla clients in the HTTP request; if you are including these in your figures as "real people using a browser" you're going to come up with horribly skewed figures like your own. Most decent server log analysis tools (such as the ever-present Analog) do a pretty good job of removing bots from the "real browsers" totals. See the Analog ROBOTINCLUDE option documentation for starters.
-
Analog
Analog is a nice loganalyzer, and fills all of my requirements, but a 100% accurate flow analysis of usage patterns, is not possible. Since HTTP is a stateless protocol, it will be difficult to exactly know what the client is doing.
-
Re:Karma whore linkOf that bunch, I must say that I really like Webalizer. It produces really nice looking reports with pie and bar charts and the level of detail can be customized to almost any need. It's also nice that it'll work on both web server logs as well as squid logs....
Analog may be the most poular, but I also found it rather difficult to set up and get useful data into and out of.
Balam
-
Re:Sounds GoodAh, the joys of analog. I regularly look though my log files for interesing stuff. Stuff people have been looking for and finding my web site (not as perverted as indicated) include:
- "Long fingernail and long toenail fetish"
- "mime nude photos"
- "16 year old boys whith arm pit hair"
- "easy and fast directions to make crack cocaine in the microwave"
- "but she was my student why did i have impure thoughs"
- "nude cartoons inspector gadget"
- "secrets on how to suntan through your computer"
-
Re:Analog
I use Analog exclusively (well, after DNSTran for name lookups and Perl to sort out sub-logs) and I have found little reason to complain. As Stephen mentioned, you can use ReportMagic to prettify the output. I don't bother.
My only complaint is Stephen's dogmatic insistence on not performing any form of speculative analysis. For example, he refuses to even attempt visitor counting, path tracking, etc. The sort of stuff that bosses like to see, whether or not it's strictly accurate.
Stephen could put WebTrends out of business with a couple hours of coding, but he has his principles. -
Re:What's it matter what server generates the logs
Last I checked, both IIS and Apache generate (or can be set to generate) W3C standard format logfiles. Part of the reason for having/using that standard is so that you don't get locked into a proprietary tool.
You might think so, but IIS breaks the standard in several ways. And it's not even really a standard, just an early working draft that was never finished.In my opinion, a good logfile analysis tool should be able to recognise and analyse all commonly-used formats, and provide a means to specify custom formats. In other words, it should work with what the server has already produced, rather than force the server administrator to reconfigure the server and ignore old logfiles. My program analog does all this, but most programs don't.
-
AnalogI'd like to plug analog. I'm the author, so read my comments in that light.
:-)First, as others have commented, the commercial programs suck, especially Webtrends.
Analog is over six years old, but it's still actively developed, and I think it's still the leading free log analyser. The main contender is the Webalizer. To some extent it depends what you want (why not try out both?). The Webalizer's biggest advantage is that it produces prettier pictures. Some of analog's advantages are that it is more configurable; that it runs on any OS (the Webalizer is Unix only); and that it can analyse logfiles from any web server.
Besides, analog's author reads Slashdot.
-
AWstats rocks!
I have been running AWStats since July, and I absolutely love it. It does not provide the fine-grain detail that many people need, and which can be provided by Analog. But it does provide exactly what 90% percent of us need, in an easy to view package. It creates an easy to understand page about many aspects of your site, including, users, page hits, countries, languages, OS, browser, spiders/robots, access times; it's great! It is also a GPLed perl script! The developement team is over at Source Forge and is actively releasing new code all the time. It also has the added benefit of allowing cgi updating through a web page; simply putting the script in your
/www/cgi-bin/ directory and adding appropriate permissions allows you to get up to the second information about your sight without having to dig up a terminal! Definately check this package out!
-OctaneZ -
No!I am almost certain that there is no industry standard.
One problem is that it would depend very much on the type of website and thus the type of users you had. If you have a B2B website, and most of your visitors are from companies, your (unique user):(unique IP) ratio will look very different to a site with mostly home visitors coming through large ISPs.
The industry seems to be more concerned with developing more and more reliable versions of the half-hour timeout metric. Of course, they're chasing the wind. (And furthermore, all the different versions of their metric are then not comparable -- see this study from Xerox PARC (PDF, 228kb).)
I leave you with this thought from my essay How the Web Works:
"These problems are not really new to the web -- they are present just as much in print media too. For example, you only know how many magazines you've sold, not how many people have read them. In print media we have learnt to live with these issues, using the data which are available, and it would be better if we did on the web too, rather than making up spurious numbers."