On Maintaining httpd Logs...
A nameless submittor dropped this in my in-bin: "I help run a site that's rapidly gaining popularity. However, I wonder how other people out there handle the large amount of logs that are generated on a busy system. How long to do you keep all those Apache logs? What about the messages logs, etc? They take a lot of space and quite often I just don't they they're worth keeping around. Thoughts?" My thoughts on this are simple: If you are serious about your site, then the logs are worth keeping. You don't have to keep them online (tape backups work well here), but the statistics within can give you valuable information on the future handling of your site. Any other thoughts?
I'm looking for a good, open source web log parser. It would be nice
if it can do pretty graphs, althought it's not necessary. Clear, well-
presented information is important, however. Thanks.
If your web site is doing anything commercial, then it is imperative that you keep your logs... I have friends who have had trouble with advertisers disputing their click-through total or unhappy customers claiming they never used your site... you never know. :)
If your site is non-commercial then your logs are still worth keeping for statistical analysis (e.g. what parts of your site are the most popular, which ones are gaining in popularity) but probably not vital.
Nonetheless, the amount of space being taken by your logs should be fairly minimal if you compress them. If the log sizes are large then backing them up to tape or zip drive or other device should be sufficient.
Also, a good log to watch is your referrer log. If you have copyrighted material on your site you may not want people linking to the material in case it gets misrepresented, and the referrer log is often the first sign of this type of activity
Logfiles - especially httpd or firewall logs - are extremely compressable - expect them to shrink with factor 20 or better.
We build weekly statistics from the (new) logs before we archive them on CD-Rs (2 CDs full of compressed logs per week, *sigh*). The weekly statistics are published on our intranet sever for reference.
The stats are built with analog and some highly optimized, specialized programs (dumb but fast: ~10MB thruput/second). I could publish them if you are interested.
Cronolog can be used on the end of a pipe from Apache (or presumably anything that generates logs similarly) and will automatically write logs to paths keyed on date. E.g. if you want to collect each month's logs in separate dirs, cronolog will write to 1999/Oct, 1999/Nov, etc. It's an extremely useful way of splitting up your log files chronologically without writing scripts to restart Apache and move the old logs.
Do a search on Google or somewhere for it.
Ade_
/
Big Bubbles (no troubles) - what sucks, who sucks and you suck
In my view, the logs themselves aren't as important as the information they contain. Therefore, use a comprehensive analysis tool, whether one of the commercial tools, a free one written in Perl, or write your own, and extract the relevant information, and then remove your logs.
Tape backups do indeed work well here, but not all logs entries are created equal. If your site is very image-heavy, you probably don't want to keep the thousands of entries for each inline jpeg; you want the records of the page views.
Sites running Apache/mod_perl (or sites where the administrator is not afraid of Apache and their C compiler) can modify Apache so that it logs only what you want. A PerlLogHandler under mod_perl with return DONE if $r->content_type =~ /image/ at the top will save you hundreds, if not thousands, of (possibly useless) log entries in your logs files. On the other hand, a 30 Gig tape will hold years worth of bzipped logfiles...
darren
(darren)
I'm kinda suprised that nobody's mentioned it yet. Building a log statistics and storage system is an excellent way for somebody to pick up Perl or Python knowledge, or to enhance what they have. Use Apache's CustomLog directive to get referer and user-agent info in your logfiles, and let your imagination run wild as to what kind of data can be mined out of them. User tracking from page to page doesn't require cookies. As compressable as log data is, there's no real excuse not to save it. If you've got enough traffic that logs are taking up disk space you want, you've already got a tape drive or something (right? you'd better...)
The way we do it here is that we basically want to save the httpd logs forever. You never know when those logs will come in handy, particularly if you run any commerce sites -- it's nice to be able to track down IP subnets to add more empirical evidence. Plus, new statistical techniques may be developed down the road. I'm not as big a fan of archiving the messages log; however, those get backed up daily themselves with the rest of the servers.
We used to archive separate logs for access, error, and referer, but now Apache's combined logs have made life much easier. (Analog also is a nice bonus -- talk about quick stats!)
We typically download all the httpd logs for a quarter, burn them onto a CD-R, and store them. This on top of daily incremental backup with weekly full. That way, if we want to analyze the data later, we have the logs ready to go, rather than having to track them off a tape. (I myself have used old logs in this manner several times.)
I've heard that you can gzip the logs on the fly directly from Apache, but thought that might lead to unwanted cpu overhead due to its constant utilization on busy web sites; anyone got any anecdotal evidence on this one?
As has been mentioned, log files to tend to compress well; at least that's my experience with gzip.
Another way to chop log files down to size is to remove image requests. There may be circumstances where you wouldn't want to do this, but for the average web site it cuts log size dramatically. My experience is at _least_ by 2/3. And that's just for sites that use small numbers of graphics per page... if you've got more, you'll see further shrinkage.
You can do this after the fact with some kind of script/program (I've used Perl, and also once suffered through doing it in C), or you could change your site so that it simply accesses the images from another domain/server so the logs are kept sepearately.
Tweet, tweet.