Slashdot Mirror


Software Logging Schemes?

MySkippy writes "I've been a software engineer for just over 10 years, and I've seen a lot of different styles of logging in the applications I've worked on. Some were extremely verbose — about 1 logging line for every 2 lines of code. Others were very lacking, with maybe 1 line in 200 devoted to logging. I personally find that writing debug and informational messages about every 2 to 5 lines works well for debugging an issue, but can become cumbersome when reading through a log for analysis. I like to write warning messages when thresholds or limits are being approached — these tend to be infrequent. I log errors whenever I catch one (but I've never put a 'fatal' message in my code, because if it's truly a fatal error I probably didn't catch it). Recently I came across log4j and log4net and have begun using them both. That brings me to my question: how do the coders on Slashdot handle logging in their code?"

6 of 225 comments (clear)

  1. What do you want to achieve... by AaronLawrence · · Score: 5, Informative

    As usual, "it depends" on what you are trying to achieve. Nobody can give you a blanket recommendation. But I guess in general: the log files need to give you enough information, that can't be got in other ways, to solve any problem that comes up.

    We have a realtime product that goes all over the world and talks to hardware that we can't always get access to ourselves. Therefore, we sometimes must debug our code remotely. Obviously, logging is critical to this. We keep sometimes hundreds of MB of logs and have archiving rules and a tool for users to collect them. Every layer of the system keeps it's own logs, and all logs have timestamps to milliseconds.

    In our case we log all the data back and forth, and then every important decision the code makes. For example if it decides there is something wrong with incoming data, it must log that. Any action it decides to take must be logged. Any data that will be passed on to other layers/the outside world must be shown. Generally, whenever we forget to log some of this data we will later regret it ("why the hell is it ignoring that device state..."). We also log at startup, basically the whole system configuration so that we can reproduce it.

    Callstacks when there is an exception can be very useful. However, a lot of "errors" (at least in our case) are not exceptions but rather unexpected data or behaviour. We rarely have a crash and in state-based systems a callstack doesn't tell you much about what's going on. So a callstack is not useful for all situations.

    Other times, you just want logging to give you a clue where in the code it was so you can run up the debugger and step through it (you do know how to step through code in the debugger, right?). In that case, too much logging can just get in the way. It might be sufficient in a GUI or web app to say which screen/page and which button was clicked.

    You'd hope users could report this kind of details, but not always: if the user is working in another language, in another country, with two layers of helpdesk between you and them, and they are busy doing other things when the problem occurs and only call in the issue an hour later, and the helpdesk takes a week to report it to you - you may just get a vague or even misleading report that no-one can remember when you ask questions. In those cases log files may be all you have to go on.

    There is also a tradeoff between log detail and manageability. Besides the difficulty in reading long log files, having a lot of detail means maintaining a lot of extra code. It also means that log files can become unmanagably large. In our case those hundreds of MB of logs can be a huge problem for customers to send to us because they have low quality internet connections (small companies in Mexico for example).

    --
    For every expert, there is an equal and opposite expert. - Arthur C. Clarke
    1. Re:What do you want to achieve... by ericfitz · · Score: 5, Informative

      +1 for parent.

      If you want good logging, then define requirements for it, just as you would for any other feature of the program. You also need to define the audience for the log. The comments thread has focused on debug logging for developers (Linus "no debuggers" Torvalds would be proud) but there are a number of reasons why the users who are stuck^h^h^h blessed with your software might want logging. For instance:

      - audit trails (often required by organizational security requirements or regulatory requirements)
      - accounting/billing (you'd be amazed at the odd ways people come up with to bill for things)
      - health monitoring (the admin might not want to watch your program 24x7 to see if it is running; they might want to program automation to be alerted when it is not working properly)
      - troubleshooting (believe it or not, your software might actually break when running in the wild)

      Anyway, think about your use cases, and then think about what to instrument for each use case, and what to put in the events.

      For instance, if you want to make your daemon monitorable for health, then think about all its dependencies. Does it read config from a file? The file is a dependency. What happens if a value is invalid? Does it fail or use a default? If it fails, reading the value is a dependency. Need a network socket? Dependency. Connection to remote machine? Dependency (actually multiple- name resolution, network connectivity, authentication, app-level connectivity, etc.). After you've enumerated all your dependencies, then add instrumentation in your code to log events when the dependency is unsatisfied (==unhealthy/broken), and when it is satisfied (==healthy). Make sure to log BOTH states, so that the monitoring app can decide which state you are in. Make sure to log only once per state transition. In each event, try to put as much information about the situation as you can- why you are in the state ("the value foo from daemon-config was invalid"), status codes, etc.- give your user a fighting chance of being able to use your log to diagnose and resolve the issue.

      If you want to instrument for audit, then I suggest reading the Orange Book or the Common Criteria documents for suggestions on what needs to be audited and what information to put in the events.

      For accounting, examine the RADIUS RFCs.

      Hope this helps.

  2. Re:As little as practically possible by 0123456 · · Score: 5, Insightful

    "Otherwise don't do much logging because it will hurt application performance, sometimes drastically."

    You're assuming that performance -- or, more precisely, CPU usage -- is important; in many cases, reliability (and being able to track down bugs after a crash) are far more important than CPU usage. With quad-core CPUs so cheap these days, we can easily afford to spend another thousand dollars to throw more processing power into a system which has cost a couple of hundred thousand dollars of programmer time to develop and will cost thousands of dollars an hour for any downtime.

  3. not cpu bound... disk bound by JonTurner · · Score: 5, Insightful

    It's not CPU that's at a premium, it's disk IO. And on virtualized machines (such as is extremely popular in corporations and hosting farms) where there might be four different OSs running on the same physical hardware, disk becomes a scarce resource very, very quickly. And not only does your virtualized server go to shit, it takes the others down with it since they can't get timely disk access, either.

  4. As much as practically possible by Anonymous Coward · · Score: 5, Insightful

    On the other hand, broken code hurts application performance, sometimes drastically.

    I'm an SQA engineer with years of experience working with large scale enterprise systems. Generally speaking the cost of unexpected outages or data corruption outweighs the cost of hardware. In such systems the costs of deployment activity itself can be such that you'd rather pay for more hardware to support extremely verbose logging.

    Sure, boneheaded logging can cause unnecessary performance hits, the obvious example being logging in a loop when logging at entry and exit would have sufficed. But that's not what we are really talking about here. You posited that you should do as little logging as practically possible, and I believe that you are wrong.

    Log lots and log often. Just do so intelligently. Use a logging framework (log4j, log4net, log4perl etc) and set the priority appropriately. Only use ERROR for real errors (unexpected code paths or data), use WARN when a performance metric is hitting a soft limit (to warn you before you hit that hard limit), and use DEBUG to verbosely log anything else of general interest. Rarely you might also want to log in an extremely verbose manner data that wouldn't ordinarily be interesting, and this should be logged at a TRACE level. Generally speaking if this is the case then the code itself is due for a refactor. FATAL should normally be reserved for errors that prevent correct startup - generally if an application runs correctly at startup then any potential faults that you see and handle now become ERRORs because there is nearly always something better an application can do than log FATAL and exit. As the OP observed, if you have a potential fault that kills your application and you don't see and handle it then you don't have the opportunity to log FATAL anyway.

    By using a logging framework, many logging pitfalls can be avoided because the framework itself provides well tested facilities. eg, time-stamping, log rotation, file-handle management etc. In addition, using a framework allows the operator to tune the logging on a very granular level. This allows for a trade off to be made where if a performance impact is noted in a well used class then its logging can be reduced at runtime. Sure, there is still a small performance impact because the logging framework has to do a "if (logMessage.logLevel >= loggingClass.logLevel ) then {...}" comparision, but in the scale of things that impact is tiny.

    My profession is not about finding and fixing bugs. It is about understanding the software that is being delivered and deployed. It is about understanding what defects exist (or may exist) and the possible implications of those defects. It is about reducing the risk of defects through analysis. Analysis of the software's functionality, analysis of the software's performance, analysis of the processes used to produce the software itself. You will never be yelled at for releasing software with a well understood and documented defect, but the shit will hit the fan when you release major defects that are not understood.

    Logging is an _invaluable_ tool in this analysis. You'd be a fool to not use it effectively.

  5. Re:As little as practically possible by darkpixel2k · · Score: 5, Funny

    ...nothing's as frustrating as trying to analyze a log that came in from the field where there's just no information about what went wrong.

    [ 56.529336] WARNING: A-fib detected
    [ 56.568802] INFO: charging defib
    [ 56.741096] INFO: charging ccomplete
    [ 57.218803] ALERT: shocking!
    [ 58.061815] Buffer I/O error on /dev/paddles
    [ 58.163210] zapper[22402]: segfault at 000000c4 eip b321bf5f esp b320a870 error 6
    BUG: unable to handle kernel NULL pointer dereference at virtual adress 00000000
    printing eip: c013186b *pde = 00000000
    Oops: 0000 [#1] SMP
    Modules linked in: battery ipv6 paddles ac button battery cardiac_monitor thermal processor zapper fan

    Man, I'd feel bad after that output...

    --
    There's no place like ::1 (I've completed my transition to IPv6)