Martin Etmajer About the Author

Martin has several years of experience as a Software Architect as well as in monitoring and managing performance in highly available cluster environments. In his current role as a Technology Strategist at the Compuware APM Center of Excellence, he contributes to the strategic development of Compuware’s dynaTrace APM solution, where he focuses on performance monitoring in Cloud technologies and along the Continuous Delivery deployment pipeline. Reach him at @metmajer

Software Quality Metrics for your Continuous Delivery Pipeline – Part III – Logging

Let me ask you a question: would you say that you have implemented logging correctly for your application? Correct in the sense that it will provide you with all the insights you require to keep your business going once your users are struck by errors? And in a way that does not adversely impact your application performance? Honestly, I bet you have not. Today I will explain why you should turn off logging completely in production because of its limitations:

  • Relies on Developers
  • Lacks Context
  • Impacts Performance

Intrigued? Bear with me and I will show you how you can still establish and maintain a healthy and useful logging strategy for your deployment pipeline, from development to production, guided by metrics.

What Logging Can Do for You

Developers, including myself, often write log messages because they are lazy. Why should I set a breakpoint and fire up a debugger if it is so much more convenient to dump something to my console via a simple println()? This simple, yet effective mechanism also works on headless machines where no IDE is installed, such as staging or production environments:

System.out.println(“Been here, done that.”);

Careful coders would use a logger to prevent random debug messages from appearing in production logs and additionally use guard statements to prevent unnecessary parameter construction:

if (logger.isDebugEnabled()) {
  logger.debug(“Entry number: ” + i + ” is ” + String.valueOf(entry[i]));
}

Anyways, the point about logging is that that traces of log messages allow developers to better understand what their program is doing in execution. Does my program take this branch or that branch? Which statements were executed before that exception was thrown? I have done this at least a million of times, and most likely so have you:

if (condition) {
  logger.debug(“7: yeah!”);
} else {
  logger.debug(“8: DAMN!!!”);
}

Test Automation Engineers, usually developers by trade, equally use logging to better understand how the code under test complies with their test scenarios:

class AgentSpec extends spock.lang.Specification {
  def “Agent.compute()”() {
    def agent = AgentPool.getAgent();

    when:
    def result = agent.compute(TestFixtures.getData());

    then:
    logger.debug(“result: “  + result);
    result == expected;
  }
}

Logging is, undoubtedly, a helpful tool during development and I would argue that developers should use it as freely as possible if it helps them to understand and troubleshoot their code.

In production, application logging is useful for tracking certain events, such as the occurrence of a particular exception, but it usually fails to deliver what it is so often mistakenly used for: as a mechanism for analyzing application failures in production. Why?

Because approaches to achieving this goal with logging are naturally brittle: their usefulness depends heavily on developers, messages are without context, and if not designed carefully, logging may severely slow down your application.

Secretly, what you are really hoping to get from your application logs, in the one or the other form, is something like this:

ContinuousdeliveryIII.1

A logging strategy that delivers out-of-the-box using dynaTrace: user context, all relevant data in place, zero config

 

The Limits of Logging

Logging Relies on Developers

Let’s face it: logging is, inherently, a developer-centric mechanism. The usefulness of your application logs stands and falls with your developers! A best practice for logging in production says: “don’t log too much” (see Optimal Logging @ Google testing blog). This sounds sensible, but what does this actually mean? If we recall the basic motivation behind logging in production, we could equally rephrase this as “log just enough information you need to know about a failure that enables you to take adequate actions”. So, what would it take your developers to provide such actionable insights? Developers would need to correctly anticipate where in the code errors would occur in production. They would also need to collect any relevant bits of information along an execution path that bear these insights and, last but not least, present them in a meaningful way so that others can understand, too. Developers are, no doubt, a critical factor to the practicality of your application logs.

Logging Lacks Context

Logging during development is so helpful because developers and testers usually examine smaller, co-located units of code that are executed in a single thread. It is fairly easy to maintain an overview under such simulated conditions, such as a test scenario:

13:49:59 INFO com.company.product.users.UserManager – Registered user ‘foo’.
13:49:59 INFO com.company.product.users.UserManager – User ‘foo’ has logged in.
13:49:59 INFO com.company.product.users.UserManager – User ‘foo’ has logged out.

But how can you reliably identify an entire failing user transaction in a real-life scenario, that is, in a heavily multi-threaded environment with multiple tiers that serve piles of distributed log files? I say, hardly at all. Sure, you can go mine for certain isolated events, but you cannot easily extract causal relationships from an incoherent, distributed set of log messages:

13:49:59 INFO com.company.product.users.UserManager – User ‘foo’ has logged in.
13:49:59 INFO com.company.product.users.UserManager – User ‘bar’ has logged in.

13:49:60 SEVERE org.hibernate.exception.JDBCConnectionException: could not execute query
  at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:99)
  …

After all, the ability to identify such contexts is key to deciding why a particular user action failed.

Logging Impacts Performance

What is a thorough logging strategy worth if your users cannot use your application because it is terribly slow? In case you did not know, logging, especially during peak load times, may severely slow down your application. Let’s have a quick look at some of the reasons:

Writing log messages from the application’s memory to persistent storage, usually to the file system, demands substantial I/O (see Top Performance Mistakes when moving from Test to Production: Excessive Logging). Traditional logger implementations wrote files by issuing synchronous I/O requests, which put the calling thread into a wait state until the log message was fully written to disk.

In some cases, the logger itself may cause a decent bottle-neck: in the Log4j library (up to version 1.2), every single log activity results in a call to an internal template method Appender.doAppend() that is synchronized for thread-safety (see Multithreading issues – doAppend is synchronised?). The practical implication of this is that threads, which log to the same Appender, for example a FileAppender, must queue up with any other threads writing logs. Consequently, the application spends valuable time waiting in synchronization instead of doing whatever the app was actually designed to do. This will hurt performance, especially in heavily multi-threaded environments like web application servers.

These performance effects can be vastly amplified when exception logging comes into play: exception data, such as error message, stack trace and any other piggy-backed exceptions (“initial cause exceptions”) greatly increase the amount of data that needs to be logged. Additionally, once a system is in a faulty state, the same exceptions tend to appear over and over again, further hurting application performance. We had once monitored a 30% drawdown on CPU resources due to more than 180,000 exceptions being thrown in only 5 minutes on one of our application servers (see Performance Impact of Exceptions: Why Ops, Test and Dev need to care). If we had written these exceptions to the file system, they would have trashed I/O, filled up our disk space in no time and had considerably increased our response times.

Subsequently, it is safe to say that insightful logging and performance are two opposite goals: if you want the one, then you have to make a compromise on the other.

Logging in Production

It is a widely adopted recommendation to restrict the logger to the most urgent messages only (for example, Level.SEVERE and possibly Level.WARNING in Java’s Logging API parlance) in production. Messages of Level.SEVERE indicate that something has definitely gone wrong and needs fixing. The next lower level, Level.WARNING, indicates the occurrence of an unexpected event from which the application was able to recover, but which would need further investigation. Now, if we follow this practice, which kind of information do we still get out of our production logs?


2014-04-15 07:43:30 SEVERE [SocketHandler] Connection closed – reason: ‘error while reading first byte from socket’, error: java.net.SocketException: Connection reset
2014-04-15 07:43:37 WARNING [EmailSender] Sending email failed: could not connect to SMTP host: 192.168.0.2, port: 2255

What does that tell you? Does it tell you that something has gone wrong? Indeed. Does it tell you whether these failures are related? No. Does it tell you if a single or multiple users were affected? No. Does it give you enough insights that allow you to take concrete actions? No. So, what would you actually lose by turning logging off in production?

A friend of mine was convinced that he had set up logging straight until he saw that the application became faulty when it reached a previously unmet level of users on their platform around Christmas time. The production logs had proven to be unhelpful and it took the team two weeks and several downtimes to stabilize the application – by mere guessing.

This is what a production log should look like:

2014-05-20 07:45:08 INFO Starting com.company.product.MyApplication
2014-05-20 07:45:08 INFO Version: 1.0, Built: 2014-05-19 02:21:06
2014-05-20 07:45:08 INFO Platform: Windows Server 2008 R2 6.1, amd64 64bit
2014-05-20 07:45:08 INFO VM: Java HotSpot(TM) 64-Bit Server VM 1.7.0_45, Oracle Corporation, mixed mode
2014-05-20 07:45:08 INFO Memory: Heap: 1024MB, PermGenSpace: 128MB
2014-05-20 07:45:08 INFO Ports: HTTP: 80, SSL: 8080
2014-05-20 07:45:08 INFO Started successfully.

Period. Not yet sold on the idea? Our own quality assurance department, for example, has established a quality gate which assures that a software update is only released to customers if there have been no errors or warnings for at least one week in our Compuware APM Community, which acts as a staging environment that is used by more than 20,000 real users per day on average. This is what one of their information radiators looks like:

ContinuousdeliveryIII

QA Department Dashboard focusing on the metric: number of log messages

 

However, I don’t propose that you should leave your application unattended in production at all. What I claim is that logging cannot reliably deliver what it is usually intended for: to provide an understanding on why a failure has happened in production so that adequate actions can be taken.

What does this mean to you?

By staying alert to the number of log messages produced by your application along the stages of the deployment pipeline, you can assert whether you have implemented a healthy logging strategy for your application. Here are the key takeaways for your particular role:

  • Developers & Test Automation Engineers: Use logging as much as possible if it helps you to understand and troubleshoot your code. Be careful though: utilize the various log levels appropriately, so that your random log messages will not make it into production. Use guard statements to prevent performance penalties arising from unnecessary parameter construction when a particular log level is disabled. Test Automation Engineers should be aware of any occurrences of urgent messages (for example, Level.SEVERE and possibly Level.WARNING in Java’s Logging API parlance). These are serious indicators that your code is not yet ready to be released.
  • Performance Engineers: Stay alert of any urgent log messages during load tests. They indicate that the application is either not ready for production or that the application does not scale with the applied load.  Keep an eye on the total number of log messages: heavy logging activity when the system is exercised may falsify the observed performance metrics.
  • Operations: Prohibit log messages, other than those containing basic startup and configuration information and make sure that these are permanently turned off by monitoring the number of log messages.

Logging is not the right tool for analyzing application failures in production. Not only are developers a critical factor to the practicality of your logs, they also lack an essential prerequisite: context; which is key to deciding why a particular user action failed. Additionally, insightful logging and performance are two opposite goals – you just can’t get both. However, if your tool does not provide the necessary insights in case of a failure, you will lose on the opportunity to service your customers. If you think that I have a point here, stay tuned for my next article where I will show you how you can use dynaTrace (available as a 15 Days Free Trial) to approach failures in production both effectively and efficiently.

By the way, log management and analytics tools, such as Splunk and Sumo Logic do a great job in processing, searching and analyzing log data, but can only build upon what is already in your application log files. In another upcoming blog post I will introduce how dynaTrace and Splunk can be used together effectively to provide actionable insights that go well beyond log analytics.

To learn more about software quality metrics, be sure to check out Part I: Number of Requests per End-User Action and Part II: Number of SQL Statements per Transaction of this series.

Why not take a quick look at your production log files right now? How much value are you really getting from there? Feel free to share your thoughts with me.

In response to the heated debates on reddit.com/r/devops and reddit.com/r/programming, I am pleased to announce a soon-to-come follow-up article where I will show you how application failures in production can be approached both effectively and efficiently.

Comments

  1. Steve Caron says:

    Excellent and highly relevant article. Can’t wait to read your next two posts.

Comments

*


7 + seven =