Martin Etmajer About the Author

Martin has several years of experience as a Software Architect as well as in monitoring and managing performance in highly available cluster environments. In his current role as a Technology Strategist at the Compuware APM Center of Excellence, he contributes to the strategic development of Compuware’s dynaTrace APM solution, where he focuses on performance monitoring in Cloud technologies and along the Continuous Delivery deployment pipeline. Reach him at @metmajer

How to Approach Application Failures in Production

In my recent article, “Software Quality Metrics for your Continuous Delivery Pipeline – Part III – Logging”, I wrote about the good parts and the not-so-good parts of logging and concluded that logging usually fails to deliver what it is so often mistakenly used for: as a mechanism for analyzing application failures in production. In response to the heated debates on reddit.com/r/devops and reddit.com/r/programing, I want to demonstrate the wealth of out-of-the-box insights you could obtain from a single urgent, albeit unspecific log message if you only are equipped with the magic ingredient; full transaction context:

Examples of insights you could obtain from full transaction context on a single log message

Examples of insights you could obtain from full transaction context on a single log message

Bear with me until I get to explain what this actually means and how it helps you get almost immediate answers to the most urgent questions when your users are struck by an application failure:
  • “How many users are affected and who are they?”
  • “Which tiers are affected by which errors and what is the root cause?”
Operator: I’m here because you broke something. (courtesy of ThinkGeek.com)

Operator: I’m here because you broke something. (courtesy of ThinkGeek.com)

When all you Have is a Lousy Log Message

Does this story sound familiar to you? It’s a Friday afternoon and you just received the release artifacts from the development team belatedly, which need to be released by Monday morning. After spending the night and another day in operations to get this release out into production timely, you notice the Monday after that everything you have achieved in the end was some lousy log message:

08:55:26 SEVERE com.company.product.login.LoginLogic – LoginException occurred when processing Login transaction

While this scenario hopefully does not reflect the common case for you, it still shows an important aspect in the life of development and operations: working as an operator involves monitoring the production environment and providing assistance in troubleshooting application failures mainly with the help of log messages – things that developers have baked into their code. While certainly not all log messages need to be as poor as this one, getting down to the bottom of a production failure is often a tedious endeavor (see this comment on reddit by RecklessKelly who sometimes needs weeks to get his “Eureka moment”) – if at all possible.

Why There is no Such Thing as a 100% Error-Free Code

Production failures can become a major pain for your business with long-term effects: they will not only make your visitors buy elsewhere, but depending on the level of frustrations, your customers may choose to stay at your competition instead of giving you another chance.

As we all know, we just cannot get rid of application failures in production entirely. Agile methodologies, such as Extreme Programming or Scrum, aim to build quality into our processes; however, there is still no such thing as a 100% error-free application. “We need to write more tests!” you may argue and I would agree: disciplines such as TDD and ATDD should be an integral part of your software development process since they, if applied correctly, help you produce better code and fewer bugs. Still, it is simply impossible to test each and every corner of your application for all possible combinations of input parameters and application state. Essentially, we can run only a limited subset of all possible test scenarios. The common goal of developers and test automation engineers, hence, must be to implement a testing strategy, which allows them to deliver code of sufficient quality. Consequently, there is always a chance that something can go wrong, and, as a serious business, you will want to be prepared for the unpredictable and, additionally, have as much control over it as possible:

Why you cannot get rid of application failures in production: remaining failure probability

Why you cannot get rid of application failures in production: remaining failure probability

Without further ado, let’s examine some precious out-of-the-box insights you could obtain if you are equipped with full transaction context and are able to capture all transactions.

Why this is important? Because it enables you to see the contributions of input parameters, processes, infrastructure and users at all times whenever a failure occurred, solve problems faster, and additionally use the presented information such as unexpected input parameters to further improve your testing strategy.

Initial Situation: Aggregated Log Messages

Instead of crawling a bunch of possibly distributed log files to determine the count of particular log messages, we may, first of all, want to have this done automatically for us just as they happen. This gives a good overview on the respective message frequencies and facilitates prioritization:

Aggregated log events: severity, logger name, message and count

Aggregated log events: severity, logger name, message and count

What we see here (analysis view based on our PurePath technology) is that there have been 104 occurrences of the same log message in the application. We could also observe other captured event data, such as the severity level and the name of the logger instance (usually the name of the class that created the logger).

Question #1: How many users are affected and who are they?

Failed Business Transactions: “Logins” and “Logins by Username”

Failed Business Transactions: “Logins” and “Logins by Username”

Having the full transactional context and not just the log message allows us to figure out which critical Business Transactions of our application are impacted. From the dashboard above we can observe that “Logins” and “Logins by Username” have failed: we see that 61 users attempted the 104 logins and who these users were by their username.

Question #2: Which tiers are affected by which errors?

Errors along those transaction paths that contain our template log message

Errors along those transaction paths that contain our template log message

The next step is to analyze what other types of errors happened. In this Errors view we see all errors, such as HTTP 500, or unhandled exceptions, that occurred in the application along those execution paths that contained our template log message. It is also interesting to observe which application tiers were affected by errors:

Health status and performance metrics of all involved application tiers

Health status and performance metrics of all involved application tiers

If we were dealing with a performance related issue, we might as well want to know about the health of some affected host belonging to one of our application tiers:

Health status and information on an involved host: all green

Health status and information on an involved host: all green

Question #3: What is the root cause of this failure?

To get to know more about the root cause, it helps to understand which actions a particular user or a group of users took that led to a failure:

Visits, associated user actions and performance metrics

Visits, associated user actions and performance metrics

This visits-centric representation allows us to better understand what caused a single user to experience the login problem, like user “dominik” who visited our site from Surrey, British Columbia, Canada from 10:34 until 10:35 on the 16th June 2014.  During his stay he performed a couple of user actions, of which the action “click on ‘Login’” on the page “orange.jsf” failed.

By drilling further down, we can investigate the method call graph that was executed by the servers on behalf of the failed user action:

Call tree including method arguments, SQL statements, exceptions, log messages and performance metrics

Call tree including method arguments, SQL statements, exceptions, log messages and performance metrics

Here, we could observe that there were failed method invocations in the “/services/AuthenticationService” servlet. We would also see any relevant method arguments, executed SQL statements, exceptions and log messages, as well as performance metrics for each call.

Conclusions

In this article, I have demonstrated what a useful strategy for analyzing application failures in production could look like: one that delivers out-of-the-box and that comes with the magic ingredient: full transaction context for all transactions.

However, there’s even more to it: the right tool can help your teams thrive in terms of DevOps. dynaTrace (available as a Free Trial), for example, allows you to record the monitored session to a file, which you can share amongst your teams. This mechanism not only secures evidence, it also enables collaboration: production failures are tracked by operations and communicated to and tackled by development teams. In parallel, test automation engineers can refine their testing and deployment strategies to prevent the same failure from happening again.

Feel free to share your thoughts with me.

Comments

  1. Nice article, kind of To-Do lists for any production failure!

Comments

*


+ eight = 17