Andreas Grabner About the Author

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Field Report: 5 Minutes to Identify a Production Problem and its Impact

It is time to share another interesting story with you and I am sure most of you readers out there who are application owners or responsible for operating web applications can relate to. We recently changed our authentication services for most of our production websites. The website that I am responsible for is the Compuware APM Community. A change in authentication is critical – that’s why we have tested this change on our staging system before rolling it out into production. Everything looked good. Once deployed to our production site we wanted to make sure that we didn’t miss anything. Turns out we did miss a thing that impacted several users of specific user groups now no longer being able to access certain content.

Let me walk you through the 5 minutes it took me to identify the problem, identify the impact and provide enough information to our operations department to fix the problem.

Question 1: Are there any problems we haven’t seen in staging?

Opening the Application Overview shows that we have a very high failure rate on a certain transaction of our Community Portal:

Application Overview shows a High Failure Rate on one of our Transactions

Application Overview shows a High Failure Rate on one of our Transactions

To answer the first question: YES – we have a problem!

Question 2: What is the problem?

Next step is to look at the automatically detected Errors showing that these problems are related to HTTP 4xx requests – meaning that many users get an access denied to several pages:

Access Denied Problems are the root cause of the high failure rate

Access Denied Problems are the root cause of the high failure rate

We now know exactly that we have a restriction problem to these pages. Whether this is a real problem or just users trying to access restricted content is not yet answered.

Question 3: Is this a real problem and if so – what information can I provide to Operations to fix this problem?

As said before – it could be that many users just try to access restricted content – in that case we would be OK with these errors as they would be expected. Looking into the underlying error information, such as exceptions we can see that the problem is actually related to our authentication service. It seems we have not migrated all security groups after switching to the new authentication system:

Exception Details reveal that we have a problem with our security groups

Exception Details reveal that we have a problem with our security groups

This is enough information for Operations to go ahead and look into why these security groups have not been migrated.

Question 4: Which users are impacts? Can we reach out to them pro-actively to apologize?

As we now know that this is a problem on our side we want to know which users are impacted. As an Application Owner I want to pro-actively reach out to these users explaining that it seems they ran into a problem (even though they haven’t reported these problem yet) and letting them know that we are actively working on a solution. With our User Experience Solution we get the full user context of every Visit that experienced these exceptions:

Visitors that were impacted by the Authentication Problem

Visitors that were impacted by the Authentication Problem

Conclusion

It’s good that we tested this in staging as we have also solved problems there. But it is even better to really see what’s going on in the real production world as it is not always possible to test every single scenario.

Comments

  1. Too bad I spent 10 minutes reading this article just to understand, that it is an advertising for your product… It would be nice if next time you use “dynatrace” keyword in the topic name or somewhere in the top of the article.

    • Hi D

      When I wrote this article I wanted to share the lesson we learned – which is: “Don’t only trust your testing efforts in staging”.
      Of course I use dynaTrace because that is the product we use internally to monitor our applications.

      Andi

Comments

*


5 + = thirteen