Field Report: 5 Minutes to Identify a Production Problem and its Impact
It is time to share another interesting story with you and I am sure most of you readers out there who are application owners or responsible for operating web applications can relate to. We recently changed our authentication services for most of our production websites. The website that I am responsible for is the Compuware APM Community. A change in authentication is critical – that’s why we have tested this change on our staging system before rolling it out into production. Everything looked good. Once deployed to our production site we wanted to make sure that we didn’t miss anything. Turns out we did miss a thing that impacted several users of specific user groups now no longer being able to access certain content.
Let me walk you through the 5 minutes it took me to identify the problem, identify the impact and provide enough information to our operations department to fix the problem.
Question 1: Are there any problems we haven’t seen in staging?
Opening the Application Overview shows that we have a very high failure rate on a certain transaction of our Community Portal:
To answer the first question: YES – we have a problem!
Question 2: What is the problem?
Next step is to look at the automatically detected Errors showing that these problems are related to HTTP 4xx requests – meaning that many users get an access denied to several pages:
We now know exactly that we have a restriction problem to these pages. Whether this is a real problem or just users trying to access restricted content is not yet answered.
Question 3: Is this a real problem and if so – what information can I provide to Operations to fix this problem?
As said before – it could be that many users just try to access restricted content – in that case we would be OK with these errors as they would be expected. Looking into the underlying error information, such as exceptions we can see that the problem is actually related to our authentication service. It seems we have not migrated all security groups after switching to the new authentication system:
This is enough information for Operations to go ahead and look into why these security groups have not been migrated.
Question 4: Which users are impacts? Can we reach out to them pro-actively to apologize?
As we now know that this is a problem on our side we want to know which users are impacted. As an Application Owner I want to pro-actively reach out to these users explaining that it seems they ran into a problem (even though they haven’t reported these problem yet) and letting them know that we are actively working on a solution. With our User Experience Solution we get the full user context of every Visit that experienced these exceptions:
It’s good that we tested this in staging as we have also solved problems there. But it is even better to really see what’s going on in the real production world as it is not always possible to test every single scenario.