Automatic Error Detection in Production – Contact your Users before they Contact You
In my role I am responsible for our Community and our Community Portal. In order for our Community Portal to be accepted by our users I need to ensure that our users find the content they are interested in. In a recent upgrade we added lots of new multi-media content that will make it easier for our community members to get educated on Best Practices, First Steps, … .
Error in Production: 3rd Party Plugin prevents users from accessing content
Here is what happened today when I figured out that some of our users actually had a problem accessing some of the new content. I was able to directly contact these individual users before they reported the issue. We identified the root cause of the problem and are currently working on a permanent fix preventing these problems for other users. Let me walk you through my steps.
Step 1: Verify and Ensure Functional Health
One dashboard I look at to check whether there are any errors on our Community Portal is the Functional Health Dashboard. dynaTrace comes with several Out-of-the-Box Error Detection Rules. These are rules that e.g.: check if there are any HTTP 500s, Exceptions being thrown between Application Tiers (e.g.: from our Authentication Web Service back to our Frontend System), Severe Log Messages or Exceptions when accessing the Database.
The following screenshot shows the Functional Health Dashboard. As we monitor more then just our Community Portal with dynaTrace I just filter on this application. I see that we had 14 failed transactions in the last hour. Seems we also had several unhandled exceptions and several HTTP 400s between Transaction Tiers:
My First Step tells me that we have users that experience a problem.
Step 2: Analyze Errors
A click on the Error on the bottom right brings me to the error details allowing me to analyze what these errors are. The following screenshot shows the Error Dashboard with an overview of all detected Errors based on the configured Error Rules. A click on one Error Rule shows me the actual errors on the bottom. Seems we have a problem with some of our new PowerPoint slides we make available on our community portal:
Now I know what these errors are. Next is to identify the impacted users.
Step 3: Identify impacted User
A drill into our Business Transactions tell me which users were impacted by this problem. Turns out that we had 5 internal users (those with the short users names) and 2 actual customers having problems.
What is also interesting for me is to understand what these users were doing on our Community Portal. dynaTrace gives me the information about every Visit including all Page Actions with detailed Performance and Context Information. The following shows the activities of one of the users that experienced the problem. I can see how they got to the problematic page and whether they continued browsing for other material or whether they stopped because of this frustrating experience:
I now know exactly which users were impacted by the errors. I also know that even though they had a frustrating experience these users are still continuing browsing other content. Just to be safe I contacted them letting them know we are working on the problem and also sent them the content they couldn’t retrieve through the portal.
Step 4: Identify Root Cause and fix problem
My last step is to identify the actual root cause of these errors because I want these errors to be fixed as soon as possible to prevent more users from being impacted. A drill into our PurePath’s shows me that error is caused by a NullPointerException thrown by the Confluence Plugin we use to display PowerPoint’s embedded in a page.
dynaTrace also captures the actual exception including the stack trace giving me just the information I was looking for.
Automatic Error Detection helped me to proactively work on problems and also contact my users before they report the problem. In this particular case we identified a problem with the viewfile Confluence Plugin. In case you use it make sure you do not have path-based animations in your slides. Seems like this is the root cause of this NullPointer Exception.
For our dynaTrace Users: If you are interested in more details on how to use dynaTrace, Best Practices or Self-Guided Walkthroughs then check out our updated dynaLearn Section on our Community Portal.
For those that want more information on how to become more pro-active in your application performance management check out What’s New in dynaTrace 4.