Tracing Intermittent Errors – Guest blog by Lucy Monahan from Novell
Lucy Monahan is a Principal Performance QA Engineer at Novell, and helps to manage their distributed Agile process.
Intermittent problems can be difficult to solve:
- the value may be corrupted potentially at any time before the value is used
- the transaction may involve multiple enterprise servers using remote calls and the root cause could be on any server
- to identify the point of corruption, one may need to monitor return values and arguments
- many test runs may be needed to reproduce the problem
- heavyweight tools may impede ability to reproduce the problem, particularly if the problem is a race condition in a multi-threaded application
Using dynaTrace these concerns can be collapsed by using features that will provide the relevant debug data with low impact on performance. Here we look at an example where there are two servers involved in a transaction. server1 is running a Java application containing REST endpoints and calls server2 remotely via SOAP. Occasionally a bad value is returned to the client along with a NumberFormatException. But where is the value being corrupted?
In this example, the error message accompanying the exception has multiple versions such as
- java.lang.NumberFormatException: For input string: “”
- java.lang.NumberFormatException: multiple points
so hints about the root cause are not obvious.
The exception is observed by the client and is also output in the server1 log – but where is it originating? Is the exception occurring on server2 but is being suppressed in the logging? Since dynaTrace was running on your servers already simply open the Exceptions view on server2 and look to see if any NumberFormatExceptions occurred. The Exceptions view will show all exceptions and not just the ones that have been logged.
In this case, by using the Exceptions view it’s concluded that the NumberFormatException is not observed on server2. This info doesn’t necessarily mean that server2 is not the root cause since the value could be corrupted at any point. But it is an important data point.
Next, to learn more information about the NumberFormatError on server1, choose View Details on the Exception to see the full stack trace. If your application does not print the stack trace for exceptions then the Details view is extremely valuable as it indicates the calling code.
Now we are ready to focus on server1 where the error was observed and will start by comparing a PurePath containing a successful transaction to a PurePath which throws the exception. If the value is being corrupted on server1 then this comparison should reveal the root cause. If not then it will be time to look at server2. Here are the steps:
Step 1: Define your instrumentation (Sensor Packs)
Create a sensor pack made up of the packages observed in the Exception stack trace. You can define the sensor pack manually but if your environment has a high load of users then it is strongly recommended to use the Include Callee Methods to create it.
One of the issues with debugging is that the introduction of too much debug output can affect outcome and reduce the chances of reproducing the problem. This issue is resolved by Include Callee Methods through creation of a highly refined set of rules that traces only the classes known to be of interest as well as only methods with the signature matching your calling code.
The Include Callee Methods feature uses PurePaths as its basis, similar to steps here. If you have not used Include Callee Methods previously the scenario in this blog post provides an example of how and why to use it.
Here is the sensor pack used in this scenario. The first two items were added manually and you’ll see that they did not match any of the tracing whereas the third item, added via Include Callee Methods and expanded to display the method signatures, will be observed in the tracing.
Step 2: Capture Context Information
Ensure that arguments and return values are enabled for the instrumented methods. This can either be defined globally for all methods or more specific for individual methods in the Sensor Pack. Additional contextual information is key to analyze problems of individual transactions.
Step 3: Run Tests
Run the test until the exception is observed and then open the Exceptions view:
Step 4: Identify the transaction (PurePath) that caused the exception
Right click on the exception and Drill Down into the PurePaths for the one with the exception. From here you can obtain the PurePath ID for the PurePath containing the exception. Note the PurePath ID (264) from the PurePath column.
Step 5: Identify all other transactions of the same type
Open Web Requests and Drill Down to PurePaths for the transaction web request. #264 is in the list. Notice also that the Duration time is very short, only 10% of the typical duration, since an exception was encountered before fully processing the transaction. Even without the PurePath ID this duration anomaly is a hint for which PurePath has the exception.
Step 6: Compare the transactions to identify the difference
Using the PurePath Comparison plugin the PurePath containing the exception can be compared to a PurePath with successful return values. The PurePath Comparison plugin will compare return values and arguments as well as performance of the PurePath and it can be downloaded from the Community Downloads: https://community.dynatrace.com/community/display/DL/PurePath+Comparison
Comparing PurePath #264 with #265 the plugin directs us via red highlighting to the second call to deserializeCalendar(). #264 returns the exception but #265 returns a valid value. We have identified the method containing the root cause.
Also, via the red shading in the upper righthand pane for #265, the plugin also shows #265 executing for a longer time than #264. The reason is because #264 did not complete the processing, exited early and thus took less time.
The conclusion is that the value is being corrupted within the deserializeCalendar() method.
The next step is to find the root cause within deserializeCalendar(), which is likely a classic Java issue relating to multi-threaded applications. In the best case the method is not long and will have few root cause candidates. Look for static objects or static declarations that are accessed by multiple threads and consider using synchronization or allocating objects instead. Your solution will take into consideration how much of the calling code has to be modified as well as whether any allocated objects would be large and long-lived.
And after modifying your code to resolve the exception rerun your test and compare the execution time of this Web Request to the previous execution time to ensure that no new performance problems have been introduced.
About the Author
Lucy Monahan is a principal performance QA engineer at Novell working on secure identity management products and helps to manage her group’s distributed Agile process. With many years of experience in QA engineering she has worked on both automated functional and performance testing for application servers and server-side applications. Lucy is a graduate of UMass Boston.
Read the Novell Case Study to learn how Novell increased their test throughput by 2-3x. For more detail, download the White Paper that discusses How to Transform the Software Testing Process to Increase Test Center Throughput. Watch a 2 minute video that explains the underlying dynaTrace PurePath Technology that enables Application Performance Management for companies like Novell. Click here for other Case Studies, Recorded Webinars, White Papers, and Best Practices.