Let’s Not Play Blame Games
When the Operations team gets an alert about potential performance problems that users might be experiencing, it is usually either the infrastructure or the actual application that is causing those problems. Things get interesting when neither the ISP nor the application provider is willing to admit fault. Can we tell who is to blame? Could it be that it is neither the ISP nor the application provider?
The IT department of our customer, SerciaFood, a food production company from Sercia (names changed for commercial reasons), received complaints about the performance of one of its applications. The IT department suspected network problems while the local ISP stood firmly behind its infrastructure and blamed the solution provider.
It Is Not Our Infrastructure
The SerciaFood IT team recently tested a new application before rolling it into production. During the tests the team complained about the performance of that application; the most likely cause of the poor performance was attributed to network problems.
SerciaNet, a big name not only in Sercia but also worldwide was delivering the network infrastructure for SerciaFood. The ISP began to monitor the network with manual traces and other techniques; the company could not, however, provide any strong evidence that it was not their issue.
The Operations team at SerciaFood was appointed to look into the problem using a real user monitoring tool.
Its first observation was that Network Performance, i.e., the percentage of traffic which did not encounter network related issues (Server Loss Rate, Client RTT and Errors), was varying between different regions where SerciaFood services were used.
Figure 1 shows a report where both Network and Application Performance metrics for EMEA are good. EMEA is the most active region on the report since it is where the core business operations of SerciaFood are focused. Other, distant regions reported performance problems; the second most active “3rd Party” region reported high Client RTT and Server Loss Rate. Client RTT is the time of the SYN packet (sent by a server) to travel from APM probe to client and back again. Server Loss Rate is the percentage of total packets sent from a server that were lost and needed to be retransmitted.
How Is the Network Performance in EMEA?
The Operations team decided to first confirm what was indicated in Figure 1: the key business region, EMEA, was not affected by network problems.
Figure 2 shows a report with all areas monitored within the EMEA region. According to this report the performance is consistently good with about 2.5 sec of Operation Time and no network-related problems (100% Network Performance) for all areas within EMEA region.
After a drill-down to one of user sites (Switzerland), the report shows that the operation time is spent almost entirely on the server side and that the network performance is good too (see Figure 3).
Another drill down to the report with transactions executed at that site (see Figure 4) shows that although Server Time varies between transactions, the Network Time remains consistently below 400 ms. The differences in Server Time between transactions are a result of the different computational complexity between these transactions. For example, responding to Query is likely to be more demanding than responding to Get File.
The Operations team decided to further investigate two transactions: one that should be heavy on network (Get File) and one that might be heavy on the server side (Query). The former was mostly responsible for merely delivering files to the client application while the latter required more computational power of the server to execute the query. The performance of the former is good with an almost even split between server and network time (see Figure 5), which does not indicate any network-related problems. The operation time for the latter is almost exclusively spent on the server, with negligible network impact (see Figure 6).
The Operations team concluded, based on the analyzed traffic in the EMEA region, that at least in that region the performance was good and that it was not affected by network infrastructure delivered by SerciaNet.
Who Is Really Affected?
The question remained: why were some users reporting performance problems? From the overview report (see Figure 1) the Operations team decided to drill down through 3rd Party, the region with lowest Network Performance and highest Server Loss Rate.
This region reported poor Network Performance below 50% and significant contribution of network component in the Operation Time (see Figure 7).
Figure 8 shows the report with a list of transactions for the affected user site. Although Server and Network time varies between transactions, the Application Performance for all transactions is low, down to 0% for Get File and Query transactions.
Further analysis of the Get File operation across different users shows significant contribution of the Network Time (see Figure 9). The Network Time for both operations is inconsistent; it took 4x more time to deliver results of the Query operation to the second user than to the first one (see Figure 10). This might indicate that users represented in this report connect to the SerciaFood applications through different ISPs.
Based on that analysis the Operations team could determine that some users did in fact experience performance problems caused by network issues. Further investigation revealed that those users who were experiencing poor performance were not connecting to the SerciaFood application using the SerciaNet infrastructure but were instead working remotely through VPN using various ISPs.
When operating a service accessed by users from various locations it is important to remember that the end user experience may vary, sometimes significantly. In the case of SerciaFood its most active users were coming from the EMEA region that was implemented on the SerciaNet infrastructure. However, the second most active users were connecting to the SerciaFood services via VPN. Since these users relied on the general internet connection their experience was affected by poor network quality. Different users where connected from different ISPs; as a result the Network Performance in the 3rd Party region was inconsistent.
Using Compuware dynaTrace Data Center Real User Monitoring (DCRUM) the Operations team was able to show evidence, which SerciaNet could not gather otherwise, that the problems were neither caused by SerciaNet infrastructure nor by the application itself. They were, in fact, only experienced by remote users connecting via VPN, who were negatively impacted by ISPs network performance problems.
(This article was based on materials contributed by James Neal based on original customer data. Screens presented may differ in most recent releases of the product while delivering the same value.)