About the Author

Sebastian is a Product Manager with Compuware APM. Since 2009 Sebastian is actively involved in the Data Center Real User Monitoring (DC RUM) project developed by Compuware APM. Contact him at @skruk

Balancing the Load

A question that every online application provider will face eventually is: does my application scale? Can I add an extra 100 users and still ensure the same user experience? If the application architecture is properly designed the easiest way is to put additional server behind load balancer to handle more traffic.

cover image

In this article we recount an incident that happened to one of our clients when the cause of a poor application performance was eventually attributed to problems with load balancing of application servers.

HTTP Server (500) Errors Go Over the Roof

Around 8am the Operations team at Rendoosia Inc. (name changed for commercial reasons) got an alert from the APM tool that one of three SharePoint servers was generating many HTTP Server (500) errors. All three servers were behind a load balancer; hence why the team decided to analyze the overall performance of all three servers with the report presented in Figure 1.

unbalanced servers

Figure 1. Overview of the three SharePoint servers behind one load balancer with some KPIs: usage, response time and number of errors; two servers show performance problems

The Operations team noticed following issues:

  1. The x.x.x.155 server (row marked with the blue box) was under significantly lower load (7k operations compared to almost 30k per each other server) then the other two. Both the load and the number of users were equally shared over 2 servers: x.x.x.154 and x.x.x.156
  2. Although server x.x.x.155 had the lowest users count it was reporting the longest processing time.
  3. Server x.x.x.156 was reporting high number of HTTP 5xx errors (marked with red box).

The team charted HTTP server errors and the load, counted as number of transactions, for all three server over time (see Figure 2) to get better understanding of the current situation.

distribution in time of 500 errors

Figure 2. Distribution of the number of server errors and transactions count over time for all three servers; one server shows lower load

The team’s first observation, based on the above-mentioned reports, was that the x.x.x.155 server, with the lowest number of users, was most likely not connected to the load balancer. In order to determine the cause of the high response time on this server the team analyzed two reports:

  • Response time for x.x.x.155 broken down into network, server and redirect times indicated that almost all time is spent on the server (see Figure 3).
  • Drill down to operations report to analyze the load on the server (see Figure 4) shows that one particular transaction took a lot of time to complete resulting in low application performance and poor user experience.

response time breakdown

Figure 3. Response time breakdown for x.x.x.155: most of the time is spent on the server

list of pages

Figure 4. Drill down in the context of the x.x.x.155 server shows main KPIs per transactions executed on this server; one transaction is affected by performance problems

Next, the team took on analyzing the 5xx errors produced by the x.x.x.156 server. They drilled down to a PurePath of one of transactions that were reporting these errors and learned that the problem was caused by malfunctioning database connection pool (see Figure 5)

dynatrace drilldown

Figure 5. Drilldown through PurePaths to Error details reveals that the reason behind 5xx errors is caused by the database connection pool usage

The Operations team was also curious how the 5xx errors produced at x.x.x.156 server were affecting the actual user experience. The team wondered if user operations were equally distributed between both servers connected to the load balancer. The question was whether users who were unlucky and got served by the x.x.x.156 server were stuck on that server. This kind of question was hard to answer just by looking at a single SharePoint server. The Operations team used the APM tool to answer it.

users per server

Figure 6. Users remain on the server at which they have started their session

The report in Figure 6 shows that users were usually served by the same application server. Therefore those who started their session on the x.x.x.156 server remained there resulting in constantly poor experience due to bad performance of that server.

Conclusions

Modern application performance management is not only about making sure that the application and database servers are operating without problems. We also need to setup the load balancer right and monitor the network infrastructure for potential problems affecting the overall application performance.

The Operations team at Rendoosia Inc., using Compuware dynaTrace Data Center Real User Monitoring (DCRUM), could get in just few clicks from the alert about HTTP Server (500) errors through a holistic overview of application server KPIs to a root cause of the problem.

Based on the unequal load among three application servers, Requests breakdown in Figure 1 and the number of transactions in Figure 2, the team quickly determined that the x.x.x.155 server was not properly connected to the load balancer. Additional analysis illustrated that this server was also affected by low performance of one of the operations.

This story shows us that even though only one server might be experiencing performance problems, caused by many HTTP Server errors, the load balancer will not offload that server because it is not aware of those errors. That is why the Operation teams need to constantly monitor, with properly setup alerts, for such outliers in application performance; even on load balanced setups.


(This article has been based on materials contributed by Pieter Jan Switten and Pieter Van Heck based on original customer data. Some screens presented are customized while delivering the same value as out of the box reports.)

Comments

  1. This is why it is important to find out where the website’s weakest point is and to find out where the performance bottlenecks lies. This is also why it’s important to have a team like yourself to make sure testing be done and to work towards building a crash proof website.

Comments

*


eight + 2 =