About the Author

Sebastian is a Product Manager with Compuware APM. Since 2009 Sebastian is actively involved in the Data Center Real User Monitoring (DC RUM) project developed by Compuware APM. Contact him at @skruk

Are We Under A DDoS Attack?

Ensuring reliability and security of the IT infrastructure at a bank is not an easy task. It gets harder when the bank is very popular, with many branches spread across the country. The potential sources that can affect the end user experience (EUE) range from the internal infrastructure and the internal banking application through various ISPs implementing access for bank branches and end users of the e-banking solution.

header image

We recently worked with a customer in the banking and trading business who had to deal with users complaining about being unable to make trades due to bad performance of our customer’s e-banking system. The first look at the available data made them assume they were under a Distributed Denial of Service (DDoS) attack; it almost triggered their emergency procedures including restarts of their platform, which would result in significant downtime. A closer look at the data captured by their application performance management (APM) solution revealed a different story: the business impact was caused by a technical implementation issue triggered by a single power user who executed a large number of transactions in a very short interval. This caused a performance problem with the entire platform, impacting all users.

In this blog post we walk you through the steps this customer had to take and what data they needed to analyze the problem. We recount how they identified that the real root cause, which impacted their business, was not a malicious DDoS attack but of technical origin.

Are We Under Attack?

It was just before noon, when all phones at the help desk of NBA, the National Bank of Abari (name changed for commercial reasons), went off. Clients from all over the country were reporting serious problems when accessing the e-banking service offered by NBA. The IT team was dispatched to analyze the situation and they reported a large number of TCP/IP sessions opened on the bank’s servers. The first thing that came to mind was:

We are under DDoS attack!

The IT manager began to consider the security procedures that among others involved tedious process of restarting all back-end servers, eventually resulting in an effective downtime of several minutes.

Not Necessarily …

In the meantime the Operations team started to analyze the traffic using Compuware dynaTrace Data Center Real User Monitoring (DC RUM) which had been recently deployed throughout the National Bank of Abari. They noticed that although there were about 50% more pages loaded than usual, this alone did not indicate a typical DDoS attack. The figure below shows only a slight increase in the total number of pages loaded with an increased number of slow pages.

Increased Load was not the reason for the slowdown as there was only a slight increase in Page Volume

Figure 1. Increased Load was not the reason for the slowdown as there was only a slight increase in Page Volume

However, what was more important was that the time to load a page increased significantly. This of course resulted in growing frustration among the end users, which was reflected in the DC RUM reports showing increased number of stopped pages. Both metrics are depicted in the Figure 2: please note the sudden increase of the page load time and the spike in the number of stopped pages.

Figure 2 shows how DC RUM enables system operators to differentiate between patient and inpatient users: long stopped pages reveal the number of pages that were stopped or reloaded after users waited significantly longer time. The percentage of long stopped pages indicated that many users were patiently waiting to use the e-banking solution, eventually refreshing the web pages or simply aborting them by closing the web browser before those pages were fully loaded. We can imagine the frustration growing among users when the system still did not work even after waiting for such a long time.

Increased Page Load Time resulted in more users abandoning the site (Stopped Pages metrics)

Figure 2. Increased Page Load Time resulted in more users abandoning the site (Stopped Pages metrics)

Based on these data which show a high HTTP response time, the Operations team isolated the fault domain to the application servers implementing the e-banking solution rather than excess network traffic of the type that could indicate a DDoS attack. Next, they used DC RUM to analyze the connections between the application servers and other services, including the database, in order to identify the root cause of the slow HTTP processing time

The report on the figure below indicates that the database had only 20% availability, i.e., the percentage of successful TCP connection attempts, indicating the database tier as the one impacting the system and its performance.

Only 20% availability of the database in the timeframe of the incident is most likely one of the root causes of this problem.

Figure 3. Only 20% availability of the database in the timeframe of the incident is most likely one of the root causes of this problem.

Further analysis of the performance of the database application (iNBA – DB, see Figure 4) shown a sudden, 8x increase in SQL query processing time.

Sudden increase in query processing time (bottom charts) was not caused by more SQL Executions (no significant increase of slow queries on the (top charts)

Figure 4. Sudden increase in query processing time (bottom charts) was not caused by more SQL Executions (no significant increase of slow queries on the (top charts)

Thanks to a very quick fault domain isolation analysis enabled by DC RUM, the Operations team was able to determine the root cause of the problem within seconds instead of hours of log-file analysis. The figure below shows a report with a list of database operations executed during the incident; the operation with the lowest performance has been highlighted.

 Slowest DB operations revealing the root cause of the problem.

Figure 5: Slowest DB operations revealing the root cause of the problem.

With the drilldown to the list of clients affected by the highlighted operation, the team discovered that the unexpected behavior of the system was caused by only one client (see Figure 6) who within few minutes ordered tens of thousands of money transfers.

Drill down shows which client was affected (and the root cause) of the described overload

Figure 6. Drill down shows which client was affected (and the root cause) of the described overload

The problem was later assigned to the maintenance team that used Compuware dynaTrace Deep Application Transaction Management data gathered during the incident to analyze how such unwelcome application behavior could be avoided in the future.

The figure below visualizes the actual end-to-end transaction flow representing the aforementioned operation with large number of database queries presented in the context of single transaction flow: using dynaTrace PurePath technology helped to discover flows in architecture design of the e-banking solution of the National Bank of Abari.

Analyzing the Transaction Flow reveals that calling the same DB query multiple times for one end user transaction is the root cause of the performance problems

Figure 7. Analyzing the Transaction Flow reveals that calling the same DB query multiple times for one end user transaction is the root cause of the performance problems

The PurePath data also contains detailed context information such as the actual money transfers that this customer executed. With that information – down to the method implementation level – developers have an easy task to fix the implementation details that caused so many SQL Executions per single end user transaction (see Figure 7).

 

When Restarting is Your Only Option

What if the Operation team did not have access to DC RUM reports and could not differentiate between a real DDoS attack and an overload caused by innocuous operations of one of the clients? One of the contingency procedures required restarting certain servers resulting in downtime of the e-banking solution.

Downtime of a service, especially an unscheduled one, impacts the business caused among others things by the frustration of end users. When such an unscheduled restart is inevitable, however, we should assess its impact on the end user experience.

The figure below shows the availability and number of transactions statistics gathered on one of the days when a restart was inevitable. Looking at the availability of the service during the restarts we get only half of the story, and a pretty scary one, telling us that 50% of connections did not succeed. However, if we look at the total number of connections we can see that this particular downtime had only marginal impact on the general end user experience.

TCP Availability and Connection Attempts during the restarts

Figure 8. TCP Availability and Connection Attempts during the restarts

In this case it is ’safe’ to schedule a downtime as you are sure that you impact the end users as little as possible.

 

It’s in Your Data

Malcolm Gladwell in his book “Blink” writes about human ability to make snap, subconscious decisions based on our experience. However, he also warns that sometimes it is better to actually first analyze the data we have and make the decision consciously.

This is what the Operations team of the National Bank of Abari did. With the right solution in hand – DC RUM – they were able to very quickly zero in on the cause of the problem and change the way they were going to handle ongoing situation. Further analysis using Compuware APM Deep Application Transaction Management helped the maintenance team to minimize the impact of similar incidents in the future.

However, not all downtimes can be avoided; in such cases DC RUM can also help assess the business impact of the downtime.

 


(This article has been based on materials contributed by Przemysław Zabludowski based on original customer data. Screens presented may differ in most recent releases of the product while delivering the same value.)

Comments

  1. Wojtek Aleksander says:

    Another lesson learned can be drawn from this particular case. The highly reactive approach of the bank team didn’t pay. They started to act only after people started phoning them – at noon! The charts show that the overload stated about two hours earlier (Fig. 2).

    If the bank team made use of the alerting engine they would have realized something was wrong no later than at 11 o’clock! With smart definitions they would know a lot about the potential root cause from the alert messages themselves (volume increase, anomalous traffic pattern, application performance drop-down – to name the less fancy conditions).

  2. Wojtek, thanks for this observation.
    Guess what – there is a post focusing on the pro-activeness in APM coming soon.

Comments

*


five − = 3