The reason I don’t monitor connection pool usage
I have been working with performance sensitive applications for a long time now. As can be expected most of them have to use the database at one point or the other. So you inevitably end up having a connection pool. Now to make sure that your application is not suffering from waiting on connections you monitor the pool usage, but is that really helping? Too be honest not really…
How an application uses the connection pool
Most applications these days use connection pools implicitly. They get a connection, execute some statements and close it. The close call does not destroy the connection but put it back into a pool. The goal is to minimize the so called busy time. Under the hood most application servers refrain from putting a connection back into the pool until the transaction has been committed. For this reason it is a good practice to get the database connection as late as possible during a transaction. Again the goal is to minimize usage time, so that many application threads can share a limited number of connections.
All connection pools have a usage measure to determine if enough connections are available, or in other words to see if the lack of connections has a negative effect. However as a connection is occupied only for very short amounts of time, often fractions of a second, we would need to check the usage equally often to have a statistical significant chance of seeing the pool being maxed out under normal conditions.
In reality this is not done, as checking the pool to often (say several times a second) would lead to a lot of monitoring overhead. Most solutions check every couple of seconds and as a result we only see pool usage reaching 100% if it is constantly maxed out. If we were to track the usage on a continuous basis the result would look different:
This means that by the time we see 100% pool usage with regular monitoring solutions we would already have a significant negative performance impact, or would we?
What does 100% pool usage really mean?
Actually it does not mean much. It means that all connections in the pool are in use, but not that any transactions are suffering performance problems due to this. In a continuous load scenario we could easily tune our setup to have 100% pool usage all the time and not have a single transaction suffering; it would be perfect.
However many use cases do not have a steady continuous load pattern and we would notice performance degradation long before that. Pool usage alone does not tell us anything; Acquisition Time does!
Most application servers and connection pools have a wait or acquisition metric that is far more interesting than pool usage. Acquisition time represents the time that a transaction has to wait for a connection from the pool. It therefore represents real actionable information. If it increases we do know for a fact that we do not have enough connections in the pool all the time (or that the connection pool itself is badly written). This measure can show significant wait time long before the average pool usage is anywhere close to 100%. But there is still a slight problem. The measure is still an aggregated average across the whole pool or more specifically all transactions. Thus while it allows us to understand whether or not there are enough connections overall, it does not enable us to identify which business transactions are impacted and by how much.
Measuring Acquisition Time properly
Acquisition time is simply the time it takes for the getConnection call to return. We can easily measure that inside our transaction and if we do that we can account it on a per business transaction basis and not just as an aggregate of the whole pool. This means we can determine exactly how much time we spend waiting for each transaction type. After all I might not care if I wait 10ms in a transaction that has an average response time of a second, but at the same time this would be unacceptable in a transaction type with 100ms response time.
We could even determine which transaction types are concurrently fighting over limited connections and understand outliers, meaning the occasional case when a transactions waits a relative long time for a connection, which would otherwise be hidden by the averaging affect.
Configuring the optimal Pool Size
Knowing how big to configure a pool upfront is not always is. In reality most people simply set it to an arbitrary number that they assume is big enough. In some high volume cases it might not be possible to avoid wait time all the times, but we can understand and optimize it.
There is a very easy and practical way to do this. Simply monitor the connection acquisition time during peak load hours. It is best if you do that on a per business transaction basis as described above. You want to pay special attention to how much it contributes to the response time. Make sure that you exclude those transactions from your analysis that do not wait at all, they would just skew your calculation.
If the average response time contribution to your specific business transaction is very low (say below 1%) than you can reasonably say that your connection pool is big enough. It is important to note that I am not talking about an absolute value in terms of milliseconds but contribution time! If that contribution time is too high (e.g. 5% or higher) you will want to increase your connection pool until you reach an acceptable value. The resulting average pool usage might be very low on average or close to 100%, it does not really matter!
The usefulness of a pool measure depends on the frequency of polling it. The more we poll it, the more overhead we add and in the end it is still only a guess. Impact measures like acquisition time are far more useful and actionable. It allows us to tune the connection pool to a point where it has no or or at least acceptable overhead when compared to response time. Like all impact measures it is best not to use the overall average, but to understand it in terms of contribution to the end user response time.