Gary Kaiser About the Author

Gary is a Subject Matter Expert in Network Performance Analysis at Compuware APM. He has global field enablement responsibilities for performance monitoring and analysis solutions embracing emerging and strategic technologies, including WAN optimization, thin client infrastructures, network forensics, and a unique performance management maturity methodology. He is also a co-inventor of multiple analysis features, and continues to champion the value of software-enabled expert network analysis.

Understanding Application Performance on the Network – Part VII: TCP Window Size

In Part VI, we dove into the Nagle algorithm – perhaps (or hopefully) something you’ll never see. In Part VII, we get back to “pure” network and TCP roots as we examine how the TCP receive window interacts with WAN links.

TCP Window Size

Each node participating in a TCP connection advertises its available buffer space using the TCP window size field. This value identifies the maximum amount of data a sender can transmit without receiving a window update via a TCP acknowledgement; in other words, this is the maximum number of “bytes in flight” – bytes that have been sent, are traversing the network, but remain unacknowledged. Once the sender has reached this limit and exhausted the receive window, the sender must stop and wait for a window update.

TCP Window Size: The sender transmits a full window, then waits for window updates before continuing. As these window updates arrive, the sender advances the window and may transmit more data.

The sender transmits a full window, then waits for window updates before continuing. As these window updates arrive, the sender advances the window and may transmit more data.

Long Fat Networks

High-speed, high-latency networks, sometimes referred to as Long Fat Networks (LFNs), can carry a lot of data. On these networks, small receive window sizes can limit throughput to a fraction of the available bandwidth. These two factors – bandwidth and latency – combine to influence the potential impact of a given TCP window size. LFNs networks make it possible – common, even – for a sender to transmit very fast (high bandwidth) an entire TCP window’s worth of data, having then to wait until the packets reach the distant remote site (high latency) so that acknowledgements can be returned, informing the sender of successful data delivery and available receive buffer space.

The math (and physics) concepts are straightforward. As the network speed increases, data can be clocked out onto the network medium more quickly; the bits are literally closer together. As latency increases, these bits take longer to traverse the network from sender to receiver. As a result, more bits can fit on the wire. As LFNs become more common, exhausting a receiver’s TCP window becomes increasingly problematic for some types of applications.

Bandwidth Delay Product

The Bandwidth Delay Product (BDP) is a simple formula used to calculate the maximum amount of data that can exist on the network (referred to as bits or bytes in flight) based on a link’s characteristics:

  • Bandwidth (bps) x RTT (seconds) = bits in flight
  • Divide the result by 8 for bytes in flight

If the BDP (in bytes) for a given network link exceeds the value of a session’s TCP window, then the TCP session will not be able to use all of the available bandwidth; instead, throughput will be limited by the receive window (assuming no other constraints, of course).

The BDP can also be used to calculate the maximum throughput (“bandwidth”) of a TCP connection given a fixed receive window size:

  • Bandwidth = (window size *8)/RTT

In the not-too-distant past, the TCP window had a maximum value of 65535 bytes. While today’s TCP implementations now generally include a TCP window scaling option that allows for negotiated window sizes to reach 1GB, many factors limit its practical utility. For example, firewalls, load balancers and server configurations may purposely disable the feature. So the reality is that we often still need to pay attention to the TCP window size when considering the performance of applications that transfer large amounts of data, particularly on enterprise LFNs.

As an example, consider a company with offices in New York and San Francisco; they need to replicate a large database each night, and have secured a 20Mbps network connection with 85 milliseconds of round-trip delay. Our BDP calculation tell us that the BDP is 212,500 (20,000,000 x .085 *8); in other words, a single TCP connection would require a 212KB window in order to take advantage of all of the bandwidth. The BDP calculation also tell us that the configured TCP window size of 65535 will permit approximately 6Mbps throughput (65535*8/.085), less than 1/3 of the link’s capacity.

A link’s BDP and a receiver’s TCP window size are two factors that help us to identify the potential throughput of an operation. The remaining factor is the operation itself, specifically the size of individual request or reply flows. Only flows that exceed the receiver’s TCP window size will benefit from, or be impacted by, these TCP window size constraints. Two common scenarios help illustrate this. Let’s say a user needs to transfer a 1GB file:

  • Using FTP (in stream mode) will cause the entire file to be sent in a single flow; this operation could be severely limited by the receive window.
  • Using SMB (at least older versions of the protocol) will cause the file to be sent in many smaller write commands, as SMB used to limit write messages to under 64KB; this operation would not be able to take advantage of a TCP receive window of greater than 64K. (Instead, the operation would more likely be limited by application turns and link latency; we discuss chattiness in Part VIII.)

Transaction Trace Illustration

To evaluate a trace for this window size constraint, use the Time Plot view. For Series 1, graph the sender’s payload in transit (i.e., bytes in flight); for Series 2, graph the receiver’s advertised TCP window, using a single y axis scale for reference. If the payload in transit reaches (or closely approaches) the receive window size, then it is likely that an increase in the window size will allow for improved throughput.

TCP Window Size: This Time Plot view shows the sender's TCP Payload in Transit (blue) reaching the receiver's advertised TCP window (brown); the window size is limiting throughput.

This Time Plot view shows the sender’s TCP Payload in Transit (blue) reaching the receiver’s advertised TCP window (brown); the window size is limiting throughput.

The Bounce Diagram can also be used to illustrate the impact of a TCP window constraint, emphasizing the impact of latency on data delivery and subsequent TCP acknowledgements.

TCP Window Size: Illustration of a TCP window constraint; each cluster of blue frames represents a complete window's worth of payload, and the sender must then wait for window updates.

Illustration of a TCP window constraint; each cluster of blue frames represents a complete window’s worth of payload, and the sender must then wait for window updates.

Note that the TCP window scaling option is negotiated in the TCP three-way handshake as the connection is set up; without these SYN/SYN/ACK handshake packets in the trace file, there is no way of knowing whether window scaling is active, or more accurately, what the scaling value might be. (Hint: if you observe window sizes in a trace file that appear abnormally small – such as 500 bytes – then it is likely that window scaling is active; you may not know the actual window size, but it will be greater than 64KB.)

Corrective actions

For a TCP window constraint on a LFN, assuming adequate available bandwidth, primary solution options focus on increasing the receiver’s TCP window or enabling TCP window scaling. Reducing latency – which in turn reduces the BDP – will allow greater throughput for a given TCP window; relocating a server or optimizing path selection are examples of how this reduction in latency might be accomplished.

Is TCP window scaling enabled for your key applications – especially those that serve users over LFNs? Are your file transfers and replications performing in harmony with the network they traverse?

In Part VIII, the final entry in this series, we’ll talk about application chattiness – the more common app turns kind, but also a behavior I call application windowing. Stay tuned and feel free to comment below.

Comments

  1. Nick Fiekowsky says:

    Optimal real world TCP window size can be far larger than bandwidth-delay product (BDP) – latency doesn’t end at the RJ-45. We had a 20 Mb/sec MPLS link between Japan and US east coast with 192 msec RTT. BDP would be just under 512 KBytes. Optimal sustained throughput achieved when receiving host advertised 2.7 MByte receive window.

    Back in WIndows XP & NetWare era we discovered that TCP tuning for larger window size significantly reduced boot time for a PC one flight up from the data center.

    Unless you’re running iPerf, additional latency stems from time taken for:
    - Application on the receiving host to be dispatched by OS
    - Time for receiving application to move data from receive buffer to disk or screen
    - Time for sending application to be dispatched by OS when send buffer empties
    - Time for sending application to marshall data to move into send buffer

    We found that larger TCP windowsize can measurably reduce host processing time while shrinking transfer time.

  2. Gary Kaiser Gary Kaiser says:

    Hi Nick,
    Thanks for your comment, and sharing your experience; often, theory and experience conflict, in which case the latter wins out.
    I like your point that latency may not end at the RJ-45; often, we (meaning I) usually abstract the definition of end-to-end delay, even incorrectly referring to it as “NIC-to-NIC.” I think a better definition – pertinent to this discussion, at least – would be “TCP stack to TCP stack.”
    If we think about the BDP and a large TCP flow, then we really are concerned with the timings of TCP ACKs; application-specific delays at the sender (which I’ve previously referred to as “starved for data” conditions) and at the receiver (which reduce the advertised TCP window size due to delays reading from the buffer) shouldn’t affect the calculation itself. But – especially on the systems you mention – the TCP stacks themselves could be OS-bound (since they would run entirely in the OS), delaying the acknowledgement of data by the receiver and/or reading the window update at the sender. So the net effect would be a delay value (for the BDP) that could be significantly greater than the physical NIC-to-NIC RTT.
    Having said that, I struggled for a while trying to explain the faster user-perceived performance. A slow app is a slow app; increasing the receive buffer can allow the data to traverse the network faster (shrinking transfer time), but if the app were slow, delays reading the data from the buffer would still remain. What if the TCP stack were slow (delaying the ACKs and increasing the BDP), and the app fast? Then the theory would seem to match the experience.
    But I defer to the real world….

    • Nick Fiekowsky says:

      Hello Gary,

      My view is that big TCP buffers provide slack that allows many components to operate in efficient “stream” mode most of the time rather than “start and stop.”

      - The disk drive can stream big chunks of data into the transmitting app since there’s a big TCP buffer ready to catch the bytes.

      - The transmitting TCP stack has a big pile of bytes on hand to sustain a max speed stream, avoiding pauses and subsequent slow start.

      - The receiving TCP stack has lots of buffer space to hold the arriving bytes, likely eliminating window freezes.

      - The receiving application stays active for long stretches since it can work through MBytes of data at a time.

      - The receiving application can feed long streams of data to its storage. The writing disk head thus stays in one track, or quickly moves to an adjacent tracks, for rapid data storage.

      Confirming experience – some years back my ancient single-core, small-memory laptop with mild TCP tuning could outperform a colleague’s shiny new dual-core, 4 GByte memory laptop in downloads. My colleague didn’t believe it at first, but disk defragmentation made the difference. I regularly defragmented my hard drive, his had never been defragmented. His laptop outperformed mine once the disk was adequately defragmented.

      Conclusion – the network is one component of an end-to-end system. Poor tuning can cripple end-to-end performance, good tuning helps. Strong network tuning allows other components to deliver their best performance, too.

  3. Hello Gary, could you please advise some for following problem:

    we have slaw bitrate link with B=32 kbps (this is power line carrier communiction) RTT is app. 200 ms. TCP WINDOW is 800 bytes, IEC 60870-5-104 ctransmit data with very small payload – 46 bytes. It means that in one TCP WINDOW we have 17 packets.

    Are there some ways how to reduce amount of ACKs, because it is bad to wait all of 17 ACKs before new send…
    Is it possible to use Nagle algorithm for collection of all ACKs in one packet?

    ALso could you please clarify, I find some articles where after receive few TCP-segments receiver send ACK only for last one with the maximal number. Is it possible to use such approach for example for TCP_DELAY mode. Wait for 500 ms and send ACK only for one segment?

    Best regards,

    Anton

  4. Gary Kaiser Gary Kaiser says:

    Hi Anton,

    If your interest is to improve data transfer throughput, then it would appear that TCP is tuned quite well to your environment. The BDP = 32000*0.200 = 6400 bits in flight; divide by 8 = 800 bytes in flight. This is the maximum carrying capacity of the network, so in theory, a TCP window size of 800 bytes (or greater) would allow the link to be fully utilized.
    As the ACKs for earlier packets are received, the sender should be able to stream more data – without waiting until the ACKs are received for the remaining data; however, this is true only if the application is streaming data. The behavior you describe – send a block of data in 17 packets, then wait until all of these packets have been acknowledged, then send the next block of data – would appear to be what I call application windowing. In this case, the application (or perhaps the power line protocol you are using) is ensuring that each block of data has been successfully received before sending the next block.
    I discuss this behavior in more detail in Part VII of this blog series – http://apmblog.compuware.com/2014/08/21/understanding-application-performance-on-the-network-chattiness-application-windowing/
    Hope this helps you narrow down the issue.
    Gary

Comments

*


6 − one =