Best Practices from Zappos to deliver WOW Performance
Zappos - the leading eCommerce site for shoes and apparel – recently talked about their best practices of delivering WOW Performance to their customers. Zappos re-architected their web-site and went from Perl to Enterprise Java as the need to scale and perform was driven by explosive business growth and performance problems in their old architecture.
Performance is the key to business success for every eCommerce site. Zappos picked an Application Performance Management Solution that enabled them to deliver their #1 Core Value to their Clients: “Deliver WOW through Service”
Why Zappos needed to re-architecture
Zappos eCommerce site exploded over the years with growing popularity. The website serves millions of users/visitors daily and processes between 60-65.000 purchases every day. Their #1 Core Value is “Deliver WOW through Service” – as their success is clearly powered by their customers – and – customers like great service in order to continue shopping.
The original platform was built with Perl which showed performance problems with the growing success and growing number of online transactions. Due to the lack of analysis tools and also the lack of structured performance testing it was not clear where the performance problems actually were and why the implementation didn’t scale as required.
The decision to re-architecture the application using Enterprise Java was made in order to scope with scaling and demands on high performance.
Zappos Performance Environment
The application has been re-architected to run on 3 tiers. Each tier is hosted by 2 JVMs – making it 6 JVMs in total. Once a build passes the functional requirements it moves on to the performance lab. In-house load testing tools and the load-testing services from SOASTA allow them to test up to 1000′s of transactions/sec on a build-to-build basis.
The load testing results delivered response times of individual pages as well as transaction throughput. These results were noted and compared from build to build. The following image shows the results per individual transaction:
It was assumed that having response times is enough to manage application performance. However – if a build suddenly showed a performance or throughput decrease everybody scratched their head because these numbers alone didn’t give any indication about the actual root cause. Capturing CPU, Memory, Network and I/O activity in the system helped to identify a problematic JVM – but it didn’t help to identify the problematic code or code change that led to that issue.
The lack of visibility into the system and the amount of time spent to find the actual problem caused Zappos to look for an application performance management solution to get insight into the application while under heavy load.
Requirements on an Application Performance Management Solution
The requirements by Zappos for an APM Solution were to
- get insight into the application down to the method level
- follow each of their distributed transactions across all 3 tiers
- run under heavy load with less than 5% CPU overhead
- do not return min/max/avg values but values per individual transaction
- include contextual data like method arguments, database access, remoting calls or exceptions
- integrates with their internal and external load testing services
- easy hand-off to developers and offline analysis capabilities
Continuous APM in Practice @ Zappos
During the initial POC Phase – which was done in their performance environment – several performance issues could be identified and fixed in the first test runs, e.g.: by identifying a wrong caching strategy the performance of the cache could be boosted by 12x.
Today - as dynaTrace is deployed in their performance lab – every build that gets into the performance lab and is tested with the internal testing tools or with the testing services from SOASTA is performance managed by dynaTrace. Every build needs to get ”dynaTrace Certified” before it is passed on to the next stage. In case of performance regressions from one build to the other, the guesswork from the past is over. dynaTrace identifies the problematic transactions (PurePaths) and compares it to the transactions from the previous build to highlight the differences. Zappos makes heavy use of the PurePath Comparison feature:
The comparison shows the structural difference (which methods are new/removed or called more/less frequent) as well as time difference (which methods take longer/shorter to execute). Not only does this difference data give great input for developers – every single transaction that is captured includes additional contextual information like SQL statements, bind variables, method arguments, return values, exceptions, … As Zappos runs on a multi tier environment it is essential for them to see a the full transaction that spans across all their tiers. dynaTrace’s PurePath technology is able to follow transactions across runtime boundaries following remoting calls via Web Services, RMI, .NET Remoting, WCF or Messaging from one runtime to the next. The developer can then look at the full PurePath:
In the above screenshot we see a PurePath that made a synchronous HttpCall call from machine zeta01 to zeta02 where the request was handled by a servlet. The HotSpot visualization on the top right as well as the colour coding of the methods in the PurePath tree indicate which methods contribute the most to this individual transaction. Additional context information like servlet attributes, execution times, … can be analyzed to help with problem diagnosis and solution.
Full End-to-End Tracing – The App is more than what happens on the server
The following image shows the browser side-analysis with dynaTrace AJAX Edition and the integration with dynaTrace on the server-side that allows Zappos to Drill Into the actual server-side transaction for each individual Network/XmlHttpRequest(XHR):
Best Practices for WOW Performance
Zappos derived several best practices while implementing Continuous Application Performance Management and while running it in their performance lab on a build-to-build basis.
- You can’t start testing too soon
- Stop the guesswork -> Give the developers the actionable evidence they need
- Averages and Aggregates aren’t enough -> You need to see all transactions to find the outliers
- Don’t settle for simple response time metrics -> Get as granular as possible – down to the method level
- Find the exceptions!
Check out the full webinar where Kevin and Ryan talk about all their challenges and success in detail.