This post will be about what we did, what we learned, and what worked.
A little background: our stack is nginx, Ruby on Rails, and mysql. The web app is a custom travel guide. The user comes into our app and either imports a TripIt itinerary, or selects a destination and enters their travel details. They can expressed preferences in dozens of different categories (think about rating cuisine types of restaurants, categories of attractions, etc.). They can preview their travel guide live, in the browser, and order a high quality paperback version.
Our site was in beta, and we've been getting a few hundred visitors a day. At 2pm on Monday we learned that we would be featured on a very popular TV show on Thursday morning (9:30am EST), and based on their previous experiences, we could expect anywhere from 20,000 to 150,000 unique visitors that day, with a peak load of 10,000 simultaneous visitors. This would be a 100x to 1,000x increase in the number of unique visitors that we would normally serve. We estimated that our peak volume would be 10,000 visitors at once, which would create a load of ~40,000 page requests per minute (rpm).
What we had going for us:
- We were already running on Amazon Web Services (AWS), using EC2 behind a load balancer.
- We already had auto-scale rules in place to detect when when CPU load exceeded a certain percentage on our app servers, and to double the number of EC2 instances we had.
- We had reliable process for deploying (although this became a constraint later).
- A month before we had spent a week optimizing performance of our app, and we had already cut page load time by 30%, largely by reducing database calls.
After some experimentation, we decided that we couldn't run more than 125 threads on a single JMeter instance. So if we wanted to simulate truly high loads, we needed to run JMeter on several computers. We also needed to make sure not to saturate any part of our local network. At a minimum, we made sure we didn't have all the machines going through the same router. We also VPNed into an entirely different site, and ran another client from a geographically distant location.
We quickly learned that:
- we needed to change our initial auto-scale group from two servers to four.
- we needed to reduce our CPU load threshold from 80% to 50%.
- we needed to reduce the time we sustained that load from 5 minutes to 1 minute.
After making these changes, and letting the app servers scale to 16 servers, we saw that we were now DB bound. The DB server would quickly go to 100% load. We increased to the maximum AWS DB size (extra extra large, with extra CPUs and memory) but we still swamped the DB server. We watched the JMeter tests run, sending the average response times into the 30 second territory. We knew we were sunk if we didn't solve this problem. It was mid-day Tuesday.
I said "we knew we were sunk", but to be honest, this actually took some convincing. Some members of the team felt like any changes so close to the big day would be too risky to make. We could introduce a bug. But a 100% correct application experience that will definitely get buried under load will result in delivering no pages to anyone. That's 100% risk. After a few hours, we had everyone convinced that we had to solve the scalability issues.
We briefly explored several different large scale changes we could have made, but ultimately judged to be too risky:
- duplicating our entire stack, and using DynDNS or Amazon's Route 53 to route between them. Because we didn't have users accounts or other shared data, we could have done this in theory. If we had, we would later have needed to synchronize a bunch of databases through some scripts to collect all the customer data together. If we had a week or an Ops expert, we probably would have done this.
- offloading part of the DB load by creating a read-only replica, and shunting some of the database calls to certain mostly-static data (e.g. list of destinations and attractions at those destinations) to the read-only database. We didn't do this mostly because we didn't know how we'd point some database queries at one db server, and others at a different db server.
- using memcache or other mechanisms to hit only memory instead of database: this too would have been extensive code changes. Given the short timeframe, it felt too risky.
The action we did decide to take:
- We had a few easily identified DB queries that ran with every page request, that loaded the same data each time. We rewrote the code to cache these in memory in global variables that stayed resident between calls. So they would get loaded once when the servers rebooted, and never again.
Around this time, which was Wednesday morning, we started to get smarter about how we used JMeter. Instead of slamming the servers with many hundreds of simultaneous threads from many clients and watching request times spiral around of control, we instead spun up threads slowly, let the servers stabilize, counted our total throughput in requests-per-second, and checked CPU load at that level, then increased the threads, and kept rechecking. We grabbed a flipchart, and started recording data where everyone could see it. Our goal was to determine the maximum RPM we could sustain without increasing response times.
As the day went on, we incrementally added code changes to cache some database requests, as well as other late-breaking requests we needed before we were live on Thursday. This is where our deployment process started to break down.
We have the following environments: local machines, dev, demo, staging, production. Normally, we have a test script that runs (takes about 40 minutes) before pushing to dev. The deployment script takes 15-30 minutes (80% of which is spent waiting for git clone. Why???) And we have to repeat that process to move closer and closer to production.
The only environment that is a mirror of production is staging. If we had a code optimization we needed to performance test, it would take hours to deploy the change to staging. Conversely, we can (and did) log into the staging servers and apply the changes manually (ssh into server, vi to edit files, reboot mongrel), but we would need to repeat across four servers, bring down the auto-scale servers, rebuild the auto-scale AMI, then bring up the auto-scale servers. Even the short-cut process took 20 to 30 minutes.
We already had New Relic installed. I have always loved New Relic since I used it several years ago when I developed a Facebook app. It's a performance monitoring solution for Rails and other web app environments. I had really only used it as a development tool, as it lets you see where your app is spending it's time.
It's also intended for production monitoring, and what I didn't know up until that point is that if we paid for the professional plan, it would identify slow-running page requests and database queries, and provide quite a lot of detail about exactly what had executed.
So, desperate to get more data, we upgraded to the New Relic Pro plan. Instantly (ok, within 60 seconds), we had a wealth of data about exactly what DB requests were slow and where our app was spending it's time.
One DB query was taking 95% of all of our DB time. As soon as I saw that, and I remembered that our performance was getting worse during the day, it all snapped into place, and I realized that we had to have an unindexed field we were querying on. Because the JMeter tests were creating entries in the database tables, and because they had been running for hours, we had hundreds of thousands of rows in the database, so we were doing entire table scans.
I checked the DB schema.rb file, as saw that someone had incorrectly created a multi-column index. (In general, I would say to avoid multi-column indexes unless you know specifically that you need one.) In a multiple column index, if you specify columns A, B, and C in a query, you can use it to query against A, A and B, or A and B and C. But the columns are order dependent, and you cannot leave out column A.
That's exactly what had happened in our case. So I quickly created a migration to create the index we needed, we ran the migration on our server, let the server have five minutes to build the index, and then we reran the tests. We scaled to 120 requests per second (7,200 rpm), and the DB load was only 16%.
This was Wednesday afternoon, about 3pm. At this point we figured the DB server could handle the load from about 36,000 page requests per minute. We were bumping up against the app server CPU load again.
At some point during the day we had maxed out the number of EC2 instances we could get. We had another team doing scalability testing that day, and we were hitting some upper limit. We killed unnecessary running AMIs, we asked everyone else to stop testing, and we requested an increase in our upper limit via AWS, then went to our account rep and asked them to escalate the request.
By the time we left late Wednesday night, we felt confident we could handle the load on Thursday morning.
Thursday came, and we all had Google analytics, New Relic, and our backdoor interfaces covering all screens. Because it was all happening at 5:30am PST, we were spread across hotel rooms and homes. We were using Campfire for real-time chat.
The first load came at 5:38am, and the spike happened quickly: from 0 visitors we went to over a thousand visitors in just a minute. Our page response times stayed rocky steady at 500ms, and we never experienced any delays in serving customer pages.
We had a pricing bug come up, and had to hot patch the servers and reboot them while under load (one at a time, of course), and still were able to keep serving up pages. Our peak load was well under the limits we had load tested, but was well over what we could have sustained on Monday afternoon prior to the changes we made. So the two and a half days of grueling effort paid off.
- Had we paid for the Pro plan for New Relic earlier, we probably would have saved ourselves a lot of effort, but either way, I do believe New Relic saved our butts: if we hadn't gotten the data we needed on Wednesday afternoon, we would not have survived Thursday morning.
- Load testing needs to be done regularly, not just when you are expecting a big event. At any point someone might make a database change, forget an index, or introduce some other weird performance anomaly that isn't spotted until we have the system under load.
- There needed to be a relatively safe, automated way to deliver expedited deployments to staging and production for times when you must shortcut the regular process. That shouldn't be dependent on one person using vi to edit files and not making any mistakes.
- We couldn't have been scaling app instances and database servers so quickly if we weren't using a cloud service.
- It really helped to have the team all in one place for this. Normally we're distributed, and that works fine when it's not crisis mode, but I wouldn't have wanted to do this and be 1,000 miles away.
*As usual, I'm speaking as and for myself, even when I'm writing about my day job, not my employer.