May 18th Downtime Retrospective
May 19th, 2011
Both of Spreedly’s products, Subscriptions and Core, were down for a little over four hours yesterday (5/18). We know this had a significant impact on our customers businesses, and we want to give you a rundown of what happened, what we learned, and what our plans are going forward to keep this from ever happening again.
Many of you may have noticed over the past few weeks that we’ve been suffering from “blips” in availability: basically we would lose connectivity for ~3 minutes. These were due to the router provided by our host “going away” for that period of time. All our internal services were up and available, but they had no connectivity.
This router was a big problem, since while it was dedicated to us, it was owned by our host. That meant we had very little visibility in to what was happening on it, so our first priority was to prep a replacement that we actually owned. We’ve been working on that for the past two weeks, getting in hardware and configuring it to take over. This new hardware was going to take over two functions, not just the routing, so it was a good bit of work to get it set up.
Lesson learned: Make sure there is a clear owner of each piece in your stack. Our host had control over the router but wasn’t monitoring it well or taking ownership of its downtime.
Two days ago (5/17) the new hardware was just about ready to go, and so John took it to the data center, racked it up, and started configuring it. Some extra parts were needed, and so he cleaned up and headed home, planning to pick it up again the next day. Yesterday (5/18) he bought the parts and headed back to the data center. Just as he was arriving, the site went down again. We assumed it was the router again, so John rushed inside to get in to the system from the inside and poke at it some to make sure our diagnosis was correct.
Something that went right: I (Nathaniel) started Twittering about the issue right away, which made it much easier to keep everyone updated throughout the ordeal.
It wasn’t the router – when he arrived at our cabinet, the whole thing was completely dark. Turns out the power system to our cabinet had failed. It was at this point we discovered our host wasn’t doing a good job of monitoring power, as John had to run in to the office portion of the data center to get help, and no one there was yet aware the power had failed to our cabinet.
Something that went right: John was already at the colo and able to quickly get our host to fix the power.
Lesson learned: don’t assume everything is monitored like you think it should be.
We’re now about an hour in to the downtime, and our servers are coming back online. That said, we had the new replacement hardware racked up for the router, and we made the decision to bring it up and avoid future downtime to do that – it was going to happen within the next few days anyhow, and it didn’t seem to make sense to spend time on the old router which we knew had issues. There were multiple pieces to the replacement, so John started bringing them up.
This is where things started to go really sideways. Nothing seemed to be working with the new hardware, even though it was well tested before the switch. After fighting with it for a while, we realized that occasional three minute downtimes were better than extending this outage any longer, and started rolling back to what we had had before. The problem was, it wouldn’t work either! In particular, while we could make outbound connections now, no inbound connections were happening.
Lesson learned: tempting as it might be, it’s not a good idea to piggyback changes on to downtime, since you don’t know for sure that you have a solid fallback position.
Our host began prepping a new router, and that was up about three hours in to the outage. It took us another 45 minutes to figure out that they’d configured it for the wrong IP range – this is yet another router we don’t own – and for them to then fix the configuration.
Once that final issue was resolved, it was a matter of minutes to get Subscriptions back up and running. Core followed shortly thereafter once we realized the power outage had affected one of our redundant pieces of hardware.
So where do we go from here? First of all, we will announce a short period of planned downtime in the near future to bring this new hardware up. It’s critical that we own the whole stack, and we did learn some key things during this whole fiasco that will make it that much easier to bring up the new hardware in a solid, planned fashion.
Second, one of the things we’re doing with this new hardware is getting closer to our “ideal cabinet” setup. We have been planning for a while to expand to multiple locations – part of the reason we picked Riak to underly Core – and this is accelerating that significantly. Multiple geographic locations is architecturally challenging, but well worth the investment based on our future plans for Subscriptions and Core.
Third, we’re offering credit for a week’s worth of Spreedly fees to any customers that request it. Just drop us a line and let us know you’d like the credit, and we’ll calculate it and put it on your account. It’s a token amount against the business you may’ve lost, but hopefully it helps show we’re serious about making this right.
As always we’d love to hear any and all questions and comments you may have, so feel free to drop us a line at support@spreedly.com. Thanks for your patience through these growing pains with us, and here’s to us all continuing to grow our businesses in the coming months!
