January 11

What happened today?

On January 11, 2012, CheddarGetter was unavailable beginning at around 17:03 UTC. This unprecedented and unexpected downtime lasted for approximately 2.25 hours.

No data was lost, and no payments will be lost. Now that we are back up, our system automatically processes the payments that were due to be processed during that period.

But during that time no data was able to be added. This is a big deal, because it means that new customers were not able to sign up during that time (among other things). We know how frustrating this was for you because your businesses rely on CheddarGetter just like ours does.

What’s next: This problem will not happen again; it has been fixed. We are working to ensure that other things like this will not happen, and are committed to keeping events like today’s black swan notable for their rarity. This means working to eliminate this single point of failure as soon as possible.

I’m not sure how much it helps, but we are very sorry.

The gory technical details:

An unexpected first responder, Papertrail, landed a notification in the team’s inboxes at approximately 17:03 UTC. The message contained an alert explaining that 112 icky log messages occurred on our web servers in the 10 seconds prior to the notification being sent. In other words, thanks to Papertrail, we were aware of the problem almost immediately. Moments later, an annoying number of beeps, buzzers, and bells began sounding off ad nauseum.

The really annoying part was that the only information we had to go on was desperately vague. Those of you with PHP experience might be familiar with the error message: “PHP Fatal error: Exception thrown without a stack frame in Unknown on line 0”. Lovely. That almost always means that we threw an exception from an object constructor. You can’t do that. Now, on to tracking it down with that as a starting point. We had another error that came along with the infamous “Unknown on line 0”: “PHP Warning: Unknown: Failed to write session data (user). Please verify that the current setting of session.save_path is correct (/var/lib/php5) in Unknown on line 0”. Ok, so that actually helps a little, right? Wrong. More on that later.

First thought: any boneheaded deployments recently? Nope, no deployments at all within 23 hours.

Second thought: problems saving session? Hmm… that’d be odd. Ok, we’re using Memcached for session storage so let’s check that out. Using telnet try a simple set. Success. Try a simple get. Success. Damn. One of the most frustrating parts of troubleshooting a problem like this is desiring failure and not getting it. Memcached is really solid so no surprise there.

Third through fifty-seventh thought: grasping at straws around the idea of “Failed to write session data”. This included some line-by-line “Hello World” nonsense. Grr…

What we found here was that the problem was actually related to a DB connection failure during auth.

In walks an intern who says: “Is there anything I can do that wont cause you to punch me in the face?”

CEO: “Hmm, probably not but we’re getting close to finding the cause. It appears to be database connection related.”

Intern: “Did Rackspace reboot anything recently?”

CEO: “Not that we’re aware of.” [Checks uptime on db server]. “Damn, this server’s been rebooted.” [Checks DB connection from web head]. “DB connection is hanging from web1”. [sudo iptables -F]. Ding, ding, ding!!!

So, not quite a perfect storm but, as usual, multiple actions worked together to cause and sustain this downtime. A quick check of the inbox showed a notification from Rackspace regarding a server reboot. Since RS had to force power cycle the offending server, our live iptables rules were not saved (i.e., “post-down iptables-save -c > /etc/iptables.rules” did not fire). Unfortunately the most recent saved ruleset did not include an allow for the web heads. Upon reboot, no connectivity.

Also very common, a well meaning dev tried to be clever about detecting db connection failure and report a nice error message. Unfortunately that included throwing an exception “without a stack frame”. So instead of the nice, well-thought-out error message, we had what we had here today.