January 15

System Performance: Database Server Upgrade

As of Sunday 2013-01-13, we’re living on a new, higher-powered master database server.

We’ve been planning to make this move for some time. Our database had grown too large to hold in RAM on our previous server, and some queries were taking upwards of a minute to complete. That was unacceptable, so we began the process of building out a bigger server, with architecture improvements to allow us to scale more easily. The new server had been setup for about 6 days and was happily replicating our production data in real time. We had run almost all of our pre-migration tests, and we were planning to make the switch this week.

In the wee hours of Sunday morning EST, our hand was forced. Our old database server failed, and we were compelled to make the switch to the new server early. There were a couple glitches during the switchover that caused the database to be down longer than it should have been. For that we’re very sorry. I mentioned that we had run *almost* all of our tests. It turns out that the ones we hadn’t gotten to would have shown us a potential problem earlier.

In the end though, the cutover went smoothly, and we’re now up and running full steam on a vastly improved master server.

On the agenda this year is a project to improve our database layer. Today we have a single point of failure at the database. That’s obviously not ideal. Worst case scenario, we could again experience unplanned downtime of a couple of hours or more. But the potential downtime grows as the size of the database grows. We maintain at all times verified real-time replicated data on a passive slave to be ready for such a catastrophe. Our current architecture requires that human interaction take place to fail over to the slave server. We run tests to determine if there is data loss and point the app to the new database host. That process can take some time.

With the database layer improvement project, we will gain the benefit of increased hardware resources and an architecture that ideally eliminates the possibility of downtime caused by failure of a single database node. In other words, the database will no longer be a single point of failure. We’ll have more on this later.