Last night, at 6:08pm EDT, the Zencoder service went offline due to a database failure. We began working on the problem immediately, but unfortunately our primary approach to solving the problem was unsuccessful, and the secondary approach took an extended period to implement. In total, the service was unavailable for six hours and 18 minutes.
Here is a detailed description of what happened, why, and why it will never happen again.
What happened
The Zencoder stack relies on a PostgreSQL database, among other things. We believe PostgreSQL to be an excellent database - fast, reliable, scalable, and well-designed. Many services rely upon PostgreSQL, and we have run it successfully since Zencoder’s inception.
For reasons we are still investigating, an internal PostgreSQL database process (autovacuum) stalled while running on a single table in our database. This ultimately caused the database to stop accepting new transactions, effectively putting it in “read-only” mode. At this time, we believe the underlying problem may have been a bug in the version of PostgreSQL we were using, but we are still verifying this.
Within minutes, engineers were alerted to the problem and started an operation that should have unfrozen the database. This process takes time, especially on a large database, but we expected that it would finish in relatively short order. Unfortunately, this process stalled, possibly due to the same issue that caused the problem in the first place.
In parallel with this, we considered failing over to a standby server. We have redundant database servers (along with redundancy across the rest of our stack), and if this were a hardware failure, we could have failed to our secondary server within minutes. Unfortunately, the issue was replicated onto the secondary database as well, and so this was not an option.
Eventually, we determined that the operation itself was not working. We decided to take a more drastic step, and stood up a new stack. Jobs started processing again by 12:26am EDT.
What we will do
As a response to this incident, we have already begun multiple layers of improvements.
First, we have upgraded to a newer version of PostgreSQL that is not susceptible to the particular bug we identified.
Second, we are working on our database configuration and monitoring to ensure that the conditions that led to the problem will not happen again.
Beyond that, we are working on indirect improvements, including faster recovery in the face of a catastrophe and additional layers of redundancy, that will minimize the impact of future problems.
(Our full response to the incident is still being determined, since we are still verifying the root cause for the operation’s failure, but we are happy to share more information with customers who are interested as we make progress in the coming days.)
We sincerely apologize for the problem and for the impact that it caused to our customers. We pride ourselves on operating reliably at very high scale, and we will work hard to make sure this never happens again.