We just had two hardware failures occur in quick succession leading to intermittent errors on the Dashboard and an outage on a subset of blogs.
Hardware failures are the norm when operating at our scale, and we’re confident this cluster should have had enough capacity to tolerate these issues. However, the failover appeared to hang and trigger a cascading failure.
Our operations team was able to quickly identify and begin recovering from the issue. We’re re-examining this specific method of failover and are thoroughly reviewing our procedures related to this incident.
Our deepest apologies to those affected. We owe you better.