This morning, our primary datacenter effectively lost its connection to the Internet after our upstream provider (ISP) suffered a major outage caused by broken fiber lines.
It took nearly five hours before service was fully restored.
Our failover procedures, designed to immediately rebalance traffic over our backup connections, failed to work as expected.
The major failing here was an oversight in how we had been performing ongoing testing of these backup links. Our checks missed an obscure but critical issue in these connections that prevented production traffic from being served properly.
We are already well into an effort to re-evaluate and adjust these testing procedures. We’ll be working through the weekend to make sure we’re not caught off-guard again.
Thank you for your patience today.