We deeply regret that starting on April 25th, our Scheduler service faced a series of downtime incidents that spanned over a 24-48 hour period. We understand the inconvenience and disruption this may have caused to your operations and we sincerely apologize for it.
The root cause of these incidents was tied to an unexpected change in a third-party package that we use to redirect web traffic from HTTP to HTTPS. Previously, this package was consistently returning 200 HTTP status codes when redirecting traffic. However, it suddenly began returning 302s, which unintentionally triggered our auto-scaling mechanism to start removing hosts from the cluster. As a result, the availability of our Scheduler service was reduced.
It's worth mentioning that we believe the shift might have occurred at the load balancer or Kubernetes cluster level, though the exact trigger or timing is still under investigation.
The outcome of this unforeseen issue was a prolonged unavailability of the Scheduler service, leaving you, our valued users, unable to load the application. We are currently working on calculating the precise duration of downtime, as this incident occurred multiple times over the stated period.
In response to this incident, we've taken immediate and decisive action. We've eliminated the third-party package from our product and restructured the method in which we fetch the health status of hosts powering the Scheduler service.
But we didn't stop there. We've completely rearchitected our underlying server infrastructure. This step, while time-consuming and complex, was necessary to ensure we provide you with not only improved reliability, but also elevated performance. Our commitment to your experience with our service drives us to keep pushing for better, more efficient systems.
We deeply regret any inconvenience this may have caused and want to assure you that we've taken all necessary measures to resolve this issue and safeguard against a similar occurrence in the future. We recognize the trust you place in our services, and we don't take this lightly. Moving forward, we promise to continue monitoring the situation closely and to take all necessary precautions to ensure your services remain uninterrupted.
Thank you for your patience and understanding during this time. We are committed to serving you better and appreciate your continued support. Please do not hesitate to reach out if you have any questions or concerns.