Nylas Scheduler currently unreachable
Incident Report for Nylas
Postmortem

We deeply regret that starting on April 25th, our Scheduler service faced a series of downtime incidents that spanned over a 24-48 hour period. We understand the inconvenience and disruption this may have caused to your operations and we sincerely apologize for it.

The root cause of these incidents was tied to an unexpected change in a third-party package that we use to redirect web traffic from HTTP to HTTPS. Previously, this package was consistently returning 200 HTTP status codes when redirecting traffic. However, it suddenly began returning 302s, which unintentionally triggered our auto-scaling mechanism to start removing hosts from the cluster. As a result, the availability of our Scheduler service was reduced.

It's worth mentioning that we believe the shift might have occurred at the load balancer or Kubernetes cluster level, though the exact trigger or timing is still under investigation.

The outcome of this unforeseen issue was a prolonged unavailability of the Scheduler service, leaving you, our valued users, unable to load the application. We are currently working on calculating the precise duration of downtime, as this incident occurred multiple times over the stated period.

In response to this incident, we've taken immediate and decisive action. We've eliminated the third-party package from our product and restructured the method in which we fetch the health status of hosts powering the Scheduler service.
But we didn't stop there. We've completely rearchitected our underlying server infrastructure. This step, while time-consuming and complex, was necessary to ensure we provide you with not only improved reliability, but also elevated performance. Our commitment to your experience with our service drives us to keep pushing for better, more efficient systems.

We deeply regret any inconvenience this may have caused and want to assure you that we've taken all necessary measures to resolve this issue and safeguard against a similar occurrence in the future. We recognize the trust you place in our services, and we don't take this lightly. Moving forward, we promise to continue monitoring the situation closely and to take all necessary precautions to ensure your services remain uninterrupted.

Thank you for your patience and understanding during this time. We are committed to serving you better and appreciate your continued support. Please do not hesitate to reach out if you have any questions or concerns.

Posted May 16, 2023 - 11:12 PDT

Resolved
We continued to monitor the service overnight and we didn't notice any new instances of this issue so far. We have created some new alerts to identify similar trends in the future. We are marking this incident as resolved.
Posted Apr 27, 2023 - 06:45 PDT
Monitoring
We've implemented some infrastructure changes. We'll continue to closely monitor the results and make adjustments as necessary.
Posted Apr 26, 2023 - 14:26 PDT
Update
We are currently working to stabilize our scheduler to provide you with a better experience. During this process, you may intermittently receive 504 Gateway Timeout errors while accessing the scheduler.
Posted Apr 26, 2023 - 08:30 PDT
Update
We are seeing the Nylas Scheduler back to stable though we are still working with our infrastructure provider to identify the underlying root cause in order for us to come up with the appropriate fix (if needed). We are keeping this incident in the current state of “identified” for the time being.
Posted Apr 25, 2023 - 16:45 PDT
Update
We are continuing to work on a fix for this issue.
Posted Apr 25, 2023 - 15:15 PDT
Update
Our team is continuing to investigate the matter in order to determine the next steps towards resolution.
Posted Apr 25, 2023 - 14:12 PDT
Identified
We have determined that the issue is related to our infrastructure provider. Our team is actively looking into next steps to resolve this issue.
Posted Apr 25, 2023 - 13:21 PDT
Investigating
The Nylas Scheduler is currently unreachable and returning 504 Bad Gateway errors. Our team is aware of the issue and is actively investigating it.
Posted Apr 25, 2023 - 12:32 PDT
This incident affected: Nylas Application Services (Scheduler).