US API Endpoint are unreachable
Incident Report for Nylas
Postmortem

On 31st January and 1st February we posted this incident on our status page (https://staus.nylas.com/)..) At first it looked like our endpoints were unreachable but they were actually experiencing increased latency and timeouts. On 31st January we believed we had solved the issue, but that returned again once our "quiet period" that night was over and normal business day traffic returned. Over the multiple hours of the incident, our minute-by-minute logs show that there were only a handful of times, lasting for several minutes each, where the latency had reached the level of completely unreachable (meaning greater than 30 seconds at the P90 level). We are creating new guidelines internally around how we name each incident. We fully realize, however, that any period of "unreachable" is longer than ideal for our customers and their end users.

Root Cause Analysis (RCA):
The root cause of the high API latencies was a misconfiguration of the AWS Elastic Kubernetes Service [EKS] coreDNS on our AWS clusters. This coreDNS is responsible for handling the resolution of domain names to IP addresses within our infrastructure.Unfortunately, although our clusters were properly configured to autoscale based on demand, the configuration recommended by AWS, and pre-configured by default, resulted in the coreDNS hitting the upper limits of AWS EKS Virtual Private Cloud [VPC] limitations on its virtual network card, causing the latencies. Due to the lack of sufficient documentation from Amazon regarding EKS coreDNS configuration limits, there was almost no way Nylas could have foreseen this limitation to avoid it. Although we found evidence of other enterprise AWS customers being "stung" by this limitation, and even a Blog Post by AWS themselves, they still have not updated their default configuration nor their documentation with any warnings about this limitation. The fact that AWS knows that this might happen, and has still not made changes, highlights just how rare an occurrence hitting these particular limits is in production, even with their own largest customers.

Lesson Learned/Mitigation:
We have added both additional monitoring and alerting to the coreDNS services, primarily on the Virtual Network Interface (VNIC). To ensure we had covered all of our based, we spoke with a Senior Enterprise Technical Account Manager [TAM] at AWS, who confirmed that our previous configuration was indeed the default setting, and that we were "not the first large enterprise customer he has seen eventually get hit by it, primarily customers that were much larger than Nylas". During the incident, once we realized the issue our team took immediate steps to resolve the issue by modifying the coreDNS configuration to avoid hitting the limits in the future, including as the traffic scales. In addition, the TAM at AWS reviewed our newly revised configuration and confirmed it matched his recommended path forward to avoid future scaling issues at the coreDNS level. He also provided guidance on optimal alerting thresholds, providing us with additional reassurance from AWS themselves that we are taking the right approach in regards to prevention, mitigation and monitoring.

Our entire team is sincerely sorry for the interruption and challenges incurred from this incident and we strive to learn from this event to improve our processes, our product, and our communication. Thank you for your patience and trust in Nylas.

Posted Feb 10, 2023 - 13:48 PST

Resolved
This incident has been resolved.
Posted Feb 01, 2023 - 13:02 PST
Update
We have removed the last mitigation measure in place and are seeing API latency and error rate back to normal levels. We will continue to monitor.
Posted Feb 01, 2023 - 12:35 PST
Update
We are seeing elevated API latency and error rate again after removing the mitigation measures in place. We are removing them one at a time and monitoring closely.
Posted Feb 01, 2023 - 11:58 PST
Update
We are seeing API latency and error rate back to normal. We will now proceed with reverting some mitigation measures still in place and keep you posted on how that goes.
Posted Feb 01, 2023 - 11:30 PST
Monitoring
The rollback is complete. We are monitoring.
Posted Feb 01, 2023 - 10:59 PST
Update
Our rollback is almost complete and we are seeing a positive trend in the latency and error rate.
Posted Feb 01, 2023 - 10:52 PST
Update
We are halfway through rolling back our previous deployment. We will keep you posted when it's done. Thank you for your continued patience.
Posted Feb 01, 2023 - 10:10 PST
Identified
We are currently rolling back a change and hope to see positive results shortly.
Posted Feb 01, 2023 - 09:17 PST
Update
We are reverting the previous deployment that we suspect is related to this issue.
Posted Feb 01, 2023 - 08:57 PST
Update
We are continuing to investigate the source of the the problem.
Posted Feb 01, 2023 - 08:57 PST
Update
We are seeing an improvement in response times, error rates are dropping.
Posted Feb 01, 2023 - 08:20 PST
Update
We believe we have determined the cause are are looking to block the source of the traffic.
Posted Feb 01, 2023 - 08:00 PST
Update
We are continuing to see an increase in 499 and 502 responses and we are still investigating the root cause.
Posted Feb 01, 2023 - 07:30 PST
Investigating
We are still receiving reports about issues with API endpoints, we are investigating the issue.
Posted Feb 01, 2023 - 06:46 PST
Monitoring
Services are back to normal now. We are monitoring the endpoints.
Posted Feb 01, 2023 - 06:22 PST
Identified
A fix has been implemented and the services are coming back to normal.
Posted Feb 01, 2023 - 06:21 PST
Investigating
We are investigating this issue at the moment. Will provide further updates here.
Posted Feb 01, 2023 - 06:19 PST
This incident affected: Nylas Application Services (API).