On 31st January and 1st February we posted this incident on our status page (https://staus.nylas.com/)..) At first it looked like our endpoints were unreachable but they were actually experiencing increased latency and timeouts. On 31st January we believed we had solved the issue, but that returned again once our "quiet period" that night was over and normal business day traffic returned. Over the multiple hours of the incident, our minute-by-minute logs show that there were only a handful of times, lasting for several minutes each, where the latency had reached the level of completely unreachable (meaning greater than 30 seconds at the P90 level). We are creating new guidelines internally around how we name each incident. We fully realize, however, that any period of "unreachable" is longer than ideal for our customers and their end users.
Root Cause Analysis (RCA):
The root cause of the high API latencies was a misconfiguration of the AWS Elastic Kubernetes Service [EKS] coreDNS on our AWS clusters. This coreDNS is responsible for handling the resolution of domain names to IP addresses within our infrastructure.Unfortunately, although our clusters were properly configured to autoscale based on demand, the configuration recommended by AWS, and pre-configured by default, resulted in the coreDNS hitting the upper limits of AWS EKS Virtual Private Cloud [VPC] limitations on its virtual network card, causing the latencies. Due to the lack of sufficient documentation from Amazon regarding EKS coreDNS configuration limits, there was almost no way Nylas could have foreseen this limitation to avoid it. Although we found evidence of other enterprise AWS customers being "stung" by this limitation, and even a Blog Post by AWS themselves, they still have not updated their default configuration nor their documentation with any warnings about this limitation. The fact that AWS knows that this might happen, and has still not made changes, highlights just how rare an occurrence hitting these particular limits is in production, even with their own largest customers.
Lesson Learned/Mitigation:
We have added both additional monitoring and alerting to the coreDNS services, primarily on the Virtual Network Interface (VNIC). To ensure we had covered all of our based, we spoke with a Senior Enterprise Technical Account Manager [TAM] at AWS, who confirmed that our previous configuration was indeed the default setting, and that we were "not the first large enterprise customer he has seen eventually get hit by it, primarily customers that were much larger than Nylas". During the incident, once we realized the issue our team took immediate steps to resolve the issue by modifying the coreDNS configuration to avoid hitting the limits in the future, including as the traffic scales. In addition, the TAM at AWS reviewed our newly revised configuration and confirmed it matched his recommended path forward to avoid future scaling issues at the coreDNS level. He also provided guidance on optimal alerting thresholds, providing us with additional reassurance from AWS themselves that we are taking the right approach in regards to prevention, mitigation and monitoring.
Our entire team is sincerely sorry for the interruption and challenges incurred from this incident and we strive to learn from this event to improve our processes, our product, and our communication. Thank you for your patience and trust in Nylas.