US | Elevated API latencies and elevated rate of 520 errors
Incident Report for Nylas
Postmortem

On June 6th, from approximately 8:00 AM until 3:30 PM Pacific time, an attempt to improve our security and encryption measures inadvertently led to a significant disruption in our API service. This resulted in an alarming increase in errors from API calls and an escalating delay in response times. At the height of the incident, there were brief periods where as many as 50% of calls were timing out on the client side. Even though we actively investigated potential causes and deployed possible mitigations throughout the incident, leading to periods of improved performance, we recognize that the impact on our customers was significant and unacceptably disruptive throughout the entire incident.

We deeply regret the negative repercussions this incident had on your operations. Our mistake caused substantial inconvenience to you, our valued Nylas users, for which we sincerely apologize. In the aftermath of this unfortunate event, we aim to provide reassurance by offering complete transparency. In this postmortem, we will detail the lessons learned from this incident and outline the changes we have implemented to prevent a recurrence. We are committed to learning from our mistakes and to continually improving our service to you. We want to assure you that we regard any outage as a severe issue, and we will always prioritize minimizing disruptions to your operations.

The underlying cause was traced to a change to the connection security settings to one of our Redis caching instances, which we mistakenly thought was only used by a single API endpoint. It was, in fact, also being used as a secondary authentication system supporting both /calendars and /events endpoints as well, and this additional use was never updated in our system documentation. This attempt to improve backend security measures led to an unforeseen degradation of our API service, which resulted in an elevated rate of HTTP errors and increased API latency for close to 8 hours.

The initial signs of trouble were deceptive. Our API was slowed due to DNS queries taking longer than usual. Given previous incidents tied to CoreDNS on AWS' EKS clusters, we initially perceived this as the root cause. However, this turned out to be a red herring that led us away from the real issue. The missing connection to the Redis instance became evident only when we were rebuilding one of our clusters from scratch in an effort to resolve the issue, and we were alerted that it could not connect to the Redis cache. Upon discovery, we promptly updated the Redis instance settings, which brought immediate improvements including increasing our API success rates and reducing latencies to normal levels.

Recognizing our communication shortfalls during this incident, we've taken decisive steps to improve. We've established a full-time "Incident Commander" role, an individual tasked with refining our incident communication processes and delivering consistent, timely, and comprehensive updates during disruptions. Their primary duty during incidents is to ensure clear and up-to-date information is gathered from all parties and shared on our status page at least every thirty minutes. This role, identified as crucial even before this incident, has been filled, with the new hire starting today. Consequently, you can anticipate significantly enhanced communication in any future incident, reinforcing our commitment to transparency and minimizing any inconvenience for you.

Reflecting on this incident has emphasized the necessity to also refine our maintenance and deployment protocols. While our existing process is very regimented and semi-automated, and includes Deployment Calendars, Scheduled Maintenance windows, and dedicated cross-team communication, we've identified areas for improvement. One significant enhancement we've implemented is requiring all engineering leads to review and approve the "potential impact" of any infrastructure changes before they're enacted. This strategic step will enhance our understanding and management of potential implications, enabling swift and effective reactions to unforeseen situations in the future.

Once again, we deeply regret the inconvenience this incident has caused you and your customers. As we continue to learn from this event, we are committed to enhancing our communication and system processes to prevent similar situations in the future.

We appreciate your understanding, patience, and ongoing support. We value your trust and will strive tirelessly to uphold and exceed the high-quality service you expect from Nylas. Thank you for your partnership.

Posted Jun 12, 2023 - 16:04 PDT

Resolved
We are seeing our service restored to normal levels and maintaining performance. We will be marking this incident as resolved. If you are still experiencing any issues, please reach out to support@nylas.com.

We are aiming at publishing the postmortem for this incident before the end of the day on Friday, June 9th, 2023 or on Monday, June 12th, 2023 at the latest.
Posted Jun 06, 2023 - 16:04 PDT
Monitoring
We are seeing our API traffic and success rates stabilizing and holding consistently, with latency reduced to normal levels. We will continue to monitor.
Posted Jun 06, 2023 - 15:26 PDT
Update
We are seeing API traffic and success rate back to normal levels but you might see elevated latencies still. We are looking into it.
Posted Jun 06, 2023 - 14:41 PDT
Update
API traffic is picking back up. We are monitoring things closely.
Posted Jun 06, 2023 - 14:08 PDT
Update
We have rebuilt a key component of our infrastructure to try to mitigate the current issue. We have slowly routed traffic to it and we see that it’s helping. We will keep you posted as we ramp up the traffic.
Posted Jun 06, 2023 - 13:55 PDT
Update
We are continuing to investigate this issue.
Posted Jun 06, 2023 - 13:11 PDT
Update
The team is still investigating and working on bringing the API service back ASAP.
Posted Jun 06, 2023 - 12:27 PDT
Update
We are continuing to work to bring the API service back up. Thank you for hanging in there with us.
Posted Jun 06, 2023 - 11:38 PDT
Update
We are continuing to investigate this issue.
Posted Jun 06, 2023 - 11:04 PDT
Investigating
The API service is unreachable again. The team is looking into it.
Posted Jun 06, 2023 - 10:17 PDT
Update
We are seeing the API traffic going back to normal levels. We will monitor for the next 15 minutes to ensure it remains stable.
Posted Jun 06, 2023 - 09:47 PDT
Monitoring
We are seeing API traffic picking back up. We are monitoring.
Posted Jun 06, 2023 - 09:35 PDT
Update
The team is still working as fast as possible to get the API service back. Thank you for your continued patience.
Posted Jun 06, 2023 - 09:33 PDT
Update
We are continuing to investigate this issue.
Posted Jun 06, 2023 - 08:46 PDT
Update
Because of a failure of one of our clusters, the Nylas API is currently unreachable. The team is investigating and working to get the API service back ASAP.
Posted Jun 06, 2023 - 08:11 PDT
Investigating
We are currently investigating this issue.
Posted Jun 06, 2023 - 07:57 PDT
This incident affected: Nylas Application Services (API, Dashboard, Scheduler).