On June 6th, from approximately 8:00 AM until 3:30 PM Pacific time, an attempt to improve our security and encryption measures inadvertently led to a significant disruption in our API service. This resulted in an alarming increase in errors from API calls and an escalating delay in response times. At the height of the incident, there were brief periods where as many as 50% of calls were timing out on the client side. Even though we actively investigated potential causes and deployed possible mitigations throughout the incident, leading to periods of improved performance, we recognize that the impact on our customers was significant and unacceptably disruptive throughout the entire incident.
We deeply regret the negative repercussions this incident had on your operations. Our mistake caused substantial inconvenience to you, our valued Nylas users, for which we sincerely apologize. In the aftermath of this unfortunate event, we aim to provide reassurance by offering complete transparency. In this postmortem, we will detail the lessons learned from this incident and outline the changes we have implemented to prevent a recurrence. We are committed to learning from our mistakes and to continually improving our service to you. We want to assure you that we regard any outage as a severe issue, and we will always prioritize minimizing disruptions to your operations.
The underlying cause was traced to a change to the connection security settings to one of our Redis caching instances, which we mistakenly thought was only used by a single API endpoint. It was, in fact, also being used as a secondary authentication system supporting both /calendars and /events endpoints as well, and this additional use was never updated in our system documentation. This attempt to improve backend security measures led to an unforeseen degradation of our API service, which resulted in an elevated rate of HTTP errors and increased API latency for close to 8 hours.
The initial signs of trouble were deceptive. Our API was slowed due to DNS queries taking longer than usual. Given previous incidents tied to CoreDNS on AWS' EKS clusters, we initially perceived this as the root cause. However, this turned out to be a red herring that led us away from the real issue. The missing connection to the Redis instance became evident only when we were rebuilding one of our clusters from scratch in an effort to resolve the issue, and we were alerted that it could not connect to the Redis cache. Upon discovery, we promptly updated the Redis instance settings, which brought immediate improvements including increasing our API success rates and reducing latencies to normal levels.
Recognizing our communication shortfalls during this incident, we've taken decisive steps to improve. We've established a full-time "Incident Commander" role, an individual tasked with refining our incident communication processes and delivering consistent, timely, and comprehensive updates during disruptions. Their primary duty during incidents is to ensure clear and up-to-date information is gathered from all parties and shared on our status page at least every thirty minutes. This role, identified as crucial even before this incident, has been filled, with the new hire starting today. Consequently, you can anticipate significantly enhanced communication in any future incident, reinforcing our commitment to transparency and minimizing any inconvenience for you.
Reflecting on this incident has emphasized the necessity to also refine our maintenance and deployment protocols. While our existing process is very regimented and semi-automated, and includes Deployment Calendars, Scheduled Maintenance windows, and dedicated cross-team communication, we've identified areas for improvement. One significant enhancement we've implemented is requiring all engineering leads to review and approve the "potential impact" of any infrastructure changes before they're enacted. This strategic step will enhance our understanding and management of potential implications, enabling swift and effective reactions to unforeseen situations in the future.
Once again, we deeply regret the inconvenience this incident has caused you and your customers. As we continue to learn from this event, we are committed to enhancing our communication and system processes to prevent similar situations in the future.
We appreciate your understanding, patience, and ongoing support. We value your trust and will strive tirelessly to uphold and exceed the high-quality service you expect from Nylas. Thank you for your partnership.