Elevated 504 responses from Exchange Server
Incident Report for Nylas
Postmortem

On April 12th, we received reports from some of our customers about end-users encountering 504 errors when using our "/send" API endpoint. As we investigated the issue, we discovered that the root cause was a Microsoft update on April 11th (i.e. what they internally call “Patch Tuesdays”) that prevented third-party API users from accessing their user folders if they had at least one public folder. This issue affected a small percentage of our accounts, estimated to be less than 2%.

Microsoft acknowledged the incident on April 13th and started rolling back the updates. However, due to the staged rollout of patches on their online servers, combined with the low usage of Public folders by their customers, the impact was not immediately apparent. We identified the issue and developed a mitigation solution by adding an optimization in our middleware component responsible for Exchange Server connectivity to prevent listing folders in scenarios that were contributing to the problem. We deployed our solution promptly, without waiting for Microsoft to complete their rollback, which took them almost nine days. You can view the Microsoft status page with the complete timeline here (Office 365 Admin account required): Some users may be unable to access, view, add, or select public folders and calendars in Outlook on the web.

We apologize for any inconvenience this may have caused and assure you that we have taken all necessary steps to resolve the issue and prevent a similar incident from occurring in the future. We will continue monitoring the situation closely and take all necessary precautions to ensure our customers' services are not disrupted.

Posted May 01, 2023 - 13:58 PDT

Resolved
This incident has been resolved.
Posted Apr 13, 2023 - 17:40 PDT
Update
We are observing a significant decrease of 504 error rates on /send and will be marking this incident as resolved.

If you are still experiencing this issue, please report it to support@nylas.com
Posted Apr 13, 2023 - 17:40 PDT
Update
We have finished the deployment of the fix to our Sync component in our production environment. We are closely monitoring the results.
Posted Apr 13, 2023 - 17:09 PDT
Monitoring
We have finished the deployment of the fix to our API component in our production environment. We are closely monitoring the results.
We are now starting the deployment of the fix to the Sync component in our production environment.
Posted Apr 13, 2023 - 16:14 PDT
Update
The deployment is 75% done and we are already seeing an improvement in the error rate on /send. We will confirm when the deployment is completed.
Posted Apr 13, 2023 - 15:53 PDT
Update
The smoke tests on our staging environment were successful. We are now starting the deployment of the aforementioned fix to our production environment.
Posted Apr 13, 2023 - 14:36 PDT
Update
The deployment has made it to our staging environment. We are running our smoke tests there before deploying to production.
Posted Apr 13, 2023 - 14:25 PDT
Update
We want to give visibility on an advisory issue that Microsoft posted earlier today that we believe is related to the current incident: https://admin.microsoft.com/adminportal/home#/healthoverview/:/alerts/EX540990 (Microsoft 365 Admin account required). We are seeing error levels decreasing since 1PM EST confirming they are rolling back that change. Please note we are still proceeding with the deployment on our side in an effort to speed up resolution for all users.
Posted Apr 13, 2023 - 14:05 PDT
Update
We are still preparing our deployment by running our end-to-end test suite. We'll start the deployment once that's done. Thank you for your continued patience.
Posted Apr 13, 2023 - 13:12 PDT
Update
The fix is implemented and tested but not deployed yet. We are preparing the deployment. We will keep you posted.
Posted Apr 13, 2023 - 11:43 PDT
Identified
We believe we have identified the issue and are implementing a fix.
Posted Apr 13, 2023 - 11:25 PDT
Investigating
We are following a lead on what the root cause might be, however we are still investigating.
Posted Apr 13, 2023 - 10:03 PDT
Update
Please ask your customers’ mail administrators for assistance to raise tickets with Microsoft to help identify the cause of the increase in Fanout errors. This Fanout error suggests Microsoft have an issue with their storage, this is likely a transitory error with Microsoft ongoing since 11/04 18:00 UTC

An internal server error occurred. The operation failed., Fanout timed out

This is impacting a small percentage of Office 365 accounts. From past experience we have produced the following KB article: https://support.nylas.com/hc/en-us/articles/4505725124253
Posted Apr 13, 2023 - 06:28 PDT
Identified
Please ask your customers’ mail administrators to raise tickets with Microsoft to place pressure on them to identify the cause of the increase in Fanout errors. This Fanout error suggests Microsoft have an issue with their storage, this is likely a transitory error with Microsoft ongoing since 11/04 18:00 UTC

An internal server error occurred. The operation failed., Fanout timed out

This is impacting a small percentage of Office 365 accounts. From past experience we have produced the following KB article: https://support.nylas.com/hc/en-us/articles/4505725124253
Posted Apr 13, 2023 - 04:59 PDT
Update
We have confirmed that this is impacting a small percentage of our customers' users on EWS Exchange.

We're still investigating this issue.

Thank you for your continued patience.
Posted Apr 12, 2023 - 17:55 PDT
Investigating
We are aware of elevated 504 error responses when sending from the Exchange server and are currently investigating this issue.
Posted Apr 12, 2023 - 16:24 PDT
This incident affected: Nylas Application Services (API).