greattore.blogg.se - Slack status outage

This made investigating the widespread production issues even more difficult. Many of the incident responders on our call had their SSH sessions ended abruptly as the instances they were working on were deprovisioned. Because we were working without our monitoring dashboards, several engineers were logged into production instances investigating problems at this point. Secondly, our autoscaling system downscaled our web tier. Our systems attempted to replace these unhealthy instances with new instances. Some of our instances were marked unhealthy by our automation because they couldn’t reach the backends that they depended on. Slack became unavailable.Īround this time two things happened independently. The increased packet loss led to much higher latency for calls from the web tier to its backends, which saturated system resources in our web tier. As load increased so did the widespread packet loss. However, the mini-peak at 7am PST - combined with the underlying network problems - led to saturation of our web tier. We manage the scaling of our web tier and backends to accommodate these mini-peaks. Slack has a traffic pattern of mini-peaks at the top of each hour and half hour, as reminders and other kinds of automation trigger and send messages (much of this is external - cronjobs from all over the world). At this point Slack itself was still up - at 6.57am PST 99% of Slack messages were being sent successfully (but our success rate for message sending is usually over 99.999%, so this was not normal). While our infrastructure seemed to generally be up and running, we observed signs that we were seeing widespread network degradation, which we escalated to AWS, our main cloud provider. Our metrics backends were still up, meaning that we were able to query them directly - however this is nowhere near as efficient as using our dashboards with their pre-built queries. We still had various internal consoles and status pages available, some command line tools, and our logging infrastructure. We pulled in several more people from our infrastructure teams because all debugging and investigation was now hampered by the lack of our usual dashboards and alerts. To narrow down the list of possible causes we quickly rolled back some changes that had been pushed out that day (turned out they weren’t the issue). We immediately paged in our monitoring team to try and get our dashboard and alerting service back up. As initial triage showed the errors getting worse, we started our incident process (see Ryan Katkov’s article All Hands on Deck for more about how we manage incidents).Īs if this was not already an inauspicious start to the New Year, while we were in the early stages of investigating, our dashboarding and alerting service became unavailable. During the Americas’ morning we got paged by an external monitoring service: Error rates were creeping up. The day in APAC and the morning in EMEA went by quietly.

January 4th 2021 was the first working day of the year for many around the globe, and for most of us at Slack too (except of course for our on-callers and our customer experience team, who never sleep).