Monitor the speed of their service from the user’s perspective
The most efficient way to provide visibility into infrastructure and application-level processes
Aggregating all sources of a metric
Easy to see API-level failures and latency spikes
Developers now instrument their code
Visibility into problems that impact users, such as a 500 error or site slowness
Slack is a platform for team communication, with a mission to keep it all in one place, instantly searchable and available wherever you go. An estimated 500,000 people per day used Slack when we last chatted, and the user base doubles every three months. With fewer than 150 employees, Slack must work as efficiently as possible in order to keep pace with massive growth.
The five-member Ops team at Slack is fundamentally in charge of managing security and availability of the site. To get the job done, the team’s mandate is to build the best tools and automation to provide visibility at the infrastructure and application levels. The realities of a fast-growing service are a constant challenge: the team adds capacity on an almost weekly basis in order to scale effectively. Richard Crowley, director of operations at Slack, decided to implement Librato after previous monitoring solutions failed to meet the challenge.
Slack runs a traditional LAMP stack (PHP, Apache and MySQL running on Linux) on several hundred servers. Java is used for real-time services, and Chef to orchestrate AWS through its APIs. All of Slack’s systems emit real baseline metrics to Ganglia.
Crowley was always a big proponent of OSS software like Ganglia, but, he told us, “has run enough big instances to know the pain and time sink that can come with it.” After finding an impedance mismatch between Ganglia and Slack’s application metrics, Crowley and his team decided to try Librato.
“Implementing Librato was very liberating,” Crowley told us. He used StatsD from PHP applications to send business- and app-health metrics into Librato. “Ganglia worked naturally for host-centric metrics, but Librato was far better at expressing app metrics and cluster metrics,” he said.
“Librato gave us an easy way to measure when an event happened and why it took as long as it did,” said Crowley. “The fact that Librato can aggregate sources is a first-class feature. It’s the reason that we needed a tool beyond Ganglia to begin with, and it remains the most important thing. Librato doesn’t treat two different metrics from the same host as the same thing, and that’s huge.”
Slack has a service that uses Go to emit internal metrics about that system’s health. They run equivalent Librato metrics from the app side to time the full request and response cycle. “Having those graphs side-by-side lets us clearly see and compare the effects of network latency on one of our new services, and that’s very cool,” said Crowley.
“I love the speed at which you can throw together something enlightening.”
Librato solved more than one headache for the Ops team. Librato increased visibility in the problems that actually impact users, for example the underlying cause of a 500 response, or a cause of perceived or actual slowness. This has been immensely helpful in understanding the performance of Slack’s site.
“I love the speed at which you can throw together something enlightening,” said Crowley. “Composing dashboards and metrics is a breeze, and it’s super easy for us to see API-level failures and latency spikes. The toolkits, particularly the vertical bar that follows along everywhere, are an excellent touch.”
“Librato let us gain insights more quickly than if we had set something up on our own, like Graphite. It is fantastic to be able to deal with all of our metrics from all of our different sources in Librato,” Crowley told us.