The Importance of Observability

Site reliability engineers, in the most general sense, are charged with a clear mission: efficiently keep the sites reliable. Reliability can be broken down into two main facets: availability and performance. This is about where it stops being straightforward and everything becomes nuanced. This is because you have to start defining what availability and performance means for your systems (which is generally driven by the mission of your organization and how your systems fit into that). Even more complexity comes into play when you consider all the activities an SRE team engages in to achieve these things. For example: configuration management, capacity planning, restores, fault tolerance, and security to name some of them.

How you define availability and performance in your organization is a topic worthy of its own set of posts; and the details of all the activities an SRE team participates could fill a library. An SRE team needs to start somewhere and have a strategy to tackle all of this. There is no one answer, but achieving a high level of observability needs to be a key strategic component for any SRE team.

Observability is the Foundation

Observability is the degree and facility in which your team can gain insight into the behavior of your systems. It is worth noting that the scope of your systems is likely quite broad; it includes the obvious things like your applications and hosts, but also includes things like processes, workflows, and team dynamics. Having insight in your systems means:

  • Questions operators have about their systems can be quantifiably answered with minimal effort
  • Operators have rich mental models of how their systems function

When you have to decide something you can either guess or use “the science.” Without a set of systems for observability in place you will end up guessing (not the educated kind) or be terribly inefficient. A good understanding of how systems work is what allows operators to be effective and avoid disastrous mistakes: observability can drive that.

Decision Making and Incident Preparedness

Observability is key to the strategy for an SRE team because it informs and impacts nearly every other activity that team engages in. I’ve written about the OODA loop before which stands for Observe, Orient, Decide, Act (You can think of Orient as “Analyze.”) It is a military strategy that suggests you can be successful when you can rapidly and successfully iterate through this loop quickly. It is also a tool that is useful for thinking about site reliability operations as well.

OODA is carried out at both the macro and micro levels (planning and incidents) by SRE teams. As an example, we can imagine what making system design decisions as a team is like without good observability (and since we have likely all been there, you can probably just remember.) The observation phase will be based on people’s memory and is frequently skipped. Orienting or analyzing that information as a group will have conflicts because people don’t agree on what the facts are. This can result in arguments about the person’s recollection of the facts instead of the issue at hand. Decisions end up being prolonged and half hearted because of the uncertainty of their basis. Lastly, action will be hindered because a strong consensus hasn’t been reached because people don’t trust the baseless decision. Even worse, people question if this is even the system they should be working on at all.

Many have also probably been through outages when observability is lacking. Lots of time is lost trying to figure out what is even going on. Orienting is difficult because operators lack the internal model of the system that observability provides over time. As a result of these things decisions and actions are chaotic. Or more simply put, it’s amateur hour.

In contrast, the picture is entirely different with a solid foundation in observability because everything becomes data informed. This is different from “data driven” because you can trust people’s intuition. Due to good observability they have developed keen instincts about systems over time. When it comes to system design decisions you are in a much better position because chances are you are designing the right thing in the first place. Team members will bring their observations to the discussion. If there are questions about the facts, instead of arguing then you can just look them up. Decisions will be made with more confidence and faster because they are based on evidence. Lastly, action will have more consensus behind it, even if people didn’t agree they at least know the choice was based on something.

You never know what the next incident will be, but if you have good observability then your operators will have a deeper understanding of the system and will be far more prepared for the unknown.

Other Benefits

Observability positions a team to do more capacity planning by enabling them to see constrained resources and forecast growth. This can help reduce the vicious cycle of fire fighting that many SRE teams are locked into.

Since observability leads to insight, team members are learning more about their systems which generally is a common source of fulfillment for engineering types.

Convinced? 5 Steps to Achieving Good Observability:

In order to achieve good observability an SRE team (often in conduction with the rest of the organization) needs to do the following steps.

  1. Instrument your systems by publishing metrics and events
  2. Gather those metrics and events in a queryable data store(s)
  3. Make that data readily accessible
  4. Highlight metrics that are, or are trending towards abnormal or out of bounds behavior
  5. Establish the resources to drill down into abnormal or out of bounds behavior

Each of these steps largely depends on the previous step to be successful.

1. Instrument your Systems

Brainstorm what key and useful metrics exist for your system. Make those metrics easily accessible (i.e. standard APIs like json via REST or by providing a destination to push to) and document what they are and what the implications of those metrics are. This largely falls on the developers of systems, and DevOps culture can go a long way encourage application developers to empower the operations side of things by doing this. At the highest level you can break metrics and events into two categories:

  1. Objective Oriented: These metrics reflect the mission of your organization. For example they include client facing measurements like response time, availability, error codes, items sold, number of users, number of active users and rate of content created.
  2. Diagnostic Oriented: These measure aspects of the system that allow you to achieve your objects. These include system measures such as OS, network, hardware, middleware, cluster, and application metrics. These also include response time and availability metrics but they measure components and parts of the pipeline that contribute to your objectives.

Good Metrics also tend to have these properties:

  • High Resolution: “High” is qualitative, but a higher frequency of data collection means you will have more insight into the shape of your data (i.e. is it bursty)
  • Lossless: This means that there isn’t missing information from your metric. This can often be achieved by publishing counters instead of rates and letting the display side of things calculate a rate from that information. Also not pre-aggregating things into averages can be useful (or if you are going to do that also aggregate the data into multiple percentiles)
  • Specific: More specific metrics can often be more useful to understanding a system and drilling down into a problem. For example, with something like CPU utilization it is better to report something like %user, %system CPU time breakdowns and let something later in the pipeline aggregate them.

It is also worth making a point to instrument your own internal “meta” systems such as bug tracking and documentation.

2. Gather those metrics in a queryable data store(s)

This is a key intermediate step to making this data accessible. Data generally needs to be stored over time in order to give it context (although the time of each datapoint isn’t always important for things like histograms when it is processed later). Having this step enables things like:

  • Building dashboards
  • Enabling capacity planning
  • Allowing operators to explore the data and learn
  • Allowing people to invent cool stuff you didn’t anticipate

As a rule of thumb, less data stores are better because it makes it easier to work with the data (although specialized databases for things like time series might be worth the tradeoff because of features and scalability.) For time series data in particular, a couple of useful qualities are:

  • Scalability: This enables one to collect a lot of metrics, at high resolution, and high retention
  • Aggregation: This encourages a shift from host/process oriented views to cluster and service oriented views

3. Make that data Readily Accessible

If there is a lot of friction to view the data then people won’t have time or energy to do it. This is why it is important to have good dashboards and APIs to allow easy access for your operators. Good dashboards tend to have the following attributes:

  • A fast responsive UI to allow for operators to drill down and explore easily
  • Enables operators to create their own dashboards and graphs
  • Highlight problems

4. Highlight metrics that are, or are trending towards abnormal or out of bounds behavior

Ideally a team ends up collecting a lot of data. This means humans can’t process it all and therefore your systems need to ask for operator attention. Essentially this is alerting. However it is important to understand that alerting doesn’t always mean “emailing”. It can also mean things like publishing something to a dashboard or logging it.

Traditionally alerting has been done on current values, but anomaly detection and forecasting are becoming a reality thanks to some work done at Etsy.

Alert noise / desensitization is a plague in our field, my belief is that future systems will allow for more carefully crafted and adjustable rules to reduce the noise. Keeping this under control is also largely about discipline and remembering that every alert requires action.

5. Establish the resources to drill down into abnormal or out of bounds behavior

The above steps are a gateway to observability. This is because the nature of collecting metrics is resource constrained. You can only collect so much information without noticeably impacting what you are trying to observe. Eventually you are going to need to drill down into problems or explore further why metrics are behaving in a certain way. There are three common activities for this:

  1. Log analysis: Digging into your system logs for information. System logs can also be a powerful source of metrics (especially things like web logs) if you parse them and feed the results into your monitoring systems
  2. Profiling: This the activity of sampling programs to figure out what they are doing – generally at a much higher resolution than collecting metrics (computer time (sub 1ms) instead of human time)
  3. Tracing: Collecting every single thing a system is doing (i.e. strace or DTrace)

Although my path to observability puts an emphasis on collecting metrics and events, this step is also crucial to observability.

Use the science, Luke

If observability is one of the key components of the strategy for your team, then it sets the tone and foundation for everything else. It can create a culture of constant learning as it provides a medium for learning about your systems and proves a source of information for productive analytical arguments. Whatever your strategy is, you need to consider what role observability plays in your team. And remember: Use The Science.

Posted by Kyle Brandt (@kylembrandt) on November 26th, 2013
Filed under Monitoring and Graphing, Performance, sysadmins




