Golden Signals
Hello and welcome to Small Batches with me Adam Hawkins. In each episode, I share a small batch of the theory and practices behind software delivery excellency
Topics include DevOps, lean, continuous delivery, and conversations with industry leaders. Now, let’s begin today’s episode.
I’ve said for a while that Test Driven Development is skill zero for professional software engineers. It’s skill zero because it unlocks everything else. This isn’t an episode about TDD though. Undoubtably that will come, so let’s move onto skill one for now.
Skill one is continuous delivery. Continuous delivery will put your changes into production as quickly as possible so you can learn, and iterate forward. The learning requires skill two: production operations.
The discipline of Production operations revolves around understanding the current condition of the system and comparing that against expected targets.
I must pull in a bit of Deming’s System of Profound Knowledge now because it’s relevant to the discipline.
The first point is Theory of Knowledge: how do we know what we know? The second point is understanding variation: what’s the range of acceptable outcomes? The third point is: what’s the aim of this system?
So where we do start knowing the current condition? It starts with the golden signals. If you can only measure four metrics of the system, then focus on these four. They’ll point you in the right direction.
Remember this phrase: LETS. It stands for Latency, Errors, Traffic, and Saturation.
Latency is how long it takes for the system to service a request, ideally tracked in a statistical distribution.
Errors are the problems, ideally tracked as sums over time. They indicate the system failed to product the intended outcome. Think of HTTP 500s.
Traffic is a measure of the flow through your system. This may be the total requests in a time interval or something like requests per second.
Saturation is how full something is, ideally tracked as percentage. Think of a connection pool, disk usage, or queue.
Odds are that any telemetry tool work it’s salt will provide three of the four out of the box. The best ones will provide you all four.
Operations can correlate these signals to answer the question: how do we know what we know? Here’s what a conversation may sound like in the operations review for a web service.
We know the system is operating correctly because the traffic levels are within established levels as measured by the total HTTP requests. We know the system is operating correctly because the number of errors, as measured by HTTP 5xx responses, is statistically small. We know the system is operating correctly because the latency, as measured by the server side response, is under 100 milliseconds. We know the system is operating correctly because saturation, as measured by server’s incoming connection queue, averages twenty percent.
I repeated the phrase “as measured by” to emphasize the focus on empirical facts. Operations is numbers driven. If you can’t include “as measured by” into what you’re seeing, then you don’t understand it enough or don’t have enough certainty to make any assertions about the current condition.
Everything’s great when there are no problems. Something will go wrong. Time for a story.
It’s 2:58 AM. Your phone shocks you awake with a page from your PM. It reads “the website is down”. No time for grogginess. It’s go time.
You Put your empirical thinker hat on: “down” as measured by what? Well, you remember the handy phrase: LETS for the golden signals: Latency, Errors, Traffic, Saturation.
Traffic is usually a good starting point for these “down” scenarios.
You follow your mental block diagram of the whole system. First, checking traffic as measured by HTTP requests and responses at the load balancer. Everything looks good. Traffic is flowing through the load balancer. Next stop: traffic as measured at the app servers. HTTP requests and response count looks good. No problem with traffic. Next stop: errors as measured at the load balancer and app servers.
Now it starts to become a bit of a head scratcher. There are no red bars on the charts of 5xx’s from the load balancers or application servers. So there’s no change in error counts. What’s the next golden signal? How about latency?
You pull up a chart of the p50, p90, and p95 latency on HTTP responses across all the app servers. There is a slight up tick in the p95 latency.
The next question is: which endpoints have gotten slower? Time to drill down into the metrics, so you split the chart by API and user-facing responses. No delta on the user-facing latency charts. Then you spot something: p95 latency for API responses has really gone up, though there are no errors. So where is the failure?
Time to go up level of abstraction. You think to yourself: what uses this API? Oh right. The fancy single page Javascript application.
So you pull up a dashboard of the golden signals coming from the Real User Monitoring application. You discover a red bar chart of errors. The red bars are consistent. They also correlate with the increase in API latency. Now a theory forms in your head. Something is probably timing out somewhere, then something something Javascript error. Next stop: error logs.
And finally, there it is. The Javascript app uses the API to fetch all the data to render the initial home page. The increase in latency causes a timeout in the client, which creates an uncaught error, which leads to a blank screen. Not really “down” but certainly broken.
All right that’s all for this batch. Head over to https://SmallBatches.fm/88 for links to recommended self-study on production operations and ways to support the show.
I hope to have you back again for next episode. So until then, happy shipping!