Understanding Production Operations
Hello and welcome to Small Batches with me Adam Hawkins. I’m your guide to software delivery excellence. In each episode, I share a small batch of the theory and practices along the path. Topics include DevOps, lean, continuous delivery, and conversations with industry leaders. Now, let’s begin today’s episode.
Software delivery excellence requires a “you build, you run it” mandate. Running it is crucial because software only provides value in production—just building it is not enough.
Running software in production provides ample feedback opportunities to learn from design choices, expected behavior, and systemic problems. All that learning may be used to improve future development activities.
That right there is the Three Ways of DevOp: flow, feedback, and learning. First, find flow of software to production. Second, collect feedback about the process. Third, experiment at improving both those activities.
Engineers and teams often struggle with the “run it” portion of the mandate. Why? Because running, also known as operating, is a different skill than building software. However, it is skill like anything else. Anyone can learn it with coaching and practice on the gemba.
Deming tells us that practice means nothing without theory.
Speaking of Deming, this is the last week to enter my giveaway for a free copy of John Willis’ new book Deming’s Journey to Profound Knowledge. The giveaway ends February 29th. Go to SmallBatches.fm/103 for details on how to enter.
OK, let me channel Deming here. In this episode I’ll share the theory and practices behind understanding production operations.
Let’s begin by creating common mental model for “production operations”. The aim is provide working software to consumers in production. We can achieve that aim by working through a series of questions. The answers create a method for bringing systems under some level of operational control.
Every system must have an aim. Our software has a simple aim: provide value to consumers in production. This is the first question: What is the system supposed to do? Operators must be able to state how the software provides value to consumers and the intended behavior.
If this question cannot be answered, then it’s like playing darts without the dart board. You need the target.
After establishing what the system is supposed to do, then ask second question: How do we know the system is doing what it is supposed to be doing? Complex systems have multiple answers to this question.
Answering this question requires empirical thinking. Your answers should include “as measured by”. Here’s an example. Consider a travel booking system. One answer to this question may be: “The number of completed booking as measured by the total bookings with confirmed payments”.
Your answers should use language the consumer understands. Remember the system’s aim to provide value to consumers in production. Frame your answers accordingly.
The first two questions build a mental model of system operation. The next step is bringing that mental model into the world. Here’s the third question: How do we instrument the system to measure what it’s supposed to be doing?
Now we’re getting closer to the redwork.
Instrumenting means adding telemetry. Telemetry is typically logs, metrics, traces, and events. We need telemetry to understand what the system is doing at any point in time.
Go and see if the system produces the telemetry to measure what you came up with in question two. If the telemetry is missing, then close the gap.
Now you have a model of system operation and the signals to reconcile it. So here’s the next question: How do we visualize the telemetry to know the system is doing what it’s supposed to be doing?”
Answering this question requires visual management with charts and other indicators. The visual management must clearly communicate the intended behavior so there is call-to-action when that’s not happening. This may be a line chart with a colored horizontal marker for a threshold. If the measurement falls below the threshold, then there’s a problem.
Teams can use their visual management on ad-hoc, daily, or weekly cadence to go and see if the system is doing what it is supposed to be doing.
The crucial bit here is that system behavior must be visualized. Peter Drucker has a popular quote: “What gets measured gets managed”. We can send gigabytes of telemetry to our monitoring system but never look at it. That’s measured, not managed—simply waste.
So, I prefer my version: “What gets visualized gets managed.”
We’re four questions into the cascade. So far, these questions have produced a mental model of system operations, the telemetry to reconcile it, and a manual visual management process to go and see if the system is doing what’s supposed to be doing.
Time for the next last question: How do make it so we’re told when the system stops doing what it’s supposed to be doing?
Control theory states that control system must operate twice as fast as the underlying system. The production environment is changing multiple times a day. Relying on a manual visual management process on a daily is insufficient—forget weekly or biweekly.
Answering this question requires creating a 24x7 monitoring system that can page engineers when things stop working.
Now I’m going to give you an acronym to internalize these questions and practice working through them. It’s M-M-I-V-M
“M” for Model. Create a simplified visual model of the system (such as block diagram with communication paths). Incorporate how consumers use the system.
Check your work: Can I quickly verbalize the diagram of what the system is supposed to be doing and how it’s designed to another engineer?
“M” for Measure”. Ask yourself the question: “How do I measure what the system is supposed to be doing?”. Consider the aggregate system and components in the diagram.
Check your work: I can state the system is working as measured by blank. The “Blank” are typically golden signals (Latency, Errors, Traffic, and Saturation).
“I” for Instrument”. Determine how the system produces the telemetry for your measurements. Typical sources are application logs, APM libraries, Cloud Provider telemetry, and custom metrics. This requires a “Go and see” attitude to assess what’s already instrumented in the system and what’s not.
Check your work: I have a link to the source telemetry for each of my measurements or a plan to add it.
“V” for Visualize”. Visualize the telemetry from the previous step as time series charts. Proper visual management is a whole separate topic, so here some quick tips.
Leverage color. Use blue for traffic and red for errors. Use bar charts for counters. Use line charts for latency. Design the charts to clearly communicate the presence or absence of expected behavior. Use text widgets for reading instructions.
Check your work by evaluating each chart with this sentence: The system is working as measured by the behavior on the blank chart. Notice the expected behavior of blank. How well can you fill in those two blanks?
“M” is for “Monitor”. Use the charts from the visualize step to create 24x7 monitors. Monitors will tell you when the system stops working. However, not all monitors are created equal.
Check your work before turning on those monitors: do I want to wake up at 3AM to fix this problem or can it wait until tomorrow? This is urgency. There are only two answers. Proceed accordingly.
I’ve covered a lot in this episode, so let’s stop here to recap. I shared an exercise for understanding production operations. It’s MMIVM for Model, Measure, Instrument, Visualize, Monitor.
This exercise acts as visual management system for the work itself. I’ll explain.
I’ve seen engineers struggle because they’re in “monitor” without doing the work in “instrument”. The same goes for engineers eagerly jumping into “Visualize” without any understanding of the telemetry or the model behind it. The exercise acts as a way to move the work back to the appropriate step.
Once they’re in the appropriate step, then the path forward is clear: get to monitor. First get the state of knowing the system is working. Next, be told when the system stops working. Then you can start doing real continuous delivery.
I challenge you to bring this exercise to your teams—especially those with little ops experience or fuzzy ownership. When they get stuck working through each step then ask this question: “What’s the real challenge here for you?”. Start to develop their capabilities from there.
Remember: MMIVM; Model, Measure, Instrument, Visualize, Monitor.
All right that’s all for this batch.
I’ve purposely used the phrase “understanding production operations” in this episode. This one of the four pillars in my Small Batches Way study guide. Get the guide to develop your capabilities in modeling, instrumenting, visualizing, and monitoring systems.
Get the guide and other helpful production operations links at SmallBatches.fm/103.
I hope have you back again for the next episode. So until then, happy shipping.