T1 and T2 Signals
Hello and welcome to Small Batches with me Adam Hawkins. I’m your guide to software delivery excellence. In each episode, I share a small batch of the theory and practices along the path. Topics include DevOps, lean, continuous delivery, and conversations with industry leaders. Now, let’s begin today’s episode.
One of my focuses over the last two years has been directing my teams to answer this question: How do we know the system is working?
This is a powerful question because it forces the other person to create a mental model about what the system is supposed to be doing and how the system is designed to do that. You cannot answer this question without it. Once you have that model you can answer the question: We know the system is working _as measured by_ fill in the blank.
Though what is “fill in the blank”? What do you call “the blank”? Repeated iterations and conversations around what makes a good “fill in the blank” and what does not built a mental mode and lexicon for discussing this. We came up with “Tier 1” and “Tier 2” signals, A.K.A. T1 and T2 signals.
So in this episode, I’ll share the model so you can more confidently answer the question “How do we know the system is working?”
But first, a reminder about the FREE May giveaway.
This month I’m giving away a copy of Flow Engineering by Steve Periera and Andrew Davis. This one of my top books on flow, value stream mapping, and how to actually improve things.
Here’s my blurb for the book:
> The best management systems teach us to continually improve the systems we work in. By what method? The method is Flow Engineering. Flow Engineering offers a practical guide to beginning the process of continuous improvement.
Just a little teaser of the book to remix it with the topic of today’s episode. How do you know your process is working? Start with value, clarity, and focus. Next map it. Then, improve it!
Anyways, go to SmallBatches.fm/109 for entry details. Entry runs until the end of the month.
Now onto the good stuff. Let’s talk systems.
First, I’m sure this model is similar to something else out where in the world with a different name. So I doubt this an original creation. In fact, I hope that someone can point me to something that’s more easily google-able than my bespoke T1 and T2 signal model. SREs and Ops people please let me know. Disclaimer out of the way, now onto the lexicon.
There are multiple ways to explain the difference between the T1 and T2 signals. Here’s a simple explainer, then I’ll explain with examples.
T1 signals measure what the system is doing. T2 signals measure why the system is doing that. I refer to these as tier one and tier two because there is a relationship and different use cases.
The T1 signal depends on the T2 signal. In other words, the T2 influences the T1 signal. T1 signals are candidates for SLIs and other first-tier visual management dashboards. T2 signals are best suited to second tier and more detailed diagnostic visual management dashboards.
I know that is abstract, so here are concrete examples.
This example comes straight from the daily work.
One of my teams is responsible for Datadog cost governance. The aim is deliver Datadog on-budget each year. So how do we know the system is working? We know the system is working as measured by the current monthly bill.
That “current monthly bill” is the T1 signal. It’s measures what the current monthly spending is. So, if the we had to pay a bill _today_, then this what we would pay. We know the expected cost, we know how much we’ve already spent, and we know the budget amount.
We can synthesize that into a blue, yellow, black grade for on-budget delivery. We know the system is working. All this is captured in a single chart on the team’s weekly ops review dashboard. Grading it black creates a clear call to action: go and see _why_ the spending is at X.
Enter the T2 signals. The T2 signals measure the spending on different Datadog features. Each feature has its own billing model and units so each T2 signal is different. The team created a deeper diagnosis dashboard full of different visual management controls for the different T2 signals. Some are tree maps, some are time series charts, some a top-lists. The point is the T2 signals are more complex and require a deeper mental model to truly interrogate them.
So the T1 signal, the current spend, may have changed because of a change in the ingested log bytes. That ingested log bytes is a T2 signal. Just looking at that metric cannot tell you _why_ ingested log bytes changed. However, it may tell you what service or environment started sending more logs. That’s just _enough_ to go and see that service or environment.
The other point in this example is that focusing exclusively on T2 signals is unlikely to demonstrate a problem. They lack the context provided by the T1 signal. This is a trait of complex systems. You need multiple measurements to understand what’s happening, why it’s happening, and what the causes may be.
Now that we have more understanding, here’s another example.
Say you’re making a simple HTTP API. Let’s assume that HTTP responses are entirely sufficient to measure the system is working.
I mention this caveat because HTTP response codes are rarely sufficient to measure the system’s true purpose, so just roll with me for now.
A T1 signal for this service is the success rate or “availability”. This is typically measured by dividing success requests by total requests. This creates a percentage, hopefully something close to 99.9. This defines one SLI, or service level indicator. That SLI can be used to set an SLO, or service level objective. The objective may be 99% in hour or 99.9% in a month. Now we know the system is working by tracking the system against the SLO.
This example demonstrates how to compose a T1 signal from different T2 signals. The T2 signals are the total requests and the successful requests. These are necessary, but not entirely useful in isolation. They are immensely more useful when combined to contextualize each other.
Say you’re tracking the counts of 500s. Alright, the count goes from zero to one? one to ten? Ten to a thousand? Five hundred to ten? Are these good or bad? Is this special or common cause variation? That cannot be answered without knowing the total requests.
More over, the T2 telemetry should be deeper than the T1 telemetry. Say this example measures the T1 signal across all endpoints. You notice the T1 signal drops from 99% to 80%. Ok, but which endpoint? Is it POST to foo or GET bar? Doing a quick breakout of successes and errors by route or endpoint will answer that question.
One last example before some tips.
This is example comes directly from the daily work. One of my teams is responsible for a video transcription. This is a long running asynchronous process that can take up to 48 hours. Users upload videos then eventually the video should have a transcript. We know the system is working as measured by the total videos that received transcripts within 48 hours.
That measurement tells us if the system is working, but nothing about why it may stop working. This is a complex process, so the team created multiple T2 signals. They measure how many videos are in different states of the larger finite state machine, different API calls from our systems to other vendors, latencies, and when things happened. These T2 signals are sufficient to go and see _why_ videos are not being transcribed within 48 hours.
I can share so many lessons from this example, so that’s my segue into tips.
Tip one: T1 signals capture consumer behavior.
T1 signals are expressed in a domain consumers understand. Bring it back to the Datadog spending example. The consumer is the finance department. The measurement is current monthly spending measured in dollars. All parties can understand that. All parties cannot understand the monthly log ingestion measured in bytes. That’s the T2 signal.
Tip two: replace your developer hat with consumer hat.
I’ve seen engineers struggle identifying T1 signals because they’re stuck in white-box thinking for T2 signals.
Instead, they need to step outside the system and see it like a black box that consumers do. Consumers simply do not care what AI you use for transcription or how many queues there are between the video and the transcript. They only care about the video and getting the transcript on time.
Tip three: good T1 signals come with clear call-to-actions.
This is the strong indicator for a great T1 signal. The point of this tip is that if the T1 signal changes than you know _something_ needs to be done.
That something may be a straight forward runbook execution. On the other hand other hand it may require someone to go and see before doing anything. If the action is clear, then it can be connected to automated monitoring. Trust me on this. Monitors connected to clear signals of actionable problems are the heart and soul of production operations.
Tip four: you can have multiple T1 signals.
You may be able to compress an entire feature or value stream to a single number. If so, great! Sometimes you need just a little more.
Reconsider the simple web service. Say the company uses it in a real time trading product. Performance is critical. So there may be an additional T1 signal, and thus SLO. One for availability and one for the percentage of requests that complete in under 50 milliseconds.
## PROTIP Five
Tip five: T1 and T2 signals not the same as the golden signals.
The golden signals are Latency, Errors, Traffic, and Saturation. The golden signals capture facets of operations. These may help identify which matter to answering the question: how do we know the system is working?
Let’s return to the Datadog example. The T1 signal, the current monthly spending, is closest to traffic. That’s “traffic” in the sense of dollars out of the company bank account. A T2 signal such as ingested log bytes is traffic measured in bytes. A T2 signals such as the percentage of active APM hosts relative to the committed quota is saturation. Changes in traffic of log bytes or saturation of APM hosts can change that T1 signal.
Tip six: T1 and T2 signals are measurements, not instrumentation instructions.
Let’s return to the transcription example. We know the system is working if the video receives a transcription in under 48 hours. That short description says _nothing_ but how the system will instrument that number. It’s up to you, the engineer, to build the reading of that measurement into the system. That’s “instrumentation”. Writing the T1 and T2 signals in English defines what to instrument.
One last tip. Put on your Deming hat for a moment. Deming tells us every system must have aim. The aim of this episode—the system—is help you answer if your systems are working. Deming then asks a follow up question: by what method will you achieve this aim?
This is tip seven: use the MMIVM exercise to build your mental about the system, then derive your T1 and T2 signals.
Let’s go all the way back to the beginning of the episode. You must have mental model about what the system is supposed to be doing and how the system is designed do that. That informs the T1 and T2 signals, how to instrument them, then how to visualize them, and ultimately monitor them.
This is the MMIVM exercise: Model, Measure, Instrument, Visualize, Monitor.
I covered this exercise in detail in a previous episode. Find a link in the show notes.
That’s all for the tips.
Now, dear listener, I have some announcements for you.
The first announcement is about the Small Batches Way study guide.
The guide outlines three courses of reading with supporting self-study questions. It begins with Deming and goes all the way through to building your capabilities in four areas: understanding of continuous delivery, understanding of test driven development, understanding of software architecture, and understanding of production operations.
Right now the study guide is FREE. That’s changing in a week. It’s moving behind the paywall. So pick it up for free a TheSmallBatchesWay.com before too late.
This leads me into the next announcement on paid perks.
The tl;dr here is that my best stuff is moving behind the paywall and more of my best stuff is coming. Going behind the paywall gives me a written and visual medium to share software delivery education.
The Small Batches Way study guide is a perfect example. It’s a visual book map with twenty pages of supporting materials. Plus, it needs updating as I learn more and iterate on the path itself.
Here’s another example of wonderful material that doesn’t fit the Small Batches podcast. Patreon and Substack subscribers received a detailed guide on how I design and run my one-on-one meetings as a Director of Engineering. The guides covers the mental model behind the system, plenty of visual samples, and a deep FAQ. Plus, template Miro boards for use in the daily work.
I have plenty more stuff on the shelf itching to see the light of day. Here’s what top-of-mind:
* Deeper book reviews and analysis
* Coding and deeper technical examples
* Gemba walks through facets of my daily work
* A3 reviews and retrospectives
* MMIVM worksheets examples
* Dives into adjectives topics for the most curious
* Dedicated software like Joseph Carlson does for his supporters
* Deeper meditations from the Small Batches library.
Another perk is episode requests and ask-me-anything episodes.
I’ve being receiving more of these than usual, so preference goes to my paid audience.
I saved the best perk for last: direct coaching from me. Bring your challenges to me. You’ll find I am an authentic, curious, and engaged collaborator. Let’s talk.
Look, the point is this: lots of perks are coming to paid supporters.
All these perks won’t stay around for ever, so subscribe now.
I am cross posting content to Patreon and Substack, so you can subscribe on whichever platform you prefer. Find links my Patreon and the Software Kaizen substack at SmallBatches.fm.
Alright, that’s all for this batch.
Go to SmallBatches.fm/109 to a link to my patreon, the free May giveaway, and more on T1 and T2 signals.
I hope to have you back again for the next episode. Until then, happy shipping.