The 5S Method

Applying the 5S's (Sort, Straighten, Shine, Standardize, Sustain) to the telemetry needed for stable production operations

Hello and welcome to Small Batches. I'm your host, Adam Hawkins. In each episode I share a small batch of software delivery education. Topics include DevOps, lean, continuous delivery, and conversations with industry leaders. Now, let's begin today's episode
I’ve been working with systems that are not well understood in production. It’s unclear what telemetry is available. Is the available telemetry even accurate? Does the telemetry capture customer facing value? What about monitoring? Are those monitors accurate or are they ignored because of false positives?

These experiences frustrate me because I care deeply about operational excellence. Access to telemetry is priority zero for maintaining any level of production operations. Without it, there’s no feedback or learning.

The frustration is is amplified by telemetry tools that provide overwhelming amounts of data. (P.S. I’m talking to you directly Datadog.). This provides an illusion of telemetry, while at the same time forcing engineers to sort through volumes of telemetry.

Contrast this with the other end of the spectrum. Consider a system that has a document listing all available telemetry, what it is, why it’s important, what to use it for, what’s monitored, what are paging conditions, and how to respond to pages. Newcomers can consume this information and quickly get up to speed. Experienced engineers can navigate it with ease.

Continuous delivery requires stable production operation such that normal operating conditions are understood and addressing abnormal conditions follows standardized processes. The typical challenge is that systems have limited to no telemetry and have not been designed for stable production operations. So where do we begin?

Toyota’s so-called "5S’s" help here. The 5s originated from Toyota’s approach to individual workstations. I think the same applies to the telemetry needed for stable production operations.

The first S is "sort". Sort through items and keep only what is needed while relocating or disposing of what is not. Imagine a workbench with tools all strewn about. It’s better to only have the needed tools present and the others moved aside. This will make it easier to pick the correct tool when the moment comes.

The second S is "straighten" or orderliness. Think of this as "a place for everything and everything in its place". Imagine that same workbench. Once the tools are sorted by those that belong and do not belong, then the ones that belong are hung on a board with an outline for each tool. The outline is a visual control mechanism. The visual presence immediately indicates which tools are and are not in place.

The third S is "shine" or cleanliness. Cleaning or shining up your tools acts as an inspection mechanism. As you clean each tool you will notice problems like cracks or broken handles. This helps you prevent problems before they happen.

The fourth is "standardize". Create rules and systems that maintain and monitor the first three S. This could mean setting standards that all the workbenches using the same organization with declared setup and teardown procedures.

The fifth is "sustain" or self-discipline. Maintaining these principles requires personal commitment. It’s a process of ongoing improvement as you adapt to changing conditions. For example, if you encounter a workbench that is disorganized then you commit to the needed 5s work to bring it up to standard.

Lastly is the related concept of "PFEP" or "Plan for Every Part". Returning to our workbench example, this means creating an inventory of all the required tools and how they will be used, and procured. The PFEP should be as detailed as possible since it represents your understanding of everything required to construct the system or complete the process.

Let’s bring this back to telemetry and production operations now that we’ve equipped ourselves with the 5S and PFEP. These ideas help navigate the uncertain gray zone between current condition of production operations and the target condition of stable production operations.

I like to begin by creating a PFEP for monitors. Starting with monitors requires answering questions about urgency, expected normal operations, abnormal, and prefailure conditions. This exercise tests your mental model of the system and your understanding of the business requirements. Also, it bootstraps the PDCA process needed to test you understanding of the system against real world production operations.

Armed, with the PFEP it’s time to put in the work.

First, sort the telemetry by data required to create the monitors. Telemetry not connected to monitors is secondary. It’s "disposed" for now in the sense it’s unimportant. For example, if you have a monitor on connection exhaustion then sort the telemetry related to networking and connections before worrying about memory utilization.

Now it’s time to put the telemetry in its place. This is the second s: straighten. That place is visualized on a dashboard. Graphing or creating other visual representations of the data is an act of shining or cleanliness. This forces you to _use_ the telemetry for its intended purpose. You will identify gaps that prevent using the telemetry for its intended purpose.

The visualization process bootstraps the feedback loop between production operations and future decisions. _Using_ the telemetry forces you to identify false positives and other inaccuracies. You’ll learn that some telemetry is not useful, so it may be discarded. This is wonderful news because it’s better to have a small set of highly informative and actionable telemetry than a large of set of barely usable telemetry.

This process may be completed for a single monitor. New monitors will undoubtedly be added to the system. This where the final two S’s come in.

You need standardized processes to ensure the introduction of _new_ telemetry meets the current standard. You also need the self discipline to sustain this over time such that adding new telemetry is a net positive in that it improves your understanding instead of distracting you with false positives or unuseful information.

Remember that reaching stable production operations is a process of ongoing improvement. As your understanding improves, you will be able to detect more prefailure and abnormal conditions. You will learn to design telemetry into systems since it make future work easier. You will learn to design systems that prevent problems in production. The system will also continually change which tests your assumptions about normal operations. You must use that information to drive future decisions.

Alright, that’s all for this batch. I encourage to think about how you can apply the 5s—sort, straighten, shine, standardize, sustain—and the PFEP (plan for every part) in your daily work.

I have two resources for you if you’d like to delve deeper.

The first is a previous podcast episode on Jeffrey Liker’s book "The Toyota Way". It’s a wonderful look into the Toyota philosophy that informs modern lean software delivery.

Second is the Small Batches slack app. I’ve loaded the app with tips and info like 5S to improve your daily work. The app is currently FREE in beta, so signup today and get small batches of software delivery education sent to your team’s slack.

Find links to The Toyota Way and the Small Batches slack app at smallbatches.fm/66.

OK, well I hope to have you back again for the next episode. Until then, may your systems be stable, pagers quiet, and telemetry accurate. Happy shipping.

Creators and Guests

The 5S Method
Broadcast by