Software Delivery in Small Batches | Transcript: Service Level Objectives with Alex Hidalgo

Service Level Objectives with Alex Hidalgo

December 7, 2020 / 48:51/E26

Alex Hidalgo & Adam Hawkins discuss the theory, practices, and the hard problems with service level objectives (SLOs).

[00:00:00] Hello and welcome. I'm your host, Adam Hawkins. In each episode I present a small batch, with theory and practices behind building a high velocity software organization. Topics include dev ops, lean software architecture, continuous delivery, and conversations with industry leaders. Now let's begin today's episode.

[00:00:25] Hello again, everyone. It's that time of the week. Again, new small batches episode today. I'm speaking with Alex Hidalgo. Let me read off his official bio. Alex is the principal site reliability engineer at Nobel nine and author of implementing service level objectives. During his career, he has developed a deep love per sustainable operations incident, response management, proper observability, and using SLO data to drive discussions and make decisions.

[00:00:58] Alex's previous jobs it support network security restaurant work t-shirt design and hosting game shows at bars. We're not sharing his passion for technology with others. You can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner, Jen and a rescue dog named taco.

[00:01:19] Alex has a BA in philosophy from Virginia Commonwealth university. I found Alex on Twitter when someone tweeted a picture of a book on SLOs, I had never heard of this book before. And, you know, I try to keep up on these type of things. So I was surprised by it. Well, I went on the internet to learn more. It turns out that there is in fact, a book on just SLOs. Even better after reading it. I can say that it is the book on SLOs. So I immediately reached out to Alex about coming on the show. He and I are two birds. of a feather. We're both SREs. And although he cut his teeth working as an SRE at Google, so he's definitely got more street cred than I do, but that puts him in a wonderful position to write about and discuss the theory and practice of SLOS, because I don't think that there's anyone as good at SLoS as Google Alex, and I discussed theory and practices behind SLoS and why even write a book about it in the first place and the shortcomings of all the current telemetry tool.

[00:02:22] I was dying to ask him what his preferred telemetry stack is for SLoS. Well, the TLDR is that it doesn't exist now. So he's building it at Nobel nine. We assume that the listeners are already familiar with SLIs and SLoS. For this conversation, we didn't focus on the basics or anything like that. More tackled the day two problems with SLoS. I've already done a short solo episode on SLS to prime the pump for this conversation. I find that one, that smallbatches.fm/14. Now here's my conversation with Alex Hidalgo.

[00:03:02] Adam Hawkins: Alex, welcome to the show.

[00:03:04] Alex Hidalgo: Happy to be here. All right.

[00:03:06] Adam Hawkins: So let's start from the beginning. So I found you because you wrote what appears to be the book on SLOS. I don't think there was one before and I thought to myself, okay, I got to get this guy on the show. He knows a lot about SLOS. Like I talk about SLOS. I'm also SRE, he's an SRE. And I started to think, why write a book on this? Like, there's the SRE workbook, which I think you actually wrote a chapter on. Right about SLOS. And then there's the, of course the SRE book. So why write this book? What was your original motivation? for that?

[00:03:39] Alex Hidalgo: So there's a pretty lengthy story there, really. So I was at Google for along time, just about seven years. And there, of course, as an SRE, that's kind of where SLOS were developed and, and how they're used, or was really first explored. So it was just part of what I did. It was part of my job. Our services had SLOS and we use them to make decisions and we measure our error budgets, and it was pretty wonderful once I was on certain teams where we were able to get rid of all of our threshold based alerting, and we only alerted off of error, budget, burn rate like it was great. The number of false positive pages that we got, went way down and things like that. So I just. Fell in love with the concept. And towards the end of my tenure at Google, I was asked to join CRE the customer reliability engineering team. And that was a group of, or still is a group of veteran site reliability engineers that are tasked with basically helping Google's largest cloud customers learn how to SRE, how do we build reliable systems and, and things along those lines. And CRE group very early decided look for us to be able to help you. We need a common vernacular. We need a way to communicate with each other. And that for SRE at Google is SLOS. How it works. So before we would engage anyone further, helping them actually look at their architecture or even their code, like how can we help you make your whole system more resilient and robust and therefore reliable, the goal was essentially, you need SLOS first and we'll help you with those. We develop workshops. We'd go onsite. I sometimes spent up to a week. Traveling all over the country, helping people learn what these concepts are and how they work and why they work. And then we'll even sit down with you and pair with you and ensure you actually get these things set up.

[00:05:31] Alex Hidalgo: What is an SLI for your system? What is a good SLO target for your system? And so I was really engrossed in that. Quite a long time. And then it became time for me to leave Google for plenty of reasons. And I ended up at Squarespace and at Squarespace, I don't know if it was actually on day one, but what felt like day one, people approached me and were basically like, hi, we want to do SLOS, you know, SLOS, can you help us do SLOS? And I was like, yeah, sure. Of course this'll be no problem at all. And what I reasonably assumed was going to be, you know, a few months of work or, you know, a quarter or so, and everyone's going to be happy. I realized very quickly how much you need to do if you're trying to do this from the ground. I had been introduced to SLOS at Google when other people had already formulated them. When the tooling existed, the culture existed, the even mandates from leadership, right? That's, part of it.

[00:06:29] Alex Hidalgo: Leadership didn't just like them. They wanted them. And suddenly I was running into so many hurdles and I realized, you know, we were using Prometheus's primarily at Squarespace. Prometheus doesn't do SLOS calculations. It, it doesn't, you can't do that via prompt QL. So we had to build a system just to do the measurements and to calculate your error budgets.

[00:06:49] Alex Hidalgo: And then I did develop a workshop that was based very much off of the one I helped develop for CRE you know, and these were four hour long sessions where first I'd teach people, let me do a risk assessments. Then we, you know, sit down and again, I'd look at your code. I look at your service, let's actually leave this room with one defined SLI and one defined SLO.

[00:07:08] Alex Hidalgo: It was a lot, we had to create, you know, a document templates, document repositories, SLOS aren't very useful for internal tools. If other people can't discover them and how they're defined and I could go on and on, but the point is just building this entirely from the ground up was so much work.

[00:07:26] Alex Hidalgo: So one day in, I believe it was June, 2019. I had just finished one of these workshops. And I've been running these like once a week for months. And I was just getting tired of repeating myself, not quite burnt out, but close, you know, just like I I'm going in. And I'm seeing the same thing for hours on end once a week. And people were very receptive and we were getting traction, but it was just a little bit tiring for me. So I'm speaking with my coworker, Gabe and I told him, I wish there was a book about. this. So I could point people to it. So I wouldn't have to keep repeating myself. And he said, you should write the book. And I said, no, no, no, no, no. We need a real expert to read it. And Gabe said, you are the expert. And I responded with an explicative, because I immediately realized I am now writing a book. and I've heard from plenty of people who have done that, how difficult that is, and that's really how it got started. I, in a way, the book is the story of my two years at Squarespace, in a way it's the story of what I learned and the knowledge I was able to also impart on others over the course of that.

[00:08:40] Adam Hawkins: I see. So really you were kind of like a internal consultant actually, you know, going around doing these things, trying to help these people, you know, these different organizations achieve these specific outcomes, and then you just did that enough and you're like, okay, I can almost standardize this. Right.

[00:08:58] Adam Hawkins: I can encapsulate all of my knowledge about this topic in a book and share that with people so that they can start from there. And then maybe, you know, they can kind of like ladder up in their own. And then, like you said, there's a standard, no standard language to approach this topic.

[00:09:15] Alex Hidalgo: Yeah, that's exactly it. I had great success getting all this set up, but it was longer and more challenging than I realized it was going to be.

[00:09:24] Alex Hidalgo: But, yeah, your, your comment about upleveling is a great one, for example, like, so I end up writing about 60% of the. book. But about 40% was written by various contributing authors. And one is my friend Harold train. He was someone at Squarespace who had never used SLOS before, but his team was one of the first to come to our workshops and they really internalized it.

[00:09:47] Alex Hidalgo: And he led his team. He was like, here's how we're going to do this. Here's how here's the SLIs we need. Here's the correct SLO targets. Here's what we should be alerting or not. And then he transferred teams and he did it again. And he, so he was traveling around Squarespace imparting this knowledge that I'd, I'd only helped him get started with right.

[00:10:05] Alex Hidalgo: Most of this, he kind of intuited himself or learned himself. And, and that's why he wrote the SLO culture chapter. He is, you know, I come from it from a totally different standpoint than he does. So he was the best option, in my opinion, to say, here's how you convince your team to go on this journey with you.

[00:10:25] Alex Hidalgo: Here's how you on a single team, as an engineer, as an individual contributor, here's how you can build an SLO culture on your team. So that's why I asked him to write this book or that chapter, exactly, because of what you're talking about, it's all about knowledge sharing and having it grow organically and people leading, leading from bottom.

[00:10:44] Adam Hawkins: So you spoke about, you know, like writing this book and part of the sort of realization was, Hey, there's a lot more here than I thought. And this also speaks to my experience with SLOS because I sort of, I think that experience sort of bifurcates in two different directions, like you working at Google, we're kind of already in the promised land, they had this stuff established.

[00:10:56] Adam Hawkins: It was, you know, you had buy-in you had all this structure in place to just say like, yeah, here's my number. We're going to use this. This is going to drive all these other things. Like these are the outcomes at the, you know, the very highest level of the pyramid where we as SRES want to operate because they're good for us, but they're also good for the business. Like it's a win-win for everybody involved and then.

[00:11:30] Adam Hawkins: There's the other end where maybe like, you don't have telemetry, you don't have like, buy-in, you don't have like a product thinking about air budgets. You know, there's all these, it's like this iceberg that you just go down and down and down, and you just realized that like, Hey, this is a huge area that like for us to actually make a real difference here, we're going to have to adopt a, B, C, and D are up to change X, Y, and Z. We're going to have to learn all this stuff. So I think for a lot of people, their experience fits on that sort of where they're starting here. We haven't got to the promised land yet. And then, because they have where like there's, they're in a really introductory state or beginner state, then they think if they just create, say a number that they're going to get all these other things, and then it kind of like sours to their experience with this.

[00:12:20] Adam Hawkins: And it leaves a negative taste. Right. So like, what's the biggest pitfall that a team adopting this actual way of thinking, because it's not just a way of working, but it's like a, really, a whole philosophy. So when a team who doesn't, it was not there yet, they want to adopt this. What are the pitfalls they make early on in this journey?

[00:12:40] Alex Hidalgo: There are a few that I've seen commonly many times and the first and most prominent is actually exactly what you just mentioned. People think that SLOs for too many people think that SLOs are like a thing you can do. It's a checkbox on a list that you can check off and you're like, oh cool. Now we've got SLOS, that's not how it works.

[00:12:58] Alex Hidalgo: It's a different approach to thinking about reliability. It's I would like to say it's closer to something like using agile to help you playing your sprints than it is some kind of technological solution. It's a different way of thinking about your services and it's a different way of gathering data about them, which you can then use to make decisions.

[00:13:18] Alex Hidalgo: What should we be focused on right now? Are our users happy or not? Things like that. And that's the part that people often miss. They don't understand it is a ongoing conversation. It doesn't end. It's a different approach. In fact, it's why. I Don't love just saying, like doing SLOS, right? Like that's how many people refer to it?

[00:13:38] Alex Hidalgo: I try to very often say SLO based approaches to reliability because that's what it really is. It's a different way of thinking, and it's not a thing that ends, there's no start and end date here. And that's the biggest problem I see is, is teams don't realize that someone in leadership, someone who's very excited about the concept of sit, reliability, engineering, they latch onto SLOS as some kind of new, You know, fancy, hot lingo and they just want them because they're supposed to have them or because they read some blog posts somewhere and that doesn't prime people correctly to be like, this is a culture change. This is a process change that is gonna last forever. And if you don't realize that, then you're not gonna be able to adopt these approaches in the most efficient manner.

[00:14:26] Adam Hawkins: Yeah, so true. It's almost like when you talk about SLOS and you really have to talk about SLO culture. Cause it's way bigger than just, you know, a number, you know, I've been in teams where like, Hey, we're going to do an SLO. They put, they do all the work to all the work, but they do something to create a number, you report, this number, and then it ends up in some spreadsheet, which then just gets like occasionally looked at and eventually relegated to the dustbin of the pile of spreadsheets with Dussel's numbers. So without the culture, thinking those numbers are not going to be useful. I'd like to get your take on like, one problem that I've seen is if the push comes from engineering, the numbers are going to be like engineering focus will be like something more like low-level about say like, Hey, how this processes running or how the web server is running. Like just raw number of like HTP requests, which is like useful to some people, but not at the highest level of people. Right. So that sort of speaks to the problem of getting product and business driven SLOS compared to engineering driven as well. It's like, how do you, like in your experience, what has been helpful in getting product and business stakeholders on board with setting good SLOS?

[00:15:38] Alex Hidalgo: The number one thing I do there is. I go to the product team or the business side of things. And I just explained what we're going to call an SLI service level indicator over on the engineering side is what you call a user journey, right? That document that you have called user journeys for service X, where you outlined the 15 different things that you believe that service needs to do for your customers.

[00:16:03] Alex Hidalgo: That's all a SLI is it's a measurement of one of those user journeys. And then you go to the business side of things and you say, all right, so we're going to be doing this new thing and it's going to help us tremendously. it'll probably save us money if we do it right, and everything. And, but what we're going to be calling SLIs is what you call KPIs, key performance indicators.

[00:16:21] Alex Hidalgo: They're all the same thing. You know, you can go to your testing team and your QA team and they'll tell you, oh, that's a transactional test. It turns out everyone at your business, everyone in your organization probably already cares about the same things. We just don't have a unifying language. We all speak in slightly different lingo. And that's always going to be the case. I don't think you can totally fix that, but that's always how I start. I always start and say, look, we have this concept, That we want to measure things from the user perspective. And you probably agree with that. The slightly more difficult thing is to convince people that a hundred percent is not possible.

[00:16:56] Alex Hidalgo: That's aiming too high is too difficult. Picking a reasonable targets, can often be difficult, especially I've noticed for like, Founders who are still in technical positions often have difficulty getting over this, but you know, you can very easily tell stories there, tell them, you know, what does it look like to you when you are watching, you know, a random streaming service? We don't have to pick one and you know, most of the time it takes five seconds to buffer and sometimes it takes twenty. And you're not super stoked when it takes 20, but it only takes 20 seconds to buffer before we move. He starts one in a hundred times. You're probably cool with that. You're not going to switch your subscription. Right. And they'll very quickly. Yeah, sure. Of course. Cause that's a lived human experience that everyone can kind of associate with and they say, okay, cool. So why does our service necessarily have to be better than 99% reliable? You can tell stories. And I think that's the best way to kind of convince people.

[00:17:54] Alex Hidalgo: It turns out humans don't expect perfection. They're actually fine with occasional failure. I think tech uniquely tries to demand perfection out of it services and out of its users. But you can pretty quickly appeal to people at a base level. If you tell them a story that they've experienced, if you can say here's an example of failure of when you went to a restaurant and it took too long for your dish to come out, did you never go back to that restaurant again? Probably not. Unless it happened every time you went and that's how you can kind of explain to people, look, a hundred percent is not possible. Let's aim for something more. reasonable.

[00:18:31] Adam Hawkins: Yeah. And that also speaks to the other aspect of this, which is really a different philosophy for thinking about reliability. One, one book that really changed my perspective on this was the first edition of release it. And when I first read it, I was sort of like moving closer to. Away from like really just focusing on building user, facing applications, more like onboarding platforms and to kind of moving backward in that direction. Right. And one of the things that you mentioned in the book was just something like, well, you know, there's all these different kinds of outcomes for what can happen when you make, say a request to an external network service, like it can succeed, it can fail with the known air condition. It could time out. There's all these different things. Like you have to think about that as the engineer who's writing the code, but you also have to, as the engineer, you also have to communicate that up to the product owner. To ask like, Hey, how should a product behave in these particular like failure modes so that you can create a product that still like, it fails in a way that users are okay with it. It's not just going to like, you know, explode or, or, you know, or fall over. Right.

[00:19:36] Adam Hawkins: And I think this also speaks to your point about like, not having a hundred percent. I think we both agree that we just know a hundred percent is not possible, but you have to accept that and then plan for all the ways that it can happen so that you can recover and act accordingly. Like, I think this also speaks to some of the, the other hard problem. I like to get your take on this, which is you can start with an SLOS, you pick a number, right. We both agree that it's percentages. So, you know, you pick say 99%. Right. But then what do you do? Like if the threshold is not just enough, like this is where like air budgets and all this stuff come in so, Let's assume for the sake of conversation that we have them they're defined and then something goes wrong. You have this air budget and, you know, whatever, how do you alert somebody? Like, Hey, when this SLO is not, as we expect, how do you alert somebody and say like, Hey, look at this. What is so hard about that?

[00:20:34] Alex Hidalgo: I think there's two primary things at play there. The first is that it can be difficult to get to a SLO with a powered error budget that is accurate and trustable enough that you can just say, this is literally how we page now. That is one way that you can alert people. Literally alert them. If it looks like you're burning through your error budget at a rate quicker than you could possibly recover from without humans.

[00:21:01] Alex Hidalgo: Without human intervention. That's that is part of the story. But getting there is very, very, very difficult. And what I really liked to constantly drive home to people is that at the end of the day, these are just different measurements to allow you to make different decisions. They should never be mandates. Sometimes you burn through your air, budget 10 times over, because of what you know is kind of a black Swan events or, you know, not saying you shouldn't take action if someone drops the production database, but if someone drops the production database and it takes three hours to get it back up. Yeah.

[00:21:35] Alex Hidalgo: Like, okay, that's fine. You're going to learn lessons from that. But that doesn't mean like your entire organization now has to like, oh, we're not shipping features and blah, blah, blah. Like, no. Okay. So you deal with a month Of really bad error budget status. That's fine. I think one of the best ways to use error budgets, Is really just to use them as a conversational tool. If you have a weekly sync, make it part of your agenda. Let's look at all of our SLOs. Let's look at it over budget status. we're good. We're good. We're good. Okay. That one looks kind of weird. Maybe we should assign someone this week to at least look into it. Maybe it's, you know, just an afternoon, maybe it turns into, oh crap. We got to bring three people in. We really have a problem, but I think that's how this data works best. It's. One of the problems is that they were presented pretty simplistically in the first two in the first few SRE books, they were presented as measure your service with SLIs. So the target with SLOS, when you're out of error, budget, stop shipping features, focus on reliability. That was the only story that was told. And that's not how I've experienced how this whole process works best. The way I've experienced it, working best is it's better data you can use, have better conversations.

[00:22:50] Alex Hidalgo: What should we be focused on? Is this okay? Is this not, maybe if you're burning your air budget, it's actually cause you're a sly measurements bd, you know, it doesn't mean you have to take action, you know? that's that's when the. big Hurdles for a lot of people is to get over that kind of initial definition that took place in the first two and the first two Google SRE books.

[00:23:12] Alex Hidalgo: And just realized that the important part is the. SLIS. The important part is, are you measuring what your users actually care about that target, that error budget window, how much you've expended, all that that's useful data that you don't have to do later. Like you don't do the math for that later because you're continuously doing it.

[00:23:32] Alex Hidalgo: And again sure. Get to a very mature point. You can also alert off of it and all that kind of stuff, but really it's just some pre-dawn math. The important part is the SLIS, or are you thinking about your users? Are you measuring the things that actually matter? And then just SLOS and error budgets, give you better data to rationalize about that?

[00:23:51] Adam Hawkins: Hmm. Okay. So one thing that's continually come up for me in my conversations with SLOS and implementing them is sort of. It was a hierarchy between SLIs SLOs and SLAs. Like, we're not really concerned about SLAS for of this conversation, but you know, you may have say 10 SLIs. Each of those SLIS may not necessarily have an SLO. And I think that's okay. You know what do you think?

[00:24:18] Alex Hidalgo: Yeah, no 100%. Well, I say nothing's ever a hundred percent, so maybe I shouldn't use that exact phrasing, but yes, I believe that having SLIS service level indicators, that link to user journeys, that link to KPIs, that link to all the things that actually matter about your service, about your business.

[00:24:37] Alex Hidalgo: That's what's important. You can use these target percentages and time windows and all that to think about that data in a slightly different way. But yeah, if people take anything away from the book, I hope it's just think about the user first. And you do that by measuring the correct things who cares about your error rate, who cares about, you know, your latency, if your users don't care about it? one of my favorite examples, and this is the thing I've seen repeatedly over and over again, over the course of my career is that the database team will get paged at 3:00 AM in the morning. If there are too many errors between the primary web app and the database, but it turns out that users, this isn't being exposed to them because the web app was written with decent retry logic or something along those lines. The point is none of this has ever exposed to the actual users. So should you be paging on that? Probably, not, Should the database team eventually look into why there are so many errors? Yes, probably. And the web teams should probably be brought into it as well because maybe their queries are bad, blah, blah, blah, blah, blah.

[00:25:42] Alex Hidalgo: But that shouldn't be a page. It shouldn't be a priority. If the user experience is fine. That's the only thing that constitutes an incident. That's the only thing that constitutes something worthy of waking someone up on a weekend, you know, think about things from that perspective. And yeah. That's why, you know, very long winded way of saying yeah, like I do agree. I love that slows. I love error budgets. I think they can provide you with great data, but at the end of the day, it's the SLIs that matter. Are you measuring what the users care about?

[00:26:13] Adam Hawkins: Yeah, that's so true. I want to circle this back to sort of the high level idea of dev ops, which is. Focus on the customers and the value you provide to them, find a way to measure that at the beginning and at the end. And then that's ideally where your SLI should be. If you are somewhere inside of that whole value stream, by focusing say on like an individual service or some small infrastructure component, like a job queue or some like web service. Okay. I heard that it's useful, but that should come after you have the high level SLI across the whole thing. Inner measurements will allow you to further analyze, like what goes wrong when the top level number goes wrong. So like the way I think about this, and I'm curious what your take is. I kind of see this sort of as a pyramid, I think you're probably familiar with the idea of a testing pyramid with like more unit tests.

[00:27:04] Adam Hawkins: And then, you know, those are the fastest ones. You have the most of them as you go up the pyramid, you have less than they cover less stuff, but they take longer to run that type of thing. So like the way that I sort of think about this is you'll should have a lot of SLIS. You can use those for everything. You'll have some of those with SLO. And some of those might do to use the language from the SRA book. And I, I think, I mean this right, but you'll have alerts and you'll have tickets, right? Alerts. Is it do something immediately. Tickets do something eventually, you know, look into this and then if you need it, you might have SLA with other teams or other people, you know, whatever, but focused on getting the bottom layers. Right. And then scale up the pyramid as you have. In the lower levels. Is that sort of mapped to how you think about it?

[00:27:51] Alex Hidalgo: Yeah, that's exactly how I always explain to people. Start with SLIs, start with thinking about what your users need. Figure out how to measure that because often that's the most complicated part to be honest, picking some target percentage that often involves the number nine and things like that.

[00:28:07] Alex Hidalgo: Like, oh, like you can do that. But Getting to the point where you can actually measure what your users need is often complicated. You may need new telemetry. you may need to edit your code a bunch. You need to spin up entirely new services just to watch your services, that takes the most effort, but that's the most meaningful part of all of it.

[00:28:27] Alex Hidalgo: So yeah. Develop your slides first. And in fact, I generally tell people unless you're on some kind of incredible deadline where again, leadership is demanding, we're doing SLOS this quarter or. The best way to pick your SLO target what your, your percentage be is to measure your SLI for a very long time, at least 30 days, or do it for a quarter. If you can measure what your service actually looks like once you have that telemetry, once you have that data available to you, measure it for awhile and then look back on it and say, okay, here's where we knew people were happy. Here's where we knew people. were Unhappy. Let's do some basic statistics. the math is not that difficult to figure out what our threshold should be.

[00:29:11] Alex Hidalgo: So then you can set some thresholds and that just gives you some math that, so next time around you don't have to go back and look at 90 days of data and say, where were we? Good. Where were we? Bad? Your target gives you. Like some extra data to understand how are things right now? And then your error budget window gives you a little bit more data about how we performed over X time. And are we okay with how we performed over X time? But yeah, those are just abstractions on top of measuring the correct stuff.

[00:29:40] Adam Hawkins: Yeah. Well, definitely. I, maybe you have a different take on this, but I think that's actually where the quote hard problem of SLoS are. is Being able to measure something at the beginning, at the end, especially as that sort of spans across other systems, like other smaller technical systems, you know, you need to have some high level end to end measurement of a particular flow or whatever the we're going to use equate your SLI. So, you know, you'll need something, say like an, a browser like real user monitoring. If you're having mobile apps, you have to actually report data from mobile apps back to somewhere to be able to coordinate that with whatever. So you can say, Hey, user signup on platform, X is not working, you know, and then trace backward from there. Like, but then why I say that's the hard problem as a, as a technical problem, making sure that the proper tools in place B it's that people problem, getting everybody aligned to say, Hey, we need to actually do this. And then C doing all the technical work to actually implemented correctly. And then once you have that, then you can actually just sort of start the journey.

[00:30:37] Adam Hawkins: Like you said of, Hey, we have this number, let's watch it for at least a month before we even do anything on it. And you know, people are pretty, like, they want stuff now, you know, they like, oh, we just added it. Let's look at it. Let's use it. Let's ping somebody. Let's alert this. I think it's just such a good advice to measure it. Wait and analyze. So that's my take on where the real hard problem is. What do you think is the hard problem with SLS?

[00:31:03] Alex Hidalgo: No, I think I agree. I, I would frame it slightly differently in the sense that yes, getting the SLIS. Properly configured and measuring the right things. And especially if you want to, yeah.

[00:31:18] Alex Hidalgo: Let's talk about a web app. Again, you got to go, you got to somehow trace all the way back to the user browser and you got to take into account, the quality of consumer internet connections and things like that. Yes, that's hard, but the harder part is getting people to understand that you don't have to get there the SLOS based approaches.

[00:31:39] Alex Hidalgo: Again, there are different way of thinking about things in a perfect world. You have complete tracing, spans that stretch from the mobile app on someone's phone. that Additionally reports back to a, if they're on a 3g connection or a 4g connection or LTE, and how this traverses the internet and hits your load balancers and goes into your service and all the way back to the exact database call required to, you know, can you get there? Is anyone really there?

[00:32:10] Adam Hawkins: It doesn't matter. Is that another question.

[00:32:13] Alex Hidalgo: Right. The point is to. get a good enough. Approximation nothing's ever perfect is a general tenant That's the paragraph in the book actually too, right? Nothing's ever perfect. That means your SLIs and SLOs are not gonna be perfect either. Your measurements are never going to be perfect and that's fine. They just need to be good enough. So while we can sit here and hypothesize about what a perfect measurement, what a perfect set of telemetry could be to help you measure the user experience in the best possible way.

[00:32:50] Alex Hidalgo: You know, the alternative I give you is when I was at Squarespace, I was on the team that was responsible for our general elk stack. And it was the busiest service because every other service, anything it did sent data to elk, right? Everything was logs and blablablabla, and it was incredibly busy and we wanted a meaningful SLO or I should say we wanted a meaningful, SLI. And we realized it was. Like too expensive for us to build something that would insert log lines and then retrieve them to get some kind of end to ends indexing time. So in some we realized that we could just divide the Kafka queue size by the indexing rate. And that basically gives you a close enough approximation to what the end-to-end latency actually is. Right? How many messages are in the queue divided by how many messages are being indexed per second? Does that give you an exact measurement? No, but was it good enough? Absolutely. We used it to bring our service to great levels of reliability that end up making people really happy until things broke again.

[00:33:57] Alex Hidalgo: But that's a separate story, but the point is that you. approximate, as long as you're again, thinking about your users, you don't have to absolutely get the entire journey. Cause that's, that's almost impossible, right? If anyone is actually actually anyone listening to this is actually using tracing data from the user client browser all the way back to the database and all way back to render time and has SLO set off of that and is calculating error budget off of that. If anyone's actually gotten to that. Let me know because it is a dream and I've tried to get there, but again, you don't have to, you just have to be good enough.

[00:34:40] Adam Hawkins: Yeah, that's true. So I got two questions for you before we go. So we've been talking a lot about like, Hey, if you already have SLOs and you're sort of operating at this place in ways that it can go wrong along the way.

[00:34:53] Adam Hawkins: And you mentioned earlier, about, You know, telling and talking about conveying the message through stories. So I was wondering if you could share with the listener a story from your experience, or like what you think the ideal path would be of, let's say that you don't have any SLIs. You want to start, you know, you need to do A, B and C.

[00:35:13] Adam Hawkins: What does that sort of journey look like through your first useful. SLO?

[00:35:20] Alex Hidalgo: So I think the starting point, the way to do this best is first to realize that chances are the current metrics, the current data that you have about your service. Isn't telling you as much as you think it is sure availability metrics or latency metrics, or even error rates. They can tell you something. But they're not telling the full story because you could be serving requests very quickly with zero errors. But if the data isn't the data that's being requested, then you're not doing what you're supposed to be doing. So step one is understanding what reliability actually means. And in tech, even in 2020, unfortunately it reliability is too often conflated with a availability. Are you there? Does this service respond to a thing? But that's not what reliability means. Reliability means is this thing doing what it's supposed to be doing? True outside of tech. It's not like SRE at Google came up with a concept of reliability.

[00:36:19] Alex Hidalgo: It's a word that's been in the English language for centuries and reliability engineering itself, has been around since the forties, as a formalized, concept. So I think that's step one is learning what reliability actually means. And once you understand what reliability means, then you can go look at Your service and say, okay, cool. I have these availability metrics or these error rates or, you know, latency of, of, of incoming requests. But what do those mean in terms of if we're being reliable or not? And that's often where you get to the point where actually checking for data correctness and, you know, a multi-step journey, Multiple components that are often at play. You often can't measure things that just one spot doesn't have to mean. You got to have black box monitoring necessarily, but in the very least, it often means you need to combine white box metrics from multiple different sources and synthesize those. So that's that gets up to right. So, so step one is understand what reliability means in the first place.

[00:37:23] Alex Hidalgo: Step two is actually start measuring your service in those terms, because you probably are not. Step three is once you're measuring your service correctly, watch it for awhile. As we were kind of talking about earlier, and then step four is pick the right target that keeps you, you and your engineers happy. You'll never. Be able to hit five nines, right? Like no one actually, you know, does that, I know some Google and Amazon services promises in their SLAS. They generally other don't hit it. Or it's a kind of unquantifiable target not to go on too much of a tangent, but I think both S3 and GCs promised 13 nines of data durable.

[00:38:03] Adam Hawkins: How would you, how do you even know as the consumer of that thing and a, are you check summing your data? One of you read it back, like, are you measuring Zack as the consumer of this? I don't think so.

[00:38:15] Alex Hidalgo: That that's, that's exactly my point. People don't actually hit those numbers. So pick one that doesn't cause so much toil on your team or so many pages or such a quick response time that they're not getting burnt out. Make sure that target does also KPA user in mind. Sure. It's great. If your engineers are happy because they're never getting paid, but it's not great if your users are unhappy because maybe they should be getting, paid you know, like that's, that's, that's how you go on this journey. You, you take it one step at a time. it's not something you're ever finished with. You can always iterate. I think that I think people often forget, especially when they're first getting started is that SLOs are explicitly not SLAS. They're not agreements, they're objectives. They can change, change them whenever you need to. As long as whatever you're setting them to, isn't suddenly making one side or the other, the users or the engineers, or even product or business, or whoever, as long as your new target percentage, isn't making people unhappy, then feel free to change it, fit the realities of the world.

[00:39:18] Adam Hawkins: All right. Yeah. Great advice there. Just to tack on like one thing at the end about, about the changing them is that in the beginning, if they're too painful, make them less painful because if they're too painful, you won't stick with them. So you need to like back off the pain. So you can get used to it and then ramp it up. It's like lifting weight. It's like training, any type of like exercise. You can't go from zero to a hundred. You have to work in increments to get to this point where yeah.

[00:39:46] Adam Hawkins: Like, Hey, I'm at this sort of promised land or ideal place. So that leads me to my next question, which was okay, we've gone through this journey. We have all these things we have to, I mentioned we have these numbers. So as the, like, this is me cutting from the SRE side, or kind of like the opposite, like this, this end, which is okay, we're good at, we need telemetry from all these different sources and for all the tools I've used, none of these are like sufficient in my mind for collecting, analyzing, like summarizing, like heaven forbid you want to quarry and relate different data streams or, you know, do something like that.

[00:40:21] Adam Hawkins: So given that you have far more experience with this, What is your like ideal stack or preferred approach for collecting telemetry and able to turn that telemetry into SLIs and then into SLOS and ultimately at the top level of the pyramid alert somebody when they're going really wrong?

[00:40:41] Alex Hidalgo: I think there's a few different answers there. I do miss Monarch, Google is an incredibly powerful, mostly metrics collection system that the query language is something I miss a lot. And, I mean, I was on the sister team that was responsible for it, and I know it was often a headache to actually maintain. So it's definitely not a perfect service in that sense, but I will say, I do miss Monarch. It's a phenomenal product. It does back stack driver now. so perhaps not in every way yet. I don't believe, I don't know. I've been gone from Google for two and a half years now, that is a brilliant product that, that I miss. I will say there are a few observability companies out there that are actually getting close to what you were saying being able to ingest telemetry from many different sources and kind of slice and dice it, LightStep and honeycomb are both phenomenal tools. I would highly recommend people check them out, but at the end of the day, I do agree. I don't think anyone out there yet is doing the right analysis portion of it. There's lots oa tools you can use to help you get better insight into your services, better observability into your services. And there's many great choices there. And there's many great choices for helping you troubleshoot problems in real time. That's for example, where I would really, I love LightStep and I love honeycomb for those kinds of things.

[00:42:12] Alex Hidalgo: And while they're both working on it a bit, they, I, they don't quite yet have the, the ability to really measure and analyze. Your SLOS over time. There isn't a lot of, you know, like it's great. And well, if you can build an error budget status based on a web service that does tens of thousands of, of, of requests per second.

[00:42:38] Alex Hidalgo: You know, I mean that or even hour, but what if your service only has four data points per hour, there's ways to solve for this? There are statistical techniques you can use. There's pushing distributions by knowing who just oppressions and there's, you know, there's all sorts of things you can use to account for the probability. And that's, what's not really out there yet, and that if I can plug it, that's what we're trying to do at noble nine. We're trying to build a platform that you can send data to from literally any data source. I currently have a spreadsheet where I think I've 50 listed. I don't know how the dev team feels about that, but I would love for us to eventually be able to ingest data from anywhere.

[00:43:18] Alex Hidalgo: And, you know, there are lots of tools that give you good insight in terms of troubleshooting a problem right now. I don't see anyone who's get properly, allowing you to report on your liability over time and really understand how your users have thought about you over time and understand the business impact of that reliability.

[00:43:39] Adam Hawkins: That's so true. I can't argue with that more because of. A lot of the tools that are good at exactly what you said, which is real-time analysis of problems, you know, correlating time series data, and you can infer decide whatever you want from that. But then you have this number and you need to look at the summary of that number over time.

[00:43:59] Adam Hawkins: And this is where. I don't know what your experience have been, but I've been the guy responsible for a copy and pasting that number from one tool into a spreadsheet. So it can be piped into some sort of like biz ops, you know, tool. And they can compare like this month, this month and this year, over this year.

[00:44:13] Adam Hawkins: So it would be wonderful if there was a tool that could actually, you know, summarize the SLOS over time and handle all of the alerting and monitoring of that stuff. Cause that's the second half of the promise land. Like once you get there, like you want these things to really have Primacy in your, in your team, like if they're going wrong, therefore the business is going wrong. Therefore somebody should take action. And that's the, that the top level. I think that the promise land of. SLOS.

[00:44:41] Alex Hidalgo: Yeah, that's exactly it. You need to be able to tie this data directly to business decisions. What did our business look like when our reliability was X versus Y we're really losing customers? Do we have more signups? Did we, get everything C. steady? And maybe steady is fine. Maybe steady is bad. Maybe we should be growing very quickly. Maybe this is actually a deprecated product and we're trying to get people to switch to the other one. So actually we want you to should be going down, right? There's so many potential avenues for how to look at this data and use it and tie it back to KPIs, tie it back to your business performance and right. That's the part that I think is currently still kind of.

[00:45:20] Adam Hawkins: Yeah, well, Alex, thank you so much for coming on the show and sharing a small batch of SLO education. I appreciate it. Is there anything you would like to leave the listeners with before we go?

[00:45:33] Alex Hidalgo: Well, I'd be remiss if I didn't mention go check out my book implementing service level objectives, published by Riley media early in September. It was. a lot of work, Especially with a global pandemic hitting right in the middle of it. And there were times where I got sick of the words because I looked at them so much, but, with a, a new perspective, a few months out of, of the really end push of getting everything done, I can say I've read, read them and I'm incredibly proud of them.

[00:46:04] Alex Hidalgo: I think it's a good book and I think it can make a lot of people happier. I think it makes it can make a lot of people's lives easier. And, that's really my only goal though. this.

[00:46:12] Adam Hawkins: And you mentioned a noble nine. Is there anything online people can go to, like a demo, approach, to see what this is about?

[00:46:19] Alex Hidalgo: Yeah, there's no, public demo yet. The company is still pretty early in the process, but if you go to Nobel nine.com, there's like there's a contact form. We're happy to demo things for anyone at any point in time, but it's so kind of ad hoc at this point in time. We're hoping to have a more accessible kind of public data in just the next few months, but for now, just go to noblenine.com or, feel free to find me on Twitter. Ahidalgosre. My dms are open for all, free to send me a message. If you have any questions about SLOS.

[00:46:53] Adam Hawkins: Yeah, and all of this of course will be linked in the show notes. And I would be also remiss for not offering a sincere congratulations on finishing the book, because it really is an amazing book that covers all this from ground zero all the way up to the very top.

[00:47:07] Adam Hawkins: I was honestly surprised by how well done it was and how much is covered. I was surprised that it. It is how long it is, because that's how much there is to communicate about this topic. Right? So don't be for the listeners. If you're thinking about reading this book, do not be discouraged by the length. It is an amazing book that covers everything you need to know. And if you want to get started with SLOS, like this is an amazing place to start. I wish I would have had this book, you know, four or five years ago.

[00:47:32] Adam Hawkins: So Alex, thank you so much for writing the book. Thank you so much for coming on the show and to the listeners. thanks for listening. Thanks so much. I had a great time.

[00:47:42] Alex Hidalgo: That wraps up this batch visit smallbatches.fm for the show notes. Also find smallbatches.fm on Twitter and leave your comments in the thread for this episode. More importantly, subscribe to this podcast for more episodes, just like this one.

[00:47:56] Alex Hidalgo: If you enjoy this episode, then tweeted or posted to your team slack for rate this show on iTunes, it all supports the show and helps me produce more small batches. Well, I hope to have you back again for the next. So until then happy shipping.

[00:48:15] Alex Hidalgo: Are you feeling stuck, trying to level up your skills to applying software then apply for my software delivery dojo. My dojo is a four week program designed to level up your skills, building, deploying and operating production systems. Each week, participants will go through a theoretical and practical exercises led by me designed to hone the skills needed for a continuous.

[00:48:36] Alex Hidalgo: delivery. I'm offering this dojo at an amazingly affordable price. to small batches listeners spots are limited. So apply now at softwaredeliverydojo.com. Like the sound of small batches? This episode was produced by pods worth media. That's podsworth.com.

Creators and Guests

Host

Adam Hawkins

Software Delivery Coach

Guest

Alex Hidalgo

ex-Google SRE. Author of the SLO book.

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere