Saltside Chronicles #5: Retrospective

The third episode in a five part series on the Big Bang rewrite completed at Saltside in 2014/15. This episode is a retrospective through the lens of the flow framework and proposed mitigations.

[00:00:00] Hello and welcome. I'm your host Adam Hawkins. In each episode, I present a small batch, with theory and practices behind building a high velocity software organization. Topics include dev ops, lean software architecture, continuous delivery, and conversations with industry leaders. Now let's begin today's episode.

[00:00:26] Hello again, Adam here for the final episode of the salt side Chronicles. This episode is a retrospective on the rewrite and all the events leading up to it. I won't retell the story in this episode. So don't listen to this one until you've heard the whole story in the previous episodes.

[00:00:49] This was the most difficult episode to write because I thought I had more to say about. I initially planned to probe different areas of the organizational processes and engineering department, but that felt like waffling on. So I went back to the drawing board. You're now listening to the fifth iteration of this script.

[00:01:09] It was frustrating to write the drafts, cut them up and continually remix them, but more iterations made it clear that wandering through that analysis just diluted the key takeaway. The key takeaway is that the rewrite was avoidable and I never, ever under any circumstances want to go through anything like it again, it took an incredible human toll on all of those involved and put the business on pause until it was completed.

[00:01:39] So listener, if you're thinking of a big bang rewrite, then you are going the wrong way. Find another way. The big bang we write is a nuclear option that should be avoided at all costs, but there's something enticing about a rewrite though. It's an opportunity to right. The wrongs and just do it right.

[00:02:02] Well, I learned the hard way that that's a bigger project than you ever expect. That was especially true at salt. The challenge for us was that it didn't start off like that, you know, but it snowballed into something much bigger and left us with no escape hatch. Then the long project, negative feedback loop kicked in.

[00:02:25] And if you've worked in the industry long enough, you know what I'm talking about, this is a situation where requirements change or staffing changes or priorities change that just make it harder to complete the. These changes, make the project go longer, which then increases exposure to more kinds of changes.

[00:02:44] So it's impossible to break out of this feedback loop without cutting scope. Unfortunately, we fell for the siren song of trying to solve all the problems that is until we ruthlessly cut scope that enabled us to decide on something that we could ship and then iterate. That was the only way we're able to complete the rewrite without that we endlessly discover new product requirements, which would increase the scope, thus activating the negative feedback loop.

[00:03:19] I remember how optimistic I was about solving all the problems that optimism eventually faded as the grind set in days turned to weeks and then months, and then eventually heavy over. It was amazingly stressful and exhausting. I cannot overstate that or exclude that aspect of the story. The initial joy of untangling, the technical nod was replaced with the solace of simply not working on.

[00:03:51] Everyone felt immense relief. When we launched the first market on the new product, in the web app, the mobile app, the iOS app, the Android app, the admin app. And of course, all of the backend API APIs to power it all, you know, but more importantly, finally, there was just some space for something else in the day after all this time.

[00:04:12] I think that salt side story is unfortunately a common. It may play out differently each time, but the setup is the same early and continued focus on features without strong technical leadership, create a technical debt trap. The absence of strong technical leadership allows the tech that to fly under the radar until it's too late, teams may not have enough experienced engineers, which exacerbates the problem.

[00:04:40] If the company survives long enough, then they will fall into the trap. And salt side fell face first into this trap. No early decisions did not help either building config into a database set one landmine, relying on an untenable multi-tenant infrastructure. Set another landmine, the Cardinal sin of coupling services to the database set even more continued use of half measures, never address the underlying technical architecture pro.

[00:05:12] Engineers could not identify these fire signals or communicate them to management management could not identify why delivery slowed and defects increased in the house of cards. Just collapsed by introducing a single new requirement that challenged all the assumptions baked into the system. Let me turn to the full framework, the four types of work and the flow distribution method.

[00:05:37] Recall the four types of work features, defects, risks, and debts. They are mutually exclusive and exhaustive. So any work done by an organization fits into exactly one of these categories. The flow distribution metric represents the amount of each type of work in flight at a given time. Organizations need to ModuLite this metric.

[00:05:58] According to the situation at hand, I recently had a conversation with Carmen Diardo from task. He explained the four types of work like this features and defects, our quote revenue, generation risks and debts are quote revenue protection, an organization can't forego one or the other without negative consequences.

[00:06:20] Now consider the early stage startup flush with seed money. The prime directive is to just ship here. Flow distribution is almost entirely on features and defects. Risks and debts are just not relevant at this point in time. This is appropriate because the risks and debts only matter if the company survives long enough to have to deal with them.

[00:06:43] Now, project farther into the future. Eventually the risks and debts must be dealt with Mick Kirsten explains these two items. In project to product as work that protects the value streams ability to continually deliver with fast flow, Carmen Deandra are called them revenue protection, salt side, never dealt with the debts and risks that prevented them from developing the product in the term nor protecting the revenue they were able to generate before the mobile app that cost an existential crisis.

[00:07:19] When the time came. The business was on the line because management knew they could not compete without a mobile app. And instead of slowly and consistently working on paying back debts and dealing with the risk, the company completely stopped working on features, endowed, flow distribution, the other way.

[00:07:37] Completely imagine your company putting a stop work order on new features for nine months. Could that even happen? What would it even take to get that to happen? Well, that's what happened at salt side compound interest on debts and risks cashed in by an existential crisis. Systemic issues created this scenario.

[00:08:02] Now a root cause analysis is moot. In this case, there are too many interwoven issues to point to one. Exactly. Instead, I'll leave you with some practices that would have mitigated the various contributing factors. First measure and track your flow distribution. This metric helps all participants in the value stream, assess how resources are used and really just be cognizant of how it relates to the current business strategy.

[00:08:30] Second preferred delivery and iteration over completing a 100% perfect project in one go. This was a big problem for solving. Culture emphasized shipping the best product to users who could disagree with that. Well, the problem was that salt side ship features as big bang releases. There were no feature flags or AB tests.

[00:08:55] So 100% of product requirements must be met. One a developer deployed that code to production. This created the negative feedback loop where shipping something incremental was seen as incomplete and thus not worth. As a result projects took longer and became harder to complete. Kick-starting the long project, negative feedback loop, frankly, it's just better to do the opposite or really just follow agile principles, focus on delivering something to production as priority one control, access to it with feature flags or really any other mechanism, then measure it against whatever your success criteria is.

[00:09:37] Continue to iterate on this until you find something that may be released to the general public or learn that your idea doesn't work and try something else. This is also known as hypothesis driven development. Third ensure a technical leadership maintains architecture and engineering quality standards.

[00:09:55] There is an inflection point where a single developer or even a single team cannot properly manage a shared code base. This is especially key in an early stage startup because the architecture is likely monolithic. And for good reason, one person at a minimum must be reviewing code, pairing, planning, and enforcing boundaries throughout the system.

[00:10:16] This person could be anything from a lead developer to a principal engineer. My point is that early investment in this capacity will pay dividends. Give these tasks to your most experienced engineer. I think this would have helped salt site avoid the architecture decisions that ultimately trap them fourth, just never skip on building automated tests.

[00:10:40] Automated tests are the lowest layer of the software delivery pipeline. Software delivery success predicates on the ability to transparently pass or fail a change with automated tests. I guarantee you that a product without automated tests will be lower quality than one covered by an automated test suite.

[00:10:59] If a developer introduces a key feature without automated tests. And in my experience adamant later is orders of magnitude more difficult than adding them at the start, which in practice means that they never get added. This is because the code will change or the priorities will change. And this can bootstrap the broken window model where subsequent changes will also ignore tests.

[00:11:23] This virus can spread to the wider system if on checked and this connects back to the previous point about technical leadership, this person must be reviewing PR such that they were properly written and tested. If someone in the team does not know how to do that, then they can teach them how to do that.

[00:11:42] Fifth and perhaps most important in, in my opinion was learning that technology is not the problem. I was naive when I joined salt side. I sincerely thought that if we were just better programmers in, that would solve our problems. Oh, you know, I would say we really need to do rails like this, or we need to do Ruby like this, you know, and forget JavaScript.

[00:12:07] That's just gonna be a pain for us. Boy was I wrong now? I think that technology is not the hard problem. People and processes are they're hard problem because people are the ones that using the technology technology can't prevent technical debt from accumulating that's people choosing to work around it or indefinitely punting on it.

[00:12:31] Technology can't prevent Dell developers from coupling applications at the data. That's people choosing to work at a low quality level technology. Can't ensure sound architecture decisions. It's the people and processes that apply the technology. So look there before trying a technical solution now to shift gears a bit, I produced the salt side Chronicles to share my experience and knowledge with you.

[00:13:00] I also produce these episodes to give everyone who participated in the rewrite, something out there in the world that they could point to, to say, I was a part of this, all the engineers who participated in the rewrite deserve credit and certainly a call-out on their resumes. So here's a shout out to all of you.

[00:13:18] And I know that some of you do listen to this podcast to all of you. I wish we could work together again, just never under the same circumstance. I am immensely proud of our technical achievements and what that meant for the business. I'm grateful to have been a part of the team. Now I want to spotlight some key people from my perspective, Terry Larson, Yohanas Martinson, Antoine Lindstrom, Valentina Gotzsche Daniel, Oscar sin, and Peter terrier was my right-hand man in building the core service and many of the supporting services.

[00:13:57] Pairing with Terry Gaye was the most productive programming I've ever done today. I jokingly referred to terrier as my linter. He kept code consistent and honest. He had a knack for keeping the entire code base in his head. So when the code changed in one area, he knew how to sync it with all the other areas.

[00:14:17] That ability made him a fantastic pair and amazing code review. On another note. I think that terrier is the God of dot files and shell configuration terrier helped me see how I can improve the quality of my work by improving my workflow and sharpening my tools. So thank you so much for everything terrier.

[00:14:37] Maybe one day we can write some fish and Ruby programs. Again together. Yohanas was an amazingly productive engineer. He could take on a vague blob of requirements and turn it into working software. He took on the entire admin loop and just, and I mean, just sorted it all out. He's a true 10 X engineer.

[00:14:59] Plus he wrote amazingly useful shell utilities for the platform team. He wrote a utility to find the IPS of all the instances in a given market environment or service open up an SSH connection in a TMX pain for each instance, combine that with the synchronized pains feature in TMX, and we had a way to complete all kinds of maintenance.

[00:15:21] That program alone saved our team hours and undoubtably unlocked to builders. We didn't even know we needed, Yohanas inspired me to level up my shell programming game. So thank you. Yohanas for showing me the ways of the force. Anton single handedly saved the rewrite by forking thrift and adding union support to the goals.

[00:15:43] We hit a roadblock about a third of the way through the rewrite. When we learned a common thrift IDL feature we use with non-supported in go, this really could have something entire project. All of us thought that there was no way we could actually solve the problem inside thrifts. It just really seemed out of our reach.

[00:16:02] So we came up with a workaround idea to use a proxy instead. Well, in comes Anton just saying, well, raising his hand, I think I can fix. Give me a day. Well, turns out he did. Anton was just that kind of developer. He would contribute to any part of the stack that needed it and go about it humbly and confidently.

[00:16:25] Thank you so much. Anton volunteen led the web team and Daniel led the mobile team, the three of us plus the head of product and the CTO formed the round table group that more or less steered the technical direction of the. The three of us collaborated so well and read them the core API that powered all of these are applications.

[00:16:47] We trusted each other to design the API, such that it was consumable by all clients and met all the product requirements. We were really focused on what was best for the business. This round table group discussions often included productive conflict. Each of us had our own opinions and our preferred technical solution.

[00:17:09] This often led to amazingly productive debates. Sometimes we couldn't decide between the three of us and the decision went to the CTO. Well, whatever the decision, the three of us always got behind it and did whatever was best for the product and user. It was my pleasure to work with both of you last.

[00:17:31] And certainly not least I would be remiss if I did not mention. Peter single-handedly solve the infrastructure and deployment problem. This was a massive problem in hindsight, given that Peter was really the only person working on it, the results are truly impressive. Peter's contributions and others had a larger impact.

[00:17:52] And that's what I want to talk about. He turned me on to dev ops before I really even knew what it was. You see Peter and I discussed the problems at salt side for hours and hours. Peter brought the operations perspective. He approached the problem using lean principles of theory of constraints and critical paths.

[00:18:11] I brought the development side with the focus on testing automation, software architecture, and quality talking with Peter made me understand that we weren't really solving technical problems, but we were really addressing business problems implemented with technology. These conversations informed our goal for the rewrite of one, create a system that protected business value over time to remove entering from the critical path as much as possible three isolate systems for team and developer autonomy and four create streamlined teams supported by a compelling internal platform.

[00:18:46] Now all of these things are discussed on the podcast, but we did them back then without a vocabulary to discuss. It was just intuition for us little. Did I know that these conversations where the edge of a software delivery rabbit hole that I've been tumbling through to this day now culminating with this podcast?

[00:19:07] Unfortunately, I have not been able to work with Peter in the same capacity since the rewrite. However, it is one of my long-term professional goals that we do work together again. And you should too, if you ever get the chance, I certainly hope. So Peter, if you're listening great work and thank you for everything.

[00:19:26] I know I could not have done it without you. Now it's time to put a pin in this story. Let me recap. This salt side Chronicles covers the story of a company who needed to launch a mobile app, but couldn't because of technical deck and then eventually turned to a complete Roundup rewrite to solve that problem.

[00:19:46] It was a solution that would wanted, but ultimately worked out. It's a case study on software delivery and business. The results are simple. Avoid a complete ground-up rewrite at all costs. It will be harder than you think and take longer than expected. Instead. Focus on incremental delivery and consistent technical investment.

[00:20:08] Remember Carmen Diardo is framing revenue, generation and revenue protection. Pouring all resources into revenue generation means nothing. If it's not sustainable, that wraps up this batch is a small batch of that FM for the show notes. Also find small batches FM on Twitter and leave your comments in the thread for this episode.

[00:20:30] More importantly, subscribe to this podcast for more episodes, just like this one. If you enjoy this episode, then tweeted or posted to your team's slack for rate this show on iTunes, it also supports the show and helps me produce more. Small batches. Well, I hope to have you back again for the next episode.

[00:20:48] So until then, happy shipping.

[00:20:54] Are you feeling stuck, trying to level up your skills to playing software? Then apply for my software delivery. My dojo is a four week program designed to level up your skills, building, deploying and operating production systems. Each week, participants will go through a theoretical and practical exercise exercises led by me designed to hone the skills needed for continuous delivery.

[00:21:16] I'm offering this dojo at an amazingly affordable price to small batches. Listeners spots are limited. So apply now at softwaredeliverydojo.com.

[00:21:28] Like the sound of small batches? This episode was produced by pods worth media. That's podsworth.com.

Creators and Guests

Saltside Chronicles #5: Retrospective
Broadcast by