Writing

Risks and Decisions in the Wilderness

I recently went on an adventure with NOLS. While most of the time was spent avoiding cactus spines and admiring the wonders of the desert, we did talk about leadership.

More importantly, we talked about decisions. And risk. When you're off the grid and a few days walk from civilization, small choices can matter a lot.

On one of the more exciting days of hiking the instructors presented a quick framework for thinking about those choices.

Consider the following scenario: you encounter a 1.5 meter drop over a boulder. You need to descend. Some options:

  • jump

  • climb down

  • sit down and slide until your feet reach the bottom. A "technical butt scoot" (tm)

We make a graph to analyze this choice. Your X axis is the likelihood of a given outcome, from 0 to 100%. The Y axis is the worst consequence you can imagine from that tactic. Find a pointy stick and draw in the dirt, there’s no Google Sheets out here.

Scooting off the rock is pretty low consequence. You’ll get dirty and maybe tear your pants. If you jump you’re going pretty fast, and a bit out of control, so breaking an ankle seems possible. You could break an ankle while climbing down, but you're moving more slowly and carefully.

In general the consequences of an action don’t change much with circumstances. The likelihood is variable depending on the skill level of the team, whether it is dark, or people are tired. If you’re a comfortable climber, you might put the likelihood of a broken ankle even lower than this graph. On my trip we had an even mix of people who’d climb or scoot down these boulders. No one jumped, especially not wearing a heavy backpack.

Most people like to operate pretty close to the axes, either with a small likelihood of severe consequences or any likelihood of mild consequences. You’d choose to rip your pants with 100% certainty if that was the way past the obstacle.

You probably want to stay under this curve.

Back at my desk, the same framework applies to managing risk in software. SRE and resilience engineering is about moving complex systems so that they live under the safe part of the consequences / likelihood curve. We can make changes to systems and process that make the bad outcomes less probable.

Suppose you’re trying to launch a whole new set of features for your cat picture sharing application. You could:

  • Take the site down, wipe all your servers and deploy the new application.

  • Build a second copy of the application and switch onto it all at once. aka “blue/green deploy”

  • Launch to a small fraction of your users and expand slowly.

Launching to a percentage of users is safe and low consequence. If this is a popular application, you’ll almost certainly see some media attention about the “leaked” new features. That’s as bad as ripped pants, maybe a bit embarrassing and uncomfortable. Doing a full site rebuild during downtime might cause a multi-day outage, and is probably not something your team has practice at. Switching onto a new system all at once might expose new bugs, but that is low consequence since you can easily flip back. Teams using blue/green deployments get very skilled at reverts.

SRE and DevOps professionals have agreed that fractional user releases and blue/green deployments are acceptable risk management for this scenario. No one would recommend a release process that included a full site rebuild without a fallback plan.

Whether you’re in the bottom of a canyon or behind your desk, you have choices. The likely outcomes of those choices depend on your team and the systems around you. Stop and think before you jump.

Carla G