Book Review: The Unicorn Project

These are my notes from Gene Kim’s book The Unicorn Project: A Novel about Digital Disruption, Developers, and Overthrowing the Ancient Powerful Order. I enjoyed this book and recommend it.

I delayed reading this book for a while. From the title of this book, I thought this would be about rapidly growing startups. I thought, how would that be relevant to my work at a megacorp? Turns out the “Unicorn” name is firmly tongue-in-cheek, this is about a mature organisation struggling in a new world.

Summary (spoilers)

A 100-year-old auto parts retailer is being disrupted by new digital competitors, threatened with a private-equity takeover. Lead times for new initiatives are astronomical because of organisational silos and mismatched incentives between Dev/QA/Security/Ops, huge handover times, lack of CI / build docs, integrating changes only at the end of projects, large coordination costs across many teams to get anything done, and conservative approvals processes declining initiatives to move faster.

A scrappy team of developers turn the fortunes of the company with a mix of:

  • Skunkworks projects outrunning other teams building confidence we can move quickly
  • Asking forgiveness for breaking some rules
  • Focussing heavily on the problems at the highest level of the company
  • Which gets executive support for this scrappy team to ignore some of the bureaucracy
  • Working with QA, embedding QA on the team (with the assistance of a helpful QA manager), training QA to write automated tests and freeing up time for exploratory QA
  • The dev team running their own cloud infrastructure and holding their own pager, to route around an Ops team that isn’t interested in helping with fast deploys
  • The dev team taking on responsibility for security and compliance to route around a broken Security org that was unresponsive to business needs

In a way, the book is about pulling a cross-functional team together, focussed on a business need, so that team can move quickly. A review on the back calls it a “organisational civil-war novel” and I think that’s fair, there is a lot of resistance put up by the people running the separate Ops/QA/Security empires that is fixed with a mix of winning them over, out-executing, strong exec support.

In the end, the developers succeed, spinning up new data platforms, deploying with only 10 minute lead times, spinning down legacy infra to save money, and empowering data & sales teams to execute independently on company goals, growing the revenue and beating back the private equity firm.

The Business Novel

I’m not usually a fan of the “business novel” as a concept: most I’ve read fail as novels, and only convey business ideas with great verbosity, but The Unicorn Project executed very well. The characters were relatable, talking like real engineers. There was an empathy for people across business functions and their pressures and incentives, people doing their best in often-crappy systems. For example, they don’t just blow up the QA org and automate away everyone’s jobs, they partner with QA, teach them how to automate tests, and get them doing more-valuable exploratory work.

I particularly appreciated how the book toured almost the entire business: the C-Suite, Retail Staff, Engineering, QA, Ops, Security, and Sales. Few authors could pull this off, but Gene Kim’s made a career out of trying to see the entire business as a holistic system.

The Civil War

The scrappy group only succeeded, people only dismantled their empires, why? The book shows a few ways to dismantle ‘old orders’ that prevent organisational agility:

  • The CEO is always repeating the company’s mission, that they exist to help their customers with their problems.
  • People appeal to this wider goal. But it only works if the manager is already bought-in and doesn’t care much about empire-building, or because they have exec support to route around the team.
  • And they only had exec support because they showed they could move quickly and there was an organisational crisis. Perhaps this illustrates the importance of ‘not wasting a crisis’ and being ready for a crisis with the ideas for the change you want to implement.

Self-Teaming

People formed their own teams once they knew what the problem was. This is an interesting idea, I’ve only infrequently seen it happen at my work. People needed freedom / management support to do this. And not to be strangled by big roadmaps up front — you need slack time to explore new opportunities.

Core vs Context

There’s a bit where the business has to free up some money, fast. The lens they look at this is the core of the business, and the non-core bits (“Context”).

They outsource/reduce the non-core bits. HR systems, ERP, custom CI, custom onsite hardware. I’m wondering which bits of our development are non-core and if maybe we could give them away / open-source them

“The Five Ideals”

It’s a rule: Every business book has to have a numbered list of principles. I thought these principles were pretty good.

  • Locality and Simplicity – Changes shouldn’t need to coordinate many teams.
  • Focus, Flow & Joy – Work should flow, you shouldn’t have big delays.
  • Improvement of Daily Work – More important than the work itself, is improving how we do the work.
  • Psychological Safety – of course.
  • Customer focus – this is mostly contrasted with empire-building.

The Innovation Lab

Towards the end of the book, the CEO sets up an incubator for new ideas that might help the business grow in the future.

They have an internal pitch battle, and funding outside the organisation structure for incubating. There’s a recognition that the new idea needs different organisational techniques from charted spaces. Oversimplifying, they call this Horizon 1 (bread and butter of business, well understood), Horizon 2 (adjacent new markets), Horizon 3 (totally new ideas).

Then they shut down projects if they don’t show the growth, and if it works, it graduates the innovation lab, and gets embedded with the business. I find this an interesting recognition of the need for different strategies for teams at different levels of maturity.

Technical nuts & bolts

I really liked how they had live graphs of funnels and could debug rollouts live. Even on a mobile app. All server triggered. And could trigger behaviour changes in their mobile app in real-time. These are capabilities I could benefit from in my work.

Picking Nits: I’m Skeptical about Event Sourcing

The team turn around a central “Data Hub” application that interfaces with every database in the company, into a decentralised Event-Sourcing system that teams can integrate with, without bottlenecking on the Data Hub team implementing features.

I don’t have much experience with Event Sourcing, but it seems to be the practice of storing immutable events (usually in persistent-ish message-bus/queues like Kafka) and allowing systems to read/write the queue to roll-up/build their own state (cache) of the state of the world.

I’m pretty skeptical of Event Sourcing:

  • Every reader has to maintain their own cache, and they’ll get it wrong.
  • If many systems can write, how do you coordinate them to use a coherent model?
  • The message bus becomes a single point of failure, and they don’t seem to have great uptime. seasoned SRE lizthegrey@ notes:
The event stream becomes our communication, integration, and replication fabric to use for availability and consistency [ed: nononononno oh god no, SREs are crying]
This is outsourcing all your risk onto your event bus. Make note that Google Cloud Platform’s PubSub is 99.95% available, for instance.
  • What about read-after write?
  • If you shard your queues, do you give up consistency/ordering?
  • If you’re storing all the events immutably forever, what’s your privacy plan for deleting data after a retention period?
  • What’s your GDPR plan for deleting user’s data?
  • People explain that when a service goes down, you can just re-start it later and re-read the messages, you don’t drop things. But what if those messages needed to update the system quickly? You’re trading off consistency (read-after-write) for the ability to buffer reliability failures. Eventual consistency can require a high price: handling inconsistent databases is a real pain.
  • Re-starting a service after there’s a bug to re-play the log through the service works OK with stateless services, but what about services with side effects? Most services have side effects, and it’s hard to make requests idempotent even when they can be replayed hours or days later.

I was heavily influenced by tef’s blogposts, from an Ops perspective, railing against message brokers and their lack of back pressure & read-after-write: How do you cut a monolith in half?:

Using a message broker to distribute work is like a cross between a load balancer with a database, with the disadvantages of both and the advantages of neither.
In practice, a message broker is a service that transforms network errors and machine failures into filled disks.

And tef’s Scaling in the presence of errors—don’t ignore them:

A message broker does not block the producer until the consumer can catch up. In theory, this means transient errors or network issues between components don’t bring the entire system down. In practice, the more queues you have in a pipeline, the longer it takes to find out if there’s a problem.

To learn more, I watched Scott Haven’s DevOps Enterprise Summit 2019 talk: “Fabulous Fortunes, Fewer Failures, and Faster Fixes from Functional Fundamentals” which this section of the book was based on. I get that data storage is cheap, but sometimes you really want to delete data, and I wonder how his immutable ledger deals with that. It only seemed to discuss the advantages, and didn’t touch on the problems I’ve seen above. Scott explains in the talk how at one point the queue went down and they could replay it, but I wonder what the price of this latency is.

I don’t want to dismiss Event Sourcing — some people are having success with it, and maybe it helps particularly to decentralise some pathological team structures. Gene Kim isn’t some architecture astronaut, and if he’s on board this probably isn’t completely wrong. Perhaps I should listen to Gene’s podcast with Scott Havens where they dive into the above talk more.

But for now, I see a lot of hype around Event Sourcing without discussion of the downsides, and that makes me wary.

Overall

I enjoyed & recommend this book, I’d recommend it to anyone involved in organisational structure decisions: tech leads, senior engineers, managers, and particularly middle managers.

I’d read, and enjoyed, Gene Kim’s previous book The Phoenix Project, and I think this book was much better. There’s no need to have read The Phoenix Project first.

Mark Hansen

Mark Hansen

Sydney, Australia