Story points measure complexity — except complexity isn't one thing

The most common explanation you’ll get for why story points aren’t hours: “Story points measure complexity, not time.”

It’s technically correct. It misses the whole problem.

What “complexity” is hiding

Ask ten developers to define complexity and you get ten different answers — because the word is doing the work of four distinct concepts that teams routinely conflate.

Complexity — how tangled is the code? How many systems does this touch? How hard is it to hold the whole thing in your head and make a change without breaking something else?

Effort — setting complexity aside entirely, how much work needs to actually happen? The amount of code, the number of changes, the hours at the keyboard.

Risk — what external things could go wrong? Upstream APIs that behave unexpectedly. Another team’s branch sitting in a merge conflict with yours. A deployment that depends on a migration running cleanly in production.

Uncertainty — how much do we not know yet? Unclear requirements. A library nobody’s used before. A feature touching a part of the codebase that hasn’t been touched in two years and has no tests.

These don’t move together. A ticket can be technically simple and high-risk — a one-line change in the payment service. High-effort and low uncertainty — a lot of straightforward work on a well-understood component. Low-complexity and high-uncertainty — a simple feature no one has fully specced out.

The planning meeting where it breaks

Someone puts up a ticket and calls it a 5. Two people disagree — one votes 3, one votes 13.

The usual response is to average, or talk it out until convergence, or defer to the tech lead. But that’s the wrong move, because the disagreement almost always means two people were answering different questions.

The person who voted 3 was thinking about effort. The code change itself is small, contained, well-understood. Maybe forty lines.

The person who voted 13 was thinking about uncertainty. Nobody’s touched this part of the system since the original author left. The requirements changed twice in the last sprint. There’s an integration point with a third-party API that has a history of surprising the team.

They’re both right. They were just answering different questions — questions that got collapsed into the word “complexity” and then into a single number on a card.

The meeting ends with an 8. The sprint starts with a ticket everyone thinks they understand. The sprint ends with an explanation about why this one took longer than expected.

This is why teams can’t agree on story point estimates — not because people are estimating badly, but because the question has multiple valid answers depending on which dimension you’re thinking about.

What changes when you split the questions

When you vote on effort, risk, complexity, and uncertainty separately — the same scenario plays out differently.

Effort: 2. Everyone agrees on this. It’s a small change. Risk: 4. There’s an integration dependency and a deployment risk. Complexity: 2. The code is simple, just an unfamiliar corner of the codebase. Uncertainty: 5. Requirements aren’t stable and nobody’s worked in this area recently.

Now the conversation changes. It’s not “why do you think this is a 13?” — which puts the high voter on the defensive and usually ends with them caving. It’s “we’re aligned on effort and complexity, but you flagged high uncertainty — what specifically are you unsure about?” That question has a useful answer. A name, a dependency, a conversation that needs to happen before the sprint starts.

The vote spread stopped being a problem to resolve and became information.

Why “complexity” became the catch-all

Story points came from a useful intuition: software estimation based on absolute time is unreliable because developers have different speeds and the same developer is faster on familiar ground than unfamiliar. Relative sizing — this ticket is bigger than that one — is more stable. Switching to hours doesn’t fix this — you’re still compressing the same four dimensions into one number, just labelled differently.

The mistake was collapsing all the dimensions that make a ticket hard into one word. “Complexity” became shorthand for “hard for whatever reason” — and as a result, planning meetings became exercises in averaging different answers to different questions.

That’s not a calibration problem. You can’t fix it by running more retrospectives or asking the team to estimate more carefully. The structure of the question is wrong.

If you want to try this with your team — Estimate Well runs structured multi-dimensional estimation sessions where the team votes separately on effort, risk, complexity, and uncertainty before arriving at a final number. Free, no account needed. Share a link and you’re in.