Story points vs t-shirt sizing: you're still estimating one thing

The team switches from story points to t-shirt sizing. The argument was that numbers create false precision — at least S, M, and L are honest about being rough. The first planning session goes smoothly. Everyone picks a size, there’s less debate, the meeting ends on time.

Three sprints later, the estimates are still wrong in the same ways. The large turned out to be a quick afternoon of work. The medium sat in QA for two weeks because nobody had flagged that it required a round-trip through a third-party API. The small was done in an hour but blocked for four days waiting on another team.

The format changed. The failures didn’t.

The problem isn’t the label

When teams switch from story points to t-shirt sizes, they’re solving a real annoyance: the fake precision of “this ticket is a 5 and that one is an 8.” Numbers imply measurement. T-shirt sizes feel more like gut-feel buckets, which is what estimates actually are.

But the failure mode isn’t the number. It’s the singular noun. One number, one size — both ask the team to compress what makes a ticket hard into a single output. And the things that make a ticket hard are not one thing.

A ticket can be a small in terms of implementation — the code change is trivial, you’ve written this pattern a dozen times. And simultaneously a large in terms of risk, because it touches the payment service in a way that cascades to three other systems if something breaks. Your t-shirt size has to pick one of those. Whichever it picks, the other dimension is now invisible.

What disappears in the compression

The four things that don’t compress into each other:

Effort — how much engineering work this actually requires.

Risk — what external forces could make this go sideways. Dependencies, deployment sensitivity, proximity to live production data.

Complexity — how hard the code is to understand and navigate safely. The difference between a 50-line change in a well-documented module and a 50-line change in the authentication layer nobody has touched since the original author left.

Uncertainty — how much the team doesn’t know yet. Unclear requirements, an integration nobody has fully scoped, a definition of done that keeps shifting.

A ticket can be low-effort and high-risk. High-complexity and low-uncertainty. High-uncertainty and everything else low. Story points can’t distinguish between these profiles. T-shirt sizes can’t either. The label format doesn’t change what the compression loses.

This is why switching from story points to hours produces the same estimation failures. Hours change the unit without changing the dimension count. T-shirt sizes change the label without changing the dimension count. The debt accumulates in the same place.

Where you see it

The t-shirt sizing failure shows up most visibly in sprint retrospectives. The team is reviewing what went wrong with a ticket that was estimated as medium but blew past the sprint. Someone says “the estimate wasn’t wrong, we just didn’t account for the integration testing.” Someone else says “we knew there was an external dependency, we just didn’t flag it.”

Both of those are dimension problems. Integration testing is a testing and QA dimension. The external dependency is its own dimension. Both were visible before the sprint started — they just had no place to live in an estimation format that asks for one label.

The medium size answered the effort question accurately. It left the other questions unasked.

What actually changes

When a team votes on effort, risk, complexity, and uncertainty as four separate questions, the conversation that happens in estimation is different from the one that happens around a t-shirt size or a story point.

“This is a small effort but a high-risk deploy — it touches the payment flow, we should schedule it for the start of the sprint and give it a testing buffer” is not a conversation you have after picking a size. It’s the conversation that happens when risk is a first-class question with its own vote.

The final output can still be a story point or a t-shirt size. You can aggregate the four dimensions into a summary number if your sprint board needs it. But the conversation before that number contains the information that actually determines whether the sprint holds.

T-shirt sizing is a reasonable response to a real problem with story points. It just doesn’t solve the problem it’s responding to.

If you want to try this with your team — Estimate Well runs structured multi-dimensional estimation sessions. Free, no account needed. Share a link and you’re in.