The sprint went fine on paper. The junior developer shipped three tickets in four days. All of them were written with Copilot — he described the intent, iterated on the output, cleaned up the edges. Fast. Productive. Story point velocity at a record high.
Then the tech lead reviewed the code.
Two of the three tickets touched the authentication layer in ways that weren’t wrong, exactly, but weren’t right either. The abstractions were off. A pattern that made sense locally created a precedent the team would have to work around for the next year. Review took two days. Three back-and-forth cycles. Two more developers pulled in.
The sprint was a story point success. In practice, it was a net negative.
What AI tools broke
Story points worked as an approximation when “how much code needs to be written” and “how hard this ticket is” correlated reasonably well. That correlation was never perfect — but it was stable enough that teams could develop calibration over time and trust the number.
AI tools broke that correlation completely.
With Copilot or Cursor, implementation effort — the thing story points are actually measuring — has become the smallest variable in the room. A developer with strong domain knowledge and good prompting instincts ships the right code fast. A developer without those things ships a lot of code quickly, and the team pays for it in review, in architectural drift, in decisions made without context.
The four things that actually determine how hard an AI-era ticket is don’t move together. They never did — but now the gap is too wide to ignore.
The four dimensions story points can’t see
Steering effort — how much skill and iteration does it take to get useful output from the AI? A well-scoped change in a familiar codebase: low. An ambiguous requirement in a domain the developer doesn’t know deeply: hours of back-and-forth that produce plausible but subtly wrong code.
Architectural impact — does the AI-generated code fit the patterns the team has established, or does it introduce abstractions that conflict with existing ones? LLMs optimise locally. They don’t know why the codebase is structured the way it is. A choice that looks sensible in isolation can carry significant downstream weight.
Domain knowledge — how much does the developer need to understand about the problem space to know whether the AI’s output is actually correct? In a payment flow or a permission model, the code can be syntactically perfect and semantically wrong. Domain knowledge is what catches it. If it’s absent, review burden absorbs the gap.
Review burden — how much careful human attention does the output require before it’s safe to merge? AI-generated code is often thorough and plausible. That’s what makes review harder, not easier. A reviewer needs to understand intent, not just correctness.
What changes when you vote on these separately
When a team asks these four questions in planning, the junior-with-Copilot scenario looks different before it happens.
Steering effort: low — the ticket is well-scoped, the codebase in this area is clean. Architectural impact: high — this change is in the auth layer, and the patterns here affect every service that touches it. Domain knowledge: high — you need to understand the permission model to know whether the AI’s output is semantically correct. Review burden: high — whoever reviews this needs to trace the implications through three services.
That’s not a 3-point ticket. More usefully: it’s a ticket that shouldn’t go to the junior developer alone, and that probably needs a senior to review the architectural direction before the code is written rather than after.
Story points would have called it a 3. A senior would have handed it over as an easy win. The sprint would have looked fast and been slow.
The argument that keeps missing the point
The case against story points isn’t new. It’s been made on engineering blogs for years — that they’re gameable, that they don’t map to time, that velocity becomes a target instead of a measure. All true. But none of it is the core failure.
The core failure is that one number was always trying to carry multiple signals. When implementation effort, review burden, domain knowledge requirements, and architectural impact moved roughly together, teams could get away with it. A hard ticket was usually hard on all dimensions. An easy ticket was easy on all of them.
AI development broke that relationship. The four signals now decouple visibly and frequently. A ticket with near-zero implementation effort can carry extreme review burden and architectural risk. Story points have no way to see it.
If you want to try this with your team — Estimate Well ships an AI Era config that breaks the vote into steering effort, architectural impact, domain knowledge, and review burden. You can see all estimation configs at estimate-well.com/profiles. Free, no account needed. Share a link and you’re in.