Sprint review. The team finished all the feature work on time. Two tickets are still open. Both were 3s. Both were straightforward changes to a service that talks to a third-party API. Both spent most of the sprint waiting for a vendor to respond to a support ticket about a rate limit that wasn’t documented anywhere.
Nobody estimated wrong. Three points of implementation work was exactly right. The sprint broke somewhere the estimate never looked.
What story points actually measure
When a team votes story points, they’re mostly voting on implementation. How much code? How complex is it? How long will it take to write? These are reasonable questions, and with experience, teams develop real calibration for them.
What story points can’t carry is everything that happens around the implementation. The integration depth — how many systems does this touch, and how tightly? The testing surface — does this ticket require a regression sweep, or is it isolated? The external dependencies — are there third parties, other teams, or upstream APIs between “code written” and “ticket done”?
A ticket can be a 2 for implementation and an 8 for everything else. Simple database write, but it feeds into a reporting pipeline owned by another team, needs coordinated deployment, and the downstream consumers haven’t tested against the new schema yet. Story points will call this a 2. Your sprint will not treat it like a 2.
The four failure modes that don’t compress
Delivery risk for integration-heavy work has four distinct shapes:
Implementation — the code itself. How complex is the logic? How navigable is the codebase in this area? This is what story points are good at.
Integration — the surface area between systems. How many services does this connect? Are those interfaces stable? Does a change here require coordination with other teams?
Testing & QA — the verification burden. Can this be covered by unit tests, or does it require end-to-end validation? Is there a manual testing step that can only happen in staging?
External dependencies — what sits outside your team’s control. Vendor APIs. Third-party services. Another squad’s release schedule. A change freeze you can’t predict.
These don’t move together. A technically complex ticket can have zero external dependencies — a hard algorithm entirely inside your own codebase. A trivially simple ticket can be blocked for three days waiting on a vendor. Story points can’t distinguish between them, and for integration-heavy work, it’s almost never the implementation dimension that breaks the sprint.
What changes when you ask the right questions
When your team votes these four dimensions separately, a 3-point implementation ticket looks different in the planning session:
Implementation: 2. Integration: 6. Testing & QA: 5. External dependencies: 8.
That’s not a 2-point ticket anymore — but more importantly, now you know what to do with it. Coordinate with the downstream team before the sprint starts. Schedule time for integration testing. Decide whether an external dependency this uncertain belongs in week one or on the backlog until the vendor confirms the fix.
If you run standard planning poker, the team votes 3 and moves on. On Friday you find out which assumptions were wrong.
The unit of measurement isn’t the issue — hours would have the same problem. A “4-hour” estimate for an integration ticket is still a 4-hour estimate for the implementation. It still says nothing about what’s waiting on the other side.
The retro that changes
There’s a version of the retrospective that goes: “We consistently underestimate integration work.” The action item is usually “estimate more carefully.” Which changes nothing, because the team isn’t estimating integration work — they’re estimating implementation work and treating the rest as implied.
When a team votes dimensions separately and a sprint misses, the retro has a different starting point. External dependencies were rated 8 and the ticket was scheduled in week one. Next time: either move it later, or make the vendor dependency explicit as a pre-sprint action item. That’s a specific thing to change. “We underestimated” is not.
If you want to try this with your team — Estimate Well ships an Integration Focus config that breaks the vote into implementation, integration, testing & QA, and external dependencies. You can see all estimation configs at estimate-well.com/profiles. Free, no account needed. Share a link and you’re in.