

Somewhere around the third connector we built for Xceptor, something changed in how I worked. I stopped reading every line of AI-generated code hunting for bugs to fix. I started reading it to confirm the approach was right, occasionally adjusting, mostly approving. The job had quietly shifted from production to verification. That shift is, in miniature, the whole story of what we built.
Xceptor runs a data automation platform that financial institutions use for trade workflows, reconciliations, tax processing, and regulatory reporting, across 170+ SaaS instances, with a 39-person engineering team. By the time we got involved, they weren't behind on AI adoption. GitHub Copilot was rolled out. Engineers were experimenting with Claude. Individual developers were visibly faster.
But none of it was showing up in team-level delivery numbers. Tool usage was scattered and unmeasured. People used AI to autocomplete code and basically nothing else, not requirements, not architecture, not test strategy, not release docs. Six months of usage data, over 18,000 AI events, and nobody could say whether any of it was making delivery better or just busier. Rework sat around 30%, with requirements gaps surfacing during build or QA instead of during planning, which is the most expensive place for them to surface.
The diagnosis was simple to state and hard to fix: this was a tool-first rollout, AI dropped into a process that hadn't been redesigned to use it. You don't get compounding gains from a faster typewriter. You get them from rethinking the unit of work.
So we didn't start with more tooling. We started by collapsing the pipeline. Instead of features moving through product, architecture, development, QA, and DevOps as a relay, with dead time between every handoff, we rebuilt the lifecycle around two roles: Product, which works with AI to shape and validate requirements before anything gets built, and Builder, which orchestrates AI through design, implementation, testing, and release, and is accountable for everything that ships.
The programme itself moved through three stages, each one a deliberate escalation of how much we trusted the system to act on its own: Augment, where AI is a tool an individual reaches for; Automate, where AI gets embedded directly into the process; and Agent, where AI executes pipeline stages with a human approving at every gate. The first two are done. The agentic stage is live in production today as a governed plugin running custom slash commands across all nine stages of the delivery lifecycle, from requirements through release.
The principle that made this workable wasn't "trust the AI more." It was the opposite: keep AI proposing and humans approving, at every single gate, with no artifact moving forward without sign-off. Speed came from removing the wait time between stages, not from removing the checkpoints.
We picked connectors as the first proof point because they're the most self-contained unit of work in the codebase, clear API boundaries, independently deployable. Two connectors that the team had estimated at two weeks through the normal process shipped in two days. That's not a typo, and it wasn't cherry-picked: it's an 83% reduction in delivery time.
The second test pushed further, beyond code into platform configuration. We built an AI Configuration Builder that turns a plain-language description of a business process into a structured platform config. The first real use case, Loan Notices, was estimated at 26 days and delivered in 6, an 85% reduction. It went in front of an actual customer at the proof-of-concept stage. Their feedback on quality and experience: excellent. That was the first time anyone at Xceptor had shown a live customer an AI-configured, end-to-end solution, and it proved the same pipeline worked for code and for configuration.
Rework, the metric that started this whole conversation, dropped from roughly 30% to under 10%.
"We were trying to get AI to build a process on our platform. When we gave it too much context, it did a terrible job. When we gave it the bare minimum, a codebase link, the documentation, one good working example, within a day it was building an end-to-end process the way our business user would." — Michael Kinloch, SVP Engineering, Xceptor
None of this worked on the first try, and the parts that broke are more useful than the parts that didn't.
We sized stories the way humans do, and it was wrong. The agent's first pass broke features into small, independently testable stories, the textbook way a human team would do it. For an agent, every story means reloading full context, so a small story costs nearly as much to execute as a large one. A single larger story per connector worked better. The right call for a human team turned out to be the wrong call for an agentic one, and we only found that out by shipping the wrong version first.
Detail is a cost, not a virtue. Left alone, the agent generates far more documentation than a human reviewer actually needs, and early on that thoroughness became the bottleneck: nobody wants to review a ten-page solution design for a two-day connector. We spent real iteration cycles, two or three passes per skill in most cases, calibrating prompts down to the level of detail people would actually read. Simple, reviewable artifacts beat comprehensive ones every time a human has to sign off.
Not every model needs to be the best model. Token costs add up fast when every stage defaults to your most capable model. We're now routing work: the strongest reasoning model where code quality is on the line, lighter and cheaper models for stages like documentation or routine test generation where the ceiling on quality needed is lower.
Regression testing an AI pipeline is still an open problem. This is the one I don't have a clean answer for yet. The skills and agents we built are, underneath, probabilistic systems. Change an instruction and you can't guarantee the next run is better, or even that it isn't worse. Comparing two model outputs side by side is a solved problem; validating that an entire multi-stage pipeline still behaves correctly after a change is not, at least not with tooling that exists today. We're actively working on it, and I'd genuinely like to hear from anyone who's solved this better than we have.
Find one contained, well-bounded piece of work and one team that actually wants to try it before you touch the org chart. We proved this end to end on connectors with a small, willing team before we asked anyone else to change how they worked. That sequencing mattered more than any tooling decision we made.
Instrument from day one. We tracked time-to-market, cost, and quality from the first feature, not after adoption became a question someone was asking. Without that data, "is this actually working" stays a matter of opinion.
And expect the job to change shape, not disappear. Engineers on this team aren't writing less because AI writes the code. They're spending their time differently: confirming plans instead of producing first drafts, reviewing pull requests instead of writing every line, defining what good looks like instead of manually checking for it. That's a real shift in how the work feels day to day, and it's the part no tooling rollout prepares you for on its own.
We're still mid-journey. The agentic stage is live, but it isn't where it's going to end up. If you're earlier in this than we were a year ago, the honest version of where it gets hard is more useful than the highlight reel, so that's what I tried to write here.