What Xceptor’s 170+ Instance Production Monitoring Tells Us About the Future of SaaS Operations

Xceptor runs a data automation platform for capital markets. Over 170 SaaS instances, each serving a different financial institution, each with its own configuration, data profile, and exception patterns. When something breaks in one of those instances at 2 AM, somebody has to figure out what happened, how bad it is, and who needs to know.

For years that somebody was a human on-call. Engineers reviewed dashboards, triaged alerts, diagnosed root causes, and escalated when needed. That worked when the instance count was lower. By the time the platform passed 170 instances, the maths stopped working. Every new customer added surface area. Headcount couldn’t scale linearly with it. Resolution times stretched. Days were spent on triage instead of engineering.

Today an AI agent handles that triage. Exception detection, classification, root cause suggestion, and routing all run autonomously across every instance. Resolution time dropped from days to hours. Zero missed P1 incidents since deployment.

Getting there was harder than that summary suggests.


Nobody trusted it at first

An AI agent that makes routing decisions across 170+ production instances serving financial institutions is not something you turn on and hope for the best. A bad routing decision, a missed critical exception, or a false positive that pages the wrong team at 3 AM all have real and expensive consequences.

We started in read-only mode. The agent observed everything, classified everything, and suggested routing for everything, but took no action. Engineers could see what it would have done and compare against what they actually did. This ran for weeks. Slow and unproductive-feeling, but it was the only way to build trust in the system before giving it execution rights.

After read-only came confidence thresholds. The agent could only route exceptions where its classification confidence exceeded a set level. Anything below that went to a human. Novel exception types the agent hadn’t seen before always went to a human, regardless of confidence score. Human decisions on those escalated cases fed back into the model, widening what the agent could handle over time.

Rollout was instance by instance. Not a flag flip across 170 instances on a Tuesday. Each new instance brought different data patterns and configuration quirks. The agent had to prove itself in each environment before it was trusted in the next one.

Read-only first, confidence thresholds second, instance-by-instance rollout third. That sequence is why it works in production today.


How the human/AI boundary actually works

  • Tier 1 exceptions are well-classified patterns the agent has seen many times across multiple instances: data format mismatches, connection timeouts, known integration failures. The agent detects these, classifies them, suggests a root cause, and routes them to the right team with context attached. No human touches them unless the receiving engineer disagrees with the routing.
  • Tier 2 covers exceptions the agent can classify but where the root cause has multiple plausible explanations. Here the agent provides its analysis and ranking of likely causes, but a human reviews before the diagnosis is sent downstream. Agent does the research, engineer signs off.
  • Tier 3 and above stays fully human. Customer-impacting incidents, configuration changes, anything that falls outside established patterns, and anything where the business context matters more than the technical signal. At this tier the agent’s job is to surface the exception fast and provide whatever context it can, but decision-making is entirely human.


What had to exist before the agent could work

Most of the work that made this possible had nothing to do with AI. Operational plumbing, all of it.

  • Exception taxonomy. Before an agent can classify exceptions, you need a classification system that humans agree on. At Xceptor, that meant defining and systematising every exception type across 170+ instances into a shared taxonomy. Tedious, unglamorous work that took weeks. Without it, the agent has nothing to classify against.
  • Consistent observability. An agent can only act on data it can see. Instrumentation had to be uniform across every instance: same logging format, same metric export, same alerting hooks. Instances that were instrumented differently produced unreliable signals, and the agent performed poorly on them until the instrumentation was standardised.
  • Explicit escalation logic. In most ops teams, escalation is implicit knowledge: senior engineers know which exceptions are urgent based on experience, context, and gut feel.


That works with humans. An agent can’t operate on gut feel. Every escalation path, every severity threshold, every team routing rule had to be written down and formalised before the agent could execute against it.

We’ve since seen this pattern repeat across other engagements. Teams that struggle with AI in production are almost never blocked by model capability. They’re blocked by these operational prerequisites. If your human operators rely on tribal knowledge to triage, an agent can’t replicate what was never documented.


What changed for the ops team

Before the agent, the team’s days were dominated by triage. Reviewing dashboards, classifying alerts, diagnosing root causes, deciding who to page. Skilled work, but repetitive, and it consumed capacity that should have been spent on harder problems.

After the agent took over Tier 1 and most of Tier 2, engineers shifted to reviewing the agent’s output on complex cases, investigating root causes the agent can’t resolve, and working on infrastructure improvements that prevent exceptions from recurring. Less firefighting, more engineering.

On-call changed materially. When the agent handles initial detection, classification, and routing, the human on-call doesn’t get woken up for every exception. They get woken up for exceptions that actually need a human decision. That’s a quality-of-life change that shows up in recruitment conversations and retention numbers.

There’s also a commercial effect. When exception triage runs in hours instead of days, SLA commitments that were previously conservative become credible. Xceptor can offer response time guarantees that would have required a much larger ops team to deliver manually. Unit economics of monitoring change when headcount doesn’t need to track linearly with instance count.


What we’d do differently

We underestimated the instrumentation work. Getting consistent observability across 170+ instances with different configurations took longer than expected, and the agent’s performance was directly tied to instrumentation quality. → Next time we’d run the instrumentation audit before the agent build, not in parallel.

We also learned that the exception taxonomy needs to be a living document. Our initial version was too rigid. New exception types appeared as Xceptor onboarded new clients and new data sources. We had to build a process for proposing, reviewing, and adding new categories on an ongoing basis. → Treating the taxonomy as a one-time deliverable would have degraded accuracy over time.

And we’d allocate more time for the read-only phase. Felt slow while we were in it. In retrospect, that was the single most important step in building trust. Rushing it would have produced an agent that the ops team worked around rather than worked with.


Why production gets overlooked

Most AI-in-delivery conversations focus on the build side: prototyping, requirements, coding, testing. Those phases get the conference talks and the blog posts. But production monitoring is where AI changes the economics of running a SaaS business. Where you stop scaling headcount linearly with customer growth. Where ops engineers move from triage to engineering. And the payoff is measured in SLA performance and incident response, not just developer productivity.

Xceptor’s monitoring agent didn’t start as an ambitious AI project. A practical response to a scaling problem that headcount couldn’t solve. The agent works because the operational foundations were built first, because trust was earned incrementally, and because the human/AI boundary was designed to evolve as the system learned.

Any SaaS team running at scale can follow the same sequence. Start with the taxonomy, fix the instrumentation, make the escalation logic explicit, then let the agent earn its way into production one instance at a time.

For the full Xceptor engagement covering all five delivery lifecycle phases, read: How Xceptor Moved AI Out of the Pilot Phase and Into Every Stage of Delivery.

You may also like

Thinking about your own AI, data, or software strategy?

Let's talk about where you are today and where you want to go - our experts are ready to help you move forward.