Financial Planning: The Agents That Drifted After a Model Update

Vertical: Financial Services / Agentic Platform Stage: Growth-stage, 20+ AI agent instances across customer operations What we analysed: Composite scenario drawn from multiple financial services prospects - describes the drift pattern we see repeatedly and the monitoring approach we deploy

A financial planning platform had 20+ AI agent instances across customer operations. Each agent handled different aspects of data analysis, reporting, and customer interaction. One of their people described the core anxiety: “How do you know it’s doing what you’re expecting it to do? At the point where you set it up and test it and go, that looks good. Versus - we’ve seen even just very simple processes, interactions where we use AI, got steps that have ChatGPT integrated into making decisions. Suddenly, changes in the model and it starts handling those decisions differently than you were expecting.”

Then a model update shipped.

What they thought was happening

They tested the model update in staging. Evals passed. The new model performed better on their benchmark suite. They deployed to production. Within two days, the team was confident everything was fine.

What was actually happening

The model update changed how the agents interpreted ambiguous financial data. Not dramatically - subtly. A calculation that previously rounded conservatively now rounded aggressively. A categorisation that previously defaulted to “needs review” now defaulted to “approved.” No single decision crossed a severity threshold.

Their team’s quality process was reactive. As one person put it: “When someone on my team is doing a peer assessment, they’ll say, oh, and by the way, I spotted that the AI did something weird. And you go, oh, like, that’s really a byproduct rather than the deliberate process.”

The first sign of trouble came three weeks later. A customer noticed their quarterly report looked different. When the team investigated, they found 200+ decisions had drifted from baseline since the deployment. The agent wasn’t failing - it was confidently producing outputs that were slightly, consistently off.

“So that’s when everything does go wrong. Normally it’s only discovered later on and then it’s much more of a drama. Because - oh, how long has this been going on for? Good question.”

What monitoring catches

We set up continuous comparison across agent instances with time-series tracking. Not just “is this output correct?” but “has the distribution of outputs changed since last week?” The system detects when a model update, data change, or any upstream shift causes the agent population to behave differently.

When drift is detected, it surfaces the specific decisions that changed, the direction of the change, and the likely cause. The team can see - within hours, not weeks - whether a deployment changed agent behaviour in ways their eval suite didn’t catch.

“I would like to have something that is constantly monitoring any AI within our process. It’s the over time thing that I think is important. What works today doesn’t work tomorrow. Like, why? What’s changed? That feels like the stressful part of never being able to get on top of it.”

What they own

An always-on monitoring layer comparing agent behaviour across 20+ instances with time-series baselines. Drift detection that catches distributional shifts - not “is this output wrong?” but “has the pattern of outputs changed since last deployment?” Alerting that surfaces the specific decisions that changed, the direction of the change, and the likely upstream cause. The team deploys model updates on Monday and has drift reports by Wednesday - instead of finding out three weeks later from a customer.

The financial exposure from three weeks of undetected drift was significant. Not because any single decision was wrong enough to flag - but because hundreds of slightly-off decisions compound. MIT research examining 32 datasets across four industries found 91% of ML models experience degradation over time. The question is whether you detect it in hours or weeks.