Building AI agents people will use: six implementation lessons

Building AI agents people will use: six implementation lessons

In my last article, I introduced seven AI analyst personas that run against our Waltz enterprise architecture repository.

Each has a distinct lens: a CTO persona watching for infrastructure obsolescence, a Migration Architect hunting dependency cycles and shared-database cohorts, a TOGAF analyst spotting capability redundancy and standards drift, an Auditor reviewing the other six for agreement and contradiction. Each runs on a schedule, producing structured findings and recommending actions an architecture team can take.

A question I received afterwards stuck with me:  How do you stop this becoming another report nobody reads?

The honest answer is that the first version came surprisingly close to doing exactly that.

The findings were often accurate. But accuracy alone did not make them useful. The system still produced too much repetitioninconsistent prioritisation and occasional moments that made users question whether they could trust it.

Here are six lessons from improving it. Although the examples come from an enterprise architecture repository, most apply to any system in which AI continuously watches data and asks people to act on what it finds.

 

1. Let the model rediscover problems. Let the application remember them.

My first instinct was to tell each agent: “Do not report anything you already reported last week.”

Sounds reasonable, but it makes the model responsible for lifecycle state that the surrounding application should own.  An anti-pattern.

A better approach is to let the agent report what it sees on every run.  Assign each finding a fingerprint based on its structural identity:

·      which persona reported it

·      the finding category

·      the exact repository entities involved

When the same problem is observed again, it produces the same fingerprint. The system increments an observation count and can refresh the explanation, rather than creating another finding.

The wording can change without changing the identity of the problem.

This is also what makes dismissal meaningful. A user dismisses the underlying finding, not one particular version of the prose. When the agent notices the same issue again, it remains dismissed.

The practical lesson: models should detect; applications should remember.

 

2. Calibrate severity with contrasts, not adjectives.

Initially, the agents were asked to classify findings as low, medium, high or critical.

Nearly everything became critical.

The model treated any breached threshold as critical by default. It was not consistently considering who needed to act, how quickly they needed to act or whether an immediate response was practical.

Longer definitions did not solve this. Contrasting examples did.

For example:

“A single end-of-life production database supporting six named applications may require urgent intervention.”

Compared with:

“Two hundred applications missing an annual attestation represent a substantial programme of work, but not necessarily 200 separate emergencies.”

The important distinction is not simply the number of affected entities. It is the combination of impact, urgency, concentration of risk and the type of response available.

Once the personas were given contrasting cases, their severity ratings became much more useful.

The practical lesson: calibration examples often communicate judgement better than abstract definitions.

 

3. Never use generated prose as your navigation model.

The agents include entity references in their explanations. An early version of the interface turned references such as id=875 into application links.

Unfortunately, 875 was not always an application. It might have been a database, a capability or another type of repository entity.

The result was links to the wrong page, or to pages that did not exist.

That is worse than showing no link at all. A missing link is inconvenient. A confidently wrong link makes the entire finding feel unreliable.

Navigation is now driven exclusively by the structured entity references attached to the finding. The entity type and identifier are explicit, and the display name is resolved against the live repository when the finding is viewed.

References that appear only in generated prose are treated as text, not as ground truth.

The practical lesson: use language for explanation, use structured data for identity and navigation.

 

4. Reject malformed output without abandoning the run.

AI-generated structured output will occasionally be wrong.

A persona may return an unsupported category, omit a required field or provide an entity reference in the wrong format.

The first implementation treated this as a failed run. 

The better approach was to make validation part of the conversation with the model.

When a finding fails schema validation, the validation error is returned to the persona as feedback. The persona can then correct the finding and submit it again during the same run.

For example, a persona might return a severity of urgent when the schema only permits low, medium, high or critical. Rather than failing the run, we hand the error back — “urgent is not a valid severity; choose one of low, medium, high, critical” — and the persona resubmits with a severity of high.

This gives us both sides of the contract:

·      strict validation at the system boundary

·      a forgiving correction loop inside the run

We do not silently accept malformed data, but we also do not throw away an otherwise useful analysis because of one repairable mistake.

The practical lesson: be strict about what enters the system and helpful about how the model can correct it.

 

5. Measure consistency before tuning prompts.

Prompt tuning is very easy to do by instinct.

A wording change appears to produce a better result, so it is declared an improvement. But the same change may also make the agent less predictable or cause previously reliable findings to disappear.

Before further prompt tuning, I introduced a simple consistency test.

Each persona runs twice, back-to-back, against unchanged repository data. The two runs are then compared using the fingerprints of the findings they produced.

They do not need to use identical prose. They do need to identify essentially the same underlying issues.

Our current regression set is at 100% fingerprint consistency across the agent fleet.

In practice it catches the failures that matter: a persona that finds eight issues on one run and five on the next, or one that silently drops a category after a prompt edit. A wording change that quietly destabilises the fleet shows up immediately as a consistency drop.

That does not prove that every finding is correct. It does give us a stable baseline, a degree of determinism. When a prompt, model or tool changes, we can see whether it has quietly made the system less deterministic.

It is the closest equivalent I have found to a regression test for this kind of analyst.

The practical lesson: for non-deterministic agents, you need a stability signal, not just an accuracy one before you start tuning prompts.

 

6. AI reviewing AI is useful—but not always useful to the operator.

One of the seven personas is an Auditor.

It reviews findings from the other six analyst agents and looks for gaps, agreement and disagreement. For example, it might identify that two personas assessed the same shared service differently, or that a significant repository area was overlooked entirely.

This is genuinely useful to me as the platform owner.

It is much less useful to someone who has opened an application page to decide what action to take. From their perspective, a discussion about disagreement between two AI personas is mostly system-level noise.

We therefore treat these findings differently in the interface. They appear in a demoted tier and are folded away by default.

Readers of the first article will recognise this as the third reverse-lookup bucket — the inter-persona quality-assurance tier that sits alongside findings which directly affect an application and those that mention it as part of a wider pattern. It earns its place for me as the platform owner, but it should not be the first thing an operator sees when they open an application page to decide what to do.

The information remains available, but it does not compete with findings that directly affect the application or decision in front of the user.

The practical lesson: AI-about-AI output is often platform telemetry, not primary user content.

 

The model was not the difficult part

The common thread through all six lessons is that none of them required a more capable model.

The difficult work was in the layer between a capable model and the person expected to act on its output:

·      identity and lifecycle management;

·      severity calibration;

·      structured grounding;

·      correction loops;

·      regression testing;

·      information hierarchy.

Model capability can produce a compelling finding.

The surrounding system determines whether anyone trusts it, understands it and does something about it.

What broke first in your AI-output-to-human-action pipeline?

 

#AIEnablement #EnterpriseArchitecture #EnterpriseAI #AIEngineering #AgenticAI #AIGovernance #DataQuality #ArchitectureManagement #Waltz