Insights / Your data has no opinions: the judgment layer AI reporting actually needs

Your data has no opinions: the judgment layer AI reporting actually needs

An agent wired to your semantic model can answer 'what is the value?' perfectly and still be useless on 'is that good?' The missing piece isn't model quality - it's a judgment layer nobody's schema contains.

Published

June 2026

Length

7 min read

Topics

AI Orchestration · Delivery Intelligence

Wire a capable model to a well-built semantic model and ask it for a number. It will get the number right. Ask it whether the number is good and you'll get fluent hedging - because nothing in the data says which direction is healthy, what threshold matters, or which of two disagreeing systems to believe.

That was the wall I hit making an analytics agent useful to non-technical users: a media client's executive report, a Copilot Studio agent in front of it, and an MCP tool that could already query the live semantic model under the user's own row-level security. The plumbing worked. The failures weren't retrieval failures or reasoning failures. The agent was missing what every analyst in the building carries in their head and nobody had written down.

What the schema can't tell you

The scale of the gap was measurable. The report's semantic model had 843 measures across 99 table files, feeding 57 pages and 963 visuals — and zero descriptions. Not sparse descriptions: literally none, on any measure or column. The upstream financial model it wrapped was the same: hundreds of real DAX definitions, not one formal description property among them. Rich in structure — tables, relationships, lineage, display folders — and empty of judgment.

And the judgment questions are the ones users actually ask. Working through the plan, I kept a list of what no amount of metadata could answer:

Systems. What is each source system, in one plain-English sentence? Which one is authoritative for financials, which for client perception, which for employee engagement — and when the warehouse and an operational system disagree, who wins?
Terms. What's the difference between billings, net sales, revenue, and client income? What does "contribution" mean here? Every business uses these words; no two mean the same thing by them.
Direction and thresholds. Is a higher score always better, or only inside a band? At what value does anyone actually act?
Routing. When a user says "sales," do they mean net sales or billings? (Here: net sales, unless they say otherwise. Nothing in any schema says so.)
What absence means. Under row-level security, a zero-row result is ambiguous: no activity for that client, or no access to that client? The agent needs a rule — check visibility separately before declaring "there were no sales" — and that rule is judgment, not metadata.

The failure had a face, too. In early testing a user asked the agent for billings for a household consumer brand. The agent — knowledge sources attached, live query tool connected and working — explained what billings means, described which report page covers it, and never ran the query. The number was one tool call away. Nothing was broken: the agent simply had no written rule that a value question means call the tool, and its instructions had been authored entirely in help-desk terms. Judgment isn't only definitions; it's routing and behavior, and it has to be written down with the same care.

The moment that named the problem: reviewing the project's open questions, they were all implementation questions - where should the dictionary file live, which parser, how much SME time. It took stepping back to ask - where are the questions about what the systems are, what the names mean, whether a high score is good? Different list entirely, and the one the whole project actually depended on.

No amount of model capability recovers that, because it isn't latent in the data. It lives in people. An agent without it can describe your business numerically while understanding none of it.

The glossary nobody owns

The fix is unglamorous: a metric glossary that states, for every measure - what it means in business terms, which direction is good, what thresholds or benchmarks apply, which source is authoritative, and what caveats travel with it. Add a short description of each source system. The system-description template that survived contact with a real stakeholder (their review of the first draft: spot on) has sections worth stealing:

Plain-English purpose - one paragraph, no acronyms
What it is authoritative for, and - just as important - what it is not
Score directionality and any thresholds that trigger action
Time-comparison rules (what "versus last year" means for this system)
Common phrases users say that should route here
Required caveats (for a survey system: always mention response rate)

This document rarely exists, because it has no natural owner. It's beneath the data team's tooling and beside the business's day job. But it is the single highest-leverage artifact in an AI reporting project - more than model choice, more than orchestration.

Draft mechanically, confirm humanly

The encouraging discovery: you don't need weeks of stakeholder workshops. When I actually scanned the upstream model's source files, 812 of its 834 measures already had informal /// comment blocks — purpose notes developers had left for each other, never promoted to real descriptions. That, plus the DAX itself, the display-folder placement, and which visuals used each measure, meant eighty to ninety percent of a glossary could be drafted mechanically. A model can infer from structure that a measure is a prior-year comparison, returns a percentage, or is a wrapper around another measure — and for obvious names (churn, error rate) it will even guess direction correctly.

What's left for humans is the judgment itself: the "is higher always better here?" pass, the thresholds, the authority calls when systems disagree. Scoped that way, the ask shrank to about an hour a week per subject-matter expert — enough to confirm the fifty or so measures that matter most within two or three weeks, using a workbook of prefilled questions rather than a blank page.

A model can read every number you have and still not know which direction is good. That knowledge is yours - and it has to be written down.

Draft the eighty percent, schedule one review for the twenty, and version the result next to the semantic model it describes.

The evidence says this is the bottleneck

I want to be careful not to oversell a war story, so here is the disconfirming check I ran on my own claim - and it held up.

The BIRD text-to-SQL benchmark attaches a human-written "external knowledge" sentence to each question - exactly the kind of fact a metric glossary holds. GPT-4's execution accuracy on BIRD was roughly 20 points higher with that knowledge than without it (about 55% versus 35%), and even with it, ~38 points below human performance. One written sentence of business context was worth more than most model upgrades. And on Spider 2.0, which tests real enterprise warehouses instead of clean academic schemas, the best model evaluated solved about 21% of tasks - while a model scoring 86.6% on the older academic Spider benchmark managed just 10.1% here. The missing ingredient in enterprise settings isn't SQL fluency; it's context that lives outside the schema.

The vendors have quietly reached the same conclusion, because the products now ship judgment-layer containers. Power BI's Prep data for AI adds AI instructions (model-level business context, exactly the "what does this mean and how should you read it" layer), verified answers, and a curated AI data schema. Fabric data agents take instructions and example query pairs for the same reason. Microsoft's own guidance for Copilot-ready models leads with write descriptions, fix ambiguous names, hide what doesn't matter. The judgment layer now has first-party places to live. What no product can do is write it - the directionality, the thresholds, the authority calls still have to come out of your people's heads.

Which is the real point: if you write the glossary once, as its own versioned artifact, you can feed it to whatever container this year's platform offers - a vector index, an instructions field, a data agent config - instead of re-authoring it inside each one.

The short version

AI reporting projects stall on a missing document, not a missing capability. The judgment layer - meaning, direction, thresholds, authority - exists only in your people until someone writes it down, and the benchmark data says that context is worth more than model upgrades. Draft it mostly mechanically from the comments, DAX, and usage your model already contains; spend the scarce human hours confirming the judgment calls; version it next to the semantic model it describes. Skip it, and the smartest model you can buy will confidently narrate numbers it doesn't understand.

All insights Start a project