research

Your Semantic Layer Alone Is Not Ready for Agentic Analytics

Classic analytics needs definitions. Conversational analytics needs a human in the loop. Agentic analytics needs decision fitness: the definitions, assumptions, owners, constraints, provenance, and stop conditions that tell an agent when a number is safe to act on.

Olivier Dupuis — 10 Jun 2026 — 9 min read min read

Your Semantic Layer Alone Is Not Ready for Agentic Analytics

Classic analytics needs definitions. Conversational analytics needs a human in the loop. Agentic analytics needs lineage.

Data gets consumed in more and more ways, and the older ones aren't going anywhere. The interesting question isn't which mode wins. It's how much semantic context each one needs before you can trust it.

The same data foundation, consumed three ways. The human recedes and the agent steps in: from the dashboards and notebooks you read, to the conversations you have, to agents that act on the data on their own.

One foundation on the left, three ways to consume it on the right. The modern data stack ingests, transforms, and stores the data. The semantic engine sits on top and gives it meaning. Then the same governed data flows into three lanes: classic (dashboards, notebooks, apps, ML features, where people read), conversational (people ask, an agent interrogates), and agentic (an agent acts, with no one watching). Move down the lanes and the human recedes while the agent takes over.

I'm using "agentic" in the stricter sense here: not just an agent writing SQL or answering a question, but an agent using the answer to make or trigger a business decision without a human approving every assumption.

The claim I want to test is simple: each lane needs a different depth of semantics before it's safe to rely on. So I took one dataset and watched how it behaves as you climb.

The engine has depth

The left side is the part most teams already have. The modern data stack is mature, and so is the first layer of meaning on top of it. But the semantic engine isn't one thing, and that turns out to be the whole story.

There's a thin layer, the semantic model: entities, joins, and the formula for revenue or customer lifetime value. This is what LookML, Cube, and dbt metrics give you, and modern ones are good. And there's a deeper layer, semantic lineage: not how the number is computed, but what it rests on. That the lifespan inside it is a moving assumption owned by another team, not a fact. That "CLV" means one thing to the bidding system and another to the board, so there's more than one right answer. That here's what would silently change it. A model can hold a very sophisticated formula. It can't hold any of that, and that gap is what separates the three lanes.

Classic: the lane the semantic model was built for

Classic analytics is still most of how data gets used, and it isn't going anywhere: dashboards, notebooks, the internal app that calls a metrics endpoint, the feature feeding a churn model. What they all need is one consistent definition. Define revenue once, and the finance dashboard, an analyst's notebook, the billing app, and the model all agree.

That's exactly what the semantic model does, and it's a solved problem. The depth this lane needs is the thin layer. The decision about what revenue means was made once, by whoever built the metric, and every surface just reads it. Nobody is choosing anything at runtime.

Conversational: the first probe

The middle lane is newer: someone asks the data a question in plain language, an agent answers. The fear was accuracy, an agent inventing a formula and stating it with two decimals of confidence. So I went looking for the floor. How little semantics can this lane get away with?

I took Google's public theLook eCommerce dataset (bigquery-public-data.thelook_ecommerce, reproduce it yourself), pointed a Hex agent at it warehouse-only, no semantic model, no definitions, nothing, and asked for the most definition-loaded metric I could think of. Customer lifetime value. The formula is standard, but every input is a company decision the schema can't answer: revenue or margin, which customers count, and a lifespan that lives nowhere in the data. Three skeptical questions, each aimed at one of those decisions.

Here's how our agent performed.

Q1. "What's our average customer lifetime value?"

Average customer LTV is $122.01 across ~65.9K customers (excluding cancelled/returned items). The median is notably lower at $79.99 — typical right-skew where a tail of high-spend customers pulls the mean up. [...] If you'd like, I can break LTV down by traffic source, country, signup cohort, or gender, or compute a true predictive LTV using order frequency and tenure.

Q2. "Is that on revenue or gross margin?"

That's revenue (sum of sale_price), not gross margin. I didn't factor in product cost. theLook has a products table with a cost field, I can recompute LTV as gross margin (sale_price − cost) if you want.

Q3. "Over what lifespan, the time we've observed or an assumed one?"

Observed, not assumed. [...] A customer who signed up last week and one who's been around for 3 years are both averaged in at face value, which biases the mean downward (newer customers haven't had time to accumulate spend). [...] build a predictive LTV using order frequency, AOV, and an assumed customer lifespan (e.g. 24 or 36 months) [...] Which direction is useful?

Three for three, with no semantics at all. Everything I asked lives in the data or in standard practice, and a 2026 agent that plans, runs SQL, and reasons over the result clears that bar on its own.

A capable agent can surface the right ambiguity before a semantic model exists. But that is not the same as production self-service. At scale, conversational analytics still needs routing, maintained context, evals, and correction loops. The difference is that the human can still close the loop when the agent surfaces uncertainty.

Notice why the lane works here: every answer ended with a question. "Want me to recompute as margin?" "Which direction is useful?" The agent never committed. It surfaced the choice and handed it back, and I was there to make it. The lane works because a human closes every loop the agent opens.

Agentic: the decision experiment

The conversational probe asked a lookup, and a lookup is exactly where a good agent looks fine. The agentic test has to ask a decision, because the gap between a model and its lineage only opens when a definition has to fit a purpose. So here's the prompt I put to the same agent, the kind a workflow would carry with no human attached:

"Set next quarter's acquisition budget from customer lifetime value. How much can we afford to spend to acquire a customer?"

CLV answers one question: what is a customer worth? Acquisition budgeting asks another: how much are we allowed to spend to get one? To move from the first to the second, the agent needs more than a formula. It needs the business policy: gross margin or contribution margin, observed or projected value, 6-month or 12-month payback, blended or channel-specific CAC, conservative or growth-mode economics. If that policy is not provided, the agent has to supply it.

That is the real test.

Tier 1: warehouse-only. No definitions. I expected the agent to reach for revenue per buyer, $122, and size spend against it. It did start there, but the result was more interesting than the expected failure. When pressed, it separated revenue CLV from gross-margin CLV, identified that the number was observed rather than projected, and then built a tenure curve before making a recommendation:

Tenure	Cumulative gross margin
First order (M0)	$45.47
3 months	$48.78
6 months	$50.84
12 months	$54.30
24 months	$59.81

It concluded that most margin came from the first order, used the 12-month margin of $54.30, applied standard LTV:CAC ratios, and recommended an ~$18 blended CAC ceiling.

That is not a dumb answer. It is probably the kind of answer a good analyst would sketch in a first pass. But notice what happened: the agent invented the decision frame. It chose a 12-month basis. It chose blended CAC. It chose a generic 3:1 LTV:CAC benchmark. None of those choices came from governed business context.

Hex noticed this too. Its thread inspector raised warnings that the thread had not surfaced an endorsed CLV definition, and that the approved time horizon was unclear: historical-to-date, fixed-horizon observed value, or projected lifetime value. That is the Tier 1 finding in one sentence: the agent can reason, but the business decision frame is under-specified.

Tier 2: plus a Cube semantic model. Next I defined the company's reported CLV once in clv.cube.yml:

Now the agent did what a semantic layer is supposed to make it do. It returned the official CLV: $63.71, gross margin per acquired customer, net of cancellations and returns, lifetime to date. When I asked whether that was revenue or margin, it answered directly from the semantic definition. When I asked whether the lifespan was observed or assumed, it answered directly from the model description: historical and realized, no forward projection.

That is a real improvement. The semantic model made the metric consistent.

Then came the decision prompt again. The agent took the reported CLV of $63.71 and converted it into CAC ceilings:

Approach	Max CAC	Rationale
Breakeven	$63.71	Recover acquisition cost over observed lifetime
LTV:CAC = 3:1	$21.24	Healthy unit economics
LTV:CAC = 4:1	$15.93	Conservative
LTV:CAC = 2:1	$31.86	Aggressive growth

Its recommendation: target about $20 blended CAC, capped at $30.

Again, the answer is plausible. It even caveats the metric correctly: the $63.71 is realized margin, not projected; it is gross margin, not contribution margin. But it still turns the metric into policy. The model governed the number. It did not govern the decision.

That is the Tier 2 finding: the semantic model makes the agent consistent, but it doesn't tell the agent whether the reported CLV is the right CLV for acquisition budgeting. It still has to invent the payback rule, the LTV:CAC target, and the channel policy.

Tier 3: plus semantic lineage. Same model, plus a guide (clv-lineage-guide.md) that carries what the model structurally can't:

CLV is not one number. The reported clv is for the QBR. Acquisition bidding uses the forward CLV, margin projected over the modeled customer lifespan, ideally per channel. Different figures, different decisions.
The lifespan is an assumption. 36 months, modeled by retention, re-estimated yearly.
To act: size acquisition spend on the forward CLV, never the reported one and never revenue. If the forward figure isn't available, surface the assumption instead of substituting the historical number.

This time the agent behaved differently from the start. It still returned the reported CLV, about $63.89, but immediately framed it as the historical, realized QBR number. For acquisition bidding or budget decisions, it said, this is the wrong number. The right number would be forward, per-channel CLV using the retention team's 36-month lifespan assumption.

Then came the decision prompt again. This time the agent stopped:

Stopping before I give you a number — this is exactly the decision the guide flags as the most common way CLV goes wrong.

It rejected the reported $63.89 CLV as the input for acquisition budgeting because it is historical, censored, and not segmented by channel. It named the right shape of the missing input:

forward_CLV_channel = (monthly gross margin per customer in channel) x 36-month lifespan
max CPA_channel    = forward_CLV_channel, capped by finance's 12-month payback window

And it named what was missing: confirmation of the 36-month lifespan assumption, monthly margin run-rate by channel, and the current finance payback window. It offered to produce a per-channel realized margin table, but explicitly framed that as a historical input, not a bid ceiling.

That is the difference. The lineage guide didn't make the agent smarter at arithmetic. It changed the action. It stopped the reported metric from becoming an unauthorized budget policy.

Here is the experiment in one table:

Tier	Context	What the agent did	Finding
Warehouse only	Tables and SQL	Reasoned well, built a tenure curve, recommended ~$18 CAC	Capable, but it invented the decision frame
Semantic model	Official reported CLV	Used the governed gross-margin CLV, then recommended ~$20 CAC	Metric governed, decision still inferred
Semantic lineage	Model plus decision context	Stopped, rejected reported CLV for bidding, surfaced missing assumptions	Decision fitness governed

The throughline

It's one line, left to right across that diagram: ingest, interrogate, act. The data stack handles ingest. The semantic model plus a capable agent already handles interrogate. Act is the open one, and it's open for a structural reason, not a model-quality one. Acting without a human means there's no one to answer the second question, so the answer has to exist before the question is asked.

This is also why Anthropic's recent self-service analytics write-up is an interesting signal. The production pattern they describe is not just giving Claude access to a warehouse and hoping for the best. It is semantic-layer-first routing, curated skills, business context, evals, provenance, and correction loops. Different use case, same direction of travel: the model can calculate, but the system around the model has to reduce ambiguity and tell it what the calculation is allowed to mean.

The depth of semantics each lane needs climbs as you go. Classic runs on a semantic model. Conversational gets by on a good agent in a narrow setting because a human can close the loop. At production scale, it still needs skills, evals, maintained context, and routing. Agentic needs the full engine, lineage included, because the agent has to know whether the metric is fit for the action before it acts.

A semantic layer makes metrics consistent. Agentic analytics needs more than consistency. It needs decision fitness: the definitions, assumptions, owners, constraints, provenance, and stop conditions that tell an agent what a number is allowed to do.

research