PMs Should Define What 'Good' Means for AI Features—Not Measure It

March 28, 2026

Product Management AI Development Evaluations AI Features PM Roles

There's a version of the "empowered PM" that's quietly breaking AI teams: the PM who decides they need to own evaluations because they own the product spec. The logic feels sound. You own the outcome, so you should own the measurement. But in practice, this collapses the distinction between two very different jobs—and it makes PMs worse at both.

The argument here is specific: PMs should define what success looks like for an AI feature, including failure modes, acceptable error rates, and business constraints. Engineers and data scientists should own the eval framework, the methodology, and the iteration loop. Conflating these two responsibilities doesn't make the product better. It makes the PM a bottleneck and the evals less rigorous.

The Confusion at the Root of This Problem

When a PM says "I'm going to own evals," they usually mean one of two things:

"I'll define what good looks like." This is PM work. It requires understanding users, business context, and what failure costs.
"I'll design and run the evaluation framework." This is engineering and data science work. It requires statistical literacy, knowledge of eval methodologies (LLM-as-judge, human annotation pipelines, embedding-based similarity, regression test suites), and the ability to reason about measurement validity.

These are not the same job. The first requires deep product intuition. The second requires technical depth. Most PMs are strong at the first and underprepared for the second—not because they're incapable of learning, but because it's not where their leverage is.

The distinction that matters: defining quality standards is a PM responsibility. Measuring quality is an engineering responsibility. The confusion between these two is where AI teams lose weeks.

The problem is that "evals" as a term has become a catch-all. In ML, an evaluation is a structured, statistically valid measurement of model behavior across a defined test set. It has methodology—how examples are selected, how outputs are scored, how results are aggregated. When PMs "run evals," what they typically do is look at 15-20 outputs, note the ones that feel wrong, and form an opinion. That's not an evaluation. That's a vibe check. And vibe checks create a specific, dangerous kind of false confidence.

A clean split-screen illustration showing two roles: on the left, a PM at a whiteboard sketching user journey failure points; on the right, an engineer at a terminal running automated test pipelines with charts and metrics. Warm, professional color palette.

How PM-Owned Evals Create False Confidence

Here's what actually happens when a PM takes ownership of running evals on an AI feature:

They look at outputs. Some are impressive. Some are bad. They focus on the impressive ones because those validate the direction they've already committed to. They flag the bad ones, but since they don't have a framework for understanding why the model failed or how frequently it fails at that type of input, the feedback they give engineers is anecdotal: "This output felt off" or "The tone here was wrong."

Engineers then fix the specific examples the PM flagged. The PM re-reviews. The fixed examples look better. The PM approves. The feature ships with failure modes that were never systematically tested because the eval process was driven by examples that happened to be visible, not a representative sample of real inputs.

This is cherry-picking with extra steps. The PM feels like they did rigorous quality work. The engineer feels like they satisfied the feedback. And the user encounters failure modes that neither of them saw coming.

The most dangerous eval is the one that feels thorough but isn't. A PM reviewing 20 hand-selected outputs can create more false confidence than running no eval at all—because it produces a documented sign-off without the rigor to back it up.

There's also a speed problem. When PMs own eval execution, they become the critical path. Engineers finish a model update, then wait for the PM to review outputs. The PM has other priorities—roadmap work, stakeholder updates, user research. Reviews slip by days. Iteration slows. In a domain where model quality improves through rapid, tight feedback loops, this is a real cost.

What PMs Should Actually Specify

The PM's job in an AI feature isn't to measure quality—it's to define what quality means in terms that can be measured. That's a distinct skill, and it's genuinely hard to do well.

Concretely, this means PMs should produce:

Success thresholds: Not "the output should be accurate" but "the feature should produce outputs that users rate as helpful ≥80% of the time in usability testing, and factual errors should occur in fewer than 5% of responses on our core use case inputs."
User-facing failure modes, ranked by severity: What are the worst things the model can do? Hallucinating a competitor's product name is a different severity than hallucinating a medical dosage. PMs should define this hierarchy, because it determines how the engineer prioritizes eval coverage.
Business constraints: Latency requirements, cost-per-query limits, compliance boundaries. These constrain what "good enough" means technically.
Edge cases that matter most to the user: Not a comprehensive test suite—that's engineering work—but the scenarios that represent the highest-stakes or highest-frequency user situations. These become the inputs that the engineer's eval suite must cover.

A structured document mockup showing a PM's acceptance criteria for an AI feature: sections for success thresholds, failure severity rankings, and business constraints. Clean, minimal design with a notebook aesthetic.

Notice what's absent from this list: eval methodology, test set construction, scoring rubrics, inter-annotator agreement, LLM-as-judge prompt design. Those are engineering decisions. The PM's spec should be the input to those decisions, not a replacement for them.

The Collaboration Model That Actually Works

The right split looks like this:

PM owns:

Writing acceptance criteria before engineering starts (not after)
Defining the failure modes that matter to users and the business
Setting the threshold for "good enough to ship"
Reviewing eval results against acceptance criteria—not the eval methodology itself
Deciding whether the product meets the bar, given what the evals show

Engineer/data scientist owns:

Designing the eval framework and test set
Selecting eval methodology (automated scoring, human annotation, regression suites)
Running evals and iterating on model behavior
Surfacing results in a format the PM can reason about
Flagging when the acceptance criteria are technically unmeasurable or need refinement

Both iterate on:

Whether the acceptance criteria are the right ones (sometimes engineering finds failure modes the PM didn't anticipate)
Whether the eval results actually answer the product question the PM cares about
The go/no-go decision before launch

PM Defines, Engineer Measures

Evals are statistically valid and representative
Engineers can iterate quickly without PM as bottleneck
PM focuses on user impact, not methodology
Failure modes are caught systematically, not anecdotally
Clear ownership means clear accountability

PM Owns End-to-End Evals

Review cycles slow iteration to PM's availability
Cherry-picked examples create false confidence
PM spends time on technical work outside their leverage
Engineers optimize for what PM notices, not what matters
Methodology gaps go undetected until production

The key mechanism here is the acceptance criteria document. If the PM writes clear, specific, measurable criteria before the engineer builds the eval framework, the engineer has a target to design toward. If the PM writes criteria after seeing the engineer's eval results, they're rationalizing whatever was built—which is exactly the rubber-stamping problem this split is meant to prevent.

A collaborative workflow diagram showing PM handing off an acceptance criteria document to an engineer, who runs an eval pipeline and returns results to the PM for a go/no-go decision. Modern, clean infographic style.

Why This Distinction Is Harder Than It Sounds

There's a real tension in this model: PMs who don't understand eval methodology at all can't write acceptance criteria that are actually measurable. If you specify "the model should never hallucinate" as a success threshold, you've written a criterion that's technically unmeasurable (hallucination detection is itself an unsolved problem) and practically useless.

So the right answer isn't "PMs should stay ignorant of how evals work." It's "PMs should understand eval methodology well enough to write measurable criteria, without trying to own the execution."

That's a calibration question. A PM building AI features needs enough technical literacy to know:

What kinds of outputs can be automatically scored vs. require human judgment
Why a test set needs to be representative of real user inputs, not cherry-picked examples
What a precision/recall tradeoff means for their specific feature
Why "the model got this right" in a demo doesn't tell you anything about production performance

The PM's goal is to be a sophisticated consumer of eval results—not an eval practitioner. That means knowing enough to ask the right questions about methodology, not enough to run the methodology yourself.

This is similar to how a PM who ships a data dashboard doesn't need to write SQL, but does need to understand whether the metric being tracked is the right one. The judgment layer is PM work. The execution layer is engineering work.

The Real Cost of Getting This Wrong

When PMs try to own evals without the technical chops to do it rigorously, two failure modes emerge—and they're both bad:

Rubber-stamping: The PM doesn't really understand what the engineer built, so they approve it because the outputs look reasonable on the examples they checked. Problems surface in production. The PM is blamed for approving something they couldn't actually evaluate.

Bottlenecking: The PM takes their quality ownership seriously and reviews every model update carefully. Iteration slows to the PM's review cycle. The engineering team loses momentum. The feature takes twice as long to reach acceptable quality because the feedback loop is broken.

Either way, the product suffers. The fix isn't to make PMs better at running evals. It's to redirect PM energy toward the thing that actually requires their specific expertise: defining what the product needs to do for users, in terms precise enough that engineers can build a framework to test it.

Owning the definition of success is harder than it sounds. Most PMs underinvest in writing specific, measurable acceptance criteria because it's easier to review outputs and give impressionistic feedback. But that's exactly where PM leverage is highest in AI development—not in the eval pipeline, but in the clarity of the target the pipeline is designed to hit.