Split AI Work by Decision Authority, Not by Artifact Type

March 30, 2026

Product Management AI Development Team Structure Engineering Collaboration AI Projects

The most common fight on an AI project team sounds like this: the PM wants to tweak a prompt, the engineer thinks the PM is slowing everything down, and nobody is sure who actually owns whether the output is good enough to ship. Both people are doing their jobs as they understand them. The problem is that nobody defined the jobs correctly.

The default split—PM writes prompts, engineer writes code—feels intuitive because it maps AI work onto familiar role boundaries. But it's a categorization of artifacts, not decisions. And in AI projects, artifact ownership without decision authority is what creates both the bottlenecks and the accountability gaps that kill projects.

Why the Artifact-Based Split Fails

When a PM owns prompts, they become a required checkpoint on every iteration cycle. Engineers can't run experiments without waiting for PM-approved prompt changes. The PM, meanwhile, is doing something they're often not best positioned to do—making low-level technical choices about phrasing, structure, and token efficiency—while still not having formal authority over the thing that actually matters: whether the system is working.

This creates a specific failure mode. The PM is accountable for the prompt but not for the eval results. The engineer is accountable for the model infrastructure but not for whether the outputs are good. When the system underperforms, both parties have a defensible position and nobody has clear ownership of the outcome.

Artifact ownership without decision authority creates accountability gaps. When the system underperforms, everyone has a defensible position—and nobody owns the outcome.

The engineer-owned alternative isn't much better. When engineers own both the code and the prompts, they end up making subjective quality calls that should belong to the PM—decisions like "is this output good enough for users?" or "does this tone match what we promised?" Engineers are not well-positioned to make those calls, and they know it. The result is either paralysis (engineers asking for PM sign-off on every output judgment) or drift (engineers shipping based on their own intuitions about quality, which may not match user or business needs).

The Decision-Authority Model

The better split isn't about who touches which file. It's about who has final say on which class of decision.

PMs own:

What "good" looks like (the evaluation criteria)
The thresholds required to ship (the gate conditions)
Which failure modes are acceptable and which are blockers
When to change the goal based on user or business feedback

Engineers own:

How to reach the defined threshold (model choice, architecture, retrieval strategy)
Whether a specific prompt formulation serves the technical goal
When to run experiments and what to vary
How to instrument the system to surface the data the PM needs

Under this model, the PM isn't reviewing prompts—they're reviewing eval results against criteria they defined. If the eval passes, the engineer's implementation choices don't require PM approval. If it fails, the PM decides whether to adjust the criteria, adjust the timeline, or block the ship. The engineer decides how to fix the underlying problem.

The PM's job is to define what the finish line looks like. The engineer's job is to choose the fastest path to cross it. These are different decisions that should never be conflated.

This model also clarifies prompt ownership in a principled way. Prompts are implementation details—they're one lever an engineer pulls to hit the eval threshold the PM defined. The PM doesn't need to approve prompt changes any more than they need to approve a refactor. What they do need to approve is any change to the eval criteria or ship gate, because those are business and user decisions.

Responsibility Matrices for Three Common AI Project Types

The decision-authority model plays out differently depending on what you're building. Here's how it maps to three common AI project types.

Classification Systems (e.g., content moderation, intent detection)

Classification is the most structured AI problem type, which makes accountability clearest.

Decision	Owner	Rationale
Precision/recall threshold	PM	This is a business tradeoff—false positives vs. false negatives have different costs to users
Which classes to detect	PM	Driven by product requirements and user needs
Model architecture and training approach	Engineer	Technical implementation choice
Prompt design for few-shot classification	Engineer	Implementation lever toward PM-defined threshold
Labeling guidelines for ambiguous cases	Joint	Requires PM's knowledge of user intent + engineer's knowledge of model behavior
Ship/no-ship decision	PM	Owns the gate condition

For classification systems, the PM's most important job is defining the acceptable error rate for each class—not the same threshold for all classes. A missed harmful content detection and a missed spam detection have very different business consequences.

Generation Systems (e.g., AI writing assistants, summarization, chatbots)

Generation is where the artifact-based split causes the most damage, because output quality feels subjective and everyone has opinions about prose.

Decision	Owner	Rationale
Evaluation rubric (what makes a good output)	PM	This is a product definition question, not a technical one
Tone and voice guidelines	PM	Brand and UX decisions
Eval methodology (human rating, LLM-as-judge, automated metrics)	Engineer	Technical implementation of the measurement system
Prompt structure and iteration	Engineer	Implementation lever; PM reviews results, not prompts
Which outputs are "good enough" to ship	PM	Gate condition based on rubric
How to handle edge cases the rubric doesn't cover	Joint	Requires rubric expansion (PM) and technical triage (engineer)

The key discipline for generation systems: the PM must write the eval rubric before any prompts are written. If the rubric comes after, the prompts will implicitly define quality—and the PM will have already ceded the decision that matters most.

Retrieval-Augmented Generation (RAG) Systems (e.g., enterprise search, knowledge bases)

RAG introduces a retrieval layer that creates a new class of decisions about relevance—and a new potential accountability gap.

Decision	Owner	Rationale
What "relevant" means for a given query	PM	Requires understanding of user intent and task context
Retrieval evaluation criteria (precision@k, MRR, etc.)	Joint	PM defines user-facing relevance; engineer translates to measurable metric
Chunking strategy and embedding approach	Engineer	Technical implementation
Reranking logic	Engineer	Implementation lever toward PM-defined relevance criteria
How to handle queries with no good retrieval result	PM	Failure mode decision with UX implications
Index update frequency and staleness tolerance	Joint	Business requirement (PM) with infrastructure cost implications (engineer)

Decision-Authority Model

PM accountability is tied to outcomes, not artifacts
Engineers can iterate without waiting for PM approval on implementation details
Clear ownership of ship/no-ship decisions prevents stalemates
Eval criteria are defined before implementation begins, not reverse-engineered from outputs
Scales across project types without constant renegotiation

Artifact-Based Split

PM becomes a bottleneck on every prompt iteration cycle
Engineers make implicit quality calls without authority or context
Nobody owns the outcome when the system underperforms
Eval criteria emerge late and are often shaped by what the system already does
Requires constant renegotiation as the project evolves

Preventing the Two Failure Modes This Model Is Designed to Avoid

The decision-authority model is specifically constructed to prevent two failure modes that the artifact-based split makes almost inevitable.

Failure Mode 1: The PM Bottleneck. When PMs own prompts, every experiment requires their involvement. In a fast-moving AI project, this means engineers are idle while waiting for PM input on technical implementation details. The fix isn't to exclude PMs—it's to move their involvement upstream. If the PM has already defined the eval criteria and thresholds, the engineer can run dozens of prompt experiments without PM involvement, because the PM's judgment is already encoded in the eval. The PM reviews results on a cadence, not every individual change.

Failure Mode 2: Engineer Drift. When engineers own both implementation and quality judgment, they gradually make product decisions without realizing it. A generation system's tone shifts. A classifier's threshold gets adjusted because the engineer thought the false positive rate looked too high. A RAG system starts filtering out sources the engineer found noisy. These are all PM decisions made by engineers in the absence of clear criteria. The fix is for PMs to define those criteria explicitly and in advance—not to audit engineer decisions after the fact.

Engineer drift is harder to detect than PM bottlenecks because it's invisible until the product ships. Bottlenecks slow you down; drift takes you somewhere you didn't intend to go.

Making This Work in Practice

The model requires two things from PMs that many haven't had to do before on non-AI projects.

First, PMs must write eval criteria before engineers write a single prompt. This is the most important discipline change. The criteria don't need to be perfect—they'll evolve—but they need to exist. A generation system needs a rubric. A classifier needs defined precision/recall targets per class. A RAG system needs a definition of relevance for the top use cases. Without this, engineers will fill the vacuum with their own judgments.

Second, PMs must understand what is and isn't measurable. Not every quality dimension can be captured in an automated eval. When something can't be measured automatically, the PM needs to own the human evaluation process—defining the rating scale, the sample size, the frequency. Delegating this to engineers means delegating the quality definition.

For engineers, the model requires one key discipline: surface the eval results, not just the outputs. The PM can't exercise decision authority on data they don't have. Engineers who build good eval instrumentation give PMs what they need to make ship decisions confidently. Engineers who show the PM individual outputs and ask for a thumbs up are forcing the PM back into the artifact review loop.

The Practical Starting Point

If you're setting up an AI project now, start with a single document before any code is written: the eval spec. It should define what success looks like, what the minimum threshold is to ship, and what failure modes are blockers. The PM writes this. The engineer reviews it for measurability and flags anything that can't be instrumented.

Everything downstream—model choice, prompt design, retrieval architecture—is the engineer's call, constrained by the eval spec. The PM's ongoing job is to keep the eval spec current as user feedback and business requirements evolve, and to make the ship decision when the system hits the threshold.

That's not a smaller PM role. It's a more precisely located one—and it's the location where PM judgment actually has leverage.