Split AI Work by Decision Authority, Not by Artifact Type
The most common fight on an AI project team sounds like this: the PM wants to tweak a prompt, the engineer thinks the PM is slowing everything down, and nobody is sure who actually owns whether the output is good enough to ship. Both people are doing their jobs as they understand them. The problem is that nobody defined the jobs correctly.
The default split—PM writes prompts, engineer writes code—feels intuitive because it maps AI work onto familiar role boundaries. But it's a categorization of artifacts, not decisions. And in AI projects, artifact ownership without decision authority is what creates both the bottlenecks and the accountability gaps that kill projects.
Why the Artifact-Based Split Fails
When a PM owns prompts, they become a required checkpoint on every iteration cycle. Engineers can't run experiments without waiting for PM-approved prompt changes. The PM, meanwhile, is doing something they're often not best positioned to do—making low-level technical choices about phrasing, structure, and token efficiency—while still not having formal authority over the thing that actually matters: whether the system is working.
This creates a specific failure mode. The PM is accountable for the prompt but not for the eval results. The engineer is accountable for the model infrastructure but not for whether the outputs are good. When the system underperforms, both parties have a defensible position and nobody has clear ownership of the outcome.
The engineer-owned alternative isn't much better. When engineers own both the code and the prompts, they end up making subjective quality calls that should belong to the PM—decisions like "is this output good enough for users?" or "does this tone match what we promised?" Engineers are not well-positioned to make those calls, and they know it. The result is either paralysis (engineers asking for PM sign-off on every output judgment) or drift (engineers shipping based on their own intuitions about quality, which may not match user or business needs).
The Decision-Authority Model
The better split isn't about who touches which file. It's about who has final say on which class of decision.
PMs own:
- What "good" looks like (the evaluation criteria)
- The thresholds required to ship (the gate conditions)
- Which failure modes are acceptable and which are blockers
- When to change the goal based on user or business feedback
Engineers own:
- How to reach the defined threshold (model choice, architecture, retrieval strategy)
- Whether a specific prompt formulation serves the technical goal
- When to run experiments and what to vary
- How to instrument the system to surface the data the PM needs
Under this model, the PM isn't reviewing prompts—they're reviewing eval results against criteria they defined. If the eval passes, the engineer's implementation choices don't require PM approval. If it fails, the PM decides whether to adjust the criteria, adjust the timeline, or block the ship. The engineer decides how to fix the underlying problem.
This model also clarifies prompt ownership in a principled way. Prompts are implementation details—they're one lever an engineer pulls to hit the eval threshold the PM defined. The PM doesn't need to approve prompt changes any more than they need to approve a refactor. What they do need to approve is any change to the eval criteria or ship gate, because those are business and user decisions.
Responsibility Matrices for Three Common AI Project Types
The decision-authority model plays out differently depending on what you're building. Here's how it maps to three common AI project types.
Classification Systems (e.g., content moderation, intent detection)
Classification is the most structured AI problem type, which makes accountability clearest.
| Decision | Owner | Rationale |
|---|---|---|
| Precision/recall threshold | PM | This is a business tradeoff—false positives vs. false negatives have different costs to users |
| Which classes to detect | PM | Driven by product requirements and user needs |
| Model architecture and training approach | Engineer | Technical implementation choice |
| Prompt design for few-shot classification | Engineer | Implementation lever toward PM-defined threshold |
| Labeling guidelines for ambiguous cases | Joint | Requires PM's knowledge of user intent + engineer's knowledge of model behavior |
| Ship/no-ship decision | PM | Owns the gate condition |
Generation Systems (e.g., AI writing assistants, summarization, chatbots)
Generation is where the artifact-based split causes the most damage, because output quality feels subjective and everyone has opinions about prose.
| Decision | Owner | Rationale |
|---|---|---|
| Evaluation rubric (what makes a good output) | PM | This is a product definition question, not a technical one |
| Tone and voice guidelines | PM | Brand and UX decisions |
| Eval methodology (human rating, LLM-as-judge, automated metrics) | Engineer | Technical implementation of the measurement system |
| Prompt structure and iteration | Engineer | Implementation lever; PM reviews results, not prompts |
| Which outputs are "good enough" to ship | PM | Gate condition based on rubric |
| How to handle edge cases the rubric doesn't cover | Joint | Requires rubric expansion (PM) and technical triage (engineer) |
The key discipline for generation systems: the PM must write the eval rubric before any prompts are written. If the rubric comes after, the prompts will implicitly define quality—and the PM will have already ceded the decision that matters most.
Retrieval-Augmented Generation (RAG) Systems (e.g., enterprise search, knowledge bases)
RAG introduces a retrieval layer that creates a new class of decisions about relevance—and a new potential accountability gap.
| Decision | Owner | Rationale |
|---|---|---|
| What "relevant" means for a given query | PM | Requires understanding of user intent and task context |
| Retrieval evaluation criteria (precision@k, MRR, etc.) | Joint | PM defines user-facing relevance; engineer translates to measurable metric |
| Chunking strategy and embedding approach | Engineer | Technical implementation |
| Reranking logic | Engineer | Implementation lever toward PM-defined relevance criteria |
| How to handle queries with no good retrieval result | PM | Failure mode decision with UX implications |
| Index update frequency and staleness tolerance | Joint | Business requirement (PM) with infrastructure cost implications (engineer) |
Decision-Authority Model
- PM accountability is tied to outcomes, not artifacts
- Engineers can iterate without waiting for PM approval on implementation details
- Clear ownership of ship/no-ship decisions prevents stalemates
- Eval criteria are defined before implementation begins, not reverse-engineered from outputs
- Scales across project types without constant renegotiation
Artifact-Based Split
- PM becomes a bottleneck on every prompt iteration cycle
- Engineers make implicit quality calls without authority or context
- Nobody owns the outcome when the system underperforms
- Eval criteria emerge late and are often shaped by what the system already does
- Requires constant renegotiation as the project evolves
Preventing the Two Failure Modes This Model Is Designed to Avoid
The decision-authority model is specifically constructed to prevent two failure modes that the artifact-based split makes almost inevitable.
Failure Mode 1: The PM Bottleneck. When PMs own prompts, every experiment requires their involvement. In a fast-moving AI project, this means engineers are idle while waiting for PM input on technical implementation details. The fix isn't to exclude PMs—it's to move their involvement upstream. If the PM has already defined the eval criteria and thresholds, the engineer can run dozens of prompt experiments without PM involvement, because the PM's judgment is already encoded in the eval. The PM reviews results on a cadence, not every individual change.
Failure Mode 2: Engineer Drift. When engineers own both implementation and quality judgment, they gradually make product decisions without realizing it. A generation system's tone shifts. A classifier's threshold gets adjusted because the engineer thought the false positive rate looked too high. A RAG system starts filtering out sources the engineer found noisy. These are all PM decisions made by engineers in the absence of clear criteria. The fix is for PMs to define those criteria explicitly and in advance—not to audit engineer decisions after the fact.
Making This Work in Practice
The model requires two things from PMs that many haven't had to do before on non-AI projects.
First, PMs must write eval criteria before engineers write a single prompt. This is the most important discipline change. The criteria don't need to be perfect—they'll evolve—but they need to exist. A generation system needs a rubric. A classifier needs defined precision/recall targets per class. A RAG system needs a definition of relevance for the top use cases. Without this, engineers will fill the vacuum with their own judgments.
Second, PMs must understand what is and isn't measurable. Not every quality dimension can be captured in an automated eval. When something can't be measured automatically, the PM needs to own the human evaluation process—defining the rating scale, the sample size, the frequency. Delegating this to engineers means delegating the quality definition.
For engineers, the model requires one key discipline: surface the eval results, not just the outputs. The PM can't exercise decision authority on data they don't have. Engineers who build good eval instrumentation give PMs what they need to make ship decisions confidently. Engineers who show the PM individual outputs and ask for a thumbs up are forcing the PM back into the artifact review loop.
The Practical Starting Point
If you're setting up an AI project now, start with a single document before any code is written: the eval spec. It should define what success looks like, what the minimum threshold is to ship, and what failure modes are blockers. The PM writes this. The engineer reviews it for measurability and flags anything that can't be instrumented.
Everything downstream—model choice, prompt design, retrieval architecture—is the engineer's call, constrained by the eval spec. The PM's ongoing job is to keep the eval spec current as user feedback and business requirements evolve, and to make the ship decision when the system hits the threshold.
That's not a smaller PM role. It's a more precisely located one—and it's the location where PM judgment actually has leverage.