My Profile Photo

Paul Brodner's Blog

Opinions are my own and not the views of my employer


Tips & Tricks


    Writing AI Agent Specs: Less Art, more checklists

    I recently built a project I called 5level (inspired by this video) — a framework of 8 specialized AI agents covering a full Agile SDLC: product manager, product lead, architect, tech lead, scrum master, developer, QA tester, and release lead.

    Level 4 — Developer as Product Manager — is the idea that you write strong specifications and outcome criteria, AI implements and tests, and you evaluate results rather than every line. The primary gain is a shift from coding effort to product and system thinking. 5level covers roughly 80% of that in practice (so far March 2026). The risk — as I learned firsthand — is exactly what this post is about: weak specs produce wrong-but-plausible results.

    Each agent is defined in a single Markdown file called SKILL.md. Each one has a specific role, a set of inputs and outputs, a numbered workflow, a Definition of Done checklist, and a set of rules.

    Writing those 8 spec files taught me something I didn’t expect: the quality of an AI agent is almost entirely determined by the quality of its spec. Not the model. Not the tools. The spec.

    And most specs I’ve seen — including early versions of mine — fail for the same reason: they’re written like prose. Like a job description. Like a README no one will read.

    Let me show you what actually works.


    The Challenge

    A GitHub study of over 2,500 agent configuration files found the most common failure mode: too vague. Things like “write clean code” or “follow best practices” or “make sure tests pass.” These aren’t instructions — they’re hopes.

    When you hand an AI agent a vague spec, you don’t get a bad result. You get a plausible-looking result. The agent fills in the gaps with its own assumptions, and those assumptions compound across multiple steps until you have something that works in demos and falls apart on edge cases.

    The antidote isn’t more prose. It’s structure.


    The Anatomy of a SKILL.md

    Each agent in the 5level project follows the same template. Here’s the skeleton:

    ---
    name: sdlc-qa-tester
    description: '...trigger keywords...'
    compatibility: Designed for Claude Code (claude-sonnet-4-6)
    metadata:
      author: pbrodner
      version: "1.2"
      sdlc-level: "4"
      role: qa-tester
    ---
    
    ## Role Summary       ← what this agent owns
    ## Inputs             ← what it receives (from whom)
    ## Outputs            ← what it produces (artifacts, not conversations)
    ## Workflow           ← numbered steps with commands and templates
    ## Definition of Done ← checkbox list (not prose)
    ## Rules              ← what it always/never does
    ## Collaboration Map  ← who to contact for what
    

    Every section is mandatory. Remove any one of them and the agent starts guessing.

    The section easiest to underspecify is Outputs. Not “write a test plan” — but what the test plan is, what file it goes to, and what downstream agent consumes it. Here’s the QA agent’s output list:

    - Test Strategy Document (per Feature — requires human approval before execution)
    - Test Plan (per story)
    - Test Cases (per story AC scenario)
    - Test Execution results
    - Defect reports (Jira Bug issues)
    - Test Report (PASS/FAIL summary for DoD gate)
    - Traceability matrix (AC → test case → result)
    

    Artifacts, not conversations. This distinction matters — an agent that “helps with QA” and an agent that “produces a traceability matrix consumed by sdlc-release-lead for release validation” are two very different things.


    The Workflow: 8 Agents, 3 Human Gates

    The 8 agents in 5level form a pipeline. Here’s how they connect:

    Human Idea
        │
        ▼
    [Product Manager] ──► Epic + Features
        │
        ▼
    [Product Lead] ──────► User Stories + AC
        │
        ▼
    [Architect] ─────────► SDD / ADRs
        │
        ▼
    [Tech Lead] ─────────► Estimates + Technical Tasks
        │
        ▼
    [QA Tester] ─────────► Test Strategy ◄── HUMAN GATE #1
        │                                    (approval required before dev)
        ▼
    [Scrum Master] ──────► Sprint Entry Gate
        │
        ├─────────────────────────┐
        ▼                         ▼
    [Developer]             [QA Tester]
    Feature Branch          Test Cases (parallel)
        │                         │
        └───────────┬─────────────┘
                    ▼
                PR Review
             [Tech Lead] ◄── HUMAN GATE #2 (high-complexity or security changes)
                    │
                    ▼
               QA Execution
                    │
                    ▼
          [Scrum Master] DoD Gate
                    │
                    ▼
         [Release Lead] ◄── HUMAN GATE #3 (Sprint Review + Release Gate)
                    │
                    ▼
                Deploy
    

    Three human touchpoints. Everything else is agent-driven. This only works if each agent’s spec is precise enough that it doesn’t need supervision on every step.


    The Six Non-Negotiables

    Addy Osmani writes that good agent specs must cover six areas. I’ll show you how 5level handles each one.

    Area What It Means How 5level Does It
    Commands Exact executable commands with flags Each workflow step includes runnable commands: uv run pytest -v, git checkout main && git pull origin main
    Testing Framework, file locations, coverage threshold Developer agent: “Achieve the coverage threshold set by sdlc-architect (default: 80% line coverage)”
    Project structure Where artifacts live All agents write to explicit paths: docs/backlog.md, docs/sdd.md, docs/test-strategy-[FEATURE-ID].md
    Code style Real examples beat descriptions Commit format is defined by example, not description (see below)
    Git workflow Branch naming, commit format, PR requirements Developer agent has a dedicated step for each: create branch, commit format, self-review, open PR
    Boundaries What the agent must never do Rules section in every SKILL.md — hard stops, not suggestions

    For code style, here’s how the developer agent defines commit format — with an actual example, not a description:

    [STORY-ID] imperative verb phrase describing what changed
    
    - What was changed and why (1–3 bullets)
    - AC: [AC scenario name this satisfies]
    
    PROJ-42 Generate cryptographically secure password reset token
    
    - Add generate_reset_token() to AuthService with 64-byte entropy
    - Token expires after 1 hour per security standard ADR-007
    - AC: "User requests password reset — happy path"
    

    An AI agent reading this knows exactly what a commit message should look like. No interpretation needed.


    Three-Tier Guardrails

    Every SKILL.md in 5level has a Rules section. But not all rules are equal. Effective rules follow a three-tier structure:

    ✅ Always do — actions the agent can take autonomously without asking:

    • “Test planning starts at task definition, not code completion.”
    • “Every test case must trace to an AC. No AC traceability = invalid test case.”
    • “File bugs immediately. Do not batch defects until the end.”

    ⚠️ Ask first — situations that require a human or another agent before proceeding:

    • “If anything is ambiguous: post a question on the Jira story, tag sdlc-tech-lead or sdlc-product-lead, and wait for a response. Do not implement on assumptions.
    • “Do not execute tests until [Test Strategy] approval is recorded.”

    🚫 Never do — hard stops:

    • “Never push directly to main.”
    • “Do not commit: credentials, API keys, or .env files.”
    • “Never disable, skip, or comment out a failing test.”
    • “Do not merge before approval.”

    The difference between a good spec and a dangerous one often comes down to how explicit the “never” list is. Agents are very good at finding technically-compliant ways to get things done — including ways you didn’t intend. A hard stop is the only reliable prevention.


    Definition of Done as a Checklist

    The single highest-value addition you can make to any agent spec is a Definition of Done with checkboxes. Not a paragraph about quality. A list the agent can verify against before declaring work complete.

    Here’s the QA agent’s DoD:

    - [ ] All AC scenarios have at least one executed test case
    - [ ] All test cases marked PASS, or defects are filed for FAILs
    - [ ] All Critical and High severity bugs are resolved and re-tested with PASS
    - [ ] E2E browser tests PASS (or N/A documented — story touches no UI)
    - [ ] Traceability matrix is complete (every AC → test case → result)
    - [ ] Test Report issued to `sdlc-scrum-master`
    

    And the developer’s:

    - [ ] Feature branch merged to `main` via approved PR
    - [ ] All AC scenarios have passing unit tests
    - [ ] Unit test coverage meets threshold
    - [ ] No TODOs, debug code, or commented-out blocks in merged code
    - [ ] `sdlc-scrum-master` notified
    

    A checklist does two things: it prevents the agent from stopping too early (missing items), and it makes the completion criteria unambiguous for you when you review the output. “Is this done?” becomes answerable in 30 seconds.


    The Curse of Instructions

    One pattern I learned to avoid: the monolithic spec. One giant prompt that covers every possible scenario an agent might encounter. The temptation is to make it comprehensive — but there’s a hard limit where adding more requirements actually degrades performance.

    The reason is context. When a prompt contains 50 rules, an agent’s attention to each individual rule goes down. Research has shown this effect is real — it’s called the “curse of instructions”. Tested across GPT-4o, Claude 3.5, Gemini 1.5, and others: success rates drop from ~90% on a single instruction to as low as 15% when multiple constraints are combined.

    The solution is separation of concerns. Instead of one agent with 50 rules, use 8 agents with 6-8 rules each. The developer agent doesn’t need to know how to write a test strategy. The QA agent doesn’t need the commit format rules. Each spec covers exactly what that agent needs — nothing more.

    Concretely: when writing a SKILL.md, if a rule applies to this agent’s outputs and to another agent’s inputs, it belongs in both. Don’t try to centralize it in one place for consistency. Let each spec be complete for its role.


    The Checklist

    Before handing any agent spec to an AI, run through this:

    Structure

    • Role summary: one paragraph, no ambiguity about what this agent owns
    • Inputs: listed with source (which agent or human provides each)
    • Outputs: listed as concrete artifacts with file paths or destinations
    • Workflow: numbered steps, not bullet soup (a flat list with no order or dependencies)
    • Definition of Done: checkboxes, not prose
    • Rules section with ✅ always / ⚠️ ask / 🚫 never

    The Six Areas

    • Exact commands included (not “run the tests” — uv run pytest -v)
    • Testing framework and coverage threshold specified
    • File/directory structure explicit
    • Code style shown by example
    • Git workflow defined (branch format, commit format, PR target)
    • Boundaries enumerated — what this agent must never touch

    Quality

    • No vague phrases (“best practices”, “clean code”, “make sure it works”)
    • Templates referenced for all output artifacts (not inline prose)
    • Collaboration map: who this agent contacts for what
    • Scope is tightly bounded — one role, not multiple

    Conclusion

    At Level 4 SDLC — where humans write specs and AI implements and tests — the spec is the product. Code is a side effect.

    A vague spec produces a wrong-but-plausible result. A precise spec with checkboxes, hard stops, and real examples produces something you can trust to gate, ship, and audit.

    Less art. More checklists in the end.