25 Jun 2026 • on AI testing

The QA Layer Your AI Toolchain Won't Build For You

In Part 1, I made the case that AI has created a dangerous gap between testing and quality assurance — and that many engineering organizations are filling that gap with confidence rather than rigor.

The response I usually get to that argument is: “Okay, so what do we actually do?”

Fair. Let’s get specific.

There are five instruments worth putting in place. Some are technical. One is organizational. All of them are grounded in recent research and real production experience. None of them require you to slow down or abandon AI-assisted development — they require you to be smarter about what you trust.

Instrument 1: Mutation Testing — Stop Measuring Coverage, Start Measuring Effectiveness

This is the most important one, and the most underused.

Here’s the problem with coverage metrics: they tell you whether your tests ran a line of code. They don’t tell you whether your tests would catch a bug in that line of code. A test can execute a function without ever asserting anything meaningful about it. Coverage goes green. The bug ships.

Mutation testing flips this. Instead of asking “did the test run?”, it asks “if I deliberately break this code, does the test catch it?” It injects controlled defects — mutations — and measures your test suite’s kill rate. If your tests don’t catch the mutation, they’re not actually testing what you think they are.

This is well-established technique, but it’s had a resurgence for a specific reason: AI makes it practical at scale. One engineering team paired AI test generation with mutation testing and grew their suite from 12 to 33 high-quality tests while raising their kill rate from 57% to 80% — the difference between “tests that run” and “tests that catch real defects.”

Meta productized this at scale. Their Automated Compliance Hardening (ACH) tool runs mutation testing across Facebook, Instagram, WhatsApp, and their wearables platforms. Privacy engineers accepted 73% of the AI-generated tests, with 36% judged as genuinely privacy-relevant — a meaningful signal in a domain where false confidence is genuinely dangerous.

Gartner is now actively advising teams to integrate mutation-guided test hardening directly into pull request workflows.

What to do: Integrate a mutation testing framework (Pitest for Java, mutmut or Cosmic Ray for Python, Stryker for JS/TS) into your CI pipeline. Set a minimum kill rate threshold as a quality gate — not a coverage threshold. Start with your highest-risk modules.

Instrument 2: Property-Based Testing — Test Invariants, Not Just Examples

Most AI-generated tests are example-based: give the function this input, expect that output. The problem is that AI models generate tests by predicting what tests typically look like for code like this. They confirm the code does what it does — they don’t challenge whether it handles what it should.

Property-based testing is a different approach. Instead of testing specific examples, you define invariants — rules that must hold true for any valid input. You then let the framework generate thousands of random inputs and verify the invariant holds across all of them.

Recent research using property-based testing as a validation layer over AI-generated code showed 23–37% improvements in correctness over standard test-driven approaches — specifically because it breaks what researchers called the “cycle of self-deception,” where AI-generated tests share the same blind spots as the code they’re meant to validate.

This matters for a simple reason: the bugs AI misses aren’t random. They cluster around edge cases, boundary conditions, and combinations of inputs that no typical example would surface. Property-based testing systematically attacks those blind spots.

What to do: Adopt Hypothesis (Python), fast-check (JavaScript), or QuickCheck (Haskell/Erlang) for your highest-risk business logic. Task your senior engineers with defining the invariants — this is the judgment work that AI genuinely can’t do well. The framework handles the rest.

Instrument 3: Mandate Negative Testing — Explicitly

This one sounds obvious. It isn’t practiced.

Research consistently shows that AI testing agents have a documented tendency to avoid negative test scenarios — they unconsciously “correct” the flow toward a positive outcome, masking potential failures. The result is test suites full of happy-path coverage and almost no adversarial coverage.

This isn’t a subtle bias. It’s a structural one. LLMs are trained to produce helpful, correct-looking outputs. A test that “fails” by catching a bug looks like a failure to the model, not a success.

The practical consequence: if you don’t explicitly instruct AI to generate tests that expect failure, it largely won’t. And if you don’t review your test suite for the ratio of positive to negative scenarios, you won’t notice until production teaches you.

What to do: Make negative test coverage a code review criterion, not just a suggestion. When using AI to generate tests, explicitly prompt for failure scenarios, boundary violations, invalid inputs, and concurrent edge cases as separate passes. Require a minimum ratio of failure-path tests for any module handling user input, authentication, or financial logic.

Instrument 4: LLM-as-Judge for AI-Powered Products

If you’re building on top of LLMs — not just using AI to write tests for traditional code — you have an additional problem. The outputs are non-deterministic. You can’t assert that a specific string was returned. Coverage metrics mean nothing.

The emerging solution is LLM-as-Judge: use a separate model to evaluate the quality of your system’s outputs against explicit rubrics, and wire this into your CI pipeline as an automated quality gate.

This works — with a critical caveat. A recent engineering guide found that LLM-as-Judge is only reliable as a CI gate if it achieves at least 80% agreement with human judgments on your specific task type before you deploy it as an automated check. Below that threshold, you’re automating noise.

What to do: For your highest-stakes prompts, write explicit evaluation rubrics. Calibrate your LLM judge against 15–20 human-labeled examples for each rubric. Only promote it to an automated gate once you’ve validated the agreement rate. Treat the judge as a product that needs its own QA — because it does.

Instrument 5: Don’t Hollow Out Your QA Function

All four of the above are technical instruments. This one is organizational, and it’s the one most likely to be ignored.

The pattern I’ve watched play out in org after org: AI testing tooling lands, velocity goes up, test volume goes up, and QA headcount quietly gets redeployed or not backfilled. The logic seems sound — the machines are covering it.

What walks out with those people isn’t test execution. It’s domain knowledge, institutional memory, and the judgment about what actually matters to test. The ability to ask “what are we not testing, and why?” That’s not a function AI has taken over. It’s a function that’s just going unfilled.

The fix isn’t to resist AI in your quality practice. It’s to be honest about what you’re actually replacing. AI scales execution. It doesn’t replace the engineer who knows that the edge case in the payment retry logic caused three incidents last year, or the one who reads the spec and notices the requirement nobody thought to test.

What to do: Redefine the QA role in your org around test strategy, not test execution. Measure your quality engineers on the quality of failure mode analysis, on the clarity of acceptance criteria, on the signal-to-noise ratio of your test suite — not on the number of tests written. Make quality engineering a senior discipline, not a task to be automated away.

The Framework in One Sentence Per Layer

Mutation testing: stop measuring whether tests ran; measure whether they catch bugs
Property-based testing: stop testing examples; test invariants
Negative testing mandates: stop confirming the happy path; challenge it
LLM-as-Judge: for AI products, automate evaluation — but calibrate it first
QA as strategy: stop treating quality engineering as execution; treat it as judgment

None of this requires slowing down. It requires being honest about what AI is actually doing for you — and what it isn’t.

A Final Note for Engineering Leaders

The teams I’ve seen navigate this well share one thing: they treat “can this ship?” and “should this ship?” as separate questions, owned by different people with different mandates.

AI is very good at answering the first one. The second one still needs you.

If Part 1 resonated, the question worth taking to your next engineering review is simple: when did someone last ask what you’re not testing, and why?