My Profile Photo

Paul Brodner's Blog

Opinions are my own and not the views of my employer


Drawing the lines AI must follow


    Can AI Test Thoroughly? Why the Industry Is Sleepwalking Into a QA Crisis

    Lately, I’ve been seeing a familiar pattern play out across engineering teams.

    A developer ships an AI-powered feature. It works. It demos beautifully. The team moves on.

    Three months later, production is on fire — and nobody saw it coming. Not because they skipped testing. Because they confused testing with quality assurance.

    That distinction used to be a footnote. Now it’s the fault line everything cracks along.

    Testing and QA Are Not the Same Thing

    This is worth saying plainly, because the industry conflates them constantly.

    Testing asks: does this work? It’s execution. You run cases, you check outputs, you confirm behavior matches expectation. It’s necessary. It’s not sufficient.

    Quality assurance asks: is this good? It’s a systems-level question — about requirements, about risk, about what “correct” even means in context. It requires judgment, domain knowledge, and an honest understanding of failure modes you haven’t imagined yet.

    AI is genuinely impressive at the first one. It can generate test cases at scale, cover surface area no human team could, run regression suites relentlessly, and catch obvious breakage fast.

    But the second one? That’s where the gap is. And that gap is widening.

    Why AI Makes This Gap Harder to See

    Here’s the uncomfortable part: AI doesn’t just struggle with QA — it actively makes the problem less visible.

    When you use AI to generate tests, you get volume. Hundreds of cases, fast. That feels like coverage. It looks rigorous. The dashboard goes green. Confidence goes up.

    But AI-generated tests are bounded by what the model thinks is worth testing — which is largely a reflection of what’s common, expected, and well-documented. The scenarios that actually break production systems are usually none of those things. They’re edge cases that emerge from real user behavior, implicit business rules that were never written down, and interactions between systems that no single model has full context over.

    You end up with a test suite that’s wide but shallow. It passes. The system ships. And somewhere in the parts nobody thought to question, quality is quietly eroding.

    In regulated environments — medical devices, clinical trial software, anything under FDA scrutiny — this is not an academic concern. A shallow test suite that passes CI is not a validation artifact. Auditors ask what you tested and why. “AI generated it” is not an answer to the second question.

    The Signals Engineering Leaders Are Missing

    If you’re managing an engineering org right now, here are the questions worth asking honestly:

    Who owns the definition of “good”? Not “does it pass CI” — but what does quality actually mean for this product, for this user, in this context? If the answer is fuzzy, your QA is fuzzy, regardless of what tools you’re using.

    Are your test suites telling you what you want to hear? AI-generated tests optimize for coverage metrics, not for the failure modes that matter. If nobody is actively challenging what’s being tested, you have a yes-machine, not a quality signal.

    What happens when requirements are ambiguous? AI executes well against clear specs. Real-world software is built on incomplete, evolving, sometimes contradictory requirements. QA is partly the discipline of surfacing those ambiguities before they become incidents. That work still requires humans.

    Is your team moving from demo to production — or just to deployment? Demo-quality and production-quality are not the same thing. The gap between them is where QA lives. Shipping fast with AI tooling can close the distance to deployment while doing nothing to close the distance to production-readiness.

    What This Looks Like at Scale

    The pattern I’m worried about isn’t a team that skips testing. It’s teams that use AI to test more, ship more confidently, and gradually hollow out the QA function — because it feels redundant next to all that automated coverage.

    QA engineers get redeployed or not backfilled. The institutional knowledge about what matters, what breaks, and why, starts to walk out the door. The test suite grows. The judgment behind it atrophies.

    This is the sleepwalk. Not a dramatic failure. A slow, quiet degradation of the thing that makes software trustworthy — dressed up in green CI pipelines and velocity metrics.

    What Good Looks Like

    AI absolutely belongs in your quality engineering stack. Used well, it frees your best QA minds from repetitive execution so they can focus on what machines genuinely can’t do: understanding risk in context, defining what quality means for this product, and asking the uncomfortable questions nobody else is asking.

    That means using AI to scale execution, not to replace judgment. Keeping humans accountable for test strategy, not just test generation. Treating QA as a systems-thinking discipline, not a checklist discipline. And asking regularly: what are we not testing, and why?

    The teams that get this right will use AI to build faster and build better. The teams that don’t will build faster — and discover the difference in production.


    Three things I’d take away from this:

    AI scales execution; it cannot own strategy. Test generation at volume is real value — but strategy means deciding what to test, why, and what counts as good. That question requires a human who understands the product, the risk, and the users. No model has that context by default.

    Green metrics are not a safety signal. A CI pipeline that always passes is a yes-machine until someone asks the right question. The job of QA isn’t to produce green — it’s to surface the things that shouldn’t be green. That job doesn’t go away when AI generates the tests; it becomes more important.

    The institutional knowledge that walks out the door doesn’t come back. When QA engineers are redeployed because automated coverage looks sufficient, you lose more than headcount. You lose the accumulated judgment about why certain things break, what matters in this system, and which edge cases are actually dangerous. That knowledge took years to build and takes a production incident to notice it’s gone.

    AI can test. It cannot assure. The question isn’t whether to use it — of course you should. The question is whether you’re still asking the hard, human questions that no tool can answer for you.

    comments powered by Disqus