Architecture · Formal methods

Formal specification vs. iterative debugging: why proving beats testing for regulated AI

Testing tells you what happened in the cases you thought to test. Formal specification tells you what can happen across every possible input. In regulated AI—where the cost of an edge case is a regulatory breach— the difference is not academic.

The default mode: generate and test

Most engineering teams using AI today operate on a simple loop: prompt the model, read the output, run the tests, fix what breaks. This works well enough at human-code-writing speed. A senior engineer writes two hundred lines a day. They carry the system's invariants in their head. When something goes wrong, they have the context to know why.

AI code generation breaks that equilibrium. A single engineer can now produce thousands of lines per session. The tests still cover what you thought to test. But the reasoning capacity required to understand what the new code does—every edge case, every interaction with existing state—has not scaled with the output rate. You have more code, the same human attention, and the same test suite written to cover yesterday's assumptions.

The bugs that surface are not the obvious ones. Obvious bugs fail fast. The dangerous ones are silent invariant violations: code that passes every test you wrote because your tests never asked the right question.

What changes at scale

Now multiply by ten. Ten engineers, each generating ten times as much code as before, produces a hundredfold increase in surface area. The integration points alone create a combinatorial problem that no test suite can exhaustively cover—not because your QA team is bad, but because the space of possible states is genuinely too large to enumerate by hand.

Consider a financial system where accounts can hold positions across multiple funds. A correct implementation must ensure that no single account exceeds a 10 percent ownership stake in any one fund—a regulatory requirement, not a preference. A test suite will check the cases you wrote: account at 9%, account at 10%, account at 11%. It will not check the interaction between a partial redemption, a same-day subscription in a different share class, and a NAV recalculation that temporarily produces a state your code never anticipated.

A specification checker will. Testing covers cases you thought to test. Specification covers all cases provably.

The specification-first approach

Specification-driven development (SDD) inverts the order of work. Before any code exists, you write down what must be true: the invariants the system can never violate and the state transitions it is permitted to make. These are not comments. They are checkable claims.

The fund ownership example becomes a single line:

for all funds f and accounts a: ownership_share(f, a) <= 0.10

That line is either provably satisfied by the system or provably violated. There is no middle ground, no "it passed in staging." A model checker evaluates it against all reachable states. If the property holds, you know it holds everywhere. If it does not, the checker produces a counterexample—an exact sequence of transitions that breaks it—before a single line of production code is written.

This is qualitatively different from testing. Tests are witnesses to behaviour. Specifications are constraints on behaviour. The distinction matters most at the boundaries of what you imagined when you wrote the code.

What Spec++ adds

Writing formal specifications has historically required deep expertise in tools like TLA+, Alloy, or Coq. The barrier was not the idea—engineers understood why it mattered—it was the tooling gap between the specification and the running system. Specs rotted. Code drifted. The two became disconnected.

Spec++ is a methodology for writing specifications precisely enough that tools can act on them end to end: prove properties before any code exists, generate implementation that satisfies the spec, and enforce the same rules at runtime through the Compliance Operator.

In one pilot, 11 pages of Spec++ produced 3,829 lines of production code—a 350:1 compression ratio. The resulting system contained zero invalid states. Not "we didn't find any"—proven, not tested. The specification was the source of truth, and the code was a mechanically verified consequence of it.

That number matters because it quantifies what engineers actually spend time on when AI generates code from specification: they maintain intent, not implementation detail.

The "AI generates from spec" difference

When AI generates code from a proven specification, the generation target is intent that has already passed checking. The model is not improvising—it is implementing constraints that were verified before it was asked to produce anything.

This changes the error model. Without a specification, AI code generation can introduce invariant violations that are syntactically correct, semantically plausible, and wrong in a way that only surfaces under a specific combination of inputs at production load. With a specification, any implementation that violates a proven property fails verification before it ships.

The approach is also language-agnostic. One specification can produce Python, Java, Rust, or Go implementations. The invariants do not change. The language is a deployment decision, not an architectural one. Teams that need to target multiple runtimes—a common constraint in regulated industries with heterogeneous infrastructure—write the specification once and generate for each target.

Specification stewardship as a role

SDD introduces a role that does not yet have a standard title: the specification steward. In organisations where AI writes most of the code, someone must own the specification the way a product owner owns requirements. Not as a gatekeeper, but as the person responsible for keeping expressed intent accurate, complete, and checkable.

This is not a formal methods PhD. A specification steward needs to understand the domain well enough to write down what must be true—the business rules, the regulatory constraints, the safety properties— and to recognise when a proposed change to the specification would weaken a guarantee the system depends on.

In practice this role often emerges from a senior engineer or architect who was already doing informal versions of it: maintaining the mental model of the system's invariants, reviewing PRs for correctness rather than style, being the person others consult when something subtle goes wrong. SDD gives that person a formal artefact to maintain and tooling to enforce it.

The specification steward sits at the intersection of product, engineering, and compliance. In regulated industries, this is not an optional role—it is the person who can answer an auditor's question about system behaviour with a proof rather than a test report.

Where to start

Most teams do not need to replace their entire development process on day one. A practical entry point is to identify one system property that is currently tested but not proven—typically a constraint that would cause a regulatory or financial exposure if violated—and write a specification for it. Run the checker. See what the proof reveals.

The value is often immediate. In most cases, the first formal specification surfaces an edge case the test suite missed—not because the tests were poorly written, but because the specification asks a harder question.

An SDD adoption programme with Riley Betts helps engineering organisations implement this across product, engineering, and QA: establishing the specification layer, integrating proof steps into CI gates, introducing provenance tracking so every deployed artefact links back to the specification it satisfies, and developing the specification steward capability in-house.

Not every engineer needs to become a formal methods specialist. The goal is an organisation that can state what its systems must do, prove that they do it, and maintain that proof as the systems evolve. AI-assisted development makes the code generation problem easier. SDD makes the correctness problem tractable.