Verification — knowing what to trust

1. The new long step

Generation is fast. Specification is medium. Verification is the new long step. It's where most of your day goes, once you've practiced enough to write good specs.

Verification is the skill of looking at a piece of code (yours or the model's) and deciding how much you trust it for which purpose. You're not certifying it's perfect. You're answering: is this good enough to ship to this audience under these conditions?

That's a judgment call, and the judgment is yours. The tools below are the inputs to the judgment.

2. The five tools, in increasing cost

Verification isn't one thing. It's five tools, each with its own cost and its own coverage. Use the right one for the situation, not the most expensive one available.

Type checks. Cost: free, in seconds. Coverage: the shapes match. Doesn't catch logic errors, but catches most stupid mistakes.

Tests. Cost: minutes to write, seconds to run. Coverage: the specific cases you wrote tests for. Misses cases you didn't think of.

Manual review. Cost: minutes to read carefully. Coverage: everything your eyes catch. Variable — depends on the reader's taste and energy.

Eval sets. Cost: hours to build, free to run. Coverage: a large representative sample, scored automatically. The way you verify AI-facing code (where the output is non-deterministic).

Production observation. Cost: deferred, plus risk. Coverage: real users on real inputs at real scale. The only verification that catches the case nobody imagined.

3. Each is a tool, not a religion

Some engineers turn one of these into a religion. Everything must have 100% test coverage. Tests are obsolete, just check it in prod. Types prove correctness. Each position is right in a narrow band and wrong everywhere else.

A type check doesn't catch the logic where you returned 0 instead of 1. A test doesn't catch the input you didn't think to test. A manual review at 11 PM after eight hours of coding doesn't catch the subtle off-by-one. An eval set is meaningless if the inputs aren't representative. Production observation catches everything, but the user is the QA, which is sometimes acceptable and sometimes not.

Match the tool to the risk. A migration script you'll run once on production data needs careful manual review plus a dry-run plus a rollback plan. A unit conversion utility needs type checks and three tests and you're done. An LLM-facing prompt needs an eval set with 500 examples and a threshold for regressions.

4. The trade between speed and certainty

Verification has a built-in trade. More verification, more certainty, more time. Less verification, less certainty, more speed. The trade is real and you don't get to opt out.

The mistake juniors make: they want full certainty on everything, so they spend three hours verifying a one-line change to a logging utility. The other mistake juniors make: they ship anything that compiles, then wonder why production is on fire.

The skill is calibration. How much verification this change needs, given the blast radius if it's wrong. Reversible change to an internal tool: less. Irreversible change to user-facing money: more. Knowing where on that spectrum each change sits is part of taste.

5. Verifying AI code is different

A few things are different when the code came from the model instead of from you.

You didn't form the mental model while writing. You have to form it while reading. That's slower and easier to miss things.

The model is consistent in style across the codebase, which makes diffs harder to spot. Your eye has nothing to catch on.

The model can produce confident-looking code that's confidently wrong. With a human author, weird code is usually intentional and you can ask them. With AI code, weird code might be a bug.

The fix for all three: read the spec, then read the code, then re-read the spec. If the code does what the spec says, you can move on. If you find the code doing something the spec didn't ask for, that's a flag — it's either a creative interpretation (sometimes useful, often wrong) or a bug.

6. The verification habit

A small habit that pays off: after the model writes a function, before you accept it, type out one line stating what you've verified.

"Reviewed for empty input, null input, and the boundary at length === windowSize. Type-checked. No tests yet — will add when the caller is built."

That sentence does three things. It forces you to actually verify. It records what you verified for later. And it catches the case where you were about to accept code without verifying at all — because the sentence would be embarrassing to write if you hadn't done it.

You don't need a process for this. You need the habit. Build it early.