A growing share of the code running in production today was not typed by a person. It was generated by Cursor, GitHub Copilot, Claude, Lovable, Bolt.new, or a conversational prompt session, then committed by whoever was steering. Sometimes that share is a feature or two. Increasingly it is most of the codebase.
The review practices protecting that code have not changed. Pull requests, peer approval, a CI pipeline. All of it was designed to catch the mistakes humans make. AI-generated code fails differently, and it fails in ways those practices were never built to see.
This page explains what an AI code review is, what it checks, how it differs from a traditional code review, and when you should commission one. It is written for the people who usually ask us: CTOs and heads of engineering who have inherited an AI-assisted codebase, and founders whose product was built at speed and now has to survive due diligence or production traffic.
What is an AI code review?
An AI code review is an independent, diagnostic audit of a codebase built partly or wholly with AI assistance. It examines the structural patterns AI coding tools produce: fragmented logic, schema drift between layers, silently skipped security primitives, duplicated solutions to the same problem, and tests that pass without verifying anything. The output is a verdict on whether the code is sound, repairable, or cheaper to rebuild.
One terminology point, because the phrase is used two ways. Some vendors use “AI code review” to mean review performed by an AI tool: a bot that comments on your pull requests. That is a useful layer, and we use automated analysis inside our own process. But this page, and our service, means review of AI-generated code, anchored by senior human judgement. A bot commenting on AI output is one model checking another. Neither carries liability for the verdict, and neither can tell you whether the architecture underneath is sound.
The distinction matters because the two products answer different questions. A review bot answers “is this pull request acceptable?”. An AI code review answers “is this codebase fit for what you are about to ask of it?”: a fundraise, a launch, a scale-up, a handover.
Why does AI-generated code fail differently from human code?
The defining property of AI-generated code is plausibility. It reads cleanly, compiles, and demos well. Variable names are sensible. Comments are tidy. Every individual file looks like the work of a careful engineer. That is precisely what makes the failure modes hard to spot.
Human technical debt has a smell. Rushed code looks rushed: dead ends, commented-out blocks, inconsistent naming, the visible scar tissue of deadline pressure. Experienced reviewers navigate by that smell. AI-generated code removes it. The signal experienced engineers use to decide where to look harder is gone, while the underlying faults remain.
The faults themselves cluster into patterns we see repeatedly:
- Fragmented logic. The same business rule implemented three different ways in three different modules, because each generation session solved the problem fresh. All three work today. The first schema change breaks one of them silently.
- Schema drift. The database, the API layer, and the front end each hold a slightly different idea of what the data looks like. Conversions paper over the gaps until an edge case falls through one.
- Missing security primitives. Authentication checks present on the routes the prompt mentioned and absent on the ones it did not. Secrets hardcoded because that made the demo work. Input validation that exists in the UI but not at the API boundary.
- Hollow tests. A test suite with respectable coverage numbers that asserts almost nothing, generated to satisfy an instruction to “write tests” rather than to interrogate behaviour.
We have written in more detail about the problems with vibe coding in production. The short version: every one of these faults passes a demo, and most of them pass a casual pull-request review.
If you already recognise your codebase in that list, the practical next step is an independent code review and audit: fixed fee, diagnostic only, no remediation upsell. Or keep reading to see exactly what the review covers.
What does an AI code review actually check?
A serious review is a structured pass over seven areas. The weighting shifts with the codebase, but the checklist does not.
- Architectural coherence. Does the system have one architecture, or several overlapping ones from different generation sessions? Are responsibilities separated on purpose, or distributed by accident? This check alone usually determines whether the verdict is repair or rebuild.
- Schema and contract integrity. Do the database schema, the API contracts, and the client models agree? Where they disagree, is the conversion deliberate and tested, or incidental and fragile?
- Security posture. Authentication and authorisation on every route, not just the prominent ones. Secrets management. Input validation at trust boundaries. Injection surfaces. Dependency vulnerabilities. This is the area where AI-generated code most often fails quietly, and the one investors’ technical due diligence probes first.
- Logic duplication and fragmentation. How many independent implementations exist for each core business rule? Which one is canonical? What breaks when one of them changes?
- Dependency health. Packages that are outdated, abandoned, or occasionally hallucinated outright. Licence terms incompatible with commercial use. Transitive dependencies nobody chose.
- Test integrity. Not coverage percentage: assertion quality. Do the tests encode the actual business rules, or do they merely execute the code and pass? Would a meaningful regression fail the suite?
- Operational readiness. Error handling, logging, observability, configuration management, and the behaviour of the system under load it has not seen yet. A codebase can be functionally correct and still be unrunnable as a business.
Together these seven areas amount to a codebase health check tuned for the way AI builds software. The same structure serves as a technical debt assessment for codebases where AI wrote a significant share of the code, because that is where the debt hides.
Can AI code quality be measured?
Partly, and the partial measures are where teams get misled. Static analysis, linting, and coverage tooling all produce numbers, and AI-generated code tends to score well on them. The code is syntactically clean, consistently formatted, and accompanied by tests. On the dashboard, quality looks high.
The faults that matter sit in what those tools cannot see. No linter detects that two modules implement the same pricing rule differently. No coverage metric distinguishes a test that interrogates behaviour from one that merely executes code. No static analyser knows that the authorisation check missing from one route is present on its siblings, which is exactly what makes the omission invisible.
So AI code quality is measurable, but only by combining tooling with structural judgement: does the system hold together as one design, and would it survive contact with the load, the attacker, and the schema change it has not met yet? That combination is the review. The dashboards are inputs to it, not substitutes for it.
How is an AI code review different from a traditional code review?
Both have their place. They answer different questions at different altitudes.
| Traditional code review | AI code review | |
|---|---|---|
| Unit of review | One pull request at a time | The whole codebase as a system |
| Core assumption | A competent human wrote this and may have made a mistake | A plausible generator produced this and nobody owns its assumptions |
| Catches | Local bugs, style drift, obvious design issues | Cross-module fragmentation, schema drift, structural security gaps |
| Blind spot | Faults spread across many “individually fine” changes | Day-to-day regressions (it is a point-in-time audit, not a gate) |
| Performed by | Peers inside the team | An independent senior practitioner with no stake in the verdict |
| Output | Approve, comment, request changes | A written verdict: sound, repairable, or rebuild, with evidence |
The structural point sits in the second row. Peer review assumes the author had a mental model of the system and made local errors against it. With AI-generated code there may be no unified mental model at all. Each prompt session held a partial picture, and nobody has examined the sum. Reviewing such a codebase one pull request at a time is how fragmented logic gets approved: every individual change looks fine, and the contradiction lives between them.
So a traditional review process is necessary and not sufficient. Keep the PR gate. Add a whole-system audit at the moments that matter, which brings us to timing.
When should you commission an AI code review?
Five triggers account for nearly every review we run.
Before a fundraise. Investor technical due diligence will examine your architecture, your security posture, and your dependency on the people who built the code. Finding the problems yourself, months earlier, costs a fixed fee. Having a diligence team find them mid-raise costs negotiating leverage, and sometimes the round.
Before a vibe-coded product takes production traffic. If the product was built primarily through conversational prompting, the gap between “works in the demo” and “survives real users” is exactly the set of faults listed above. The time to find out is before the launch, while fixes are cheap and private.
When you inherit a codebase. An agency handover, an acquisition, a departed technical co-founder. You are about to own the consequences of decisions you did not witness. An independent read establishes what you actually received, separate from what the invoice said.
When a build has stalled. Velocity was strong for the first two months, then collapsed. That curve is the classic signature of compounding AI technical debt: generation speed up front, integration cost later. A review tells you whether the foundation can carry the remaining roadmap or whether you are paying rebuild prices for repair work.
When the answer matters more than the reassurance. Some leaders simply want to know what they own, with no transaction attached. That is a legitimate trigger on its own. You should know what you own; we can tell you.
If one of those five describes you, book a strategy session and we will scope it in thirty minutes, or read on for how the engagement actually works.
Why does independence matter in a code review?
A code review is only as good as the incentives behind it. A review from the agency that wrote the code grades its own homework. A free audit from an agency that wants to sell you the rebuild has a strong pull toward a verdict that generates the next invoice. Neither has to involve bad faith. Incentives lean on judgement quietly, and the leaning shows up in the verdict.
An independent review carries no such weight. We price the diagnostic as a fixed fee, we have no remediation contract waiting behind the report, and the verdict is yours to act on with any team you choose. We have set out the full argument in why a free code audit from your dev shop is not independent, including the four characteristics worth checking before you commission any review.
The test is simple and worth applying to any reviewer, including us: does the reviewer earn more if the verdict is bad? If yes, the verdict is marketing.
What do you get at the end of an AI code review?
Three things, none of them a slide deck.
A written verdict in plain English. Sound, repairable, or cheaper to rebuild, stated directly, with the evidence that supports it. It is written to be read by a founder or a board, not just an engineer, because the decisions it informs are commercial ones.
A prioritised fault map. Every significant finding, located in the code, ranked by business risk rather than by theoretical severity. A hardcoded secret on a public route outranks an inelegant abstraction, every time.
A clear set of next moves. Where the verdict is repair, the report sequences the repairs so your existing team, or any competent team, can execute them. Where the verdict is rebuild, the report becomes the brief: what to keep, what to remanufacture, and what the rebuild must not repeat.
The report stands on its own. Some clients take it back to their incumbent team. Some hand it to a new vendor. Some convert it into a rebuild brief for our Factory, where agentic development makes a clean remanufacture dramatically cheaper than it used to be. All three are good outcomes, and the report does not lean toward any of them.
How much does an AI code review cost?
We do not publish a price list, because a useful number depends on three things we establish in one conversation: the size and surface area of the codebase, the deadline driving the review (a fundraise date changes the shape of the work), and the depth of verdict you need.
What we can say publicly: every review is a fixed fee agreed before we start, scoped in a single session, with no day-rate drift and no remediation contract attached. The comparison that matters is not the fee against zero. It is the fee against the cost of the thing the review prevents: a failed diligence, a production incident in week one, or six months of paying repair prices for a foundation that needed rebuilding.
Where do you start?
One conversation. Thirty minutes, direct with the practitioner who runs the review, no sales layer. We will tell you honestly whether your situation needs a full audit, a narrower review, or nothing at all yet.
Book a strategy session, request an assessment, or if you are not ready for either, ask a question and we will answer it by email. The report, the verdict, and the decision stay yours.