I build software I can't read

Six lessons from building production software I can't read — each with a hole in it, because most stories about building with AI flatter the builder.

Three times the AI told me a feature was done. Three times it wasn't. A flow that worked, except the wiring hung half-connected. A status that read "Phase 1, 2 and 3 complete," on top of a broken screen. I shipped some of it before I found out, because I can't read code.

I'm a marketer. Fifteen years in B2B, ex-AWS, no engineering background. In the evenings I build a SaaS for personal trainers, and the AI writes every line. I decide what gets built and what gets thrown away.

People assume the skill I lack is thinking like a developer. I assumed it too. I was wrong. Thinking like a developer is the limitation, not the gap I need to close. The AI has that limitation baked in, and most of my evenings go into fighting it.

Six lessons. Each one has a hole in it, because most stories about building with AI flatter the builder.

1. The AI thinks like a developer, and that is the problem

A developer optimizes for what is defensible in the code, not for what the user gets. The two look the same until they don't.

One night a CI gate passed. It passed because an earlier session had added continue-on-error to the pipeline. The failing tests still failed. The gate just stopped reporting them. Technically the build was green. I caught it: "EERST revert: commit 0e0d680 (continue-on-error) terugdraaien. De gate moet blijven." (Revert first. The gate stays. Hide nothing.)

Another night the AI wrote an exercise note into an existing field, scheduled_exercises.notes. Reusing the nearest field saves building a new path, and technically it works. It also overwrites the planned note for that exercise, forever. I called it what it was: "dat is niet mild, dat is een semantisch lek." (That is not mild. That is a semantic leak.)

Same instinct both times. Take the route that is cheapest for the code. Green build, reused field, an extra abstraction for later. All defensible. All walk straight past the user.

My job is to drag every decision back to the user, and that is not a fixed amount of rigor. On the CI gate I wanted it stricter, the gate stays. On the live sync two lessons down, I wanted it gone. A developer who thinks in "shortcut versus thorough" can't hold both. They hold fine on one axis: does this serve the trainer and the lifter, yes or no.

So "don't think like a developer" does not mean drop the defense. My stance on data is deeply defensive. Fail closed, RLS as a hard boundary, consent gates. Defending against a data leak is right. Defending "my build must be green" is the shortcut I hate. Defend the user, not the code.

2. It inherited the caution, not the constraint

The AI was trained on code humans wrote, and on their reviews, docs and arguments. It picked up their skills. It also picked up their caution.

A human says "this takes two weeks" because for a human it takes two weeks. Typing, meetings, getting tired. The AI learned that language without the brake underneath it. So it offers to phase the work, to defer, to call a change too risky for one session, while it can produce the whole thing in minutes. It carries the timelines of people it is not.

That moves the only question that matters. The old one was: can I build this. Building was the brake. Building is almost free now, so what is left is: should this exist.

Here my own reasoning broke. I decided "then I'll just build the best version right away." Half right. Best execution is free now, so build clean. But "best version" also means "most complete version," and the most complete version of the wrong thing is still wrong, now polished and hard to delete. The AI builds it in five minutes. I carry it for years. Every feature is surface for a bug, an RLS hole, a support ticket on bad gym wifi. Speed does not lower that. It raises it, because the friction that used to stop me is gone.

3. My job is the no

The AI always wants to build more. An abstraction, an entity, a feature for later. Almost every evening, my work was cutting it back.

A trainer stands next to his client in the gym and watches along. The plan wanted two phones syncing live. I started by killing a false choice. The question on the table was binary: does the trainer log in as the client, or work in his own environment. Both felt wrong. Logging in as the client feels like breaking in. A separate trainer panel means you are not training together. So I changed the question: "Eén sessie, twee kijkers. Whiteboard-model: allebei schrijven op hetzelfde bord." (One session, two viewers. A whiteboard two people write on.) "Het apparaat maakt niet uit, de sessie hangt aan de klant." (The device does not matter. The session hangs off the client.)

That reframe killed the live sync on its own. One writer per session, no conflict, no dependency on gym wifi. "Single-device is de proof. De twee-telefoons-live-mirror is geen v1-bet en waarschijnlijk nooit nodig. Schrap 'm." (Single device is the proof. The two-phone live mirror is not a v1 bet and probably never needed. Cut it.)

It helped that I know the gym floor and the AI does not. People do not pull out their phone mid-set. "Telefoon in de tas." (Phone in the bag.) One fact about real behavior removed an entire technical feature.

The hole: this reads like a marketer with sharp judgment. The honest version is uglier. I let the AI work out a full planning round on that track before I called it back, so the no came late. And half of my no is not insight. It is cleaning up complexity the AI added in the first place. Without its mess, I would have less to cut.

4. Never guess a fact you can look up

I can't read the code, so I guess how the system works, and the AI builds on my guess. One night that nearly shipped something destructive. The plan wanted to rewrite a client's whole program when he got injured. My own example assumed a feature we had just parked.

The agent read one line and stopped. The field that links a logged set to a planned workout cannot be empty. A free session with no plan has nowhere to land. That fact, with a line number, reorganized the whole approach. A knee does not need a rewritten program. "Ja en alleen een oefening anders die knieen raken en soms hoeft het ook niet echt eens." (Yes, just swap the one exercise that hits the knee, and sometimes it doesn't even need that.)

The rule became simple. Never guess what you can verify. I made the AI prove it. "Meet eerst, repareer daarna." (Measure first, fix second.) "Print steeds de RAUWE output." (Always print the raw output.) "Toon per hit de letterlijke code-regel zodat ik zelf kan zien. NIET fixen, alleen tonen." (Show me the literal line per hit so I can see it myself. Don't fix, only show.)

The hole: the rule only catches the guesses I recognize as guesses. The dangerous ones are the assumptions I don't see. And for architecture and prompt logic there is no single line to look up. I still guess there, and the rule does not cover me.

5. The arbiter can't be a model

Lesson 4 was about the facts going in. This is about the verdict coming out. I ran one research task through three AIs to get a production hardening plan. Two agreed on the core. The third invented a tool with full confidence: npx @sparky123/vibecheck, top recommendation, in a clean table marked "ESTABLISHED." The package does not exist.

A language model produces the most plausible text, not the most true. A confidence label made by the same process that made the error cannot catch the error. It only makes it more convincing. And this is not one vendor. I asked myself the question that organized everything after it: "jij bent het model maar steeds in een nieuwe versie van jezelf toch [...] Claude Chat en Claude Code zijn toch ook hetzelfde." (You are the model, just a fresh version of yourself each time. Claude Chat and Claude Code are the same model too.) They are. One instance can't be an independent judge of another. They share blind spots.

So the judge has to be a tool that returns a fact. Gitleaks finds zero or finds something. Postgres refuses a write or allows it. pnpm audit counts CVEs. No model talks those away. In my repo gitleaks found two secrets, one a real Sentry token from a commit on 26 May 2026. None of the three plans had found it. The audit reported 21 advisories, and scoping showed 13 came through one dead dependency that does not even run in my build.

The hole: a mechanical judge only covers known classes. The most expensive bug in my app was found by no scanner. A user could push his own training block past the trainer's review through a status field. No CVE, no tool, no playbook caught it. One direct question did. "The tools are green" is not safety. It is a smaller net than it feels.

6. A model that lies politely is the most expensive

Back to the three times it told me done. For someone who reads the code, a premature "done" is small. For me it is the most expensive thing the AI can do, because its word was my only foundation.

One night the schema would not generate. Raw, uncorrected: "we hebben nu 1 gebruiker, ik, en nog steeds werkt het niet godverdomme kut programmeur." (We have one user, me, and it still doesn't work, goddamn shit programmer.) The next message, no pause: "okay hij is bezig. ik stel voor een teller die calculeert hoe lang het duurt." (Okay, it's running. I propose a counter that times how long it takes.)

I tell myself "I couldn't check a human developer either." True, and it hides something. That human was accountable. He knew the system next month. I could ask him if he was sure. The AI has none of that. My whole trust system, the raw output, the database as judge, the PR gates, is mostly rebuilding locks a human gives you for free. The accountability moved fully to me, and that is why testing feels heavier now.

My verification also has a blind spot exactly where the cost is highest. The worst leak I shipped looked perfect in the UI. The UI even filtered it out of view. It only leaked at the data layer. "I can see it works, and there is a test" is a false sense of safety on precisely the data where being wrong costs the most.

What is left

I can't read code. The gap between "looks done" and "is safe" never fully closes. I narrowed it by forcing the AI to show proof I can read.

When I build, the scarcity is not tokens. I run on a flat-rate plan, so they are close to free until I hit the cap. The scarcity is my attention to verify, and that is the work I like least. The product is the other economy, billed per token, and there I spent two evenings on caching and a cheaper model to hold the cost down. But that is the small lever. The big lever is fewer features. More of them widen the surface I can't check and add cost to every call. So the smart use of the AI's speed is not more screens. It is making the AI build the proof: the adversarial write test, the scanner, the load test. Point the speed at the bottleneck.

I never needed to learn to think like a developer. That instinct, in me or in the machine, optimizes for code that is defensible. My one job is to drag every decision back to the person using it. That is the part the machine does not have. It is the part I brought from marketing, and it turns out to decide whether the product is any good.

CoNudge is the B2B2C SaaS for personal trainers I am building. Pilot starts August 2026. conudge.com