The 30-second recap, then the part the video skipped
If you saw the short: Karpathy's framing is that the difference between old software and AI is verifiability — "traditional computers automate what you can specify in code; LLMs automate what you can verify." The car-wash question (drive or walk 50m to a car wash? — every model says walk) is just a clean way to show models breaking where the context lives only in your head. Three layers fix that: write a spec, add a verifier, set up the environment. That's the idea. The problem is that "write a detailed spec" and "verify the output" are the kind of advice that sounds obvious and changes nothing. So this guide does one thing the 60 seconds couldn't: it runs one real task through all three layers and shows you the actual artifacts — the spec text, the verifier prompts, the files you commit. Steal them and adapt.
WORKED EXAMPLEThe task we'll run through all three layers
Pick something small enough to finish in a sitting but real enough to have edge cases. We'll use this one for the whole guide: > "Add an endpoint to our API that exports a customer's invoices as a CSV." That sentence is a vague ask. Handed to Claude as-is, you'll get a CSV export — probably the wrong columns, no auth check, and a memory blow-up the first time someone with 40,000 invoices hits it. The three layers are how you close that gap before a line of code is written. Watch the same task get sharper at each step.
- Vague ask: "export invoices as CSV" — Claude fills the gaps with guesses
- After Layer 1 (Spec): exact columns, auth rule, date format, the 40k-row case named out loud
- After Layer 2 (Verifier): a definition of "correct" Claude can check itself against
- After Layer 3 (Environment): the rules that made it behave without you re-typing them
LAYER 1 · SPECLayer 1 — The Spec: turn the vague ask into a contract
A spec is just your understanding written down in a form Claude can build from. Karpathy is lukewarm on plain plan mode — useful, but he'd rather work with the agent to design a genuinely detailed spec. Don't write it solo and don't accept Claude's first draft. Make it interview you, then read what it produces like an engineer, not a customer. Keep specs small and compartmentalized (this one endpoint, not "the whole billing module").
- Open plan/spec mode and hand it the one-liner plus this: "Don't build yet. Interview me. Ask the 5–8 questions whose answers you'd otherwise have to guess — auth, data shape, edge cases, format, scale — one at a time."
- Answer honestly, including the things you'd normally leave implicit ("only the account owner can export", "some customers have 40k+ invoices").
- Then force the decision check — paste: "Restate the spec as a numbered contract. For each key decision, show me the option you chose and the one you rejected, so I can veto."
- Read every line. If a line is vague to you, it's vaguer to the model — rewrite it before you continue.
GET /customers/:id/invoices.csv. (2) Auth: caller must be the account owner or admin; otherwise 403. (3) Columns, in order: invoice_id, issued_date (ISO-8601), amount_cents, currency, status. (4) Stream the response — do not build the whole file in memory; assume up to 50k rows. (5) Empty result returns a header-only CSV, not a 404. Five lines. That's the whole point — five specified lines beat five paragraphs of prose.LAYER 2 · VERIFIERLayer 2 — The Verifier: give 'correct' a definition the model can check
This is the layer to actually get good at. The most painful part of working with AI is checking its output, so most people eyeball it and hope. Karpathy's point is that verification is the real lever — and the move is to make the model help verify itself against something concrete, not against your vibe. Three moves do most of the work.
- Set the bar up front, before building: "List the evaluation criteria you'll judge this endpoint against — correctness, security, performance, edge cases — then build to them." This kills vague asks like "make it good."
- Feed it a real reference so 'correct' has a shape: paste one row of real (or realistic) expected CSV output and the exact header line. Now "matches the format" is checkable, not aspirational.
- Wire it to reality where you can: "Write a test that calls the endpoint as a non-owner and asserts 403, and one that exports 50,000 generated invoices and asserts memory stays flat (streamed, not buffered). Run them." A passing test beats "looks right."
- Bundle it into the build prompt: end with "Before you call this done, run the verification step: check the output against the criteria above and the sample row, and report each check as pass/fail with the evidence."

LAYER 3 · ENVIRONMENTLayer 3 — The Environment: make the first two automatic
Layers 1 and 2 work once. The environment is what stops you re-typing them every session. Two pieces do almost all of it: a CLAUDE.md that states the rules in plain language, and — for the lines that must never be crossed — a hook that enforces them at the tool level. A prompt rule is a polite request the model can rationalize past. A hook is a wall it physically cannot walk through. The table sorts every action into three buckets; put the buckets in CLAUDE.md, and back the NEVER row with a hook.
- Write the three buckets above into
CLAUDE.mdat the repo root. It loads every session with no script to run — the right home for conventions that don't change. - For the NEVER row, add a
PreToolUsehook in.claude/settings.jsonwithmatcher: "Edit|Write"pointing at a script. CLAUDE.md asks; the hook blocks. - In the script, read
tool_input.file_pathfrom the JSON on stdin (jq -r '.tool_input.file_path'), and if it matches.env,/infra/or/migrations/*, print a reason to stderr andexit 2— exit 2 is a blocking error and the stderr text goes back to Claude as the reason. - Prefer
exit 2for the simple case; if you want richer control, exit 0 with a JSONpermissionDecision: "deny"on stdout instead. - Audit it now and then — paste your
CLAUDE.mdback and ask: "Which of these rules are vague, unenforceable, or never actually checked? Rewrite those."
| Bucket | For our CSV task | How you enforce it |
|---|---|---|
| ALWAYS (autopilot) | Before a multi-step change, write the spec + eval criteria and add a failable test | A line in CLAUDE.md — loads every session |
| ASK FIRST (pause) | Schema/migration changes, new deps, anything touching auth or billing | A line in CLAUDE.md — Claude checks with you |
| NEVER (hard line) | Edit .env, files under /infra, or existing DB migrations | A PreToolUse hook — Claude physically can't |
hooks.PreToolUse[0].matcher = "Edit|Write" and hooks[0].command pointing at protect.sh. The script does f=$(jq -r '.tool_input.file_path' < /dev/stdin), then a case "$f" on the protected globs that echoes a message to >&2 and runs exit 2. That's the whole wall — a few lines, and Claude can no longer touch those paths no matter what the prompt says.THE DEEPER SKILLContext engineering: what actually goes in the window
Karpathy's term for the real job is context engineering, which he defines as "the delicate art and science of filling the context window with just the right information for the next step." The three layers above are how you do it on purpose. His warning is the part people miss: "too little or of the wrong form and the LLM doesn't have the right context; too much or too irrelevant, and cost goes up and performance comes down." More context is not better context. For the CSV task, the table below is the difference between a window full of noise and a window aimed at the next step.
| Goes in the window | For our CSV task | Why it earns its place |
|---|---|---|
| The spec/contract | The 5-line numbered contract from Layer 1 | The model's actual target — without it, it guesses |
| A concrete reference | One sample CSV row + the exact header line | Turns 'correct format' into something checkable |
| The real schema/types | The Invoice model and the auth helper signature | Stops it inventing column names and authz it can't call |
| Evaluation criteria | Correctness / security / perf / edge-case list | Defines 'done' before the build, not after |
| Leave OUT | The whole repo, unrelated modules, old chat | Irrelevant tokens raise cost and lower performance |
CHEAT SHEETRun sheet: the whole loop on one page
Once it clicks, the method is a short loop you can run on any task. The labels are a scaffold; the discipline underneath — understand, specify, define correct, enforce — is the bit that compounds.
- Spec — "Interview me, then restate as a numbered contract; show the option you rejected per decision."
- Verify — "List evaluation criteria first; build to them; add a test that can fail; report each check pass/fail with evidence."
- Environment — put rules in CLAUDE.md; enforce the hard lines with a PreToolUse hook, not a prompt.
- Curate context — give the minimum right files/examples for the next step; leave the rest out.
- The skill that lasts — "You can outsource your thinking, but you can't outsource your understanding." Stay the person who understands the problem well enough to steer.
Get the next drop
New AI build guides + the occasional bonus template. No spam, unsubscribe anytime.
By submitting you agree to our Privacy Policy & Terms. Unsubscribe anytime.
Frequently asked questions
What does Karpathy's method actually change about how I prompt Claude?
How do I write a good spec without over-engineering it?
What's the single highest-leverage move in the whole method?
Why use a hook instead of just writing the rule in CLAUDE.md?
PreToolUse hook is enforcement it can't. The hook matches Edit/Write, reads tool_input.file_path from stdin, and exits with code 2 (stderr becomes Claude's error) or returns a permissionDecision: deny on exit 0. Use CLAUDE.md for conventions, the hook for genuine never-cross lines like .env or migrations.