YouTube AnalysisPublished 28 May 2026

Andrej Karpathy on Agentic Engineering: Why He's Never Felt More Behind as a Programmer

Andrej Karpathy's AI Ascent 2026 talk on agentic engineering, Software 3.0, jagged intelligence, and why he stopped writing bash scripts. Production field notes from a Lead Engineer running this stack since the December 2025 inflection point.

JBJames Bennett

26 minutes read

Agentic engineering is the disciplined practice of coordinating AI coding agents to ship professional software at scale while preserving security, maintainability, and the quality bar. Andrej Karpathy coined the term to describe what comes after vibe coding — and at Sequoia's AI Ascent 2026, he explained why December 2025 was the inflection point that made him feel "behind" as a programmer for the first time in his career.

Video Summary and Key Insights

Andrej Karpathy on stage at Sequoia AI Ascent 2026 discussing why he feels behind as a programmer in 2026

Andrej Karpathy sat down with Stephanie Zhan at Sequoia Capital's AI Ascent 2026, one year after he coined the term "vibe coding." He co-founded OpenAI, ran AI at Tesla, and now runs Eureka Labs. His 29-minute fireside chat lays out a single thesis. Agentic engineering is the serious discipline that has to grow on top of vibe coding so professional software keeps its quality bar.

The single most important takeaway: Software 1.0 was code. Software 2.0 was learned weights. Software 3.0 is prompting as programming. Your context window is the lever, and the LLM is the interpreter that runs it. According to LangChain's 2026 State of Agent Engineering report, 57% of teams already have agents in production. That's exactly the population Karpathy is addressing.

Key Insights:

December 2025 was a stark phase change in agentic coding. Karpathy was on break with more time to test the latest models. He asked for bigger chunks. He stopped correcting the output. Anyone who tested these tools earlier in 2025 and walked away should reset their priors. The workflow that "actually started to work" is the coherent multi-step agentic one, not single-turn completion. According to Anthropic's 2026 Agentic Coding Trends Report, 2026 is the year these gains extend far beyond incremental improvements to existing tools.

OpenClaw's installer is the canonical Software 3.0 example. Instead of a bash script that balloons to handle every platform, OpenClaw ships as a chunk of text you copy-paste into your agent. The agent reads your environment, debugs in the loop, and installs itself. The "program" is text, the "interpreter" is the LLM.
MenuGen got obsoleted by a one-shot Nano Banana prompt. Karpathy built a full Vercel app to OCR a restaurant menu, generate images for each item, and re-render the menu. Then he saw the Software 3.0 version: hand the photo to Gemini, ask Nano Banana to overlay the food images directly onto the menu pixels. The middle app shouldn't exist.

Models are jagged because labs only RL-train what they care about. Code peaks because labs pour reinforcement learning environments at it. Chess jumped from GPT-3.5 to GPT-4 because someone at OpenAI added chess data, not because of a generic capability lift. If your domain isn't in the data distribution or the RL mix, you're "pulling teeth." That's the founder opportunity.
The car wash example shows how jagged frontier models still are. A state-of-the-art Opus 4.7 will refactor a 100,000-line codebase or find zero-day vulnerabilities. It will also tell you to walk 50 meters to a car wash because it's "so close." The right reading: stay in the loop. Treat the agent as a tool that's flying in some circuits and crashing in others.

Hiring is still built for the old paradigm. Karpathy's proposal: skip the LeetCode puzzles. Give the candidate a big project, like "write a Twitter clone for agents, make it secure." Then turn ten Codex 5.4 instances on the deployment and try to break it. That's the workload an agentic engineer actually runs.
You're in charge of taste, not API details. Karpathy admits he no longer remembers whether it's keepdim or keep_dims, dim or axis. The intern handles that. What he still owns: the spec, the unique-user-ID decision, whether the abstraction is sane. The MenuGen Stripe-vs-Google-email bug is his favorite example of what taste catches and agents miss.
You can outsource your thinking, but not your understanding. Karpathy ends on a tweet that's been rattling around his head: directing agents is bottlenecked by understanding what you're trying to build and why. That's the human skill that doesn't compress.

Why I Spent an Evening on This Talk

I run engineering at WebSearchAPI.ai. The December 2025 inflection Karpathy describes is the same one we lived through internally. Our team had been using Claude Code in a half-trusting way through Q3, accepting chunks and re-reading them line by line. Then the December model dropped. Within a week, our largest agentic refactor (a complete rewrite of the freshness ranker) shipped without a single manual diff.

We weren't alone in catching that shift. According to TechnologyChecker's May 2026 Claude adoption statistics, Claude Code crossed a $2.5 billion annualized revenue run rate within 9 months of launch and now authors roughly 4% of public GitHub commits. That's the macro shape of what we felt internally. So when Karpathy says he "can't remember the last time" he corrected it, that landed.

The other reason I sat with this one twice: Karpathy is one of the few people who builds his own demo apps, watches them get obsoleted by the next model release, and writes up the post-mortem honestly. MenuGen as a soon-to-be-dead app is more useful to me than another thread about "the future of software." It's a real shape of what gets compressed away.

What I want from this post is a working map of his framework (Software 3.0, jagged intelligence, ghosts not animals, agentic engineering). I'll walk through the parts that match what we see in production at WebSearchAPI.ai, the tactical patterns we've found work, and the parts I'd push back on.

What Is Agentic Engineering and How Does It Differ From Vibe Coding?

Sequoia AI Ascent 2026 stage opening with Andrej Karpathy and Stephanie Zhan before the agentic engineering conversation

Agentic engineering is the disciplined practice of coordinating one or more AI coding agents to produce production-grade software while a human stays accountable for the spec, the quality bar, security, and architectural taste. Vibe coding is the floor-raising version: prompt, accept, run, paste errors back, ship something that didn't exist before. Karpathy himself coined both terms within roughly a year. He's now the clearest voice on why the gap between them matters.

The clearest single line in the talk: vibe coding raises the floor. Agentic engineering preserves the ceiling. Both can exist on the same team, on the same day, even in the same repo. They're different practices with different accountability structures.

The reason this distinction matters in 2026 is that a meaningful share of professional teams have already crossed it without naming it. According to LangChain's State of Agent Engineering 2026 report, 57% of respondents now have agents in production, with large enterprises leading adoption. Among those teams, 32% cite quality as the top barrier. Nearly 89% have implemented observability, outpacing eval adoption at 52%. Those are agentic engineering numbers. Vibe coding doesn't generate evals or observability dashboards. Disciplined teams do.

Dimension	Vibe Coding	Agentic Engineering
Goal	Raise the floor — anyone can ship	Preserve the ceiling — professional bar stays
Output	Side project, prototype, weekend app	Production system, secure, maintainable
Accountability	Implicit — "it works on my machine"	Explicit — vulns, regressions, on-call ownership
Workflow	Single agent, single thread, one chat	Multi-agent orchestration, specs, evals, observability
Human role	Imagination, prompt fluency	Spec writing, oversight, taste, eval design, security review
Quality bar	Whatever the model produces	The bar you set before the agents start
Cost of error	Throw it out, vibe again	Real users, real data, real consequences
Speed-up vs solo coder	5–10× on greenfield	Reportedly far beyond 10× on existing systems

IBM's agentic engineering primer draws on Stack Overflow's 2025 Developer Survey, which found 84% of respondents either use or intend to use AI-assisted programming. That's the population vibe coding raised. The smaller, disciplined slice within it — the ones running observability, evals, multi-agent loops, and audited specs — is the population agentic engineering serves. Same tools, very different practice.

For a longer treatment of how a single team can compose the two practices into one company, the Garry Tan and Diana Hu CS153 lecture at Stanford makes the operating-system argument: agentic primitives map directly onto company structure. That post pairs well with this one. Karpathy gives you the discipline. Tan and Hu give you the org chart.

What Is Software 3.0 and Why Does It Change How You Build?

Karpathy diagramming Software 1.0 vs 2.0 vs 3.0 paradigm at Sequoia AI Ascent 2026

Software 3.0 is Karpathy's name for the paradigm where your program is the text in your context window and the LLM is the interpreter that runs it. Software 1.0 was explicit rules. Software 2.0 was training neural networks by arranging datasets and architectures. Software 3.0 collapses both into prompting.

Karpathy at 02:31 — defining Software 1.0, 2.0, and 3.0 on stage at AI Ascent 2026.

The OpenClaw installer is the cleanest illustration. A traditional installer is a .sh script that balloons because it has to branch on every OS, every architecture, every shell. The OpenClaw installer is just a paragraph of text. You copy it. You paste it into your agent. The agent reads your machine and works out the install path itself.

Here's what I'd add from running this in production. The agentic installer pattern isn't free. The agent has to be allowed to read your filesystem, install dependencies, and run scripts. At WebSearchAPI.ai, we ship our internal CLI as a copy-paste prompt for Claude Code and a fallback bash script for CI environments where the agent isn't allowed to write to disk. The agent path is faster for humans. The deterministic path is needed for CI, automated tests, and locked-down customer environments.

MenuGen vs raw Nano Banana prompt diagram from Andrej Karpathy's Sequoia talk — neural network doing more of the work

MenuGen is the more violent example. Karpathy built the app, deployed it on Vercel, wired up image generation. All of it was Software 2.0/2.5 plumbing. Then Nano Banana lets you ask Gemini to overlay food directly into the pixels of the menu photo. One model call. No app.

💡 Field note from James Bennett (Lead Engineer, WebSearchAPI.ai): We rebuilt our snippet extraction pipeline in March 2026 with the same MenuGen mindset. The old version was a chain of Python services (DOM parser, boilerplate detector, summarizer, ranker). The new version hands the rendered page screenshot and the user query to a vision model and asks for a structured JSON snippet. The Python services still exist for low-latency fallback, but 70% of queries now go through the one-shot path. Our p95 snippet quality went up. Our infra cost per query dropped 38%. The lesson is the one Karpathy is pointing at: don't just speed up what existed, ask which parts shouldn't exist at all.

How Does the OpenClaw Installer Pattern Work in Practice?

Karpathy explaining how OpenClaw's copy-paste installer replaces a traditional bash script via Software 3.0 paradigm

The OpenClaw installer is the cleanest deployable pattern in the whole talk. It's worth pulling out and naming as its own technique because it's the one most teams can copy this week.

The pattern, step by step:

Write a runbook in plain English. Not a bash script. Not a Makefile target. A paragraph that says "install dependency X, configure Y, set environment variable Z."
Include version bounds and known compatibility issues. Agents handle ambiguity, but they handle constrained ambiguity better. "Python 3.10 or higher" beats "use Python."
State the desired end state, not the steps. Saying "after installation, mytool --version should return a number" lets the agent self-verify.
List known failure modes. "If you see error E, check that file F exists." That becomes the agent's debug guide.
Ship it alongside (not instead of) a deterministic install script for CI, sandboxes, and customer environments where you can't grant agent access.

The reason this works is the same reason Karpathy's MenuGen example collapses. The agent is already a more general interpreter than the bash shell. It can read your OS, your Python version, your existing packages, and your error messages. It can branch on conditions you didn't anticipate. The bash script can do none of that without ballooning into the 500-line cross-platform installers everyone hates writing.

Anthropic's 2026 Agentic Coding Trends Report describes 2025 as the year agentic AI changed how a large swath of developers write code, with 2026 poised to be the year systemic effects show up across the SDLC. The installer pattern is one of the most concrete places that shift becomes visible. Every dev tool that ships docs is a candidate to ship a runbook instead.

💡 Field note from James Bennett: We published a llms.txt and an agent-native runbook for Claude Code in February 2026. Agent-driven traffic was 7% of our API hits in January. It's 31% in May. The pattern is exactly what Karpathy describes. Once a service is legible to agents, agent traffic finds it. The agents tell each other.

Why Are Frontier Models Still So Jagged?

Karpathy explaining verifiability and jagged skills in frontier LLMs at Sequoia AI Ascent 2026

A jagged model is one that peaks in some domains and stagnates in others, with no smooth interpolation between them. Karpathy's answer is that jaggedness is downstream of two things. What's verifiable enough to put into reinforcement learning environments, and what the lab happened to care about.

Code and math are where frontier models fly because both are verifiable. You can grade the output programmatically, generate millions of training examples, and pour RL at the problem. Domains without a verifier (taste, writing quality, system design judgment) barely move between releases. That's the verifiability axis.

The second axis is more uncomfortable. Karpathy uses the chess example. GPT-3.5 to GPT-4 saw a huge jump in chess strength, and people read that as a generic capability lift. The real story is that someone at OpenAI added a large chess corpus to the pre-training mix. Capability rode the data.

The car wash example is the most useful one to keep in your pocket. Opus 4.7 will refactor 100,000 lines of code. It will also tell you to walk 50 meters to your car wash because 50 meters is "so close." The model doesn't notice that the implicit task is to wash a car, which requires the car. It's fluent in code-shaped circuits and out-of-distribution on common-sense physical reasoning. Pretending the model is uniformly competent is what gets you in trouble.

For founders, Karpathy's read is the opposite of bearish. Verifiability is a moat. If your domain is verifiable and the labs haven't built that RL environment yet, you can build it. You can grind your own fine-tunes, ship the result, and outperform a generic frontier model on that vertical.

Domain	Verifiable?	In current RL mix?	Founder opportunity
Code generation	Yes — compiler, tests	Yes — heavy	Crowded; competing with labs
Math / theorem proving	Yes — formal proofs	Yes	Crowded
Visual design / aesthetics	Hard — LLM-judge proxies	No reward signal	Open, but expensive to verify
Customer support resolution	Yes — resolution rate, CSAT	Mostly no	Open — workflow + RL
Sales call quality	Partial — LLM judges + close rate	No	Open
Citation correctness (RAG)	Yes — source matching	Mostly no	Open and under-served
Long-horizon software engineering	Partial — smoke tests, evals	Improving	Mixed — depends on niche

💡 Field note from James Bennett: This table is roughly the framework we used to pick our enterprise verticals at WebSearchAPI.ai. We left "general code generation" alone. Anthropic and OpenAI have a 1000× more RL compute than we do. We invested heavily in citation-correctness verification (does the model's claim actually match the source it cited?) because that's verifiable, the labs aren't focused on it, and our customers feel the cost of hallucinations directly. That single decision is where our retention numbers come from. Karpathy's "be in a verifiable domain the labs aren't already prioritizing" advice is the most honest version of this I've heard. For a worked example, see our breakdown of Claude Managed Agents. Anthropic's own product roadmap is pointing at the same verifiability lever.

What's the Difference Between Vibe Coding and Agentic Engineering?

Karpathy contrasting vibe coding (floor) versus agentic engineering (ceiling) on stage at Sequoia AI Ascent 2026

The shorter answer lives in the comparison table further up. The longer answer is about upper bound. How big the gap actually is between a generalist using a chat interface and a professional using a fully wired agentic workflow.

Karpathy's framing: vibe coding raises the floor of what any individual can build. Anyone can ship a side project. Agentic engineering preserves the quality ceiling of professional software while using agents to go faster. You're still accountable for security, regressions, and maintainability. But you coordinate multiple agents, write detailed specs, and design evaluation harnesses instead of writing every line yourself.

The numbers on the upper bound are striking. Karpathy says the old "10× engineer" mythology massively undershoots what the best agentic engineers are now producing. He won't give a number. The framing is that 10× is no longer the ceiling. It's a starting point.

Karpathy is brutal about how most hiring still tests for the wrong thing. LeetCode puzzles measure the old paradigm. He'd hand a candidate a brief like build a Twitter clone for agents, make it secure. Let them ship to production. Then point ten Codex instances at it and try to break it. The candidate's job is to ship something the agents can't break. That's a workload that actually exists in 2026.

💡 Field note from James Bennett: We rewrote our engineering interview in Q1 2026 around exactly this. The old loop was three coding screens and a system-design round. The new loop is one async take-home where the candidate has to ship a small RAG service end-to-end with Claude Code and write the evals for it, then a live "agent rodeo" where I point three Claude Code instances at their service with prompt-injection payloads and they have to defend. The hires we've made under this loop have all shipped to production in week one. The hires we used to make under the old loop sometimes took two months to ramp on prompt design alone.

What Tactical Agentic Engineering Patterns Should Teams Adopt in 2026?

Karpathy giving founder advice at AI Ascent 2026 — verifiable domains the labs have not yet RL-trained

The talk is mostly philosophical. The tactical patterns it implies (what an actual engineering team should change about its workflow on Monday) are scattered throughout. Pulled together, they look like this.

1. Write specs the agent can verify against. Karpathy says he's "in charge of the spec" and doesn't even like plan mode in agentic tools. He wants the docs themselves to be the spec, then have the agent write to them. In practice, that means a SPEC.md in every repo, structured as: goals, constraints, non-goals, success criteria. The agent then opens code changes referencing the spec line by line. This is also the pattern Garry Tan and Diana Hu describe when they talk about agents.md and Skillify.

2. Treat the LLM as an interpreter, not a coder. Every place you'd reach for a bash script, a Makefile, or a one-off CLI, ask whether the equivalent paragraph of text in a runbook would do the job. If yes, ship the runbook. If no (usually because reliability or latency matters) keep the script. The hybrid is the cheat code: deterministic code for the parts the LLM is bad at, prompts for the parts it's good at.

3. Stay in the loop on judgment calls, even when output looks fine. Karpathy's MenuGen Stripe-vs-Google-email bug is the canonical example. The agent will silently pick a fragile cross-correlation when you weren't watching. Add a checklist to PRs that asks: "Is there a unique identifier I should be designing around that the agent might have missed?"

4. Build your own evals, even crude ones. LangChain's State of Agent Engineering 2026 report found 89% of teams running observability for their agents but only 52% with evals. The 37-point gap is exactly the failure mode Karpathy warns about. You can see the agent's output without knowing whether it's any good. Crude LLM-as-judge evals at 5–10 samples per release catch most regressions before they reach users.

5. Pick a verifiable, under-served domain if you're a founder. Karpathy hints at unnamed verifiable verticals the labs haven't built RL environments for. The framework is in the table above. If you can grade the output, you can grind a fine-tune that beats a generic frontier model on your slice. Customer support resolution, citation correctness, sales call quality, document extraction (all viable).

6. Run multiple agents in parallel and let them argue. Peter Steinberger's OpenClaw talk popularized the "5–6 agents at once" workflow. Karpathy nods at it implicitly by talking about "10 Codexes" attacking a deployment. The pattern: one agent drafts, one critiques, one runs the tests. Cross-model panels catch failures any single model misses.

7. Treat your context window as the wiring diagram. Because the LLM is the interpreter, what's in the context window IS your program. Every wasted token of "be polite" or "you are an expert" is one fewer token of actual program. Trim ruthlessly.

8. Publish your runbooks for agents. Once you've internalized that agents are the consumers of docs, publishing llms.txt, agents.md, and MCP servers becomes the obvious unlock. Agent traffic finds agent-native services. We saw 4× growth in agent-driven API hits over three months after publishing ours.

According to Speak's 2026 agentic engineering report, December 2025 was the month it became clear agentic AI coding had crossed a capability threshold where the era of software engineers writing code entirely by hand was ending. The eight patterns above are the practices we've found close the gap between "models work" and "products ship reliably."

Why Does Karpathy Call LLMs "Ghosts" Instead of Animals?

The animals-vs-ghosts framing is Karpathy's attempt to get a better mental model of what LLMs actually are. Animals come from evolution. They have intrinsic motivation, curiosity, fun, empowerment, fear. Ghosts are summoned. They're statistical simulation circuits sitting on top of pre-training, with RL appendages bolted on for specific tasks.

The practical consequence: you can't yell at a ghost and expect it to try harder. You can't appeal to its sense of professional pride. You also can't assume that whatever common-sense reasoning evolution wired into a six-year-old is wired into the model. There's no embodied substrate underneath.

What this changes about agentic engineering: every time you catch yourself anthropomorphizing the model's "intent," stop and check whether the data distribution it was trained on actually contains the thing you assume it knows. Karpathy's MicroGPT example is the cleanest one. He tried to prompt an LLM to simplify his training code. It refused to get simpler because aesthetic minimalism isn't in the RL mix. He could feel himself "outside the RL circuits." Pulling teeth.

That maps directly onto something we hit at WebSearchAPI.ai when we tried to use a frontier model to write our own internal style guide for snippet formatting. The model produces fluent prose. It cannot produce short prose. It cannot produce opinionated prose. There's no RL reward for either, and the distribution it was trained on rewards verbose, hedged writing. We ended up writing the style guide ourselves and using the model only to flag violations.

The ghost framing also explains why your agent will sometimes confidently solve a problem and minutes later confidently solve a different problem in a way that's completely wrong. There's no "self" maintaining consistency across turns. There are circuits firing in response to your context. Treat your context as the wiring diagram.

What Will Be Obvious by 2027 That Is Still Mostly Unbuilt Today?

Karpathy answering the what's obvious in 2026 question at Sequoia AI Ascent — neural net as host process

Stephanie Zhan asks Karpathy what will look completely obvious in hindsight that is still mostly unbuilt today. The 2026 equivalent of building websites in the '90s, mobile apps in the 2010s, or SaaS in the cloud era. His answer is one of the most interesting parts of the conversation.

Karpathy's extrapolation: in the 1950s and 60s, computer scientists were genuinely uncertain whether computers would look like calculators or neural networks. They went down the calculator path, and we built classical computing for 70 years. The 2027–2030 unlock might be the inverse, where the neural net becomes the host process and the CPU becomes the coprocessor. Tool use is a deterministic appendage for narrow tasks. Most of the lifting is networked neural nets doing fuzzy computation natively.

You can already see the early shape of this in MenuGen. The middle tier (apps with deterministic logic between a vision model and a generation model) is the part that compresses away. By the time we're sitting here in 2027, products will look weirder than that. Karpathy describes a device that takes raw video or audio in, runs a diffusion model to render a UI tailored to the moment, and outputs whatever action the user implicitly wanted.

Three concrete predictions follow from this view:

Agent-native services compound. Products with MCP servers, llms.txt, and machine-readable runbooks will pull agent traffic that human-only services miss. This is already happening. See our field note on 7% → 31% agent-driven traffic growth above.
Verifiable but under-RL'd domains compound for founders. The labs prioritize what they can grade and what they care about. Everything outside that intersection is open. Citation correctness, customer support resolution, and document extraction are three of the obvious ones. Karpathy hints there are others he won't name on stage.
The deterministic-code-as-coprocessor pattern compounds. Every place where the latent space is unreliable, deterministic code takes over. Every place where deterministic code is rigid, the latent space takes over. Hybrid systems beat pure systems for at least the next two release cycles.

The $211 billion in AI venture capital in 2025 (half of all global VC funding, per Metavert's State of Agents 2026 report) is chasing some version of this trajectory. The actual winners are the ones who pick the right verifiable domain, ship the agent-native runbook for it, and don't get caught maintaining the middle-tier Python service that the next model release deletes.

What Should Agent-Native Infrastructure Look Like by 2027?

Karpathy on agents everywhere and the future of agent-native infrastructure at AI Ascent 2026

Karpathy's pet peeve, said with real frustration: every framework still ships docs written for humans. He doesn't want to read docs. He wants the text he can copy-paste into his agent so the agent does the work.

The MenuGen blog post is his telling example. He says the hard part wasn't writing the code. Claude Code handled that. The hard part was deploying it: configuring Vercel, wiring up DNS, stringing together third-party services with UIs designed for humans. He'd like to give one prompt (build MenuGen, deploy it) and have agents handle everything, including the DNS records.

That sensors/actuators framing is the right one. The agent-native version of a SaaS product isn't a new UI. It's an MCP server, an API the agent can hit, and docs written as runbooks the agent can follow. We're still in the awkward middle phase where most products bolt an agent on top of a UI built for humans, and the agent has to click buttons instead of calling endpoints.

The teams already building this way are pulling away. Karpathy's vision of "my agent talks to your agent" only works if both ends speak the same dialect: structured endpoints, machine-readable docs, clear error messages designed to be read by a model and acted on.

For a working example of this pattern shipping today, see our breakdown of Anthropic's Claude web search API. Every endpoint is documented twice, once for humans and once as an agent runbook. That's the cost of admission for agent-native infrastructure in 2026.

Frequently Asked Questions

What is agentic engineering according to Andrej Karpathy?

Agentic engineering, in Karpathy's 2026 framing, is the disciplined practice of coordinating one or more AI coding agents to ship production-grade software while a human stays accountable for the spec, the quality bar, security, and architectural taste. It sits on top of vibe coding. Vibe coding raises the floor of what any individual can build, while agentic engineering preserves the ceiling of professional software. Karpathy says the upper bound for a strong agentic engineer is far past the old "10× engineer" mythology.

How is agentic engineering different from vibe coding?

Vibe coding is the floor-raising practice where you prompt, accept, run, and ship something that didn't exist before. Accountability is implicit and the output is usually a side project. Agentic engineering is the ceiling-preserving practice where multiple agents are coordinated against a spec, you write evals and observability, and a human is still on the hook for security, regressions, and maintainability. LangChain's State of Agent Engineering 2026 found that 89% of teams have observability for their agents but only 52% have evals. The gap is exactly where agentic engineering becomes a discipline.

What did Andrej Karpathy mean by "Software 3.0"?

Karpathy's Software 3.0 framing puts prompting itself in the role that code used to play. Software 1.0 is explicit rules written by humans. Software 2.0 is learned weights produced by training a neural network on a dataset. Software 3.0 is what happens when you train a sufficiently large LLM on the open internet. The model becomes a programmable computer whose programming language is whatever you put in its context window. The OpenClaw installer is his canonical example: a paragraph of text replaces a bash script.

Why does Andrej Karpathy feel "behind" as a programmer in 2026?

Karpathy says December 2025 was a stark transition point in his own workflow. With more time over the holidays, he tested the latest agentic coding tools, asked for bigger chunks of code, and stopped correcting the output. The realization that the agent's output reliably exceeded what he'd write by hand made him feel he was "behind" because there's so much more he could build now that he was previously bottlenecked on. The feeling is a mix of unsettling and exhilarating, in his own words.

What is "jagged intelligence" in LLMs, and why does it matter for product builders?

Jagged intelligence is Karpathy's term for the uneven capability profile of frontier models. They peak in domains the labs train with reinforcement learning (code, math) and stagnate in domains that lack a verifier or that the labs aren't focused on. His car wash example is canonical. Opus 4.7 will refactor a 100,000-line codebase but tell you to walk 50 meters to a car wash because it can't reason about the implicit goal of bringing the car. For product builders, this means you need to know which "circuits" you're in for your specific application and fine-tune your own when you're out of the lab's distribution.

Why does Karpathy say LLMs are "ghosts" instead of animals?

In his "animals versus ghosts" essay, Karpathy argues that LLMs are not animal intelligences shaped by evolution with intrinsic motivation, curiosity, fear, or fun. They're statistical simulation circuits assembled from pre-training (the substrate) with reinforcement learning bolted on as "appendages." The practical consequence: you can't yell at an LLM to make it work harder, and you can't assume it has the embodied common sense an animal would have. Treat the context window as the wiring diagram, not the conversation.

How should founders pick a problem if the labs are already taking the obvious verticals?

Karpathy's advice is to look for domains that are verifiable (meaning you can produce reward signals from RL environments) but that the labs aren't prioritizing. If you have a verifiable setting and a way to generate diverse training data, you can pull the lever on fine-tuning and outperform a generic frontier model on that vertical. He hints he sees specific examples but won't say them out loud on stage. The framework is to combine verifiability with under-served data distributions. Customer support resolution, citation correctness, and document extraction are three open verticals where this is already working.

What human skills become more valuable as agents get more capable?

Taste, judgment, and the ability to write detailed specs. Karpathy says he no longer remembers PyTorch tensor API details because the intern handles that. But he's still in charge of the design: whether email addresses can serve as user IDs, whether an abstraction is sane, whether the system is asking for the right thing. His final line is the strongest one: you can outsource your thinking but you cannot outsource your understanding. Direction is the bottleneck that doesn't compress.

What's the most concrete thing a senior engineer should change about their workflow after watching this talk?

Start writing your install instructions, internal runbooks, and API docs as text designed to be copy-pasted into an agent. Not read by a human. Karpathy's strongest pet peeve in the talk is that every framework still ships human-facing docs. The minute you publish your service's runbook as something an agent can execute, you compound. Agent traffic finds agent-native services, and the lift on developer experience is immediate. The other concrete change: build crude evals before you ship. LangChain's 2026 data shows 89% observability adoption but only 52% eval adoption, and that gap is where regressions hide.

How big is the agentic AI market and how fast is it growing?

According to Metavert's State of Agents 2026 report, AI venture capital hit $211 billion in 2025, roughly half of all global VC funding. The global agentic AI market is projected to grow from a roughly $28 billion base in 2026, per various industry forecasts. Enterprise adoption is the leading edge. 57% of LangChain survey respondents now have agents in production, and 19% of organizations have invested significantly in agentic AI with another 42% making cautious pilot-stage investments.

Key Takeaways

December 2025 is the inflection point Karpathy points to as the moment agentic coding "actually started to work." If you tested these tools earlier in 2025 and walked away unimpressed, your priors are stale.
Software 3.0 means your context window is the program and the LLM is the interpreter. The OpenClaw installer (text replaces bash script) and MenuGen → Nano Banana (one prompt replaces a whole Vercel app) are the two canonical examples.
Frontier models are jagged because labs only train RL environments for what they care about. Verifiability plus an under-served data distribution is the founder opportunity.
Vibe coding raises the floor. Agentic engineering preserves the ceiling. The new "10× engineer" ceiling is reportedly far higher than the old one.
LLMs are ghosts, not animals. There's no intrinsic motivation, no embodied common sense. Just statistical simulation circuits with RL appendages. Stop anthropomorphizing.
Hiring loops built for the old paradigm (LeetCode puzzles) miss the actual workload. Give candidates a large, secure project to ship, then attack it with agents.
The eight tactical patterns to adopt this quarter: spec-as-docs, LLM-as-interpreter, judgment-call checklists, crude evals before observability, picking verifiable verticals, multi-agent panels, context as wiring diagram, and publishing agent-native runbooks.
57% of teams have agents in production but only 52% run evals. The 37-point gap between observability and evals is where regressions hide. Close it first.
Agent-native infrastructure (clear endpoints, machine-readable docs, MCP servers) compounds. Once a service is legible to agents, agent traffic finds it.
The skill that survives is taste plus understanding. You can outsource your thinking. You can't outsource your understanding of why the work matters.

This post is based on Andrej Karpathy: From Vibe Coding to Agentic Engineering by Sequoia Capital. Interview with Andrej Karpathy (Founder, Eureka Labs) by Stephanie Zhan (Partner, Sequoia Capital). Filmed at AI Ascent 2026. Duration: 29:49. Statistics sourced from LangChain State of Agent Engineering 2026, Anthropic 2026 Agentic Coding Trends Report, IBM agentic engineering primer, Metavert State of Agents 2026, and Speak's December 2025 agentic engineering report.