Guest contributor: Ethan Wang is founding member of Google DeepMind Spark agent and founding member of Google DeepMind Mariner Agent.

Most agent projects that miss the quality bar actually do not miss because the model is too weak. They miss it because the tools are too sophisticated, the system instructions are too crowded, or the team tries to evaluate quality before the plumbing actually works. This note distills the principles and the sequence — These insights come from developing several state-of-the-art agents. One notable project involved an autonomous music-video director created in just one month through approximately eighty manual iterations.

1. The Mental Model

A capable agent is one that can solve problems we did not script. That power does not come from a long instruction list; it comes from the agent's freedom to call the tools we give it, in whatever order it decides, as many times as it needs. The clearest mental model is linguistic:

Tools are words. The agent writes the sentence. Capability grows by adding more words — never by scripting more sentences.

Figure 1 — Tools are the agent’s vocabulary; the workflow is a sentence the agent composes at runtime.

Once this frame clicks, almost every design decision gets easier. When you find yourself reaching for a switch-statement inside a tool, or a conditional branch inside the system prompt, you are scripting a sentence. Stop, and add a word instead.

2. Two Pillars of Tool Design

If tools are words, then the three properties that make a language usable also describe what makes an agent's toolset usable: each word means one thing, each word is defined precisely, and the surrounding grammar (the system instructions) stays out of the way.

Pillar 1 — Single Responsibility

Every tool should be one verb. analyze_song. plan_story. generate_shot. critique_shot. cut_jump. assemble_section. The moment a tool sprouts a mode parameter, you have stapled several tools together and now the agent has to guess which mode is valid with which options. That guessing is where bad runs come from.

Figure 2 — A mega-tool with modes forces the agent to guess. A row of small, single-purpose tools lets it choose.

An agent reads tool descriptions the way a careful translator reads a dictionary. Each tool needs a one-line summary of what it does, an explicit statement of when to use it (and when not to), the exact input and output shape, and the failure modes it can return. Two tools that sound similar will get used interchangeably; if you cannot describe each in one sentence that distinguishes it from its neighbours, the design is not done.

Pillar 2 — Simple, Stable System Instructions

The system prompt should hold only what is true on every single turn: the agent's role, the path it writes outputs to, hard ground rules. Doctrine that is conditional ("when the song intensifies, shorten the shots") belongs in the tool that owns that decision, not in the global prompt — once it lives at the top, it pollutes every unrelated decision the agent makes. A short, clean prompt plus a rich toolset beats a long prompt with a thin toolset, every time.

Figure 3 — Where the rules live. Keep each layer doing only its job.

3. The Build Sequence: Infrastructure First, Then Hill-Climb

The single most common reason eval programs fail is that they start too early. If a tool sometimes silently returns the wrong field, or two tools collide on the same scratch path, no amount of clever scoring will tell you anything useful — you'll just measure your bugs.

Figure 4 — Two phases, in this order. You cannot hill-climb on a broken foundation.

Phase 1 — Get the Infrastructure Right

In this phase, the bar is correctness, not quality. Before you measure anything, confirm that:

  • Every tool does exactly what its doc says, with the input and output shapes documented.

  • Errors are structured and informative — {error, hint} rather than a stack trace — so the agent can recover.

  • Side-effects are explicit (the tool that says it writes a file is the only tool that writes a file).

  • Outputs are isolated per run — no two runs can ever overwrite each other.

  • There are unit tests at the tool boundary; if you cannot test a tool in isolation, it is doing too many things.

This phase is unglamorous and easy to skip. Skipping it produces evals you cannot trust and a hill-climb that wastes weeks.

Phase 2 — Eval and Hill-Climb

Only now does it pay to turn quality into a number. Pick a target the agent should match — a reference output, a rubric, a human score — and have a critic produce a single graded comparison per run. The number doesn't have to be perfect; it has to be consistent enough that you can tell whether a change made things better or worse.

4. Hill-Climbing in Practice

Figure 5 — The hill-climbing loop. Run it manually first; once it works, automate it.

Each turn of this loop has only one job: turn one specific observed defect into one specific addition to the agent. The addition takes one of three forms:

  • A new/better tool — when the agent is missing a word it needs. ("It cannot vary the cut rate, so it doesn't." Give it cut_jump, cut_match, assemble_section.)

  • A new critic or soft gate — when a rule must always hold. The critic refuses or rewrites with a structured hint the agent can act on.

  • A doctrine line — added to the specific tool's documentation where it applies, not bolted onto the global prompt.

Done patiently, the curve looks like Figure 6. Each labelled step is a real lesson encoded back into the agent: character identity anchored by spec; hard cuts only; the song analysis made the mandatory first step; the planning phase grown to ~70% of the run; and finally, the AI-feel score loop closing the gap unattended.

Figure 6 — Quality vs. iterations on the music-video director. Each annotation is one observed defect, encoded once.

Two notes on this curve. First, the early gains are large and the late gains are small — that's expected; you are running out of obvious defects. Second, the steepest drops come from infrastructure-class fixes ("ground-truth the song first", "storyboard-first"), not micro-tweaks. When a graph plateaus, the next jump usually requires re-thinking which words the agent has, not nagging it harder in the prompt.

5. Distilling New Tools From Real-World Runs

The hill-climbing loop in §4 is human-driven: a person watches a run, names the defect, encodes a lesson. The next step — and the one that compounds — is letting the agent's own runs surface the next words it needs. The principle is simple, and it falls straight out of the linguistic frame:

When the agent keeps composing the same sub-sentence — that is the missing word.

Figure 7 — Recurring tool-call sub-sequences are evidence of a missing primitive. Distill them into new vocabulary.

How It Works

Log every run's full trajectory — the tool calls the agent made, in order, with arguments and reasoning. Across many real runs you start to see two signals:

  • Repeated sub-sequences. The same three or four tool calls in the same order, on different inputs. That sequence is the agent inventing a primitive that should have existed.

  • Workarounds and shims. The agent is gluing existing tools together with zero-effect parameters (cross_fade(0ms), assemble on a single shot, retry-with-tweaked-prompt). It is faking the verb it wishes you had given it.

Either signal is enough to propose a new tool. The agent itself can do the proposing: feed it its own trajectory log and ask it to name the recurring pattern, write a one-line description, sketch the signature, and list the failure modes. A human (or a second agent) accepts or rejects; accepted tools get added to the vocabulary, and future runs use them directly.

Concrete Examples From the Music-Video Director

Two of the most useful pieces of the music-video agent emerged exactly this way, by watching what the agent kept trying to do with the tools it already had:

  • Scene-cutting primitives. Early on the agent had only generate_shot and assemble. To make a hard cut, it would generate two shots, splice them with a zero-length cross-fade, and hope. The same pattern showed up in run after run. The lesson was not "prompt it harder to use better cuts" — it was that the agent had no word for a cut. Distilling those repeated sub-sequences produced the seven cut primitives: cut_standard, cut_jump, cut_on_action, cut_cross, cut_montage, cut_match, assemble_section. Each one is a single verb the agent now reaches for directly.

  • Camera-movement taxonomy. The agent kept writing long natural-language motion clauses inside its shot prompts — "slow push-in while panning left, then settle." These free-form clauses were inconsistent and hard to critique. The repeated improvisation revealed a missing vocabulary: a named set of camera moves (push-in, dolly, whip-pan, crash-zoom, …) exposed as a tool with advisory hints. Once that vocabulary existed, both planning and critique sharpened.

The Self-Recursive Loop

Notice what has happened: the agent's runs become training data for the next version of the agent's toolset. Better tools produce different runs, which surface different missing words, which become the next tools. The loop runs on itself.

Two things keep this loop healthy. First, treat vocabulary growth as costly — every new tool the agent has to consider is a small tax on every future decision, so the bar to admit one should be "this sub-sentence shows up across many runs, on different inputs, and the distilled version genuinely simplifies them." Second, when a distilled tool ships, retire the workaround patterns that motivated it — leave them in the prompt and the agent will keep falling back to them. The vocabulary should grow, but it should also stay clean.

6. Hard-Won Lessons

Make Changes Surgically, and Protect a Known-Good Baseline

The most painful failure mode is the well-intentioned global edit. You add one new feature — say, an image-anchoring step — and quality on story, cuts, and pacing all silently regress, because that change touched the part of the prompt every other decision flows through. The rule that emerges: when a change degrades existing behavior, revert to the last known-good trunk and redo it surgically. Quality is a thing you protect, not just a thing you add to.

Critics Encode Doctrine. Tools Encode Choices

Anything that must always be true belongs in a critic or a soft gate — a check that refuses or rewrites with a structured hint. Anything the agent could reasonably decide either way belongs in a tool. This split keeps the agent free to be creative inside the boundary, while the boundary itself is enforced mechanically.

Preserve the Agent’s Freedom

When a proposal would replace a decision with a lookup table, refuse it. Determinism is good for safety and bad for creativity. Give the agent a tool plus a strong recommendation; let it call the tool — or not — based on the situation it actually sees.

Debug by Reasoning Before Debugging by Coding

When the agent does something wrong, the most productive first move is not to patch the prompt. It is to ask: why did this happen? And — equally important — why did this never happen before? Tracing a regression to its root cause almost always reveals a missing word in the vocabulary, or a recently-added word that confuses two existing ones. Either way, a code fix follows from understanding, not the other way around.

7. A Short Checklist

Before you ship a new agent — or before you decide an existing one is stuck — walk through this list.

  • Can every tool be described in one sentence that distinguishes it from its neighbours?

  • Does any tool have a mode parameter? If yes, split it.

  • Are tool errors structured ({error, hint}) so the agent can recover without human help?

  • Is the system prompt short — only role, paths, and rules true on every turn?

  • Is there one number that grades the agent's output, computed by a critic and not by hand?

  • Is there a known-good baseline you can revert to in one command?

  • After each loss in quality, can you point at exactly which change caused it?

Appendix — How Quality Climbed on the Music-Video Director

The eras the project moved through, with the lesson encoded at each step. Every row started life as a watched video and a complaint that named the defect.

Era

What changed

What it unlocked

First render

Bare pipeline runs

A video exists at all

Identity

Style bible + spec anchoring

Character holds across cuts

Editing

Hard-cut doctrine, scoped cross-cuts

Cuts feel deliberate, not slideshowy

Ground truth

analyze_song made mandatory

Cuts land on beats; structure tracks the song

Story

Plot-seed + conceit libraries (as tools)

Real narrative arcs, not just vibes

Rhythm

Tensity tool ties pacing to energy

Pace follows the music

Storyboard-first

~70% of run is planning + critique

Film-grade, richly-decomposed scenes

Measurable

AI-feel score + self-improvement loop

Quality becomes a number, not an opinion

Ethan Wang is a member of Google DeepMind, focused on building general-purpose agents on the path to AGI. He is a founding member of two of the org's flagship agent projects: Gemini Spark, the recently released general agent; and Mariner, company's first computer use agent.

Reply

Avatar

or to participate