AI Agents vs Human-in-the-Loop: What I’ve Learned Shipping Both

TL;DR

It is not “AI vs no AI.” It is where you place the human’s judgement — and how much reasoning you let the model do without it.

  • Let agents run autonomously when the work is repetitive, low-risk, large-scale and tolerant of the occasional miss. The token bill is justified by the hours saved.
  • Keep a human in the loop when a mistake hits traffic, revenue, brand or compliance — anywhere one bad decision costs more than a thousand reviews.
  • Agents are expensive because context accumulates, not just because they loop. Prompt caching changes the slope of that curve, but output tokens and genuinely new context never cache away, so the cost gap narrows without ever closing.
  • The rule I run: the more uncertainty a task contains, the more valuable a human becomes. Agents execute; humans judge.

Everyone is talking about AI agents. Companies are building systems that browse the web, write code, analyse data, send emails and complete multi-step tasks with little or no human intervention.

But there is a question the hype tends to skip, and it is the one I keep running into in practice: should an AI agent make decisions on its own, or should a human stay in the loop?

Creator of Claude Code Boris Cherny (a guy with unlimited tokens): “I don’t write prompts anymore, I have loops running that prompt Claude… My job is to write loops. I uninstalled my IDE, I wasn’t using it.”

I build production SEO tooling for a large news publisher and for my own SaaS products, mostly by vibe-coding with Claude Code rather than building models from scratch. So my answer isn’t theoretical; it comes from watching token bills, debugging trajectories that went sideways, and deciding, task by task, where to put a person.

I’m an SEO specialist first, and the honest answer is “it depends” — and from what I’ve seen it depends on three things you can actually measure:

  • Cost (mostly token consumption, but also the time you spend cleaning up failures)
  • Reliability (how badly a single wrong step compounds)
  • Risk (what a mistake costs you in traffic, revenue or reputation)

Here is how each of those plays out, with examples from things I’ve actually built.

1. What is an AI agent?

An AI agent is a system that can:

  1. Receive a goal
  2. Decide what actions to take
  3. Use tools
  4. Observe the results
  5. Keep going until the goal is met

The loop looks like this:

Goal → Think → Use Tool → Observe → Think → Use Tool → Observe → … → Done

For example:

“Find every page with declining organic traffic and open a ticket on Notion for the content team.”

A fully autonomous agent queries the Google Search Console API, analyses the trend lines, categorises the cause, drafts recommendations, and connects with Notion (via MCP) to create a task for the content team. Without me touching it.

2. What is human-in-the-loop?

A human-in-the-loop (HITL) system inserts approval points. The model proposes, I validate the steps that matter.

Goal → AI Analysis → Human Review → AI Action → Human Approval → Done

Same example, HITL version:

“Find pages with declining traffic and recommend actions.”

The AI finds the issues and drafts recommendations. I review the list, strip out the false positives, and approve. Only then does automation continue.

The distinction I’ve landed on is this: it’s not “AI vs no AI.” It’s where you place the human’s judgement — and how much reasoning you’re willing to make the model do without it.

3. Comparing the two approaches

AreaAutonomous agentHuman-in-the-loop
SpeedVery fastSlower (gated on a person)
Token costHigher (loops + accumulating context)Lower (one or two passes)
ReliabilityErrors compound across stepsA wrong step gets caught at review
ScalabilityExcellentCapped by reviewer time
RiskHigherLower
Improvement over timeNone by default. Needs added memory, evals or fine-tuningBuilt in: the human corrects every cycle

That last row is the one I see misstated everywhere. A base LLM agent does not “learn continuously.” It accumulates context within a single run, but nothing persists or improves between runs unless you explicitly add a memory layer, an eval harness, or fine-tuning. (In my own Claude Code setup that’s what Skills and a proper QA test suite are for — otherwise every run starts from zero.) HITL, by contrast, gets a free feedback signal every time I edit the output.

4. Why agents really use more tokens

Most people picture this:

One task = one prompt

Agents actually work like this:

Task → Reason → Tool → Observe → Reason → Tool → Observe → Reason → Output

Every cycle costs tokens. But the bigger driver isn’t the number of loops — it’s what each loop carries.

Context accumulates

At each step, the agent re-sends almost everything it has seen: the original goal, prior reasoning, previous tool outputs, and current state. The input grows every turn:

Turn 1:   2,000 input tokens
Turn 2:   4,000   (re-reads turn 1 + new tool output)
Turn 3:   7,000
Turn 4:  12,000

By turn four, it’s re-reading 12,000 tokens of mostly old context. Across the run, it has processed ~25,000 input tokens plus all the output, versus maybe 3,000 for a single HITL pass.

This is the real cost engine, and it’s why the multipliers are large. On Anthropic’s own engineering data, single agents use roughly 4× the tokens of a chat interaction, and multi-agent systems about 15×. That baseline also compounds: a subagent that recursively spawns more subagents, or a tool that returns an oversized payload, can multiply a run’s cost by another 10×, and there’s no circuit breaker unless you build one. I’ve watched a single misconfigured loop quietly become the most expensive thing I ran that week.

The piece most cost breakdowns miss: caching

Here is the correction I’d push back with whenever I read “agents are expensive because they re-read context.” You don’t pay full price to re-read.

Prompt caching lets you mark the stable prefix — system prompt, tool definitions, the original goal, earlier turns — so subsequent calls read it from cache instead of reprocessing it. Cache hits cost about 10% of the standard input rate (a ~90% saving on the cached portion), against a one-time write premium.

That changes the slope of the cost curve, not its existence:

  • The re-read penalty on stable context largely disappears.
  • What you still pay full rate for is the new tail each turn (fresh tool outputs, new reasoning) and, critically, the output tokens, priced several times higher than input and never cached.

In the generation-heavy work I run — mass title rewrites, hundreds of content briefs across sites — output tokens, not context re-reads, are usually the real bill. Cache aggressively and the agent-vs-HITL cost gap narrows, but it never closes, because output and genuinely new context can’t be cached away.

Context growth is also a reliability problem

Cost isn’t the only consequence of a swelling context window. As it grows, the agent eventually hits the hard window limit, and well before that, quality degrades — models lose track of detail buried in the middle of a long context. The long-running agents I’ve built fight this with compaction (summarising old turns), external memory, and splitting work across runs that each get a fresh window. If you’re operating agents at any scale, context management is the engineering problem.

5. Why human-in-the-loop often uses fewer tokens

A human is a compression layer for the decision space.

Instead of:

AI → AI → AI → AI → AI   (many reasoning loops and retries)

You get:

AI → Human decision → AI

I supply direction, prioritisation and validation in a few seconds — work that would otherwise cost the model several reasoning loops and a couple of failed attempts. Thirty seconds of review can save thousands of tokens and head off the wrong-trajectory cascade before it starts.

The trade is obvious: I’ve put myself back on the critical path. That’s fine when the decision is rare and high-stakes. It’s a bottleneck when the decision is frequent and cheap.

6. When I let agents run autonomously

Agents earn their token cost when the task is repetitive, low-risk, large-scale and tolerant of the occasional miss. In technical SEO that’s a lot of the daily grind, and it’s where most of my automation lives:

1. Internal link discovery and orphan detection. I built an automated internal-linking system for a news publisher — crawl the archive, find pages with no inbound internal links, and surface relevant source articles that could link to them with sensible anchor text. Low risk (it proposes links, it doesn’t delete pages), highly repetitive, and it scales to an archive no human could work through by hand. A wrong suggestion costs nothing — it just doesn’t get used.

2. Crawl and log analysis. Parsing logs and crawler behaviour to work out where crawl budget is being wasted — parameter URLs, redirect chains, soft-404s, bot access patterns — is pattern-matching over huge volumes where speed beats perfection. I’ve written plenty of throwaway Python for exactly this; a misclassified line in a diagnostic report does no damage, so I’m happy to let it run unsupervised.

3. Rank and SERP-feature tracking. This is the core of what I automate — from Raspberry Pi rank tracking on DataForSEO to StoryHawk, which monitors Google first-page features (Top Stories, Video Carousel, Web Stories, Image Packs) for publishers. High volume, deterministic, and the output is data, not a live change. Perfect autonomous territory.

4. Structured-data validation at scale. Crawl templates, validate NewsArticle/Article/Product markup against schema.org, flag what’s missing or malformed. The kind of custom-extraction job I’ve run through Screaming Frog for years — repetitive, deterministic, and it produces a flag for review rather than a change to production.

5. Multilingual publishing pipelines. My esim-publisher CLI automates article publishing across several sites via GitHub PRs and Coolify deploys. The agentic part — formatting, routing, opening the PR — runs autonomously, because the PR itself is the approval gate (see below).

In all of these, the cost of a mistake is a discarded suggestion or a PR I don’t merge. The token bill is justified by the hours saved.

7. When I keep a human in the loop

I keep a person on the approval step when a mistake hits traffic, revenue, brand or compliance — anything where one bad decision is more expensive than a thousand reviews:

1. Migration redirect mapping. An agent can propose the old→new URL map by similarity matching. A human signs off before it ships, full stop. I’ve done enough migration analysis to know a botched redirect map quietly tanks organic traffic across the whole property — and you usually don’t notice until the rankings are already gone.

2. robots.txt, noindex, and canonical changes. The agent drafts the change and explains its reasoning; it never pushes to production unsupervised. On a site the size of a national newspaper, a single stray Disallow: or an errant noindex can deindex a section. This is the textbook mandatory-approval gate, and I treat it that way — in my own Claude Code projects I literally use Hooks to protect this class of change.

3. Keyword-gap analysis to editorial briefs. On client work — the Danish sites I support are a good example — the agent clusters keywords, finds the gaps, and drafts briefs. A human editor decides what’s actually worth commissioning. That’s a judgement call about audience, angle and resourcing that no amount of agent reasoning replaces.

The pattern I keep coming back to: agents execute, humans judge. Where judgement is the scarce input, I stay in the loop.

8. The hybrid model is what I actually run

In practice, almost nothing I build is fully autonomous or fully manual. I route the cheap, repetitive execution to agents and reserve my attention for the decisions that carry risk:

Data collection
      ↓
AI analysis
      ↓
AI recommendation
      ↓
Human approval   ←  the gate sits exactly where risk lives
      ↓
AI execution

That esim-publisher flow is the shape of it: the agent does everything up to opening the PR, and the PR is my approval gate before a Coolify deploy. I get lower token usage (fewer speculative loops), better accuracy (I catch the wrong trajectory early), fast execution on everything except the gated step, and a small blast radius when something goes wrong. For most of what I do, that’s the optimal balance — and it’s far cheaper than letting an unsupervised agent compound an error across fifty steps before I look.

9. The rule I use: uncertainty is where humans pay for themselves

The heuristic I’ve settled on:

The more uncertainty a task contains, the more valuable a human becomes.

Agents are exceptional at execution — high-volume, well-defined, repeatable work. Humans are exceptional at judgement under uncertainty, and we deliver it in one decision rather than five reasoning loops and two retries.

So when I catch an agent grinding through loop after loop — burning accumulating context the whole way — just to work out the right answer, that’s my signal to pull the decision back to a human. I can usually make the same call instantly and for a fraction of the tokens. That’s not an argument against agents. It’s an argument for putting the human exactly where the uncertainty is, and letting the agent run flat-out everywhere else.

Key takeaways

  • Decide agent vs HITL on three measurable axes: cost, reliability and risk — not hype.
  • Agents cost more because context accumulates every turn; single agents run ~4× and multi-agent ~15× the tokens of a chat.
  • Prompt caching flattens the re-read penalty, but output tokens and new context never cache, so the cost gap narrows without closing.
  • Let agents run autonomously on repetitive, low-risk, large work; keep a human on anything touching traffic, revenue, brand or compliance.
  • The hybrid model — agent executes, human approves at the risky gate — is the only configuration that survives a real token bill.

From where I sit, the future isn’t autonomous agents replacing people (for now). It’s systems where AI does the heavy lifting and a human supplies direction, validation and oversight at the few points where it actually matters. That’s the way I build, and so far it’s the only configuration that’s survived contact with a real token bill.

Svet Petkov, Head of Technical SEO at The Telegraph

Written by

Svet Petkov

Head of Technical SEO at The Telegraph

Svetoslav Petkov has over ten years of experience in technical SEO, working agency-side, in-house, and as a consultant across news publishing, e-commerce, and financial services. He writes about technical SEO, data engineering, and applied AI.