skip to content

LLM-driven systems that pursue a goal by interleaving reasoning, tool calls, and observations inside a loop — and that decide for themselves which step to take next.

AI Agents#

Definition#

An AI agent is an LLM placed inside a loop with tools, memory, and an explicit goal — the model decides which tool to call next, observes the result, updates its plan, and repeats until the goal is satisfied or a termination condition fires. Anthropic’s “Building effective agents” draws a sharp line between workflows (LLMs orchestrated through code-defined paths) and agents (LLMs that dynamically direct their own processes and tool usage); a system is only “agentic” when control over the next step lives with the model rather than the developer. The minimal recipe is unchanged across vendors: one chat model + a tool schema + a runner that feeds results back as tool_result messages + a stop condition.

Why it matters#

Agents are the unit of composition for any task that can’t be solved by a single prompt — multi-step research, codebase refactors, ticket triage, data extraction across messy sources, browser or computer automation, anything that needs to react to intermediate observations. They turn a stateless completion API into a goal-driven worker, which is why almost every “AI feature” shipped since 2024 (coding assistants, customer-support copilots, browser agents like Operator, deep-research products) is some flavour of agent loop under the hood. Picking the right abstraction matters: Anthropic’s research finding — repeated across LangChain, OpenAI, and crewAI post-mortems — is that simpler, composable patterns beat heavyweight frameworks for most production use cases, and that complexity should only be added when measurable evaluation says it pays off.

How it works#

An agent loop is a small state machine that the model drives.

  1. System prompt + goal. The developer seeds the conversation with a role, constraints, and the user request. Tool schemas (JSON Schema for OpenAI/Anthropic, function signatures for SDKs) are passed alongside the messages so the model knows what’s callable.
  2. Plan / act. The model emits either a final answer or a tool_use block — name + arguments. The classic ReAct pattern (Yao et al., 2022) interleaves a visible “Thought:” before each action so the trace is auditable; Toolformer (Schick et al., 2023) showed models can learn when to call tools without explicit scaffolding.
  3. Execute. The harness runs the tool — code interpreter, shell, HTTP call, vector search, sub-agent, MCP server — and returns a tool_result content block. MCP (Model Context Protocol) is the emerging interop standard: any MCP server is a drop-in tool surface for any MCP-aware agent.
  4. Observe / update. The result is appended to the message list; the model re-reads the trajectory, updates its plan, and emits the next action. Long traces get compacted, summarised, or offloaded to memory — a key-value store, a vector index, or a scratchpad file.
  5. Terminate. The loop ends when the model returns stop_reason: end_turn, when a max_turns budget is hit, when a guardrail trips, or when a human-in-the-loop step rejects an action.

Patterns layer on top: single-agent + tools (the default), router (one agent chooses among specialists), multi-agent debate / review (AutoGen’s signature pattern), role-based crews (crewAI’s planner → researcher → writer chain), graph-based stateful workflows (LangGraph’s directed graph with checkpoints and time-travel), and sub-agents (Claude Code’s Task tool, Codex’s /agent) for context-window isolation and parallelism.

Frameworks land on different points of the trade-off curve:

  • Claude Agent SDK — safety-first, MCP-native, ships computer use; locked to Claude models.
  • OpenAI Agents SDK — clean handoff model, built-in tracing and guardrails; locked to OpenAI models.
  • LangGraph — fully model-agnostic, stateful graphs with checkpointing and time-travel debugging via LangSmith.
  • AutoGen / crewAI / LlamaIndex / Haystack — opinionated higher-level surfaces for multi-agent, role-based, document-centric, or pipeline-DAG patterns respectively.

Evaluation has matured alongside the runtimes. SWE-bench Verified (500 real GitHub issues) is the de-facto coding-agent benchmark — Claude Sonnet 4.5 leads at ~77 % as of 2026, up from 4 % three years earlier. Adversarial variants like SWE-ABS show the headline numbers drop ~15 points under strengthened test suites, so always pair a public benchmark with task-specific evals before trusting an agent in production.

Common pitfalls#

  1. No termination condition. Multi-agent runs without max_turns or an explicit termination_condition will loop until token budgets explode. Always cap the loop and alarm on runaway cost.
  2. Reaching for a framework before the prompt works. A single well-scoped tool-use call often beats a 6-agent crew. Start with the model’s native tool-use API; promote to a framework only when evals justify it.
  3. Vague tool descriptions. Tools are selected by the model from their description field. “Get weather” is worse than “Get current weather for a city; call this whenever the user asks about temperature, rain, or forecasts.” Write descriptions from the model’s perspective.
  4. Overlapping agent roles. Two agents with near-identical role/goal produce contradictory output. Each agent in a crew needs a clearly differentiated responsibility, or collapse them into one.
  5. Context-window poisoning. Long tool traces, retries, and verbose errors crowd out the task. Spawn a sub-agent (Claude Code Task, Codex sub-agent, LangGraph sub-graph) for sub-problems, and keep the parent’s context lean.
  6. Skipping evaluation. Headline benchmark scores don’t predict your workload. Build a small, task-specific eval set early and re-run it on every prompt or tool change — agents regress silently.
  7. Conflating agents with workflows. If every step is pre-determined, you don’t have an agent — you have a chain. That’s often better (cheaper, more predictable). Only adopt agency where dynamic decision-making genuinely helps.
  8. Not tracking per-session cost. An agent loop’s cost is invisible until the monthly invoice. Tools like ccusage read the local JSONL transcripts of 15 agent CLIs (Claude Code, Codex, OpenCode, Amp, Droid, …) and roll up daily / per-session / per-5-hour-billing-block spend — wire one into a tmux pane or a statusline before your first long-running task.

Where to go next#

Sibling concepts, tool-specific cheat sheets, and external references for going deeper.

Sources#

References consulted while writing this concept page. Links open in a new tab.

See also

Used in (12)