Everybody Has an Agent Now: The Architecture, Economics, and Confusion of Agentic AI

 

“A word that names everything from a chatbot reply to a swarm of self-directing systems has stopped naming anything at all, and the cost of that vagueness lands on whoever signs the contract.”

When a Word Means Everything, It Means Nothing

Walk through a week of industry announcements and the word “agent” arrives wearing a different costume each time. A customer-support vendor uses it for a chatbot that pulls order history and issues refunds. A developer-tools company uses it for a system that edits a live codebase, runs the tests, and fixes what it broke. A search product uses it for something that crawls dozens of sources and returns a written report. Productivity suites bury an “agent” inside email and documents, consumer banking applications launch one that claims to buy products on your behalf, and a growing tier of platforms reserve the term for orchestrated fleets of cooperating systems. Each of these is real software. None of them is the same kind of thing, yet all of them share one noun.

The label travels furthest where the capability is thinnest. The Nielsen Norman Group catalogued the consumer end bluntly in 2026: agents in your phone, your Slack, your note-taking app, your bank. Many of these are assistants that suggest and wait, or scripts that follow a fixed path, dressed in the language of autonomy. The limitation is hidden by the branding. A banking “agent” that can supposedly purchase on your behalf is, in most deployments, a narrow flow with hard guardrails, and the moment a request falls outside its script it stalls or hands back to a human. Robotic process automation rebadged as an agent carries the same disadvantage in sharper form: it executes reliably until the underlying form changes by a field, and then it fails silently, because it never had the self-direction to notice that anything had moved.

At the capable end, the genuine coding and search agents, the disadvantages are different but no less real. These systems do decide their own next steps, which is precisely what makes them useful and what makes them hard to trust. They are brittle when a task drifts past their tested range, they can compound a single confident error across many steps, and they still require a human to review consequential output. The same property that lets an agent act unsupervised is the property that lets it act wrongly unsupervised

This is not pedantry about vocabulary. When one word stretches to cover a scripted refund bot, a repository-editing tool, and a coordinated research pipeline, the people who must buy, build, and govern these things lose their footing. A developer hears “agent” and pictures a reasoning loop with tool access. An executive hears the same word and pictures autonomous staff replacement. A vendor says it and means whatever closes the deal. 

AI Agent and Agentic AI are used interchangeably in that same marketing copy, yet a growing body of work treats them as distinct, and the people drawing the distinction do not agree on where the line falls. Clearing this up is not an academic exercise. It determines whether a team ships the right architecture and whether a budget buys the capability it was promised.

Two Words Doing Different Jobs

 

“The distinguishing variable is not how many models are in the room. It is autonomy: who decides the next step, the programmer in advance or the model at runtime.”

An AI Agent a modular system driven by a language model and built for narrow, task-specific automation, using tool integration, prompt engineering, and reasoning enhancements to execute a bounded objective such as drafting a reply, classifying a ticket, or scheduling a meeting. 

Agentic AI, in their framing, is something further again, marked by multi-agent collaboration, dynamic task decomposition, persistent memory, and coordinated autonomy. Where an AI agent executes an isolated task under fairly close instruction, an agentic system pursues a broad goal, retains context over time, adapts to feedback, and parcels sub-tasks out to subordinate agents. The boundary they draw is, in effect, the number of cooperating parts.

Aspect Generative AI AI Agent Agentic AI
Reasoning and execution Single-step response, digital content only Plans and executes a bounded task with tools Iterative, multi-step pursuit of an overarching goal
Instruction dependency Needs a detailed, specific prompt Needs a defined task and tools Navigates ambiguity from a high-level goal
Memory None beyond the context window Limited, mostly within a session Persistent across tasks and sessions
Autonomy Reactive, user-driven Task-scoped, supervised Self-directed, adjusts its own path
Coordination Single model Single agent, modular tools Multiple agents under orchestration

That is a clean, teachable boundary, and it has obvious appeal for anyone trying to bring order to a chaotic field. It also runs into trouble the moment you compare it with how practitioners building these systems actually talk.

Anthropic, in its widely read engineering note Building Effective Agents, draws the line somewhere else entirely. It groups everything under “agentic systems” and then separates workflows, where language models and tools are orchestrated through predefined code paths, from agents, where the model dynamically directs its own process and tool use. The distinguishing variable is not how many models are in the room. It is autonomy: who decides the next step, the programmer in advance or the model at runtime.

The Nielsen Norman Group reaches the same place from a design perspective. Its definition is compact: an AI agent is a system that pursues a goal by iteratively taking actions, evaluating progress, and deciding its own next steps. Crucially, the group notes that a single language model answering a question is not an agent at all. It is a reasoning engine. The agent is the system built around that engine, the loop of tools, actions, and self-evaluation that lets it act and reconsider. Their metaphor is exact: the engine alone is not a car. By this definition a lone coding tool that runs a test, reads the error, and tries a different fix is already a fully qualified agent, no second system required. It also follows that an agent need not contain a language model at all. A self-driving vehicle, sensing and planning and acting in a continuous loop, satisfies the same behavioural definition while running on computer vision rather than an LLM.

Where the definitions collide

Here the two camps openly disagree, and the disagreement is not trivial.

For Sapkota and colleagues, multi-agent coordination is the threshold that turns an agent into something categorically new. For Anthropic and Nielsen Norman, multi-agent design is merely one architecture among several, and the real threshold, self-direction, is crossed long before any second agent appears. A single self-directing system is, to the academics, an AI Agent; to the practitioners, it is already agentic. The same software earns two different labels depending on whose taxonomy you hold.

A second disagreement sits underneath the first: whether the distinction is a clean break or a sliding scale. The review treats agentic AI as a discrete leap. Anthropic treats the whole field as a continuum of autonomy, with workflows and agents as regions on it rather than separate species. 

So this analysis will not pretend the matter is closed. It will hold both boundaries in view, the academic one drawn at coordination and the practitioner one drawn at autonomy, and give you a way to navigate regardless of which convention a given document follows.

A Map, Not a Label

The first dimension is capability to act: how much a system can affect the world, from none, through contained digital actions like editing files or querying a database, to relatively unconstrained physical action such as steering a vehicle. The second is self-direction: who chooses the next step. A predetermined process follows a fixed sequence no matter what it encounters. A self-determined process picks its next action based on what just happened. A spam filter, however many rules it applies, never reconsiders its approach; it is predetermined. A tool that reads its own error output and changes tack is self-determined.

The terms cluttering the same conversation now find their places too. An assistant or copilot typically suggests and waits, high on usefulness but low on self-direction, a reasoning engine with a polite interface. A workflow automates multiple steps but along rails laid down in advance. An agent, on any of the practitioner definitions, is the first thing on the map that decides its own next step. Agentic AI, depending on the author, is either everything in that self-directed region or specifically its multi-agent corner. The map accommodates both claims. A label cannot.

Six Shapes of an Agent

The six topologies below move from least to most self-directed. For each, a flow diagram traces the execution path, and a table reports the approximate split between AI reasoning and traditional software at every phase. The proportions are illustrative rather than measured, but they make a point that marketing rarely does: even at the autonomous end, deterministic code carries much of the load, and at the lower rungs it carries almost all of it.

A useful way to read the six is as answers to one question, asked with rising stakes: how much of the next step is decided by the model rather than by the developer who wrote the harness around it

Scope One: Prompt Chaining

Prompt chaining decomposes a task into a fixed sequence of steps, each model call consuming the output of the last. The pattern trades latency for accuracy by keeping every call narrow, and it suits work that splits cleanly into stages, such as drafting copy and then translating it. Programmatic gates between steps validate intermediate output and halt the chain when something fails.

flowchart TD
  A[Input payload] --> G1{Validate syntax}
  G1 -->|pass| L1[LLM 1: draft]
  L1 --> G2{Schema gate}
  G2 -->|pass| L2[LLM 2: transform]
  L2 --> F[Format and PII redaction]
  F --> O[Output]
  G1 -->|fail| H[Halt]
  G2 -->|fail| H
Phase Execution Guardrails AI vs traditional
Task decomposition Code receives the payload and routes it to the first model Input syntax validation 10% AI / 90% traditional
Sequential processing Model 1 drafts, model 2 transforms the result Schema gates between steps; halt on failure 40% AI / 60% traditional
Finalisation The last model formats output for delivery Rule-based filtering and PII redaction 30% AI / 70% traditional

The system exhibits no self-direction. The developer hardcodes the sequence and the checks between nodes, which is exactly why traditional software dominates. This is a workflow wearing none of the agentic costume, and most tasks labelled “agent” in consumer products live here.

Scope Two: Routing

Routing classifies an incoming request and sends it to a specialised downstream handler, which lets a team write focused prompts per intent and send cheap queries to small models while reserving expensive ones for hard cases. It is the backbone of competent customer-service automation, and it remains a rigid workflow with no genuine agency.

flowchart TD
  I[Input] --> S[Sanitise and rate limit]
  S --> C{Classifier: intent}
  C -->|low confidence| HF[Human fallback]
  C -->|high confidence| R[Route by intent]
  R --> L1[Specialist LLM A]
  R --> L2[Specialist LLM B]
  L1 --> V{Schema validation}
  L2 --> V
  V --> O[Output]
Phase Execution Guardrails AI vs traditional
Input reception System ingests the query Rate limiting and sanitisation 0% AI / 100% traditional
Classification A lightweight model or classifier infers intent Confidence thresholds; low confidence routes to a human 80% AI / 20% traditional
Routing Code directs the payload to the right specialist Execution time-outs 20% AI / 80% traditional
Execution The specialist model returns the result Output schema validation 90% AI / 10% traditional

Intelligence concentrates at the two ends, classification and execution, while the connective tissue stays deterministic. The model decides what kind of thing the request is, but not what to do about it; the routing table does that.

Scope Three: Parallelisation

Parallelisation runs several model calls at once and aggregates their output programmatically. It comes in two forms: sectioning, where independent sub-tasks run concurrently, and voting, where the same task runs several times to reach consensus. It buys speed and reliability at the cost of more simultaneous calls, and it is common in automated evaluation and safety screening. Architecturally it is still deterministic.

flowchart TD
  I[Input] --> D[Distributor]
  D --> P1[LLM 1: task or section]
  D --> P2[LLM 2: task or section]
  D --> P3[LLM 3: task or section]
  P1 --> AG{Aggregate: vote or threshold}
  P2 --> AG
  P3 --> AG
  AG --> O[Output]
Phase Execution Guardrails AI vs traditional
Distribution System broadcasts the input to several model instances Concurrency limits and thread management 5% AI / 95% traditional
Parallel execution Instances process sections, or all judge the same item Per-instance timeouts and error isolation 95% AI / 5% traditional
Aggregation Code collects and reconciles the branches Threshold logic, such as two of three flags 10% AI / 90% traditional

The reasoning is concentrated and abundant in the middle phase, but the decision about what to do with the results is a hardcoded threshold. The system never chooses its own path.

Scope Four: Evaluator-Optimiser

This pattern introduces a cyclical, self-refining loop. A generator model produces an artifact, an evaluator model scores it against a rubric, and the generator revises until it passes or a hard iteration cap is reached. The loop mimics a human drafting process and is effective wherever quality criteria can be stated explicitly.

flowchart TD
  I[Goal] --> G[Generator LLM]
  G --> E{Evaluator LLM: score vs rubric}
  E -->|below threshold| G
  E -->|pass or max iterations| O[Output]
Phase Execution Guardrails AI vs traditional
Initial generation Generator produces a first artifact Token limits and strict system prompts 90% AI / 10% traditional
Critique and scoring Evaluator judges the artifact against a rubric Explicit criteria to limit subjective drift 85% AI / 15% traditional
Iterative refinement Generator revises from the critique Hardcoded loop cap, enforced by code 80% AI / 20% traditional
Termination Loop ends on a passing score or the cap Deterministic completeness check 10% AI / 90% traditional

Here self-direction appears for the first time, but only partly. The system iterates on its own judgement of quality, yet the path itself, generate then evaluate then repeat, is structurally fixed. The capability to act stays contained to producing and revising an artifact.

Scope Five: Orchestrator-Workers

This is the architecture most people mean when they say Agentic AI. A central orchestrator receives a broad goal, generates a plan, and delegates sub-tasks to specialised worker agents, then synthesises their results. Unlike parallelisation, the sub-tasks are not fixed in advance; the orchestrator decides what is needed from the specific input. Workers operate in sandboxes, write to a shared memory store, and pass through evaluator checks before synthesis.

flowchart TD
  Goal[Broad goal] --> ORC[Orchestrator: plan and decompose]
  ORC --> W1[Worker A: write code]
  ORC --> W2[Worker B: query data]
  ORC --> W3[Worker C: subtask]
  W1 --> MEM[(Shared memory)]
  W2 --> MEM
  W3 --> MEM
  MEM --> SYN[Orchestrator: synthesise and review]
  SYN --> EV{Evaluator loop}
  EV -->|reject| ORC
  EV -->|accept| O[Deliverable]
Phase Execution Guardrails AI vs traditional
Dynamic planning Orchestrator generates a spec and assigns work Planners barred from execution to stop scope errors cascading 85% AI / 15% traditional
Autonomous execution Workers complete sub-tasks independently Isolated sandboxes, read-only data access, full logging 90% AI / 10% traditional
State persistence Agents read and write a shared memory store Context resets between handoffs to clear stale state 30% AI / 70% traditional
Synthesis and review Orchestrator assembles and checks the deliverable Evaluator loops verify worker output before synthesis 70% AI / 30% traditional

The reasoning load is high and now governs the workflow itself, deciding what tasks exist rather than just performing them. Traditional software recedes to an integration layer: the memory store, the sandboxes, the logging. That layer is unglamorous and indispensable, because it is the only thing standing between coordinated autonomy and coordinated failure.

Scope Six: The Autonomous Loop

At the far end sits a single model in an unbounded reason-and-act loop, navigating a digital or physical environment with no predefined path. It perceives the environment, reasons about the gap between the current state and the goal, selects a tool, acts, observes the result, and loops, adjusting its trajectory from whatever ground truth comes back. This is maximum self-direction, and it is where the cautionary tales originate.

flowchart TD
  Goal[Goal] --> P[Perceive environment]
  P --> R[Reason and select tool]
  R --> DG{Destructive action?}
  DG -->|yes| HITL[Human approval]
  DG -->|no| ACT[Act: execute tool]
  HITL --> ACT
  ACT --> OB[Observe result]
  OB -->|failure| R
  OB -->|success| T[Terminate]
  R -.budget and loop caps.-> T
Phase Execution Guardrails AI vs traditional
Perception Agent gathers context from the environment Standardised interfaces, rate-limited perception APIs 20% AI / 80% traditional
Reasoning and tool selection Agent evaluates state and chooses a tool Tool-level access control, misuse-resistant tool design 95% AI / 5% traditional
Action and observation Tool runs; agent observes the outcome Blast-radius containment, human approval for destructive acts 50% AI / 50% traditional
Dynamic adaptation On failure, agent reformulates and loops Loop-count limits and token-budget caps 80% AI / 20% traditional

The system reasons for itself at almost every turn, but the guardrails that make it safe are overwhelmingly traditional software living in the infrastructure: the budget caps that stop a runaway loop, the access controls that bound what a tool can touch, the human approval gate in front of anything destructive. The lesson across all six scopes is consistent. Autonomy is a property of the model; safety is a property of the scaffolding. The more of the former a system claims, the more of the latter it silently requires.

The Pilot-to-Production Chasm

The theory is tidy. The enterprise reality is not. The defining feature of the current market is a vast gap between organisations experimenting with agents and organisations running them in production, and the gap is not closing as fast as the spending implies.

McKinsey’s 2025 global survey, drawn from 1,993 respondents across roughly 105 countries, found that 62% of organisations are at least experimenting with AI agents, but only around 23% are scaling them anywhere in the enterprise. The financial picture is starker still: only 39% report any enterprise-level impact on earnings before interest and taxes from AI overall, and for the majority of those the contribution sits below 5%. Deloitte’s enterprise research adds a governance dimension to the same story, finding that only about one in five companies has a mature model for governing autonomous agents even as deployment accelerates. Experimentation is cheap and nearly universal. Value and control are neither.

The reasons pilots stall are structural, and they tend to stay hidden until a system meets real data. A model that performs beautifully in a clean sandbox degrades when it hits the varied document structures, inconsistent user behaviour, and edge cases of production. Three gaps recur. Integration gaps appear where legacy systems cannot accept an agent’s actions, because the APIs were never built for real-time writes or secure machine identity. Semantic gaps appear where an agent misreads the meaning of enterprise data and confidently produces a wrong answer. Reliability gaps are the most dangerous, because they manifest as silent failure: the agent continues despite missing information, emitting corrupted output that a human then acts on, trusting it.

These friction points have a measurable consequence. Gartner forecasts that more than 40% of agentic AI projects will be cancelled by the end of 2027, driven by escalating costs, unclear business value, and inadequate risk controls. The same analysis estimated that of the thousands of vendors claiming agentic capability, only around 130 were building something that genuinely qualified.

Bottleneck How it shows up Production impact
Legacy system friction Old APIs lack real-time execution and machine identity The agent cannot write back to systems of record, so execution halts
Data architecture constraints Data locked in rigid pipelines cannot be read in context Hallucinated metrics from misread schemas
Cost and compute drift Recursive loops consume tokens without limit Overruns that erase the workflow's return on investment
Irreversibility of actions No visibility into how the agent reasoned to a decision High-severity errors needing expensive manual rollback

When the Meter Never Stops

Agentic systems introduce a financial risk that traditional software does not have: unbounded consumption of compute. A conventional script costs roughly the same to run every time. An agent does not. Because it reasons probabilistically and loops until it succeeds or gives up, an agent that hits a novel problem or a broken interface may burn through dozens of tool-calling iterations to find a way around it, consuming tokens with each pass.

The result can be technically successful and economically ruinous at once. An agent that needs forty model calls to solve a problem a simpler approach would handle in three has completed the task and destroyed its own business case. This cost drift compounds as workflows grow more complex, and it is worsened by the fact that autonomous systems optimise for completing the task, not for the cost of completing it. Without runtime budget controls, the bill can dwarf the projected return.

Deploying agents therefore demands a shift in how their economics are reckoned, from the fixed-cost amortisation of ordinary software to dynamic, token-based unit economics. The discipline is concrete: estimate the value of a task, estimate the computational cost for an agent to achieve it, and compare the agent platform, integration, and token costs against the current cost of doing the work, over a defined horizon such as twelve months. An agent worth deploying is one whose unit economics survive that comparison, not merely one that works.

The Scaffolding That Makes Agents Survivable

If autonomy is the model and safety is the scaffolding, then the question of whether an enterprise can run agents at all comes down to the quality of that scaffolding. Four pieces of it have matured fast over the past year: a standard way for agents to reach tools and data, a way to govern what they do, a way to feed them trustworthy context, and a way to prove they work before they ship.

Standardising the connections

The first problem is integration sprawl. Connecting many models to many enterprise systems by hand produces a brittle web of custom connectors, an “N by M” problem that grows with every addition. The industry has converged on the Model Context Protocol to cut through it. Introduced by Anthropic in late 2024 and, in December 2025, donated to the newly formed Agentic AI Foundation under the Linux Foundation, MCP is an open protocol that gives AI applications a universal interface to tools and data, carried over JSON-RPC and structured around hosts, clients, and servers. Placing it under a neutral foundation, alongside founding projects contributed by Block and OpenAI, was a deliberate move to keep the standard from being owned by any single vendor.

Adoption has outpaced the protocol’s security maturity, and the gap is now an enterprise risk rather than an academic one. In 2026 the United States National Security Agency published security design guidance for MCP, warning that its rapid uptake has run ahead of appropriate safeguards. The protocol inverts the usual client-server pattern, often expecting servers to act on behalf of connected clients, which opens attack paths that traditional controls do not cover. The NSA flags unverified task propagation, where tasks pass between servers without validation of origin, scope, or intent, leading to scope overreach and leaked context; weak input screening, which lets hidden instructions ride inside otherwise ordinary data; and the treatment of tool descriptions as trusted when they amount to arbitrary code execution. The guidance is blunt that these are environment-wide problems, not endpoint patches, and that least-privilege access, strict payload validation, and real authorisation checks must be enforced at the infrastructure layer because the protocol itself leaves them underspecified.

Governing the autonomous

Because agents act on production databases, financial ledgers, and customer records, ordinary IT governance is not enough; an organisation needs to see what an agent is doing in real time and to stop it instantly when it goes wrong. A new category of control infrastructure has grown up to meet this. ServiceNow’s AI Control Tower, expanded in 2026, offers a central place to discover, observe, govern, and secure agents across the enterprise, mapping permissions across human, machine, and AI identities through access-graph technology from Veza. In ServiceNow’s own demonstration, the system detected a prompt-injection attack on a pricing agent, traced the malicious instructions hidden inside an order payload, mapped the blast radius across affected systems, and offered a real-time kill switch to shut the compromised agent down before it could do damage.

A parallel mechanism attacks agent-washing and unsafe deployment from the verification side. Workday’s Agent Passport, launched in 2026, tests and continuously monitors agents, whether built in-house or by third parties, against recognised public standards such as the OWASP Top Ten for large language models, the NIST AI Risk Management Framework, and MITRE ATLAS. Independent attestation, with Cisco joining as a launch partner to test agents using its AI Defense product, verifies resistance to prompt injection, goal hijacking, system-prompt extraction, and leakage of employee data before an agent reaches production, and leaves an auditable record of exactly what was tested and by whom. The shift is from trusting a vendor’s adjective to demanding a signed, benchmarked attestation.

Feeding the agent

Agents expose the limits of legacy data architecture with unusual cruelty. Reasoning engines need real-time, contextual access to business data, but most enterprise data sits in rigid pipelines and historical warehouses built for a different purpose, and Gartner attributes a large share of cancelled AI projects directly to data that was not ready. Worse, without a shared semantic understanding of what the data means, an agent will misread a schema and hallucinate a business metric with full confidence.

The response from data platform vendors is to build a governed semantic layer that sits between the agent and the raw tables. Snowflake’s agentic platform, through components it calls Horizon Context and Cortex Sense, embeds business definitions and lineage into the catalogue so that any agent or tool inherits the same consistent meaning. The reported effect is large: Snowflake’s internal testing put accuracy on complex enterprise queries at 83% with its semantic context active, against 47% without it. Giving an agent a governed context layer and a verifiable identity before it ever touches production data is what separates an autonomous actor that reasons from accurate facts from one that reasons confidently from misunderstood ones.

Proving it works

The same flexibility that makes agents useful makes them hard to test. A system that modifies state over many turns and adapts as it goes cannot be checked with a single input-output assertion. Anthropic’s 2026 guidance on agent evaluation argues for rigorous, deterministic testing built on the “task,” a single test with defined inputs and verifiable success criteria. The bar for a good task is social as much as technical: if two domain experts cannot independently agree on whether a result passed, the specification is too vague to test against. Evaluations also need to be balanced, including cases where the agent should act and cases where it should not, so that a web-search agent is measured on restraint as much as on retrieval, rather than being optimised into triggering on everything. A defensible deployment is one that has been measured this way before it ships, not after it fails.

A final constraint is physical. In conversational and voice agents, latency governs whether the experience is usable at all, and the engineering trade-offs are stark. Cascade architectures, which chain separate speech-to-text, reasoning, and text-to-speech stages, are modular and easy to guard but accumulate delay, often landing well above the threshold where conversation starts to feel broken. Speech-to-speech architectures fold the stages into a single model and cut latency sharply, but the tight coupling makes it far harder to insert the explicit safety checks that enterprise compliance demands. The choice of architecture sets not only the agent’s responsiveness but the complexity of the scaffolding needed to keep it safe, which is the recurring trade of this entire field in miniature.

References

Share This Post

MORE TO EXPLORE