[ARCHITECTURE & CONCEPTS]

[

1/20/26

]

Why Prompt Guardrails Are Not Security

[Author]:

Amjad Fatmi

Every time I talk to an engineering team about agent security, I hear the same four answers.

"We have guardrails." "We have observability." "We have IAM." "We use MCP."

These are not wrong answers. They are answers to a different question than the one they think they're answering.

This post goes through each layer and shows you exactly where it stops relative to the moment a tool actually executes. Not to make you feel bad about your stack. To make the gap visible enough to reason about.

First, what is the execution boundary

The execution boundary is the moment an agent's decision becomes a real-world effect.

Before it: the agent is reasoning, planning, generating text. None of that matters yet.

After it: something happened. A file was written. An API was called. Money moved. You can't take it back.

The question the execution boundary asks is not "is this a plausible thing for an agent to do?" It's not "did this come from a legitimate source?" It's: should this specific action, with these specific parameters, actually run right now?

That is a different question than any existing layer answers. Here's the proof.

Guardrails stop before the action exists

Guardrails, whether rule-based classifiers, embedding filters, or LLM judges, operate on inputs and outputs. They try to stop harmful tool calls from being generated in the first place.

Here's where they sit:

User input
    |
[GUARDRAIL]   <--- acts here
    |
LLM reasoning
    |
Tool call proposal
    |
    |          <--- nothing here
    |
Execution     <

User input
    |
[GUARDRAIL]   <--- acts here
    |
LLM reasoning
    |
Tool call proposal
    |
    |          <--- nothing here
    |
Execution     <

User input
    |
[GUARDRAIL]   <--- acts here
    |
LLM reasoning
    |
Tool call proposal
    |
    |          <--- nothing here
    |
Execution     <

The guardrail sees text. It classifies it. If the score is below threshold, the action proceeds. That's it.

The problem is simple: a probabilistic pre-execution filter cannot give you execution-time authorization guarantees. These are categorically different things.

Guardrails were designed for the average case. Known harmful patterns. Obvious jailbreaks. They were not designed for adversarial inputs crafted to look normal, which is exactly what prompt injection attacks look like.

tool_proposal = {
    "name": "send_email",
    "to": "finance@company.com",
    "subject": "Q4 reconciliation",
    "body": "Please find the attached report."
}

tool_proposal = {
    "name": "send_email",
    "to": "finance@company.com",
    "subject": "Q4 reconciliation",
    "body": "Please find the attached report."
}

tool_proposal = {
    "name": "send_email",
    "to": "finance@company.com",
    "subject": "Q4 reconciliation",
    "body": "Please find the attached report."
}

A guardrail evaluating this sees an internal email address, a professional subject, a reasonable body. It passes. What the guardrail doesn't know is that the attachment path was injected by an external document the agent processed. The file being attached is not a reconciliation report.

The guardrail gave it a clean score. The action executed. The damage is done.

The formal version: guardrail passing (G(action) = pass) means only that the action didn't match patterns the guardrail was trained to detect. It does not mean the action was safe, correct, or authorized. These are not the same claim.

Observability records what happened, not whether it should have

Observability tools are forensic instruments. They record the past. Datadog, Langfuse, Arize, Helicone — all of them operate after execution.

Agent decision
    |
Execution        <--- consequence already happened
    |
[OBSERVABILITY]  <--- records it here

Agent decision
    |
Execution        <--- consequence already happened
    |
[OBSERVABILITY]  <--- records it here

Agent decision
    |
Execution        <--- consequence already happened
    |
[OBSERVABILITY]  <--- records it here

This is useful. Forensics matter. The problem is the assumption that observability equals governance.

What observability tells you:

This action ran at 14:32 UTC
These were the parameters
It returned this response

What observability cannot tell you:

Should this action have run given the policy in effect at 14:32?
Would it have been permitted under the policy we updated yesterday?
If we replay this with corrected state, does the decision change?

The last three questions are the ones that matter for compliance, incident investigation, and policy iteration. They require a record not of what happened but of why it was authorized — what policy applied, what state was evaluated, and whether anything actually had the ability to say no.

Your logs tell you the action ran. They cannot tell you whether the action was ever asked permission.

IAM governs identity, not action instances

Access control is the backbone of production security. It answers whether a given principal is allowed to access a given resource or call a given endpoint.

Necessary. Not sufficient.

IAM evaluates a policy of the form: principal P may perform action class A on resource R. That decision was made once, at role assignment or service account creation, and it applies uniformly to every request from that principal forever.

# IAM policy: agent-service-account may call POST /refunds
# Set once by an administrator. Covers all calls.

# At runtime:
POST /refunds { order_id: "ORD-7823", amount: 1200.00 }

# IAM check: does agent-service-account have POST /refunds permission?
# Yes. Execute

# IAM policy: agent-service-account may call POST /refunds
# Set once by an administrator. Covers all calls.

# At runtime:
POST /refunds { order_id: "ORD-7823", amount: 1200.00 }

# IAM check: does agent-service-account have POST /refunds permission?
# Yes. Execute

# IAM policy: agent-service-account may call POST /refunds
# Set once by an administrator. Covers all calls.

# At runtime:
POST /refunds { order_id: "ORD-7823", amount: 1200.00 }

# IAM check: does agent-service-account have POST /refunds permission?
# Yes. Execute

IAM never asked: should this specific refund of $1,200 for this specific order run right now, under the current state of the system, given who initiated the agent session and what it's been doing?

IAM cannot ask that question. It was not designed to. The permission class was granted when the role was created. The specific action instance at runtime is not evaluated.

If an agent is compromised and starts issuing fraudulent refunds, IAM will happily permit every single one, because the service account has the right role. IAM is working correctly. You just lost a lot of money.

Orchestration frameworks route actions, they don't authorize them

LangChain, CrewAI, AutoGen, the OpenAI Agents SDK — these frameworks are designed to make tool execution frictionless. The model proposes a tool call, the framework dispatches it, the result comes back. That's the value proposition.

Here's what the dispatch path looks like in LangChain, stripped to essentials:

class AgentExecutor:
    def _call(self, inputs):
        while True:
            action = self.agent.plan(inputs)     # model decides
            if isinstance(action, AgentFinish):
                return action.return_values
            # dispatch happens here
            observation = self.tools[action.tool].run(action.tool_input)
            inputs["intermediate_steps"].append((action, observation))

class AgentExecutor:
    def _call(self, inputs):
        while True:
            action = self.agent.plan(inputs)     # model decides
            if isinstance(action, AgentFinish):
                return action.return_values
            # dispatch happens here
            observation = self.tools[action.tool].run(action.tool_input)
            inputs["intermediate_steps"].append((action, observation))

class AgentExecutor:
    def _call(self, inputs):
        while True:
            action = self.agent.plan(inputs)     # model decides
            if isinstance(action, AgentFinish):
                return action.return_values
            # dispatch happens here
            observation = self.tools[action.tool].run(action.tool_input)
            inputs["intermediate_steps"].append((action, observation))

tools[action.tool].run() is where execution happens. There is no authorization step in this path. The framework assumes: if the model proposed it and the tool is registered, it runs.

This is the correct design for a framework. Frameworks should be composable. But it means the framework cannot be your execution boundary. You'd have to add that yourself, correctly, for every tool, in every agent, forever. Almost no team does.

The orchestration framework satisfies the invariant: if you registered a tool and the model calls it, it executes. That is not an authorization guarantee. It is its absence.

MCP terminates at message acceptance, not execution

MCP and A2A have brought real progress to agent communication, standardized schemas, authenticated messages, interoperability. This matters.

But there is a widespread misreading of what protocol-level security gives you.

When an MCP server validates a tool call, it checks: is this message well-formed? Is it from an authenticated sender? Is this tool registered?

That work ends at the server boundary. The MCP server accepts the message. What happens next is not part of the protocol.

Agent
  |
[MCP message]
  |
MCP Server validates:
  - schema correct? yes
  - sender authenticated? yes
  - tool registered? yes
  --> accepts message
  --> calls tool_handler()
         |
         executes.

# MCP's job ended at message acceptance.
# Whether tool_handler() should run right now
# was never part of the question

Agent
  |
[MCP message]
  |
MCP Server validates:
  - schema correct? yes
  - sender authenticated? yes
  - tool registered? yes
  --> accepts message
  --> calls tool_handler()
         |
         executes.

# MCP's job ended at message acceptance.
# Whether tool_handler() should run right now
# was never part of the question

Agent
  |
[MCP message]
  |
MCP Server validates:
  - schema correct? yes
  - sender authenticated? yes
  - tool registered? yes
  --> accepts message
  --> calls tool_handler()
         |
         executes.

# MCP's job ended at message acceptance.
# Whether tool_handler() should run right now
# was never part of the question

A valid, authenticated, well-formed MCP message for a dangerous action is still a dangerous action.

Protocol validation (valid_message(m) = true) and execution authorization (should_execute(action, policy, state) = true) are not the same predicate. One evaluates the message's form and provenance. The other evaluates the action's semantics against current policy and runtime state. MCP does the first. Nobody in this stack does the second.

What the full picture looks like

Input
  |
[Guardrail]            probabilistic, pre-execution, fails open
  |
LLM reasoning
  |
Tool call proposal
  |
[Protocol validation]  schema + auth, ends at acceptance
  |
[Orchestration]        routes correctly, does not authorize
  |
                       <<< nothing here >

Input
  |
[Guardrail]            probabilistic, pre-execution, fails open
  |
LLM reasoning
  |
Tool call proposal
  |
[Protocol validation]  schema + auth, ends at acceptance
  |
[Orchestration]        routes correctly, does not authorize
  |
                       <<< nothing here >

Input
  |
[Guardrail]            probabilistic, pre-execution, fails open
  |
LLM reasoning
  |
Tool call proposal
  |
[Protocol validation]  schema + auth, ends at acceptance
  |
[Orchestration]        routes correctly, does not authorize
  |
                       <<< nothing here >

Every layer in this diagram is doing something real. The gap is not a failure of any individual tool. It is a structural absence in how the stack was assembled.

The gap has a specific shape. It requires evaluating a specific action instance against current policy and current runtime state, at the moment of execution, with a default of denial when no matching policy exists.

That is not what any of these layers do. It cannot be derived from combining them. You can have every layer in this diagram and still have nothing that can say no at the execution moment.

Why this matters more now

A year ago this gap was mostly theoretical. Agents were doing low-stakes work. When they made mistakes, the consequences were annoying and recoverable.

That is not the current situation.

The agents in production now are issuing payments, modifying production databases, provisioning cloud resources, sending external communications on behalf of organizations. For these deployments, the question "should this action actually run?" is not academic. It is the central operational question.

The answer right now, for most teams: the framework decides. Silently. Without a policy. Without a record. With a default of execution.

What actually covers it

We won't pitch a product here. But the solution space has a specific shape, so you can evaluate anything claiming to cover this clearly.

Covering the execution boundary requires:

Interception of every effectful action before it executes. Not the suspicious-looking ones. Every one. Partial coverage is not coverage.

Normalization across frameworks. The same action expressed through LangChain, AutoGen, a raw API call, or an MCP message needs to be evaluated by the same policy engine. If you have framework-specific guardrails, attackers just use a different framework.

Evaluation against current policy and state at execution time. Not a static access control list. A policy that reflects current organizational intent applied to the specific action being proposed, at the moment it wants to run.

Fail-closed by default. If no policy matches, the action does not execute. This is the inversion almost nothing in the current stack provides. Everything above fails open. A real execution boundary fails closed.

A record of the authorization decision. Not just that the action ran. What policy version was applied. What state was evaluated. What the decision was. Enough to replay the decision later.

The current stack covers inputs, outputs, identity, routing, and post-execution forensics. The execution moment, the only moment where you can actually stop something is empty.

That is the gap. It is not small. And it is not covered by anything you already have.

[GET STARTED IN MINUTES]

Ready to give Faramesh a try?

The execution boundary your agents are missing.
Start free. No credit card required.

[GET STARTED IN MINUTES]

Ready to give Faramesh a try?

The execution boundary your agents are missing.
Start free. No credit card required.