[SECURITY RESEARCH]

[

1/30/26

]

The Lethal Trifecta of AI Agents & How Faramesh Closes the Three Attack Fronts

[Author]:

Amjad Fatmi

In June 2025, Simon Willison named the problem that had been breaking production AI systems for two years. He called it the lethal trifecta: an AI agent with access to private data, exposure to untrusted content, and the ability to communicate externally. When those three conditions combine, an attacker does not need a code vulnerability. They need a malicious instruction, a model that follows it, and a tool that executes it.

Since naming the concept, documented examples have appeared across ChatGPT, Google Bard, Amazon Q, Google NotebookLM, GitHub Copilot Chat, Microsoft Copilot, Slack, Mistral Le Chat, Grok, Claude's iOS app, and ChatGPT Operator. The list keeps growing.

The security industry's response has been to build better filters. Smarter input scanners, more sophisticated output classifiers, faster PII detectors, more accurate injection detectors. All of them operate at the same layer: the language layer, between the user and the model. None of them operate at the layer where the consequence actually occurs: the moment a tool executes.

Faramesh operates at that layer. This is not a better filter. It is a different layer entirely. And it is the only layer that closes all three fronts of the trifecta regardless of what the model was told, how it was manipulated, or what it decided to do.

This post explains exactly what that means, where every other solution stops, and why the execution layer is the one that matters.

The three fronts, precisely stated

Before explaining the defense, the attack surface needs to be stated precisely. Most writing on this topic conflates the three fronts. They are distinct problems with distinct mechanisms.

Front 1: Unintended side effects from legitimate user prompts. The user asks the agent to do something reasonable. The agent interprets it and takes an action that is technically responsive to the request but larger in scope, more destructive, or more consequential than the user intended. No malicious actor is involved. The user's own prompt triggers the damage.

Front 2: Prompt injection reaching execution. A malicious actor embeds instructions in content the agent processes, an email, a document, a web page, a database result. The model reads those instructions as if they were legitimate and generates a tool call that carries out the attacker's intent.

Front 3: Consequential actions from hallucinated reasoning or PII exposure. The model generates a tool call based on incorrect information, a hallucinated API endpoint, a confabulated parameter, a misremembered file path. Or the model's output embeds PII that then gets sent somewhere it should not go. The problem is not a malicious actor. It is the model being wrong in a way that has real consequences.

These three fronts have different causes. They share one property: they all terminate in a tool call. The agent proposes an action. A tool executes it. The consequence is real.

Where every existing solution operates

The security tooling market for AI agents is large and growing. To understand what Faramesh does that nobody else does, you need to understand exactly where each existing solution sits in the stack.

Every existing solution operates between layers 1 and 4. They try to catch bad inputs before the model sees them, or catch bad outputs before they become tool calls. They are doing real work. They are necessary. And they all share the same limitation: they can be bypassed.

Input scanning can be bypassed by encoding, obfuscation, or novel attack vectors the scanner has not seen. Output scanning can be bypassed by injection that successfully manipulates the model into producing output that looks clean to a classifier but carries malicious intent in the structured parameters of a tool call. Language-layer defenses are probabilistic. They are classifiers making likelihood estimates. Sufficiently sophisticated attacks push inputs into the tail of the distribution where the classifier is wrong.

Layer 6 - between tool call proposed and tool.execute(), is empty in every commercial solution on the market. Faramesh fills it.

The crucial property: Faramesh at layer 6 does not care what happened at layers 1 through 4. It does not care whether input scanning caught the injection or missed it. It does not care whether the model was manipulated or not. It does not care whether the output scanner classified the output as safe or unsafe. It evaluates the proposed tool call itself, the specific action, the specific parameters, against a deterministic policy. The model's state is irrelevant. The attack that produced the tool call is irrelevant. The tool call either matches a permitted rule or it does not.

How Faramesh closes each front

Front 1: Unintended side effects

This is Faramesh's most direct application. The user's prompt produces a legitimate tool call that has consequences beyond the user's intent. No injection. No attack. Just the inevitable gap between what a user says and what a probabilistic model executes.

The execution boundary closes this gap by requiring that the proposed action match a permitted rule before it executes. If the user's prompt produces a tool call that would delete production configuration files, policy evaluates that specific action.

# Policy that closes this front:
rules:
  - match:
      tool: bash
      params:
        command: "rm -rf|rm -r|delete|truncate"
    require_approval: true
    risk: high
    reason: "Destructive operations require explicit approval"

  - match:
      tool: bash
      params:
        path: "/prod/|/production/|/configs/"
    require_approval: true
    risk: critical
    reason: "Production path operations require approval"

# Policy that closes this front:
rules:
  - match:
      tool: bash
      params:
        command: "rm -rf|rm -r|delete|truncate"
    require_approval: true
    risk: high
    reason: "Destructive operations require explicit approval"

  - match:
      tool: bash
      params:
        path: "/prod/|/production/|/configs/"
    require_approval: true
    risk: critical
    reason: "Production path operations require approval"

# Policy that closes this front:
rules:
  - match:
      tool: bash
      params:
        command: "rm -rf|rm -r|delete|truncate"
    require_approval: true
    risk: high
    reason: "Destructive operations require explicit approval"

  - match:
      tool: bash
      params:
        path: "/prod/|/production/|/configs/"
    require_approval: true
    risk: critical
    reason: "Production path operations require approval"

The user's "clean up my project" prompt produces rm -rf /project/prod-configs/. The action hits the boundary. The rule matches. The action requires approval. The human sees the exact command, sees the exact path, and decides. The user's intent was "clean up." The specific action proposed was something more consequential. The boundary surfaces that gap before anything executes.

The underlying mechanism is per-instance evaluation. The policy does not ask "is bash allowed?" It asks "is this specific bash command, with these specific parameters, on this specific path, permitted right now?" That distinction is the entire difference between class-level access control and execution-time authorization.

Front 2: Prompt injection reaching execution

This front is where the industry debate is most intense. The conventional view is that prompt injection must be solved at the input or model layer, detect the injection before it reaches the model, or make the model more resistant to following injected instructions.

Faramesh takes a different position. Not instead of input-layer defenses. In addition to them. The position is: even if the injection gets through every input scanner, even if the model successfully executes the attacker's instructions and generates a malicious tool call, that tool call must still pass through the execution boundary before anything happens.

Simon Willison himself suggested exactly this direction: "That suggests a very deterministic mitigation: taint tracking and policy gating. If the current state is tainted, block or require explicit human approval for any action with exfiltration potential: outbound HTTP, email/chat sends, PR creation." That is precisely what Faramesh implements, without requiring taint tracking -- because the execution boundary evaluates every action regardless of how the agent arrived at proposing it.

The property that makes this work is determinism. The policy evaluates the proposed action based on what it is, not how the model was convinced to propose it. An http_post to an external domain not on an allowlist is policy-blocked regardless of whether the model proposed it from legitimate reasoning or from a successful injection. The path to the tool call is irrelevant. The tool call itself is evaluated.

This is the framing that changes how teams think about injection defense. Input scanning is important and worth doing. But it is probabilistic. It has false negatives. The execution boundary is the guarantee that even a successful injection produces no consequence, because the consequence requires a tool call, and the tool call requires authorization.

Front 3: Consequential actions from hallucination and PII leakage

This front requires the most precise language because Faramesh's protection here is real but specific.

Faramesh does not scan content for PII. It does not detect whether a model output contains personally identifiable information. It does not evaluate whether the model's reasoning is factually accurate or hallucinatory. Those are language-layer concerns.

What Faramesh does: it blocks the consequential actions that hallucination and PII exposure produce.

A hallucinated API endpoint, called with customer payment data, is an http_post to an unrecognized domain. Policy can require approval for any HTTP POST to a domain not on an explicit allowlist. The hallucination produced the call. The policy blocks the call.

A model that includes PII in an email body is proposing send_email(to=..., body=...containing PII...). Policy can require approval for all outbound emails. The PII exposure produced the call. The policy blocks the call until a human reviews it.

# Policy that addresses hallucination-driven API calls:
rules:
  - match:
      tool: http
      op: post
      params:
        url: "^(?!https://(api\.mycompany\.com|trusted-partner\.com))"
    require_approval: true
    risk: high
    reason: "HTTP POST to non-allowlisted domain requires approval"

# Policy that addresses PII in outbound email:
  - match:
      tool: email
      op: send
    require_approval: true
    risk: medium
    reason: "All outbound email requires human review before sending"

# Policy that addresses hallucination-driven API calls:
rules:
  - match:
      tool: http
      op: post
      params:
        url: "^(?!https://(api\.mycompany\.com|trusted-partner\.com))"
    require_approval: true
    risk: high
    reason: "HTTP POST to non-allowlisted domain requires approval"

# Policy that addresses PII in outbound email:
  - match:
      tool: email
      op: send
    require_approval: true
    risk: medium
    reason: "All outbound email requires human review before sending"

# Policy that addresses hallucination-driven API calls:
rules:
  - match:
      tool: http
      op: post
      params:
        url: "^(?!https://(api\.mycompany\.com|trusted-partner\.com))"
    require_approval: true
    risk: high
    reason: "HTTP POST to non-allowlisted domain requires approval"

# Policy that addresses PII in outbound email:
  - match:
      tool: email
      op: send
    require_approval: true
    risk: medium
    reason: "All outbound email requires human review before sending"

The protection is at the action layer, not the content layer. Faramesh does not catch the PII in the body. It catches the action of sending an email, which is the only moment where the PII leaves the system. It does not detect the hallucinated endpoint. It catches the action of posting to that endpoint, which is the only moment where the data is exposed.

The existing literature on the lethal trifecta correctly identifies that the exfiltration vector is the easiest leg of the trifecta to cut: "exfiltration is the easiest vector to control... if it needs to communicate externally, such as send an email, dispatch a notification, or render a view, then it should do so by hitting an API request via an MCP tool call, not by arbitrarily exfiltrating data itself." Faramesh makes that control deterministic and non-bypassable, not a best practice that teams may or may not implement.

The comparison matrix

Here is the honest comparison of where each layer and tool in the stack actually operates.

The comparison is not "Faramesh is better than these tools." It is "Faramesh operates at a different layer than all of these tools." Language-layer tools and Faramesh are complementary. The language layer is the first several lines of defense. Faramesh is the last one.

What makes the last line of defense the critical one: everything that bypasses the earlier lines terminates in a tool call. If no tool executes, no consequence occurs. The execution boundary is the guarantee that can be made regardless of what happened earlier in the stack.

What "non-bypassable" actually means

The term is easy to misread as marketing. It has a precise technical meaning.

Faramesh's enforcement point is at tool.execute(), the function call that executes any tool. The hook that evaluates Faramesh's policy runs at this point, before execution, on every call. There is no code path in which a tool executes without passing through this hook.

This is different from saying Faramesh cannot be defeated. It can be defeated by:

Uninstalling the plugin
Writing a new code path that calls tools directly without the hook
Configuring a permissive policy that allows everything
Faramesh server being down with fail-open configured

What it cannot be defeated by is anything that happens at the language layer. A model that is perfectly manipulated, producing a perfectly crafted malicious tool call, still hits the boundary. The boundary does not evaluate how the model was convinced to propose the action. It evaluates the action itself.

This is the property that closes all three fronts. Front 1 produces a tool call. Front 2 produces a tool call. Front 3 produces a tool call. The boundary evaluates tool calls. Every consequential agent action is a tool call. The boundary covers every consequential agent action.

What Faramesh does not claim

Precision matters here. The claims above are true. These claims are not:

Faramesh does not prevent prompt injection at the model layer. An attacker can still manipulate an agent's reasoning. Faramesh prevents that manipulation from producing consequences. Those are different properties.

Faramesh does not scan content for PII. If an agent writes PII to a local file that policy permits writing to, that PII is in the file. Faramesh stopped the email that would have sent it out. It did not stop the local write.

Faramesh does not detect hallucinations. If an agent produces a hallucinated analysis that the user reads and acts on, that hallucination had consequences Faramesh did not prevent. Faramesh prevented the tool calls that would have had automated consequences, the API calls, the file writes, the emails.

Faramesh is not a replacement for language-layer defenses. Input scanning catches injections before the model processes them. That is faster and simpler than catching them at execution. Both layers matter. Faramesh is the guarantee that the layers above it have a backstop.

The precise claim: Faramesh is the only layer that provides deterministic, non-bypassable, per-instance evaluation of every consequential agent action before it executes. No language-layer tool provides this. No observability tool provides this. No IAM system provides this. The execution boundary is a distinct layer that does not exist anywhere else in the stack.

The defense-in-depth picture

The correct architecture is all layers operating together. Each layer catches what it can catch, fails gracefully when it cannot, and passes to the next.

The execution boundary is the only layer that fails closed. Every layer above it fails open when it cannot make a determination. A novel injection that escapes input scanning, evades model resistance, and produces output that looks clean to the output scanner. that attack terminates at the execution boundary. The tool call either matches a permitted rule or it does not execute.

For teams building agents that touch real systems, this is the architecture that makes production deployment defensible. Not because language-layer tools are insufficient. Because the stack needs a layer that fails closed, operates deterministically, and covers every attack that terminates in a tool call which is every attack that matters.

Faramesh sits at the execution boundary. Every tool call, every time, before anything executes.

[GET STARTED IN MINUTES]

Ready to give Faramesh a try?

The execution boundary your agents are missing.
Start free. No credit card required.

[GET STARTED IN MINUTES]

Ready to give Faramesh a try?

The execution boundary your agents are missing.
Start free. No credit card required.