Guardrails: PII, Injection, and Content Safety

Guardrails are automatic safety layers that wrap every agent request — you don’t need to configure or enable them. They run as middleware in the request pipeline, operating before the LLM sees any user content and after the model generates its response.

How guardrails work

Agent Manager’s guardrails are implemented as ordered advisors in the execution pipeline. Input guardrails run first, before the prompt reaches the LLM. Output guardrails run after the model responds, before the content is returned to the caller.

User input
    │
    ▼
[PII Anonymization]     ← input guardrail
[Prompt Injection]      ← input guardrail
    │
    ▼
    LLM
    │
    ▼
[Content Safety]        ← output guardrail
[Hallucination Check]   ← output guardrail (RAG only)
    │
    ▼
Response returned

Input guardrails

PII anonymization

Every user message is scanned for personally identifiable information — including email addresses and phone numbers — before it is sent to the LLM. Detected values are replaced with anonymized placeholders so the agent can still process the message meaningfully without exposing raw user data to the model provider.What is detected: email addresses, phone numbers, and other common PII patterns.How it works: redacted values are substituted transparently. The agent’s response is coherent even though the LLM never saw the original values.

PII redaction runs before any other processing. Raw user data is never passed to downstream components, tools, or LLM providers.

Prompt injection detection

User input is scanned for jailbreak patterns — attempts to override system instructions, escape role constraints, or manipulate the agent into unsafe behavior. Requests that trigger injection detection are rejected before the LLM is called.What is detected: common jailbreak phrasing, instruction override attempts, and role-escape patterns.On detection: the run is rejected. The guardrail event is recorded in the audit log.

Output guardrails

Content safety

After the LLM generates a response, it passes through a content moderation check. Responses containing harmful, violent, or inappropriate content are blocked before being returned to the caller.Streaming note: for streaming responses, content safety applies a blocking collect-and-check pattern. This means the full streamed response is validated before any content reaches the client — the streaming user experience is preserved but safety is guaranteed.

Hallucination detection (RAG responses)

When an agent uses retrieval-augmented generation, its response is verified against the knowledge sources that were retrieved. If the agent’s answer is not grounded in the cited documents, the hallucination check flags or blocks the response.This guardrail only activates for RAG-backed responses. Standard conversational responses are not subject to this check.

Agent tiers

Guardrail strictness scales with the agent tier configured for your deployment:

Tier	PII on input	PII on streaming output	Content safety
`TIER_1_STANDARD`	Redacted	Pass-through	On
`TIER_2_STRICT`	Redacted	Redacted (sliding-window)	On

TIER_2_STRICT activates sliding-window output redaction for streaming responses, ensuring PII is caught even when it appears in generated content rather than user input. This is a stricter safety guarantee at the cost of some streaming throughput.

Secure code sandbox

When an agent executes Python code, it runs inside an ephemeral Docker container. Each execution gets a fresh, isolated environment:

No host access

The container has no access to the host filesystem. Files created during execution do not persist after the container exits.

No network access

Network access is disabled by default. The container cannot make outbound connections unless your administrator explicitly configures allowlist exceptions.

Ephemeral

Every code execution starts with a clean container. There is no state carried between separate executions.

Audited

Each sandbox execution is logged as a tool call in the run’s audit trail.

What gets logged

Every guardrail intervention is recorded in the audit log with:

The type of guardrail triggered (PII, injection, content safety, hallucination)
The run ID and timestamp
The action taken (redacted, blocked, or flagged)

If users are reporting unexpected redactions or blocked responses, navigate to the audit logs and filter by guardrail events for the affected run ID. The log entry will show exactly which guardrail triggered and why.

Get Started

Core Features

Knowledge & Memory

Security & Compliance

Platform

Guardrails: PII, Injection, and Content Safety

How guardrails work

Input guardrails

Output guardrails

Agent tiers

Secure code sandbox

No host access

No network access

Ephemeral

Audited

What gets logged

Get Started

Core Features

Knowledge & Memory

Security & Compliance

Platform

Documentation Index

​How guardrails work

​Input guardrails

​Output guardrails

​Agent tiers

​Secure code sandbox

No host access

No network access

Ephemeral

Audited

​What gets logged

How guardrails work

Input guardrails

Output guardrails

Agent tiers

Secure code sandbox

What gets logged