> ## Documentation Index
> Fetch the complete documentation index at: https://operativusai.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Guardrails: PII, Injection, and Content Safety

> Agent Manager's built-in safety layers: PII redaction, prompt injection blocking, output moderation, and sandboxed code execution active on every request.

Guardrails are automatic safety layers that wrap every agent request — you don't need to configure or enable them. They run as middleware in the request pipeline, operating before the LLM sees any user content and after the model generates its response.

## How guardrails work

Agent Manager's guardrails are implemented as ordered advisors in the execution pipeline. Input guardrails run first, before the prompt reaches the LLM. Output guardrails run after the model responds, before the content is returned to the caller.

```
User input
    │
    ▼
[PII Anonymization]     ← input guardrail
[Prompt Injection]      ← input guardrail
    │
    ▼
    LLM
    │
    ▼
[Content Safety]        ← output guardrail
[Hallucination Check]   ← output guardrail (RAG only)
    │
    ▼
Response returned
```

## Input guardrails

<AccordionGroup>
  <Accordion title="PII anonymization">
    Every user message is scanned for personally identifiable information — including email addresses and phone numbers — before it is sent to the LLM. Detected values are replaced with anonymized placeholders so the agent can still process the message meaningfully without exposing raw user data to the model provider.

    **What is detected:** email addresses, phone numbers, and other common PII patterns.

    **How it works:** redacted values are substituted transparently. The agent's response is coherent even though the LLM never saw the original values.

    <Note>
      PII redaction runs before any other processing. Raw user data is never passed to downstream components, tools, or LLM providers.
    </Note>
  </Accordion>

  <Accordion title="Prompt injection detection">
    User input is scanned for jailbreak patterns — attempts to override system instructions, escape role constraints, or manipulate the agent into unsafe behavior. Requests that trigger injection detection are rejected before the LLM is called.

    **What is detected:** common jailbreak phrasing, instruction override attempts, and role-escape patterns.

    **On detection:** the run is rejected. The guardrail event is recorded in the audit log.
  </Accordion>
</AccordionGroup>

## Output guardrails

<AccordionGroup>
  <Accordion title="Content safety">
    After the LLM generates a response, it passes through a content moderation check. Responses containing harmful, violent, or inappropriate content are blocked before being returned to the caller.

    **Streaming note:** for streaming responses, content safety applies a blocking collect-and-check pattern. This means the full streamed response is validated before any content reaches the client — the streaming user experience is preserved but safety is guaranteed.
  </Accordion>

  <Accordion title="Hallucination detection (RAG responses)">
    When an agent uses retrieval-augmented generation, its response is verified against the knowledge sources that were retrieved. If the agent's answer is not grounded in the cited documents, the hallucination check flags or blocks the response.

    This guardrail only activates for RAG-backed responses. Standard conversational responses are not subject to this check.
  </Accordion>
</AccordionGroup>

## Agent tiers

Guardrail strictness scales with the agent tier configured for your deployment:

| Tier              | PII on input | PII on streaming output   | Content safety |
| ----------------- | ------------ | ------------------------- | -------------- |
| `TIER_1_STANDARD` | Redacted     | Pass-through              | On             |
| `TIER_2_STRICT`   | Redacted     | Redacted (sliding-window) | On             |

<Info>
  `TIER_2_STRICT` activates sliding-window output redaction for streaming responses, ensuring PII is caught even when it appears in generated content rather than user input. This is a stricter safety guarantee at the cost of some streaming throughput.
</Info>

## Secure code sandbox

When an agent executes Python code, it runs inside an ephemeral Docker container. Each execution gets a fresh, isolated environment:

<CardGroup cols={2}>
  <Card title="No host access" icon="hard-drive">
    The container has no access to the host filesystem. Files created during execution do not persist after the container exits.
  </Card>

  <Card title="No network access" icon="network-wired">
    Network access is disabled by default. The container cannot make outbound connections unless your administrator explicitly configures allowlist exceptions.
  </Card>

  <Card title="Ephemeral" icon="trash">
    Every code execution starts with a clean container. There is no state carried between separate executions.
  </Card>

  <Card title="Audited" icon="file-lines">
    Each sandbox execution is logged as a tool call in the run's audit trail.
  </Card>
</CardGroup>

## What gets logged

Every guardrail intervention is recorded in the audit log with:

* The type of guardrail triggered (PII, injection, content safety, hallucination)
* The run ID and timestamp
* The action taken (redacted, blocked, or flagged)

<Tip>
  If users are reporting unexpected redactions or blocked responses, navigate to the audit logs and filter by guardrail events for the affected run ID. The log entry will show exactly which guardrail triggered and why.
</Tip>
