> ## Documentation Index
> Fetch the complete documentation index at: https://operativusai.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluating Agent Performance and Quality

> How to create evaluation suites, add test cases with expected outputs, run suites against agents, and track scores and pass rates over time.

Evaluations let you measure and track the quality of your agents' responses over time. Agent Manager uses LLM-as-a-Judge scoring to grade agent outputs against expected results, giving you a structured way to compare the impact of model changes, prompt updates, or knowledge base modifications.

## Concepts

* **Evaluation suite**: a named collection of test cases targeting a specific quality dimension (e.g., "Customer Support Accuracy" or "RAG Groundedness").
* **Test case**: a single `{input, expectedOutput}` pair. You can optionally override the system prompt per case.
* **Evaluation run**: an execution of all cases in a suite against a target agent. Each run records per-case scores and aggregate metrics.

## Creating a suite

```bash theme={null}
curl -X POST http://your-host/api/v1/evaluations/suites \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Customer Support Accuracy",
    "description": "Tests accuracy on common support queries",
    "createdBy": "your-username"
  }'
```

```json theme={null}
{
  "id": "suite-uuid",
  "name": "Customer Support Accuracy",
  "description": "Tests accuracy on common support queries"
}
```

## Adding test cases

Add cases to a suite by providing an input and the expected output:

```bash theme={null}
curl -X POST http://your-host/api/v1/evaluations/suites/{suiteId}/cases \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Refund policy question",
    "input": "What is your refund policy?",
    "expectedOutput": "We offer a 30-day money-back guarantee on all plans.",
    "systemPromptOverride": null
  }'
```

<Note>
  `systemPromptOverride` is optional. When set, it replaces the agent's default system prompt for that case only — useful for testing how prompt changes affect specific query types.
</Note>

## Running a suite

Trigger a suite run against a specific agent. The run is enqueued as a background job:

```bash theme={null}
curl -X POST "http://your-host/api/v1/evaluations/suites/{suiteId}/run?agentId=my-agent" \
  -H "Authorization: Bearer {token}"
```

```json theme={null}
{
  "jobId": "job-uuid"
}
```

The job executes all cases in the suite sequentially (or concurrently, depending on configuration), scores each response, and persists the results.

## Streaming run progress

To watch evaluation progress in real time as cases are scored:

```bash theme={null}
GET /api/v1/evaluations/runs/{runId}/stream
```

This is a Server-Sent Events endpoint. Connect with an `EventSource` or a streaming HTTP client to receive live scoring updates.

## Viewing results

**List all runs for a suite** (newest first):

```bash theme={null}
GET /api/v1/evaluations/suites/{suiteId}/runs
```

**Get detailed per-case results for a run:**

```bash theme={null}
GET /api/v1/evaluations/runs/{runId}/results
```

Each result entry includes the case input, the agent's actual output, the expected output, the score, and whether the case passed or failed.

## Aggregate metrics

Get aggregate metrics across all runs:

```bash theme={null}
GET /api/v1/evaluations/metrics
```

```json theme={null}
{
  "totalRuns": 42,
  "totalCases": 840,
  "passedCases": 756,
  "failedCases": 84,
  "passRate": 90.0,
  "averageScore": 0.87,
  "averageLatencyMs": 1240.5
}
```

## Submitting feedback

After reviewing results manually, you can attach qualitative feedback to a run:

```bash theme={null}
curl -X POST http://your-host/api/v1/evaluations/feedback \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{
    "runId": "run-uuid",
    "rating": 4,
    "comment": "Good accuracy on policy questions, struggles with edge cases"
  }'
```

## Managing suites and cases

| Action                | Endpoint                                         |
| --------------------- | ------------------------------------------------ |
| List all suites       | `GET /api/v1/evaluations/suites`                 |
| Get suite by ID       | `GET /api/v1/evaluations/suites/{id}`            |
| Delete a suite        | `DELETE /api/v1/evaluations/suites/{suiteId}`    |
| List cases in a suite | `GET /api/v1/evaluations/suites/{suiteId}/cases` |
| Delete a case         | `DELETE /api/v1/evaluations/cases/{caseId}`      |

<Tip>
  Run evaluations before and after changing an agent's model, system prompt, or knowledge base. Comparing `averageScore` and `passRate` across runs gives you an objective signal of whether the change improved or regressed quality.
</Tip>
