Documentation Index
Fetch the complete documentation index at: https://operativusai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Evaluations let you measure and track the quality of your agents’ responses over time. Agent Manager uses LLM-as-a-Judge scoring to grade agent outputs against expected results, giving you a structured way to compare the impact of model changes, prompt updates, or knowledge base modifications.
Concepts
- Evaluation suite: a named collection of test cases targeting a specific quality dimension (e.g., “Customer Support Accuracy” or “RAG Groundedness”).
- Test case: a single
{input, expectedOutput} pair. You can optionally override the system prompt per case.
- Evaluation run: an execution of all cases in a suite against a target agent. Each run records per-case scores and aggregate metrics.
Creating a suite
curl -X POST http://your-host/api/v1/evaluations/suites \
-H "Authorization: Bearer {token}" \
-H "Content-Type: application/json" \
-d '{
"name": "Customer Support Accuracy",
"description": "Tests accuracy on common support queries",
"createdBy": "your-username"
}'
{
"id": "suite-uuid",
"name": "Customer Support Accuracy",
"description": "Tests accuracy on common support queries"
}
Adding test cases
Add cases to a suite by providing an input and the expected output:
curl -X POST http://your-host/api/v1/evaluations/suites/{suiteId}/cases \
-H "Authorization: Bearer {token}" \
-H "Content-Type: application/json" \
-d '{
"name": "Refund policy question",
"input": "What is your refund policy?",
"expectedOutput": "We offer a 30-day money-back guarantee on all plans.",
"systemPromptOverride": null
}'
systemPromptOverride is optional. When set, it replaces the agent’s default system prompt for that case only — useful for testing how prompt changes affect specific query types.
Running a suite
Trigger a suite run against a specific agent. The run is enqueued as a background job:
curl -X POST "http://your-host/api/v1/evaluations/suites/{suiteId}/run?agentId=my-agent" \
-H "Authorization: Bearer {token}"
The job executes all cases in the suite sequentially (or concurrently, depending on configuration), scores each response, and persists the results.
Streaming run progress
To watch evaluation progress in real time as cases are scored:
GET /api/v1/evaluations/runs/{runId}/stream
This is a Server-Sent Events endpoint. Connect with an EventSource or a streaming HTTP client to receive live scoring updates.
Viewing results
List all runs for a suite (newest first):
GET /api/v1/evaluations/suites/{suiteId}/runs
Get detailed per-case results for a run:
GET /api/v1/evaluations/runs/{runId}/results
Each result entry includes the case input, the agent’s actual output, the expected output, the score, and whether the case passed or failed.
Aggregate metrics
Get aggregate metrics across all runs:
GET /api/v1/evaluations/metrics
{
"totalRuns": 42,
"totalCases": 840,
"passedCases": 756,
"failedCases": 84,
"passRate": 90.0,
"averageScore": 0.87,
"averageLatencyMs": 1240.5
}
Submitting feedback
After reviewing results manually, you can attach qualitative feedback to a run:
curl -X POST http://your-host/api/v1/evaluations/feedback \
-H "Authorization: Bearer {token}" \
-H "Content-Type: application/json" \
-d '{
"runId": "run-uuid",
"rating": 4,
"comment": "Good accuracy on policy questions, struggles with edge cases"
}'
Managing suites and cases
| Action | Endpoint |
|---|
| List all suites | GET /api/v1/evaluations/suites |
| Get suite by ID | GET /api/v1/evaluations/suites/{id} |
| Delete a suite | DELETE /api/v1/evaluations/suites/{suiteId} |
| List cases in a suite | GET /api/v1/evaluations/suites/{suiteId}/cases |
| Delete a case | DELETE /api/v1/evaluations/cases/{caseId} |
Run evaluations before and after changing an agent’s model, system prompt, or knowledge base. Comparing averageScore and passRate across runs gives you an objective signal of whether the change improved or regressed quality.