Evaluating Agent Performance and Quality

Evaluations let you measure and track the quality of your agents’ responses over time. Agent Manager uses LLM-as-a-Judge scoring to grade agent outputs against expected results, giving you a structured way to compare the impact of model changes, prompt updates, or knowledge base modifications.

Concepts

Evaluation suite: a named collection of test cases targeting a specific quality dimension (e.g., “Customer Support Accuracy” or “RAG Groundedness”).
Test case: a single {input, expectedOutput} pair. You can optionally override the system prompt per case.
Evaluation run: an execution of all cases in a suite against a target agent. Each run records per-case scores and aggregate metrics.

Creating a suite

curl -X POST http://your-host/api/v1/evaluations/suites \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Customer Support Accuracy",
    "description": "Tests accuracy on common support queries",
    "createdBy": "your-username"
  }'

{
  "id": "suite-uuid",
  "name": "Customer Support Accuracy",
  "description": "Tests accuracy on common support queries"
}

Adding test cases

Add cases to a suite by providing an input and the expected output:

curl -X POST http://your-host/api/v1/evaluations/suites/{suiteId}/cases \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Refund policy question",
    "input": "What is your refund policy?",
    "expectedOutput": "We offer a 30-day money-back guarantee on all plans.",
    "systemPromptOverride": null
  }'

systemPromptOverride is optional. When set, it replaces the agent’s default system prompt for that case only — useful for testing how prompt changes affect specific query types.

Running a suite

Trigger a suite run against a specific agent. The run is enqueued as a background job:

curl -X POST "http://your-host/api/v1/evaluations/suites/{suiteId}/run?agentId=my-agent" \
  -H "Authorization: Bearer {token}"

{
  "jobId": "job-uuid"
}

The job executes all cases in the suite sequentially (or concurrently, depending on configuration), scores each response, and persists the results.

Streaming run progress

To watch evaluation progress in real time as cases are scored:

GET /api/v1/evaluations/runs/{runId}/stream

This is a Server-Sent Events endpoint. Connect with an EventSource or a streaming HTTP client to receive live scoring updates.

Viewing results

List all runs for a suite (newest first):

GET /api/v1/evaluations/suites/{suiteId}/runs

Get detailed per-case results for a run:

GET /api/v1/evaluations/runs/{runId}/results

Each result entry includes the case input, the agent’s actual output, the expected output, the score, and whether the case passed or failed.

Aggregate metrics

Get aggregate metrics across all runs:

GET /api/v1/evaluations/metrics

{
  "totalRuns": 42,
  "totalCases": 840,
  "passedCases": 756,
  "failedCases": 84,
  "passRate": 90.0,
  "averageScore": 0.87,
  "averageLatencyMs": 1240.5
}

Submitting feedback

After reviewing results manually, you can attach qualitative feedback to a run:

curl -X POST http://your-host/api/v1/evaluations/feedback \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{
    "runId": "run-uuid",
    "rating": 4,
    "comment": "Good accuracy on policy questions, struggles with edge cases"
  }'

Managing suites and cases

Action	Endpoint
List all suites	`GET /api/v1/evaluations/suites`
Get suite by ID	`GET /api/v1/evaluations/suites/{id}`
Delete a suite	`DELETE /api/v1/evaluations/suites/{suiteId}`
List cases in a suite	`GET /api/v1/evaluations/suites/{suiteId}/cases`
Delete a case	`DELETE /api/v1/evaluations/cases/{caseId}`

Run evaluations before and after changing an agent’s model, system prompt, or knowledge base. Comparing averageScore and passRate across runs gives you an objective signal of whether the change improved or regressed quality.

Get Started

Core Features

Knowledge & Memory

Security & Compliance

Platform

Evaluating Agent Performance and Quality

Concepts

Creating a suite

Adding test cases

Running a suite

Streaming run progress

Viewing results

Aggregate metrics

Submitting feedback

Managing suites and cases

Get Started

Core Features

Knowledge & Memory

Security & Compliance

Platform

Documentation Index

​Concepts

​Creating a suite

​Adding test cases

​Running a suite

​Streaming run progress

​Viewing results

​Aggregate metrics

​Submitting feedback

​Managing suites and cases

Concepts

Creating a suite

Adding test cases

Running a suite

Streaming run progress

Viewing results

Aggregate metrics

Submitting feedback

Managing suites and cases