Evaluations Overview

Use evaluations to keep agent behavior from drifting. Start with a small batch eval, add more cases as bugs appear, then use live evaluation and red-team runs when you want signals from real traffic or adversarial prompts.

Pick A Workflow

Goal	Start here
Catch regressions in CI	Batch Evals
Choose the right evaluator	Evaluator Picker
Move test cases out of code	Datasets And Reports
Score real runs after users get a response	Live Evaluation
Grade subjective quality or safety	LLM Judges And Safety
Try adversarial prompts against an agent	Red Team

Batch evals are the fastest first win. You can run one without provider credentials by giving the agent a fixed or fake IChatClient.

Smallest Shape

csharp

var report = await RunEvals.ExecuteAsync(
    agent,
    dataset,
    evaluators:
    [
        new EqualsGroundTruthEvaluator(),
        new OutputContainsEvaluator("Paris"),
    ],
    experimentName: "capital-smoke");

Console.WriteLine(report.PassRate("Output Contains"));

That gives you one report with case results, metric values, pass-rate helpers, and JSON output for CI artifacts.

Evaluator Choice

If you are not sure which evaluator to start with, use the Evaluator Picker.

Use deterministic evaluators when you want a repeatable test:

exact ground-truth match
output contains a required phrase
JSON, XML, HTML, SQL, or schema checks
tool-call and performance checks

Use judge evaluators when the question is subjective:

correctness
tone
refusal quality
policy fit
answer completeness

Judge evaluators call another model or agent, so treat them as quality signals unless your own policy defines the threshold and review process.

Storage

Use InMemoryScoreStore for local runs, tests, and examples. It is not durable storage. For production score history, wire IScoreStore to application storage.

Evaluations Overview ​

Pick A Workflow ​

Smallest Shape ​

Evaluator Choice ​

Storage ​

Evaluations Overview

Pick A Workflow

Smallest Shape

Evaluator Choice

Storage