Evaluator Picker
Draft boundary: this catalog is source-grounded against the evaluator classes under HPD-Agent.Evaluations/Evaluators and the corresponding evaluator tests. It is meant to help developers choose a first evaluator without reading source. It is not a complete API reference, and long-tail evaluators are grouped when that makes first use clearer.
Pick The Smallest Useful Evaluator
| Need | Start with | Judge required | Gate in CI? | Trend instead? |
|---|---|---|---|---|
| Exact expected answer | EqualsGroundTruthEvaluator, OutputEqualsEvaluator | No | Yes | Usually no |
| Required phrase, prefix, regex, length | OutputContainsEvaluator, ContainsAllEvaluator, StartsWithEvaluator, OutputMatchesRegexEvaluator, WordCountEvaluator | No | Yes | Sometimes |
| Valid JSON/XML/HTML/SQL shape | JsonValidityEvaluator, XmlValidityEvaluator, HtmlShapeEvaluator, SqlShapeEvaluator | No | Yes | Usually no |
| JSON fields and light schema checks | SchemaConformanceEvaluator, FieldCompletenessEvaluator, FieldAccuracyEvaluator | No | Yes for contract tests | Track field completeness over time |
| Tool behavior | ToolWasCalledEvaluator, ToolCallCountEvaluator, ToolArgumentMatchesEvaluator, ToolResultContainsEvaluator, NoToolsCalledEvaluator, ToolCallOrderEvaluator, ToolCallF1Evaluator | No | Yes when tool behavior is required | Use F1 as a trend when tool choice can vary |
| Latency, tokens, cost, iterations | MaxDurationEvaluator, MaxTokensEvaluator, MaxInputTokensEvaluator, MaxOutputTokensEvaluator, MaxCostEvaluator, MaxIterationsEvaluator, LatencyEvaluator | No | Gate only hard limits | Trend latency/cost |
| Semantic answer quality | AnswerSimilarityEvaluator, GoalAccuracyEvaluator, TaskSuccessEvaluator, AspectCriticEvaluator, CustomJudgeEvaluator | Yes | Only with owned thresholds/review | Yes |
| Retrieval quality and hallucination | ContextRelevanceEvaluator, HallucinationEvaluator, ContextRecallEvaluator, ContextPrecisionEvaluator | Yes | Rarely | Yes |
| Safety and policy review | Safety evaluators such as PromptInjectionEvaluator, SensitiveDataLeakEvaluator, CodeSecurityRiskEvaluator, PolicyComplianceEvaluator | Yes | Yes only after product policy defines thresholds | Yes for discovery and tuning |
| Conversation and memory | ConversationCoherenceEvaluator, GoalProgressionEvaluator, RepetitionDetectionEvaluator, MemoryAccuracyEvaluator | Yes | Rarely | Yes |
| Risk/autonomy monitoring | TurnRiskEvaluator plus TurnAutonomyEvaluator | Mixed: risk yes, autonomy no | Rarely | Yes |
Default policy in batch evals is source-defined: deterministic evaluators are MustAlwaysPass; other evaluators default to TrackTrend. Live AddEvaluator defaults to MustAlwaysPass, so pass policy: EvalPolicy.TrackTrend explicitly for judge metrics you want to monitor rather than fail.
CI Gate Or Trend Signal?
Use MustAlwaysPass when the answer is mechanically checkable and a failure means the agent is broken: exact output, output contract, required tool call, no tool call, maximum tokens, or a bounded duration.
Use TrackTrend when the score is probabilistic, subjective, or depends on a judge model: semantic quality, hallucination, safety discovery, reasoning quality, conversation memory, risk, autonomy, and aggregate quality scores.
There are two practical exceptions:
- Wrap a numeric deterministic score with
ThresholdGatewhen you need a hard gate, for examplenew ThresholdGate(new ToolCallF1Evaluator("Search", "Fetch"), 0.8). - Safety evaluators emit both a numeric score and a
... Passedboolean throughSafetyPolicy. They can become gates after the application owns the policy, threshold, and review process.
First Evaluator Families
Output Assertions
Use these when the response text itself is the contract. They do not need a judge model.
OutputContainsEvaluator,ContainsAnyEvaluator,ContainsAllEvaluator,CaseInsensitiveContainsEvaluator, andStartsWithEvaluatorcheck simple string presence or prefix rules.OutputMatchesRegexEvaluatorchecks a regex againstmodelResponse.Text.OutputEqualsEvaluatorchecks exact text;EqualsGroundTruthEvaluatorchecks exact text againstGroundTruthContext.WordCountEvaluatorchecks min, max, or exact word count and stores the observed count as metadata.RefusalEvaluatordetects common refusal phrases. UseNotEvaluator(new RefusalEvaluator())when refusal is a failure.KeywordCoverageEvaluator,ContentSimilarityEvaluator, andLevenshteinEvaluatorreturn numeric similarity or coverage scores. Gate them only with an explicit threshold.
evaluators:
[
new EqualsGroundTruthEvaluator(),
new ContainsAllEvaluator("Paris", "France"),
new OutputMatchesRegexEvaluator(@"\b\d{4}\b"),
]Structured Output
Use these when the answer must be machine-readable.
JsonValidityEvaluatorandXmlValidityEvaluatorparse the response.HtmlShapeEvaluatorchecks for plausible HTML tags and optional required tags. It is not W3C validation.SqlShapeEvaluatorchecks plausible SQL statement shape, balanced parentheses, and balanced quotes. It is not dialect validation.SchemaConformanceEvaluatorvalidates a JSON schema subset:type,required,properties,minLength,maxLength,minimum,maximum, andenum.FieldCompletenessEvaluatorscores the fraction of named top-level JSON fields present and non-null.FieldAccuracyEvaluatorchecks a top-level JSON field value with case-insensitive string comparison.SemanticFieldEqualityEvaluatoris the judge-backed escape hatch when a field can be semantically right without matching exactly.
var schema = """
{
"type": "object",
"required": ["city", "country"],
"properties": {
"city": { "type": "string" },
"country": { "enum": ["France"] }
}
}
""";
evaluators:
[
new JsonValidityEvaluator(),
new SchemaConformanceEvaluator(schema),
new FieldAccuracyEvaluator("country", "France"),
]Tool And Trace Behavior
Use these when success depends on how the agent acted, not only what it said. These evaluators require TurnEvaluationContext; live evaluation and batch runs through the HPD evaluation pipeline provide it.
ToolWasCalledEvaluatorchecks that a named tool was called at least once.ToolCallCountEvaluatorchecks an exact count.ToolArgumentMatchesEvaluatorparses the tool call JSON arguments and compares one argument value exactly.ToolResultContainsEvaluatorchecks a named tool result for expected text.NoToolsCalledEvaluatorasserts that no tools were used.ToolCallOrderEvaluatorchecks that expected tools appear in order as a subsequence; extra calls may exist.ToolCallF1Evaluatorreturns unordered precision/recall/F1 over expected tool names.HasMatchingSpanEvaluatoris the broader trace assertion for matching aSpanQueryagainstTurnTrace.
evaluators:
[
new ToolWasCalledEvaluator("Search"),
new ToolArgumentMatchesEvaluator("Search", "query", "capital of France"),
new ToolCallOrderEvaluator(["Search", "Summarize"]),
]Performance And Cost
Use hard limits for regressions that should fail immediately. Use numeric trends for everything else.
MaxDurationEvaluator,MaxIterationsEvaluator,MaxTokensEvaluator,MaxInputTokensEvaluator, andMaxOutputTokensEvaluatorreturn booleans.LatencyEvaluatorreports duration in seconds andlatency-msmetadata.MaxCostEvaluatorlooks for cost in turn metrics undercost_usd,turn_cost_usd,estimated_cost_usd, orcost. If no cost metric exists, it returns a warning diagnostic rather than inventing cost.
These also require TurnEvaluationContext.
NLP Overlap Metrics
Use these when you need deterministic overlap metrics against references. They are better for summaries, translations, or extraction-style tasks than open-ended assistant quality.
BleuEvaluator,GleuEvaluator, andTextF1Evaluatorwrap Microsoft NLP evaluators.RougeEvaluatorsupportsRouge1,Rouge2,RougeL, andRougeS.MeteorEvaluatoris a lightweight dependency-free METEOR-style unigram alignment score with configurableMeteorEvaluatorOptions.
Treat these as trends unless your dataset and threshold are already calibrated.
LLM Judges
Use judge evaluators when deterministic checks cannot express the quality bar. They require judge configuration through EvalJudgeConfig, UseEvalJudgeConfig, or UseEvalJudgeAgent.
AnswerSimilarityEvaluatorcompares the response toGroundTruthContext.GoalAccuracyEvaluatorscores goal achievement from ground truth or turn context.TaskSuccessEvaluatorreturns a boolean for whether the user task appears complete.AspectCriticEvaluatoris the fastest custom boolean rubric.CustomJudgeEvaluatoris a custom 0-1 rubric.TopicAdherenceEvaluatorscores whether the response stays within allowed topics.
var report = await RunEvals.ExecuteAsync(
agent,
dataset,
evaluators:
[
new AspectCriticEvaluator("The answer is correct, concise, and cites the retrieved source."),
new AnswerSimilarityEvaluator(),
],
options: new RunEvalsOptions<string>
{
JudgeConfig = new EvalJudgeConfig
{
OverrideChatClient = judgeChatClient,
},
});Retrieval And Hallucination
Use these when the agent is grounded in retrieved or supplied documents.
ContextRelevanceEvaluatorjudges whether retrieved context helps answer the query.HallucinationEvaluatorextracts factual claims from the output, verifies them against grounding documents, and scores contradicted claims divided by total claims. Unsupported claims are not counted as hallucinations.ContextRecallEvaluatorextracts factual claims from ground truth and checks whether the grounding context supports them.ContextPrecisionEvaluatorscores ranked grounding chunks with a mean-average-precision style aggregate.
These require a judge model and relevant contexts such as GroundingDocumentContext and, for recall, GroundTruthContext.
Safety And Policy
Safety evaluators are JSON judges. Each returns a numeric score on a 0-7 scale plus a boolean ... Passed metric after applying SafetyPolicy. The judge prompt includes recent conversation, the assistant response, HPD tool calls/results, and reasoning text when present.
Use the specific class when the risk is known:
- Harm categories:
ContentHarmEvaluator,HateHarassmentEvaluator,ViolenceSafetyEvaluator,SelfHarmSafetyEvaluator,SexualContentSafetyEvaluator. - Prompt and instruction attacks:
PromptInjectionEvaluator,JailbreakAttemptEvaluator. - Data and IP risks:
SensitiveDataLeakEvaluator,ProtectedMaterialEvaluator. - Code and human-attribute risks:
CodeSecurityRiskEvaluator,UngroundedSensitiveAttributeEvaluator. - Product policy:
PolicyComplianceEvaluator.
evaluators:
[
new PromptInjectionEvaluator(),
new SensitiveDataLeakEvaluator(),
new CodeSecurityRiskEvaluator(new SafetyPolicy { BlockThreshold = 4.0 }),
]Keep early safety runs as trend/review signals. Promote the ... Passed metric to a gate only after the product policy, block threshold, and escalation workflow are explicit.
Conversation, Memory, Reasoning, Risk
These are useful signals for live evaluation and longer-running agents. Most should remain trend metrics.
- Conversation:
ConversationCoherenceEvaluator,GoalProgressionEvaluator,RepetitionDetectionEvaluator, andMemoryAccuracyEvaluator. - Reasoning:
ReasoningCoherenceEvaluator,ReasoningGroundednessEvaluator, andReasoningEfficiencyEvaluator. They depend on reasoning context or captured turn reasoning text. - Risk/autonomy:
TurnRiskEvaluatorjudges potential harm on a 1-10 scale;TurnAutonomyEvaluatordeterministically scores autonomy from iteration count, permission-denied rate, stop kind, and duration. - Specialty:
SqlSemanticEquivalenceEvaluatorjudges whether generated SQL is equivalent to reference SQL for a schema;NoiseSensitivityEvaluatorcompares a noisy-response run against a baseline response.
Use these to compare prompts, models, tool permissions, or releases. Do not use them as hard gates until the team has calibrated scores against reviewed examples.
Composite Helpers
Use composition after individual metrics are already useful.
ThresholdGateconverts a numeric metric to a boolean gate.WeightedScoreEvaluatoraverages numeric sub-evaluator scores and excludes inconclusive or failed sub-evaluators from the weighted average.NotEvaluatorinverts the primary metric of another evaluator.
evaluators:
[
new ThresholdGate(new RougeEvaluator(reference, RougeVariant.RougeL), 0.75),
new NotEvaluator(new OutputContainsEvaluator("internal_secret")),
]Reference-Only Notes
This page intentionally does not fully document base classes such as HpdEvaluatorBase, HpdLlmJudgeEvaluatorBase, HpdJsonJudgeEvaluatorBase, HpdDecomposeVerifyEvaluatorBase, or TaskOracleEvaluator. They matter for authors of new evaluators, but they are not first-use picker material.
The source includes TaskOracleEvaluator for custom task-specific oracle subclasses. Treat it as an extension point until there is a documented public oracle example.
Long-tail judge metrics such as reasoning, memory, noise sensitivity, and SQL semantic equivalence should be considered reference-only until calibrated with reviewed datasets. Their class names and prompts are source-grounded, but recommended thresholds are intentionally not supplied here.
Source Grounding
This page is based on:
src/HPD-Agent.Evaluations/Evaluators/Deterministic/OutputEvaluators.cssrc/HPD-Agent.Evaluations/Evaluators/Deterministic/StructuredOutputEvaluators.cssrc/HPD-Agent.Evaluations/Evaluators/Deterministic/ToolCallEvaluators.cssrc/HPD-Agent.Evaluations/Evaluators/Deterministic/PerformanceEvaluators.cssrc/HPD-Agent.Evaluations/Evaluators/Deterministic/TurnAutonomyEvaluator.cssrc/HPD-Agent.Evaluations/Evaluators/Deterministic/SpanQueryEvaluator.cssrc/HPD-Agent.Evaluations/Evaluators/Nlp/NlpEvaluators.cssrc/HPD-Agent.Evaluations/Evaluators/Composite/CompositeEvaluators.cssrc/HPD-Agent.Evaluations/Evaluators/LlmJudge/*.cssrc/HPD-Agent.Evaluations/Evaluators/Safety/*.cs- evaluator tests under
test/HPD-Agent.Evaluations.Tests/Evaluators