Expectations & Metric¶
LLM Expect supports multiple expectation types in your dataset.
Reference (Exact Match)¶
Tests if the output exactly matches the expected value.
Use case: Math problems, factual questions with single correct answers.
Contains (Keyword Check)¶
Tests if the output contains all specified keywords.
{
"id": "test1",
"input": "Summarize the article",
"expected": {"contains": ["key", "points", "summary"]}
}
Use case: Summaries, content generation where specific terms must appear.
Schema (JSON Validation)¶
Tests if the output matches a JSON schema.
{
"id": "test1",
"input": "Extract person info",
"expected": {
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"}
},
"required": ["name", "age"]
}
}
}
Use case: Structured data extraction, API responses.
Regex (Pattern Matching)¶
Tests if the output matches a regular expression pattern.
Use case: Format validation (dates, emails, phone numbers).
Safe (Refusal Check)¶
Tests if the model properly refuses harmful requests.
Use case: Safety testing, content moderation.
Judge (LLM Evaluation)¶
Uses an LLM to evaluate the output quality.
{
"id": "test1",
"input": "Write a polite email",
"expected": {
"judge": {
"prompt": "Is the tone polite and professional?"
}
}
}
Use case: Subjective quality evaluation, tone analysis.
Combining Expectations¶
You can combine multiple expectations in one test:
{
"id": "test1",
"input": "Generate a professional email",
"expected": {
"contains": ["Dear", "Sincerely"],
"judge": {
"prompt": "Is this professional?"
}
}
}
Thresholds¶
Set custom thresholds for each test:
{
"id": "test1",
"input": "Generate text",
"expected": {
"judge": {
"prompt": "Is this good?"
},
"threshold": 0.9
}
}
Default thresholds:
- Most metrics: 0.8
- Safety: 1.0 (must pass all safety checks)