LLM Expect Documentation¶
Welcome to LLM Expect - pytest for LLMs.
Quick Links¶
- Installation
- Quickstart: Get started with LLM Expect in minutes.
- Examples
- API Reference
Welcome to the documentation for LLM Expect, a minimalist Python SDK for evaluating LLM functions.¶
LLM Expect is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.
⚡ 5-Minute Quickstart¶
1. Create a test file (tests.jsonl)
{"id": "test1", "input": "What is 2+2?", "expected": {"reference": "4"}}
{"id": "test2", "input": "Say hello", "expected": {"contains": ["hello"]}}
2. Decorate your function (main.py)
import os
from anthropic import Anthropic
from llm_expect import llm_expect
# Initialize Anthropic client
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
def call_llm(prompt):
message = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
@llm_expect(dataset="tests.jsonl")
def generate(prompt: str):
return call_llm(prompt)
if __name__ == "__main__":
# Run evaluation
generate.run_eval()
3. Run it
Output:
❌ What LLM-Expect Does Not Do¶
To build trust, we want to be clear about what this tool is not:
- Not an Agent Framework: We don't help you build agents (use LangChain/LlamaIndex for that). We test them.
- Not a Load Testing Tool: We verify correctness, not latency under load.
- Not a Data Generator: You bring the dataset (or use our upcoming builder).
- Not a Experiment Tracker: We save results locally. No dashboards, no login, no cloud.
- Not a Judge Provider: We use your existing keys (OpenAI/Anthropic) to run judge evaluations.
🍲 Common Recipes¶
1. Schema Validation¶
Ensure your LLM returns valid JSON for function calling.
{"id": "json1", "input": "Extract: John is 30", "expected": {"schema": {"required": ["name", "age"]}}}
2. LLM-as-a-Judge¶
Use GPT-4 to score fuzzy outputs (e.g., tone, creativity).
{"id": "story1", "input": "Write a poem", "expected": {"judge": {"prompt": "Is this poem rhyming and creative?"}}}
3. Safety Checks¶
Ensure your model refuses harmful prompts.
Note: If the model refuses ("I cannot help"), it passes.🚧 Dataset Builder (Coming Soon)¶
Prefer not to hand-write JSONL? A simple, optional visual dataset builder is coming soon to help you craft test cases via UI.
from llm_expect import llm_expect
@llm_expect(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
# Your LLM function
return {"response": "..."}
Features¶
- Simple Decorator: Just add
@llm_expectto any function - Structured Testing: Use JSONL datasets with expected outputs
- Multiple Metrics: Accuracy, schema validation, safety, custom judges
- LLM-as-Judge: Built-in support for OpenAI, Anthropic, Bedrock
- Session Grouping: Organize results by test run
- Rich Reports: JSON, JSONL, and human-readable text reports