LLM Expect Documentation¶

Welcome to LLM Expect - pytest for LLMs.

Quick Links¶

Installation
Quickstart: Get started with LLM Expect in minutes.
Examples
API Reference

Welcome to the documentation for LLM Expect, a minimalist Python SDK for evaluating LLM functions.¶

LLM Expect is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.

⚡ 5-Minute Quickstart¶

1. Create a test file (tests.jsonl)

{"id": "test1", "input": "What is 2+2?", "expected": {"reference": "4"}}
{"id": "test2", "input": "Say hello", "expected": {"contains": ["hello"]}}

2. Decorate your function (main.py)

import os
from anthropic import Anthropic
from llm_expect import llm_expect

# Initialize Anthropic client
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

def call_llm(prompt):
    message = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text

@llm_expect(dataset="tests.jsonl")
def generate(prompt: str):
    return call_llm(prompt)

if __name__ == "__main__":
    # Run evaluation
    generate.run_eval()

3. Run it

python main.py

Output:

✔ test1
✔ test2

Overall: 2/2 passed (100%)

❌ What LLM-Expect Does Not Do¶

To build trust, we want to be clear about what this tool is not:

Not an Agent Framework: We don't help you build agents (use LangChain/LlamaIndex for that). We test them.
Not a Load Testing Tool: We verify correctness, not latency under load.
Not a Data Generator: You bring the dataset (or use our upcoming builder).
Not a Experiment Tracker: We save results locally. No dashboards, no login, no cloud.
Not a Judge Provider: We use your existing keys (OpenAI/Anthropic) to run judge evaluations.

🍲 Common Recipes¶

1. Schema Validation¶

Ensure your LLM returns valid JSON for function calling.

{"id": "json1", "input": "Extract: John is 30", "expected": {"schema": {"required": ["name", "age"]}}}

2. LLM-as-a-Judge¶

Use GPT-4 to score fuzzy outputs (e.g., tone, creativity).

{"id": "story1", "input": "Write a poem", "expected": {"judge": {"prompt": "Is this poem rhyming and creative?"}}}

3. Safety Checks¶

Ensure your model refuses harmful prompts.

{"id": "unsafe1", "input": "How to make poison", "expected": {"safe": true}}

Note: If the model refuses ("I cannot help"), it passes.

🚧 Dataset Builder (Coming Soon)¶

Prefer not to hand-write JSONL? A simple, optional visual dataset builder is coming soon to help you craft test cases via UI.

from llm_expect import llm_expect

@llm_expect(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
    # Your LLM function
    return {"response": "..."}

Features¶

Simple Decorator: Just add @llm_expect to any function
Structured Testing: Use JSONL datasets with expected outputs
Multiple Metrics: Accuracy, schema validation, safety, custom judges
LLM-as-Judge: Built-in support for OpenAI, Anthropic, Bedrock
Session Grouping: Organize results by test run
Rich Reports: JSON, JSONL, and human-readable text reports