API Reference¶

This reference documents the core components of the llm-expect library.

Core Decorator¶

The main entry point is the @llm_expect decorator.

Usage Example¶

from llm_expect import llm_expect

@llm_expect(
    dataset="tests.jsonl",
    tests=["accuracy", "safety"],
    thresholds={"accuracy": 0.9}
)
def my_llm_function(prompt: str) -> str:
    # ... implementation ...
    return "response"

# Run the evaluation
results = my_llm_function.run_eval()

`llm_expect.decorator` ¶

Core @llm_expect decorator and Runner implementation.

Provides the main decorator interface and evaluation runner that coordinates all components to evaluate LLM functions.

Functions¶

`llm_expect(dataset, tests=None, thresholds=None, judge_provider=None, judge_model=None, sample_size=None, shuffle=False, cache=True, cache_dir=None, results_dir=None, fail_fast=False, timeout=60, **kwargs)` ¶

Decorator to add evaluation capabilities to LLM functions.

Parameters:

Name	Type	Description	Default
`dataset`	`str`	Path to JSONL dataset file	required
`tests`	`Optional[List[str]]`	List of metrics to evaluate (default: ["accuracy"])	`None`
`thresholds`	`Optional[Dict[str, float]]`	Threshold values for pass/fail (default: {"accuracy": 0.8})	`None`
`judge_provider`	`Optional[str]`	LLM judge provider ("openai", "anthropic", "bedrock")	`None`
`judge_model`	`Optional[str]`	Model name for judge evaluation	`None`
`sample_size`	`Optional[int]`	Number of examples to sample from dataset	`None`
`shuffle`	`bool`	Whether to shuffle examples before sampling	`False`
`cache`	`bool`	Whether to cache evaluation results	`True`
`cache_dir`	`Optional[str]`	Directory for cache files	`None`
`results_dir`	`Optional[str]`	Directory to save evaluation results	`None`
`fail_fast`	`bool`	Stop evaluation on first test failure	`False`
`timeout`	`int`	Function execution timeout in seconds	`60`
`**kwargs`		Additional configuration parameters	`{}`

Returns:

Type	Description
`Callable[[F], LLMExpectFunction]`	Decorated function with evaluation capabilities

Example

@llm_expect(dataset="tests.jsonl") def my_llm_function(prompt: str) -> dict: return call_llm(prompt)

Normal usage¶

result = my_llm_function("Hello")

Run evaluation¶

eval_results = my_llm_function.run_eval() print(f"Passed: {eval_results['passed']}")

Source code in llm_expect/decorator.py

def llm_expect(
    dataset: str,
    tests: Optional[List[str]] = None,
    thresholds: Optional[Dict[str, float]] = None,
    judge_provider: Optional[str] = None,
    judge_model: Optional[str] = None,
    sample_size: Optional[int] = None,
    shuffle: bool = False,
    cache: bool = True,
    cache_dir: Optional[str] = None,
    results_dir: Optional[str] = None,
    fail_fast: bool = False,
    timeout: int = 60,
    **kwargs
) -> Callable[[F], LLMExpectFunction]:
    """
    Decorator to add evaluation capabilities to LLM functions.

    Args:
        dataset: Path to JSONL dataset file
        tests: List of metrics to evaluate (default: ["accuracy"])
        thresholds: Threshold values for pass/fail (default: {"accuracy": 0.8})
        judge_provider: LLM judge provider ("openai", "anthropic", "bedrock")
        judge_model: Model name for judge evaluation
        sample_size: Number of examples to sample from dataset
        shuffle: Whether to shuffle examples before sampling
        cache: Whether to cache evaluation results
        cache_dir: Directory for cache files
        results_dir: Directory to save evaluation results
        fail_fast: Stop evaluation on first test failure
        timeout: Function execution timeout in seconds
        **kwargs: Additional configuration parameters

    Returns:
        Decorated function with evaluation capabilities

    Example:
        @llm_expect(dataset="tests.jsonl")
        def my_llm_function(prompt: str) -> dict:
            return call_llm(prompt)

        # Normal usage
        result = my_llm_function("Hello")

        # Run evaluation
        eval_results = my_llm_function.run_eval()
        print(f"Passed: {eval_results['passed']}")
    """

    def decorator(func: F) -> LLMExpectFunction:
        # Create configuration
        config = config_manager.create_config(
            dataset=dataset,
            tests=tests,
            thresholds=thresholds,
            judge_provider=judge_provider,
            judge_model=judge_model,
            sample_size=sample_size,
            shuffle=shuffle,
            cache=cache,
            cache_dir=cache_dir,
            results_dir=results_dir,
            fail_fast=fail_fast,
            timeout=timeout,
            **kwargs
        )

        return LLMExpectFunction(func, config)

    return decorator

Data Models¶

These models define the structure of inputs and results.

`llm_expect.models` ¶

Pydantic models for LLM Expect configuration and data structures. Provides type-safe models with validation for all core LLM Expect data structures.

Classes¶

`EvaluationResult` ¶

Bases: BaseModel

Complete result of an evaluation run.

Source code in llm_expect/models.py

class EvaluationResult(BaseModel):
    """Complete result of an evaluation run."""

    # Basic info
    function_name: str = Field(..., description="Name of the evaluated function")
    dataset_path: str = Field(..., description="Path to the dataset file")
    timestamp: datetime = Field(default_factory=datetime.now, description="When evaluation ran")

    # Configuration used
    config: 'LLMExpectConfig' = Field(..., description="Configuration used for evaluation")

    # Results
    tests: List[TestResult] = Field(..., description="Individual test results")
    summary: EvaluationSummary = Field(..., description="Summary statistics")
    passed: bool = Field(..., description="Whether overall evaluation passed")

    # Metadata
    run_id: str = Field(..., description="Unique identifier for this run")
    run_dir: Optional[str] = Field(
        default=None, 
        description="Directory where results are saved"
    )

    model_config = ConfigDict(extra="forbid")

`EvaluationSummary` ¶

Bases: BaseModel

Summary statistics for an evaluation run.

Source code in llm_expect/models.py

class EvaluationSummary(BaseModel):
    """Summary statistics for an evaluation run."""

    total_tests: int = Field(..., ge=0, description="Total number of tests")
    passed_tests: int = Field(..., ge=0, description="Number of tests that passed")
    failed_tests: int = Field(..., ge=0, description="Number of tests that failed")
    error_tests: int = Field(..., ge=0, description="Number of tests with errors")
    success_rate: float = Field(..., ge=0.0, le=1.0, description="Overall success rate")

    # Metric aggregations
    metrics: Dict[str, Dict[str, float]] = Field(
        default_factory=dict,
        description="Aggregated metric statistics (mean, std, min, max)"
    )

    # Timing information
    total_time: float = Field(..., ge=0.0, description="Total execution time in seconds")
    avg_time_per_test: float = Field(..., ge=0.0, description="Average time per test")

    @model_validator(mode='after')
    def validate_test_totals(self):
        total = self.total_tests
        passed = self.passed_tests
        failed = self.failed_tests
        error = self.error_tests

        if passed > total or failed > total or error > total:
            raise ValueError("Individual test counts cannot exceed total tests")

        if passed + failed + error != total:
            raise ValueError("Sum of passed, failed, and error tests must equal total tests")

        return self

    model_config = ConfigDict(extra="forbid")

`TestResult` ¶

Bases: BaseModel

Result of evaluating a single test case.

Source code in llm_expect/models.py

class TestResult(BaseModel):
    """Result of evaluating a single test case."""

    test_id: str = Field(..., description="ID of the test case")
    input: Union[str, int, float, bool, Dict[str, Any]] = Field(
        ..., 
        description="Input that was passed to the function"
    )
    expected: Dict[str, Any] = Field(..., description="Expected validation criteria")
    actual: Optional[Any] = Field(
        default=None, 
        description="Actual function output"
    )
    error: Optional[str] = Field(
        default=None, 
        description="Error message if function execution failed"
    )
    metrics: List[MetricResult] = Field(
        default_factory=list, 
        description="Metric results for this test"
    )
    passed: bool = Field(..., description="Whether all metrics passed")
    execution_time: float = Field(..., ge=0.0, description="Execution time in seconds")

    @field_validator('metrics')
    @classmethod
    def validate_metrics(cls, v):
        # Ensure metric names are unique
        names = [metric.name for metric in v]
        if len(names) != len(set(names)):
            raise ValueError("Metric names must be unique within a test result")
        return v

    model_config = ConfigDict(extra="forbid")

`MetricResult` ¶

Bases: BaseModel

Result of a single metric calculation.

Source code in llm_expect/models.py

class MetricResult(BaseModel):
    """Result of a single metric calculation."""

    name: str = Field(..., description="Name of the metric")
    score: float = Field(..., ge=0.0, le=1.0, description="Metric score (0.0 to 1.0)")
    passed: bool = Field(..., description="Whether this metric passed the threshold")
    threshold: float = Field(..., ge=0.0, le=1.0, description="Threshold used")
    details: Optional[Dict[str, Any]] = Field(
        default=None, 
        description="Additional metric-specific details"
    )

    model_config = ConfigDict(extra="forbid")

`LLMExpectConfig` ¶

Bases: BaseModel

Main configuration for LLM Expect evaluation.

Source code in llm_expect/models.py

class LLMExpectConfig(BaseModel):
    """Main configuration for LLM Expect evaluation."""

    # Dataset configuration
    dataset: str = Field(..., description="Path to the dataset file")
    sample_size: Optional[int] = Field(
        default=None, 
        ge=1, 
        description="Number of examples to sample (None for all)"
    )
    shuffle: bool = Field(
        default=False, 
        description="Whether to shuffle examples before sampling"
    )

    # Metrics configuration
    tests: List[str] = Field(
        default_factory=list, 
        description="List of metrics to evaluate (empty for auto-selection)"
    )
    thresholds: Dict[str, float] = Field(
        default_factory=lambda: {"accuracy": 0.8},
        description="Threshold values for each metric"
    )

    # Judge configuration (for LLM-based metrics)
    judge: Optional[JudgeConfig] = Field(
        default=None,
        description="Judge configuration for LLM-based evaluation"
    )

    # Caching
    cache: bool = Field(
        default=True, 
        description="Whether to cache evaluation results"
    )
    cache_dir: str = Field(
        default=".llm_expect_cache", 
        description="Directory for cache files"
    )

    # Output configuration
    results_dir: str = Field(
        default="runs", 
        description="Directory to save evaluation results"
    )
    save_results: bool = Field(
        default=True, 
        description="Whether to save detailed results to disk"
    )

    # Execution configuration
    fail_fast: bool = Field(
        default=False, 
        description="Stop evaluation on first test failure"
    )
    parallel: bool = Field(
        default=False, 
        description="Run evaluations in parallel (future)"
    )
    timeout: int = Field(
        default=60, 
        ge=1, 
        le=3600, 
        description="Timeout per function call in seconds"
    )

    @field_validator('tests')
    @classmethod
    def validate_tests(cls, v):
        if not v:  # Changed from "if v is None"
            return v
        valid_tests = {
            "accuracy", "schema_fidelity", "instruction_adherence", 
            "safety", "semantic_similarity", "custom_judge"  # Added custom_judge
        }
        invalid = set(v) - valid_tests
        if invalid:
            raise ValueError(f"Invalid tests: {invalid}. Valid tests: {valid_tests}")
        return v

    @field_validator('dataset')
    @classmethod
    def validate_dataset(cls, v):
        if not v.endswith('.jsonl'):
            raise ValueError("Dataset file must have .jsonl extension")
        return v

    @model_validator(mode='after')
    def validate_thresholds(self):
        # Ensure thresholds exist for all tests that need them
        if self.tests:
            for test in self.tests:
                if test not in self.thresholds:
                    self.thresholds[test] = 0.8  # Default threshold

        # Validate threshold values
        for test, threshold in self.thresholds.items():
            if not 0.0 <= threshold <= 1.0:
                raise ValueError(f"Threshold for {test} must be between 0.0 and 1.0")

        return self

    model_config = ConfigDict(extra="forbid")

Configuration¶

`llm_expect.config` ¶

Configuration management for LLM Expect.

Handles loading configuration from various sources (environment, files, parameters) with proper validation and defaults.

Classes¶

`LLMExpectConfig` ¶

Bases: BaseModel

Main configuration for LLM Expect evaluation.

Source code in llm_expect/models.py

class LLMExpectConfig(BaseModel):
    """Main configuration for LLM Expect evaluation."""

    # Dataset configuration
    dataset: str = Field(..., description="Path to the dataset file")
    sample_size: Optional[int] = Field(
        default=None, 
        ge=1, 
        description="Number of examples to sample (None for all)"
    )
    shuffle: bool = Field(
        default=False, 
        description="Whether to shuffle examples before sampling"
    )

    # Metrics configuration
    tests: List[str] = Field(
        default_factory=list, 
        description="List of metrics to evaluate (empty for auto-selection)"
    )
    thresholds: Dict[str, float] = Field(
        default_factory=lambda: {"accuracy": 0.8},
        description="Threshold values for each metric"
    )

    # Judge configuration (for LLM-based metrics)
    judge: Optional[JudgeConfig] = Field(
        default=None,
        description="Judge configuration for LLM-based evaluation"
    )

    # Caching
    cache: bool = Field(
        default=True, 
        description="Whether to cache evaluation results"
    )
    cache_dir: str = Field(
        default=".llm_expect_cache", 
        description="Directory for cache files"
    )

    # Output configuration
    results_dir: str = Field(
        default="runs", 
        description="Directory to save evaluation results"
    )
    save_results: bool = Field(
        default=True, 
        description="Whether to save detailed results to disk"
    )

    # Execution configuration
    fail_fast: bool = Field(
        default=False, 
        description="Stop evaluation on first test failure"
    )
    parallel: bool = Field(
        default=False, 
        description="Run evaluations in parallel (future)"
    )
    timeout: int = Field(
        default=60, 
        ge=1, 
        le=3600, 
        description="Timeout per function call in seconds"
    )

    @field_validator('tests')
    @classmethod
    def validate_tests(cls, v):
        if not v:  # Changed from "if v is None"
            return v
        valid_tests = {
            "accuracy", "schema_fidelity", "instruction_adherence", 
            "safety", "semantic_similarity", "custom_judge"  # Added custom_judge
        }
        invalid = set(v) - valid_tests
        if invalid:
            raise ValueError(f"Invalid tests: {invalid}. Valid tests: {valid_tests}")
        return v

    @field_validator('dataset')
    @classmethod
    def validate_dataset(cls, v):
        if not v.endswith('.jsonl'):
            raise ValueError("Dataset file must have .jsonl extension")
        return v

    @model_validator(mode='after')
    def validate_thresholds(self):
        # Ensure thresholds exist for all tests that need them
        if self.tests:
            for test in self.tests:
                if test not in self.thresholds:
                    self.thresholds[test] = 0.8  # Default threshold

        # Validate threshold values
        for test, threshold in self.thresholds.items():
            if not 0.0 <= threshold <= 1.0:
                raise ValueError(f"Threshold for {test} must be between 0.0 and 1.0")

        return self

    model_config = ConfigDict(extra="forbid")

API Reference¶

Core Decorator¶

Usage Example¶

llm_expect.decorator ¶

Functions¶

llm_expect(dataset, tests=None, thresholds=None, judge_provider=None, judge_model=None, sample_size=None, shuffle=False, cache=True, cache_dir=None, results_dir=None, fail_fast=False, timeout=60, **kwargs) ¶

Normal usage¶

Run evaluation¶

Data Models¶

llm_expect.models ¶

Classes¶

EvaluationResult ¶

EvaluationSummary ¶

TestResult ¶

MetricResult ¶

LLMExpectConfig ¶

Configuration¶

llm_expect.config ¶

Classes¶

LLMExpectConfig ¶

`llm_expect.decorator` ¶

`llm_expect(dataset, tests=None, thresholds=None, judge_provider=None, judge_model=None, sample_size=None, shuffle=False, cache=True, cache_dir=None, results_dir=None, fail_fast=False, timeout=60, **kwargs)` ¶

`llm_expect.models` ¶

`EvaluationResult` ¶

`EvaluationSummary` ¶

`TestResult` ¶

`MetricResult` ¶

`LLMExpectConfig` ¶

`llm_expect.config` ¶

`LLMExpectConfig` ¶