Runner Module

The runner module provides test suite functionality for executing and validating JSON extraction tests across multiple LLM models and drivers.

Overview

The runner module enables:

Batch Testing: Execute extraction tests across multiple models simultaneously
Validation: Automatic validation of extraction results against expected outputs
Comparison: Compare performance and accuracy across different LLM providers
Reporting: Generate detailed test results and performance metrics

Main Functions

run_suite_from_spec()

Execute a test suite specification against multiple drivers and collect comprehensive results.

Features:

Multi-model Testing: Run the same tests across different LLM providers
Automatic Validation: Compare extraction results against expected outputs
Error Handling: Graceful handling of driver failures and extraction errors
Performance Metrics: Collect timing and token usage data
Detailed Reporting: Generate comprehensive test result summaries

Test Suite Specification Format:

test_spec = {
    "name": "Person Extraction Test Suite",
    "description": "Test extraction of person information",
    "tests": [
        {
            "name": "Basic Person Info",
            "input_text": "John Smith is 25 years old",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                }
            },
            "expected": {
                "name": "John Smith",
                "age": 25
            }
        }
    ]
}

Example Usage:

from prompture.runner import run_suite_from_spec
from prompture.drivers import get_driver_for_model

# Prepare drivers for testing
drivers = {
    "gpt-4": get_driver_for_model("openai/gpt-4"),
    "claude": get_driver_for_model("anthropic/claude-3-sonnet-20240229"),
    "gemini": get_driver_for_model("google/gemini-pro")
}

# Run the test suite
results = run_suite_from_spec(test_spec, drivers)

# Analyze results
for driver_name, driver_results in results.items():
    print(f"Results for {driver_name}:")
    print(f"  Tests passed: {driver_results['passed']}/{driver_results['total']}")
    print(f"  Average accuracy: {driver_results['accuracy']:.2%}")

Return Value Structure:

{
    "driver_name": {
        "passed": 8,           # Number of tests passed
        "total": 10,           # Total number of tests
        "accuracy": 0.85,      # Overall accuracy score
        "avg_time": 1.23,      # Average response time (seconds)
        "total_tokens": 1500,  # Total tokens used
        "total_cost": 0.045,   # Total estimated cost (USD)
        "test_results": [      # Individual test results
            {
                "test_name": "Basic Person Info",
                "passed": True,
                "expected": {...},
                "actual": {...},
                "accuracy_score": 1.0,
                "response_time": 1.1,
                "tokens_used": 150,
                "cost": 0.0045,
                "error": None
            }
        ]
    }
}

Test Suite Components

Test Specification Structure

Each test suite specification should contain:

Suite Level Properties:

name: Human-readable name for the test suite
description: Detailed description of what the suite tests
tests: Array of individual test cases

Individual Test Properties:

name: Name of the specific test case
input_text: Text to extract information from
schema: JSON schema defining the expected output structure
expected: Expected extraction results for validation
options: Optional driver-specific options

Validation and Scoring

The runner automatically validates extraction results:

Validation Methods:

Schema Validation: Ensures output matches the JSON schema structure
Value Comparison: Compares extracted values against expected results
Type Checking: Validates that extracted data types are correct
Completeness: Checks that all required fields are present

Accuracy Scoring:

Perfect Match: 1.0 score when extracted data exactly matches expected
Partial Match: Proportional score based on correct fields
Schema Valid: Minimum score for schema-compliant but incorrect data
Invalid: 0.0 score for schema violations or extraction failures

Performance Metrics

The runner collects comprehensive performance data:

Timing Metrics:

Response time per test
Average response time per driver
Total execution time

Usage Metrics:

Token usage per test and driver
Cost estimation based on model pricing
Request/response sizes

Quality Metrics:

Accuracy scores per test
Pass/fail rates
Error categorization

Advanced Usage Patterns

Comparative Analysis

Run the same tests across multiple providers to compare performance:

# Test multiple providers
providers = {
    "openai-gpt4": get_driver_for_model("openai/gpt-4"),
    "openai-gpt35": get_driver_for_model("openai/gpt-3.5-turbo"),
    "anthropic-claude": get_driver_for_model("anthropic/claude-3-sonnet-20240229"),
    "google-gemini": get_driver_for_model("google/gemini-pro"),
    "groq-llama": get_driver_for_model("groq/llama2-70b-4096")
}

results = run_suite_from_spec(comprehensive_test_spec, providers)

# Analyze comparative performance
performance_summary = {}
for provider, result in results.items():
    performance_summary[provider] = {
        "accuracy": result["accuracy"],
        "avg_time": result["avg_time"],
        "cost_per_test": result["total_cost"] / result["total"],
        "tokens_per_test": result["total_tokens"] / result["total"]
    }

Regression Testing

Use the runner for regression testing when updating Prompture:

# Load test suite from file
import json
with open("regression_tests.json") as f:
    regression_spec = json.load(f)

# Run against current production setup
production_results = run_suite_from_spec(regression_spec, production_drivers)

# Compare against baseline results
baseline_accuracy = 0.92
current_accuracy = production_results["main_driver"]["accuracy"]

if current_accuracy < baseline_accuracy * 0.95:  # 5% tolerance
    print("WARNING: Accuracy regression detected!")

Custom Test Generation

Generate test suites programmatically:

def generate_field_tests(field_definitions):
    """Generate test cases for field definitions."""
    tests = []

    for field_name, field_def in field_definitions.items():
        test = {
            "name": f"Extract {field_name}",
            "input_text": f"Sample text containing {field_name} information",
            "schema": {
                "type": "object",
                "properties": {
                    field_name: {"type": field_def["type"].__name__.lower()}
                }
            },
            "expected": {field_name: field_def.get("default")}
        }
        tests.append(test)

    return {"name": "Field Definition Tests", "tests": tests}

Integration with Testing Frameworks

The runner integrates well with standard testing frameworks:

pytest Integration:

import pytest
from prompture.runner import run_suite_from_spec

def test_extraction_accuracy():
    results = run_suite_from_spec(test_spec, test_drivers)

    for driver_name, result in results.items():
        assert result["accuracy"] >= 0.8, f"{driver_name} accuracy too low"
        assert result["passed"] == result["total"], f"{driver_name} had failures"

unittest Integration:

import unittest
from prompture.runner import run_suite_from_spec

class TestExtractionSuite(unittest.TestCase):
    def setUp(self):
        self.drivers = {...}  # Initialize test drivers

    def test_person_extraction(self):
        results = run_suite_from_spec(person_test_spec, self.drivers)
        self.assertGreaterEqual(results["main"]["accuracy"], 0.85)

Best Practices

Comprehensive Test Coverage: Create tests covering various input formats and edge cases
Multiple Provider Testing: Test against multiple LLM providers to ensure robustness
Regular Regression Testing: Run test suites regularly to catch performance degradation
Baseline Establishment: Establish accuracy baselines for different test scenarios
Performance Monitoring: Track token usage and costs over time
Error Analysis: Analyze failed tests to improve field definitions and prompts
Incremental Testing: Start with simple tests and gradually increase complexity

Error Handling

The runner handles various error conditions gracefully:

Driver Errors:

Connection failures
Authentication issues
Rate limiting
Model unavailability

Extraction Errors:

Invalid JSON responses
Schema validation failures
Timeout errors
Unexpected response formats

Test Specification Errors:

Malformed test specifications
Missing required fields
Invalid schema definitions

All errors are logged and reported in the test results without stopping the entire suite execution.