Runner Module

The runner module provides test suite functionality for executing and validating JSON extraction tests across multiple LLM models and drivers.

Overview

The runner module enables:

  • Batch Testing: Execute extraction tests across multiple models simultaneously

  • Validation: Automatic validation of extraction results against expected outputs

  • Comparison: Compare performance and accuracy across different LLM providers

  • Reporting: Generate detailed test results and performance metrics

Main Functions

run_suite_from_spec()

Execute a test suite specification against multiple drivers and collect comprehensive results.

Features:

  • Multi-model Testing: Run the same tests across different LLM providers

  • Automatic Validation: Compare extraction results against expected outputs

  • Error Handling: Graceful handling of driver failures and extraction errors

  • Performance Metrics: Collect timing and token usage data

  • Detailed Reporting: Generate comprehensive test result summaries

Test Suite Specification Format:

test_spec = {
    "name": "Person Extraction Test Suite",
    "description": "Test extraction of person information",
    "tests": [
        {
            "name": "Basic Person Info",
            "input_text": "John Smith is 25 years old",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                }
            },
            "expected": {
                "name": "John Smith",
                "age": 25
            }
        }
    ]
}

Example Usage:

from prompture.runner import run_suite_from_spec
from prompture.drivers import get_driver_for_model

# Prepare drivers for testing
drivers = {
    "gpt-4": get_driver_for_model("openai/gpt-4"),
    "claude": get_driver_for_model("anthropic/claude-3-sonnet-20240229"),
    "gemini": get_driver_for_model("google/gemini-pro")
}

# Run the test suite
results = run_suite_from_spec(test_spec, drivers)

# Analyze results
for driver_name, driver_results in results.items():
    print(f"Results for {driver_name}:")
    print(f"  Tests passed: {driver_results['passed']}/{driver_results['total']}")
    print(f"  Average accuracy: {driver_results['accuracy']:.2%}")

Return Value Structure:

{
    "driver_name": {
        "passed": 8,           # Number of tests passed
        "total": 10,           # Total number of tests
        "accuracy": 0.85,      # Overall accuracy score
        "avg_time": 1.23,      # Average response time (seconds)
        "total_tokens": 1500,  # Total tokens used
        "total_cost": 0.045,   # Total estimated cost (USD)
        "test_results": [      # Individual test results
            {
                "test_name": "Basic Person Info",
                "passed": True,
                "expected": {...},
                "actual": {...},
                "accuracy_score": 1.0,
                "response_time": 1.1,
                "tokens_used": 150,
                "cost": 0.0045,
                "error": None
            }
        ]
    }
}

Test Suite Components

Test Specification Structure

Each test suite specification should contain:

Suite Level Properties:

  • name: Human-readable name for the test suite

  • description: Detailed description of what the suite tests

  • tests: Array of individual test cases

Individual Test Properties:

  • name: Name of the specific test case

  • input_text: Text to extract information from

  • schema: JSON schema defining the expected output structure

  • expected: Expected extraction results for validation

  • options: Optional driver-specific options

Validation and Scoring

The runner automatically validates extraction results:

Validation Methods:

  1. Schema Validation: Ensures output matches the JSON schema structure

  2. Value Comparison: Compares extracted values against expected results

  3. Type Checking: Validates that extracted data types are correct

  4. Completeness: Checks that all required fields are present

Accuracy Scoring:

  • Perfect Match: 1.0 score when extracted data exactly matches expected

  • Partial Match: Proportional score based on correct fields

  • Schema Valid: Minimum score for schema-compliant but incorrect data

  • Invalid: 0.0 score for schema violations or extraction failures

Performance Metrics

The runner collects comprehensive performance data:

Timing Metrics:

  • Response time per test

  • Average response time per driver

  • Total execution time

Usage Metrics:

  • Token usage per test and driver

  • Cost estimation based on model pricing

  • Request/response sizes

Quality Metrics:

  • Accuracy scores per test

  • Pass/fail rates

  • Error categorization

Advanced Usage Patterns

Comparative Analysis

Run the same tests across multiple providers to compare performance:

# Test multiple providers
providers = {
    "openai-gpt4": get_driver_for_model("openai/gpt-4"),
    "openai-gpt35": get_driver_for_model("openai/gpt-3.5-turbo"),
    "anthropic-claude": get_driver_for_model("anthropic/claude-3-sonnet-20240229"),
    "google-gemini": get_driver_for_model("google/gemini-pro"),
    "groq-llama": get_driver_for_model("groq/llama2-70b-4096")
}

results = run_suite_from_spec(comprehensive_test_spec, providers)

# Analyze comparative performance
performance_summary = {}
for provider, result in results.items():
    performance_summary[provider] = {
        "accuracy": result["accuracy"],
        "avg_time": result["avg_time"],
        "cost_per_test": result["total_cost"] / result["total"],
        "tokens_per_test": result["total_tokens"] / result["total"]
    }

Regression Testing

Use the runner for regression testing when updating Prompture:

# Load test suite from file
import json
with open("regression_tests.json") as f:
    regression_spec = json.load(f)

# Run against current production setup
production_results = run_suite_from_spec(regression_spec, production_drivers)

# Compare against baseline results
baseline_accuracy = 0.92
current_accuracy = production_results["main_driver"]["accuracy"]

if current_accuracy < baseline_accuracy * 0.95:  # 5% tolerance
    print("WARNING: Accuracy regression detected!")

Custom Test Generation

Generate test suites programmatically:

def generate_field_tests(field_definitions):
    """Generate test cases for field definitions."""
    tests = []

    for field_name, field_def in field_definitions.items():
        test = {
            "name": f"Extract {field_name}",
            "input_text": f"Sample text containing {field_name} information",
            "schema": {
                "type": "object",
                "properties": {
                    field_name: {"type": field_def["type"].__name__.lower()}
                }
            },
            "expected": {field_name: field_def.get("default")}
        }
        tests.append(test)

    return {"name": "Field Definition Tests", "tests": tests}

Integration with Testing Frameworks

The runner integrates well with standard testing frameworks:

pytest Integration:

import pytest
from prompture.runner import run_suite_from_spec

def test_extraction_accuracy():
    results = run_suite_from_spec(test_spec, test_drivers)

    for driver_name, result in results.items():
        assert result["accuracy"] >= 0.8, f"{driver_name} accuracy too low"
        assert result["passed"] == result["total"], f"{driver_name} had failures"

unittest Integration:

import unittest
from prompture.runner import run_suite_from_spec

class TestExtractionSuite(unittest.TestCase):
    def setUp(self):
        self.drivers = {...}  # Initialize test drivers

    def test_person_extraction(self):
        results = run_suite_from_spec(person_test_spec, self.drivers)
        self.assertGreaterEqual(results["main"]["accuracy"], 0.85)

Best Practices

  1. Comprehensive Test Coverage: Create tests covering various input formats and edge cases

  2. Multiple Provider Testing: Test against multiple LLM providers to ensure robustness

  3. Regular Regression Testing: Run test suites regularly to catch performance degradation

  4. Baseline Establishment: Establish accuracy baselines for different test scenarios

  5. Performance Monitoring: Track token usage and costs over time

  6. Error Analysis: Analyze failed tests to improve field definitions and prompts

  7. Incremental Testing: Start with simple tests and gradually increase complexity

Error Handling

The runner handles various error conditions gracefully:

Driver Errors:

  • Connection failures

  • Authentication issues

  • Rate limiting

  • Model unavailability

Extraction Errors:

  • Invalid JSON responses

  • Schema validation failures

  • Timeout errors

  • Unexpected response formats

Test Specification Errors:

  • Malformed test specifications

  • Missing required fields

  • Invalid schema definitions

All errors are logged and reported in the test results without stopping the entire suite execution.