Runner Module
The runner module provides test suite functionality for executing and validating JSON extraction tests across multiple LLM models and drivers.
Overview
The runner module enables:
Batch Testing: Execute extraction tests across multiple models simultaneously
Validation: Automatic validation of extraction results against expected outputs
Comparison: Compare performance and accuracy across different LLM providers
Reporting: Generate detailed test results and performance metrics
Main Functions
run_suite_from_spec()
Execute a test suite specification against multiple drivers and collect comprehensive results.
Features:
Multi-model Testing: Run the same tests across different LLM providers
Automatic Validation: Compare extraction results against expected outputs
Error Handling: Graceful handling of driver failures and extraction errors
Performance Metrics: Collect timing and token usage data
Detailed Reporting: Generate comprehensive test result summaries
Test Suite Specification Format:
test_spec = {
"name": "Person Extraction Test Suite",
"description": "Test extraction of person information",
"tests": [
{
"name": "Basic Person Info",
"input_text": "John Smith is 25 years old",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
}
},
"expected": {
"name": "John Smith",
"age": 25
}
}
]
}
Example Usage:
from prompture.runner import run_suite_from_spec
from prompture.drivers import get_driver_for_model
# Prepare drivers for testing
drivers = {
"gpt-4": get_driver_for_model("openai/gpt-4"),
"claude": get_driver_for_model("anthropic/claude-3-sonnet-20240229"),
"gemini": get_driver_for_model("google/gemini-pro")
}
# Run the test suite
results = run_suite_from_spec(test_spec, drivers)
# Analyze results
for driver_name, driver_results in results.items():
print(f"Results for {driver_name}:")
print(f" Tests passed: {driver_results['passed']}/{driver_results['total']}")
print(f" Average accuracy: {driver_results['accuracy']:.2%}")
Return Value Structure:
{
"driver_name": {
"passed": 8, # Number of tests passed
"total": 10, # Total number of tests
"accuracy": 0.85, # Overall accuracy score
"avg_time": 1.23, # Average response time (seconds)
"total_tokens": 1500, # Total tokens used
"total_cost": 0.045, # Total estimated cost (USD)
"test_results": [ # Individual test results
{
"test_name": "Basic Person Info",
"passed": True,
"expected": {...},
"actual": {...},
"accuracy_score": 1.0,
"response_time": 1.1,
"tokens_used": 150,
"cost": 0.0045,
"error": None
}
]
}
}
Test Suite Components
Test Specification Structure
Each test suite specification should contain:
Suite Level Properties:
name
: Human-readable name for the test suitedescription
: Detailed description of what the suite teststests
: Array of individual test cases
Individual Test Properties:
name
: Name of the specific test caseinput_text
: Text to extract information fromschema
: JSON schema defining the expected output structureexpected
: Expected extraction results for validationoptions
: Optional driver-specific options
Validation and Scoring
The runner automatically validates extraction results:
Validation Methods:
Schema Validation: Ensures output matches the JSON schema structure
Value Comparison: Compares extracted values against expected results
Type Checking: Validates that extracted data types are correct
Completeness: Checks that all required fields are present
Accuracy Scoring:
Perfect Match: 1.0 score when extracted data exactly matches expected
Partial Match: Proportional score based on correct fields
Schema Valid: Minimum score for schema-compliant but incorrect data
Invalid: 0.0 score for schema violations or extraction failures
Performance Metrics
The runner collects comprehensive performance data:
Timing Metrics:
Response time per test
Average response time per driver
Total execution time
Usage Metrics:
Token usage per test and driver
Cost estimation based on model pricing
Request/response sizes
Quality Metrics:
Accuracy scores per test
Pass/fail rates
Error categorization
Advanced Usage Patterns
Comparative Analysis
Run the same tests across multiple providers to compare performance:
# Test multiple providers
providers = {
"openai-gpt4": get_driver_for_model("openai/gpt-4"),
"openai-gpt35": get_driver_for_model("openai/gpt-3.5-turbo"),
"anthropic-claude": get_driver_for_model("anthropic/claude-3-sonnet-20240229"),
"google-gemini": get_driver_for_model("google/gemini-pro"),
"groq-llama": get_driver_for_model("groq/llama2-70b-4096")
}
results = run_suite_from_spec(comprehensive_test_spec, providers)
# Analyze comparative performance
performance_summary = {}
for provider, result in results.items():
performance_summary[provider] = {
"accuracy": result["accuracy"],
"avg_time": result["avg_time"],
"cost_per_test": result["total_cost"] / result["total"],
"tokens_per_test": result["total_tokens"] / result["total"]
}
Regression Testing
Use the runner for regression testing when updating Prompture:
# Load test suite from file
import json
with open("regression_tests.json") as f:
regression_spec = json.load(f)
# Run against current production setup
production_results = run_suite_from_spec(regression_spec, production_drivers)
# Compare against baseline results
baseline_accuracy = 0.92
current_accuracy = production_results["main_driver"]["accuracy"]
if current_accuracy < baseline_accuracy * 0.95: # 5% tolerance
print("WARNING: Accuracy regression detected!")
Custom Test Generation
Generate test suites programmatically:
def generate_field_tests(field_definitions):
"""Generate test cases for field definitions."""
tests = []
for field_name, field_def in field_definitions.items():
test = {
"name": f"Extract {field_name}",
"input_text": f"Sample text containing {field_name} information",
"schema": {
"type": "object",
"properties": {
field_name: {"type": field_def["type"].__name__.lower()}
}
},
"expected": {field_name: field_def.get("default")}
}
tests.append(test)
return {"name": "Field Definition Tests", "tests": tests}
Integration with Testing Frameworks
The runner integrates well with standard testing frameworks:
pytest Integration:
import pytest
from prompture.runner import run_suite_from_spec
def test_extraction_accuracy():
results = run_suite_from_spec(test_spec, test_drivers)
for driver_name, result in results.items():
assert result["accuracy"] >= 0.8, f"{driver_name} accuracy too low"
assert result["passed"] == result["total"], f"{driver_name} had failures"
unittest Integration:
import unittest
from prompture.runner import run_suite_from_spec
class TestExtractionSuite(unittest.TestCase):
def setUp(self):
self.drivers = {...} # Initialize test drivers
def test_person_extraction(self):
results = run_suite_from_spec(person_test_spec, self.drivers)
self.assertGreaterEqual(results["main"]["accuracy"], 0.85)
Best Practices
Comprehensive Test Coverage: Create tests covering various input formats and edge cases
Multiple Provider Testing: Test against multiple LLM providers to ensure robustness
Regular Regression Testing: Run test suites regularly to catch performance degradation
Baseline Establishment: Establish accuracy baselines for different test scenarios
Performance Monitoring: Track token usage and costs over time
Error Analysis: Analyze failed tests to improve field definitions and prompts
Incremental Testing: Start with simple tests and gradually increase complexity
Error Handling
The runner handles various error conditions gracefully:
Driver Errors:
Connection failures
Authentication issues
Rate limiting
Model unavailability
Extraction Errors:
Invalid JSON responses
Schema validation failures
Timeout errors
Unexpected response formats
Test Specification Errors:
Malformed test specifications
Missing required fields
Invalid schema definitions
All errors are logged and reported in the test results without stopping the entire suite execution.