Skip to main content

A/B Testing Framework

Overview

The OWL AI Platform A/B Testing Framework enables organizations to run controlled experiments on their AI governance configurations. Compare different policy settings, risk thresholds, or automation rules to measure their impact on security, efficiency, and user experience. The framework includes automatic test completion, statistical significance calculation, and winner determination.

Key Capabilities

  • Controlled Experiments: Compare two variants (A/B) under identical conditions
  • Statistical Analysis: Automated confidence level and significance calculation
  • Auto-Completion: Automatic test completion when duration expires
  • Sample Size Tracking: Monitor test progress and data sufficiency
  • Performance Metrics: Track and compare variant performance scores
  • Winner Determination: Algorithmic selection of better-performing variant
  • Notification System: Alerts when tests complete or require attention

How It Works

A/B Test Lifecycle

┌─────────────────────────────────────────────────────────────────────────────────┐
│ A/B TEST LIFECYCLE │
└─────────────────────────────────────────────────────────────────────────────────┘

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CREATE │────>│ RUNNING │────>│ COMPLETING │────>│ COMPLETED │
│ │ │ │ │ │ │ │
│ Define test │ │ Collect │ │ Calculate │ │ Review │
│ Set variants│ │ samples │ │ statistics │ │ results │
│ Set duration│ │ Track perf │ │ Determine │ │ Apply │
│ Start │ │ Monitor │ │ winner │ │ winner │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
│ │ │ │
v v v v
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Test ID │ │ Progress % │ │ Winner: │ │ Action: │
│ assigned │ │ tracked │ │ A or B │ │ Apply │
│ │ │ │ │ │ │ winning │
│ Variants │ │ Alerts │ │ Confidence │ │ config │
│ configured │ │ routed │ │ calculated │ │ │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘

Test Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│ A/B TEST ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────┐
│ INCOMING ALERTS │
│ (Random 50/50 Assignment) │
└──────────────┬───────────────┘

┌────────────────────┴────────────────────┐
│ │
v v
┌─────────────────────┐ ┌─────────────────────┐
│ VARIANT A │ │ VARIANT B │
│ (Control/Base) │ │ (Experiment) │
│ │ │ │
│ Policy: prod-v1 │ │ Policy: prod-v2 │
│ Risk Threshold: 40 │ │ Risk Threshold: 35 │
│ Auto-approve: Yes │ │ Auto-approve: Yes │
│ │ │ │
│ ┌─────────────────┐ │ │ ┌─────────────────┐ │
│ │ Performance │ │ │ │ Performance │ │
│ │ Score: 78.5 │ │ │ │ Score: 82.3 │ │
│ │ │ │ │ │ │ │
│ │ Samples: 156 │ │ │ │ Samples: 152 │ │
│ │ Approvals: 142 │ │ │ │ Approvals: 148 │ │
│ │ Denials: 14 │ │ │ │ Denials: 4 │ │
│ └─────────────────┘ │ │ └─────────────────┘ │
└─────────────────────┘ └─────────────────────┘
│ │
└────────────────────┬────────────────────┘

v
┌──────────────────────────────┐
│ STATISTICAL ANALYSIS │
│ │
│ Sample Size: 308 │
│ Confidence: 95% │
│ Significance: HIGH │
│ │
│ WINNER: Variant B │
│ (+3.8 performance points) │
└──────────────────────────────┘

Scheduler Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│ A/B TEST SCHEDULER (Background Task) │
└─────────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────┐
│ SCHEDULER LOOP │
│ (Every 60 minutes) │
└──────────────┬───────────────┘

v
┌──────────────────────────────┐
│ FIND EXPIRED TESTS │
│ │
│ SELECT * FROM ab_tests │
│ WHERE status = 'running' │
│ AND created_at + duration │
│ <= NOW() │
└──────────────┬───────────────┘

│ For each expired test
v
┌──────────────────────────────┐
│ AUTO-COMPLETE TEST │
│ │
│ 1. Calculate metrics │
│ 2. Determine winner │
│ 3. Calculate confidence │
│ 4. Update database │
│ 5. Send notification │
└──────────────────────────────┘

Configuration

Create A/B Test

from owlai import OWLClient

client = OWLClient(api_key="your_api_key")

# Create a new A/B test
test = client.ab_tests.create(
test_name="Risk Threshold Optimization",
description="Compare 40 vs 35 risk threshold for auto-approval",
duration_hours=168, # 7 days

# Variant A (Control)
variant_a={
"name": "Current Threshold",
"config": {
"auto_approve_threshold": 40,
"policy_id": "prod-policy-v1"
}
},

# Variant B (Experiment)
variant_b={
"name": "Lower Threshold",
"config": {
"auto_approve_threshold": 35,
"policy_id": "prod-policy-v2"
}
},

# Success metrics
success_metrics=["approval_rate", "false_positive_rate", "response_time"]
)

print(f"Test created: {test.test_id}")
print(f"Ends at: {test.end_time}")

Create via API

curl -X POST https://api.owlai.io/v1/ab-tests \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"test_name": "Risk Threshold Optimization",
"description": "Compare 40 vs 35 risk threshold for auto-approval",
"duration_hours": 168,
"variant_a": {
"name": "Current Threshold",
"config": {
"auto_approve_threshold": 40
}
},
"variant_b": {
"name": "Lower Threshold",
"config": {
"auto_approve_threshold": 35
}
}
}'

# Response:
# {
# "test_id": "test_abc123",
# "test_name": "Risk Threshold Optimization",
# "status": "running",
# "created_at": "2026-01-20T10:00:00Z",
# "ends_at": "2026-01-27T10:00:00Z",
# "progress_percentage": 0
# }

Test Configuration Options

ParameterTypeRequiredDescription
test_nameStringYesHuman-readable test name
descriptionStringNoDetailed test description
duration_hoursIntegerYesTest duration (1-720 hours)
variant_aObjectYesControl variant configuration
variant_bObjectYesExperiment variant configuration
success_metricsArrayNoMetrics to optimize for
target_sample_sizeIntegerNoMinimum samples needed

Usage Examples

Monitor Test Progress

# Get test status
test = client.ab_tests.get(test_id="test_abc123")

print(f"Test: {test.test_name}")
print(f"Status: {test.status}")
print(f"Progress: {test.progress_percentage}%")
print(f"Sample Size: {test.sample_size}")

print(f"\nVariant A ({test.variant_a_name}):")
print(f" Performance: {test.variant_a_performance}")
print(f" Samples: {test.variant_a_samples}")

print(f"\nVariant B ({test.variant_b_name}):")
print(f" Performance: {test.variant_b_performance}")
print(f" Samples: {test.variant_b_samples}")

# Output:
# Test: Risk Threshold Optimization
# Status: running
# Progress: 45%
# Sample Size: 308
#
# Variant A (Current Threshold):
# Performance: 78.5
# Samples: 156
#
# Variant B (Lower Threshold):
# Performance: 82.3
# Samples: 152

Get Test Results

curl -X GET https://api.owlai.io/v1/ab-tests/test_abc123 \
-H "Authorization: Bearer $TOKEN"

# Response (completed test):
# {
# "test_id": "test_abc123",
# "test_name": "Risk Threshold Optimization",
# "status": "completed",
# "completed_at": "2026-01-27T10:00:00Z",
# "winner": "variant_b",
# "confidence_level": 95,
# "statistical_significance": "high",
# "results": {
# "variant_a": {
# "name": "Current Threshold",
# "performance_score": 78.5,
# "samples": 312,
# "metrics": {
# "approval_rate": 0.91,
# "false_positive_rate": 0.03,
# "avg_response_time_ms": 245
# }
# },
# "variant_b": {
# "name": "Lower Threshold",
# "performance_score": 82.3,
# "samples": 304,
# "metrics": {
# "approval_rate": 0.97,
# "false_positive_rate": 0.01,
# "avg_response_time_ms": 198
# }
# }
# },
# "recommendation": "Deploy Variant B - 4.8% performance improvement with higher confidence"
# }

List All Tests

# List all tests
tests = client.ab_tests.list(
status="all", # running, completed, or all
limit=10
)

for test in tests:
status_icon = "green" if test.status == "completed" else "blue"
print(f"[{test.status}] {test.test_name}")
print(f" Progress: {test.progress_percentage}%")
if test.winner:
print(f" Winner: {test.winner} (confidence: {test.confidence_level}%)")

Stop Test Early

# Stop a running test (if sufficient data collected)
result = client.ab_tests.stop(
test_id="test_abc123",
reason="Sufficient data collected for decision"
)

print(f"Test stopped: {result.final_status}")
print(f"Winner: {result.winner}")
print(f"Confidence: {result.confidence_level}%")

Apply Winning Configuration

# Apply the winning variant configuration
client.ab_tests.apply_winner(
test_id="test_abc123",
apply_to="production",
rollout_percentage=100
)

print("Winning configuration applied to production")

Statistical Analysis

Confidence Level Calculation

Confidence level is based on sample size:

Sample SizeConfidence LevelSignificance
>= 50099%Very High
>= 30095%High
>= 20090%High
>= 10085%Medium
>= 5075%Medium
>= 2565%Low
>= 1055%Low
< 1040-50%Insufficient

Winner Determination

def determine_winner(test_data, metrics):
"""
Determine winning variant based on performance metrics.

Winner is the variant with higher performance score.
Ties favor Variant A (control).
"""
a_score = metrics.get("variant_a", {}).get("performance_score", 0)
b_score = metrics.get("variant_b", {}).get("performance_score", 0)

# Variant B wins if strictly better
return "variant_b" if b_score > a_score else "variant_a"

Performance Score Calculation

Performance score is calculated from multiple metrics:

performance_score = (
approval_accuracy * 0.40 + # 40% weight
false_positive_rate * 0.30 + # 30% weight (inverted)
response_time_score * 0.20 + # 20% weight
user_satisfaction * 0.10 # 10% weight
) * 100

Auto-Completion

Scheduler Configuration

from services.ab_test_scheduler import start_scheduler, stop_scheduler

# Start the background scheduler
scheduler = start_scheduler(
db_session_factory=get_db,
check_interval_minutes=60 # Check every hour
)

# Stop scheduler (on shutdown)
stop_scheduler()

Auto-Completion Process

When a test expires, the scheduler:

  1. Finds Expired Tests: Queries tests where created_at + duration <= NOW()
  2. Calculates Metrics: Aggregates real metrics from alert data
  3. Determines Winner: Compares performance scores
  4. Calculates Confidence: Based on sample size
  5. Updates Database: Sets status to completed, records winner
  6. Sends Notification: Alerts test creator of results

Manual Completion

# Manually complete a test
curl -X POST https://api.owlai.io/v1/ab-tests/test_abc123/complete \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "Early termination - clear winner identified"
}'

Notifications

Completion Notification

When a test completes, notifications are sent:

{
"notification_type": "ab_test_completed",
"test_id": "test_abc123",
"test_name": "Risk Threshold Optimization",
"winner": "variant_b",
"confidence": 95,
"summary": "Variant B (Lower Threshold) won with 82.3 performance score vs 78.5 for Variant A",
"recommendation": "Consider deploying Variant B to production",
"view_results_url": "https://app.owlai.io/ab-tests/test_abc123"
}

Notification Channels

  • Email to test creator
  • Slack integration (if configured)
  • In-app notification
  • Webhook (if configured)

Best Practices

Test Design

  1. Clear Hypothesis: Define what you're testing and expected outcome
  2. Single Variable: Change only one thing between variants
  3. Sufficient Duration: Run tests long enough for statistical significance
  4. Representative Traffic: Ensure both variants get similar traffic patterns

Sample Size Guidelines

Test TypeMinimum SamplesRecommended
Policy threshold100300+
Risk scoring200500+
Automation rules50200+
UI changes5001000+

Duration Guidelines

Test TypeMinimum DurationRecommended
Quick validation24 hours72 hours
Policy change72 hours168 hours
Major configuration168 hours336 hours

Interpreting Results

  1. High Confidence (>90%): Safe to deploy winner
  2. Medium Confidence (70-90%): Consider extending test
  3. Low Confidence (<70%): Extend test or increase traffic
  4. No Significant Difference: Either variant is acceptable

API Reference

Endpoints

EndpointMethodDescription
/ab-testsPOSTCreate new test
/ab-testsGETList all tests
/ab-tests/{test_id}GETGet test details
/ab-tests/{test_id}DELETEDelete test
/ab-tests/{test_id}/stopPOSTStop running test
/ab-tests/{test_id}/completePOSTForce completion
/ab-tests/{test_id}/apply-winnerPOSTApply winning config
/ab-tests/{test_id}/metricsGETGet detailed metrics

Test Status Values

StatusDescription
draftTest created but not started
runningTest actively collecting data
completingTest being finalized
completedTest finished with results
stoppedTest manually stopped
failedTest encountered error