A/B Testing Framework
Overview
The OWL AI Platform A/B Testing Framework enables organizations to run controlled experiments on their AI governance configurations. Compare different policy settings, risk thresholds, or automation rules to measure their impact on security, efficiency, and user experience. The framework includes automatic test completion, statistical significance calculation, and winner determination.
Key Capabilities
- Controlled Experiments: Compare two variants (A/B) under identical conditions
- Statistical Analysis: Automated confidence level and significance calculation
- Auto-Completion: Automatic test completion when duration expires
- Sample Size Tracking: Monitor test progress and data sufficiency
- Performance Metrics: Track and compare variant performance scores
- Winner Determination: Algorithmic selection of better-performing variant
- Notification System: Alerts when tests complete or require attention
How It Works
A/B Test Lifecycle
┌─────────────────────────────────────────────────────────────────────────────────┐
│ A/B TEST LIFECYCLE │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CREATE │────>│ RUNNING │────>│ COMPLETING │────>│ COMPLETED │
│ │ │ │ │ │ │ │
│ Define test │ │ Collect │ │ Calculate │ │ Review │
│ Set variants│ │ samples │ │ statistics │ │ results │
│ Set duration│ │ Track perf │ │ Determine │ │ Apply │
│ Start │ │ Monitor │ │ winner │ │ winner │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
│ │ │ │
v v v v
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Test ID │ │ Progress % │ │ Winner: │ │ Action: │
│ assigned │ │ tracked │ │ A or B │ │ Apply │
│ │ │ │ │ │ │ winning │
│ Variants │ │ Alerts │ │ Confidence │ │ config │
│ configured │ │ routed │ │ calculated │ │ │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Test Architecture
┌─────────────────────────────────────────────────────────────────────────────────┐
│ A/B TEST ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────┐
│ INCOMING ALERTS │
│ (Random 50/50 Assignment) │
└──────────────┬───────────────┘
│
┌────────────────────┴────────────────────┐
│ │
v v
┌─────────────────────┐ ┌─────────────────────┐
│ VARIANT A │ │ VARIANT B │
│ (Control/Base) │ │ (Experiment) │
│ │ │ │
│ Policy: prod-v1 │ │ Policy: prod-v2 │
│ Risk Threshold: 40 │ │ Risk Threshold: 35 │
│ Auto-approve: Yes │ │ Auto-approve: Yes │
│ │ │ │
│ ┌─────────────────┐ │ │ ┌─────────────────┐ │
│ │ Performance │ │ │ │ Performance │ │
│ │ Score: 78.5 │ │ │ │ Score: 82.3 │ │
│ │ │ │ │ │ │ │
│ │ Samples: 156 │ │ │ │ Samples: 152 │ │
│ │ Approvals: 142 │ │ │ │ Approvals: 148 │ │
│ │ Denials: 14 │ │ │ │ Denials: 4 │ │
│ └─────────────────┘ │ │ └─────────────────┘ │
└─────────────────────┘ └─────────────────────┘
│ │
└────────────────────┬────────────────────┘
│
v
┌──────────────────────────────┐
│ STATISTICAL ANALYSIS │
│ │
│ Sample Size: 308 │
│ Confidence: 95% │
│ Significance: HIGH │
│ │
│ WINNER: Variant B │
│ (+3.8 performance points) │
└──────────────────────────────┘
Scheduler Architecture
┌─────────────────────────────────────────────────────────────────────────────────┐
│ A/B TEST SCHEDULER (Background Task) │
└─────────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────┐
│ SCHEDULER LOOP │
│ (Every 60 minutes) │
└──────────────┬───────────────┘
│
v
┌──────────────────────────────┐
│ FIND EXPIRED TESTS │
│ │
│ SELECT * FROM ab_tests │
│ WHERE status = 'running' │
│ AND created_at + duration │
│ <= NOW() │
└──────────────┬───────────────┘
│
│ For each expired test
v
┌──────────────────────────────┐
│ AUTO-COMPLETE TEST │
│ │
│ 1. Calculate metrics │
│ 2. Determine winner │
│ 3. Calculate confidence │
│ 4. Update database │
│ 5. Send notification │
└──────────────────────────────┘
Configuration
Create A/B Test
from owlai import OWLClient
client = OWLClient(api_key="your_api_key")
# Create a new A/B test
test = client.ab_tests.create(
test_name="Risk Threshold Optimization",
description="Compare 40 vs 35 risk threshold for auto-approval",
duration_hours=168, # 7 days
# Variant A (Control)
variant_a={
"name": "Current Threshold",
"config": {
"auto_approve_threshold": 40,
"policy_id": "prod-policy-v1"
}
},
# Variant B (Experiment)
variant_b={
"name": "Lower Threshold",
"config": {
"auto_approve_threshold": 35,
"policy_id": "prod-policy-v2"
}
},
# Success metrics
success_metrics=["approval_rate", "false_positive_rate", "response_time"]
)
print(f"Test created: {test.test_id}")
print(f"Ends at: {test.end_time}")
Create via API
curl -X POST https://api.owlai.io/v1/ab-tests \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"test_name": "Risk Threshold Optimization",
"description": "Compare 40 vs 35 risk threshold for auto-approval",
"duration_hours": 168,
"variant_a": {
"name": "Current Threshold",
"config": {
"auto_approve_threshold": 40
}
},
"variant_b": {
"name": "Lower Threshold",
"config": {
"auto_approve_threshold": 35
}
}
}'
# Response:
# {
# "test_id": "test_abc123",
# "test_name": "Risk Threshold Optimization",
# "status": "running",
# "created_at": "2026-01-20T10:00:00Z",
# "ends_at": "2026-01-27T10:00:00Z",
# "progress_percentage": 0
# }
Test Configuration Options
| Parameter | Type | Required | Description |
|---|---|---|---|
test_name | String | Yes | Human-readable test name |
description | String | No | Detailed test description |
duration_hours | Integer | Yes | Test duration (1-720 hours) |
variant_a | Object | Yes | Control variant configuration |
variant_b | Object | Yes | Experiment variant configuration |
success_metrics | Array | No | Metrics to optimize for |
target_sample_size | Integer | No | Minimum samples needed |
Usage Examples
Monitor Test Progress
# Get test status
test = client.ab_tests.get(test_id="test_abc123")
print(f"Test: {test.test_name}")
print(f"Status: {test.status}")
print(f"Progress: {test.progress_percentage}%")
print(f"Sample Size: {test.sample_size}")
print(f"\nVariant A ({test.variant_a_name}):")
print(f" Performance: {test.variant_a_performance}")
print(f" Samples: {test.variant_a_samples}")
print(f"\nVariant B ({test.variant_b_name}):")
print(f" Performance: {test.variant_b_performance}")
print(f" Samples: {test.variant_b_samples}")
# Output:
# Test: Risk Threshold Optimization
# Status: running
# Progress: 45%
# Sample Size: 308
#
# Variant A (Current Threshold):
# Performance: 78.5
# Samples: 156
#
# Variant B (Lower Threshold):
# Performance: 82.3
# Samples: 152
Get Test Results
curl -X GET https://api.owlai.io/v1/ab-tests/test_abc123 \
-H "Authorization: Bearer $TOKEN"
# Response (completed test):
# {
# "test_id": "test_abc123",
# "test_name": "Risk Threshold Optimization",
# "status": "completed",
# "completed_at": "2026-01-27T10:00:00Z",
# "winner": "variant_b",
# "confidence_level": 95,
# "statistical_significance": "high",
# "results": {
# "variant_a": {
# "name": "Current Threshold",
# "performance_score": 78.5,
# "samples": 312,
# "metrics": {
# "approval_rate": 0.91,
# "false_positive_rate": 0.03,
# "avg_response_time_ms": 245
# }
# },
# "variant_b": {
# "name": "Lower Threshold",
# "performance_score": 82.3,
# "samples": 304,
# "metrics": {
# "approval_rate": 0.97,
# "false_positive_rate": 0.01,
# "avg_response_time_ms": 198
# }
# }
# },
# "recommendation": "Deploy Variant B - 4.8% performance improvement with higher confidence"
# }
List All Tests
# List all tests
tests = client.ab_tests.list(
status="all", # running, completed, or all
limit=10
)
for test in tests:
status_icon = "green" if test.status == "completed" else "blue"
print(f"[{test.status}] {test.test_name}")
print(f" Progress: {test.progress_percentage}%")
if test.winner:
print(f" Winner: {test.winner} (confidence: {test.confidence_level}%)")
Stop Test Early
# Stop a running test (if sufficient data collected)
result = client.ab_tests.stop(
test_id="test_abc123",
reason="Sufficient data collected for decision"
)
print(f"Test stopped: {result.final_status}")
print(f"Winner: {result.winner}")
print(f"Confidence: {result.confidence_level}%")
Apply Winning Configuration
# Apply the winning variant configuration
client.ab_tests.apply_winner(
test_id="test_abc123",
apply_to="production",
rollout_percentage=100
)
print("Winning configuration applied to production")
Statistical Analysis
Confidence Level Calculation
Confidence level is based on sample size:
| Sample Size | Confidence Level | Significance |
|---|---|---|
| >= 500 | 99% | Very High |
| >= 300 | 95% | High |
| >= 200 | 90% | High |
| >= 100 | 85% | Medium |
| >= 50 | 75% | Medium |
| >= 25 | 65% | Low |
| >= 10 | 55% | Low |
| < 10 | 40-50% | Insufficient |
Winner Determination
def determine_winner(test_data, metrics):
"""
Determine winning variant based on performance metrics.
Winner is the variant with higher performance score.
Ties favor Variant A (control).
"""
a_score = metrics.get("variant_a", {}).get("performance_score", 0)
b_score = metrics.get("variant_b", {}).get("performance_score", 0)
# Variant B wins if strictly better
return "variant_b" if b_score > a_score else "variant_a"
Performance Score Calculation
Performance score is calculated from multiple metrics:
performance_score = (
approval_accuracy * 0.40 + # 40% weight
false_positive_rate * 0.30 + # 30% weight (inverted)
response_time_score * 0.20 + # 20% weight
user_satisfaction * 0.10 # 10% weight
) * 100
Auto-Completion
Scheduler Configuration
from services.ab_test_scheduler import start_scheduler, stop_scheduler
# Start the background scheduler
scheduler = start_scheduler(
db_session_factory=get_db,
check_interval_minutes=60 # Check every hour
)
# Stop scheduler (on shutdown)
stop_scheduler()
Auto-Completion Process
When a test expires, the scheduler:
- Finds Expired Tests: Queries tests where
created_at + duration <= NOW() - Calculates Metrics: Aggregates real metrics from alert data
- Determines Winner: Compares performance scores
- Calculates Confidence: Based on sample size
- Updates Database: Sets status to completed, records winner
- Sends Notification: Alerts test creator of results
Manual Completion
# Manually complete a test
curl -X POST https://api.owlai.io/v1/ab-tests/test_abc123/complete \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "Early termination - clear winner identified"
}'
Notifications
Completion Notification
When a test completes, notifications are sent:
{
"notification_type": "ab_test_completed",
"test_id": "test_abc123",
"test_name": "Risk Threshold Optimization",
"winner": "variant_b",
"confidence": 95,
"summary": "Variant B (Lower Threshold) won with 82.3 performance score vs 78.5 for Variant A",
"recommendation": "Consider deploying Variant B to production",
"view_results_url": "https://app.owlai.io/ab-tests/test_abc123"
}
Notification Channels
- Email to test creator
- Slack integration (if configured)
- In-app notification
- Webhook (if configured)
Best Practices
Test Design
- Clear Hypothesis: Define what you're testing and expected outcome
- Single Variable: Change only one thing between variants
- Sufficient Duration: Run tests long enough for statistical significance
- Representative Traffic: Ensure both variants get similar traffic patterns
Sample Size Guidelines
| Test Type | Minimum Samples | Recommended |
|---|---|---|
| Policy threshold | 100 | 300+ |
| Risk scoring | 200 | 500+ |
| Automation rules | 50 | 200+ |
| UI changes | 500 | 1000+ |
Duration Guidelines
| Test Type | Minimum Duration | Recommended |
|---|---|---|
| Quick validation | 24 hours | 72 hours |
| Policy change | 72 hours | 168 hours |
| Major configuration | 168 hours | 336 hours |
Interpreting Results
- High Confidence (
>90%): Safe to deploy winner - Medium Confidence (70-90%): Consider extending test
- Low Confidence (
<70%): Extend test or increase traffic - No Significant Difference: Either variant is acceptable
API Reference
Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ab-tests | POST | Create new test |
/ab-tests | GET | List all tests |
/ab-tests/{test_id} | GET | Get test details |
/ab-tests/{test_id} | DELETE | Delete test |
/ab-tests/{test_id}/stop | POST | Stop running test |
/ab-tests/{test_id}/complete | POST | Force completion |
/ab-tests/{test_id}/apply-winner | POST | Apply winning config |
/ab-tests/{test_id}/metrics | GET | Get detailed metrics |
Test Status Values
| Status | Description |
|---|---|
draft | Test created but not started |
running | Test actively collecting data |
completing | Test being finalized |
completed | Test finished with results |
stopped | Test manually stopped |
failed | Test encountered error |
Related
- Policies - Test different policy configurations
- Risk Scoring - Test risk threshold changes
- Automation - Test automation rules
- Alerts - Alert routing in A/B tests