A/B Testing Framework

Overview

The OWL AI Platform A/B Testing Framework enables organizations to run controlled experiments on their AI governance configurations. Compare different policy settings, risk thresholds, or automation rules to measure their impact on security, efficiency, and user experience. The framework includes automatic test completion, statistical significance calculation, and winner determination.

Key Capabilities

Controlled Experiments: Compare two variants (A/B) under identical conditions
Statistical Analysis: Automated confidence level and significance calculation
Auto-Completion: Automatic test completion when duration expires
Sample Size Tracking: Monitor test progress and data sufficiency
Performance Metrics: Track and compare variant performance scores
Winner Determination: Algorithmic selection of better-performing variant
Notification System: Alerts when tests complete or require attention

How It Works

A/B Test Lifecycle

┌─────────────────────────────────────────────────────────────────────────────────┐
│                         A/B TEST LIFECYCLE                                       │
└─────────────────────────────────────────────────────────────────────────────────┘

  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
  │   CREATE    │────>│   RUNNING   │────>│ COMPLETING  │────>│  COMPLETED  │
  │             │     │             │     │             │     │             │
  │ Define test │     │ Collect     │     │ Calculate   │     │ Review      │
  │ Set variants│     │ samples     │     │ statistics  │     │ results     │
  │ Set duration│     │ Track perf  │     │ Determine   │     │ Apply       │
  │ Start       │     │ Monitor     │     │ winner      │     │ winner      │
  └─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
        │                   │                   │                   │
        │                   │                   │                   │
        v                   v                   v                   v
  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
  │ Test ID     │     │ Progress %  │     │ Winner:     │     │ Action:     │
  │ assigned    │     │ tracked     │     │ A or B      │     │ Apply       │
  │             │     │             │     │             │     │ winning     │
  │ Variants    │     │ Alerts      │     │ Confidence  │     │ config      │
  │ configured  │     │ routed      │     │ calculated  │     │             │
  └─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

Test Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                         A/B TEST ARCHITECTURE                                    │
└─────────────────────────────────────────────────────────────────────────────────┘

                    ┌──────────────────────────────┐
                    │       INCOMING ALERTS        │
                    │   (Random 50/50 Assignment)  │
                    └──────────────┬───────────────┘
                                   │
              ┌────────────────────┴────────────────────┐
              │                                         │
              v                                         v
    ┌─────────────────────┐               ┌─────────────────────┐
    │     VARIANT A       │               │     VARIANT B       │
    │   (Control/Base)    │               │   (Experiment)      │
    │                     │               │                     │
    │ Policy: prod-v1     │               │ Policy: prod-v2     │
    │ Risk Threshold: 40  │               │ Risk Threshold: 35  │
    │ Auto-approve: Yes   │               │ Auto-approve: Yes   │
    │                     │               │                     │
    │ ┌─────────────────┐ │               │ ┌─────────────────┐ │
    │ │ Performance     │ │               │ │ Performance     │ │
    │ │ Score: 78.5     │ │               │ │ Score: 82.3     │ │
    │ │                 │ │               │ │                 │ │
    │ │ Samples: 156    │ │               │ │ Samples: 152    │ │
    │ │ Approvals: 142  │ │               │ │ Approvals: 148  │ │
    │ │ Denials: 14     │ │               │ │ Denials: 4      │ │
    │ └─────────────────┘ │               │ └─────────────────┘ │
    └─────────────────────┘               └─────────────────────┘
              │                                         │
              └────────────────────┬────────────────────┘
                                   │
                                   v
                    ┌──────────────────────────────┐
                    │      STATISTICAL ANALYSIS    │
                    │                              │
                    │  Sample Size: 308            │
                    │  Confidence: 95%             │
                    │  Significance: HIGH          │
                    │                              │
                    │  WINNER: Variant B           │
                    │  (+3.8 performance points)   │
                    └──────────────────────────────┘

Scheduler Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                    A/B TEST SCHEDULER (Background Task)                          │
└─────────────────────────────────────────────────────────────────────────────────┘

                    ┌──────────────────────────────┐
                    │      SCHEDULER LOOP          │
                    │   (Every 60 minutes)         │
                    └──────────────┬───────────────┘
                                   │
                                   v
                    ┌──────────────────────────────┐
                    │   FIND EXPIRED TESTS         │
                    │                              │
                    │   SELECT * FROM ab_tests     │
                    │   WHERE status = 'running'   │
                    │   AND created_at + duration  │
                    │       <= NOW()               │
                    └──────────────┬───────────────┘
                                   │
                                   │ For each expired test
                                   v
                    ┌──────────────────────────────┐
                    │   AUTO-COMPLETE TEST         │
                    │                              │
                    │   1. Calculate metrics       │
                    │   2. Determine winner        │
                    │   3. Calculate confidence    │
                    │   4. Update database         │
                    │   5. Send notification       │
                    └──────────────────────────────┘

Configuration

Create A/B Test

from owlai import OWLClient

client = OWLClient(api_key="your_api_key")

# Create a new A/B test
test = client.ab_tests.create(
    test_name="Risk Threshold Optimization",
    description="Compare 40 vs 35 risk threshold for auto-approval",
    duration_hours=168,  # 7 days

    # Variant A (Control)
    variant_a={
        "name": "Current Threshold",
        "config": {
            "auto_approve_threshold": 40,
            "policy_id": "prod-policy-v1"
        }
    },

    # Variant B (Experiment)
    variant_b={
        "name": "Lower Threshold",
        "config": {
            "auto_approve_threshold": 35,
            "policy_id": "prod-policy-v2"
        }
    },

    # Success metrics
    success_metrics=["approval_rate", "false_positive_rate", "response_time"]
)

print(f"Test created: {test.test_id}")
print(f"Ends at: {test.end_time}")

Create via API

curl -X POST https://api.owlai.io/v1/ab-tests \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "test_name": "Risk Threshold Optimization",
    "description": "Compare 40 vs 35 risk threshold for auto-approval",
    "duration_hours": 168,
    "variant_a": {
      "name": "Current Threshold",
      "config": {
        "auto_approve_threshold": 40
      }
    },
    "variant_b": {
      "name": "Lower Threshold",
      "config": {
        "auto_approve_threshold": 35
      }
    }
  }'

# Response:
# {
#   "test_id": "test_abc123",
#   "test_name": "Risk Threshold Optimization",
#   "status": "running",
#   "created_at": "2026-01-20T10:00:00Z",
#   "ends_at": "2026-01-27T10:00:00Z",
#   "progress_percentage": 0
# }

Test Configuration Options

Parameter	Type	Required	Description
`test_name`	String	Yes	Human-readable test name
`description`	String	No	Detailed test description
`duration_hours`	Integer	Yes	Test duration (1-720 hours)
`variant_a`	Object	Yes	Control variant configuration
`variant_b`	Object	Yes	Experiment variant configuration
`success_metrics`	Array	No	Metrics to optimize for
`target_sample_size`	Integer	No	Minimum samples needed

Usage Examples

Monitor Test Progress

# Get test status
test = client.ab_tests.get(test_id="test_abc123")

print(f"Test: {test.test_name}")
print(f"Status: {test.status}")
print(f"Progress: {test.progress_percentage}%")
print(f"Sample Size: {test.sample_size}")

print(f"\nVariant A ({test.variant_a_name}):")
print(f"  Performance: {test.variant_a_performance}")
print(f"  Samples: {test.variant_a_samples}")

print(f"\nVariant B ({test.variant_b_name}):")
print(f"  Performance: {test.variant_b_performance}")
print(f"  Samples: {test.variant_b_samples}")

# Output:
# Test: Risk Threshold Optimization
# Status: running
# Progress: 45%
# Sample Size: 308
#
# Variant A (Current Threshold):
#   Performance: 78.5
#   Samples: 156
#
# Variant B (Lower Threshold):
#   Performance: 82.3
#   Samples: 152

Get Test Results

curl -X GET https://api.owlai.io/v1/ab-tests/test_abc123 \
  -H "Authorization: Bearer $TOKEN"

# Response (completed test):
# {
#   "test_id": "test_abc123",
#   "test_name": "Risk Threshold Optimization",
#   "status": "completed",
#   "completed_at": "2026-01-27T10:00:00Z",
#   "winner": "variant_b",
#   "confidence_level": 95,
#   "statistical_significance": "high",
#   "results": {
#     "variant_a": {
#       "name": "Current Threshold",
#       "performance_score": 78.5,
#       "samples": 312,
#       "metrics": {
#         "approval_rate": 0.91,
#         "false_positive_rate": 0.03,
#         "avg_response_time_ms": 245
#       }
#     },
#     "variant_b": {
#       "name": "Lower Threshold",
#       "performance_score": 82.3,
#       "samples": 304,
#       "metrics": {
#         "approval_rate": 0.97,
#         "false_positive_rate": 0.01,
#         "avg_response_time_ms": 198
#       }
#     }
#   },
#   "recommendation": "Deploy Variant B - 4.8% performance improvement with higher confidence"
# }

List All Tests

# List all tests
tests = client.ab_tests.list(
    status="all",  # running, completed, or all
    limit=10
)

for test in tests:
    status_icon = "green" if test.status == "completed" else "blue"
    print(f"[{test.status}] {test.test_name}")
    print(f"  Progress: {test.progress_percentage}%")
    if test.winner:
        print(f"  Winner: {test.winner} (confidence: {test.confidence_level}%)")

Stop Test Early

# Stop a running test (if sufficient data collected)
result = client.ab_tests.stop(
    test_id="test_abc123",
    reason="Sufficient data collected for decision"
)

print(f"Test stopped: {result.final_status}")
print(f"Winner: {result.winner}")
print(f"Confidence: {result.confidence_level}%")

Apply Winning Configuration

# Apply the winning variant configuration
client.ab_tests.apply_winner(
    test_id="test_abc123",
    apply_to="production",
    rollout_percentage=100
)

print("Winning configuration applied to production")

Statistical Analysis

Confidence Level Calculation

Confidence level is based on sample size:

Sample Size	Confidence Level	Significance
>= 500	99%	Very High
>= 300	95%	High
>= 200	90%	High
>= 100	85%	Medium
>= 50	75%	Medium
>= 25	65%	Low
>= 10	55%	Low
< 10	40-50%	Insufficient

Winner Determination

def determine_winner(test_data, metrics):
    """
    Determine winning variant based on performance metrics.

    Winner is the variant with higher performance score.
    Ties favor Variant A (control).
    """
    a_score = metrics.get("variant_a", {}).get("performance_score", 0)
    b_score = metrics.get("variant_b", {}).get("performance_score", 0)

    # Variant B wins if strictly better
    return "variant_b" if b_score > a_score else "variant_a"

Performance Score Calculation

Performance score is calculated from multiple metrics:

performance_score = (
    approval_accuracy * 0.40 +      # 40% weight
    false_positive_rate * 0.30 +    # 30% weight (inverted)
    response_time_score * 0.20 +    # 20% weight
    user_satisfaction * 0.10        # 10% weight
) * 100

Auto-Completion

Scheduler Configuration

from services.ab_test_scheduler import start_scheduler, stop_scheduler

# Start the background scheduler
scheduler = start_scheduler(
    db_session_factory=get_db,
    check_interval_minutes=60  # Check every hour
)

# Stop scheduler (on shutdown)
stop_scheduler()

Auto-Completion Process

When a test expires, the scheduler:

Finds Expired Tests: Queries tests where created_at + duration <= NOW()
Calculates Metrics: Aggregates real metrics from alert data
Determines Winner: Compares performance scores
Calculates Confidence: Based on sample size
Updates Database: Sets status to completed, records winner
Sends Notification: Alerts test creator of results

Manual Completion

# Manually complete a test
curl -X POST https://api.owlai.io/v1/ab-tests/test_abc123/complete \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Early termination - clear winner identified"
  }'

Notifications

Completion Notification

When a test completes, notifications are sent:

{
  "notification_type": "ab_test_completed",
  "test_id": "test_abc123",
  "test_name": "Risk Threshold Optimization",
  "winner": "variant_b",
  "confidence": 95,
  "summary": "Variant B (Lower Threshold) won with 82.3 performance score vs 78.5 for Variant A",
  "recommendation": "Consider deploying Variant B to production",
  "view_results_url": "https://app.owlai.io/ab-tests/test_abc123"
}

Notification Channels

Email to test creator
Slack integration (if configured)
In-app notification
Webhook (if configured)

Best Practices

Test Design

Clear Hypothesis: Define what you're testing and expected outcome
Single Variable: Change only one thing between variants
Sufficient Duration: Run tests long enough for statistical significance
Representative Traffic: Ensure both variants get similar traffic patterns

Sample Size Guidelines

Test Type	Minimum Samples	Recommended
Policy threshold	100	300+
Risk scoring	200	500+
Automation rules	50	200+
UI changes	500	1000+

Duration Guidelines

Test Type	Minimum Duration	Recommended
Quick validation	24 hours	72 hours
Policy change	72 hours	168 hours
Major configuration	168 hours	336 hours

Interpreting Results

High Confidence (>90%): Safe to deploy winner
Medium Confidence (70-90%): Consider extending test
Low Confidence (<70%): Extend test or increase traffic
No Significant Difference: Either variant is acceptable

API Reference

Endpoints

Endpoint	Method	Description
`/ab-tests`	POST	Create new test
`/ab-tests`	GET	List all tests
`/ab-tests/{test_id}`	GET	Get test details
`/ab-tests/{test_id}`	DELETE	Delete test
`/ab-tests/{test_id}/stop`	POST	Stop running test
`/ab-tests/{test_id}/complete`	POST	Force completion
`/ab-tests/{test_id}/apply-winner`	POST	Apply winning config
`/ab-tests/{test_id}/metrics`	GET	Get detailed metrics

Test Status Values

Status	Description
`draft`	Test created but not started
`running`	Test actively collecting data
`completing`	Test being finalized
`completed`	Test finished with results
`stopped`	Test manually stopped
`failed`	Test encountered error

Policies - Test different policy configurations
Risk Scoring - Test risk threshold changes
Automation - Test automation rules
Alerts - Alert routing in A/B tests

Overview​

Key Capabilities​

How It Works​

A/B Test Lifecycle​

Test Architecture​

Scheduler Architecture​

Configuration​

Create A/B Test​

Create via API​

Test Configuration Options​

Usage Examples​

Monitor Test Progress​

Get Test Results​

List All Tests​

Stop Test Early​

Apply Winning Configuration​

Statistical Analysis​

Confidence Level Calculation​

Winner Determination​

Performance Score Calculation​

Auto-Completion​

Scheduler Configuration​

Auto-Completion Process​

Manual Completion​

Notifications​

Completion Notification​

Notification Channels​

Best Practices​

Test Design​

Sample Size Guidelines​

Duration Guidelines​

Interpreting Results​

API Reference​

Endpoints​

Test Status Values​

Related​