Skip to main content

Agent Health Monitoring

FieldValue
Document IDASCEND-AGENT-001
Version2026.04
Last UpdatedApril 2026
AuthorAscend Engineering Team
PublisherOW-KAI Technologies Inc.
ClassificationEnterprise Client Documentation
ComplianceSOC 2 CC6.1/CC6.2, PCI-DSS 7.1/8.3, HIPAA 164.312, NIST 800-53 AC-2/SI-4

Reading Time: 10 minutes | Skill Level: Intermediate

Overview

ASCEND provides Datadog-style health monitoring for all registered agents. Continuous monitoring enables early detection of issues and automatic incident response.

warning

Agents that miss three consecutive heartbeats are automatically marked as unhealthy and may trigger the kill-switch if anomaly detection is enabled. Configure heartbeat intervals appropriate to your agent's workload.

Architecture

┌─────────────────────────────────────────────────────────────────────────────────────┐
│ HEALTH MONITORING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ SDK Agent ASCEND Platform Dashboard │
│ │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Heartbeat │─────────────▶│ Health Service │─────────────▶│ Health │ │
│ │ Every 60s │ │ │ │ Summary │ │
│ │ │ │ • Process HB │ │ │ │
│ │ • agent_id │ │ • Update status │ │ • Online │ │
│ │ • metrics │ │ • Check health │ │ • Degraded │ │
│ │ • sdk_ver │ │ • Detect anom. │ │ • Offline │ │
│ └─────────────┘ └────────┬────────┘ └─────────────┘ │
│ │ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Auto-Actions │ │
│ │ │ │
│ │ • Auto-suspend │ │
│ │ • Alert notify │ │
│ │ • Webhook call │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘

Health Status

Status Definitions

StatusDescriptionHeartbeatAction
onlineOperating normallyRecentNormal operation
degradedMissed 1-2 heartbeatsDelayedWarning alert
offlineMissed 3+ heartbeatsNoneCritical alert
unknownNever received heartbeatNeverCheck configuration

Status Calculation

# Source: services/agent_health_service.py
# Health status is calculated based on missed heartbeats

def calculate_health_status(agent):
"""Calculate agent health status."""
if not agent.last_heartbeat:
return "unknown"

now = datetime.now(UTC)
expected_interval = agent.heartbeat_interval_seconds # default: 60

elapsed = (now - agent.last_heartbeat).total_seconds()
missed = int(elapsed / expected_interval)

if missed == 0:
return "online"
elif missed <= 2:
return "degraded"
else:
return "offline"

Heartbeat API

Send Heartbeat

import requests
import time

def send_heartbeat(api_key: str, agent_id: str, metrics: dict = None):
"""Send heartbeat to ASCEND."""
response = requests.post(
"https://pilot.owkai.app/api/agents/health/heartbeat",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"agent_id": agent_id,
"metrics": metrics,
"sdk_version": "1.0.0"
}
)
return response.json()

# Usage
while True:
result = send_heartbeat(
api_key="owkai_...",
agent_id="my-agent-001",
metrics={
"response_time_ms": 45.2,
"error_rate": 0.5,
"requests_count": 1247,
"last_error": None
}
)
print(f"Health status: {result.get('health_status')}")
time.sleep(60) # Every 60 seconds

Heartbeat Request

# Source: routes/agent_health_routes.py:36
class HeartbeatRequest(BaseModel):
"""Heartbeat payload from agent SDK."""
agent_id: str = Field(..., description="Unique agent identifier")
metrics: Optional[Dict[str, Any]] = Field(
default=None,
description="Optional performance metrics",
example={
"response_time_ms": 45.2,
"error_rate": 0.5,
"requests_count": 1247,
"last_error": None
}
)
sdk_version: Optional[str] = Field(
default=None,
description="SDK version for compatibility tracking"
)

Heartbeat Response

{
"success": true,
"agent_id": "my-agent-001",
"health_status": "online",
"next_heartbeat_expected_at": "2025-12-15T10:31:00Z",
"heartbeat_interval_seconds": 60
}

Batch Heartbeat

Send heartbeats for multiple agents:

curl -X POST "https://pilot.owkai.app/api/agents/health/heartbeat/batch" \
-H "Authorization: Bearer owkai_..." \
-H "Content-Type: application/json" \
-d '[
{
"agent_id": "agent-001",
"metrics": {"response_time_ms": 45.2}
},
{
"agent_id": "agent-002",
"metrics": {"response_time_ms": 32.1}
}
]'

Health Dashboard

Get Health Summary

curl "https://pilot.owkai.app/api/agents/health/summary" \
-H "Authorization: Bearer owkai_..."

Response:

{
"summary": {
"total_agents": 15,
"online": 12,
"degraded": 2,
"offline": 1,
"unknown": 0,
"health_score": 87
},
"metrics": {
"avg_response_time_ms": 42.5,
"total_requests_24h": 125847,
"avg_error_rate": 0.3
},
"problem_agents": [
{
"agent_id": "data-processor-003",
"status": "offline",
"last_heartbeat": "2025-12-15T09:15:00Z",
"minutes_offline": 45
},
{
"agent_id": "api-gateway-002",
"status": "degraded",
"last_heartbeat": "2025-12-15T10:28:00Z",
"error_rate": 5.2
}
],
"recent_changes": [
{
"agent_id": "finance-bot-001",
"previous_status": "online",
"new_status": "degraded",
"changed_at": "2025-12-15T10:25:00Z"
}
],
"last_check": "2025-12-15T10:30:00Z"
}

Get Agent Health Detail

curl "https://pilot.owkai.app/api/agents/health/my-agent-001" \
-H "Authorization: Bearer owkai_..."

Response:

{
"agent_id": "my-agent-001",
"display_name": "Data Processing Agent",
"agent_type": "supervised",
"status": "online",
"health": {
"status": "online",
"last_heartbeat": "2025-12-15T10:29:45Z",
"next_expected": "2025-12-15T10:30:45Z",
"heartbeat_interval_seconds": 60,
"consecutive_missed": 0
},
"metrics": {
"avg_response_time_ms": 45.2,
"error_rate_percent": 0.5,
"total_requests_24h": 8547,
"sdk_version": "1.0.0"
},
"errors": {
"last_error": null,
"last_error_at": null,
"error_count_24h": 42
},
"recent_history": [
{
"timestamp": "2025-12-15T10:29:45Z",
"status": "online",
"response_time_ms": 45.2
},
{
"timestamp": "2025-12-15T10:28:45Z",
"status": "online",
"response_time_ms": 43.8
}
]
}

Performance Metrics

Tracked Metrics

MetricTypeDescription
avg_response_time_msfloatAverage action response time
error_rate_percentfloatError rate over 24 hours
total_requests_24hintTotal actions in last 24 hours
last_errorstringMost recent error message
last_error_atdatetimeTimestamp of last error

Reporting Metrics

# Include metrics in heartbeat
client.heartbeat(
metrics={
"response_time_ms": measure_response_time(),
"error_rate": calculate_error_rate(),
"requests_count": get_request_count(),
"memory_mb": get_memory_usage(),
"cpu_percent": get_cpu_usage()
}
)

Anomaly Detection

Configuration

# Source: models_agent_registry.py:173
# Anomaly detection settings
{
"anomaly_detection_enabled": true,
"baseline_actions_per_hour": 100.0, # Normal action rate
"baseline_error_rate": 0.5, # Normal error rate (%)
"baseline_avg_risk_score": 35.0, # Normal risk score
"anomaly_threshold_percent": 50.0 # Alert if 50% deviation
}

Anomaly Types

AnomalyDetectionSeverity
Action RateCurrent rate > baseline + 50%Medium to Critical
Error RateCurrent rate > baseline + 50%High
Risk ScoreAverage risk > baseline + 50%High

Detection Logic

# Source: services/agent_registry_service.py:396
def detect_anomalies(db, agent, current_action_rate, current_error_rate, current_risk_score):
"""Compare current behavior against baseline."""

if not agent.anomaly_detection_enabled:
return {"has_anomaly": False}

anomalies = []
threshold = agent.anomaly_threshold_percent or 50.0

# Check action rate anomaly
if agent.baseline_actions_per_hour and current_action_rate:
deviation = abs(current_action_rate - agent.baseline_actions_per_hour)
deviation_percent = (deviation / agent.baseline_actions_per_hour) * 100

if deviation_percent > threshold:
anomalies.append({
"type": "action_rate",
"baseline": agent.baseline_actions_per_hour,
"current": current_action_rate,
"deviation_percent": deviation_percent
})

# Determine severity based on max deviation
if anomalies:
max_deviation = max(a["deviation_percent"] for a in anomalies)
if max_deviation > threshold * 2:
severity = "critical"
elif max_deviation > threshold * 1.5:
severity = "high"
else:
severity = "medium"

return {
"has_anomaly": len(anomalies) > 0,
"anomalies": anomalies,
"severity": severity
}

Anomaly Response

{
"has_anomaly": true,
"anomalies": [
{
"type": "action_rate",
"baseline": 100.0,
"current": 250.0,
"deviation_percent": 150.0,
"threshold_percent": 50.0
}
],
"severity": "critical",
"anomaly_count_24h": 3
}

Auto-Suspension

Trigger Configuration

# Source: models_agent_registry.py:163
{
"auto_suspend_enabled": true,
"auto_suspend_on_error_rate": 0.10, # 10% error rate
"auto_suspend_on_offline_minutes": 30, # 30 minutes offline
"auto_suspend_on_budget_exceeded": true,
"auto_suspend_on_rate_exceeded": false
}

Auto-Suspend Check

# Source: services/agent_registry_service.py:522
def check_auto_suspend_triggers(db, agent):
"""Check if any auto-suspend conditions are met."""

if not agent.auto_suspend_enabled:
return {"should_suspend": False}

# Error rate trigger
if agent.auto_suspend_on_error_rate:
if agent.error_rate_percent >= agent.auto_suspend_on_error_rate * 100:
return {
"should_suspend": True,
"trigger": "error_rate",
"reason": f"Error rate {agent.error_rate_percent:.1f}% exceeds {agent.auto_suspend_on_error_rate * 100:.1f}%"
}

# Offline duration trigger
if agent.auto_suspend_on_offline_minutes and agent.last_heartbeat:
minutes_offline = (now - agent.last_heartbeat).total_seconds() / 60
if minutes_offline > agent.auto_suspend_on_offline_minutes:
return {
"should_suspend": True,
"trigger": "offline_duration",
"reason": f"Agent offline for {minutes_offline:.0f} minutes"
}

# Budget exceeded trigger
if agent.auto_suspend_on_budget_exceeded and agent.max_daily_budget_usd:
if agent.current_daily_spend_usd >= agent.max_daily_budget_usd:
return {
"should_suspend": True,
"trigger": "budget_exceeded",
"reason": f"Budget exceeded: ${agent.current_daily_spend_usd:.2f}"
}

return {"should_suspend": False}

Heartbeat Configuration

Update Interval

curl -X PUT "https://pilot.owkai.app/api/agents/health/my-agent-001/interval" \
-H "Authorization: Bearer owkai_..." \
-H "Content-Type: application/json" \
-d '{
"interval_seconds": 30
}'

Interval Guidelines

EnvironmentIntervalRationale
Production Critical30 secondsFast issue detection
Production Standard60 secondsBalance monitoring/overhead
Staging120 secondsLess critical
Development300 secondsMinimal overhead

Manual Health Check

Trigger immediate health check:

curl -X POST "https://pilot.owkai.app/api/agents/health/check" \
-H "Authorization: Bearer owkai_..."

Response:

{
"checked_by": "admin@company.com",
"status_changes": [
{
"agent_id": "api-gateway-001",
"previous_status": "online",
"new_status": "degraded",
"reason": "Missed heartbeat"
}
],
"changes_count": 1
}

SDK Integration

Python SDK

from ascend import AscendClient
import threading

client = AscendClient(
api_key="owkai_...",
agent_id="my-agent-001",
heartbeat_interval=60 # seconds
)

# Heartbeat runs automatically in background thread
# Or manually:
client.send_heartbeat(metrics={
"response_time_ms": 45.2,
"error_rate": 0.5
})

TypeScript SDK

import { AscendClient } from '@ascend-ai/sdk';

const client = new AscendClient({
apiKey: process.env.ASCEND_API_KEY,
agentId: 'my-agent-001',
heartbeatInterval: 60000 // milliseconds
});

// Heartbeat runs automatically
// Or manually:
await client.sendHeartbeat({
metrics: {
responseTimeMs: 45.2,
errorRate: 0.5
}
});

Best Practices

1. Always Send Heartbeats

# Start heartbeat immediately after initialization
client = AscendClient(...)
client.start_heartbeat() # Background thread

2. Include Meaningful Metrics

# Good - actionable metrics
metrics={
"response_time_ms": 45.2,
"error_rate": 0.5,
"queue_depth": 150,
"memory_percent": 75
}

# Bad - no useful information
metrics={}

3. Set Appropriate Intervals

# Production: 60 seconds or less
# Development: Can be longer
heartbeat_interval = 60 if is_production else 300

4. Configure Auto-Suspend Carefully

# Enable for autonomous agents
{
"auto_suspend_enabled": True,
"auto_suspend_on_error_rate": 0.10, # 10% - not too aggressive
"auto_suspend_on_offline_minutes": 30
}

Next Steps


Document Version: 2026.04 | Last Updated: April 2026