Agent Health Monitoring
| Field | Value |
|---|---|
| Document ID | ASCEND-AGENT-001 |
| Version | 2026.04 |
| Last Updated | April 2026 |
| Author | Ascend Engineering Team |
| Publisher | OW-KAI Technologies Inc. |
| Classification | Enterprise Client Documentation |
| Compliance | SOC 2 CC6.1/CC6.2, PCI-DSS 7.1/8.3, HIPAA 164.312, NIST 800-53 AC-2/SI-4 |
Reading Time: 10 minutes | Skill Level: Intermediate
Overview
ASCEND provides Datadog-style health monitoring for all registered agents. Continuous monitoring enables early detection of issues and automatic incident response.
warning
Agents that miss three consecutive heartbeats are automatically marked as unhealthy and may trigger the kill-switch if anomaly detection is enabled. Configure heartbeat intervals appropriate to your agent's workload.
Architecture
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ HEALTH MONITORING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ SDK Agent ASCEND Platform Dashboard │
│ │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Heartbeat │─────────────▶│ Health Service │─────────────▶│ Health │ │
│ │ Every 60s │ │ │ │ Summary │ │
│ │ │ │ • Process HB │ │ │ │
│ │ • agent_id │ │ • Update status │ │ • Online │ │
│ │ • metrics │ │ • Check health │ │ • Degraded │ │
│ │ • sdk_ver │ │ • Detect anom. │ │ • Offline │ │
│ └─────────────┘ └────────┬────────┘ └─────────────┘ │
│ │ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Auto-Actions │ │
│ │ │ │
│ │ • Auto-suspend │ │
│ │ • Alert notify │ │
│ │ • Webhook call │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘
Health Status
Status Definitions
| Status | Description | Heartbeat | Action |
|---|---|---|---|
| online | Operating normally | Recent | Normal operation |
| degraded | Missed 1-2 heartbeats | Delayed | Warning alert |
| offline | Missed 3+ heartbeats | None | Critical alert |
| unknown | Never received heartbeat | Never | Check configuration |
Status Calculation
# Source: services/agent_health_service.py
# Health status is calculated based on missed heartbeats
def calculate_health_status(agent):
"""Calculate agent health status."""
if not agent.last_heartbeat:
return "unknown"
now = datetime.now(UTC)
expected_interval = agent.heartbeat_interval_seconds # default: 60
elapsed = (now - agent.last_heartbeat).total_seconds()
missed = int(elapsed / expected_interval)
if missed == 0:
return "online"
elif missed <= 2:
return "degraded"
else:
return "offline"
Heartbeat API
Send Heartbeat
import requests
import time
def send_heartbeat(api_key: str, agent_id: str, metrics: dict = None):
"""Send heartbeat to ASCEND."""
response = requests.post(
"https://pilot.owkai.app/api/agents/health/heartbeat",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"agent_id": agent_id,
"metrics": metrics,
"sdk_version": "1.0.0"
}
)
return response.json()
# Usage
while True:
result = send_heartbeat(
api_key="owkai_...",
agent_id="my-agent-001",
metrics={
"response_time_ms": 45.2,
"error_rate": 0.5,
"requests_count": 1247,
"last_error": None
}
)
print(f"Health status: {result.get('health_status')}")
time.sleep(60) # Every 60 seconds
Heartbeat Request
# Source: routes/agent_health_routes.py:36
class HeartbeatRequest(BaseModel):
"""Heartbeat payload from agent SDK."""
agent_id: str = Field(..., description="Unique agent identifier")
metrics: Optional[Dict[str, Any]] = Field(
default=None,
description="Optional performance metrics",
example={
"response_time_ms": 45.2,
"error_rate": 0.5,
"requests_count": 1247,
"last_error": None
}
)
sdk_version: Optional[str] = Field(
default=None,
description="SDK version for compatibility tracking"
)
Heartbeat Response
{
"success": true,
"agent_id": "my-agent-001",
"health_status": "online",
"next_heartbeat_expected_at": "2025-12-15T10:31:00Z",
"heartbeat_interval_seconds": 60
}
Batch Heartbeat
Send heartbeats for multiple agents:
curl -X POST "https://pilot.owkai.app/api/agents/health/heartbeat/batch" \
-H "Authorization: Bearer owkai_..." \
-H "Content-Type: application/json" \
-d '[
{
"agent_id": "agent-001",
"metrics": {"response_time_ms": 45.2}
},
{
"agent_id": "agent-002",
"metrics": {"response_time_ms": 32.1}
}
]'
Health Dashboard
Get Health Summary
curl "https://pilot.owkai.app/api/agents/health/summary" \
-H "Authorization: Bearer owkai_..."
Response:
{
"summary": {
"total_agents": 15,
"online": 12,
"degraded": 2,
"offline": 1,
"unknown": 0,
"health_score": 87
},
"metrics": {
"avg_response_time_ms": 42.5,
"total_requests_24h": 125847,
"avg_error_rate": 0.3
},
"problem_agents": [
{
"agent_id": "data-processor-003",
"status": "offline",
"last_heartbeat": "2025-12-15T09:15:00Z",
"minutes_offline": 45
},
{
"agent_id": "api-gateway-002",
"status": "degraded",
"last_heartbeat": "2025-12-15T10:28:00Z",
"error_rate": 5.2
}
],
"recent_changes": [
{
"agent_id": "finance-bot-001",
"previous_status": "online",
"new_status": "degraded",
"changed_at": "2025-12-15T10:25:00Z"
}
],
"last_check": "2025-12-15T10:30:00Z"
}
Get Agent Health Detail
curl "https://pilot.owkai.app/api/agents/health/my-agent-001" \
-H "Authorization: Bearer owkai_..."
Response:
{
"agent_id": "my-agent-001",
"display_name": "Data Processing Agent",
"agent_type": "supervised",
"status": "online",
"health": {
"status": "online",
"last_heartbeat": "2025-12-15T10:29:45Z",
"next_expected": "2025-12-15T10:30:45Z",
"heartbeat_interval_seconds": 60,
"consecutive_missed": 0
},
"metrics": {
"avg_response_time_ms": 45.2,
"error_rate_percent": 0.5,
"total_requests_24h": 8547,
"sdk_version": "1.0.0"
},
"errors": {
"last_error": null,
"last_error_at": null,
"error_count_24h": 42
},
"recent_history": [
{
"timestamp": "2025-12-15T10:29:45Z",
"status": "online",
"response_time_ms": 45.2
},
{
"timestamp": "2025-12-15T10:28:45Z",
"status": "online",
"response_time_ms": 43.8
}
]
}
Performance Metrics
Tracked Metrics
| Metric | Type | Description |
|---|---|---|
avg_response_time_ms | float | Average action response time |
error_rate_percent | float | Error rate over 24 hours |
total_requests_24h | int | Total actions in last 24 hours |
last_error | string | Most recent error message |
last_error_at | datetime | Timestamp of last error |
Reporting Metrics
# Include metrics in heartbeat
client.heartbeat(
metrics={
"response_time_ms": measure_response_time(),
"error_rate": calculate_error_rate(),
"requests_count": get_request_count(),
"memory_mb": get_memory_usage(),
"cpu_percent": get_cpu_usage()
}
)
Anomaly Detection
Configuration
# Source: models_agent_registry.py:173
# Anomaly detection settings
{
"anomaly_detection_enabled": true,
"baseline_actions_per_hour": 100.0, # Normal action rate
"baseline_error_rate": 0.5, # Normal error rate (%)
"baseline_avg_risk_score": 35.0, # Normal risk score
"anomaly_threshold_percent": 50.0 # Alert if 50% deviation
}
Anomaly Types
| Anomaly | Detection | Severity |
|---|---|---|
| Action Rate | Current rate > baseline + 50% | Medium to Critical |
| Error Rate | Current rate > baseline + 50% | High |
| Risk Score | Average risk > baseline + 50% | High |
Detection Logic
# Source: services/agent_registry_service.py:396
def detect_anomalies(db, agent, current_action_rate, current_error_rate, current_risk_score):
"""Compare current behavior against baseline."""
if not agent.anomaly_detection_enabled:
return {"has_anomaly": False}
anomalies = []
threshold = agent.anomaly_threshold_percent or 50.0
# Check action rate anomaly
if agent.baseline_actions_per_hour and current_action_rate:
deviation = abs(current_action_rate - agent.baseline_actions_per_hour)
deviation_percent = (deviation / agent.baseline_actions_per_hour) * 100
if deviation_percent > threshold:
anomalies.append({
"type": "action_rate",
"baseline": agent.baseline_actions_per_hour,
"current": current_action_rate,
"deviation_percent": deviation_percent
})
# Determine severity based on max deviation
if anomalies:
max_deviation = max(a["deviation_percent"] for a in anomalies)
if max_deviation > threshold * 2:
severity = "critical"
elif max_deviation > threshold * 1.5:
severity = "high"
else:
severity = "medium"
return {
"has_anomaly": len(anomalies) > 0,
"anomalies": anomalies,
"severity": severity
}
Anomaly Response
{
"has_anomaly": true,
"anomalies": [
{
"type": "action_rate",
"baseline": 100.0,
"current": 250.0,
"deviation_percent": 150.0,
"threshold_percent": 50.0
}
],
"severity": "critical",
"anomaly_count_24h": 3
}
Auto-Suspension
Trigger Configuration
# Source: models_agent_registry.py:163
{
"auto_suspend_enabled": true,
"auto_suspend_on_error_rate": 0.10, # 10% error rate
"auto_suspend_on_offline_minutes": 30, # 30 minutes offline
"auto_suspend_on_budget_exceeded": true,
"auto_suspend_on_rate_exceeded": false
}
Auto-Suspend Check
# Source: services/agent_registry_service.py:522
def check_auto_suspend_triggers(db, agent):
"""Check if any auto-suspend conditions are met."""
if not agent.auto_suspend_enabled:
return {"should_suspend": False}
# Error rate trigger
if agent.auto_suspend_on_error_rate:
if agent.error_rate_percent >= agent.auto_suspend_on_error_rate * 100:
return {
"should_suspend": True,
"trigger": "error_rate",
"reason": f"Error rate {agent.error_rate_percent:.1f}% exceeds {agent.auto_suspend_on_error_rate * 100:.1f}%"
}
# Offline duration trigger
if agent.auto_suspend_on_offline_minutes and agent.last_heartbeat:
minutes_offline = (now - agent.last_heartbeat).total_seconds() / 60
if minutes_offline > agent.auto_suspend_on_offline_minutes:
return {
"should_suspend": True,
"trigger": "offline_duration",
"reason": f"Agent offline for {minutes_offline:.0f} minutes"
}
# Budget exceeded trigger
if agent.auto_suspend_on_budget_exceeded and agent.max_daily_budget_usd:
if agent.current_daily_spend_usd >= agent.max_daily_budget_usd:
return {
"should_suspend": True,
"trigger": "budget_exceeded",
"reason": f"Budget exceeded: ${agent.current_daily_spend_usd:.2f}"
}
return {"should_suspend": False}
Heartbeat Configuration
Update Interval
curl -X PUT "https://pilot.owkai.app/api/agents/health/my-agent-001/interval" \
-H "Authorization: Bearer owkai_..." \
-H "Content-Type: application/json" \
-d '{
"interval_seconds": 30
}'
Interval Guidelines
| Environment | Interval | Rationale |
|---|---|---|
| Production Critical | 30 seconds | Fast issue detection |
| Production Standard | 60 seconds | Balance monitoring/overhead |
| Staging | 120 seconds | Less critical |
| Development | 300 seconds | Minimal overhead |
Manual Health Check
Trigger immediate health check:
curl -X POST "https://pilot.owkai.app/api/agents/health/check" \
-H "Authorization: Bearer owkai_..."
Response:
{
"checked_by": "admin@company.com",
"status_changes": [
{
"agent_id": "api-gateway-001",
"previous_status": "online",
"new_status": "degraded",
"reason": "Missed heartbeat"
}
],
"changes_count": 1
}
SDK Integration
Python SDK
from ascend import AscendClient
import threading
client = AscendClient(
api_key="owkai_...",
agent_id="my-agent-001",
heartbeat_interval=60 # seconds
)
# Heartbeat runs automatically in background thread
# Or manually:
client.send_heartbeat(metrics={
"response_time_ms": 45.2,
"error_rate": 0.5
})
TypeScript SDK
import { AscendClient } from '@ascend-ai/sdk';
const client = new AscendClient({
apiKey: process.env.ASCEND_API_KEY,
agentId: 'my-agent-001',
heartbeatInterval: 60000 // milliseconds
});
// Heartbeat runs automatically
// Or manually:
await client.sendHeartbeat({
metrics: {
responseTimeMs: 45.2,
errorRate: 0.5
}
});
Best Practices
1. Always Send Heartbeats
# Start heartbeat immediately after initialization
client = AscendClient(...)
client.start_heartbeat() # Background thread
2. Include Meaningful Metrics
# Good - actionable metrics
metrics={
"response_time_ms": 45.2,
"error_rate": 0.5,
"queue_depth": 150,
"memory_percent": 75
}
# Bad - no useful information
metrics={}
3. Set Appropriate Intervals
# Production: 60 seconds or less
# Development: Can be longer
heartbeat_interval = 60 if is_production else 300
4. Configure Auto-Suspend Carefully
# Enable for autonomous agents
{
"auto_suspend_enabled": True,
"auto_suspend_on_error_rate": 0.10, # 10% - not too aggressive
"auto_suspend_on_offline_minutes": 30
}
Next Steps
- Kill-Switch — Emergency procedures
- Smart Rules — Health-based rules
- Notifications — Alert configuration
Document Version: 2026.04 | Last Updated: April 2026