Skip to main content

System Diagnostics

Enterprise-grade health monitoring with Splunk/Datadog-compatible audit trails, real-time component analysis, and automated remediation suggestions.

Overview

The System Diagnostics module provides comprehensive platform health monitoring aligned with industry leaders:

PatternIndustry LeaderASCEND Implementation
Common Information ModelSplunk CIMStandardized event format
Metrics RegistryDatadog MetricsCentralized metric definitions
Distributed TracingWiz.ioCorrelation IDs for request tracking
Audit TrailSOC 2 AU-6Immutable diagnostic logs

Compliance

StandardRequirementImplementation
SOC 2 CC7.2Security MonitoringReal-time health checks
PCI-DSS 10.2Audit Trail RequirementsImmutable diagnostic_audit_logs table
HIPAA 164.312Audit ControlsSIEM-compatible export formats
NIST AU-6Audit Review & AnalysisCorrelation ID tracking

API Endpoints

All endpoints are rate-limited and require authentication.

Full Health Check

GET /api/diagnostics/health

Rate Limit: 10 requests/minute

Response:

{
"status": "healthy",
"health_score": 98.5,
"severity": "INFO",
"components": {
"api": {"status": "healthy", "score": 100},
"database": {"status": "healthy", "score": 98},
"integrations": {"status": "healthy", "score": 95},
"security": {"status": "healthy", "score": 100}
},
"timestamp": "2025-12-04T14:30:52Z",
"correlation_id": "diag_4_20251204_143052_a1b2c3d4"
}

API Health

GET /api/diagnostics/api

Rate Limit: 20 requests/minute

Tests endpoint responsiveness, rate limiter status, and authentication system.

Database Health

GET /api/diagnostics/database

Rate Limit: 20 requests/minute

Checks database connectivity, query performance, and connection pool utilization.

Integration Health

GET /api/diagnostics/integrations

Rate Limit: 20 requests/minute

Verifies SIEM connectivity, webhook endpoints, and notification channels.

Diagnostic History

GET /api/diagnostics/history?limit=50

Rate Limit: 30 requests/minute

Returns historical diagnostic records for trend analysis.

SIEM Export

POST /api/diagnostics/export

Rate Limit: 5 requests/minute

Request Body:

{
"format": "splunk_cim",
"start_date": "2025-12-01",
"end_date": "2025-12-04"
}

Supported Formats:

  • splunk_cim - Splunk Common Information Model
  • datadog_metrics - Datadog metrics format
  • wiz_json - Wiz.io JSON format

Health Score Calculation

The composite health score uses a weighted formula:

Health Score = (API × 0.30) + (Database × 0.40) + (Integrations × 0.20) + (Security × 0.10)
ComponentWeightDescription
API30%Endpoint responsiveness, rate limiter status
Database40%Query latency, connection pool utilization
Integrations20%SIEM connectivity, webhook health
Security10%Authentication status, certificate validity

Severity Classification

SeverityHealth ScoreSplunk Level
INFO≥ 90informational
WARNING≥ 80, < 90warning
ERROR≥ 60, < 80error
CRITICAL< 60critical

Correlation IDs

Every diagnostic operation generates a traceable correlation ID:

Format: diag_{org_id}_{YYYYMMDD}_{HHMMSS}_{uuid4_short}

Example: diag_4_20251204_143052_a1b2c3d4

Use correlation IDs to trace diagnostic events across:

  • Application logs
  • SIEM systems
  • Support tickets
  • Audit reports

Database Schema

diagnostic_audit_logs

Immutable audit trail for all diagnostic operations.

ColumnTypeDescription
idIntegerPrimary key
correlation_idString(64)Unique tracing identifier
organization_idIntegerMulti-tenant isolation (FK)
diagnostic_typeString(50)api_health, database_status, integration_test, full_diagnostic, security_scan
statusString(20)success, warning, failure, critical, timeout
health_scoreFloatComposite score 0-100
severityString(20)INFO, WARNING, ERROR, CRITICAL, EMERGENCY
resultsJSONBFull diagnostic results with component breakdown
component_detailsJSONBIndividual component statuses
remediation_suggestionsJSONBActionable remediation steps
initiated_byIntegerUser ID who triggered diagnostic (FK)
duration_msIntegerExecution duration in milliseconds
siem_export_formatString(30)splunk_cim, datadog_metrics, wiz_json, generic_syslog
siem_exported_atDateTimeWhen record was exported to SIEM
request_contextJSONBSource IP, user agent, request ID
created_atDateTimeImmutable creation timestamp

Indexes:

  • ix_diagnostic_audit_org_created - Organization + timestamp queries
  • ix_diagnostic_audit_correlation - Correlation ID lookups
  • ix_diagnostic_audit_severity - Severity-based filtering

diagnostic_thresholds

Organization-specific alerting thresholds.

ColumnTypeDefaultDescription
api_response_time_warning_msInteger500Warning threshold for API latency
api_response_time_critical_msInteger2000Critical threshold for API latency
api_error_rate_warning_pctFloat1.0%Warning threshold for error rate
api_error_rate_critical_pctFloat5.0%Critical threshold for error rate
db_query_time_warning_msInteger100Warning threshold for DB queries
db_query_time_critical_msInteger500Critical threshold for DB queries
db_connection_pool_warning_pctFloat70%Warning for pool utilization
db_connection_pool_critical_pctFloat90%Critical for pool utilization
health_score_warningFloat80.0Warning threshold for health score
health_score_criticalFloat60.0Critical threshold for health score
auto_alert_on_criticalBooleantrueAutomatic alerting on critical status
alert_cooldown_minutesInteger15Cooldown between alerts

SIEM Integration

Splunk Common Information Model

Export diagnostic records in Splunk CIM format:

{
"event_id": "diag_4_20251204_143052_a1b2c3d4",
"timestamp": "2025-12-04T14:30:52Z",
"source": "owkai_diagnostics",
"sourcetype": "owkai:diagnostic:full_diagnostic",
"severity": "info",
"status": "success",
"health_score": 98.5,
"organization_id": 4,
"duration_ms": 245,
"component_count": 4,
"remediation_count": 0,
"details": { ... }
}

Splunk Query Example:

index=ascend sourcetype="owkai:diagnostic:*"
| where health_score < 80
| stats count by organization_id, severity

Datadog Metrics

Export as Datadog-compatible metric points:

[
{
"metric": "owkai.diagnostics.health_score",
"type": "gauge",
"points": [[1701701452, 98.5]],
"tags": ["org_id:4", "diagnostic_type:full_diagnostic", "status:success", "severity:info"]
},
{
"metric": "owkai.diagnostics.duration_ms",
"type": "gauge",
"points": [[1701701452, 245]],
"tags": ["org_id:4", "diagnostic_type:full_diagnostic", "status:success", "severity:info"]
},
{
"metric": "owkai.diagnostics.execution",
"type": "count",
"points": [[1701701452, 1]],
"tags": ["org_id:4", "diagnostic_type:full_diagnostic", "status:success", "severity:info"]
}
]

Datadog Dashboard Query:

avg:owkai.diagnostics.health_score{*} by {org_id}

Wiz.io JSON

Export for Wiz.io security posture management:

{
"event_id": "diag_4_20251204_143052_a1b2c3d4",
"platform": "ascend",
"resource_type": "diagnostic_check",
"severity": "INFO",
"health_score": 98.5,
"findings": [],
"remediation": []
}

UI Integration

Access System Diagnostics from the Admin Console:

Navigation: Admin Console > Admin Tools > System Diagnostics

Dashboard Features

  1. Run Diagnostic - Execute full health check with audit logging
  2. Health History - View historical health scores and trends
  3. Export to SIEM - Download records in Splunk/Datadog/Wiz format
  4. Threshold Configuration - Customize alerting thresholds per organization

Component Status Display

StatusColorDescription
HealthyGreenScore ≥ 90, all systems operational
DegradedYellowScore ≥ 80, minor issues detected
UnhealthyOrangeScore ≥ 60, significant issues
CriticalRedScore < 60, immediate attention required

Best Practices

Monitoring Recommendations

  1. Schedule Regular Checks - Run diagnostics hourly for proactive monitoring
  2. Set Appropriate Thresholds - Configure thresholds based on your SLA requirements
  3. Enable SIEM Export - Forward diagnostic logs to your SIEM for centralized monitoring
  4. Review Remediation Suggestions - Act on AI-generated remediation suggestions promptly

Alert Configuration

{
"api_response_time_warning_ms": 500,
"api_response_time_critical_ms": 2000,
"health_score_warning": 80.0,
"health_score_critical": 60.0,
"auto_alert_on_critical": true,
"alert_cooldown_minutes": 15
}

Correlation ID Usage

  1. Include correlation ID in support tickets
  2. Use for cross-system log correlation in your SIEM
  3. Reference in incident response documentation
  4. Store in audit reports for compliance

Troubleshooting

Common Issues

IssueCauseResolution
429 Too Many RequestsRate limit exceededWait for cooldown, check rate limits
Health Score FluctuationDatabase connection pool saturationReview pool configuration
SIEM Export TimeoutLarge date rangeReduce export range, use pagination
Missing Correlation IDLegacy diagnostic recordsUpgrade to latest API version

Debug Mode

Enable detailed logging for troubleshooting:

# Log diagnostic execution details
logger.setLevel(logging.DEBUG)

Next Steps