Skip to main content

Status Reporting

Citadel uses a hybrid push model for node telemetry: HTTP for reliable delivery to the AceTeam API, Redis Pub/Sub for real-time dashboard updates, and Redis Streams for reliable processing by backend workers. This page covers the collection, publishing, and consumption of node status data.

Architecture

Why three channels?

ChannelGuaranteeLatencyConsumerPurpose
HTTP APIReliable (retries)~100msAceTeam APIPersistent storage, historical data
Redis Pub/SubFire-and-forget~1msWeb dashboardReal-time UI updates
Redis StreamsAt-least-once~1msPython workerDevice config flow, alerting

The HTTP channel ensures status data is persisted even if Redis is temporarily unavailable. Redis Pub/Sub provides the low-latency path for live dashboards. Redis Streams provides reliable delivery for backend processing that cannot tolerate missed messages (like the device configuration flow).

What Gets Collected

The status collector in internal/status/ gathers system metrics using a combination of gopsutil, nvidia-smi, and Docker APIs.

CategoryMetricsSource
CPUUsage percentage, core count, modelgopsutil
MemoryTotal, used, available, percentagegopsutil
DiskTotal, used, available for key mount pointsgopsutil
GPUModel, VRAM total/used, temperature, utilization, driver versionnvidia-smi CSV output
NetworkMesh connection status, assigned IPinternal/network
ServicesRunning/stopped status for each configured servicedocker compose ps
SystemHostname, OS, architecture, uptimeruntime + gopsutil

GPU detection follows a platform-specific fallback chain:

  • Linux: nvidia-smi --query-gpu=... --format=csv (primary), /proc/driver/nvidia/gpus/ (fallback)
  • macOS: system_profiler SPDisplaysDataType for Metal GPU info
  • Windows: nvidia-smi.exe from standard path, PATH lookup, then WMI query fallback

Redis Key Patterns

Key PatternTypeTTLPurpose
node:status:{nodeId}Pub/Sub channelN/AReal-time status updates for dashboard subscriptions
node:status:streamStreamTrimmed by lengthReliable status processing by Python worker
jobs:v1:configStreamTrimmed by lengthConfig jobs pushed by Python worker back to Citadel

The nodeId in Pub/Sub channels corresponds to the node's mesh network identity, ensuring status updates are routable to the correct dashboard view.

HTTP Status Server

When running in worker mode (citadel work), Citadel exposes local HTTP endpoints for health checking and monitoring integration.

EndpointMethodResponsePurpose
/healthGET200 OK or 503Liveness probe (Kubernetes, load balancers)
/pingGET200 pongSimple connectivity test
/statusGETJSON status payloadFull node status (same data as published to Redis)

These endpoints are bound to the mesh network interface, not the public network. Only nodes on the same AceTeam Network mesh can reach them.

The AceTeam web dashboard is the primary consumer of this status data, displaying real-time node health, GPU utilization, and service status across the fleet. The TUI-based status display (citadel status) provides a local view of the same data. A future enhancement could expose the TUI as a web-based terminal UI, giving operators a rich status view directly in their browser without SSH access.

Device Configuration Flow

The status reporting system enables a zero-touch provisioning flow where users configure nodes from the web UI without further CLI interaction.

Flow breakdown:

  1. User runs citadel init with device authorization. The CLI publishes its initial status to Redis Streams, including the deviceCode from the auth flow.
  2. The Python worker consumes the status message, correlates the deviceCode with the user's onboarding wizard session, and retrieves the desired configuration.
  3. The user completes the configuration wizard in the web UI (selecting services, naming the node, setting monitoring preferences).
  4. The Python worker pushes an APPLY_DEVICE_CONFIG job to the config stream.
  5. Citadel picks up the config job, updates its local citadel.yaml manifest, starts the configured services, and publishes a confirmation status.

This entire flow happens without the user needing to return to the CLI after the initial citadel init.

Status Publisher Implementation

The internal/heartbeat/ package provides two publisher implementations:

HTTP Publisher (heartbeat/http.go):

  • Posts JSON status to the AceTeam API every 30 seconds.
  • Includes retry logic with exponential backoff on transient failures.
  • Falls back gracefully if the API is unreachable (logs warning, continues operating).

Redis Publisher (heartbeat/redis.go):

  • Publishes to both Pub/Sub and Streams in a single tick.
  • Pub/Sub messages are fire-and-forget (no persistence, no retry).
  • Streams messages use XADD with MAXLEN ~1000 to prevent unbounded growth.
  • Includes the deviceCode field when available (populated from CITADEL_DEVICE_CODE environment variable or the device auth flow).

Both publishers run as background goroutines and are started by citadel work and citadel up.

Direction: Agentic Execution Protocol (AEP). The Python worker that processes status streams is evolving toward a protocol-based system -- the Agentic Execution Protocol -- which abstracts worker interactions into a vendor-neutral protocol for cross-organization agent invocation. This would allow any compliant worker (not just the current Python implementation) to consume status events and trigger actions. See the Agentic Execution Protocol design documentation (in progress).

Monitoring Integration

The HTTP endpoints and status data format are designed for integration with standard monitoring tools:

  • Kubernetes: Use /health as a liveness probe and /ping as a readiness probe.
  • Prometheus: The /status endpoint returns structured JSON that can be scraped by a custom exporter.
  • Alerting: The Python worker processes status stream messages and can trigger alerts based on node going offline, GPU temperature thresholds, or service failures.