Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Observability

Orion provides structured logging, Prometheus metrics, distributed tracing, and health monitoring out of the box. No sidecars, no agents. Everything runs inside the single binary.

Structured Logging

Orion emits structured logs in JSON or pretty-printed format, configurable at runtime:

[logging]
level = "info"        # trace, debug, info, warn, error
format = "pretty"     # pretty or json

JSON format is recommended for production. It integrates directly with log aggregators like Loki, Datadog, or CloudWatch:

ORION_LOGGING__FORMAT=json
ORION_LOGGING__LEVEL=info

Per-crate filtering with RUST_LOG gives fine-grained control:

RUST_LOG=orion=debug,tower_http=warn,sqlx=warn
LevelUsage
errorFailures that need attention
warnDegraded behavior (circuit breakers, retries)
infoRequest lifecycle, engine reloads, startup/shutdown
debugDetailed processing, SQL queries, connector calls
traceFine-grained internal state

Every request carries a UUID x-request-id header. Pass your own or let Orion generate one. The ID propagates through logs and responses for end-to-end correlation.

Prometheus Metrics

Enable metrics and scrape at GET /metrics (Prometheus text format):

[metrics]
enabled = true
MetricTypeLabelsDescription
messages_totalCounterchannel, statusTotal messages processed
message_duration_secondsHistogramchannelProcessing latency
active_workflowsGaugeWorkflows loaded in engine
errors_totalCountertypeErrors encountered
http_requests_totalCountermethod, path, statusHTTP requests served
http_request_duration_secondsHistogrammethod, path, statusHTTP request latency
db_query_duration_secondsHistogramoperationDatabase query latency
engine_reloads_totalCounterstatusEngine reload events
engine_reload_duration_secondsHistogramEngine reload latency
circuit_breaker_trips_totalCounterconnector, channelCircuit breaker trip events
circuit_breaker_rejections_totalCounterconnector, channelRequests rejected by open breakers
channel_executions_totalCounterchannelChannel invocations
rate_limit_rejections_totalCounterclientRate-limited requests

Distributed Tracing

Enable OpenTelemetry trace export with OTLP gRPC:

[tracing]
enabled = true
otlp_endpoint = "http://localhost:4317"
service_name = "orion"
sample_rate = 1.0    # 0.0 (none) to 1.0 (all)
  • W3C Trace Context extraction and propagation: incoming traceparent headers are respected
  • Per-request spans with channel, workflow, and task attributes
  • OTLP gRPC export to Jaeger, Tempo, or any compatible collector
  • Configurable sampling rate for production use
  • Trace context injected into outbound http_call requests for full distributed traces

Health Monitoring

Orion exposes three health endpoints for different operational needs.

Component health: GET /health returns component-level status with automatic degradation detection:

{
  "status": "ok",
  "version": "0.1.0",
  "uptime_seconds": 3600,
  "workflows_loaded": 42,
  "components": {
    "database": "ok",
    "engine": "ok"
  }
}

The health check tests the database with SELECT 1 and verifies engine availability with a configurable lock timeout. If either check fails, the endpoint returns 503 Service Unavailable with "status": "degraded".

Kubernetes probes:

EndpointPurposeBehavior
GET /healthzLiveness probeAlways returns 200. If the process is running, it’s alive
GET /readyzReadiness probeReturns 200 only when DB is reachable, engine is loaded, and startup is complete; 503 otherwise
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Engine status: GET /api/v1/admin/engine/status returns a detailed breakdown:

{
  "version": "0.1.0",
  "uptime_seconds": 3600,
  "workflows_count": 42,
  "active_workflows": 38,
  "channels": ["orders", "events", "alerts"]
}