21 — Observability (OpenTelemetry)

Tier 3.6 — DONE

Forge emits traces and metrics via the OpenTelemetry SDK so operators can plug it into any OTLP-compatible backend (Grafana Tempo + Prometheus, Jaeger, Datadog, Honeycomb, New Relic, ...). There is zero vendor coupling: the platform talks only to an OTel Collector, and the Collector fans out.

This feature is additive and fully gated by OTEL_ENABLED. When disabled, the SDK is not even imported, so the overhead is zero.


Architecture

              forge-web (Django)            forge-task (Celery)
                   │                               │
                   └──────── OTLP gRPC ────────────┘
                              (4317)
                                │
                                ▼
                      forge-otel-collector
                  (otel/opentelemetry-collector-contrib)
                                │
                   ┌────────────┼────────────┐
                   ▼            ▼            ▼
               Grafana       Jaeger       Datadog  ...
               Tempo /       (traces)     (OTLP)
               Prometheus

Bootstrap flow

Entry point: forge.main.observability.init_observability().

  1. Short-circuits immediately if OTEL_ENABLED is false (no SDK import).
  2. Lazily imports the OTel SDK + exporter + instrumentation packages.
  3. Builds a Resource from OTEL_RESOURCE_ATTRIBUTES (plus service.name).
  4. Picks a sampler from OTEL_TRACES_SAMPLER (always_on | always_off | traceidratio | parentbased_traceidratio, default parentbased_traceidratio with ratio OTEL_TRACES_SAMPLER_ARG).
  5. Installs a TracerProvider with a BatchSpanProcessor + OTLP gRPC exporter pointed at OTEL_EXPORTER_ENDPOINT.
  6. Installs a MeterProvider with a PeriodicExportingMetricReader + OTLP metric exporter.
  7. Registers auto-instrumentations for Django, Celery, Requests, Psycopg2.
  8. All failures are caught and logged — a misconfigured Collector must not prevent Django from booting.

Called from forge/asgi.py, forge/wsgi.py, and the Celery worker boot hook (forge/settings/defaults/celery_conf.py).

Environment variables / settings registry

All registered in forge/main/conf.py under category System. Env wins on first boot; subsequent changes can be made in Settings → System.

Key Default Meaning
OTEL_ENABLED false Master switch
OTEL_EXPORTER_ENDPOINT http://forge-otel-collector:4317 OTLP gRPC endpoint
OTEL_SERVICE_NAME forge service.name resource attribute
OTEL_RESOURCE_ATTRIBUTES "" Comma-separated k=v pairs
OTEL_TRACES_SAMPLER parentbased_traceidratio Standard OTel sampler names
OTEL_TRACES_SAMPLER_ARG 0.1 Ratio in [0,1] (validated)

Span instrumentation seams

Manual root/child spans wrap high-value code paths (see forge/main/observability/tracing.py):

Everything else (HTTP views, DB queries, outgoing requests, Celery tasks) is covered by the auto-instrumentations.

Metric handles

Exposed by forge/main/observability/metrics.py:

Metric Type Labels Emitted from
forge_jobs_launched_total counter status, template_type launch hook
forge_jobs_blocked_total counter gate (policy|scanner) launch hook
forge_job_duration_seconds histogram UnifiedJob finish hook
forge_policy_evaluations_total counter decision policy evaluator
forge_scan_runs_total counter status scanner runner
forge_active_jobs gauge Celery beat every 30 s

All handles are cheap no-ops when OTEL_ENABLED=false.

REST API

GET /api/v2/observability/

Admin-only. Returns current config plus a best-effort TCP probe against the Collector endpoint (500 ms timeout, cached for 30 s).

{
  "enabled": true,
  "service_name": "forge",
  "exporter_endpoint": "http://forge-otel-collector:4317",
  "sampler": "parentbased_traceidratio",
  "sampler_arg": "0.1",
  "collector_healthy": true,
  "collector_last_check": "2026-04-09T15:42:00Z"
}

Verification

Future work