Kubernetes Deployment
Forge Platform deploys natively on Kubernetes via the official Helm chart at forgeplatform/forge-helm, with an optional operator at forgeplatform/forge-operator (v1.0.0) that lets you manage nine Forge resource kinds — organizations, teams, projects, inventories, credentials, job templates, schedules, workflows (DAG), and remote Forge instances — declaratively as Kubernetes Custom Resources, with support for routing different CRs to different Forge backends.
This page is the long-form companion to the Docker Compose Deployment guide. If you already run a Kubernetes cluster, the Helm path is the recommended deployment for production: it manages secrets natively, supports rolling upgrades, and integrates with Ingress controllers, External Secrets, and the wider ecosystem.
When to Pick Kubernetes
| Docker Compose | Kubernetes (Helm + Operator) | |
|---|---|---|
| Best for | Single-host deployments, evaluation, small teams | Multi-node clusters, GitOps, declarative job-template management |
| HA | No (single node) | Yes (multiple replicas, multi-master cluster) |
| Storage | Bind mounts | PVCs (any CSI driver) |
| Secrets | .env file | k8s Secrets / External Secrets / Sealed Secrets |
| Ingress | nginx + Let’s Encrypt scripts | Any IngressController + cert-manager |
| Job templates | Created via UI / API only | UI / API or kubectl apply -f jt.yaml |
| Upgrade | docker compose pull && up -d | helm upgrade (rolling) |
Architecture on Kubernetes
The chart deploys six application workloads plus the Forge backend itself. Every workload runs as its own Deployment (or StatefulSet for stateful data), and is fronted by a ClusterIP Service for in-cluster traffic. Ingress traffic enters through a single Ingress resource that fans out by URL path.
┌──────────────────┐
│ Ingress (TLS) │ forge.lan
│ Traefik / NGINX │
└─────────┬────────┘
┌───────────────────┼─────────────────────┐
│ /api/admin/static/sso/websocket │ /
▼ ▼
┌───────────────┐ ┌────────────────┐
│ forge-web │ Django + uwsgi + daphne │ forge-frontend │ React SPA
│ (Deployment) │ ports: 8013 8015 │ (Deployment) │ port: 80
└──┬──────┬──┬──┘ └────────────────┘
│ │ │
│ │ └──── OPA (sidecar via service forge-opa:8181) ──── policy decisions
│ │
│ └──── OTel Collector (forge-otel-collector:4317) ──── traces / metrics
│
├──── Postgres (StatefulSet, forge-postgres:5432, PVC 8 Gi)
│
├──── Redis (Deployment, forge-redis:6379, PVC 2 Gi)
│
└──── forge-task (Deployment, no Service)
├── ansible-runner + podman in privileged container
├── Receptor mesh (control socket /var/run/awx-receptor/receptor.sock)
└── shared PVC forge-projects (mounted in both web + task)
Two extra resources run as one-shots / CRDs:
forge-initJob — runs once per release revision: applies migrations, creates the admin user, provisions the instance, registers thecontrolplane+defaultqueues, seeds preload data, and writesCSRF_TRUSTED_ORIGINSinto the DB. Suffixed with{{ .Release.Revision }}so each upgrade gets a fresh Job (Job spec is immutable, so re-applying the same name would fail).- forge-operator — runs in its own namespace, watches
Organization/Team/Project/Inventory/Credential/JobTemplate/Schedule/Workflow/ForgeInstanceCRs cluster-wide, and reconciles each into a Forge REST API. Authenticates with an OAuth2 personal access token issued by Forge; the optionalForgeInstanceCR lets one operator deployment fan out to multiple Forge backends.
Namespaces
The chart and operator each install into their own namespace. This separation lets you grant the operator only the permissions it needs and lets you upgrade Forge without touching the operator (or vice versa).
| Namespace | Contents | Created by |
|---|---|---|
forge | Postgres, Redis, OPA, OTel Collector, forge-web, forge-task, forge-frontend, forge-init Job, Ingress, all chart Secrets and ConfigMaps | helm install forge ./forge-helm -n forge --create-namespace (or pre-created if you need to seed pull-secrets first) |
forge-operator | The operator Deployment, its ServiceAccount + ClusterRole + ClusterRoleBinding, and one Secret holding the Forge OAuth2 token | helm install forge-operator ./forge-operator/helm -n forge-operator --create-namespace |
| any | The nine CRD instances (Organization, Team, Project, Inventory, Credential, JobTemplate, Schedule, Workflow, ForgeInstance). Each CR is namespace-scoped; the operator watches all namespaces and reconciles them centrally against Forge. | You — kubectl apply -f cr.yaml |
The chart’s top-level namespace.create value defaults to false because pre-creating the namespace is the common path: you usually want to kubectl create it first so you can drop in the Harbor pull-secret and a TLS secret before the Pods start trying to mount them.
Networking
Services and ports
Every workload that other workloads (or the Ingress) need to reach gets a ClusterIP Service. forge-task is the only one without a Service — nothing reaches into it; it pulls work off Redis and pushes it through Receptor.
| Service | Ports | Selector | Reached by |
|---|---|---|---|
forge-web | 8013 (HTTP API), 8015 (websocket) | component=web | Ingress, forge-task callback, forge-operator |
forge-frontend | 80 (HTTP) | component=frontend | Ingress (path /) |
forge-postgres | 5432 | component=postgres | forge-web, forge-task, forge-init |
forge-redis | 6379 | component=redis | forge-web, forge-task |
forge-opa | 8181 (REST policy decision API) | component=opa | forge-web (policy checks) |
forge-otel-collector | 4317 (gRPC), 4318 (HTTP) | component=otel-collector | forge-web, forge-task (traces + metrics) |
Cluster DNS
Workloads reach each other through the cluster DNS at <service>.<namespace>.svc.cluster.local. Inside the forge namespace the bare service name resolves too, so forge.otel.endpoint defaults to http://forge-otel-collector:4317. The operator reaches the API at the FQDN http://forge-web.forge.svc.cluster.local:8013 because it lives in a different namespace.
Ingress and URL routing
A single Ingress resource named forge routes forge.lan by URL path. The chart’s default uses Traefik (the chart’s ingress.className: traefik), but any IngressController works — change className to nginx, contour, etc. Path order matters because Forge owns five distinct path prefixes that must take precedence over the SPA catch-all.
| Path | Target Service | Why |
|---|---|---|
/api | forge-web:8013 | REST API (/api/v2/*) |
/admin | forge-web:8013 | Django admin |
/sso | forge-web:8013 | SAML / OIDC / social-auth callbacks |
/websocket | forge-web:8015 | Job event streaming over WS |
/static/forge | forge-frontend:80 | SPA assets — must come before /static below |
/static | forge-web:8013 | Django staticfiles (admin CSS, DRF browser) |
/ | forge-frontend:80 | SPA catch-all |
Hostname TLD pitfall (.local vs .lan)
The chart’s ingress.host default is forge.lan, not forge.local. The reason: most desktop Linux distros ship nsswitch.conf with mdns_minimal [NOTFOUND=return] ahead of files, which intercepts every .local lookup, fails it, and bypasses /etc/hosts entirely. The browser then shows Server Not Found even though curl --resolve forge.local:30080:<node-ip> ... works fine. The .lan TLD has no such hijack.
For host-based access from a developer laptop, add to /etc/hosts:
192.168.56.32 forge.lan
Exposing the cluster
How Ingress traffic actually reaches your nodes depends on the IngressController service type:
- NodePort (test clusters) — Traefik’s service type defaults to
NodePorton30080(HTTP) and30443(HTTPS). The browser hitshttp://forge.lan:30080/and the kernel kube-proxy forwards from any node to the Traefik pod. - LoadBalancer (cloud / MetalLB) — set the IngressController service type to
LoadBalancerand use the assigned external IP / DNS name. - hostNetwork (bare metal, single-node) — runs the IngressController on the host network so 80/443 are reachable directly.
CNI quirk on VirtualBox dev clusters
If you build a multi-VM cluster on VirtualBox host-only networks, Flannel’s default --iface auto-detect picks eth0 (the NAT adapter). Pod-to-pod traffic across nodes then dies, with symptoms like connect: no route to host from any cross-node service VIP. Patch the Flannel DaemonSet to bind to the host-only interface explicitly:
kubectl -n kube-flannel patch ds kube-flannel-ds --type=json \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--iface=eth1"}]'
The forge-dev-cluster Vagrant repo bakes this patch into master-init.sh.
Storage
Five PVCs are created. Sizes are values overrides — start with the defaults and grow the underlying storage when you outgrow them.
| PVC | Default size | Mode | Mounted in | Purpose |
|---|---|---|---|---|
forge-postgres | 8 Gi | RWO | StatefulSet forge-postgres | Database files |
forge-redis | 2 Gi | RWO | Deployment forge-redis | RDB snapshots / AOF |
forge-projects | 4 Gi | RWX | forge-web + forge-task | Synced project repos (_<id>__<slug>) |
forge-receptor | 2 Gi | RWO | forge-task | Receptor work units (/tmp/receptor) |
forge-backups | 10 Gi | RWO | forge-web (mount-only) | Where backup.sh writes .sql.gz archives |
RWX requirement: forge-projects is mounted in both forge-web and forge-task. If those Pods land on different nodes, RWX is mandatory. If you’re forced to ReadWriteOnce (e.g. local-path-provisioner on bare metal), pin both Deployments to the same node with a nodeSelector or podAffinity; otherwise the second Pod will sit Pending forever.
Override the StorageClass per-PVC with values like postgres.storage.storageClass: ceph-rbd. Leaving it empty falls back to the cluster default StorageClass.
Prerequisites
- Kubernetes 1.27+ (tested on 1.30)
- An IngressController (Traefik / NGINX / Contour).
forge-dev-cluster’spost-cluster-setup.shinstalls Traefik via Helm - A default
StorageClass(or per-PVC overrides) — local-path-provisioner is fine for dev helm3.12+- Optional but recommended: cert-manager (real TLS) or a hand-rolled
kubernetes.io/tlsSecret namedforge-tls
Container images
Forge images are published to the public GitHub Container Registry: ghcr.io/forgeplatform/*. No pull secret is required.
kubectl create namespace forge
If you mirror the images to your own registry, override images.backend.repository + images.frontend.repository in values.yaml and add an imagePullSecrets entry pointing at your in-cluster docker-registry Secret.
TLS secret
Pre-create a TLS Secret if you want HTTPS on first boot. Self-signed is fine for dev:
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout tls.key -out tls.crt -subj '/CN=forge.lan' \
-addext 'subjectAltName=DNS:forge.lan,DNS:*.forge.lan'
kubectl -n forge create secret tls forge-tls --cert=tls.crt --key=tls.key
Install Forge core
git clone git@github.com:forgeplatform/forge-helm.git
cd forge-helm
helm install forge . -n forge \
--set forge.admin.user=admin \
--set secrets.forgeAdminPassword='<strong-password>' \
--set secrets.postgresPassword='<random-32-bytes>' \
--set secrets.forgeSecretKey="$(openssl rand -hex 32)" \
--set secrets.forgeBroadcastWebsocketSecret="$(openssl rand -hex 32)"
The install runs synchronously through the forge-init Job. Watch progress with:
kubectl -n forge get pods -w
Expected end state — every Pod Running 1/1, forge-init-1 Completed:
NAME READY STATUS RESTARTS AGE
forge-frontend-7f8c5d9c8b-abcde 1/1 Running 0 3m
forge-init-1-xxxxx 0/1 Completed 0 3m
forge-opa-6cdfb9d79c-fghij 1/1 Running 0 3m
forge-otel-collector-... 1/1 Running 0 3m
forge-postgres-0 1/1 Running 0 3m
forge-redis-... 1/1 Running 0 3m
forge-task-... 1/1 Running 0 3m
forge-web-... 1/1 Running 0 3m
Optional: AI Assistant
Chart 0.3.0 introduces an optional forge-assistant deployment that wraps the all-in-one image (Ollama + ChromaDB + RAG API in one container). It is disabled by default — enable it on a fresh install or on an upgrade:
helm upgrade forge . -n forge \
--reuse-values \
--set assistant.enabled=true
The chart provisions a PersistentVolumeClaim (forge-assistant-data, default 20 Gi) so the LLM model cache and Chroma vector store survive pod restarts. On first boot the entrypoint pulls the configured model (gemma3:1b by default, ≈800 MB) and the embedding model (nomic-embed-text), then indexes the bundled documentation — allow ~5 minutes before the pod is Ready. The startupProbe is sized for this (failureThreshold: 30 at periodSeconds: 10) so the liveness probe will not kill the pod mid-bootstrap.
Key values you may want to override:
| Value | Default | Notes |
|---|---|---|
assistant.model | gemma3:1b | Switch to llama3.1:8b or mistral:7b for higher answer quality at the cost of memory and latency. |
assistant.storage.size | 20Gi | Bump to 30 Gi+ if you change models or grow the indexed corpus. |
assistant.resources.limits.memory | 4Gi | gemma3:1b inference peaks ~1.5 Gi; larger models need 8 Gi+. |
The Service is reachable cluster-internally at forge-assistant.forge.svc.cluster.local:8100. Ingress routing is intentionally not wired up by this release — expose it with a follow-up middleware (Traefik stripPrefix on /assistant) once you decide on a URL contract.
Install the Operator
The operator is shipped separately because not everyone wants declarative-via-CRD management — some teams prefer the UI. Install only after Forge core is healthy (the operator needs to authenticate to it on startup).
Generate a Forge OAuth2 PAT
The operator authenticates with a personal access token issued by Forge. Generate one inside the running web Pod:
TOKEN=$(kubectl -n forge exec deploy/forge-web -- \
forge-manage create_oauth2_token --user admin | tail -1)
echo "$TOKEN"
The token is shown once — copy it now. Revoking it later: forge-manage revoke_oauth2_tokens --user admin.
Install with Helm
git clone git@github.com:forgeplatform/forge-operator.git
cd forge-operator
helm install forge-operator helm/ \
-n forge-operator --create-namespace \
--set forge.token="$TOKEN" \
--set forge.url=http://forge-web.forge.svc.cluster.local:8013
Verify:
kubectl -n forge-operator logs deploy/forge-operator -f
You should see Starting workers for each of the nine reconcilers (organization, team, project, inventory, credential, jobtemplate, schedule, workflow, forgeinstance).
For full coverage of the v1.0.0 release — multi-cluster routing via ForgeInstance, the OLM bundle, and the declarative Workflow DAG model — see the dedicated Operator v1.0.0 guide.
Custom Resources
Each CR maps to a Forge primary-key after first reconcile (status.forgeId). The operator owns the resource via a finalizer — deleting the CR deletes the Forge resource too. Re-applying a CR after manual edits in the Forge UI overwrites the UI changes (drift is detected on a 60s requeue and reconciled toward the CR).
Sample CRs live in forge-operator/config/samples/:
| Kind | Maps to | Sensitive fields |
|---|---|---|
Organization | /api/v2/organizations | None — top-level tenant container with max-host quota |
Team | /api/v2/teams + member sync at /teams/{id}/users/ | None — spec.users[] resolves usernames at reconcile |
Project | /api/v2/projects | SCM credential resolved by name from an existing Credential CR |
Inventory | /api/v2/inventories + nested hosts & groups | None — pure spec |
Credential | /api/v2/credentials | spec.inputsFrom[] reads sensitive values from k8s Secrets in the same namespace; the operator watches those Secrets and re-syncs on rotation |
JobTemplate | /api/v2/job_templates with credential / project / inventory references resolved by name | None |
Schedule | /api/v2/schedules | None — RFC 5545 RRULE drives recurrence |
Workflow | /api/v2/workflow_job_templates + node DAG at /workflow_job_templates/{id}/workflow_nodes/ + edges | Declarative spec.nodes[] keyed by identifier with successNodes / failureNodes / alwaysNodes graph references |
ForgeInstance | Control-plane only — describes a Forge backend that other CRs target via spec.forgeInstance | Bearer token sourced from a k8s Secret via spec.tokenSecretRef |
Verification
After both Helm releases land:
# Forge API healthy
kubectl -n forge port-forward svc/forge-web 8013:8013 &
curl http://localhost:8013/api/v2/ping/ | jq
# Browser access (assuming /etc/hosts is set)
open http://forge.lan:30080/
# Operator wired up
kubectl get crd | grep forgeplatform.io
kubectl apply -f https://raw.githubusercontent.com/forgeplatform/forge-operator/main/config/samples/inventory-sample.yaml
kubectl get inventory production -o jsonpath='{.status}'
Day-2 Operations
Upgrade
git -C forge-helm pull
helm upgrade forge ./forge-helm -n forge --reuse-values
The chart re-runs forge-init on each upgrade (its name embeds {{ .Release.Revision }}). The init script is idempotent. If you upgrade against a populated DB, the migrate step finishes in milliseconds; the heavier preload-data and execution-environment registration steps short-circuit when records already exist.
Backup
kubectl -n forge exec deploy/forge-web -- /usr/local/bin/backup.sh
kubectl -n forge cp forge-web-xxx:/var/backups/forge/forge-<ts>.sql.gz ./forge.sql.gz
Scale forge-task
Forge dispatches jobs to any forge-task replica via the Receptor mesh. Increase task.replicas in values.yaml for more concurrent jobs. Keep in mind:
- forge-task runs
privileged: truewith podman — restrict it to a node pool you trust - The chart bumps the task memory limit to 4 Gi by default, because supervisord + dispatcher + Receptor + ansible-runner + an EE container together blow past 2 Gi during the first project sync. Lower this only if you’ve profiled your specific workload.
Customizing values.yaml
The shipped values.yaml is heavily commented. The most commonly tuned values:
| Path | Default | Why you’d change it |
|---|---|---|
images.backend.repository | ghcr.io/forgeplatform/forge-backend | Mirror to your own registry |
images.backend.tag | latest | Pin to a CalVer release |
secrets.forgeAdminPassword | changeme-admin | Always — it’s a placeholder |
ingress.host | forge.lan | Your real DNS name |
ingress.tls.enabled | true | Disable for plain-HTTP test clusters |
postgres.storage.size | 8Gi | Production sizing |
postgres.enabled | true | Set false + override DB env to point at an external Postgres (RDS, Cloud SQL) |
otelCollector.enabled | true | Disable if you ship an OTel Collector at the cluster level (DaemonSet) |
web.replicas / task.replicas | 1 / 1 | HA / throughput |
Troubleshooting
Browser shows “Server Not Found” for forge.local
That’s the mDNS hijack described above. Use the chart default (forge.lan) and update /etc/hosts accordingly. If you’re locked into a .local hostname, edit /etc/nsswitch.conf on every developer laptop to put files before mdns_minimal — but this is system-wide and may break Avahi-discovered devices on the LAN.
Jobs fail with “unknown work type kubernetes-incluster-auth”
The Forge instance is registered as node_type=control instead of hybrid. Control nodes only orchestrate; they refuse to execute jobs locally and try to dispatch them to a ContainerGroup. The chart’s init.sh runs an explicit ORM update after provision_instance to fix this — if you wrote a custom init, replicate that step:
forge-manage shell -c "
from forge.main.models import Instance
i = Instance.objects.get(hostname='<node>')
i.node_type = 'hybrid'; i.save(update_fields=['node_type'])
"
The same script also flips the auto-created default InstanceGroup off is_container_group=True, which is what the post-migrate signal sets when Forge detects it’s running in k8s.
forge-task pod restarts with exit code 137
OOMKilled — the default memory limit was 2 Gi in earlier chart versions. Bump task.resources.limits.memory to 4Gi (the current default). Memory is tight because supervisord runs uwsgi + dispatcher + Receptor + ansible-runner + a podman EE container concurrently.
Job error “Error updating status file /tmp/receptor/.../status.lock: no such file or directory”
The Receptor work-unit directory disappeared. This used to happen when forge-task was OOMKilled mid-job and the pod restart wiped /tmp. Mitigation:
- Bump task memory (above)
- The chart now mounts
forge-receptorPVC at/tmp/receptorso work units survive Pod restarts
Pods on different nodes can’t reach each other (VirtualBox only)
See the CNI quirk note — Flannel must be patched to bind to the host-only interface (eth1), not the NAT-mode eth0. forge-dev-cluster bakes this into provisioning.
Companion Repositories
| Repo | What it ships |
|---|---|
forge-helm | The Helm chart described on this page |
forge-operator | The Kubernetes operator with 4 CRDs |
forge-dev-cluster | 4-VM Vagrant + VirtualBox cluster (2 master + 2 worker, k8s 1.30) for chart and operator integration testing — includes the Flannel eth1 patch and a post-cluster-setup.sh that installs Traefik, local-path-provisioner, and the forge namespace prerequisites |
forge-devops | The Docker Compose deployment described on the Deployment page — same image set, single-host topology |