Kubernetes Deployment

Forge Platform deploys natively on Kubernetes via the official Helm chart at forgeplatform/forge-helm, with an optional operator at forgeplatform/forge-operator (v1.0.0) that lets you manage nine Forge resource kinds — organizations, teams, projects, inventories, credentials, job templates, schedules, workflows (DAG), and remote Forge instances — declaratively as Kubernetes Custom Resources, with support for routing different CRs to different Forge backends.

This page is the long-form companion to the Docker Compose Deployment guide. If you already run a Kubernetes cluster, the Helm path is the recommended deployment for production: it manages secrets natively, supports rolling upgrades, and integrates with Ingress controllers, External Secrets, and the wider ecosystem.


When to Pick Kubernetes

Docker ComposeKubernetes (Helm + Operator)
Best forSingle-host deployments, evaluation, small teamsMulti-node clusters, GitOps, declarative job-template management
HANo (single node)Yes (multiple replicas, multi-master cluster)
StorageBind mountsPVCs (any CSI driver)
Secrets.env filek8s Secrets / External Secrets / Sealed Secrets
Ingressnginx + Let’s Encrypt scriptsAny IngressController + cert-manager
Job templatesCreated via UI / API onlyUI / API or kubectl apply -f jt.yaml
Upgradedocker compose pull && up -dhelm upgrade (rolling)

Architecture on Kubernetes

The chart deploys six application workloads plus the Forge backend itself. Every workload runs as its own Deployment (or StatefulSet for stateful data), and is fronted by a ClusterIP Service for in-cluster traffic. Ingress traffic enters through a single Ingress resource that fans out by URL path.

                        ┌──────────────────┐
                        │   Ingress (TLS)  │  forge.lan
                        │  Traefik / NGINX │
                        └─────────┬────────┘
              ┌───────────────────┼─────────────────────┐
              │       /api/admin/static/sso/websocket   │  /
              ▼                                          ▼
      ┌───────────────┐                          ┌────────────────┐
      │   forge-web   │  Django + uwsgi + daphne │ forge-frontend │  React SPA
      │  (Deployment) │  ports: 8013 8015        │  (Deployment)  │  port: 80
      └──┬──────┬──┬──┘                          └────────────────┘
         │      │  │
         │      │  └──── OPA (sidecar via service forge-opa:8181) ──── policy decisions
         │      │
         │      └──── OTel Collector (forge-otel-collector:4317) ──── traces / metrics
         │
         ├──── Postgres (StatefulSet, forge-postgres:5432, PVC 8 Gi)
         │
         ├──── Redis (Deployment, forge-redis:6379, PVC 2 Gi)
         │
         └──── forge-task (Deployment, no Service)
                ├── ansible-runner + podman in privileged container
                ├── Receptor mesh (control socket /var/run/awx-receptor/receptor.sock)
                └── shared PVC forge-projects (mounted in both web + task)

Two extra resources run as one-shots / CRDs:


Namespaces

The chart and operator each install into their own namespace. This separation lets you grant the operator only the permissions it needs and lets you upgrade Forge without touching the operator (or vice versa).

NamespaceContentsCreated by
forgePostgres, Redis, OPA, OTel Collector, forge-web, forge-task, forge-frontend, forge-init Job, Ingress, all chart Secrets and ConfigMapshelm install forge ./forge-helm -n forge --create-namespace (or pre-created if you need to seed pull-secrets first)
forge-operatorThe operator Deployment, its ServiceAccount + ClusterRole + ClusterRoleBinding, and one Secret holding the Forge OAuth2 tokenhelm install forge-operator ./forge-operator/helm -n forge-operator --create-namespace
anyThe nine CRD instances (Organization, Team, Project, Inventory, Credential, JobTemplate, Schedule, Workflow, ForgeInstance). Each CR is namespace-scoped; the operator watches all namespaces and reconciles them centrally against Forge.You — kubectl apply -f cr.yaml

The chart’s top-level namespace.create value defaults to false because pre-creating the namespace is the common path: you usually want to kubectl create it first so you can drop in the Harbor pull-secret and a TLS secret before the Pods start trying to mount them.


Networking

Services and ports

Every workload that other workloads (or the Ingress) need to reach gets a ClusterIP Service. forge-task is the only one without a Service — nothing reaches into it; it pulls work off Redis and pushes it through Receptor.

ServicePortsSelectorReached by
forge-web8013 (HTTP API), 8015 (websocket)component=webIngress, forge-task callback, forge-operator
forge-frontend80 (HTTP)component=frontendIngress (path /)
forge-postgres5432component=postgresforge-web, forge-task, forge-init
forge-redis6379component=redisforge-web, forge-task
forge-opa8181 (REST policy decision API)component=opaforge-web (policy checks)
forge-otel-collector4317 (gRPC), 4318 (HTTP)component=otel-collectorforge-web, forge-task (traces + metrics)

Cluster DNS

Workloads reach each other through the cluster DNS at <service>.<namespace>.svc.cluster.local. Inside the forge namespace the bare service name resolves too, so forge.otel.endpoint defaults to http://forge-otel-collector:4317. The operator reaches the API at the FQDN http://forge-web.forge.svc.cluster.local:8013 because it lives in a different namespace.

Ingress and URL routing

A single Ingress resource named forge routes forge.lan by URL path. The chart’s default uses Traefik (the chart’s ingress.className: traefik), but any IngressController works — change className to nginx, contour, etc. Path order matters because Forge owns five distinct path prefixes that must take precedence over the SPA catch-all.

PathTarget ServiceWhy
/apiforge-web:8013REST API (/api/v2/*)
/adminforge-web:8013Django admin
/ssoforge-web:8013SAML / OIDC / social-auth callbacks
/websocketforge-web:8015Job event streaming over WS
/static/forgeforge-frontend:80SPA assets — must come before /static below
/staticforge-web:8013Django staticfiles (admin CSS, DRF browser)
/forge-frontend:80SPA catch-all

Hostname TLD pitfall (.local vs .lan)

The chart’s ingress.host default is forge.lan, not forge.local. The reason: most desktop Linux distros ship nsswitch.conf with mdns_minimal [NOTFOUND=return] ahead of files, which intercepts every .local lookup, fails it, and bypasses /etc/hosts entirely. The browser then shows Server Not Found even though curl --resolve forge.local:30080:<node-ip> ... works fine. The .lan TLD has no such hijack.

For host-based access from a developer laptop, add to /etc/hosts:

192.168.56.32  forge.lan

Exposing the cluster

How Ingress traffic actually reaches your nodes depends on the IngressController service type:

CNI quirk on VirtualBox dev clusters

If you build a multi-VM cluster on VirtualBox host-only networks, Flannel’s default --iface auto-detect picks eth0 (the NAT adapter). Pod-to-pod traffic across nodes then dies, with symptoms like connect: no route to host from any cross-node service VIP. Patch the Flannel DaemonSet to bind to the host-only interface explicitly:

kubectl -n kube-flannel patch ds kube-flannel-ds --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--iface=eth1"}]'

The forge-dev-cluster Vagrant repo bakes this patch into master-init.sh.


Storage

Five PVCs are created. Sizes are values overrides — start with the defaults and grow the underlying storage when you outgrow them.

PVCDefault sizeModeMounted inPurpose
forge-postgres8 GiRWOStatefulSet forge-postgresDatabase files
forge-redis2 GiRWODeployment forge-redisRDB snapshots / AOF
forge-projects4 GiRWXforge-web + forge-taskSynced project repos (_<id>__<slug>)
forge-receptor2 GiRWOforge-taskReceptor work units (/tmp/receptor)
forge-backups10 GiRWOforge-web (mount-only)Where backup.sh writes .sql.gz archives

RWX requirement: forge-projects is mounted in both forge-web and forge-task. If those Pods land on different nodes, RWX is mandatory. If you’re forced to ReadWriteOnce (e.g. local-path-provisioner on bare metal), pin both Deployments to the same node with a nodeSelector or podAffinity; otherwise the second Pod will sit Pending forever.

Override the StorageClass per-PVC with values like postgres.storage.storageClass: ceph-rbd. Leaving it empty falls back to the cluster default StorageClass.


Prerequisites

Container images

Forge images are published to the public GitHub Container Registry: ghcr.io/forgeplatform/*. No pull secret is required.

kubectl create namespace forge

If you mirror the images to your own registry, override images.backend.repository + images.frontend.repository in values.yaml and add an imagePullSecrets entry pointing at your in-cluster docker-registry Secret.

TLS secret

Pre-create a TLS Secret if you want HTTPS on first boot. Self-signed is fine for dev:

openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
    -keyout tls.key -out tls.crt -subj '/CN=forge.lan' \
    -addext 'subjectAltName=DNS:forge.lan,DNS:*.forge.lan'

kubectl -n forge create secret tls forge-tls --cert=tls.crt --key=tls.key

Install Forge core

git clone git@github.com:forgeplatform/forge-helm.git
cd forge-helm

helm install forge . -n forge \
    --set forge.admin.user=admin \
    --set secrets.forgeAdminPassword='<strong-password>' \
    --set secrets.postgresPassword='<random-32-bytes>' \
    --set secrets.forgeSecretKey="$(openssl rand -hex 32)" \
    --set secrets.forgeBroadcastWebsocketSecret="$(openssl rand -hex 32)"

The install runs synchronously through the forge-init Job. Watch progress with:

kubectl -n forge get pods -w

Expected end state — every Pod Running 1/1, forge-init-1 Completed:

NAME                              READY   STATUS      RESTARTS   AGE
forge-frontend-7f8c5d9c8b-abcde   1/1     Running     0          3m
forge-init-1-xxxxx                0/1     Completed   0          3m
forge-opa-6cdfb9d79c-fghij        1/1     Running     0          3m
forge-otel-collector-...          1/1     Running     0          3m
forge-postgres-0                  1/1     Running     0          3m
forge-redis-...                   1/1     Running     0          3m
forge-task-...                    1/1     Running     0          3m
forge-web-...                     1/1     Running     0          3m

Optional: AI Assistant

Chart 0.3.0 introduces an optional forge-assistant deployment that wraps the all-in-one image (Ollama + ChromaDB + RAG API in one container). It is disabled by default — enable it on a fresh install or on an upgrade:

helm upgrade forge . -n forge \
    --reuse-values \
    --set assistant.enabled=true

The chart provisions a PersistentVolumeClaim (forge-assistant-data, default 20 Gi) so the LLM model cache and Chroma vector store survive pod restarts. On first boot the entrypoint pulls the configured model (gemma3:1b by default, ≈800 MB) and the embedding model (nomic-embed-text), then indexes the bundled documentation — allow ~5 minutes before the pod is Ready. The startupProbe is sized for this (failureThreshold: 30 at periodSeconds: 10) so the liveness probe will not kill the pod mid-bootstrap.

Key values you may want to override:

ValueDefaultNotes
assistant.modelgemma3:1bSwitch to llama3.1:8b or mistral:7b for higher answer quality at the cost of memory and latency.
assistant.storage.size20GiBump to 30 Gi+ if you change models or grow the indexed corpus.
assistant.resources.limits.memory4Gigemma3:1b inference peaks ~1.5 Gi; larger models need 8 Gi+.

The Service is reachable cluster-internally at forge-assistant.forge.svc.cluster.local:8100. Ingress routing is intentionally not wired up by this release — expose it with a follow-up middleware (Traefik stripPrefix on /assistant) once you decide on a URL contract.


Install the Operator

The operator is shipped separately because not everyone wants declarative-via-CRD management — some teams prefer the UI. Install only after Forge core is healthy (the operator needs to authenticate to it on startup).

Generate a Forge OAuth2 PAT

The operator authenticates with a personal access token issued by Forge. Generate one inside the running web Pod:

TOKEN=$(kubectl -n forge exec deploy/forge-web -- \
    forge-manage create_oauth2_token --user admin | tail -1)
echo "$TOKEN"

The token is shown once — copy it now. Revoking it later: forge-manage revoke_oauth2_tokens --user admin.

Install with Helm

git clone git@github.com:forgeplatform/forge-operator.git
cd forge-operator

helm install forge-operator helm/ \
    -n forge-operator --create-namespace \
    --set forge.token="$TOKEN" \
    --set forge.url=http://forge-web.forge.svc.cluster.local:8013

Verify:

kubectl -n forge-operator logs deploy/forge-operator -f

You should see Starting workers for each of the nine reconcilers (organization, team, project, inventory, credential, jobtemplate, schedule, workflow, forgeinstance).

For full coverage of the v1.0.0 release — multi-cluster routing via ForgeInstance, the OLM bundle, and the declarative Workflow DAG model — see the dedicated Operator v1.0.0 guide.

Custom Resources

Each CR maps to a Forge primary-key after first reconcile (status.forgeId). The operator owns the resource via a finalizer — deleting the CR deletes the Forge resource too. Re-applying a CR after manual edits in the Forge UI overwrites the UI changes (drift is detected on a 60s requeue and reconciled toward the CR).

Sample CRs live in forge-operator/config/samples/:

KindMaps toSensitive fields
Organization/api/v2/organizationsNone — top-level tenant container with max-host quota
Team/api/v2/teams + member sync at /teams/{id}/users/None — spec.users[] resolves usernames at reconcile
Project/api/v2/projectsSCM credential resolved by name from an existing Credential CR
Inventory/api/v2/inventories + nested hosts & groupsNone — pure spec
Credential/api/v2/credentialsspec.inputsFrom[] reads sensitive values from k8s Secrets in the same namespace; the operator watches those Secrets and re-syncs on rotation
JobTemplate/api/v2/job_templates with credential / project / inventory references resolved by nameNone
Schedule/api/v2/schedulesNone — RFC 5545 RRULE drives recurrence
Workflow/api/v2/workflow_job_templates + node DAG at /workflow_job_templates/{id}/workflow_nodes/ + edgesDeclarative spec.nodes[] keyed by identifier with successNodes / failureNodes / alwaysNodes graph references
ForgeInstanceControl-plane only — describes a Forge backend that other CRs target via spec.forgeInstanceBearer token sourced from a k8s Secret via spec.tokenSecretRef

Verification

After both Helm releases land:

# Forge API healthy
kubectl -n forge port-forward svc/forge-web 8013:8013 &
curl http://localhost:8013/api/v2/ping/ | jq

# Browser access (assuming /etc/hosts is set)
open http://forge.lan:30080/

# Operator wired up
kubectl get crd | grep forgeplatform.io
kubectl apply -f https://raw.githubusercontent.com/forgeplatform/forge-operator/main/config/samples/inventory-sample.yaml
kubectl get inventory production -o jsonpath='{.status}'

Day-2 Operations

Upgrade

git -C forge-helm pull
helm upgrade forge ./forge-helm -n forge --reuse-values

The chart re-runs forge-init on each upgrade (its name embeds {{ .Release.Revision }}). The init script is idempotent. If you upgrade against a populated DB, the migrate step finishes in milliseconds; the heavier preload-data and execution-environment registration steps short-circuit when records already exist.

Backup

kubectl -n forge exec deploy/forge-web -- /usr/local/bin/backup.sh
kubectl -n forge cp forge-web-xxx:/var/backups/forge/forge-<ts>.sql.gz ./forge.sql.gz

Scale forge-task

Forge dispatches jobs to any forge-task replica via the Receptor mesh. Increase task.replicas in values.yaml for more concurrent jobs. Keep in mind:


Customizing values.yaml

The shipped values.yaml is heavily commented. The most commonly tuned values:

PathDefaultWhy you’d change it
images.backend.repositoryghcr.io/forgeplatform/forge-backendMirror to your own registry
images.backend.taglatestPin to a CalVer release
secrets.forgeAdminPasswordchangeme-adminAlways — it’s a placeholder
ingress.hostforge.lanYour real DNS name
ingress.tls.enabledtrueDisable for plain-HTTP test clusters
postgres.storage.size8GiProduction sizing
postgres.enabledtrueSet false + override DB env to point at an external Postgres (RDS, Cloud SQL)
otelCollector.enabledtrueDisable if you ship an OTel Collector at the cluster level (DaemonSet)
web.replicas / task.replicas1 / 1HA / throughput

Troubleshooting

Browser shows “Server Not Found” for forge.local

That’s the mDNS hijack described above. Use the chart default (forge.lan) and update /etc/hosts accordingly. If you’re locked into a .local hostname, edit /etc/nsswitch.conf on every developer laptop to put files before mdns_minimal — but this is system-wide and may break Avahi-discovered devices on the LAN.

Jobs fail with “unknown work type kubernetes-incluster-auth”

The Forge instance is registered as node_type=control instead of hybrid. Control nodes only orchestrate; they refuse to execute jobs locally and try to dispatch them to a ContainerGroup. The chart’s init.sh runs an explicit ORM update after provision_instance to fix this — if you wrote a custom init, replicate that step:

forge-manage shell -c "
from forge.main.models import Instance
i = Instance.objects.get(hostname='<node>')
i.node_type = 'hybrid'; i.save(update_fields=['node_type'])
"

The same script also flips the auto-created default InstanceGroup off is_container_group=True, which is what the post-migrate signal sets when Forge detects it’s running in k8s.

forge-task pod restarts with exit code 137

OOMKilled — the default memory limit was 2 Gi in earlier chart versions. Bump task.resources.limits.memory to 4Gi (the current default). Memory is tight because supervisord runs uwsgi + dispatcher + Receptor + ansible-runner + a podman EE container concurrently.

Job error “Error updating status file /tmp/receptor/.../status.lock: no such file or directory”

The Receptor work-unit directory disappeared. This used to happen when forge-task was OOMKilled mid-job and the pod restart wiped /tmp. Mitigation:

  1. Bump task memory (above)
  2. The chart now mounts forge-receptor PVC at /tmp/receptor so work units survive Pod restarts

Pods on different nodes can’t reach each other (VirtualBox only)

See the CNI quirk note — Flannel must be patched to bind to the host-only interface (eth1), not the NAT-mode eth0. forge-dev-cluster bakes this into provisioning.


Companion Repositories

RepoWhat it ships
forge-helmThe Helm chart described on this page
forge-operatorThe Kubernetes operator with 4 CRDs
forge-dev-cluster4-VM Vagrant + VirtualBox cluster (2 master + 2 worker, k8s 1.30) for chart and operator integration testing — includes the Flannel eth1 patch and a post-cluster-setup.sh that installs Traefik, local-path-provisioner, and the forge namespace prerequisites
forge-devopsThe Docker Compose deployment described on the Deployment page — same image set, single-host topology