Kubernetes Deployment

Forge Platform deploys natively on Kubernetes via the official Helm chart at forgeplatform/forge-helm, with an optional operator at forgeplatform/forge-operator (v1.0.0) that lets you manage nine Forge resource kinds — organizations, teams, projects, inventories, credentials, job templates, schedules, workflows (DAG), and remote Forge instances — declaratively as Kubernetes Custom Resources, with support for routing different CRs to different Forge backends.

This page is the long-form companion to the Docker Compose Deployment guide. If you already run a Kubernetes cluster, the Helm path is the recommended deployment for production: it manages secrets natively, supports rolling upgrades, and integrates with Ingress controllers, External Secrets, and the wider ecosystem.

When to Pick Kubernetes

	Docker Compose	Kubernetes (Helm + Operator)
Best for	Single-host deployments, evaluation, small teams	Multi-node clusters, GitOps, declarative job-template management
HA	No (single node)	Yes (multiple replicas, multi-master cluster)
Storage	Bind mounts	PVCs (any CSI driver)
Secrets	`.env` file	k8s Secrets / External Secrets / Sealed Secrets
Ingress	nginx + Let’s Encrypt scripts	Any IngressController + cert-manager
Job templates	Created via UI / API only	UI / API or `kubectl apply -f jt.yaml`
Upgrade	`docker compose pull && up -d`	`helm upgrade` (rolling)

Architecture on Kubernetes

The chart deploys six application workloads plus the Forge backend itself. Every workload runs as its own Deployment (or StatefulSet for stateful data), and is fronted by a ClusterIP Service for in-cluster traffic. Ingress traffic enters through a single Ingress resource that fans out by URL path.

                        ┌──────────────────┐
                        │   Ingress (TLS)  │  forge.lan
                        │  Traefik / NGINX │
                        └─────────┬────────┘
              ┌───────────────────┼─────────────────────┐
              │       /api/admin/static/sso/websocket   │  /
              ▼                                          ▼
      ┌───────────────┐                          ┌────────────────┐
      │   forge-web   │  Django + uwsgi + daphne │ forge-frontend │  React SPA
      │  (Deployment) │  ports: 8013 8015        │  (Deployment)  │  port: 80
      └──┬──────┬──┬──┘                          └────────────────┘
         │      │  │
         │      │  └──── OPA (sidecar via service forge-opa:8181) ──── policy decisions
         │      │
         │      └──── OTel Collector (forge-otel-collector:4317) ──── traces / metrics
         │
         ├──── Postgres (StatefulSet, forge-postgres:5432, PVC 8 Gi)
         │
         ├──── Redis (Deployment, forge-redis:6379, PVC 2 Gi)
         │
         └──── forge-task (Deployment, no Service)
                ├── ansible-runner + podman in privileged container
                ├── Receptor mesh (control socket /var/run/awx-receptor/receptor.sock)
                └── shared PVC forge-projects (mounted in both web + task)

Two extra resources run as one-shots / CRDs:

forge-init Job — runs once per release revision: applies migrations, creates the admin user, provisions the instance, registers the controlplane + default queues, seeds preload data, and writes CSRF_TRUSTED_ORIGINS into the DB. Suffixed with {{ .Release.Revision }} so each upgrade gets a fresh Job (Job spec is immutable, so re-applying the same name would fail).
forge-operator — runs in its own namespace, watches Organization / Team / Project / Inventory / Credential / JobTemplate / Schedule / Workflow / ForgeInstance CRs cluster-wide, and reconciles each into a Forge REST API. Authenticates with an OAuth2 personal access token issued by Forge; the optional ForgeInstance CR lets one operator deployment fan out to multiple Forge backends.

Namespaces

The chart and operator each install into their own namespace. This separation lets you grant the operator only the permissions it needs and lets you upgrade Forge without touching the operator (or vice versa).

Namespace	Contents	Created by
`forge`	Postgres, Redis, OPA, OTel Collector, forge-web, forge-task, forge-frontend, forge-init Job, Ingress, all chart Secrets and ConfigMaps	`helm install forge ./forge-helm -n forge --create-namespace` (or pre-created if you need to seed pull-secrets first)
`forge-operator`	The operator `Deployment`, its `ServiceAccount` + `ClusterRole` + `ClusterRoleBinding`, and one `Secret` holding the Forge OAuth2 token	`helm install forge-operator ./forge-operator/helm -n forge-operator --create-namespace`
any	The nine CRD instances (`Organization`, `Team`, `Project`, `Inventory`, `Credential`, `JobTemplate`, `Schedule`, `Workflow`, `ForgeInstance`). Each CR is namespace-scoped; the operator watches all namespaces and reconciles them centrally against Forge.	You — `kubectl apply -f cr.yaml`

The chart’s top-level namespace.create value defaults to false because pre-creating the namespace is the common path: you usually want to kubectl create it first so you can drop in the Harbor pull-secret and a TLS secret before the Pods start trying to mount them.

Networking

Services and ports

Every workload that other workloads (or the Ingress) need to reach gets a ClusterIP Service. forge-task is the only one without a Service — nothing reaches into it; it pulls work off Redis and pushes it through Receptor.

Service	Ports	Selector	Reached by
`forge-web`	8013 (HTTP API), 8015 (websocket)	`component=web`	Ingress, forge-task callback, forge-operator
`forge-frontend`	80 (HTTP)	`component=frontend`	Ingress (path `/`)
`forge-postgres`	5432	`component=postgres`	forge-web, forge-task, forge-init
`forge-redis`	6379	`component=redis`	forge-web, forge-task
`forge-opa`	8181 (REST policy decision API)	`component=opa`	forge-web (policy checks)
`forge-otel-collector`	4317 (gRPC), 4318 (HTTP)	`component=otel-collector`	forge-web, forge-task (traces + metrics)

Cluster DNS

Workloads reach each other through the cluster DNS at <service>.<namespace>.svc.cluster.local. Inside the forge namespace the bare service name resolves too, so forge.otel.endpoint defaults to http://forge-otel-collector:4317. The operator reaches the API at the FQDN http://forge-web.forge.svc.cluster.local:8013 because it lives in a different namespace.

Ingress and URL routing

A single Ingress resource named forge routes forge.lan by URL path. The chart’s default uses Traefik (the chart’s ingress.className: traefik), but any IngressController works — change className to nginx, contour, etc. Path order matters because Forge owns five distinct path prefixes that must take precedence over the SPA catch-all.

Path	Target Service	Why
`/api`	`forge-web:8013`	REST API (`/api/v2/*`)
`/admin`	`forge-web:8013`	Django admin
`/sso`	`forge-web:8013`	SAML / OIDC / social-auth callbacks
`/websocket`	`forge-web:8015`	Job event streaming over WS
`/static/forge`	`forge-frontend:80`	SPA assets — must come before `/static` below
`/static`	`forge-web:8013`	Django staticfiles (admin CSS, DRF browser)
`/`	`forge-frontend:80`	SPA catch-all

Hostname TLD pitfall (`.local` vs `.lan`)

The chart’s ingress.host default is forge.lan, not forge.local. The reason: most desktop Linux distros ship nsswitch.conf with mdns_minimal [NOTFOUND=return] ahead of files, which intercepts every .local lookup, fails it, and bypasses /etc/hosts entirely. The browser then shows Server Not Found even though curl --resolve forge.local:30080:<node-ip> ... works fine. The .lan TLD has no such hijack.

For host-based access from a developer laptop, add to /etc/hosts:

192.168.56.32  forge.lan

Exposing the cluster

How Ingress traffic actually reaches your nodes depends on the IngressController service type:

NodePort (test clusters) — Traefik’s service type defaults to NodePort on 30080 (HTTP) and 30443 (HTTPS). The browser hits http://forge.lan:30080/ and the kernel kube-proxy forwards from any node to the Traefik pod.
LoadBalancer (cloud / MetalLB) — set the IngressController service type to LoadBalancer and use the assigned external IP / DNS name.
hostNetwork (bare metal, single-node) — runs the IngressController on the host network so 80/443 are reachable directly.

CNI quirk on VirtualBox dev clusters

If you build a multi-VM cluster on VirtualBox host-only networks, Flannel’s default --iface auto-detect picks eth0 (the NAT adapter). Pod-to-pod traffic across nodes then dies, with symptoms like connect: no route to host from any cross-node service VIP. Patch the Flannel DaemonSet to bind to the host-only interface explicitly:

kubectl -n kube-flannel patch ds kube-flannel-ds --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--iface=eth1"}]'

The forge-dev-cluster Vagrant repo bakes this patch into master-init.sh.

Storage

Five PVCs are created. Sizes are values overrides — start with the defaults and grow the underlying storage when you outgrow them.

PVC	Default size	Mode	Mounted in	Purpose
`forge-postgres`	8 Gi	RWO	StatefulSet `forge-postgres`	Database files
`forge-redis`	2 Gi	RWO	Deployment `forge-redis`	RDB snapshots / AOF
`forge-projects`	4 Gi	RWX	forge-web + forge-task	Synced project repos (`_<id>__<slug>`)
`forge-receptor`	2 Gi	RWO	forge-task	Receptor work units (`/tmp/receptor`)
`forge-backups`	10 Gi	RWO	forge-web (mount-only)	Where `backup.sh` writes `.sql.gz` archives

RWX requirement: forge-projects is mounted in both forge-web and forge-task. If those Pods land on different nodes, RWX is mandatory. If you’re forced to ReadWriteOnce (e.g. local-path-provisioner on bare metal), pin both Deployments to the same node with a nodeSelector or podAffinity; otherwise the second Pod will sit Pending forever.

Override the StorageClass per-PVC with values like postgres.storage.storageClass: ceph-rbd. Leaving it empty falls back to the cluster default StorageClass.

Prerequisites

Kubernetes 1.27+ (tested on 1.30)
An IngressController (Traefik / NGINX / Contour). forge-dev-cluster’s post-cluster-setup.sh installs Traefik via Helm
A default StorageClass (or per-PVC overrides) — local-path-provisioner is fine for dev
helm 3.12+
Optional but recommended: cert-manager (real TLS) or a hand-rolled kubernetes.io/tls Secret named forge-tls

Container images

Forge images are published to the public GitHub Container Registry: ghcr.io/forgeplatform/*. No pull secret is required.

kubectl create namespace forge

If you mirror the images to your own registry, override images.backend.repository + images.frontend.repository in values.yaml and add an imagePullSecrets entry pointing at your in-cluster docker-registry Secret.

TLS secret

Pre-create a TLS Secret if you want HTTPS on first boot. Self-signed is fine for dev:

openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
    -keyout tls.key -out tls.crt -subj '/CN=forge.lan' \
    -addext 'subjectAltName=DNS:forge.lan,DNS:*.forge.lan'

kubectl -n forge create secret tls forge-tls --cert=tls.crt --key=tls.key

Install Forge core

git clone git@github.com:forgeplatform/forge-helm.git
cd forge-helm

helm install forge . -n forge \
    --set forge.admin.user=admin \
    --set secrets.forgeAdminPassword='<strong-password>' \
    --set secrets.postgresPassword='<random-32-bytes>' \
    --set secrets.forgeSecretKey="$(openssl rand -hex 32)" \
    --set secrets.forgeBroadcastWebsocketSecret="$(openssl rand -hex 32)"

The install runs synchronously through the forge-init Job. Watch progress with:

kubectl -n forge get pods -w

Expected end state — every Pod Running 1/1, forge-init-1 Completed:

NAME                              READY   STATUS      RESTARTS   AGE
forge-frontend-7f8c5d9c8b-abcde   1/1     Running     0          3m
forge-init-1-xxxxx                0/1     Completed   0          3m
forge-opa-6cdfb9d79c-fghij        1/1     Running     0          3m
forge-otel-collector-...          1/1     Running     0          3m
forge-postgres-0                  1/1     Running     0          3m
forge-redis-...                   1/1     Running     0          3m
forge-task-...                    1/1     Running     0          3m
forge-web-...                     1/1     Running     0          3m

Optional: AI Assistant

Chart 0.3.0 introduces an optional forge-assistant deployment that wraps the all-in-one image (Ollama + ChromaDB + RAG API in one container). It is disabled by default — enable it on a fresh install or on an upgrade:

helm upgrade forge . -n forge \
    --reuse-values \
    --set assistant.enabled=true

The chart provisions a PersistentVolumeClaim (forge-assistant-data, default 20 Gi) so the LLM model cache and Chroma vector store survive pod restarts. On first boot the entrypoint pulls the configured model (gemma3:1b by default, ≈800 MB) and the embedding model (nomic-embed-text), then indexes the bundled documentation — allow ~5 minutes before the pod is Ready. The startupProbe is sized for this (failureThreshold: 30 at periodSeconds: 10) so the liveness probe will not kill the pod mid-bootstrap.

Key values you may want to override:

Value	Default	Notes
`assistant.model`	`gemma3:1b`	Switch to `llama3.1:8b` or `mistral:7b` for higher answer quality at the cost of memory and latency.
`assistant.storage.size`	`20Gi`	Bump to 30 Gi+ if you change models or grow the indexed corpus.
`assistant.resources.limits.memory`	`4Gi`	gemma3:1b inference peaks ~1.5 Gi; larger models need 8 Gi+.

The Service is reachable cluster-internally at forge-assistant.forge.svc.cluster.local:8100. Ingress routing is intentionally not wired up by this release — expose it with a follow-up middleware (Traefik stripPrefix on /assistant) once you decide on a URL contract.

Install the Operator

The operator is shipped separately because not everyone wants declarative-via-CRD management — some teams prefer the UI. Install only after Forge core is healthy (the operator needs to authenticate to it on startup).

Generate a Forge OAuth2 PAT

The operator authenticates with a personal access token issued by Forge. Generate one inside the running web Pod:

TOKEN=$(kubectl -n forge exec deploy/forge-web -- \
    forge-manage create_oauth2_token --user admin | tail -1)
echo "$TOKEN"

The token is shown once — copy it now. Revoking it later: forge-manage revoke_oauth2_tokens --user admin.

Install with Helm

git clone git@github.com:forgeplatform/forge-operator.git
cd forge-operator

helm install forge-operator helm/ \
    -n forge-operator --create-namespace \
    --set forge.token="$TOKEN" \
    --set forge.url=http://forge-web.forge.svc.cluster.local:8013

Verify:

kubectl -n forge-operator logs deploy/forge-operator -f

You should see Starting workers for each of the nine reconcilers (organization, team, project, inventory, credential, jobtemplate, schedule, workflow, forgeinstance).

For full coverage of the v1.0.0 release — multi-cluster routing via ForgeInstance, the OLM bundle, and the declarative Workflow DAG model — see the dedicated Operator v1.0.0 guide.

Custom Resources

Each CR maps to a Forge primary-key after first reconcile (status.forgeId). The operator owns the resource via a finalizer — deleting the CR deletes the Forge resource too. Re-applying a CR after manual edits in the Forge UI overwrites the UI changes (drift is detected on a 60s requeue and reconciled toward the CR).

Sample CRs live in forge-operator/config/samples/:

Kind	Maps to	Sensitive fields
`Organization`	`/api/v2/organizations`	None — top-level tenant container with max-host quota
`Team`	`/api/v2/teams` + member sync at `/teams/{id}/users/`	None — `spec.users[]` resolves usernames at reconcile
`Project`	`/api/v2/projects`	SCM credential resolved by name from an existing `Credential` CR
`Inventory`	`/api/v2/inventories` + nested hosts & groups	None — pure spec
`Credential`	`/api/v2/credentials`	`spec.inputsFrom[]` reads sensitive values from k8s Secrets in the same namespace; the operator watches those Secrets and re-syncs on rotation
`JobTemplate`	`/api/v2/job_templates` with credential / project / inventory references resolved by name	None
`Schedule`	`/api/v2/schedules`	None — RFC 5545 RRULE drives recurrence
`Workflow`	`/api/v2/workflow_job_templates` + node DAG at `/workflow_job_templates/{id}/workflow_nodes/` + edges	Declarative `spec.nodes[]` keyed by `identifier` with `successNodes` / `failureNodes` / `alwaysNodes` graph references
`ForgeInstance`	Control-plane only — describes a Forge backend that other CRs target via `spec.forgeInstance`	Bearer token sourced from a k8s `Secret` via `spec.tokenSecretRef`

Verification

After both Helm releases land:

# Forge API healthy
kubectl -n forge port-forward svc/forge-web 8013:8013 &
curl http://localhost:8013/api/v2/ping/ | jq

# Browser access (assuming /etc/hosts is set)
open http://forge.lan:30080/

# Operator wired up
kubectl get crd | grep forgeplatform.io
kubectl apply -f https://raw.githubusercontent.com/forgeplatform/forge-operator/main/config/samples/inventory-sample.yaml
kubectl get inventory production -o jsonpath='{.status}'

Day-2 Operations

Upgrade

git -C forge-helm pull
helm upgrade forge ./forge-helm -n forge --reuse-values

The chart re-runs forge-init on each upgrade (its name embeds {{ .Release.Revision }}). The init script is idempotent. If you upgrade against a populated DB, the migrate step finishes in milliseconds; the heavier preload-data and execution-environment registration steps short-circuit when records already exist.

Backup

kubectl -n forge exec deploy/forge-web -- /usr/local/bin/backup.sh
kubectl -n forge cp forge-web-xxx:/var/backups/forge/forge-<ts>.sql.gz ./forge.sql.gz

Scale forge-task

Forge dispatches jobs to any forge-task replica via the Receptor mesh. Increase task.replicas in values.yaml for more concurrent jobs. Keep in mind:

forge-task runs privileged: true with podman — restrict it to a node pool you trust
The chart bumps the task memory limit to 4 Gi by default, because supervisord + dispatcher + Receptor + ansible-runner + an EE container together blow past 2 Gi during the first project sync. Lower this only if you’ve profiled your specific workload.

Customizing values.yaml

The shipped values.yaml is heavily commented. The most commonly tuned values:

Path	Default	Why you’d change it
`images.backend.repository`	`ghcr.io/forgeplatform/forge-backend`	Mirror to your own registry
`images.backend.tag`	`latest`	Pin to a CalVer release
`secrets.forgeAdminPassword`	`changeme-admin`	Always — it’s a placeholder
`ingress.host`	`forge.lan`	Your real DNS name
`ingress.tls.enabled`	`true`	Disable for plain-HTTP test clusters
`postgres.storage.size`	`8Gi`	Production sizing
`postgres.enabled`	`true`	Set `false` + override DB env to point at an external Postgres (RDS, Cloud SQL)
`otelCollector.enabled`	`true`	Disable if you ship an OTel Collector at the cluster level (DaemonSet)
`web.replicas` / `task.replicas`	`1` / `1`	HA / throughput

Troubleshooting

Browser shows “Server Not Found” for `forge.local`

That’s the mDNS hijack described above. Use the chart default (forge.lan) and update /etc/hosts accordingly. If you’re locked into a .local hostname, edit /etc/nsswitch.conf on every developer laptop to put files before mdns_minimal — but this is system-wide and may break Avahi-discovered devices on the LAN.

Jobs fail with “unknown work type kubernetes-incluster-auth”

The Forge instance is registered as node_type=control instead of hybrid. Control nodes only orchestrate; they refuse to execute jobs locally and try to dispatch them to a ContainerGroup. The chart’s init.sh runs an explicit ORM update after provision_instance to fix this — if you wrote a custom init, replicate that step:

forge-manage shell -c "
from forge.main.models import Instance
i = Instance.objects.get(hostname='<node>')
i.node_type = 'hybrid'; i.save(update_fields=['node_type'])
"

The same script also flips the auto-created default InstanceGroup off is_container_group=True, which is what the post-migrate signal sets when Forge detects it’s running in k8s.

forge-task pod restarts with exit code 137

OOMKilled — the default memory limit was 2 Gi in earlier chart versions. Bump task.resources.limits.memory to 4Gi (the current default). Memory is tight because supervisord runs uwsgi + dispatcher + Receptor + ansible-runner + a podman EE container concurrently.

Job error “Error updating status file /tmp/receptor/.../status.lock: no such file or directory”

The Receptor work-unit directory disappeared. This used to happen when forge-task was OOMKilled mid-job and the pod restart wiped /tmp. Mitigation:

Bump task memory (above)
The chart now mounts forge-receptor PVC at /tmp/receptor so work units survive Pod restarts

Pods on different nodes can’t reach each other (VirtualBox only)

See the CNI quirk note — Flannel must be patched to bind to the host-only interface (eth1), not the NAT-mode eth0. forge-dev-cluster bakes this into provisioning.

Companion Repositories

Repo	What it ships
`forge-helm`	The Helm chart described on this page
`forge-operator`	The Kubernetes operator with 4 CRDs
`forge-dev-cluster`	4-VM Vagrant + VirtualBox cluster (2 master + 2 worker, k8s 1.30) for chart and operator integration testing — includes the Flannel `eth1` patch and a `post-cluster-setup.sh` that installs Traefik, local-path-provisioner, and the `forge` namespace prerequisites
`forge-devops`	The Docker Compose deployment described on the Deployment page — same image set, single-host topology