04 — Task Engine
The task engine manages the lifecycle of every job from launch to completion. This is the most important part of the system — if you don't understand this, you won't be able to diagnose job issues.
Job Lifecycle
Every job goes through a defined state machine:
Launch
│
▼
pending ──► waiting ──► running ──► successful
├── failed
├── error
└── canceled
| Status | Meaning | Typical cause if stuck |
|---|---|---|
pending |
In Celery queue, waiting for dispatcher | Redis is down or dispatcher crashed |
waiting |
Dispatcher has it, waiting for capacity | No free capacity on nodes |
running |
Ansible Runner is executing the playbook | Playbook takes a long time (normal) |
successful |
Playbook finished with exit code 0 | — |
failed |
Playbook finished with errors | Error in playbook or unreachable host |
error |
System error (not a playbook issue) | Problem with container, network, or receptor |
canceled |
User or system canceled it | Manual action or timeout |
What Happens Step by Step
Jobs can be launched from multiple sources:
- Manual: User clicks "Launch" in the UI or calls POST /api/v2/job_templates/{id}/launch/
- Schedule: iCal recurrence rule triggers at the configured time
- Webhook (legacy): Direct webhook on a JobTemplate (GitHub/GitLab push triggers a specific template)
- EDA Rule: Event-Driven Automation rule fires after evaluating conditions on an incoming webhook (see docs/15-event-driven-automation.md)
- Workflow: A workflow node launches the job as part of a DAG execution
- Launch: API creates a
Jobrecord in the database (status:pending), places a Celery task in Redis - Dispatch: Dispatcher reads the task from Redis, checks capacity, selects a node
- Execution: Ansible Runner starts
ansible-playbookwith all parameters - Events: Every play/task/host result generates an event → Callback Receiver → database
- Real-time: WSRelay sends events to WebSocket → browser updates UI
- Completion: Runner finishes → job status is updated → notifications are sent
Capacity — Why a Job Stays in "waiting"
Each node has a capacity calculated from CPU and memory:
CPU capacity = CPU cores × 4 forks per core
RAM capacity = RAM (MB) / 100 MB per fork
Effective = whichever is smaller
Example: 4-core machine with 8GB RAM - CPU: 4 × 4 = 16 forks - RAM: 8192 / 100 = 81 forks - Effective: 16 forks - If 3 jobs are running with 5 forks each = 15 consumed, 1 remaining — next job waits
Watch out
-
Forks: Each job uses as many forks as defined on the template (default 5). Reducing forks on the template allows more concurrent jobs.
-
list_instances: Useforge-manage list_instancesto see capacity. Ifremaining= 0, jobs are waiting. -
Instance Groups: Organizations are mapped to instance groups. If an organization has only one group with one node, and that node is full — all jobs for that organization wait.
Receptor Mesh
Receptor enables distributed job execution across multiple nodes.
Node Types
| Type | Serves API | Runs jobs | When to use |
|---|---|---|---|
hybrid |
Yes | Yes | Single-node deployment (default) |
control |
Yes | No | Dedicated API server |
execution |
No | Yes | Dedicated job runner |
hop |
No | No | Relay for air-gapped networks |
Watch out
- In a single-node deployment (default), Receptor runs locally and doesn't need configuration.
- Port 2222 (TCP) is used for inter-node communication. Must be open between nodes.
- The Receptor control socket is at
/var/run/awx-receptor/receptor.sock— must be shared between the web and task containers.
Event Processing
Why job events matter
Job output is NOT one large text file. Every line/task/host result is a separate
JobEvent record in the database. This enables:
- Streaming output in real-time
- Filtering by host
- Searching within the output
- Efficient storage of large outputs
Partitioned table
The main_jobevent table is partitioned by job_id. Each job gets its own partition.
Without this, a query for one job's events would scan millions of rows.
Watch out:
- Partitions are created automatically when a job starts
- cleanup_jobs --days=90 drops old partitions
- WITHOUT CLEANUP, THE DATABASE GROWS WITHOUT LIMIT — this is the most common cause of full disk in production
Task Container Processes
The task container runs 4 processes through supervisord:
| Process | What it does | If it goes down... |
|---|---|---|
| Dispatcher | Receives tasks from Redis, starts jobs | Jobs stay in pending |
| Callback Receiver | Receives events from Ansible Runner | Job output is empty |
| WSRelay | Sends events to WebSocket | UI doesn't update in real-time |
| Receptor | Mesh networking for remote execution | Remote jobs fail |
Checking status
docker compose exec forge-task supervisorctl status
# All 4 processes must be RUNNING
Execution Environments
An Execution Environment (EE) is a container image that contains everything needed to run an Ansible playbook: Python, collections, modules, dependencies.
Why EE?
Without an EE, the playbook uses Python and collections installed on the host. With an EE, the playbook runs inside an isolated container with a precisely defined environment. This guarantees reproducibility — the same playbook produces the same result regardless of the host.
Watch out
- Every job template can reference a specific EE
- The
pulloption controls when the image is pulled:always,missing,never - If the EE image doesn't exist locally and
pull=never, the job will fail
Troubleshooting
Job stuck in "pending"
# 1. Is the dispatcher running?
docker compose exec forge-task supervisorctl status dispatcher
# 2. Is Redis running?
docker compose exec forge-task redis-cli -h redis ping
# 3. How many tasks are waiting in the queue?
docker compose exec redis redis-cli -n 0 LLEN celery
Job stuck in "waiting"
# Check node capacity
docker compose exec forge-web awx-manage list_instances
# If remaining = 0, there's no free capacity
Empty job output (no events)
# Check callback receiver
docker compose exec forge-task supervisorctl status callback_receiver
docker compose exec forge-task tail -f /var/log/supervisor/callback_receiver.log
WebSocket not working (UI not updating)
# Check wsrelay and daphne processes
docker compose exec forge-task supervisorctl status wsrelay
docker compose exec forge-web supervisorctl status daphne
# Verify both containers have the same FORGE_BROADCAST_WEBSOCKET_SECRET
Job finishes with "error" instead of "failed"
error means a system problem, not a playbook error. Check:
- Can Ansible Runner access the project (SCM sync)?
- Does the EE image exist?
- Does the Receptor socket exist (/var/run/awx-receptor/receptor.sock)?
- Logs: docker compose logs forge-task | grep ERROR