04 — Task Engine

The task engine manages the lifecycle of every job from launch to completion. This is the most important part of the system — if you don't understand this, you won't be able to diagnose job issues.


Job Lifecycle

Every job goes through a defined state machine:

  Launch
    │
    ▼
 pending ──► waiting ──► running ──► successful
                                  ├── failed
                                  ├── error
                                  └── canceled
Status Meaning Typical cause if stuck
pending In Celery queue, waiting for dispatcher Redis is down or dispatcher crashed
waiting Dispatcher has it, waiting for capacity No free capacity on nodes
running Ansible Runner is executing the playbook Playbook takes a long time (normal)
successful Playbook finished with exit code 0
failed Playbook finished with errors Error in playbook or unreachable host
error System error (not a playbook issue) Problem with container, network, or receptor
canceled User or system canceled it Manual action or timeout

What Happens Step by Step

Jobs can be launched from multiple sources: - Manual: User clicks "Launch" in the UI or calls POST /api/v2/job_templates/{id}/launch/ - Schedule: iCal recurrence rule triggers at the configured time - Webhook (legacy): Direct webhook on a JobTemplate (GitHub/GitLab push triggers a specific template) - EDA Rule: Event-Driven Automation rule fires after evaluating conditions on an incoming webhook (see docs/15-event-driven-automation.md) - Workflow: A workflow node launches the job as part of a DAG execution

  1. Launch: API creates a Job record in the database (status: pending), places a Celery task in Redis
  2. Dispatch: Dispatcher reads the task from Redis, checks capacity, selects a node
  3. Execution: Ansible Runner starts ansible-playbook with all parameters
  4. Events: Every play/task/host result generates an event → Callback Receiver → database
  5. Real-time: WSRelay sends events to WebSocket → browser updates UI
  6. Completion: Runner finishes → job status is updated → notifications are sent

Capacity — Why a Job Stays in "waiting"

Each node has a capacity calculated from CPU and memory:

CPU capacity  = CPU cores × 4 forks per core
RAM capacity  = RAM (MB) / 100 MB per fork
Effective     = whichever is smaller

Example: 4-core machine with 8GB RAM - CPU: 4 × 4 = 16 forks - RAM: 8192 / 100 = 81 forks - Effective: 16 forks - If 3 jobs are running with 5 forks each = 15 consumed, 1 remaining — next job waits

Watch out


Receptor Mesh

Receptor enables distributed job execution across multiple nodes.

Node Types

Type Serves API Runs jobs When to use
hybrid Yes Yes Single-node deployment (default)
control Yes No Dedicated API server
execution No Yes Dedicated job runner
hop No No Relay for air-gapped networks

Watch out


Event Processing

Why job events matter

Job output is NOT one large text file. Every line/task/host result is a separate JobEvent record in the database. This enables: - Streaming output in real-time - Filtering by host - Searching within the output - Efficient storage of large outputs

Partitioned table

The main_jobevent table is partitioned by job_id. Each job gets its own partition. Without this, a query for one job's events would scan millions of rows.

Watch out: - Partitions are created automatically when a job starts - cleanup_jobs --days=90 drops old partitions - WITHOUT CLEANUP, THE DATABASE GROWS WITHOUT LIMIT — this is the most common cause of full disk in production


Task Container Processes

The task container runs 4 processes through supervisord:

Process What it does If it goes down...
Dispatcher Receives tasks from Redis, starts jobs Jobs stay in pending
Callback Receiver Receives events from Ansible Runner Job output is empty
WSRelay Sends events to WebSocket UI doesn't update in real-time
Receptor Mesh networking for remote execution Remote jobs fail

Checking status

docker compose exec forge-task supervisorctl status
# All 4 processes must be RUNNING

Execution Environments

An Execution Environment (EE) is a container image that contains everything needed to run an Ansible playbook: Python, collections, modules, dependencies.

Why EE?

Without an EE, the playbook uses Python and collections installed on the host. With an EE, the playbook runs inside an isolated container with a precisely defined environment. This guarantees reproducibility — the same playbook produces the same result regardless of the host.

Watch out


Troubleshooting

Job stuck in "pending"

# 1. Is the dispatcher running?
docker compose exec forge-task supervisorctl status dispatcher

# 2. Is Redis running?
docker compose exec forge-task redis-cli -h redis ping

# 3. How many tasks are waiting in the queue?
docker compose exec redis redis-cli -n 0 LLEN celery

Job stuck in "waiting"

# Check node capacity
docker compose exec forge-web awx-manage list_instances
# If remaining = 0, there's no free capacity

Empty job output (no events)

# Check callback receiver
docker compose exec forge-task supervisorctl status callback_receiver
docker compose exec forge-task tail -f /var/log/supervisor/callback_receiver.log

WebSocket not working (UI not updating)

# Check wsrelay and daphne processes
docker compose exec forge-task supervisorctl status wsrelay
docker compose exec forge-web supervisorctl status daphne

# Verify both containers have the same FORGE_BROADCAST_WEBSOCKET_SECRET

Job finishes with "error" instead of "failed"

error means a system problem, not a playbook error. Check: - Can Ansible Runner access the project (SCM sync)? - Does the EE image exist? - Does the Receptor socket exist (/var/run/awx-receptor/receptor.sock)? - Logs: docker compose logs forge-task | grep ERROR