Architecture

Overview

Forge Assistant is a standalone microservice that provides AI-powered help for the Forge platform. It uses Retrieval-Augmented Generation (RAG) to answer questions based on indexed documentation.

                    ┌─────────────────────────────────┐
                    │         Forge Frontend           │
                    │  (React chat panel component)    │
                    └───────────────┬──────────────────┘
                                    │ SSE (Server-Sent Events)
                                    ▼
                    ┌─────────────────────────────────┐
                    │   Forge Assistant (all-in-one)   │
                    │                                  │
                    │  FastAPI  /api/v1/chat  → SSE    │
                    │           /api/v1/health→ JSON   │
                    │           /api/v1/index → reindex│
                    │                                  │
                    │  Ollama   gemma3:1b (LLM)        │
                    │           nomic-embed-text (emb)  │
                    │                                  │
                    │  ChromaDB (embedded vector store) │
                    └─────────────────────────────────┘

Note: All three components (FastAPI, Ollama, ChromaDB) run inside a single Docker container. The entrypoint script starts Ollama and ChromaDB as background processes before launching the FastAPI server.

Components

FastAPI Application (`app/`)

The core service, responsible for:

Receiving chat requests with optional page context and history
Querying ChromaDB for relevant documentation chunks
Building a context-enriched prompt for the LLM
Streaming the response token-by-token via SSE

Key files: - app/main.py — FastAPI app, endpoints, CORS - app/rag.py — RAG pipeline (embed, retrieve, generate, stream) - app/indexer.py — Document loading, chunking, indexing - app/config.py — Pydantic settings from environment

Ollama (LLM Server)

Runs the language model locally inside the same container. Two models are used: - gemma3:1b (default, or configured model) — for chat generation - nomic-embed-text — for generating document/query embeddings

The Ollama binary is copied from the official ollama/ollama image at build time. GPU acceleration is optional but recommended for larger models.

ChromaDB (Vector Store)

Stores document chunks as vectors for similarity search. When a user asks a question: 1. The question is embedded using nomic-embed-text 2. ChromaDB finds the top-K most similar document chunks 3. These chunks become the context for the LLM

Data Flow

Chat Request

1. User types "How do I create a scheduled job?"
2. Frontend sends POST /api/v1/chat with message + page context
3. Assistant embeds the question via Ollama /api/embeddings
4. Assistant queries ChromaDB for top 5 relevant doc chunks
5. Assistant builds system prompt with doc context
6. Assistant streams response from Ollama /api/chat
7. Each token is sent to frontend as SSE data event
8. Frontend renders tokens in real-time

Document Indexing

1. Admin calls POST /api/v1/index (or runs indexer script)
2. Indexer reads all .md files from docs_to_index/
3. Each file is split into overlapping chunks (500 chars, 50 overlap)
4. Each chunk is embedded via Ollama nomic-embed-text
5. Embeddings + text stored in ChromaDB collection "forge_docs"

Design Decisions

Why Standalone Service (Not Django App)?

No Django dependency — FastAPI is lighter, async-native, no ORM needed
Optional — can be added/removed without touching core Forge
Independent scaling — can run on a separate GPU server
Independent release cycle — update models without redeploying Forge

Why Ollama (Not OpenAI/Claude API)?

Privacy — all data stays on your infrastructure
Cost — no API fees, no usage limits
Offline — works without internet
Control — choose and swap models freely

Why FastAPI (Not Flask/Django)?

Native async — SSE streaming without workarounds
Fast — ASGI, built on Starlette
Auto docs — OpenAPI schema generated automatically
Minimal — no ORM, no admin, no migrations needed

Why ChromaDB (Not Pinecone/Weaviate)?

Local — no cloud dependency
Python-native — simple API, no external clients
Lightweight — single container, ~500MB
Good enough — for <100K document chunks, ChromaDB performs well