Architecture

Overview

Forge Assistant is a standalone microservice that provides AI-powered help for the Forge platform. It uses Retrieval-Augmented Generation (RAG) to answer questions based on indexed documentation.

                    ┌─────────────────────────────────┐
                    │         Forge Frontend           │
                    │  (React chat panel component)    │
                    └───────────────┬──────────────────┘
                                    │ SSE (Server-Sent Events)
                                    ▼
                    ┌─────────────────────────────────┐
                    │   Forge Assistant (all-in-one)   │
                    │                                  │
                    │  FastAPI  /api/v1/chat  → SSE    │
                    │           /api/v1/health→ JSON   │
                    │           /api/v1/index → reindex│
                    │                                  │
                    │  Ollama   gemma3:1b (LLM)        │
                    │           nomic-embed-text (emb)  │
                    │                                  │
                    │  ChromaDB (embedded vector store) │
                    └─────────────────────────────────┘

Note: All three components (FastAPI, Ollama, ChromaDB) run inside a single Docker container. The entrypoint script starts Ollama and ChromaDB as background processes before launching the FastAPI server.

Components

FastAPI Application (app/)

The core service, responsible for:

Key files: - app/main.py — FastAPI app, endpoints, CORS - app/rag.py — RAG pipeline (embed, retrieve, generate, stream) - app/indexer.py — Document loading, chunking, indexing - app/config.py — Pydantic settings from environment

Ollama (LLM Server)

Runs the language model locally inside the same container. Two models are used: - gemma3:1b (default, or configured model) — for chat generation - nomic-embed-text — for generating document/query embeddings

The Ollama binary is copied from the official ollama/ollama image at build time. GPU acceleration is optional but recommended for larger models.

ChromaDB (Vector Store)

Stores document chunks as vectors for similarity search. When a user asks a question: 1. The question is embedded using nomic-embed-text 2. ChromaDB finds the top-K most similar document chunks 3. These chunks become the context for the LLM

Data Flow

Chat Request

1. User types "How do I create a scheduled job?"
2. Frontend sends POST /api/v1/chat with message + page context
3. Assistant embeds the question via Ollama /api/embeddings
4. Assistant queries ChromaDB for top 5 relevant doc chunks
5. Assistant builds system prompt with doc context
6. Assistant streams response from Ollama /api/chat
7. Each token is sent to frontend as SSE data event
8. Frontend renders tokens in real-time

Document Indexing

1. Admin calls POST /api/v1/index (or runs indexer script)
2. Indexer reads all .md files from docs_to_index/
3. Each file is split into overlapping chunks (500 chars, 50 overlap)
4. Each chunk is embedded via Ollama nomic-embed-text
5. Embeddings + text stored in ChromaDB collection "forge_docs"

Design Decisions

Why Standalone Service (Not Django App)?

Why Ollama (Not OpenAI/Claude API)?

Why FastAPI (Not Flask/Django)?

Why ChromaDB (Not Pinecone/Weaviate)?