Build a ChatGPT for Your Team Docs: RAG System Tutorial

Want to build a smart assistant that answers questions about your team's documentation? This step-by-step tutorial shows you how - no AI experience needed.

What You'll Build

A production-ready RAG (Retrieval Augmented Generation) system with:

🔍 Smart document search using ChromaDB vector database
🤖 Multiple LLM options: Ollama (local/free), OpenAI, or OpenRouter
💬 Beautiful chat interface using Gradio
🐳 Docker deployment - just docker-compose up
📚 Sample documentation to get started
🔑 Flexible: Works with or without API keys

Live Demo:

You: "How do I deploy to production?"

RAG System: "Based on your deployment documentation:

1. Run tests: `npm test`
2. Build production: `npm run build`
3. Deploy: `./scripts/deploy.sh production`
4. Monitor at: https://status.example.com

Source: docs/deployment-guide.md"

Prerequisites

Docker & Docker Compose installed
Choose one:
- Option A: 8GB+ RAM for local Ollama (free, private)
- Option B: OpenAI or OpenRouter API key (cloud, fast)
30 minutes of your time
No AI experience required!

Architecture Overview

┌─────────────┐
│   Gradio    │ ← User asks question
│     UI      │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  RAG App    │ ← Orchestrates everything
└──────┬──────┘
       │
   ┌───┴────┬──────────┬──────────┐
   │        │          │          │
   ▼        ▼          ▼          ▼
┌─────┐ ┌──────┐ ┌─────────┐ ┌────────┐
│Docs │ │Chroma│ │ Ollama  │ │Embeddings│
│ DB  │ │Vector│ │  LLM    │ │  Model  │
└─────┘ └──────┘ └─────────┘ └────────┘

Project Structure

rag-system/
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── app.py                 # Main RAG application
├── ingest.py             # Document ingestion
├── docs/                 # Your documentation
│   ├── deployment.md
│   ├── architecture.md
│   └── troubleshooting.md
└── chroma_db/            # Vector database (auto-created)

Step 1: Create Project Structure

mkdir rag-system
cd rag-system
mkdir docs

Step 2: Create Sample Documentation

Create docs/deployment.md:

# Deployment Guide

## Production Deployment

To deploy to production:

1. Run tests: `npm test`
2. Build: `npm run build`
3. Deploy: `./scripts/deploy.sh production`
4. Monitor: https://status.example.com

## Staging Deployment

For staging environment:

1. Push to staging branch: `git push origin staging`
2. Auto-deploys via GitHub Actions
3. Check: https://staging.example.com

Create docs/troubleshooting.md:

# Troubleshooting Guide

## Service Won't Start

If the service fails to start:

1. Check logs: `docker logs app-name`
2. Verify environment variables in `.env`
3. Ensure port 3000 is not in use
4. Restart: `docker-compose restart`

## Database Connection Issues

Common database errors:

- "Connection refused" → Check if DB container is running
- "Auth failed" → Verify credentials in `.env`
- "Timeout" → Check network settings

Step 3: Create Requirements File

Create requirements.txt:

langchain==0.1.0
langchain-community==0.0.20
langchain-openai==0.0.5
chromadb==0.4.22
ollama==0.1.6
gradio==4.19.0
sentence-transformers==2.3.1
openai==1.12.0

Step 4: Create Ingestion Script

Create ingest.py:

"""
Document Ingestion Script
Supports: Ollama (local) or OpenAI/OpenRouter (cloud)
"""
import os
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

def ingest_documents():
    print("📚 Loading documents from ./docs...")

    # Load all markdown files
    loader = DirectoryLoader('./docs', glob="**/*.md", loader_cls=TextLoader)
    documents = loader.load()
    print(f"✅ Loaded {len(documents)} documents")

    # Split into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = text_splitter.split_documents(documents)
    print(f"✅ Created {len(chunks)} chunks")

    # Choose embeddings based on provider
    LLM_PROVIDER = os.getenv("LLM_PROVIDER", "ollama").lower()

    if LLM_PROVIDER == "ollama":
        embeddings = OllamaEmbeddings(
            model="nomic-embed-text",
            base_url="http://ollama:11434"
        )
    else:
        # Use OpenAI for embeddings (works for OpenAI & OpenRouter)
        embeddings = OpenAIEmbeddings(
            model="text-embedding-3-small",
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )

    # Store in ChromaDB
    print("💾 Storing in vector database...")
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory="./chroma_db",
        collection_name="docs"
    )

    print(f"✨ Ingestion complete! {len(chunks)} chunks stored")
    return vectorstore

if __name__ == "__main__":
    ingest_documents()

Step 5: Create Main RAG Application

Create app.py:

"""
RAG System with Multi-Provider Support
Supports: Ollama (local), OpenAI, or OpenRouter
"""
import os
from langchain_community.embeddings import OllamaEmbeddings
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import gradio as gr

print("🚀 Initializing RAG system...")

# Get provider from environment
LLM_PROVIDER = os.getenv("LLM_PROVIDER", "ollama").lower()

# Initialize embeddings
if LLM_PROVIDER == "ollama":
    embeddings = OllamaEmbeddings(model="nomic-embed-text", base_url="http://ollama:11434")
else:
    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-small",
        openai_api_key=os.getenv("OPENAI_API_KEY")
    )

# Load vector database
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="docs"
)

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Initialize LLM based on provider
if LLM_PROVIDER == "ollama":
    llm = Ollama(model="llama3.2", base_url="http://ollama:11434", temperature=0.3)
    model_name = "llama3.2 (Ollama)"
elif LLM_PROVIDER == "openai":
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3, openai_api_key=os.getenv("OPENAI_API_KEY"))
    model_name = "gpt-4o-mini (OpenAI)"
elif LLM_PROVIDER == "openrouter":
    llm = ChatOpenAI(
        model="anthropic/claude-3.5-sonnet",
        temperature=0.3,
        openai_api_key=os.getenv("OPENROUTER_API_KEY"),
        openai_api_base="https://openrouter.ai/api/v1"
    )
    model_name = "claude-3.5-sonnet (OpenRouter)"

# Create RAG chain
prompt = PromptTemplate(
    template="""Answer based on the documentation. Cite sources.

Context: {context}
Question: {question}

Answer:""",
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

print(f"✅ RAG ready! Using {model_name}")

def chat(message, history):
    try:
        result = qa_chain.invoke({"query": message})
        answer = result.get('result', str(result))
        sources = result.get('source_documents', [])

        if sources:
            answer += "\n\n**Sources:**\n" + "\n".join([f"{i+1}. {doc.metadata.get('source')}" for i, doc in enumerate(sources[:3])])

        return answer
    except Exception as e:
        return f"❌ Error: {str(e)}"

demo = gr.ChatInterface(
    chat,
    title=f"🤖 Documentation Assistant",
    description=f"Powered by RAG with {model_name}",
    examples=["How do I deploy to production?", "Fix database errors?"],
    theme=gr.themes.Soft()
)

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=7860)

Step 6: Create Dockerfile

Create Dockerfile:

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application files
COPY . .

# Expose Gradio port
EXPOSE 7860

# Default command (can be overridden)
CMD ["python", "app.py"]

Step 7: Create Docker Compose & Environment

Create docker-compose.yml:

version: '3.8'

services:
  # Ollama (optional - only for local LLM)
  ollama:
    image: ollama/ollama:latest
    container_name: rag-ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    networks:
      - rag-network
    profiles:
      - ollama  # Only start with --profile ollama

  # RAG Application
  rag-app:
    build: .
    container_name: rag-app
    volumes:
      - ./docs:/app/docs
      - ./chroma_db:/app/chroma_db
    ports:
      - "7860:7860"
    networks:
      - rag-network
    environment:
      # Choose provider: ollama, openai, or openrouter
      - LLM_PROVIDER=${LLM_PROVIDER:-ollama}
      - OLLAMA_BASE_URL=${OLLAMA_BASE_URL:-http://ollama:11434}
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}

networks:
  rag-network:
    driver: bridge

volumes:
  ollama_data:

Create .env file:

# Choose: ollama, openai, or openrouter
LLM_PROVIDER=openai

# OpenAI key (required for openai or openrouter)
OPENAI_API_KEY=your-key-here

# OpenRouter key (only if using OpenRouter for LLM)
# OPENROUTER_API_KEY=your-key-here

Step 8: Choose Your Provider & Run

Option 1: Using OpenAI (Recommended for laptops)

# 1. Set your API key in .env
echo "LLM_PROVIDER=openai" > .env
echo "OPENAI_API_KEY=your-key-here" >> .env

# 2. Build and ingest
docker-compose build rag-app
docker-compose run --rm rag-app python ingest.py

# 3. Start
docker-compose up

Option 2: Using Ollama (Free, requires 8GB+ RAM)

# 1. Start Ollama
docker-compose --profile ollama up -d ollama
sleep 10

# 2. Pull models
docker exec rag-ollama ollama pull llama3.2
docker exec rag-ollama ollama pull nomic-embed-text

# 3. Build and ingest
docker-compose build rag-app
docker-compose run --rm rag-app python ingest.py

# 4. Start
docker-compose up

Option 3: Using OpenRouter (Access to Claude, Gemini, etc.)

# 1. Set keys in .env (needs both!)
echo "LLM_PROVIDER=openrouter" > .env
echo "OPENAI_API_KEY=sk-..." >> .env  # For embeddings
echo "OPENROUTER_API_KEY=sk-or-..." >> .env  # For LLM

# 2. Build and ingest
docker-compose build rag-app
docker-compose run --rm rag-app python ingest.py

# 3. Start
docker-compose up

Access at: http://localhost:7860

Testing Your RAG System

Try these questions in the Gradio interface:

"How do I deploy to production?"
- Should return deployment steps from deployment.md
"What should I do if the service won't start?"
- Should return troubleshooting steps
"How do I fix database errors?"
- Should cite troubleshooting.md

How It Works

1. Document Ingestion

docs → split into chunks → create embeddings → store in ChromaDB

2. Question Answering

question → create embedding → search similar chunks →
add to LLM context → generate answer

3. Key Components

ChromaDB: Stores document embeddings for fast similarity search Ollama: Runs LLM locally (no API keys!) LangChain: Orchestrates the RAG pipeline Gradio: Provides chat interface

LLM Provider Comparison

Provider	Cost	Speed	Privacy	Best For
Ollama	Free	Medium	100% Private	Desktops, privacy-focused
OpenAI	$0.15/1M tokens	Fast	Sent to OpenAI	Most use cases
OpenRouter	Varies	Fast	Sent to provider	Multi-model access

When to use each:

Ollama: Desktop with 8GB+ RAM, need privacy, no ongoing costs
OpenAI: Laptops, fastest setup, best performance/cost
OpenRouter: Want Claude, Gemini, or other models

Customization Tips

Use Different Models

Edit docker-compose.yml:

# For faster responses (smaller model)
docker exec rag-ollama ollama pull llama3.2:1b

# For better quality (larger model)
docker exec rag-ollama ollama pull llama3.1:8b

Update app.py:

llm = Ollama(
    model="llama3.1:8b",  # Change model here
    base_url="http://ollama:11434"
)

Add More Documents

Just drop markdown files in ./docs:

cp ~/my-team-docs/*.md ./docs/
docker-compose run --rm rag-app python ingest.py

Adjust Retrieval Settings

In app.py:

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 5}  # Return more chunks
)

Change Chunk Size

In ingest.py:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,   # Smaller chunks = more precise
    chunk_overlap=100
)

Production Deployment

Add Authentication

demo.launch(
    server_name="0.0.0.0",
    auth=("username", "password")  # Basic auth
)

Add Monitoring

import time

def chat(message, history):
    start = time.time()
    result = qa_chain({"query": message})
    duration = time.time() - start

    print(f"Query: {message}")
    print(f"Duration: {duration:.2f}s")

    return result['result']

Persist Data

Volumes in docker-compose.yml already persist:

ollama_data: Downloaded models
./chroma_db: Vector database

Troubleshooting

"ModuleNotFoundError"

# Rebuild with updated requirements
docker-compose build --no-cache

"Ollama connection refused"

# Check Ollama is running
docker ps | grep ollama

# Check logs
docker logs rag-ollama

"No documents found"

# Re-run ingestion
docker-compose run --rm rag-app python ingest.py

Slow responses

# Use smaller model
docker exec rag-ollama ollama pull llama3.2:1b

# Or reduce temperature (faster, less creative)
llm = Ollama(model="llama3.2", temperature=0.1)

What You Learned

✅ How RAG systems work (Retrieval + Augmentation + Generation) ✅ Document embedding and vector search ✅ Running local LLMs with Ollama ✅ Building chat interfaces with Gradio ✅ Containerizing AI applications with Docker

Next Steps

Add conversation memory - Remember chat history
Multi-modal RAG - Add images, PDFs, videos
Hybrid search - Combine semantic + keyword search
Agent-based retrieval - Let AI decide when to search
Deploy to production - Add auth, monitoring, scaling

Complete Code Repository

Get the full working code: 👉 Download Complete RAG System on GitHub

Clone and run:

git clone https://github.com/alwaysnix/ai-literacy.git
cd ai-literacy/rag-system
docker-compose build
docker-compose run --rm rag-app python ingest.py
docker-compose up

Open http://localhost:7860 and start asking questions!

Want to master advanced RAG techniques? Our training covers:

Multi-modal RAG (images, videos, code)
Production deployment at scale
Cost optimization strategies
Evaluation and monitoring

Join the Next Cohort →

Quick Reference

# Setup
./setup.sh

# Start
docker-compose up

# Stop
docker-compose down

# Re-ingest docs
docker-compose run --rm rag-app python ingest.py

# View logs
docker-compose logs -f

# Reset everything
docker-compose down -v
rm -rf chroma_db

Questions? The RAG system can answer questions about itself! Just ask it "How does this RAG system work?" 😉