Build a Personal RAG Knowledge Base for Your AI Assistant

Your AI assistant is brilliant but amnesiac. Every new session starts from zero. It doesn't know your meeting notes from last Tuesday, the architecture decision you documented three months ago, or the research you compiled last year.

Retrieval-Augmented Generation (RAG) solves this. Instead of stuffing everything into the context window, RAG builds a searchable knowledge base from your documents. When the assistant needs information, it retrieves the relevant chunks and uses them to answer — giving your AI a form of long-term memory that scales to millions of documents.

This tutorial walks through building a personal RAG system, from document ingestion to query-time retrieval, and integrating it with your AI assistant workflow.

What RAG Is and Why It Matters

API architecture diagram showing RAG retrieval pipeline with vector database and language model RAG combines a vector database for retrieval with a language model for generation — getting the best of both

RAG stands for Retrieval-Augmented Generation. The core idea:

Index — Convert your documents to vector embeddings and store them in a searchable database
Retrieve — When a question comes in, find the most semantically similar chunks from your knowledge base
Generate — Feed those relevant chunks to the LLM as context, along with the question
Answer — The LLM generates an answer grounded in your actual documents

The critical advantage over pure context-stuffing: your knowledge base can be enormous (millions of words) without impacting response speed. The retrieval step finds only the relevant slices and feeds them to the model — keeping context usage small and focused.

This is how enterprise systems like Notion AI, GitHub Copilot's codebase indexing, and most production AI assistants work at scale.

What You'll Need

Before starting, gather the following:

Python 3.10+ with pip
OpenAI API key (for text-embedding-3-small, the most cost-effective embedding model)
Or Anthropic API key with a local embedding alternative
Your documents — PDFs, markdown files, text files, Notion exports
A vector database — ChromaDB (local, free) or Pinecone (managed, free tier)

For this tutorial, we'll use ChromaDB locally (no API key needed, runs on your server) and OpenAI embeddings (costs fractions of a cent per megabyte of text).

Step 1: Set Up the Embedding Pipeline

LangChain developer environment showing document processing and embedding pipeline setup The embedding pipeline converts raw documents into searchable vector representations

First, install the required packages:

pip install chromadb langchain langchain-openai pypdf python-dotenv

Create a project directory and set up your environment:

mkdir ~/my-rag && cd ~/my-rag
echo "OPENAI_API_KEY=sk-your-key-here" > .env

Now create the ingestion script (ingest.py):

import os
from pathlib import Path
from dotenv import load_dotenv
from langchain.document_loaders import DirectoryLoader, PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

load_dotenv()

# Load documents from your knowledge directory
loader = DirectoryLoader(
    "./knowledge",
    glob="**/*.{txt,md,pdf}",
    show_progress=True
)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")

# Create embeddings and store in ChromaDB
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma.from_documents(
    chunks,
    embeddings,
    persist_directory="./chroma_db"
)
db.persist()
print("Knowledge base created successfully")

Drop your documents into a ./knowledge/ folder and run python ingest.py. Your knowledge base is built.

Step 2: Build the Query Interface

API integration diagram showing query processing and retrieval from vector database The query interface accepts natural language questions and returns answers grounded in your documents

Now build the query layer (query.py):

import os
import sys
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

load_dotenv()

# Load the existing knowledge base
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

# Create retrieval chain
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}  # Return top 5 most relevant chunks
)

# Custom prompt that grounds answers in retrieved context
prompt = PromptTemplate(
    template="""Use the following context from your knowledge base to answer the question.
If the answer is not in the context, say "I don't have that in my knowledge base."

Context: {context}

Question: {question}

Answer:""",
    input_variables=["context", "question"]
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

# Query
question = sys.argv[1] if len(sys.argv) > 1 else "What is in my knowledge base?"
result = qa_chain.invoke({"query": question})
print("\nAnswer:", result["result"])
print("\nSources:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata.get('source', 'unknown')}")

Test it:

python query.py "What did we decide about the database architecture?"

Step 3: Optimize Your Chunking Strategy

MCP architecture showing context chunking and retrieval optimization strategies Chunking strategy dramatically affects retrieval quality — context-aware splitting beats naive fixed-size splitting

The biggest factor in RAG quality after the initial setup is your chunking strategy. Naive fixed-size chunks break documents in the middle of ideas, reducing retrieval quality.

Best practices for chunking:

Use RecursiveCharacterTextSplitter — it tries to split on paragraph breaks first, then sentences, then words. This keeps semantically coherent units together.
Chunk size 500-1500 tokens — too small misses context, too large dilutes relevance
20% overlap — prevents losing information at chunk boundaries
Include metadata — store filename, section heading, date in chunk metadata so you know where retrieved content came from

For structured documents (meeting notes, project docs), add a header to each chunk:

from langchain.document_transformers import EmbeddingsRedundantFilter

# Add document title as context to each chunk
for chunk in chunks:
    title = chunk.metadata.get("source", "").split("/")[-1]
    chunk.page_content = f"Document: {title}\n\n{chunk.page_content}"

This helps the retriever distinguish between chunks from different sources on similar topics.

Step 4: Connect to OpenClaw

OpenClaw integration diagram showing RAG system connecting to AI assistant workflow Integrating RAG with OpenClaw gives your AI assistant grounded, document-backed answers

To make your RAG knowledge base available to OpenClaw, expose it as a simple HTTP API:

from flask import Flask, request, jsonify
app = Flask(__name__)

@app.route("/query", methods=["POST"])
def query_kb():
    question = request.json.get("question")
    result = qa_chain.invoke({"query": question})
    return jsonify({
        "answer": result["result"],
        "sources": [d.metadata.get("source") for d in result["source_documents"]]
    })

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8765)

Start the server: python server.py

Now OpenClaw can query your knowledge base via web_fetch or a custom tool call:

"Query my knowledge base at localhost:8765 for everything related to our Q1 planning decisions."

For a persistent setup, run the Flask server as a systemd service so it starts on boot. This gives you always-on RAG access from any OpenClaw session.

For more on integrating external APIs with OpenClaw, see the MCP model context protocol tutorial and OpenClaw automation guide.

Keeping Your Knowledge Base Updated

Documents change. Meeting notes get added. Research evolves. Your RAG system needs to stay current.

Options for keeping it fresh:

Scheduled re-ingestion — Cron job that re-runs ingest.py nightly, picking up new files
Incremental updates — Track file modification timestamps and only re-embed changed files
Watch mode — Use watchdog to trigger re-ingestion whenever files in your knowledge directory change

For most personal use cases, a nightly cron is sufficient. For team knowledge bases that update frequently, implement incremental updates to avoid re-embedding unchanged documents (saves API costs).

Frequently Asked Questions

FAQ section for RAG knowledge base setup and optimization Common questions about building and maintaining a RAG system

How much does it cost to embed a large document collection? OpenAI's text-embedding-3-small costs $0.02 per million tokens. A 1MB text file is roughly 250,000 tokens. Embedding 100MB of documents costs about $0.50. Very affordable.

Can I use a local embedding model to avoid API costs entirely? Yes. Nomic Embed and sentence-transformers provide free, locally-run embedding models that work with ChromaDB. Quality is slightly lower than OpenAI's models but sufficient for most personal knowledge bases.

What file types are supported? Out of the box: PDF, TXT, MD, CSV, HTML. With additional loaders: DOCX, PPTX, Notion exports, Obsidian vaults, and more. LangChain has loaders for nearly every document format.

How do I handle documents in multiple languages? Use a multilingual embedding model like multilingual-e5-large or OpenAI's embeddings (which handle multiple languages natively). Query in any language and retrieval works across language boundaries.

Is ChromaDB production-ready? For personal use: absolutely. For team-scale production: consider Pinecone, Weaviate, or Qdrant which offer better performance, filtering, and managed infrastructure.

Conclusion

AI conclusion showing knowledge base integration delivering grounded answers to complex queries A personal RAG system transforms your AI assistant from amnesiac to encyclopedic — grounded in your own knowledge

A personal RAG knowledge base is one of the highest-leverage upgrades you can make to your AI assistant setup. It turns an amnesiac model into one that knows your history, your documents, and your decisions.

The system described here — ChromaDB + OpenAI embeddings + LangChain retrieval chain — is battle-tested, cheap to run, and can handle millions of documents on modest hardware. Start with your most-referenced documents, test retrieval quality, then expand.

For deeper reading, see LangChain's RAG tutorial, the ChromaDB documentation, and Anthropic's guide on using Claude with retrieved context.