Build a Personal RAG Knowledge Base for Your AI Assistant

Your AI assistant is brilliant but amnesiac. Every new session starts from zero. It doesn't know your meeting notes from last Tuesday, the architecture decision you documented three months ago, or the research you compiled last year.
Retrieval-Augmented Generation (RAG) solves this. Instead of stuffing everything into the context window, RAG builds a searchable knowledge base from your documents. When the assistant needs information, it retrieves the relevant chunks and uses them to answer β giving your AI a form of long-term memory that scales to millions of documents.
This tutorial walks through building a personal RAG system, from document ingestion to query-time retrieval, and integrating it with your AI assistant workflow.
What RAG Is and Why It Matters
RAG combines a vector database for retrieval with a language model for generation β getting the best of both
RAG stands for Retrieval-Augmented Generation. The core idea:
- Index β Convert your documents to vector embeddings and store them in a searchable database
- Retrieve β When a question comes in, find the most semantically similar chunks from your knowledge base
- Generate β Feed those relevant chunks to the LLM as context, along with the question
- Answer β The LLM generates an answer grounded in your actual documents
The critical advantage over pure context-stuffing: your knowledge base can be enormous (millions of words) without impacting response speed. The retrieval step finds only the relevant slices and feeds them to the model β keeping context usage small and focused.
This is how enterprise systems like Notion AI, GitHub Copilot's codebase indexing, and most production AI assistants work at scale.
What You'll Need
Before starting, gather the following:
- Python 3.10+ with pip
- OpenAI API key (for text-embedding-3-small, the most cost-effective embedding model)
- Or Anthropic API key with a local embedding alternative
- Your documents β PDFs, markdown files, text files, Notion exports
- A vector database β ChromaDB (local, free) or Pinecone (managed, free tier)
For this tutorial, we'll use ChromaDB locally (no API key needed, runs on your server) and OpenAI embeddings (costs fractions of a cent per megabyte of text).
Step 1: Set Up the Embedding Pipeline
The embedding pipeline converts raw documents into searchable vector representations
First, install the required packages:
pip install chromadb langchain langchain-openai pypdf python-dotenv
Create a project directory and set up your environment:
mkdir ~/my-rag && cd ~/my-rag
echo "OPENAI_API_KEY=sk-your-key-here" > .env
Now create the ingestion script (ingest.py):
import os
from pathlib import Path
from dotenv import load_dotenv
from langchain.document_loaders import DirectoryLoader, PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
load_dotenv()
# Load documents from your knowledge directory
loader = DirectoryLoader(
"./knowledge",
glob="**/*.{txt,md,pdf}",
show_progress=True
)
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")
# Create embeddings and store in ChromaDB
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma.from_documents(
chunks,
embeddings,
persist_directory="./chroma_db"
)
db.persist()
print("Knowledge base created successfully")
Drop your documents into a ./knowledge/ folder and run python ingest.py. Your knowledge base is built.
Step 2: Build the Query Interface
The query interface accepts natural language questions and returns answers grounded in your documents
Now build the query layer (query.py):
import os
import sys
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
load_dotenv()
# Load the existing knowledge base
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
# Create retrieval chain
retriever = db.as_retriever(
search_type="similarity",
search_kwargs={"k": 5} # Return top 5 most relevant chunks
)
# Custom prompt that grounds answers in retrieved context
prompt = PromptTemplate(
template="""Use the following context from your knowledge base to answer the question.
If the answer is not in the context, say "I don't have that in my knowledge base."
Context: {context}
Question: {question}
Answer:""",
input_variables=["context", "question"]
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)
# Query
question = sys.argv[1] if len(sys.argv) > 1 else "What is in my knowledge base?"
result = qa_chain.invoke({"query": question})
print("\nAnswer:", result["result"])
print("\nSources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata.get('source', 'unknown')}")
Test it:
python query.py "What did we decide about the database architecture?"
Step 3: Optimize Your Chunking Strategy
Chunking strategy dramatically affects retrieval quality β context-aware splitting beats naive fixed-size splitting
The biggest factor in RAG quality after the initial setup is your chunking strategy. Naive fixed-size chunks break documents in the middle of ideas, reducing retrieval quality.
Best practices for chunking:
- Use
RecursiveCharacterTextSplitterβ it tries to split on paragraph breaks first, then sentences, then words. This keeps semantically coherent units together. - Chunk size 500-1500 tokens β too small misses context, too large dilutes relevance
- 20% overlap β prevents losing information at chunk boundaries
- Include metadata β store filename, section heading, date in chunk metadata so you know where retrieved content came from
For structured documents (meeting notes, project docs), add a header to each chunk:
from langchain.document_transformers import EmbeddingsRedundantFilter
# Add document title as context to each chunk
for chunk in chunks:
title = chunk.metadata.get("source", "").split("/")[-1]
chunk.page_content = f"Document: {title}\n\n{chunk.page_content}"
This helps the retriever distinguish between chunks from different sources on similar topics.
Step 4: Connect to OpenClaw
Integrating RAG with OpenClaw gives your AI assistant grounded, document-backed answers
To make your RAG knowledge base available to OpenClaw, expose it as a simple HTTP API:
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route("/query", methods=["POST"])
def query_kb():
question = request.json.get("question")
result = qa_chain.invoke({"query": question})
return jsonify({
"answer": result["result"],
"sources": [d.metadata.get("source") for d in result["source_documents"]]
})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8765)
Start the server: python server.py
Now OpenClaw can query your knowledge base via web_fetch or a custom tool call:
"Query my knowledge base at localhost:8765 for everything related to our Q1 planning decisions."
For a persistent setup, run the Flask server as a systemd service so it starts on boot. This gives you always-on RAG access from any OpenClaw session.
For more on integrating external APIs with OpenClaw, see the MCP model context protocol tutorial and OpenClaw automation guide.
Keeping Your Knowledge Base Updated
Documents change. Meeting notes get added. Research evolves. Your RAG system needs to stay current.
Options for keeping it fresh:
- Scheduled re-ingestion β Cron job that re-runs
ingest.pynightly, picking up new files - Incremental updates β Track file modification timestamps and only re-embed changed files
- Watch mode β Use
watchdogto trigger re-ingestion whenever files in your knowledge directory change
For most personal use cases, a nightly cron is sufficient. For team knowledge bases that update frequently, implement incremental updates to avoid re-embedding unchanged documents (saves API costs).
Frequently Asked Questions
Common questions about building and maintaining a RAG system
How much does it cost to embed a large document collection?
OpenAI's text-embedding-3-small costs $0.02 per million tokens. A 1MB text file is roughly 250,000 tokens. Embedding 100MB of documents costs about $0.50. Very affordable.
Can I use a local embedding model to avoid API costs entirely? Yes. Nomic Embed and sentence-transformers provide free, locally-run embedding models that work with ChromaDB. Quality is slightly lower than OpenAI's models but sufficient for most personal knowledge bases.
What file types are supported? Out of the box: PDF, TXT, MD, CSV, HTML. With additional loaders: DOCX, PPTX, Notion exports, Obsidian vaults, and more. LangChain has loaders for nearly every document format.
How do I handle documents in multiple languages?
Use a multilingual embedding model like multilingual-e5-large or OpenAI's embeddings (which handle multiple languages natively). Query in any language and retrieval works across language boundaries.
Is ChromaDB production-ready? For personal use: absolutely. For team-scale production: consider Pinecone, Weaviate, or Qdrant which offer better performance, filtering, and managed infrastructure.
Conclusion
A personal RAG system transforms your AI assistant from amnesiac to encyclopedic β grounded in your own knowledge
A personal RAG knowledge base is one of the highest-leverage upgrades you can make to your AI assistant setup. It turns an amnesiac model into one that knows your history, your documents, and your decisions.
The system described here β ChromaDB + OpenAI embeddings + LangChain retrieval chain β is battle-tested, cheap to run, and can handle millions of documents on modest hardware. Start with your most-referenced documents, test retrieval quality, then expand.
For deeper reading, see LangChain's RAG tutorial, the ChromaDB documentation, and Anthropic's guide on using Claude with retrieved context.
More Articles
The Ultimate OpenClaw AWS Setup Guide

The definitive guide to setting up OpenClaw on AWS. Includes spot instance configuration, cost optimization, and step-by-step instructions.
Building AI Workflows with Tool Chaining in OpenClaw
Master the art of chaining tools and function calls to build powerful multi-step AI automation workflowsβfrom data extraction to content generation and deployment.
Cost Optimization Guide for Self-Hosted AI Assistants: Run Claude on a Budget
Practical strategies to reduce API costs for self-hosted AI assistantsβsmart model routing, caching, batching, and OpenClaw-specific optimizations to run Claude affordably.