Cost Optimization Guide for Self-Hosted AI Assistants: Run Claude on a Budget
Running a 24/7 AI assistant powered by Claude or GPT-4 is incredible—until you get the API bill. At $15 per million input tokens and $75 per million output tokens (Claude Opus 4 pricing), costs add up fast. A busy assistant processing 50 requests per day can easily rack up $200-500/month.
But with smart optimizations, you can cut costs by 60-80% without sacrificing quality. This guide covers practical strategies for running OpenClaw and other self-hosted AI assistants on a budget.
Understanding AI API Costs
Before optimizing, understand where money goes:
Claude Pricing (as of Feb 2026)
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Speed | Use Case |
|---|---|---|---|---|
| Opus 4 | $15 | $75 | Slow | Complex reasoning, coding |
| Sonnet 4 | $3 | $15 | Fast | General tasks, chat |
| Haiku 3.5 | $0.25 | $1.25 | Very fast | Simple tasks, classification |
Cost Example: Daily Usage
Let's say your AI assistant:
- Receives 50 messages per day
- Average 500 tokens input per message (context + prompt)
- Average 200 tokens output per message
Monthly cost with Opus 4:
Input: 50 msg/day × 30 days × 500 tokens = 750,000 tokens = $11.25
Output: 50 msg/day × 30 days × 200 tokens = 300,000 tokens = $22.50
Total: $33.75/month
Seems reasonable, right? But this assumes short conversations.
With longer context (loading files, memory, tools):
Input: 50 msg × 30 days × 15,000 tokens = 22.5M tokens = $337.50
Output: Same as above = $22.50
Total: $360/month
Now it's expensive. Let's fix that.
Strategy 1: Smart Model Routing
Use the cheapest model that can handle the task.
Routing Logic
function selectModel(message) {
// Simple questions → Haiku (cheapest)
if (isSimpleQuery(message)) {
return "claude-haiku-3.5";
}
// Coding, analysis, complex reasoning → Opus
if (requiresDeepReasoning(message)) {
return "claude-opus-4";
}
// Everything else → Sonnet
return "claude-sonnet-4";
}
function isSimpleQuery(message) {
const simplePatterns = [
/^(hi|hello|hey)/i,
/^(what is|what's)/i,
/^(status|ping)/i,
/^(thanks|thank you)/i
];
return simplePatterns.some(pattern => pattern.test(message));
}
OpenClaw Model Configuration
In ~/.openclaw/config/models.json:
{
"default": "claude-sonnet-4",
"routes": {
"coding": "claude-opus-4",
"simple": "claude-haiku-3.5",
"creative": "claude-sonnet-4"
}
}
Savings potential: 50-70% by routing 60% of traffic to Sonnet/Haiku
Strategy 2: Context Optimization
The #1 cost driver is large context windows. Every token in context costs money on every request.
Minimize Context Bloat
// Bad: Load everything every time
const context = [
loadFile("SOUL.md"), // 5,000 tokens
loadFile("USER.md"), // 3,000 tokens
loadFile("MEMORY.md"), // 20,000 tokens
loadFile("AGENTS.md"), // 8,000 tokens
loadFile("TOOLS.md"), // 2,000 tokens
loadHistory(30) // 15,000 tokens
];
// Total: 53,000 tokens × $15/M = $0.80 per request
// Good: Load selectively
const context = [
loadFile("SOUL.md"), // Always needed
loadFileIfRelevant("MEMORY.md", message), // Only if memory required
loadHistory(5) // Last 5 messages, not 30
];
// Total: 8,000 tokens × $15/M = $0.12 per request
// 85% cost reduction!
OpenClaw Context Management
OpenClaw has built-in context awareness. Configure in AGENTS.md:
## Context Loading Rules
**Main session (direct chat):**
- Load SOUL.md (always)
- Load USER.md (always)
- Load MEMORY.md (always)
- Load today + yesterday memory files
**Heartbeat sessions:**
- Load SOUL.md only
- Load HEARTBEAT.md (keep this file `<500 tokens`)
- NO memory files
**Subagent sessions:**
- Load SOUL.md only
- Load task-specific context passed from main agent
- NO memory files
Dynamic Context Pruning
function buildContext(message, conversationHistory) {
let budget = 100000; // 100k token budget
let context = [];
// Always include system prompt (5k tokens)
context.push(systemPrompt);
budget -= 5000;
// Include relevant memory (prioritize recent)
const relevantMemories = findRelevant(message, budget * 0.3);
context.push(...relevantMemories);
budget -= tokenCount(relevantMemories);
// Include conversation history (most recent fits in budget)
const historyTokens = budget * 0.5;
const history = conversationHistory.slice(-historyTokens);
context.push(...history);
return context;
}
Savings potential: 60-80% by cutting unnecessary context
Strategy 3: Caching & Memoization
Don't recompute identical requests.
Response Caching
const responseCache = new Map();
async function getCachedResponse(message) {
const cacheKey = hashMessage(message);
if (responseCache.has(cacheKey)) {
const cached = responseCache.get(cacheKey);
if (Date.now() - cached.timestamp < 3600000) { // 1 hour TTL
console.log("Cache hit, saved API call");
return cached.response;
}
}
const response = await callClaude(message);
responseCache.set(cacheKey, {
response,
timestamp: Date.now()
});
return response;
}
Prompt Caching (Claude-Specific)
Claude supports prompt caching for repeated context:
// First request: Full cost
const response1 = await claude.complete({
system: largeSystemPrompt, // 50,000 tokens, costs $0.75
messages: [userMessage]
});
// Subsequent requests within 5 minutes: Cached system prompt
const response2 = await claude.complete({
system: largeSystemPrompt, // Cached! Costs $0.00
messages: [userMessage2]
});
Savings: Up to 90% on repeated context within the cache window.
Strategy 4: Batching
Group similar requests to reduce overhead.
Batch Processing Example
// Bad: Process one at a time
for (const item of items) {
const result = await processWithAI(item);
results.push(result);
}
// 100 items = 100 API calls
// Good: Batch together
const batchSize = 10;
for (let i = 0; i < items.length; i += batchSize) {
const batch = items.slice(i, i + batchSize);
const prompt = `Process these ${batch.length} items:\n\n${batch.join('\n')}`;
const results = await processWithAI(prompt);
// 100 items = 10 API calls
}
Caveat: Batching increases per-request token count but reduces number of requests. Test to find optimal batch size.
Strategy 5: Task-Specific Optimization
Text Classification: Use Embeddings
Instead of calling Claude to classify every message:
// Expensive: Claude classification
const category = await claude.complete({
prompt: `Classify this message: "${message}"\n\nCategories: support, sales, general`
});
// Cost: ~$0.002 per message
// Cheap: Embedding similarity
const embedding = await getEmbedding(message); // $0.0001
const category = findClosestCategory(embedding, categoryEmbeddings);
// Cost: ~$0.0001 per message
// 95% cheaper!
Simple Q&A: Use RAG Instead of Full Context
Retrieval-Augmented Generation (RAG):
- Store knowledge in vector database
- Retrieve only relevant chunks
- Send minimal context to Claude
// Bad: Send entire documentation (50k tokens)
const context = loadEntireDocs();
const answer = await claude.complete({
system: context,
messages: [question]
});
// Good: Retrieve relevant sections only (2k tokens)
const relevant = await vectorDB.search(question, limit: 3);
const answer = await claude.complete({
system: relevant.join('\n'),
messages: [question]
});
Savings: 90%+ on knowledge-base queries
Strategy 6: Hybrid Local + Cloud
Run simple tasks locally, send complex ones to Claude.
Local Model for Pre-Processing
// Use local Llama model for intent detection
const intent = await localModel.classify(message);
// Only send complex intents to Claude
if (intent === "complex_reasoning") {
return await claude.complete(message);
} else if (intent === "simple_lookup") {
return lookupInDatabase(message);
}
Tools:
- Ollama: Run Llama 3 locally
- llama.cpp: Fast local inference
- GPT4All: Easy local setup
Savings: 40-60% by offloading simple tasks
Strategy 7: OpenClaw-Specific Optimizations
Heartbeat Efficiency
OpenClaw's heartbeat feature polls your agent periodically. Optimize this:
# HEARTBEAT.md
## Heartbeat Rules
- Check email: Once every 4 hours (not every 30 min)
- Check calendar: Once per day (morning only)
- GitHub notifications: Once every 2 hours
- Weather: Only if user requested it recently
If nothing needs attention: HEARTBEAT_OK (don't load memory files)
Savings: 80%+ on heartbeat costs
Subagent Isolation
Use subagents for heavy tasks to prevent context pollution:
// Main session: Lightweight coordination
// Subagent: Heavy lifting (browser automation, content generation)
// Main session context: 10k tokens
// Subagent context: 50k tokens (isolated, doesn't pollute main)
Model Selection Per Task
Configure different models for different tasks:
# In .openclaw/config/defaults.json
{
"models": {
"heartbeat": "claude-haiku-3.5",
"main": "claude-sonnet-4",
"coding": "claude-opus-4",
"subagents": "claude-sonnet-4"
}
}
Strategy 8: Rate Limiting & Quotas
Prevent runaway costs with hard limits.
Daily Budget Enforcement
const DAILY_BUDGET = 10.00; // $10/day max
let todaySpend = 0;
async function callWithBudget(prompt) {
const estimatedCost = estimateTokenCost(prompt);
if (todaySpend + estimatedCost > DAILY_BUDGET) {
throw new Error("Daily budget exceeded. Try again tomorrow.");
}
const response = await claude.complete(prompt);
todaySpend += actualCost(response);
return response;
}
User Quotas
const userQuotas = {
"user123": { limit: 100, used: 45 },
"user456": { limit: 50, used: 12 }
};
function checkQuota(userId) {
const quota = userQuotas[userId];
if (quota.used >= quota.limit) {
return "Monthly quota exceeded. Upgrade for more requests.";
}
quota.used++;
}
Strategy 9: Monitoring & Analytics
Track costs to identify waste.
Log Every Request
function logAPICall(model, inputTokens, outputTokens, cost) {
const entry = {
timestamp: new Date().toISOString(),
model,
inputTokens,
outputTokens,
cost
};
appendToFile("api-usage.jsonl", JSON.stringify(entry) + "\n");
}
Daily Cost Reports
#!/bin/bash
# Generate daily cost report
cat api-usage.jsonl | \
jq -s 'group_by(.model) | map({model: .[0].model, cost: (map(.cost) | add)})' \
> daily-cost-report.json
Alerts for Anomalies
if (dailyCost > averageDailyCost * 2) {
sendAlert("⚠️ API costs doubled today. Check for issues.");
}
Real-World Cost Comparison
Before Optimization
Model: Claude Opus 4 (everything)
Context: 50k tokens (all memory files loaded)
Requests: 100/day
Cost: $450/month
After Optimization
Models:
- 60% Haiku (simple queries)
- 30% Sonnet (general tasks)
- 10% Opus (complex reasoning)
Context: 8k tokens average (selective loading)
Requests: 100/day
Cost: $65/month
Savings: 85%
Cost Optimization Checklist
- Implement smart model routing (Haiku → Sonnet → Opus)
- Reduce context size (load only what's needed)
- Enable prompt caching for repeated context
- Cache responses for common queries
- Batch similar requests
- Use embeddings for classification tasks
- Implement RAG for knowledge-base queries
- Set up daily budget limits
- Log all API calls for analysis
- Configure heartbeat for minimal context
- Use subagents to isolate heavy tasks
- Consider local models for simple tasks
Tools & Resources
- Token counters:
tiktoken(Python),js-tiktoken(JavaScript) - Cost calculators: OpenAI Pricing Calculator
- Monitoring: Langfuse, Helicone, LangSmith
- Local models: Ollama, llama.cpp, GPT4All
- Vector databases: Chroma, Pinecone, Weaviate
Conclusion
Running a self-hosted AI assistant doesn't have to break the bank. With smart model routing, context optimization, caching, and task-specific strategies, you can cut costs by 60-85% while maintaining quality.
Key takeaways:
- Use the cheapest model that works (Haiku → Sonnet → Opus)
- Minimize context size—every token costs money
- Cache responses and enable prompt caching
- Batch requests when possible
- Monitor costs and set hard limits
- Use subagents to isolate expensive operations
Start with the low-hanging fruit (model routing, context reduction), measure impact, then optimize further. Your API bill will thank you.
Next steps: Read OpenClaw Context Window Management and Building LLM Automation Workflows to build efficient AI systems.
More Articles
The Ultimate OpenClaw AWS Setup Guide

The definitive guide to setting up OpenClaw on AWS. Includes spot instance configuration, cost optimization, and step-by-step instructions.
Building AI Workflows with Tool Chaining in OpenClaw
Master the art of chaining tools and function calls to build powerful multi-step AI automation workflows—from data extraction to content generation and deployment.
Deploy Your AI Assistant Across Discord, Telegram, and Slack with OpenClaw
Complete guide to deploying a single AI agent across multiple messaging platforms using OpenClaw—handle Discord, Telegram, and Slack from one unified backend.