Cost Optimization Guide for Self-Hosted AI Assistants: Run Claude on a Budget

Running a 24/7 AI assistant powered by Claude or GPT-4 is incredible—until you get the API bill. At $15 per million input tokens and $75 per million output tokens (Claude Opus 4 pricing), costs add up fast. A busy assistant processing 50 requests per day can easily rack up $200-500/month.

But with smart optimizations, you can cut costs by 60-80% without sacrificing quality. This guide covers practical strategies for running OpenClaw and other self-hosted AI assistants on a budget.

Understanding AI API Costs

Before optimizing, understand where money goes:

Claude Pricing (as of Feb 2026)

Model	Input (per 1M tokens)	Output (per 1M tokens)	Speed	Use Case
Opus 4	$15	$75	Slow	Complex reasoning, coding
Sonnet 4	$3	$15	Fast	General tasks, chat
Haiku 3.5	$0.25	$1.25	Very fast	Simple tasks, classification

Cost Example: Daily Usage

Let's say your AI assistant:

Receives 50 messages per day
Average 500 tokens input per message (context + prompt)
Average 200 tokens output per message

Monthly cost with Opus 4:

Input:  50 msg/day × 30 days × 500 tokens = 750,000 tokens = $11.25
Output: 50 msg/day × 30 days × 200 tokens = 300,000 tokens = $22.50
Total: $33.75/month

Seems reasonable, right? But this assumes short conversations.

With longer context (loading files, memory, tools):

Input:  50 msg × 30 days × 15,000 tokens = 22.5M tokens = $337.50
Output: Same as above = $22.50
Total: $360/month

Now it's expensive. Let's fix that.

Strategy 1: Smart Model Routing

Use the cheapest model that can handle the task.

Routing Logic

function selectModel(message) {
  // Simple questions → Haiku (cheapest)
  if (isSimpleQuery(message)) {
    return "claude-haiku-3.5";
  }
  
  // Coding, analysis, complex reasoning → Opus
  if (requiresDeepReasoning(message)) {
    return "claude-opus-4";
  }
  
  // Everything else → Sonnet
  return "claude-sonnet-4";
}

function isSimpleQuery(message) {
  const simplePatterns = [
    /^(hi|hello|hey)/i,
    /^(what is|what's)/i,
    /^(status|ping)/i,
    /^(thanks|thank you)/i
  ];
  
  return simplePatterns.some(pattern => pattern.test(message));
}

OpenClaw Model Configuration

In ~/.openclaw/config/models.json:

{
  "default": "claude-sonnet-4",
  "routes": {
    "coding": "claude-opus-4",
    "simple": "claude-haiku-3.5",
    "creative": "claude-sonnet-4"
  }
}

Savings potential: 50-70% by routing 60% of traffic to Sonnet/Haiku

Strategy 2: Context Optimization

The #1 cost driver is large context windows. Every token in context costs money on every request.

Minimize Context Bloat

// Bad: Load everything every time
const context = [
  loadFile("SOUL.md"),      // 5,000 tokens
  loadFile("USER.md"),      // 3,000 tokens
  loadFile("MEMORY.md"),    // 20,000 tokens
  loadFile("AGENTS.md"),    // 8,000 tokens
  loadFile("TOOLS.md"),     // 2,000 tokens
  loadHistory(30)           // 15,000 tokens
];
// Total: 53,000 tokens × $15/M = $0.80 per request

// Good: Load selectively
const context = [
  loadFile("SOUL.md"),      // Always needed
  loadFileIfRelevant("MEMORY.md", message),  // Only if memory required
  loadHistory(5)            // Last 5 messages, not 30
];
// Total: 8,000 tokens × $15/M = $0.12 per request
// 85% cost reduction!

OpenClaw Context Management

OpenClaw has built-in context awareness. Configure in AGENTS.md:

## Context Loading Rules

**Main session (direct chat):**
- Load SOUL.md (always)
- Load USER.md (always)
- Load MEMORY.md (always)
- Load today + yesterday memory files

**Heartbeat sessions:**
- Load SOUL.md only
- Load HEARTBEAT.md (keep this file `<500 tokens`)
- NO memory files

**Subagent sessions:**
- Load SOUL.md only
- Load task-specific context passed from main agent
- NO memory files

Dynamic Context Pruning

function buildContext(message, conversationHistory) {
  let budget = 100000; // 100k token budget
  let context = [];
  
  // Always include system prompt (5k tokens)
  context.push(systemPrompt);
  budget -= 5000;
  
  // Include relevant memory (prioritize recent)
  const relevantMemories = findRelevant(message, budget * 0.3);
  context.push(...relevantMemories);
  budget -= tokenCount(relevantMemories);
  
  // Include conversation history (most recent fits in budget)
  const historyTokens = budget * 0.5;
  const history = conversationHistory.slice(-historyTokens);
  context.push(...history);
  
  return context;
}

Savings potential: 60-80% by cutting unnecessary context

Strategy 3: Caching & Memoization

Don't recompute identical requests.

Response Caching

const responseCache = new Map();

async function getCachedResponse(message) {
  const cacheKey = hashMessage(message);
  
  if (responseCache.has(cacheKey)) {
    const cached = responseCache.get(cacheKey);
    if (Date.now() - cached.timestamp < 3600000) { // 1 hour TTL
      console.log("Cache hit, saved API call");
      return cached.response;
    }
  }
  
  const response = await callClaude(message);
  responseCache.set(cacheKey, {
    response,
    timestamp: Date.now()
  });
  
  return response;
}

Prompt Caching (Claude-Specific)

Claude supports prompt caching for repeated context:

// First request: Full cost
const response1 = await claude.complete({
  system: largeSystemPrompt,  // 50,000 tokens, costs $0.75
  messages: [userMessage]
});

// Subsequent requests within 5 minutes: Cached system prompt
const response2 = await claude.complete({
  system: largeSystemPrompt,  // Cached! Costs $0.00
  messages: [userMessage2]
});

Savings: Up to 90% on repeated context within the cache window.

Strategy 4: Batching

Group similar requests to reduce overhead.

Batch Processing Example

// Bad: Process one at a time
for (const item of items) {
  const result = await processWithAI(item);
  results.push(result);
}
// 100 items = 100 API calls

// Good: Batch together
const batchSize = 10;
for (let i = 0; i < items.length; i += batchSize) {
  const batch = items.slice(i, i + batchSize);
  const prompt = `Process these ${batch.length} items:\n\n${batch.join('\n')}`;
  const results = await processWithAI(prompt);
  // 100 items = 10 API calls
}

Caveat: Batching increases per-request token count but reduces number of requests. Test to find optimal batch size.

Strategy 5: Task-Specific Optimization

Text Classification: Use Embeddings

Instead of calling Claude to classify every message:

// Expensive: Claude classification
const category = await claude.complete({
  prompt: `Classify this message: "${message}"\n\nCategories: support, sales, general`
});
// Cost: ~$0.002 per message

// Cheap: Embedding similarity
const embedding = await getEmbedding(message);  // $0.0001
const category = findClosestCategory(embedding, categoryEmbeddings);
// Cost: ~$0.0001 per message
// 95% cheaper!

Simple Q&A: Use RAG Instead of Full Context

Retrieval-Augmented Generation (RAG):

Store knowledge in vector database
Retrieve only relevant chunks
Send minimal context to Claude

// Bad: Send entire documentation (50k tokens)
const context = loadEntireDocs();
const answer = await claude.complete({
  system: context,
  messages: [question]
});

// Good: Retrieve relevant sections only (2k tokens)
const relevant = await vectorDB.search(question, limit: 3);
const answer = await claude.complete({
  system: relevant.join('\n'),
  messages: [question]
});

Savings: 90%+ on knowledge-base queries

Strategy 6: Hybrid Local + Cloud

Run simple tasks locally, send complex ones to Claude.

Local Model for Pre-Processing

// Use local Llama model for intent detection
const intent = await localModel.classify(message);

// Only send complex intents to Claude
if (intent === "complex_reasoning") {
  return await claude.complete(message);
} else if (intent === "simple_lookup") {
  return lookupInDatabase(message);
}

Tools:

Ollama: Run Llama 3 locally
llama.cpp: Fast local inference
GPT4All: Easy local setup

Savings: 40-60% by offloading simple tasks

Strategy 7: OpenClaw-Specific Optimizations

Heartbeat Efficiency

OpenClaw's heartbeat feature polls your agent periodically. Optimize this:

# HEARTBEAT.md

## Heartbeat Rules

- Check email: Once every 4 hours (not every 30 min)
- Check calendar: Once per day (morning only)
- GitHub notifications: Once every 2 hours
- Weather: Only if user requested it recently

If nothing needs attention: HEARTBEAT_OK (don't load memory files)

Savings: 80%+ on heartbeat costs

Subagent Isolation

Use subagents for heavy tasks to prevent context pollution:

// Main session: Lightweight coordination
// Subagent: Heavy lifting (browser automation, content generation)

// Main session context: 10k tokens
// Subagent context: 50k tokens (isolated, doesn't pollute main)

Model Selection Per Task

Configure different models for different tasks:

# In .openclaw/config/defaults.json
{
  "models": {
    "heartbeat": "claude-haiku-3.5",
    "main": "claude-sonnet-4",
    "coding": "claude-opus-4",
    "subagents": "claude-sonnet-4"
  }
}

Strategy 8: Rate Limiting & Quotas

Prevent runaway costs with hard limits.

Daily Budget Enforcement

const DAILY_BUDGET = 10.00; // $10/day max
let todaySpend = 0;

async function callWithBudget(prompt) {
  const estimatedCost = estimateTokenCost(prompt);
  
  if (todaySpend + estimatedCost > DAILY_BUDGET) {
    throw new Error("Daily budget exceeded. Try again tomorrow.");
  }
  
  const response = await claude.complete(prompt);
  todaySpend += actualCost(response);
  
  return response;
}

User Quotas

const userQuotas = {
  "user123": { limit: 100, used: 45 },
  "user456": { limit: 50, used: 12 }
};

function checkQuota(userId) {
  const quota = userQuotas[userId];
  if (quota.used >= quota.limit) {
    return "Monthly quota exceeded. Upgrade for more requests.";
  }
  quota.used++;
}

Strategy 9: Monitoring & Analytics

Track costs to identify waste.

Log Every Request

function logAPICall(model, inputTokens, outputTokens, cost) {
  const entry = {
    timestamp: new Date().toISOString(),
    model,
    inputTokens,
    outputTokens,
    cost
  };
  
  appendToFile("api-usage.jsonl", JSON.stringify(entry) + "\n");
}

Daily Cost Reports

#!/bin/bash
# Generate daily cost report

cat api-usage.jsonl | \
  jq -s 'group_by(.model) | map({model: .[0].model, cost: (map(.cost) | add)})' \
  > daily-cost-report.json

Alerts for Anomalies

if (dailyCost > averageDailyCost * 2) {
  sendAlert("⚠️ API costs doubled today. Check for issues.");
}

Real-World Cost Comparison

Before Optimization

Model: Claude Opus 4 (everything)
Context: 50k tokens (all memory files loaded)
Requests: 100/day
Cost: $450/month

After Optimization

Models:
  - 60% Haiku (simple queries)
  - 30% Sonnet (general tasks)
  - 10% Opus (complex reasoning)

Context: 8k tokens average (selective loading)
Requests: 100/day
Cost: $65/month

Savings: 85%

Cost Optimization Checklist

Tools & Resources

Token counters: tiktoken (Python), js-tiktoken (JavaScript)
Cost calculators: OpenAI Pricing Calculator
Monitoring: Langfuse, Helicone, LangSmith
Local models: Ollama, llama.cpp, GPT4All
Vector databases: Chroma, Pinecone, Weaviate

Conclusion

Running a self-hosted AI assistant doesn't have to break the bank. With smart model routing, context optimization, caching, and task-specific strategies, you can cut costs by 60-85% while maintaining quality.

Key takeaways:

Use the cheapest model that works (Haiku → Sonnet → Opus)
Minimize context size—every token costs money
Cache responses and enable prompt caching
Batch requests when possible
Monitor costs and set hard limits
Use subagents to isolate expensive operations

Start with the low-hanging fruit (model routing, context reduction), measure impact, then optimize further. Your API bill will thank you.

Next steps: Read OpenClaw Context Window Management and Building LLM Automation Workflows to build efficient AI systems.