Clawist
🟡 Intermediate13 min readBy Lin6

Cost Optimization Guide for Self-Hosted AI Assistants: Run Claude on a Budget

Running a 24/7 AI assistant powered by Claude or GPT-4 is incredible—until you get the API bill. At $15 per million input tokens and $75 per million output tokens (Claude Opus 4 pricing), costs add up fast. A busy assistant processing 50 requests per day can easily rack up $200-500/month.

But with smart optimizations, you can cut costs by 60-80% without sacrificing quality. This guide covers practical strategies for running OpenClaw and other self-hosted AI assistants on a budget.

Understanding AI API Costs

Before optimizing, understand where money goes:

Claude Pricing (as of Feb 2026)

ModelInput (per 1M tokens)Output (per 1M tokens)SpeedUse Case
Opus 4$15$75SlowComplex reasoning, coding
Sonnet 4$3$15FastGeneral tasks, chat
Haiku 3.5$0.25$1.25Very fastSimple tasks, classification

Cost Example: Daily Usage

Let's say your AI assistant:

  • Receives 50 messages per day
  • Average 500 tokens input per message (context + prompt)
  • Average 200 tokens output per message

Monthly cost with Opus 4:

Input:  50 msg/day × 30 days × 500 tokens = 750,000 tokens = $11.25
Output: 50 msg/day × 30 days × 200 tokens = 300,000 tokens = $22.50
Total: $33.75/month

Seems reasonable, right? But this assumes short conversations.

With longer context (loading files, memory, tools):

Input:  50 msg × 30 days × 15,000 tokens = 22.5M tokens = $337.50
Output: Same as above = $22.50
Total: $360/month

Now it's expensive. Let's fix that.

Strategy 1: Smart Model Routing

Use the cheapest model that can handle the task.

Routing Logic

function selectModel(message) {
  // Simple questions → Haiku (cheapest)
  if (isSimpleQuery(message)) {
    return "claude-haiku-3.5";
  }
  
  // Coding, analysis, complex reasoning → Opus
  if (requiresDeepReasoning(message)) {
    return "claude-opus-4";
  }
  
  // Everything else → Sonnet
  return "claude-sonnet-4";
}

function isSimpleQuery(message) {
  const simplePatterns = [
    /^(hi|hello|hey)/i,
    /^(what is|what's)/i,
    /^(status|ping)/i,
    /^(thanks|thank you)/i
  ];
  
  return simplePatterns.some(pattern => pattern.test(message));
}

OpenClaw Model Configuration

In ~/.openclaw/config/models.json:

{
  "default": "claude-sonnet-4",
  "routes": {
    "coding": "claude-opus-4",
    "simple": "claude-haiku-3.5",
    "creative": "claude-sonnet-4"
  }
}

Savings potential: 50-70% by routing 60% of traffic to Sonnet/Haiku

Strategy 2: Context Optimization

The #1 cost driver is large context windows. Every token in context costs money on every request.

Minimize Context Bloat

// Bad: Load everything every time
const context = [
  loadFile("SOUL.md"),      // 5,000 tokens
  loadFile("USER.md"),      // 3,000 tokens
  loadFile("MEMORY.md"),    // 20,000 tokens
  loadFile("AGENTS.md"),    // 8,000 tokens
  loadFile("TOOLS.md"),     // 2,000 tokens
  loadHistory(30)           // 15,000 tokens
];
// Total: 53,000 tokens × $15/M = $0.80 per request

// Good: Load selectively
const context = [
  loadFile("SOUL.md"),      // Always needed
  loadFileIfRelevant("MEMORY.md", message),  // Only if memory required
  loadHistory(5)            // Last 5 messages, not 30
];
// Total: 8,000 tokens × $15/M = $0.12 per request
// 85% cost reduction!

OpenClaw Context Management

OpenClaw has built-in context awareness. Configure in AGENTS.md:

## Context Loading Rules

**Main session (direct chat):**
- Load SOUL.md (always)
- Load USER.md (always)
- Load MEMORY.md (always)
- Load today + yesterday memory files

**Heartbeat sessions:**
- Load SOUL.md only
- Load HEARTBEAT.md (keep this file `<500 tokens`)
- NO memory files

**Subagent sessions:**
- Load SOUL.md only
- Load task-specific context passed from main agent
- NO memory files

Dynamic Context Pruning

function buildContext(message, conversationHistory) {
  let budget = 100000; // 100k token budget
  let context = [];
  
  // Always include system prompt (5k tokens)
  context.push(systemPrompt);
  budget -= 5000;
  
  // Include relevant memory (prioritize recent)
  const relevantMemories = findRelevant(message, budget * 0.3);
  context.push(...relevantMemories);
  budget -= tokenCount(relevantMemories);
  
  // Include conversation history (most recent fits in budget)
  const historyTokens = budget * 0.5;
  const history = conversationHistory.slice(-historyTokens);
  context.push(...history);
  
  return context;
}

Savings potential: 60-80% by cutting unnecessary context

Strategy 3: Caching & Memoization

Don't recompute identical requests.

Response Caching

const responseCache = new Map();

async function getCachedResponse(message) {
  const cacheKey = hashMessage(message);
  
  if (responseCache.has(cacheKey)) {
    const cached = responseCache.get(cacheKey);
    if (Date.now() - cached.timestamp < 3600000) { // 1 hour TTL
      console.log("Cache hit, saved API call");
      return cached.response;
    }
  }
  
  const response = await callClaude(message);
  responseCache.set(cacheKey, {
    response,
    timestamp: Date.now()
  });
  
  return response;
}

Prompt Caching (Claude-Specific)

Claude supports prompt caching for repeated context:

// First request: Full cost
const response1 = await claude.complete({
  system: largeSystemPrompt,  // 50,000 tokens, costs $0.75
  messages: [userMessage]
});

// Subsequent requests within 5 minutes: Cached system prompt
const response2 = await claude.complete({
  system: largeSystemPrompt,  // Cached! Costs $0.00
  messages: [userMessage2]
});

Savings: Up to 90% on repeated context within the cache window.

Strategy 4: Batching

Group similar requests to reduce overhead.

Batch Processing Example

// Bad: Process one at a time
for (const item of items) {
  const result = await processWithAI(item);
  results.push(result);
}
// 100 items = 100 API calls

// Good: Batch together
const batchSize = 10;
for (let i = 0; i < items.length; i += batchSize) {
  const batch = items.slice(i, i + batchSize);
  const prompt = `Process these ${batch.length} items:\n\n${batch.join('\n')}`;
  const results = await processWithAI(prompt);
  // 100 items = 10 API calls
}

Caveat: Batching increases per-request token count but reduces number of requests. Test to find optimal batch size.

Strategy 5: Task-Specific Optimization

Text Classification: Use Embeddings

Instead of calling Claude to classify every message:

// Expensive: Claude classification
const category = await claude.complete({
  prompt: `Classify this message: "${message}"\n\nCategories: support, sales, general`
});
// Cost: ~$0.002 per message

// Cheap: Embedding similarity
const embedding = await getEmbedding(message);  // $0.0001
const category = findClosestCategory(embedding, categoryEmbeddings);
// Cost: ~$0.0001 per message
// 95% cheaper!

Simple Q&A: Use RAG Instead of Full Context

Retrieval-Augmented Generation (RAG):

  1. Store knowledge in vector database
  2. Retrieve only relevant chunks
  3. Send minimal context to Claude
// Bad: Send entire documentation (50k tokens)
const context = loadEntireDocs();
const answer = await claude.complete({
  system: context,
  messages: [question]
});

// Good: Retrieve relevant sections only (2k tokens)
const relevant = await vectorDB.search(question, limit: 3);
const answer = await claude.complete({
  system: relevant.join('\n'),
  messages: [question]
});

Savings: 90%+ on knowledge-base queries

Strategy 6: Hybrid Local + Cloud

Run simple tasks locally, send complex ones to Claude.

Local Model for Pre-Processing

// Use local Llama model for intent detection
const intent = await localModel.classify(message);

// Only send complex intents to Claude
if (intent === "complex_reasoning") {
  return await claude.complete(message);
} else if (intent === "simple_lookup") {
  return lookupInDatabase(message);
}

Tools:

  • Ollama: Run Llama 3 locally
  • llama.cpp: Fast local inference
  • GPT4All: Easy local setup

Savings: 40-60% by offloading simple tasks

Strategy 7: OpenClaw-Specific Optimizations

Heartbeat Efficiency

OpenClaw's heartbeat feature polls your agent periodically. Optimize this:

# HEARTBEAT.md

## Heartbeat Rules

- Check email: Once every 4 hours (not every 30 min)
- Check calendar: Once per day (morning only)
- GitHub notifications: Once every 2 hours
- Weather: Only if user requested it recently

If nothing needs attention: HEARTBEAT_OK (don't load memory files)

Savings: 80%+ on heartbeat costs

Subagent Isolation

Use subagents for heavy tasks to prevent context pollution:

// Main session: Lightweight coordination
// Subagent: Heavy lifting (browser automation, content generation)

// Main session context: 10k tokens
// Subagent context: 50k tokens (isolated, doesn't pollute main)

Model Selection Per Task

Configure different models for different tasks:

# In .openclaw/config/defaults.json
{
  "models": {
    "heartbeat": "claude-haiku-3.5",
    "main": "claude-sonnet-4",
    "coding": "claude-opus-4",
    "subagents": "claude-sonnet-4"
  }
}

Strategy 8: Rate Limiting & Quotas

Prevent runaway costs with hard limits.

Daily Budget Enforcement

const DAILY_BUDGET = 10.00; // $10/day max
let todaySpend = 0;

async function callWithBudget(prompt) {
  const estimatedCost = estimateTokenCost(prompt);
  
  if (todaySpend + estimatedCost > DAILY_BUDGET) {
    throw new Error("Daily budget exceeded. Try again tomorrow.");
  }
  
  const response = await claude.complete(prompt);
  todaySpend += actualCost(response);
  
  return response;
}

User Quotas

const userQuotas = {
  "user123": { limit: 100, used: 45 },
  "user456": { limit: 50, used: 12 }
};

function checkQuota(userId) {
  const quota = userQuotas[userId];
  if (quota.used >= quota.limit) {
    return "Monthly quota exceeded. Upgrade for more requests.";
  }
  quota.used++;
}

Strategy 9: Monitoring & Analytics

Track costs to identify waste.

Log Every Request

function logAPICall(model, inputTokens, outputTokens, cost) {
  const entry = {
    timestamp: new Date().toISOString(),
    model,
    inputTokens,
    outputTokens,
    cost
  };
  
  appendToFile("api-usage.jsonl", JSON.stringify(entry) + "\n");
}

Daily Cost Reports

#!/bin/bash
# Generate daily cost report

cat api-usage.jsonl | \
  jq -s 'group_by(.model) | map({model: .[0].model, cost: (map(.cost) | add)})' \
  > daily-cost-report.json

Alerts for Anomalies

if (dailyCost > averageDailyCost * 2) {
  sendAlert("⚠️ API costs doubled today. Check for issues.");
}

Real-World Cost Comparison

Before Optimization

Model: Claude Opus 4 (everything)
Context: 50k tokens (all memory files loaded)
Requests: 100/day
Cost: $450/month

After Optimization

Models:
  - 60% Haiku (simple queries)
  - 30% Sonnet (general tasks)
  - 10% Opus (complex reasoning)

Context: 8k tokens average (selective loading)
Requests: 100/day
Cost: $65/month

Savings: 85%

Cost Optimization Checklist

  • Implement smart model routing (Haiku → Sonnet → Opus)
  • Reduce context size (load only what's needed)
  • Enable prompt caching for repeated context
  • Cache responses for common queries
  • Batch similar requests
  • Use embeddings for classification tasks
  • Implement RAG for knowledge-base queries
  • Set up daily budget limits
  • Log all API calls for analysis
  • Configure heartbeat for minimal context
  • Use subagents to isolate heavy tasks
  • Consider local models for simple tasks

Tools & Resources

  • Token counters: tiktoken (Python), js-tiktoken (JavaScript)
  • Cost calculators: OpenAI Pricing Calculator
  • Monitoring: Langfuse, Helicone, LangSmith
  • Local models: Ollama, llama.cpp, GPT4All
  • Vector databases: Chroma, Pinecone, Weaviate

Conclusion

Running a self-hosted AI assistant doesn't have to break the bank. With smart model routing, context optimization, caching, and task-specific strategies, you can cut costs by 60-85% while maintaining quality.

Key takeaways:

  • Use the cheapest model that works (Haiku → Sonnet → Opus)
  • Minimize context size—every token costs money
  • Cache responses and enable prompt caching
  • Batch requests when possible
  • Monitor costs and set hard limits
  • Use subagents to isolate expensive operations

Start with the low-hanging fruit (model routing, context reduction), measure impact, then optimize further. Your API bill will thank you.


Next steps: Read OpenClaw Context Window Management and Building LLM Automation Workflows to build efficient AI systems.