AI Voice Assistant: Build Your Own with Whisper and Claude

Voice control adds a new dimension to AI assistants. Instead of typing, you speak naturally and hear responses—hands-free, eyes-free, perfect for when you're cooking, driving, or just want a more natural interaction.
Building a custom voice assistant using Whisper (speech-to-text) and Claude (intelligence) gives you the best of both worlds: state-of-the-art voice recognition with sophisticated AI understanding.
This guide shows you how to build it, whether running locally or in the cloud.
Why Build Your Own Voice AI?
Custom voice assistants offer capabilities commercial products can't match
Compared to Alexa/Siri/Google:
- More intelligent responses (Claude is smarter)
- No wake word required (if you prefer)
- Complete privacy option (run entirely locally)
- Custom personality and capabilities
- Integration with your specific systems
Trade-offs:
- More setup than commercial products
- May have higher latency
- No built-in hardware ecosystem
- Requires maintenance
For users who value intelligence over convenience, a custom voice assistant excels.
Architecture Overview
How voice AI components work together
Core components:
- Microphone input - Captures your voice
- Whisper (STT) - Converts speech to text
- Claude - Processes request and generates response
- Piper (TTS) - Converts text to speech
- Speaker output - Plays the response
Flow:
You speak → Microphone → Whisper → Text
Text → Claude → Response text
Response → Piper → Audio → Speaker
Each component can run locally or in the cloud depending on your preferences.
Step 1: Set Up Speech-to-Text with Whisper
OpenAI's Whisper provides state-of-the-art speech recognition
Option A: Local Whisper (privacy-focused)
pip install openai-whisper
pip install faster-whisper
Create a transcription script:
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu") # or "cuda" for GPU
def transcribe(audio_file):
segments, info = model.transcribe(audio_file)
return " ".join([segment.text for segment in segments])
Option B: Cloud Whisper (easier setup)
from openai import OpenAI
client = OpenAI()
def transcribe(audio_file):
with open(audio_file, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f
)
return transcript.text
Model size trade-offs:
| Model | Size | Speed | Accuracy |
|---|---|---|---|
| tiny | 75MB | Fast | Good |
| base | 140MB | Medium | Better |
| small | 460MB | Slower | Great |
| medium | 1.5GB | Slow | Excellent |
| large | 2.9GB | Slowest | Best |
For real-time use, base or small offer good balance.
Step 2: Add Claude for Intelligence
Claude's API provides the intelligence layer
Connect Whisper output to Claude:
from anthropic import Anthropic
client = Anthropic()
conversation = []
def process_voice_input(transcript):
conversation.append({
"role": "user",
"content": transcript
})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500, # Keep responses concise for voice
system="""You are a voice assistant. Keep responses brief
and conversational. Avoid markdown formatting.""",
messages=conversation
)
assistant_message = response.content[0].text
conversation.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
Voice-optimized prompting:
- Request short, spoken-friendly responses
- Avoid bullet points and formatting
- Use natural conversational language
- Acknowledge before long responses
Step 3: Set Up Text-to-Speech with Piper
Piper provides high-quality local text-to-speech
Install Piper:
pip install piper-tts
Download a voice:
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json
Generate speech:
import subprocess
def speak(text, output_file="response.wav"):
subprocess.run([
"piper",
"--model", "en_US-lessac-medium.onnx",
"--output_file", output_file
], input=text.encode())
# Play the audio
subprocess.run(["aplay", output_file])
Voice options: Piper has many voices across languages and styles. Preview them at the Piper samples page.
Step 4: Build the Complete Pipeline
Combining all components into a working assistant
Complete voice assistant:
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
from anthropic import Anthropic
import subprocess
import tempfile
import wave
whisper = WhisperModel("base", device="cpu")
claude = Anthropic()
conversation = []
def record_audio(duration=5, sample_rate=16000):
"""Record audio from microphone."""
print("Listening...")
audio = sd.rec(
int(duration * sample_rate),
samplerate=sample_rate,
channels=1,
dtype=np.int16
)
sd.wait()
return audio, sample_rate
def save_audio(audio, sample_rate, filename):
"""Save audio to WAV file."""
with wave.open(filename, 'wb') as f:
f.setnchannels(1)
f.setsampwidth(2)
f.setframerate(sample_rate)
f.writeframes(audio.tobytes())
def voice_assistant_loop():
"""Main voice assistant loop."""
print("Voice assistant ready. Press Ctrl+C to exit.")
while True:
# Record
audio, sr = record_audio()
# Transcribe
with tempfile.NamedTemporaryFile(suffix=".wav") as f:
save_audio(audio, sr, f.name)
segments, _ = whisper.transcribe(f.name)
text = " ".join([s.text for s in segments])
if not text.strip():
continue
print(f"You: {text}")
# Process with Claude
response = process_voice_input(text)
print(f"Assistant: {response}")
# Speak response
speak(response)
if __name__ == "__main__":
voice_assistant_loop()
Wake Word Detection (Optional)
Enable hands-free activation with wake word detection
For always-on listening, add wake word detection using libraries like Porcupine or OpenWakeWord:
Using Porcupine (commercial, high quality):
import pvporcupine
porcupine = pvporcupine.create(
access_key="YOUR_ACCESS_KEY",
keywords=["jarvis", "hey siri", "alexa"] # Or custom
)
keyword_index = porcupine.process(audio_frame)
if keyword_index >= 0:
# Wake word detected, start recording
record_and_process()
Using OpenWakeWord (open source):
from openwakeword import Model
model = Model()
prediction = model.predict(audio_frame)
if prediction["hey_jarvis"] > 0.5:
record_and_process()
Conclusion
Your custom voice assistant delivers intelligence that commercial products can't match
A custom voice assistant combining Whisper and Claude delivers intelligent, private voice control that commercial assistants can't match.
The setup requires more effort than buying an Echo, but you get complete control over privacy, personality, and capabilities.
Start simple:
- Get basic STT → Claude → TTS working
- Add wake word when comfortable
- Integrate with your other systems
- Optimize for latency
Continue exploring:
- Home automation guide for voice-controlled smart home
- Self-hosted AI for fully local operation
- Personal assistant setup for messaging integration
Voice is just another interface to your AI—build it your way.
FAQ
Common questions about voice AI assistants
How is latency?
Local Whisper (base): ~1-2 seconds to transcribe. Claude: ~1-2 seconds for response. TTS: ~0.5 seconds. Total: 3-5 seconds end-to-end, which is acceptable for most uses.
Can I run this entirely locally?
Yes. Use local Whisper, Ollama with Llama 3, and Piper. Quality will be lower than Claude but completely private.
What hardware do I need?
For local processing: 8GB+ RAM, modern CPU. GPU helps significantly for Whisper. For cloud-based: any computer with a microphone.
Can I use this with smart home devices?
Yes, integrate with Home Assistant. Voice commands trigger actions through the HA API.
Why not just use Alexa/Siri?
If they meet your needs, use them. Custom assistants offer: better AI (Claude), complete privacy, and custom integration. Trade-off is setup complexity.
More Articles
The Ultimate OpenClaw AWS Setup Guide

The definitive guide to setting up OpenClaw on AWS. Includes spot instance configuration, cost optimization, and step-by-step instructions.
Building AI Workflows with Tool Chaining in OpenClaw
Master the art of chaining tools and function calls to build powerful multi-step AI automation workflows—from data extraction to content generation and deployment.
Cost Optimization Guide for Self-Hosted AI Assistants: Run Claude on a Budget
Practical strategies to reduce API costs for self-hosted AI assistants—smart model routing, caching, batching, and OpenClaw-specific optimizations to run Claude affordably.