Clawist
📖 Guide8 min read••By Lin

AI Voice Assistant: Build Your Own with Whisper and Claude

AI Voice Assistant: Build Your Own with Whisper and Claude

Voice control adds a new dimension to AI assistants. Instead of typing, you speak naturally and hear responses—hands-free, eyes-free, perfect for when you're cooking, driving, or just want a more natural interaction.

Building a custom voice assistant using Whisper (speech-to-text) and Claude (intelligence) gives you the best of both worlds: state-of-the-art voice recognition with sophisticated AI understanding.

This guide shows you how to build it, whether running locally or in the cloud.

Why Build Your Own Voice AI?

Voice interface technology Custom voice assistants offer capabilities commercial products can't match

Compared to Alexa/Siri/Google:

  • More intelligent responses (Claude is smarter)
  • No wake word required (if you prefer)
  • Complete privacy option (run entirely locally)
  • Custom personality and capabilities
  • Integration with your specific systems

Trade-offs:

  • More setup than commercial products
  • May have higher latency
  • No built-in hardware ecosystem
  • Requires maintenance

For users who value intelligence over convenience, a custom voice assistant excels.

Architecture Overview

Speech recognition system architecture How voice AI components work together

Core components:

  1. Microphone input - Captures your voice
  2. Whisper (STT) - Converts speech to text
  3. Claude - Processes request and generates response
  4. Piper (TTS) - Converts text to speech
  5. Speaker output - Plays the response

Flow:

You speak → Microphone → Whisper → Text
Text → Claude → Response text
Response → Piper → Audio → Speaker

Each component can run locally or in the cloud depending on your preferences.

Step 1: Set Up Speech-to-Text with Whisper

Whisper speech recognition interface OpenAI's Whisper provides state-of-the-art speech recognition

Option A: Local Whisper (privacy-focused)

pip install openai-whisper

pip install faster-whisper

Create a transcription script:

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu")  # or "cuda" for GPU

def transcribe(audio_file):
    segments, info = model.transcribe(audio_file)
    return " ".join([segment.text for segment in segments])

Option B: Cloud Whisper (easier setup)

from openai import OpenAI

client = OpenAI()

def transcribe(audio_file):
    with open(audio_file, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=f
        )
    return transcript.text

Model size trade-offs:

ModelSizeSpeedAccuracy
tiny75MBFastGood
base140MBMediumBetter
small460MBSlowerGreat
medium1.5GBSlowExcellent
large2.9GBSlowestBest

For real-time use, base or small offer good balance.

Step 2: Add Claude for Intelligence

Claude AI processing natural language Claude's API provides the intelligence layer

Connect Whisper output to Claude:

from anthropic import Anthropic

client = Anthropic()
conversation = []

def process_voice_input(transcript):
    conversation.append({
        "role": "user",
        "content": transcript
    })
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,  # Keep responses concise for voice
        system="""You are a voice assistant. Keep responses brief 
                  and conversational. Avoid markdown formatting.""",
        messages=conversation
    )
    
    assistant_message = response.content[0].text
    conversation.append({
        "role": "assistant",
        "content": assistant_message
    })
    
    return assistant_message

Voice-optimized prompting:

  • Request short, spoken-friendly responses
  • Avoid bullet points and formatting
  • Use natural conversational language
  • Acknowledge before long responses

Step 3: Set Up Text-to-Speech with Piper

Text to speech setup Piper provides high-quality local text-to-speech

Install Piper:

pip install piper-tts

Download a voice:

wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

Generate speech:

import subprocess

def speak(text, output_file="response.wav"):
    subprocess.run([
        "piper",
        "--model", "en_US-lessac-medium.onnx",
        "--output_file", output_file
    ], input=text.encode())
    
    # Play the audio
    subprocess.run(["aplay", output_file])

Voice options: Piper has many voices across languages and styles. Preview them at the Piper samples page.

Step 4: Build the Complete Pipeline

Complete voice assistant architecture Combining all components into a working assistant

Complete voice assistant:

import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
from anthropic import Anthropic
import subprocess
import tempfile
import wave

whisper = WhisperModel("base", device="cpu")
claude = Anthropic()
conversation = []

def record_audio(duration=5, sample_rate=16000):
    """Record audio from microphone."""
    print("Listening...")
    audio = sd.rec(
        int(duration * sample_rate),
        samplerate=sample_rate,
        channels=1,
        dtype=np.int16
    )
    sd.wait()
    return audio, sample_rate

def save_audio(audio, sample_rate, filename):
    """Save audio to WAV file."""
    with wave.open(filename, 'wb') as f:
        f.setnchannels(1)
        f.setsampwidth(2)
        f.setframerate(sample_rate)
        f.writeframes(audio.tobytes())

def voice_assistant_loop():
    """Main voice assistant loop."""
    print("Voice assistant ready. Press Ctrl+C to exit.")
    
    while True:
        # Record
        audio, sr = record_audio()
        
        # Transcribe
        with tempfile.NamedTemporaryFile(suffix=".wav") as f:
            save_audio(audio, sr, f.name)
            segments, _ = whisper.transcribe(f.name)
            text = " ".join([s.text for s in segments])
        
        if not text.strip():
            continue
            
        print(f"You: {text}")
        
        # Process with Claude
        response = process_voice_input(text)
        print(f"Assistant: {response}")
        
        # Speak response
        speak(response)

if __name__ == "__main__":
    voice_assistant_loop()

Wake Word Detection (Optional)

Wake word detection system Enable hands-free activation with wake word detection

For always-on listening, add wake word detection using libraries like Porcupine or OpenWakeWord:

Using Porcupine (commercial, high quality):

import pvporcupine

porcupine = pvporcupine.create(
    access_key="YOUR_ACCESS_KEY",
    keywords=["jarvis", "hey siri", "alexa"]  # Or custom
)

keyword_index = porcupine.process(audio_frame)
if keyword_index >= 0:
    # Wake word detected, start recording
    record_and_process()

Using OpenWakeWord (open source):

from openwakeword import Model

model = Model()

prediction = model.predict(audio_frame)
if prediction["hey_jarvis"] > 0.5:
    record_and_process()

Conclusion

Voice-controlled smart speaker assistant Your custom voice assistant delivers intelligence that commercial products can't match

A custom voice assistant combining Whisper and Claude delivers intelligent, private voice control that commercial assistants can't match.

The setup requires more effort than buying an Echo, but you get complete control over privacy, personality, and capabilities.

Start simple:

  1. Get basic STT → Claude → TTS working
  2. Add wake word when comfortable
  3. Integrate with your other systems
  4. Optimize for latency

Continue exploring:

Voice is just another interface to your AI—build it your way.

FAQ

Voice AI FAQ Common questions about voice AI assistants

How is latency?

Local Whisper (base): ~1-2 seconds to transcribe. Claude: ~1-2 seconds for response. TTS: ~0.5 seconds. Total: 3-5 seconds end-to-end, which is acceptable for most uses.

Can I run this entirely locally?

Yes. Use local Whisper, Ollama with Llama 3, and Piper. Quality will be lower than Claude but completely private.

What hardware do I need?

For local processing: 8GB+ RAM, modern CPU. GPU helps significantly for Whisper. For cloud-based: any computer with a microphone.

Can I use this with smart home devices?

Yes, integrate with Home Assistant. Voice commands trigger actions through the HA API.

Why not just use Alexa/Siri?

If they meet your needs, use them. Custom assistants offer: better AI (Claude), complete privacy, and custom integration. Trade-off is setup complexity.