Offline Speech TranslatorReal-Time On-Device Speech-to-Speech
A fully offline, CPU and Metal-only speech translation toolkit offering both cascaded STT-to-TTS and direct speech-to-speech, running real-time on an 8GB MacBook Air with no audio leaving the device.
System architecture
Build spec
- Cascaded best
- 0.38s round-trip · 0.62 GB RSS
- S2S en to es
- 1.9s · 1.39 GB RSS
- Speedup
- NLLB int8: 124s to 1.9s, about 60x
- Target
- 8GB M1 Air, fanless, CPU/Metal only
- Models
- Whisper · NLLB-200 · SeamlessM4T · Piper
Problem
Speech translation normally depends on cloud APIs that send your audio off-device, require network access, and often assume CUDA GPUs. There was no turnkey pipeline that runs real-time, fully on-device, on commodity 8GB Apple Silicon or x86 CPU with no audio ever leaving the machine.
Approach
Two configurable pipelines. A cascaded path chains voice-activity detection to Whisper STT (faster-whisper int8 on CPU, or mlx-whisper on the Apple Neural Engine) to an optional local LLM transform to Kokoro or Piper TTS. A speech-to-speech path runs either end to end via SeamlessM4T or via a lower-RAM cascade of Whisper to NLLB-200 (converted to CTranslate2 int8) to Piper. Shared audio I/O handles mic streaming at 16 kHz, YAML configs select each variant, and a download tool pre-caches model snapshots for offline use.
Impact
It hits a measured 0.38s round-trip for cascaded STT-to-TTS and 1.9s for English-to-Spanish speech-to-speech on a fanless M1 Air. Converting NLLB to CTranslate2 int8 cuts S2S latency from 124s to 1.9s, roughly 60 times faster, proving cloud-grade speech translation is feasible entirely on-device with graceful fallbacks.
Decisions & tradeoffs
CTranslate2 int8 over torch fp32 for NLLB
torch fp32 NLLB ran at 124s, unusable for real-time; the int8 conversion brought it to 1.9s for negligible quality loss. That single conversion is what makes on-device speech-to-speech viable.
mlx-whisper alongside faster-whisper
CTranslate2 does not support Apple Metal, so on Apple Silicon STT would be CPU-only. mlx-whisper fills the gap using the Neural Engine, and keeping both lets one codebase serve x86 CPU and Apple Silicon optimally.
Cascaded speech-to-speech as the 8GB default
SeamlessM4T needs about 4.5 GB of RAM and thrashes on 8GB machines. The Whisper-to-NLLB-to-Piper cascade stays under 1.4 GB, trading one end-to-end model for predictable real-time memory.
System notes
- Dual STT backends: faster-whisper on CPU and mlx-whisper on the Apple Neural Engine for roughly 2x speedup
- CTranslate2 int8 conversion of NLLB-200 yields about 60x speedup, 124s down to 1.9s
- Tiered graceful fallbacks: Kokoro to Piper to pyttsx3, and SeamlessM4T to a cascaded path for low-RAM machines
- Strict non-goals: no CUDA, no training, no cloud calls, inference-only and air-gapped
Stack
faster-whisper · mlx-whisper · CTranslate2 · NLLB-200 · SeamlessM4T · Piper