2025On-device inference · Benchmarked

Offline Speech TranslatorReal-Time On-Device Speech-to-Speech

A fully offline, CPU and Metal-only speech translation toolkit offering both cascaded STT-to-TTS and direct speech-to-speech, running real-time on an 8GB MacBook Air with no audio leaving the device.

System architecture

Build spec

Cascaded best: 0.38s round-trip · 0.62 GB RSS
S2S en to es: 1.9s · 1.39 GB RSS
Speedup: NLLB int8: 124s to 1.9s, about 60x
Target: 8GB M1 Air, fanless, CPU/Metal only
Models: Whisper · NLLB-200 · SeamlessM4T · Piper

Problem

Speech translation normally depends on cloud APIs that send your audio off-device, require network access, and often assume CUDA GPUs. There was no turnkey pipeline that runs real-time, fully on-device, on commodity 8GB Apple Silicon or x86 CPU with no audio ever leaving the machine.

Approach

Two configurable pipelines. A cascaded path chains voice-activity detection to Whisper STT (faster-whisper int8 on CPU, or mlx-whisper on the Apple Neural Engine) to an optional local LLM transform to Kokoro or Piper TTS. A speech-to-speech path runs either end to end via SeamlessM4T or via a lower-RAM cascade of Whisper to NLLB-200 (converted to CTranslate2 int8) to Piper. Shared audio I/O handles mic streaming at 16 kHz, YAML configs select each variant, and a download tool pre-caches model snapshots for offline use.

Impact

It hits a measured 0.38s round-trip for cascaded STT-to-TTS and 1.9s for English-to-Spanish speech-to-speech on a fanless M1 Air. Converting NLLB to CTranslate2 int8 cuts S2S latency from 124s to 1.9s, roughly 60 times faster, proving cloud-grade speech translation is feasible entirely on-device with graceful fallbacks.

Decisions & tradeoffs

CTranslate2 int8 over torch fp32 for NLLB

torch fp32 NLLB ran at 124s, unusable for real-time; the int8 conversion brought it to 1.9s for negligible quality loss. That single conversion is what makes on-device speech-to-speech viable.

mlx-whisper alongside faster-whisper

CTranslate2 does not support Apple Metal, so on Apple Silicon STT would be CPU-only. mlx-whisper fills the gap using the Neural Engine, and keeping both lets one codebase serve x86 CPU and Apple Silicon optimally.

Cascaded speech-to-speech as the 8GB default

SeamlessM4T needs about 4.5 GB of RAM and thrashes on 8GB machines. The Whisper-to-NLLB-to-Piper cascade stays under 1.4 GB, trading one end-to-end model for predictable real-time memory.

System notes

Dual STT backends: faster-whisper on CPU and mlx-whisper on the Apple Neural Engine for roughly 2x speedup
CTranslate2 int8 conversion of NLLB-200 yields about 60x speedup, 124s down to 1.9s
Tiered graceful fallbacks: Kokoro to Piper to pyttsx3, and SeamlessM4T to a cascaded path for low-RAM machines
Strict non-goals: no CUDA, no training, no cloud calls, inference-only and air-gapped

Stack

faster-whisper · mlx-whisper · CTranslate2 · NLLB-200 · SeamlessM4T · Piper

View source on GitHub

Next project

CureWise · Agentic RAG for Healthcare