everytongueA Translator for Any Language
A pip-installable tool that fine-tunes a working neural translator for any language from a spreadsheet of sentence pairs, with a low-resource recipe that keeps tiny datasets from collapsing.
System architecture
Build spec
- Base model
- facebook/nllb-200-distilled-600M
- Flagship
- Q'eqchi', 656 Spanish-Q'eqchi' pairs
- Metric
- chrF++ (character-level)
- Inference
- Beam search, no-repeat-ngram 3
- Distribution
- pip install everytongue · Gradio UI
Problem
Google Translate covers about 130 languages while humans speak roughly 7,000, leaving thousands of often endangered languages with zero tooling even when communities already have dictionaries. Naive fine-tuning on tiny datasets produces degenerate output, so non-experts cannot bootstrap a translator from the data they already have.
Approach
It starts from NLLB-200-distilled-600M, a 200-language multilingual prior, rather than a bilingual model. For a language NLLB never saw, a registration step invents a language code as a special token, resizes the embedding matrix, and warm-starts the new embedding by cloning a similar language's row. Training uses a real train/val/test split with early stopping and best-checkpoint restore; inference uses beam search with no-repeat-ngram to kill repetition; evaluation reports chrF++ at the character level instead of BLEU.
Impact
It turns a spreadsheet of sentence pairs into a deployable translator with a Gradio UI in minutes on a free Colab T4. The flagship Q'eqchi' example, a Mayan language with about 800k speakers that is absent from both Google Translate and NLLB-200, trains from 656 Spanish-Q'eqchi' pairs, proving the recipe works on a truly zero-resource language.
Decisions & tradeoffs
NLLB-200 multilingual base over a bilingual model
A 200-language prior already encodes cross-lingual structure, so adapting it to an unseen language beats training a bilingual model from scratch on tiny data. That is what makes useful translation possible from only hundreds of pairs.
chrF++ instead of BLEU
BLEU's word n-gram matching is near-meaningless for morphologically rich, low-resource languages with sparse references. Character-level chrF++ gives partial credit for correct morphology and stays stable on small test sets.
Warm-start the new-language embedding
Initializing a new token's embedding from a related language instead of random noise starts training from a linguistically sensible point. It is the recipe that avoids the degenerate output of naive fine-tuning.
System notes
- Registers brand-new language tokens and warm-starts their embeddings from the closest known NLLB language
- chrF++ chosen over BLEU as the headline metric for morphologically rich languages
- Beam search plus no-repeat-ngram prevents the degenerate repetition naive fine-tuning produces
- Runs on CPU, Apple Silicon, and CUDA, auto-selecting the best device; trains from as few as 20 pairs
Stack
NLLB-200 · Transformers · PyTorch · chrF++ · Gradio · PyPI