2025Low-resource NLP · Pip-installable

everytongueA Translator for Any Language

A pip-installable tool that fine-tunes a working neural translator for any language from a spreadsheet of sentence pairs, with a low-resource recipe that keeps tiny datasets from collapsing.

System architecture

Build spec

Base model: facebook/nllb-200-distilled-600M
Flagship: Q'eqchi', 656 Spanish-Q'eqchi' pairs
Metric: chrF++ (character-level)
Inference: Beam search, no-repeat-ngram 3
Distribution: pip install everytongue · Gradio UI

Problem

Google Translate covers about 130 languages while humans speak roughly 7,000, leaving thousands of often endangered languages with zero tooling even when communities already have dictionaries. Naive fine-tuning on tiny datasets produces degenerate output, so non-experts cannot bootstrap a translator from the data they already have.

Approach

It starts from NLLB-200-distilled-600M, a 200-language multilingual prior, rather than a bilingual model. For a language NLLB never saw, a registration step invents a language code as a special token, resizes the embedding matrix, and warm-starts the new embedding by cloning a similar language's row. Training uses a real train/val/test split with early stopping and best-checkpoint restore; inference uses beam search with no-repeat-ngram to kill repetition; evaluation reports chrF++ at the character level instead of BLEU.

Impact

It turns a spreadsheet of sentence pairs into a deployable translator with a Gradio UI in minutes on a free Colab T4. The flagship Q'eqchi' example, a Mayan language with about 800k speakers that is absent from both Google Translate and NLLB-200, trains from 656 Spanish-Q'eqchi' pairs, proving the recipe works on a truly zero-resource language.

Decisions & tradeoffs

NLLB-200 multilingual base over a bilingual model

A 200-language prior already encodes cross-lingual structure, so adapting it to an unseen language beats training a bilingual model from scratch on tiny data. That is what makes useful translation possible from only hundreds of pairs.

chrF++ instead of BLEU

BLEU's word n-gram matching is near-meaningless for morphologically rich, low-resource languages with sparse references. Character-level chrF++ gives partial credit for correct morphology and stays stable on small test sets.

Warm-start the new-language embedding

Initializing a new token's embedding from a related language instead of random noise starts training from a linguistically sensible point. It is the recipe that avoids the degenerate output of naive fine-tuning.

System notes

Registers brand-new language tokens and warm-starts their embeddings from the closest known NLLB language
chrF++ chosen over BLEU as the headline metric for morphologically rich languages
Beam search plus no-repeat-ngram prevents the degenerate repetition naive fine-tuning produces
Runs on CPU, Apple Silicon, and CUDA, auto-selecting the best device; trains from as few as 20 pairs

Stack

NLLB-200 · Transformers · PyTorch · chrF++ · Gradio · PyPI

View source on GitHub

Next project

site2bot · Any Website to an Offline Chatbot