WordLlama Detect: The Language of the Token
GitHub | PyPI | HF
If you could ask a token what language it spoke, what do you think it would say? Perhaps it would tell you what languages it is most often used with. Or maybe, its embedding would tell you about the semantics it can represent.
Does a token know what language it's speaking? That's what I wanted to find out.
Spoiler: The answer is yes. In the next sections I'll show how WordLlama Detect learned to extract this information from the static token embeddings of Gemma 3.
Language identification is the NLP task of predicting the language of a text. Static embedding models like WordLlama are very computationally efficient (cpu-friendly, minimal, vectorized computations), space-efficient artifacts (a few MB) and have minimal dependencies. This package enables language identification capability for realtime and/or low-resource environments, as well as labeling and filtration of large datasets.
WordLlama Detect is 13MB and can predict 70k-100k texts/s on a single thread for 148 languages:
from wldetect import WLDetect
# Load bundled model (no path needed)
wld = WLDetect.load()
# Detect language for single text
lang, confidence = wld.predict("Hello, how are you today?")
# ('eng_Latn', 0.9564036726951599)
The World According to WordLlama
Before we explain how it works, let's take a peek at how the WordLlama Detect model sees tokens.
This figure is a UMAP of token logit vectors from the lookup table. Colors represent top predicted language for each token.
Each point in the plot represents a token. The marker color shows the language that token most strongly predicts. The xy-position comes from a UMAP of the token logit vectors in the WordLlama Detect lookup table. Logit vectors that predict similar languages fall closer together. By adding color according to the language they most strongly predict, we find good separation of language clusters, and macro-structures that are separated by ISO 15924 script (e.g., Latn, Arab).
To create a plot like this, we need more than just the tokenizer. Tokenizers are trained to efficiently represent character sequences from a multilingual vocabulary, so they can have some amount of language discriminatory power. However, tokenizers aim to efficiently represent languages by finding commonality in the lexicon. More often than not, a token is not exclusive to a single language.
To get the fine structure and language separation, more detailed information is needed. The next step is that each token needs to be able to indicate that its presence in the text implies the likelihood of one language vs another. In other words, we need to create a per-token lookup table.
Intuitively, the lookup table represents an ordinary vocabulary of tokens from a tokenizer, with an "importance score" (logit) for each language. A token more-or-less represents a sequence of characters. Therefore, sequences of characters that strongly identify a language would have a high score, and character sequences that never (or uncommonly) occur in a language have low scores.
We hypothesize that because the LLM has learned to identify languages, it stores a lot of this information in its token embeddings. If this is true, then it leads to an interesting conclusion that an LLM's token embeddings already know the language of the input text (even before any transformer layers are applied!).
And indeed, we find that this is the case when we learn a projection matrix to extract it from the Gemma 3 token embeddings to produce the plot above.
Unlocking the Embedded Knowledge
To distill the language detection power of the static embeddings, we project the embeddings into an dimensional vector. Additionally, we assume that some tokens will naturally be weaker than others at language detection and we model this separately as a learned weight vector. By learning a single projection to use across all token embeddings, the model can leverage the generalized language knowledge in them.
The model is very simple. Here's how it works:
Input Text
│
▼
┌────────────────┐
│ Tokenize │
└────────────────┘
│
▼
Token IDs: [t₁, t₂, ..., tₙ]
│
▼
┌──────────────────────────┐
│ Lookup Embeddings │ ← Gemma 3 frozen embeddings
└──────────────────────────┘
│
▼
Embeddings: [e₁, e₂, ..., eₙ] (each eᵢ ∈ ℝᵈ)
│
▼
┌──────────────────────────┐
│ Project + Weight │ ← Learned: W, b, {wᵢ}
└──────────────────────────┘
│ ℓᵢ = wᵢ · (W·eᵢ + b)
▼
Logits: [ℓ₁, ℓ₂, ..., ℓₙ] (each ℓᵢ ∈ ℝᴸ)
│
▼
┌──────────────────────────┐
│ Log-Sum-Exp Pool │
└──────────────────────────┘
│ z = log[Σᵢ exp(ℓᵢ)]
▼
Aggregated Logits: z ∈ ℝᴸ
│
▼
┌──────────────────────────┐
│ Softmax │
└──────────────────────────┘
│
▼
Language Probabilities
For each token with embedding , compute language logits:
where is the learned projection, is a bias vector, and is the learned importance weight.
The final component is pooling over the tokens. For this, we use log-sum-exponential pooling:
The idea behind this is to allow some tokens to strongly influence the overall decision, while being more dynamic than simple max pooling, which could easily trip over single words from one language embedded in a sentence of a different language.
The laurievb/OpenLID-v2 dataset is a good, large scale dataset to use for training. It contains over 100M labelled texts in 200 languages. After projection and pooling, a high-gamma focal loss is applied and the model is trained until maximum validation accuracy.
Why is this Useful?
This is a very fast, low resource way of doing language detection. The lookup table can be precomputed, requiring very little computation at inference time (just the log and sum part of the pooling). The lookup table is also sparse, making the resulting artifact very small (13MB), even with fp32 precision. The lookup table can be thresholded to 97% sparsity with almost no loss in performance. Finally, the computations are trivial and numpy can perform them quickly on CPU, while maintaining good detection performance.
Limitations and Future Work
The reason WordLlama Detect can be so concise is in part because tokenizers are trained to reduce vocabularies by identifying common elements, where elements that overlap between languages can share tokens. This creates an efficient representation of the lexicon, but unlike words, also increases ambiguity of what language a token represents. WordLlama Detect is statistical in nature, and language detection is more stable for token sequences (e.g. a sentence or paragraph) and becomes poorly defined as token length approaches 1. This is an inherent limitation of the model, and it is not recommended for single word detection.
Second, about 1/3rd of tokens in the Gemma 3 vocabulary are just not very good indicators of language. The simplest way to understand this is if you think about tokens that decode to individual characters. Some tokens are simply not suited to bias a decision one way or another, and as a result, are effectively garbage collected in the step that applies the threshold to sparsify the lookup table.
Finally, there seems to be a bias toward the English language (false positive), likely stemming from it being more heavily represented in the training data. I think it will be interesting to compare models that are likely to have different language distributions (e.g. Qwen, Mistral), and test multi-tokenizer training.
Conclusions
I hope you've found this work interesting, and that you feel inspired to try your own experiments on LLM token embedding codebooks. If you enjoyed it, throw some github stars my way or upvote this article.