AyutthayaAlpha 2.0 Early Beta
AyutthayaAlpha 2.0 is a Thai–Latin script transliteration Transformer for personal and organization names.
Transliteration — converting text from one writing system to another while preserving how it sounds and respecting cultural conventions — is a core challenge in multilingual NLP. Unlike translation, which focuses on meaning, transliteration aims to maintain pronunciation as well as historical and cultural preferences (Knight & Graehl, 1998; Rosca & Breuel, 2016; Merhav & Ash, 2018; Prabhakar & Pal, 2018).
Thai name romanization is particularly difficult: Thai orthography under-specifies pronunciation, tones are usually dropped in Latin script, and multiple romanization standards and idiosyncratic spellings coexist in real data (Aroonmanakun, 2004; Leung, 2007; Suchato, 2012). The same Thai name can appear under many Latin spellings, depending on domain, period, and personal preference.
AyutthayaAlpha 2.0 treats transliteration as a preference-aware ranking problem over multiple plausible variants. It builds on the original AyutthayaAlpha system (Lauc, 2024) and combines large-scale data curation, a Thai Transliteration Distance metric, and quality-weighted objectives to produce ranked lists of transliterations optimized for both orthographic plausibility and phonetic proximity.
Model description
- Architecture: custom T5-style encoder–decoder (Raffel et al., 2020)
- 6 encoder layers, 6 decoder layers
- Hidden size 512, FFN size 2048
- 8 attention heads
- Tied encoder/decoder/output embeddings
- Tokenizer & vocabulary:
- 32k SentencePiece BPE trained on Thai–Latin names (Sennrich et al., 2016; Kudo & Richardson, 2018)
- Thai-biased vocabulary, but with good coverage of Latin name patterns
- Special task tokens:
<latn>: Thai → Latin transliteration<thai>: Latin → Thai back-transliteration
- Objectives:
- Quality-weighted cross-entropy over multiple variants per Thai name
- Per-example weights in [0, 1] from a supervised scoring model
- Soft-minimum formulation over all valid variants for the same source
- Directionality:
- Both directions are trained:
- Thai → Latin (direction = 0,
<latn>prefix) - Latin → Thai (direction = 1,
<thai>prefix)
- Thai → Latin (direction = 0,
- Both directions are trained:
For full details, see the accompanying AyutthayaAlpha 2.0 paper (Lauc, Rutherford, & Wongwarawipatr, 2025, in preparation).
Intended use
Primary use cases
- Thai → Latin romanization of personal and organization names for:
- cross-lingual IR and entity linking,
- digital humanities and historical corpora,
- academic and governmental data pipelines.
- Latin → Thai reconstruction for:
- back-transliteration of romanized names,
- deduplication and record linkage across systems.
What the model is not for
- General-purpose Thai–English translation
- Full-sentence translation or transliteration of arbitrary text
- High-stakes legal decisions without human review (e.g., passports)
Input and output format
Very important: you must prepend a special token that indicates the direction.
Thai → Latin (romanization)
- Input:
<latn>followed by a Thai name - Example input string:
"<latn>สมชาย" - Output (example):
"Somchai"
- Input:
Latin → Thai (back-transliteration)
- Input:
<thai>followed by a Latin-script name - Example input string:
"<thai>somchai" - Output (example):
"สมชาย"
- Input:
Maximum total sequence length used in training was 128 subword tokens (source + target); typical names are much shorter.
Quickstart
from transformers import AutoTokenizer, T5ForConditionalGeneration
model_id = "davor/ayutthayaalpha-2.0-early-beta"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)
def translit_thai_to_latin(name_thai: str) -> str:
text = "<latn>" + name_thai
inputs = tokenizer(text, return_tensors="pt")
inputs.pop("token_type_ids", None) # some tokenizers add this key
output = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=32,
min_length=2,
num_beams=5,
do_sample=False,
)
return tokenizer.decode(output[0], skip_special_tokens=True)
def translit_latin_to_thai(name_latin: str) -> str:
text = "<thai>" + name_latin
inputs = tokenizer(text, return_tensors="pt")
inputs.pop("token_type_ids", None)
output = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=32,
min_length=2,
num_beams=5,
do_sample=False,
)
return tokenizer.decode(output[0], skip_special_tokens=True)
print(translit_thai_to_latin("สมชาย"))
print(translit_latin_to_thai("khlai khampraphan"))
Getting ranked alternatives
Since AyutthayaAlpha 2.0 is trained with multiple variants and a ranking objective, it is natural to ask for several candidates:
def ranked_transliterations_thai_to_latin(name_thai: str, k: int = 5):
text = "<latn>" + name_thai
inputs = tokenizer(text, return_tensors="pt")
inputs.pop("token_type_ids", None)
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=32,
num_beams=k,
num_return_sequences=k,
do_sample=False,
)
return [tokenizer.decode(o, skip_special_tokens=True) for o in outputs]
print(ranked_transliterations_thai_to_latin("ทิพามณี", k=5))
Data and quality scoring (summary)
Training data comes from a large aggregation of Thai–Latin name pairs:
- Commercial APIs (Microsoft Azure Translator; Google Cloud Translation)
- Co-occurrence statistics from multilingual name databases and registries (Lauc, 2024)
- Structured knowledge bases (Wikidata, Wikipedia, ORCID, VIAF)
- Thai business registries and Kaggle corpora
- Community dictionaries and GeoNames place names
- Rule-based baselines using RTGS-style mappings via PyThaiNLP (Phatthiyaphaibun et al., 2023)
Each Thai name is associated with multiple candidate Latin spellings. A supervised quality scoring model (LightGBM; Ke et al., 2017) assigns each pair a continuous quality score in [0, 1] using:
- Thai Transliteration Distance (TTD): a Thai-specific weighted edit distance over phonetic units,
- Frequency and co-occurrence features: how often a pair appears across corpora,
- Syllable alignment and diversity features: how well Thai and Latin syllabification line up, and how many sources support the pair.
These scores enter the loss as per-example weights in a quality-weighted soft-minimum cross-entropy objective, so that the model:
- learns from diverse but noisy supervision,
- emphasizes high-quality variants,
- and treats transliteration as a multi-modal, ranking-aware problem.
Limitations
- Early beta checkpoint: hyperparameters, filtering, and data coverage are still being tuned.
- Domain bias: the model is optimized for names; performance may drop on common nouns or arbitrary phrases.
- Romanization standards: the model reflects the mixture of conventions in the training data (RTGS, real-world usage, aesthetics). It does not enforce strict RTGS by default.
- Bias & coverage: like any data-driven model, outputs may encode biases and gaps present in the underlying datasets.
Use human review for high-stakes, irreversible decisions.
Citation
If you use AyutthayaAlpha 2.0 Early Beta in research, please cite the accompanying paper (update once the preprint is public). A placeholder citation:
Lauc, D., Rutherford, A., & Wongwarawipatr, W. (2025). AyutthayaAlpha 2.0: A Thai–Latin Script Transliteration Transformer. Manuscript in preparation.
You may also cite the Transformers library (Wolf et al., 2020) and SentencePiece (Kudo & Richardson, 2018).
- Downloads last month
- 17