AyutthayaAlpha 2.0 Early Beta

AyutthayaAlpha 2.0 is a Thai–Latin script transliteration Transformer for personal and organization names.

Transliteration — converting text from one writing system to another while preserving how it sounds and respecting cultural conventions — is a core challenge in multilingual NLP. Unlike translation, which focuses on meaning, transliteration aims to maintain pronunciation as well as historical and cultural preferences (Knight & Graehl, 1998; Rosca & Breuel, 2016; Merhav & Ash, 2018; Prabhakar & Pal, 2018).

Thai name romanization is particularly difficult: Thai orthography under-specifies pronunciation, tones are usually dropped in Latin script, and multiple romanization standards and idiosyncratic spellings coexist in real data (Aroonmanakun, 2004; Leung, 2007; Suchato, 2012). The same Thai name can appear under many Latin spellings, depending on domain, period, and personal preference.

AyutthayaAlpha 2.0 treats transliteration as a preference-aware ranking problem over multiple plausible variants. It builds on the original AyutthayaAlpha system (Lauc, 2024) and combines large-scale data curation, a Thai Transliteration Distance metric, and quality-weighted objectives to produce ranked lists of transliterations optimized for both orthographic plausibility and phonetic proximity.

Model description

Architecture: custom T5-style encoder–decoder (Raffel et al., 2020)
- 6 encoder layers, 6 decoder layers
- Hidden size 512, FFN size 2048
- 8 attention heads
- Tied encoder/decoder/output embeddings
Tokenizer & vocabulary:
- 32k SentencePiece BPE trained on Thai–Latin names (Sennrich et al., 2016; Kudo & Richardson, 2018)
- Thai-biased vocabulary, but with good coverage of Latin name patterns
- Special task tokens:
  - <latn>: Thai → Latin transliteration
  - <thai>: Latin → Thai back-transliteration
Objectives:
- Quality-weighted cross-entropy over multiple variants per Thai name
- Per-example weights in [0, 1] from a supervised scoring model
- Soft-minimum formulation over all valid variants for the same source
Directionality:
- Both directions are trained:
  - Thai → Latin (direction = 0, <latn> prefix)
  - Latin → Thai (direction = 1, <thai> prefix)

For full details, see the accompanying AyutthayaAlpha 2.0 paper (Lauc, Rutherford, & Wongwarawipatr, 2025, in preparation).

Intended use

Primary use cases

Thai → Latin romanization of personal and organization names for:
- cross-lingual IR and entity linking,
- digital humanities and historical corpora,
- academic and governmental data pipelines.
Latin → Thai reconstruction for:
- back-transliteration of romanized names,
- deduplication and record linkage across systems.

What the model is not for

General-purpose Thai–English translation
Full-sentence translation or transliteration of arbitrary text
High-stakes legal decisions without human review (e.g., passports)

Input and output format

Very important: you must prepend a special token that indicates the direction.

Thai → Latin (romanization)
- Input: <latn> followed by a Thai name
- Example input string: "<latn>สมชาย"
- Output (example): "Somchai"
Latin → Thai (back-transliteration)
- Input: <thai> followed by a Latin-script name
- Example input string: "<thai>somchai"
- Output (example): "สมชาย"

Maximum total sequence length used in training was 128 subword tokens (source + target); typical names are much shorter.

Quickstart

from transformers import AutoTokenizer, T5ForConditionalGeneration

model_id = "davor/ayutthayaalpha-2.0-early-beta"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)


def translit_thai_to_latin(name_thai: str) -> str:
    text = "<latn>" + name_thai
    inputs = tokenizer(text, return_tensors="pt")
    inputs.pop("token_type_ids", None)  # some tokenizers add this key
    output = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=32,
        min_length=2,
        num_beams=5,
        do_sample=False,
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)


def translit_latin_to_thai(name_latin: str) -> str:
    text = "<thai>" + name_latin
    inputs = tokenizer(text, return_tensors="pt")
    inputs.pop("token_type_ids", None)
    output = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=32,
        min_length=2,
        num_beams=5,
        do_sample=False,
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)


print(translit_thai_to_latin("สมชาย"))
print(translit_latin_to_thai("khlai khampraphan"))

Getting ranked alternatives

Since AyutthayaAlpha 2.0 is trained with multiple variants and a ranking objective, it is natural to ask for several candidates:

def ranked_transliterations_thai_to_latin(name_thai: str, k: int = 5):
    text = "<latn>" + name_thai
    inputs = tokenizer(text, return_tensors="pt")
    inputs.pop("token_type_ids", None)
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=32,
        num_beams=k,
        num_return_sequences=k,
        do_sample=False,
    )
    return [tokenizer.decode(o, skip_special_tokens=True) for o in outputs]


print(ranked_transliterations_thai_to_latin("ทิพามณี", k=5))

Data and quality scoring (summary)

Training data comes from a large aggregation of Thai–Latin name pairs:

Commercial APIs (Microsoft Azure Translator; Google Cloud Translation)
Co-occurrence statistics from multilingual name databases and registries (Lauc, 2024)
Structured knowledge bases (Wikidata, Wikipedia, ORCID, VIAF)
Thai business registries and Kaggle corpora
Community dictionaries and GeoNames place names
Rule-based baselines using RTGS-style mappings via PyThaiNLP (Phatthiyaphaibun et al., 2023)

Each Thai name is associated with multiple candidate Latin spellings. A supervised quality scoring model (LightGBM; Ke et al., 2017) assigns each pair a continuous quality score in [0, 1] using:

Thai Transliteration Distance (TTD): a Thai-specific weighted edit distance over phonetic units,
Frequency and co-occurrence features: how often a pair appears across corpora,
Syllable alignment and diversity features: how well Thai and Latin syllabification line up, and how many sources support the pair.

These scores enter the loss as per-example weights in a quality-weighted soft-minimum cross-entropy objective, so that the model:

learns from diverse but noisy supervision,
emphasizes high-quality variants,
and treats transliteration as a multi-modal, ranking-aware problem.

Limitations

Early beta checkpoint: hyperparameters, filtering, and data coverage are still being tuned.
Domain bias: the model is optimized for names; performance may drop on common nouns or arbitrary phrases.
Romanization standards: the model reflects the mixture of conventions in the training data (RTGS, real-world usage, aesthetics). It does not enforce strict RTGS by default.
Bias & coverage: like any data-driven model, outputs may encode biases and gaps present in the underlying datasets.

Use human review for high-stakes, irreversible decisions.

Citation

If you use AyutthayaAlpha 2.0 Early Beta in research, please cite the accompanying paper (update once the preprint is public). A placeholder citation:

Lauc, D., Rutherford, A., & Wongwarawipatr, W. (2025). AyutthayaAlpha 2.0: A Thai–Latin Script Transliteration Transformer. Manuscript in preparation.

You may also cite the Transformers library (Wolf et al., 2020) and SentencePiece (Kudo & Richardson, 2018).

Downloads last month: 17

Safetensors

Model size

73.5M params

Tensor type

F32

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support