AyutthayaAlpha 2.0 Early Beta

AyutthayaAlpha 2.0 is a Thai–Latin script transliteration Transformer for personal and organization names.

Transliteration — converting text from one writing system to another while preserving how it sounds and respecting cultural conventions — is a core challenge in multilingual NLP. Unlike translation, which focuses on meaning, transliteration aims to maintain pronunciation as well as historical and cultural preferences (Knight & Graehl, 1998; Rosca & Breuel, 2016; Merhav & Ash, 2018; Prabhakar & Pal, 2018).

Thai name romanization is particularly difficult: Thai orthography under-specifies pronunciation, tones are usually dropped in Latin script, and multiple romanization standards and idiosyncratic spellings coexist in real data (Aroonmanakun, 2004; Leung, 2007; Suchato, 2012). The same Thai name can appear under many Latin spellings, depending on domain, period, and personal preference.

AyutthayaAlpha 2.0 treats transliteration as a preference-aware ranking problem over multiple plausible variants. It builds on the original AyutthayaAlpha system (Lauc, 2024) and combines large-scale data curation, a Thai Transliteration Distance metric, and quality-weighted objectives to produce ranked lists of transliterations optimized for both orthographic plausibility and phonetic proximity.


Model description

  • Architecture: custom T5-style encoder–decoder (Raffel et al., 2020)
    • 6 encoder layers, 6 decoder layers
    • Hidden size 512, FFN size 2048
    • 8 attention heads
    • Tied encoder/decoder/output embeddings
  • Tokenizer & vocabulary:
    • 32k SentencePiece BPE trained on Thai–Latin names (Sennrich et al., 2016; Kudo & Richardson, 2018)
    • Thai-biased vocabulary, but with good coverage of Latin name patterns
    • Special task tokens:
      • <latn>: Thai → Latin transliteration
      • <thai>: Latin → Thai back-transliteration
  • Objectives:
    • Quality-weighted cross-entropy over multiple variants per Thai name
    • Per-example weights in [0, 1] from a supervised scoring model
    • Soft-minimum formulation over all valid variants for the same source
  • Directionality:
    • Both directions are trained:
      • Thai → Latin (direction = 0, <latn> prefix)
      • Latin → Thai (direction = 1, <thai> prefix)

For full details, see the accompanying AyutthayaAlpha 2.0 paper (Lauc, Rutherford, & Wongwarawipatr, 2025, in preparation).


Intended use

Primary use cases

  • Thai → Latin romanization of personal and organization names for:
    • cross-lingual IR and entity linking,
    • digital humanities and historical corpora,
    • academic and governmental data pipelines.
  • Latin → Thai reconstruction for:
    • back-transliteration of romanized names,
    • deduplication and record linkage across systems.

What the model is not for

  • General-purpose Thai–English translation
  • Full-sentence translation or transliteration of arbitrary text
  • High-stakes legal decisions without human review (e.g., passports)

Input and output format

Very important: you must prepend a special token that indicates the direction.

  • Thai → Latin (romanization)

    • Input: <latn> followed by a Thai name
    • Example input string: "<latn>สมชาย"
    • Output (example): "Somchai"
  • Latin → Thai (back-transliteration)

    • Input: <thai> followed by a Latin-script name
    • Example input string: "<thai>somchai"
    • Output (example): "สมชาย"

Maximum total sequence length used in training was 128 subword tokens (source + target); typical names are much shorter.


Quickstart

from transformers import AutoTokenizer, T5ForConditionalGeneration

model_id = "davor/ayutthayaalpha-2.0-early-beta"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)


def translit_thai_to_latin(name_thai: str) -> str:
    text = "<latn>" + name_thai
    inputs = tokenizer(text, return_tensors="pt")
    inputs.pop("token_type_ids", None)  # some tokenizers add this key
    output = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=32,
        min_length=2,
        num_beams=5,
        do_sample=False,
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)


def translit_latin_to_thai(name_latin: str) -> str:
    text = "<thai>" + name_latin
    inputs = tokenizer(text, return_tensors="pt")
    inputs.pop("token_type_ids", None)
    output = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=32,
        min_length=2,
        num_beams=5,
        do_sample=False,
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)


print(translit_thai_to_latin("สมชาย"))
print(translit_latin_to_thai("khlai khampraphan"))

Getting ranked alternatives

Since AyutthayaAlpha 2.0 is trained with multiple variants and a ranking objective, it is natural to ask for several candidates:

def ranked_transliterations_thai_to_latin(name_thai: str, k: int = 5):
    text = "<latn>" + name_thai
    inputs = tokenizer(text, return_tensors="pt")
    inputs.pop("token_type_ids", None)
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=32,
        num_beams=k,
        num_return_sequences=k,
        do_sample=False,
    )
    return [tokenizer.decode(o, skip_special_tokens=True) for o in outputs]


print(ranked_transliterations_thai_to_latin("ทิพามณี", k=5))

Data and quality scoring (summary)

Training data comes from a large aggregation of Thai–Latin name pairs:

  • Commercial APIs (Microsoft Azure Translator; Google Cloud Translation)
  • Co-occurrence statistics from multilingual name databases and registries (Lauc, 2024)
  • Structured knowledge bases (Wikidata, Wikipedia, ORCID, VIAF)
  • Thai business registries and Kaggle corpora
  • Community dictionaries and GeoNames place names
  • Rule-based baselines using RTGS-style mappings via PyThaiNLP (Phatthiyaphaibun et al., 2023)

Each Thai name is associated with multiple candidate Latin spellings. A supervised quality scoring model (LightGBM; Ke et al., 2017) assigns each pair a continuous quality score in [0, 1] using:

  • Thai Transliteration Distance (TTD): a Thai-specific weighted edit distance over phonetic units,
  • Frequency and co-occurrence features: how often a pair appears across corpora,
  • Syllable alignment and diversity features: how well Thai and Latin syllabification line up, and how many sources support the pair.

These scores enter the loss as per-example weights in a quality-weighted soft-minimum cross-entropy objective, so that the model:

  • learns from diverse but noisy supervision,
  • emphasizes high-quality variants,
  • and treats transliteration as a multi-modal, ranking-aware problem.

Limitations

  • Early beta checkpoint: hyperparameters, filtering, and data coverage are still being tuned.
  • Domain bias: the model is optimized for names; performance may drop on common nouns or arbitrary phrases.
  • Romanization standards: the model reflects the mixture of conventions in the training data (RTGS, real-world usage, aesthetics). It does not enforce strict RTGS by default.
  • Bias & coverage: like any data-driven model, outputs may encode biases and gaps present in the underlying datasets.

Use human review for high-stakes, irreversible decisions.


Citation

If you use AyutthayaAlpha 2.0 Early Beta in research, please cite the accompanying paper (update once the preprint is public). A placeholder citation:

Lauc, D., Rutherford, A., & Wongwarawipatr, W. (2025). AyutthayaAlpha 2.0: A Thai–Latin Script Transliteration Transformer. Manuscript in preparation.

You may also cite the Transformers library (Wolf et al., 2020) and SentencePiece (Kudo & Richardson, 2018).

Downloads last month
17
Safetensors
Model size
73.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support