--- license: mit library_name: transformers base_model: jknafou/TransBERT-bio-fr language: - fr tags: - life-sciences - clinical - biomedical - bio - medical - biology pipeline_tag: fill-mask --- # TransBERT-bio-fr TransBERT-bio-fr is a French biomedical language model pretrained exclusively on synthetically translated PubMed abstracts, using the TransCorpus framework. This model demonstrates that high-quality domain-specific language models can be built for low-resource languages using only machine-translated data. # Model Details - **Architecture**: BERT-base (12 layers, 768 hidden, 12 heads, 110M parameters) - **Tokenizer**: SentencePiece unigram, 32k vocab, trained on synthetic biomedical French - **Training Data**: 36.4GB corpus, 22M PubMed abstracts, translated from English to French, available here: [TransCorpus-bio-fr 🤗](https://huggingface.co/datasets/jknafou/TransCorpus-bio-fr) - **Translation Model**: M2M-100 (1.2B) using [TransCorpus Toolkit](https://github.com/jknafou/TransCorpus) - **Domain**: Biomedical, clinical, life sciences (French) # Motivation The lack of large-scale, high-quality biomedical corpora in French has historically limited the development of domain-specific language models. TransBERT-bio-fr addresses this gap by leveraging recent advances in neural machine translation to generate a massive, high-quality synthetic corpus, making robust French biomedical NLP possible. # How to Use Loading the model and tokenizer : ```python from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("jknafou/TransBERT-bio-fr") model = AutoModel.from_pretrained("jknafou/TransBERT-bio-fr") ``` Perform the mask filling task : ```python from transformers import pipeline fill_mask = pipeline("fill-mask", model="jknafou/TransBERT-bio-fr", tokenizer="jknafou/TransBERT-bio-fr") results = fill_mask("L’insuline est une hormone produite par le et régule la glycémie.") # [{'score': 0.6606941223144531, # 'token': 486, # 'token_str': 'foie', # 'sequence': 'L’insuline est une hormone produite par le foie et régule la glycémie.'}, # {'score': 0.172934889793396, # 'token': 2642, # 'token_str': 'pancréas', # 'sequence': 'L’insuline est une hormone produite par le pancréas et régule la glycémie.'}, # {'score': 0.08486421406269073, # 'token': 488, # 'token_str': 'cerveau', # 'sequence': 'L’insuline est une hormone produite par le cerveau et régule la glycémie.'}, # {'score': 0.017183693125844002, # 'token': 2092, # 'token_str': 'cœur', # 'sequence': 'L’insuline est une hormone produite par le cœur et régule la glycémie.'}, # {'score': 0.009480085223913193, # 'token': 712, # 'token_str': 'corps', # 'sequence': 'L’insuline est une hormone produite par le corps et régule la glycémie.'}] ``` # Key Results TransBERT-bio-fr sets a new state-of-the-art (SOTA) on the French biomedical benchmark DrBenchmark, outperforming both general-domain (CamemBERT) and previous domain-specific (DrBERT) models on classification, NER, POS, and STS tasks. | Task | CamemBERT | DrBERT | TransBERT | | -------------------- | ----------- | ----------- |----------- | | Classification (F1) | 74.17 | 73.73 | **75.71^*** | | NER (F1) | 81.55 | 80.88 | **83.15^*** | | POS (F1) | 98.29 | 98.18^* | **98.31** | | STS (R²) | **83.38** | 73.56^* | 83.04 | ^*Statistically significance (Friedman & Nemenyi test, p<0.01). ## Paper published at EMNLP2025 TransCorpus enables the training of state-of-the-art language models through synthetic translation. For example, TransBERT achieved superior performance by leveraging corpus translation with this toolkit. A paper detailing these results will be submitted to EMNLP 2025. 📝 [Current Paper Version](https://transbert.s3.text-analytics.ch/TransBERT.pdf) # Why Synthetic Translation? - **Scalable**: Enables pretraining on gigabytes of text for any language with a strong MT system. - **Effective**: Outperforms models trained on native data in key biomedical tasks. - **Accessible**: Makes high-quality domain-specific PLMs possible for low-resource languages. # 🔗 Related Resources This model was pretrained on large-scale synthetic French biomedical data generated using [TransCorpus](https://github.com/jknafou/TransCorpus), an open-source toolkit for scalable, parallel translation and preprocessing. For source code, data recipes, and reproducible pipelines, visit the [TransCorpus GitHub repository](https://github.com/jknafou/TransCorpus). If you use this model, please cite: ```text @inproceedings{knafou-etal-2025-transbert, title = "{T}rans{BERT}: A Framework for Synthetic Translation in Domain-Specific Language Modeling", author = {Knafou, Julien and Mottin, Luc and Mottaz, Ana{\"i}s and Flament, Alexandre and Ruch, Patrick}, editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.findings-emnlp.1053/", doi = "10.18653/v1/2025.findings-emnlp.1053", pages = "19338--19354", ISBN = "979-8-89176-335-7", abstract = "The scarcity of non-English language data in specialized domains significantly limits the development of effective Natural Language Processing (NLP) tools. We present TransBERT, a novel framework for pre-training language models using exclusively synthetically translated text, and introduce TransCorpus, a scalable translation toolkit. Focusing on the life sciences domain in French, our approach demonstrates that state-of-the-art performance on various downstream tasks can be achieved solely by leveraging synthetically translated data. We release the TransCorpus toolkit, the TransCorpus-bio-fr corpus (36.4GB of French life sciences text), TransBERT-bio-fr, its associated pre-trained language model and reproducible code for both pre-training and fine-tuning. Our results highlight the viability of synthetic translation in a high-resource translation direction for building high-quality NLP resources in low-resource language/domain pairs." } ```