---
license: cc-by-nc-4.0
language:
- fa
tags:
- masked-language-modeling
- feature-extraction
- large-scale-dataset
- Persian
- dataset_size:72.9B
- no-next-sentence-prediction
pipeline_tag: fill-mask

extra_gated_description: >-
  You agree to not use the model to conduct experiments that cause harm to
  human subjects.
extra_gated_fields:
  Full Name: text
  Organization (University): text
  Email address: text
  Country: country
  Could you briefly explain the purpose of using the dataset?: text
  I agree to use this dataset for non-commercial use ONLY: checkbox
---

# Persian Masked Language Model (MLM)

This model is a **Masked Language Model (MLM)** trained on a **72.9-billion-token corpus** of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance **language understanding tasks** and provide high-quality contextual embeddings for various NLP applications in Persian.

- **Our Paper:** Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [link](https://arxiv.org/abs/2501.04858)

## Model Details

### Model Description
- **Model Type:** Masked Language Model (MLM)
- **Base Model:** XLM-RoBERTa Large
- **Objective:** Predicting randomly masked tokens within sequences
- **Training Corpus Size:** 72.9 billion tokens
- **Maximum Sequence Length:** 512 tokens
- **Special Feature:** No Next Sentence Prediction (NSP) task

## Training Details

### Training Configuration
- **Hardware:** 8 NVIDIA A800 GPUs
- **Duration:** One week
- **Optimization Framework:** DeepSpeed (Stage 0)
- **Training Parameters:**
  - **Learning Rate:** 5e-5
  - **Maximum Sequence Length:** 512 tokens
  - **Precision:** FP16 (Mixed Precision)

### Corpus
The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity:
- Web-crawled data
- Academic articles and books
- Persian Wikipedia
- Religious texts
- Social media platforms

The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data.

## Usage

The model can be used for various **downstream NLP tasks** in Persian, including:
- Text classification
- Named entity recognition
- Question answering
- Semantic search
- Contextual embedding generation

### Example Usage
This model can be loaded and used with the 🤗 Transformers library:

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("your_model_id")
model = AutoModelForMaskedLM.from_pretrained("your_model_id")

# Example text
text = "این یک [MASK] جدید است."
inputs = tokenizer(text, return_tensors="pt")

# Predict the masked token
outputs = model(**inputs)
logits = outputs.logits
```

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 30
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 480
- total_eval_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 1.0
- mixed_precision_training: Native AMP

### Framework versions

- Transformers 4.47.0.dev0
- Pytorch 2.4.1+cu121
- Datasets 3.0.2
- Tokenizers 0.20.1

## Citation
If you find this model helpful, please ensure to cite the following paper.

**BibTeX:**
```
@misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian,
      title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization}, 
      author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi},
      year={2025},
      eprint={2501.04858},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.04858}, 
}
```