--- license: cc-by-nc-4.0 language: - fa tags: - masked-language-modeling - feature-extraction - large-scale-dataset - Persian - dataset_size:72.9B - no-next-sentence-prediction pipeline_tag: fill-mask extra_gated_description: >- You agree to not use the model to conduct experiments that cause harm to human subjects. extra_gated_fields: Full Name: text Organization (University): text Email address: text Country: country Could you briefly explain the purpose of using the dataset?: text I agree to use this dataset for non-commercial use ONLY: checkbox --- # Persian Masked Language Model (MLM) This model is a **Masked Language Model (MLM)** trained on a **72.9-billion-token corpus** of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance **language understanding tasks** and provide high-quality contextual embeddings for various NLP applications in Persian. - **Our Paper:** Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [link](https://arxiv.org/abs/2501.04858) ## Model Details ### Model Description - **Model Type:** Masked Language Model (MLM) - **Base Model:** XLM-RoBERTa Large - **Objective:** Predicting randomly masked tokens within sequences - **Training Corpus Size:** 72.9 billion tokens - **Maximum Sequence Length:** 512 tokens - **Special Feature:** No Next Sentence Prediction (NSP) task ## Training Details ### Training Configuration - **Hardware:** 8 NVIDIA A800 GPUs - **Duration:** One week - **Optimization Framework:** DeepSpeed (Stage 0) - **Training Parameters:** - **Learning Rate:** 5e-5 - **Maximum Sequence Length:** 512 tokens - **Precision:** FP16 (Mixed Precision) ### Corpus The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity: - Web-crawled data - Academic articles and books - Persian Wikipedia - Religious texts - Social media platforms The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data. ## Usage The model can be used for various **downstream NLP tasks** in Persian, including: - Text classification - Named entity recognition - Question answering - Semantic search - Contextual embedding generation ### Example Usage This model can be loaded and used with the 🤗 Transformers library: ```python from transformers import AutoTokenizer, AutoModelForMaskedLM # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained("your_model_id") model = AutoModelForMaskedLM.from_pretrained("your_model_id") # Example text text = "این یک [MASK] جدید است." inputs = tokenizer(text, return_tensors="pt") # Predict the masked token outputs = model(**inputs) logits = outputs.logits ``` ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 30 - eval_batch_size: 8 - seed: 42 - distributed_type: multi-GPU - num_devices: 8 - gradient_accumulation_steps: 2 - total_train_batch_size: 480 - total_eval_batch_size: 64 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - num_epochs: 1.0 - mixed_precision_training: Native AMP ### Framework versions - Transformers 4.47.0.dev0 - Pytorch 2.4.1+cu121 - Datasets 3.0.2 - Tokenizers 0.20.1 ## Citation If you find this model helpful, please ensure to cite the following paper. **BibTeX:** ``` @misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian, title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization}, author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi}, year={2025}, eprint={2501.04858}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.04858}, } ```