--- license: mit language: - ar base_model: - Qwen/Qwen2.5-1.5B-Instruct pipeline_tag: text2text-generation library_name: transformers tags: - Text-To-SQL - Arabic - Spider - SQL --- # Model Card for Arabic Text-To-SQL (OsamaMo) ## Model Details ### Model Description This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**. - **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo)) - **Funded by:** Self-funded - **Shared by:** Osama Mohamed - **Model type:** Text-to-SQL fine-tuned model - **Language(s):** Arabic (ar) - **License:** MIT - **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct ### Model Sources - **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL) - **Dataset:** Spider (translated to Arabic) - **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory) ## Uses ### Direct Use This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications. ### Downstream Use Can be fine-tuned further for specific databases or Arabic dialect adaptations. ### Out-of-Scope Use - The model is **not** intended for direct execution of SQL queries. - Not recommended for non-database-related NLP tasks. ## Bias, Risks, and Limitations - The model might generate incorrect or non-optimized SQL queries. - Bias may exist due to dataset translations and model pretraining data. ### Recommendations - Validate generated SQL queries before execution. - Ensure compatibility with specific database schemas. ## How to Get Started with the Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch device = "cuda" if torch.cuda.is_available() else "cpu" base_model_id = "Qwen/Qwen2.5-1.5B-Instruct" finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B" model = AutoModelForCausalLM.from_pretrained( base_model_id, device_map="auto", torch_dtype=torch.bfloat16 ) model.load_adapter(finetuned_model_id) tokenizer = AutoTokenizer.from_pretrained(base_model_id) def generate_resp(messages): text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=1024, do_sample=False, top_k=None, temperature=None, top_p=None, ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] return response ``` ## Training Details ### Training Data - Dataset: **Spider (translated into Arabic)** - Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged. - Training format: - System instruction guiding Arabic-to-SQL conversion. - Database schema provided for context. - Arabic user queries mapped to correct SQL output. - Output is strictly formatted SQL queries enclosed in markdown code blocks. ### Training Procedure #### Training Hyperparameters - **Batch size:** 1 (per device) - **Gradient accumulation:** 4 steps - **Learning rate:** 1.0e-4 - **Epochs:** 3 - **Scheduler:** Cosine - **Warmup ratio:** 0.1 - **Precision:** bf16 #### Speeds, Sizes, Times - **Training time:** 15 hours on **NVIDIA P100 8GB** - **Checkpointing every:** 500 steps ## Evaluation ### Testing Data - Validation dataset: Spider validation set (translated to Arabic) ### Metrics - Exact Match (EM) for SQL correctness - Execution Accuracy (EX) on databases ### Results - Model achieved **competitive SQL generation accuracy** for Arabic queries. - Further testing required for robustness. ## Environmental Impact - **Hardware Type:** NVIDIA Tesla P100 8GB - **Hours used:** 15 - **Cloud Provider:** Kaggle - **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute) ## Technical Specifications ### Model Architecture and Objective - Transformer-based **Qwen2.5-1.5B** architecture. - Fine-tuned for Text-to-SQL task using LoRA. ### Compute Infrastructure - **Hardware:** Kaggle P100 GPU (8GB VRAM) - **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub ## Citation If you use this model, please cite: ```bibtex @misc{OsamaMo_ArabicSQL, author = {Osama Mohamed}, title = {Arabic Text-To-SQL Model}, year = {2024}, howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}} } ``` ## Model Card Contact For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)).