--- license: apache-2.0 base_model: - Qwen/Qwen2.5-Math-7B language: - en pipeline_tag: text-generation library_name: transformers tags: - large-language-models - DPO - direct-preference-optimization - reasoning - long-CoT --- # 🤖 Model Card: InfiX-ai/InfiAlign-Qwen-7B-DPO

**InfiAlign** is a scalable and data-efficient post-training framework that combines supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) with a high-quality data selection pipeline to enhance reasoning in large language models. At the core of **InfiAlign** is a **robust data selection pipeline** that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the [Qwen2.5-Math-7B-Base model](https://huggingface.co/Qwen/Qwen2.5-Math-7B), our SFT model achieves performance on par with [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B), while using only approximately **12\%** of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of **Direct Preference Optimization (DPO)**, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of **3.89\%** on AIME 24/25 benchmarks. ## 🚀 InfiAlign Model Series The InfiAlign framework offers multiple variants tailored for different alignment strategies: * **[InfiAlign-Qwen-7B-SFT](https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT)**: Fine-tuned using curriculum-style instruction data. * **[InfiAlign-Qwen-7B-DPO](https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-DPO)**: Trained with Direct Preference Optimization (DPO) to improve reasoning alignment. **\[You are here!]** * **[InfiAlign-Qwen-7B-R1](# "Stay tuned")**: Reinforcement learning variant (GRPO) for further refinement. ## 📋 Model Description * **Model Name:** InfiAlign-Qwen-7B-DPO * **Developed by:** InfiX-ai * **Fine-tuned from:** [InfiAlign-Qwen-7B-SFT](https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT) * **Model Type:** 7B-parameter decoder-only Transformer * **Context Length:** 32K tokens * **License:** Apache 2.0 * **Status:** Static checkpoint (offline training) ## 🏋️ Training Details ### 📊 Dataset Overview **Total of 10K curated samples** across three core reasoning domains: | Domain | Curated Samples | | :---: | :---: | | Mathematics | 3.5K | | Code | 3.5K | | Science | 3K | Each sample includes preference-ranked completions distilled from stronger teacher models, selected for difficulty and diversity. **Data Source:** [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning), [Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts), [OpenScience](https://huggingface.co/datasets/nvidia/OpenScience) ### 📊 Data Pipeline * **Data Decontamination and Deduplication:** Decontaminate data against evaluation benchmarks and deduplicate samples from the SFT training dataset. * **Data Selection:** We first utilize [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) to annotate each sample with domain-specific labels. For each category, we select the problems with the longest solution, representing the most challenging problems. Our SFT model then generates responses for these selected problems, which are used in the subsequent rejection sampling step. * **Reject Sampling:** We employ Qwen2.5-32B-Instruct to evaluate the SFT model's responses to math and science questions, and utilize an internal sandbox service to verify the correctness of code-related answers. For each domain, we select false samples with the longest solution lengths from each category, ensuring a balanced number of samples across categories. We directly use the solutions generated by DeepSeek-R1 as the positive samples, and pair them with the selected false samples to construct training pairs. ### 🏗️ Training Procedure 🧠 **Alignment Algorithm:** Direct Preference Optimization (DPO) ⚙️ **Training Hyperparameters:** | Hyperparameter | Value | | :---: | :---: | | Batch Size | 16 | | Learning Rate | 5e-7 | | LR Scheduler | cosine | | Warmup Ratio | 0.1 | | Epoch | 3 | | Sequence Parallelism | 4 | | Loss | sigmoid preference loss | | Preference Beta | 0.1 | ## 📊 Evaluation We evaluate **InfiAlign-Qwen-7B-DPO** on a range of benchmarks to assess its reasoning, problem-solving, and code generation capabilities. All metrics are reported as Pass\@1 under a consistent regex-based answer extraction pipeline, adapted from [LIMO](https://github.com/open-compass/opencompass). ### 🧪 Benchmark Overview * **AIME24 / AIME25**: American Invitational Mathematics Examination problems (Olympiad-level high school math). * **MATH500**: Subset of the MATH dataset focused on complex mathematical reasoning. * **GPQA (Graduate Physics QA)**: Advanced physics multiple-choice questions. * **MMLU-Pro**: Professional-level subset of the Massive Multitask Language Understanding benchmark. * **LiveCodeBench**: Code reasoning benchmark using real-world coding problems. ### 🏆 Performance Comparison (Pass\@1)

Model	Initial CKPT	Data Size	AIME 2025 (avg@64)	AIME 2024 (avg@64)	MATH500 (avg@4)	GPQA Diamond (avg@8)	MMLU-Pro (pass@1)	LiveCodeBench-v5 (avg@8)	Avg.
Qwen2.5-7B-Instruct	Qwen2.5-7B-Base	1M	8.80	11.93	76.15	38.70	57.49	15.77	34.80
Qwen2.5-Math-7B-Instruct	Qwen2.5-7B-Math-Base	2.5M	6.72	6.67	82.40	31.12	43.06	2.68	28.78
DeepSeek-Distill-Qwen-7B	Qwen2.5-7B-Math-Base	800K	37.97	55.50*	92.80*	49.10*	54.16	37.60*	54.43
OpenThinker2-7B	Qwen2.5-7B-Instruct	1M	38.70*	60.70*	87.60*	47.00*	40.60*	37.50	52.01
Light-R1-7B-DS	DeepSeek-Distill-Qwen-7B	3K	44.30*	59.10*	91.35	49.40*	54.95	38.40	56.25

InfiAlign-Qwen-7B-SFT-92K (ours)	Qwen2.5-7B-Math-Base	92K	43.39	56.46	92.35	48.48	53.51	34.05	54.70
InfiAlign-Qwen-7B-DPO-9K (ours)	InfiAlign-Qwen-7B-SFT-92K	9K	44.06	61.04	91.95	48.17	49.90	34.54	54.94

InfiAlign-Qwen-7B-SFT-165K (ours) 🤗	Qwen2.5-7B-Math-Base	165K	42.19	63.75	92.70	53.60	56.68	36.20	57.52
InfiAlign-Qwen-7B-DPO-10K (ours) 🤗	InfiAlign-Qwen-7B-SFT-165K	10K	47.45	61.25	93.45	51.77	53.95	35.30	57.20

## 🧪 Usage Here is a code snippet with apply_chat_template showing you how to load the tokenizer and model and how to generate content. - PS: Make sure the model starts with "\\n" to avoid generating empty thoughts, which will reduce the output quality. If you use "apply_chat_template" and set "add_generation_prompt=True", this will be automatically implemented, but this may result in a missing "\" label at the beginning of the response. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "InfiX-ai/InfiAlign-Qwen-7B-SFT" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response) ``` ## 🎯 Intended Uses ### ✅ Appropriate Use Cases * Reasoning tasks in **math**, **science**, and **code** * Chat-based AI assistants requiring structured problem-solving * Educational and research tools focused on logic-based domains ### ❌ Out-of-Scope Uses * High-stakes applications (e.g., legal, medical) * Non-English or multilingual scenarios (model is primarily trained on English) * Tasks not related to reasoning or logic-intensive domains ## ⚖️ Bias, Risks, and Limitations ### 🎭 Bias * English-centric training may result in underperformance on non-English tasks * Potential propagation of stereotypes or social biases from source data ### ⚠️ Risks * May produce hallucinated or incorrect outputs * Risk of unsafe or offensive completions in adversarial contexts * Code outputs may be syntactically correct but functionally incorrect ### 🚧 Limitations * Lacks fine-grained safety alignment beyond DPO * Performance outside of math/code/science domains remains unverified ## 📚 Citation ```bigquery @misc{cai2025infialignscalablesampleefficientframework, title={InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities}, author={Shuo Cai and Su Lu and Qi Zhou and Kejing Yang and Zhijie Sang and Congkai Xie and Hongxia Yang}, year={2025}, eprint={2508.05496}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2508.05496}, } ``` ## 📌 News * ✅ We released model checkpoint for `InfiAlign-Qwen-7B-DPO` ! * ✅ We released [InfiAlign-Qwen-7B-DPO-Eval-Response](https://huggingface.co/datasets/InfiX-ai/InfiAlign-Qwen-7B-DPO-Eval-Response) ! This dataset contains the detailed evaluation responses generated by our DPO model across various benchmarks.