--- license: apache-2.0 base_model: - Qwen/Qwen2.5-Math-7B language: - en pipeline_tag: text-generation library_name: transformers tags: - large-language-models - DPO - direct-preference-optimization - reasoning - long-CoT --- # ๐ค Model Card: InfiX-ai/InfiAlign-Qwen-7B-DPO
**InfiAlign** is a scalable and data-efficient post-training framework that combines supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) with a high-quality data selection pipeline to enhance reasoning in large language models. At the core of **InfiAlign** is a **robust data selection pipeline** that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the [Qwen2.5-Math-7B-Base model](https://huggingface.co/Qwen/Qwen2.5-Math-7B), our SFT model achieves performance on par with [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B), while using only approximately **12\%** of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of **Direct Preference Optimization (DPO)**, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of **3.89\%** on AIME 24/25 benchmarks. ## ๐ InfiAlign Model Series The InfiAlign framework offers multiple variants tailored for different alignment strategies: * **[InfiAlign-Qwen-7B-SFT](https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT)**: Fine-tuned using curriculum-style instruction data. * **[InfiAlign-Qwen-7B-DPO](https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-DPO)**: Trained with Direct Preference Optimization (DPO) to improve reasoning alignment. **\[You are here!]** * **[InfiAlign-Qwen-7B-R1](# "Stay tuned")**: Reinforcement learning variant (GRPO) for further refinement. ## ๐ Model Description * **Model Name:** InfiAlign-Qwen-7B-DPO * **Developed by:** InfiX-ai * **Fine-tuned from:** [InfiAlign-Qwen-7B-SFT](https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT) * **Model Type:** 7B-parameter decoder-only Transformer * **Context Length:** 32K tokens * **License:** Apache 2.0 * **Status:** Static checkpoint (offline training) ## ๐๏ธ Training Details ### ๐ Dataset Overview **Total of 10K curated samples** across three core reasoning domains: | Domain | Curated Samples | | :---: | :---: | | Mathematics | 3.5K | | Code | 3.5K | | Science | 3K | Each sample includes preference-ranked completions distilled from stronger teacher models, selected for difficulty and diversity. **Data Source:** [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning), [Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts), [OpenScience](https://huggingface.co/datasets/nvidia/OpenScience) ### ๐ Data Pipeline * **Data Decontamination and Deduplication:** Decontaminate data against evaluation benchmarks and deduplicate samples from the SFT training dataset. * **Data Selection:** We first utilize [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) to annotate each sample with domain-specific labels. For each category, we select the problems with the longest solution, representing the most challenging problems. Our SFT model then generates responses for these selected problems, which are used in the subsequent rejection sampling step. * **Reject Sampling:** We employ Qwen2.5-32B-Instruct to evaluate the SFT model's responses to math and science questions, and utilize an internal sandbox service to verify the correctness of code-related answers. For each domain, we select false samples with the longest solution lengths from each category, ensuring a balanced number of samples across categories. We directly use the solutions generated by DeepSeek-R1 as the positive samples, and pair them with the selected false samples to construct training pairs. ### ๐๏ธ Training Procedure ๐ง **Alignment Algorithm:** Direct Preference Optimization (DPO) โ๏ธ **Training Hyperparameters:** | Hyperparameter | Value | | :---: | :---: | | Batch Size | 16 | | Learning Rate | 5e-7 | | LR Scheduler | cosine | | Warmup Ratio | 0.1 | | Epoch | 3 | | Sequence Parallelism | 4 | | Loss | sigmoid preference loss | | Preference Beta | 0.1 | ## ๐ Evaluation We evaluate **InfiAlign-Qwen-7B-DPO** on a range of benchmarks to assess its reasoning, problem-solving, and code generation capabilities. All metrics are reported as Pass\@1 under a consistent regex-based answer extraction pipeline, adapted from [LIMO](https://github.com/open-compass/opencompass). ### ๐งช Benchmark Overview * **AIME24 / AIME25**: American Invitational Mathematics Examination problems (Olympiad-level high school math). * **MATH500**: Subset of the MATH dataset focused on complex mathematical reasoning. * **GPQA (Graduate Physics QA)**: Advanced physics multiple-choice questions. * **MMLU-Pro**: Professional-level subset of the Massive Multitask Language Understanding benchmark. * **LiveCodeBench**: Code reasoning benchmark using real-world coding problems. ### ๐ Performance Comparison (Pass\@1)| Model | Initial CKPT | Data Size | AIME 2025 (avg@64) |
AIME 2024 (avg@64) |
MATH500 (avg@4) |
GPQA Diamond (avg@8) |
MMLU-Pro (pass@1) |
LiveCodeBench-v5 (avg@8) |
Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | Qwen2.5-7B-Base | 1M | 8.80 | 11.93 | 76.15 | 38.70 | 57.49 | 15.77 | 34.80 |
| Qwen2.5-Math-7B-Instruct | Qwen2.5-7B-Math-Base | 2.5M | 6.72 | 6.67 | 82.40 | 31.12 | 43.06 | 2.68 | 28.78 |
| DeepSeek-Distill-Qwen-7B | Qwen2.5-7B-Math-Base | 800K | 37.97 | 55.50* | 92.80* | 49.10* | 54.16 | 37.60* | 54.43 |
| OpenThinker2-7B | Qwen2.5-7B-Instruct | 1M | 38.70* | 60.70* | 87.60* | 47.00* | 40.60* | 37.50 | 52.01 |
| Light-R1-7B-DS | DeepSeek-Distill-Qwen-7B | 3K | 44.30* | 59.10* | 91.35 | 49.40* | 54.95 | 38.40 | 56.25 |
| InfiAlign-Qwen-7B-SFT-92K (ours) | Qwen2.5-7B-Math-Base | 92K | 43.39 | 56.46 | 92.35 | 48.48 | 53.51 | 34.05 | 54.70 |
| InfiAlign-Qwen-7B-DPO-9K (ours) | InfiAlign-Qwen-7B-SFT-92K | 9K | 44.06 | 61.04 | 91.95 | 48.17 | 49.90 | 34.54 | 54.94 |
| InfiAlign-Qwen-7B-SFT-165K (ours) ๐ค | Qwen2.5-7B-Math-Base | 165K | 42.19 | 63.75 | 92.70 | 53.60 | 56.68 | 36.20 | 57.52 |
| InfiAlign-Qwen-7B-DPO-10K (ours) ๐ค | InfiAlign-Qwen-7B-SFT-165K | 10K | 47.45 | 61.25 | 93.45 | 51.77 | 53.95 | 35.30 | 57.20 |