--- title: CodeReviewBench emoji: 😎 colorFrom: gray colorTo: indigo sdk: gradio sdk_version: 4.44.1 app_file: app.py pinned: true short_description: A comprehensive benchmark for codereview. models: - openai/gpt-4o-mini - openai/gpt-4o - claude-3-7-sonnet - deepseek/deepseek-r1 --- # CodeReview Bench Leaderboard A comprehensive benchmark and leaderboard for code review generation models, inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench). ## Features - **Multi-Language Support**: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more - **Dual Language Comments**: Supports both Russian and English comment languages - **Comprehensive Metrics**: - LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity) - Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10) - **Interactive Visualization**: Compare model performance across categories with radar plots - **Easy Submission**: Submit your model results via web interface ## Metrics ### LLM-based Multimetric - **Readability**: How easy the review is to understand - **Relevance**: How relevant the review is to the code - **Explanation Clarity**: How clear the explanations are - **Problem Identification**: How well problems are identified - **Actionability**: How actionable the suggestions are - **Completeness**: How complete the review is - **Specificity**: How specific the feedback is - **Contextual Adequacy**: How well the review fits the context - **Consistency**: How consistent the review style is - **Brevity**: How concise the review is ### Exact-Match Metrics - **Pass@1**: Percentage of correct reviews on first attempt - **Pass@5**: Percentage of correct reviews in top 5 attempts - **Pass@10**: Percentage of correct reviews in top 10 attempts - **BLEU@10**: BLEU score for top 10 review candidates ## Programming Languages Supported - Python - JavaScript - Java - C++ - C# - TypeScript - Go - Rust - Swift - Kotlin - Ruby - PHP - C - Scala - R - Dart - Other ## Comment Languages - Russian (ru) - English (en) ## Example Categories - Bug Fix - Code Style - Performance - Security - Refactoring - Documentation - Testing - Architecture - Other ## Installation ```bash pip install -r requirements.txt ``` ## Usage ```bash python app.py ``` ## Submission Format Submit your results as a JSONL file where each line contains: ```json { "model_name": "your-model-name", "programming_language": "python", "comment_language": "en", "readability": 8.5, "relevance": 9.0, "explanation_clarity": 7.8, "problem_identification": 8.2, "actionability": 8.7, "completeness": 8.0, "specificity": 7.5, "contextual_adequacy": 8.3, "consistency": 8.8, "brevity": 7.2, "pass_at_1": 0.75, "pass_at_5": 0.88, "pass_at_10": 0.92, "bleu_at_10": 0.65, "total_evaluations": 100 } ``` ## Environment Variables Set the following environment variables: ## Citation <<<<<<< HEAD - **Multi-tab Interface**: Organized navigation with dedicated sections - **Advanced Filtering**: Real-time filtering by multiple criteria - **Dark Theme**: Modern, GitHub-inspired dark interface - **IP-based Submissions**: Secure submission tracking - **Comprehensive Analytics**: Detailed performance insights - **Data Export**: Multiple export formats - **Rate Limiting**: Anti-spam protection ### 🔧 Technical Improvements - **Modular Architecture**: Clean separation of concerns - **Type Safety**: Full type annotations throughout - **Error Handling**: Comprehensive error handling and logging - **Data Validation**: Multi-layer validation with Pydantic - **Performance**: Optimized data processing and display ## 📈 Metrics & Evaluation ### Performance Metrics - **BLEU**: Text similarity score (0.0-1.0) - **Pass@1**: Success rate in single attempt (0.0-1.0) - **Pass@5**: Success rate in 5 attempts (0.0-1.0) - **Pass@10**: Success rate in 10 attempts (0.0-1.0) ### Quality Dimensions 1. **Readability**: How clear and readable are the reviews? 2. **Relevance**: How relevant to the code changes? 3. **Explanation Clarity**: How well does it explain issues? 4. **Problem Identification**: How effectively does it identify problems? 5. **Actionability**: How actionable are the suggestions? 6. **Completeness**: How thorough are the reviews? 7. **Specificity**: How specific are the comments? 8. **Contextual Adequacy**: How well does it understand context? 9. **Consistency**: How consistent across different reviews? 10. **Brevity**: How concise without losing important information? ## 🔒 Security Features ### Rate Limiting - **5 submissions per IP per 24 hours** - **Automatic IP tracking and logging** - **Graceful error handling for rate limits** ### Data Validation - **Model name format validation** - **Score range validation (0.0-1.0 for performance, 0-10 for quality)** - **Logical consistency checks (Pass@1 ≤ Pass@5 ≤ Pass@10)** - **Required field validation** ### Audit Trail - **Complete submission logging** - **IP address tracking (partially masked for privacy)** - **Timestamp recording** - **Data integrity checks** ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests if applicable 5. Submit a pull request ## 📄 License This project is licensed under the MIT License - see the LICENSE file for details. ## 🙏 Acknowledgments - Inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench) - Built with [Gradio](https://gradio.app/) for the web interface - Thanks to the open-source community for tools and inspiration ## 📞 Support For questions, issues, or contributions: - Open an issue on GitHub - Check the documentation - Contact the maintainers --- **Built with ❤️ for the code review research community**