---
title: CodeReviewBench
emoji: 😎
colorFrom: gray
colorTo: indigo
sdk: gradio

sdk_version: 4.44.1
app_file: app.py
pinned: true
short_description: A comprehensive benchmark for codereview.
models:
- openai/gpt-4o-mini
- openai/gpt-4o
- claude-3-7-sonnet
- deepseek/deepseek-r1

---

# CodeReview Bench Leaderboard

A comprehensive benchmark and leaderboard for code review generation models, inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench).
## Features

- **Multi-Language Support**: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more
- **Dual Language Comments**: Supports both Russian and English comment languages
- **Comprehensive Metrics**:
  - LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity)
  - Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10)
- **Interactive Visualization**: Compare model performance across categories with radar plots
- **Easy Submission**: Submit your model results via web interface

## Metrics

### LLM-based Multimetric

- **Readability**: How easy the review is to understand
- **Relevance**: How relevant the review is to the code
- **Explanation Clarity**: How clear the explanations are
- **Problem Identification**: How well problems are identified
- **Actionability**: How actionable the suggestions are
- **Completeness**: How complete the review is
- **Specificity**: How specific the feedback is
- **Contextual Adequacy**: How well the review fits the context
- **Consistency**: How consistent the review style is
- **Brevity**: How concise the review is

### Exact-Match Metrics

- **Pass@1**: Percentage of correct reviews on first attempt
- **Pass@5**: Percentage of correct reviews in top 5 attempts
- **Pass@10**: Percentage of correct reviews in top 10 attempts
- **BLEU@10**: BLEU score for top 10 review candidates

## Programming Languages Supported

- Python
- JavaScript
- Java
- C++
- C#
- TypeScript
- Go
- Rust
- Swift
- Kotlin
- Ruby
- PHP
- C
- Scala
- R
- Dart
- Other

## Comment Languages

- Russian (ru)
- English (en)

## Example Categories

- Bug Fix
- Code Style
- Performance
- Security
- Refactoring
- Documentation
- Testing
- Architecture
- Other

## Installation

```bash
pip install -r requirements.txt
```

## Usage

```bash
python app.py
```

## Submission Format

Submit your results as a JSONL file where each line contains:

```json
{
  "model_name": "your-model-name",
  "programming_language": "python",
  "comment_language": "en",
  "readability": 8.5,
  "relevance": 9.0,
  "explanation_clarity": 7.8,
  "problem_identification": 8.2,
  "actionability": 8.7,
  "completeness": 8.0,
  "specificity": 7.5,
  "contextual_adequacy": 8.3,
  "consistency": 8.8,
  "brevity": 7.2,
  "pass_at_1": 0.75,
  "pass_at_5": 0.88,
  "pass_at_10": 0.92,
  "bleu_at_10": 0.65,
  "total_evaluations": 100
}
```

## Environment Variables

Set the following environment variables:


## Citation

<<<<<<< HEAD
- **Multi-tab Interface**: Organized navigation with dedicated sections
- **Advanced Filtering**: Real-time filtering by multiple criteria
- **Dark Theme**: Modern, GitHub-inspired dark interface
- **IP-based Submissions**: Secure submission tracking
- **Comprehensive Analytics**: Detailed performance insights
- **Data Export**: Multiple export formats
- **Rate Limiting**: Anti-spam protection

### 🔧 Technical Improvements

- **Modular Architecture**: Clean separation of concerns
- **Type Safety**: Full type annotations throughout
- **Error Handling**: Comprehensive error handling and logging
- **Data Validation**: Multi-layer validation with Pydantic
- **Performance**: Optimized data processing and display

## 📈 Metrics & Evaluation

### Performance Metrics

- **BLEU**: Text similarity score (0.0-1.0)
- **Pass@1**: Success rate in single attempt (0.0-1.0)
- **Pass@5**: Success rate in 5 attempts (0.0-1.0)
- **Pass@10**: Success rate in 10 attempts (0.0-1.0)

### Quality Dimensions

1. **Readability**: How clear and readable are the reviews?
2. **Relevance**: How relevant to the code changes?
3. **Explanation Clarity**: How well does it explain issues?
4. **Problem Identification**: How effectively does it identify problems?
5. **Actionability**: How actionable are the suggestions?
6. **Completeness**: How thorough are the reviews?
7. **Specificity**: How specific are the comments?
8. **Contextual Adequacy**: How well does it understand context?
9. **Consistency**: How consistent across different reviews?
10. **Brevity**: How concise without losing important information?

## 🔒 Security Features

### Rate Limiting

- **5 submissions per IP per 24 hours**
- **Automatic IP tracking and logging**
- **Graceful error handling for rate limits**

### Data Validation

- **Model name format validation**
- **Score range validation (0.0-1.0 for performance, 0-10 for quality)**
- **Logical consistency checks (Pass@1 ≤ Pass@5 ≤ Pass@10)**
- **Required field validation**

### Audit Trail

- **Complete submission logging**
- **IP address tracking (partially masked for privacy)**
- **Timestamp recording**
- **Data integrity checks**

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🙏 Acknowledgments

- Inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench)
- Built with [Gradio](https://gradio.app/) for the web interface
- Thanks to the open-source community for tools and inspiration

## 📞 Support

For questions, issues, or contributions:

- Open an issue on GitHub
- Check the documentation
- Contact the maintainers

---

**Built with ❤️ for the code review research community**