README / 03_publish-hub-dataset.md
burtenshaw's picture
burtenshaw HF Staff
Upload folder using huggingface_hub
7acc317 verified
# Week 2: Publish a Hub Dataset
Create and share high-quality datasets on the Hub. Good data is the foundation of good models—help the community by contributing datasets others can train on.
## Why This Matters
The best open source models are built on openly available datasets. By publishing well-documented, properly structured datasets, you're directly enabling the next generation of model development. Quality matters more than quantity.
## The Skill
Use `hf_dataset_creator/` for this quest. Key capabilities:
- Initialize dataset repos with proper structure
- Multi-format support: chat, classification, QA, completion, tabular
- Template-based validation for data quality
- Streaming uploads without downloading entire datasets
```bash
# Quick setup with a template
python hf_dataset_creator/scripts/dataset_manager.py quick_setup \
--repo_id "your-username/dataset-name" --template chat
```
## XP Tiers
### 🐢 Starter — 50 XP
**Upload a small, clean dataset with a complete dataset card.**
1. Create a dataset with ≤1,000 rows
2. Write a dataset card covering: license, splits, and data provenance
3. Upload to the Hub under the hackathon organization (or your own account)
**What counts:** Clean data, clear documentation, proper licensing.
```bash
python hf_dataset_creator/scripts/dataset_manager.py init \
--repo_id "hf-skills/your-dataset-name"
python hf_dataset_creator/scripts/dataset_manager.py add_rows \
--repo_id "hf-skills/your-dataset-name" \
--template classification \
--rows_json "$(cat your_data.json)"
```
### 🐕 Standard — 100 XP
**Publish a conversational dataset with a complete dataset card.**
1. Create a dataset with ≤1,000 rows
2. Write a dataset card covering: license and splits.
3. Upload to the Hub under the hackathon organization.
**What counts:** Clean data, clear documentation, proper licensing.
### 🦁 Advanced — 200 XP
**Translate a dataset into multiple languages and publish it on the Hub.**
1. Find a dataset on the Hub
2. Translate the dataset into multiple languages
3. Publish the translated datasets on the Hub under the hackathon organization
**What counts:** Translated datasets and merged PRs.
## Resources
- [SKILL.md](../hf_dataset_creator/SKILL.md) — Full skill documentation
- [Templates](../hf_dataset_creator/templates/) — JSON templates for each format
- [Examples](../hf_dataset_creator/examples/) — Sample data and system prompts
---
**Next Quest:** [Supervised Fine-Tuning](04_sft-finetune-hub.md)