Week 2: Publish a Hub Dataset
Create and share high-quality datasets on the Hub. Good data is the foundation of good modelsβhelp the community by contributing datasets others can train on.
Why This Matters
The best open source models are built on openly available datasets. By publishing well-documented, properly structured datasets, you're directly enabling the next generation of model development. Quality matters more than quantity.
The Skill
Use hf_dataset_creator/ for this quest. Key capabilities:
- Initialize dataset repos with proper structure
- Multi-format support: chat, classification, QA, completion, tabular
- Template-based validation for data quality
- Streaming uploads without downloading entire datasets
# Quick setup with a template
python hf_dataset_creator/scripts/dataset_manager.py quick_setup \
--repo_id "your-username/dataset-name" --template chat
XP Tiers
π’ Starter β 50 XP
Upload a small, clean dataset with a complete dataset card.
- Create a dataset with β€1,000 rows
- Write a dataset card covering: license, splits, and data provenance
- Upload to the Hub under the hackathon organization (or your own account)
What counts: Clean data, clear documentation, proper licensing.
python hf_dataset_creator/scripts/dataset_manager.py init \
--repo_id "hf-skills/your-dataset-name"
python hf_dataset_creator/scripts/dataset_manager.py add_rows \
--repo_id "hf-skills/your-dataset-name" \
--template classification \
--rows_json "$(cat your_data.json)"
π Standard β 100 XP
Publish a conversational dataset with a complete dataset card.
- Create a dataset with β€1,000 rows
- Write a dataset card covering: license and splits.
- Upload to the Hub under the hackathon organization.
What counts: Clean data, clear documentation, proper licensing.
π¦ Advanced β 200 XP
Translate a dataset into multiple languages and publish it on the Hub.
- Find a dataset on the Hub
- Translate the dataset into multiple languages
- Publish the translated datasets on the Hub under the hackathon organization
What counts: Translated datasets and merged PRs.
Resources
- SKILL.md β Full skill documentation
- Templates β JSON templates for each format
- Examples β Sample data and system prompts
Next Quest: Supervised Fine-Tuning