Update README.md
Browse files
README.md
CHANGED
|
@@ -13,8 +13,8 @@ The tokenizer was trained on a comprehensive dataset, including:
|
|
| 13 |
- English and Dutch Wikipedia (278M and 356M, respectively)
|
| 14 |
- Dutch and English book datasets (211M and 355M, respectively)
|
| 15 |
- Dutch news articles (256M)
|
| 16 |
-
- CodeParrot GitHub code (158M)
|
| 17 |
-
- CodeSearchNet
|
| 18 |
- Markdown files with math markup (5.8M)
|
| 19 |
- Arxiv scientific papers (169M)
|
| 20 |
|
|
|
|
| 13 |
- English and Dutch Wikipedia (278M and 356M, respectively)
|
| 14 |
- Dutch and English book datasets (211M and 355M, respectively)
|
| 15 |
- Dutch news articles (256M)
|
| 16 |
+
- CodeParrot GitHub Python code (158M)
|
| 17 |
+
- CodeSearchNet Python code (126M)
|
| 18 |
- Markdown files with math markup (5.8M)
|
| 19 |
- Arxiv scientific papers (169M)
|
| 20 |
|