Clarify architecture
Browse files
README.md
CHANGED
|
@@ -10,7 +10,7 @@ datasets:
|
|
| 10 |
# SuperBPE
|
| 11 |
This 11B model was trained from scratch with a SuperBPE tokenizer. [SuperBPE](https://arxiv.org/abs/2503.13423) extends the BPE algorithm to include both traditional subword tokens (contained within word boundaries), as well as new **superword** tokens (containing parts of multiple words)! It matches the [8B BPE model](huggingface.co/UW/OLMo2-8B-BPE) in both train and inference FLOPs.
|
| 12 |
|
| 13 |
-
The model was trained
|
| 14 |
|
| 15 |
## Example Usage
|
| 16 |
|
|
|
|
| 10 |
# SuperBPE
|
| 11 |
This 11B model was trained from scratch with a SuperBPE tokenizer. [SuperBPE](https://arxiv.org/abs/2503.13423) extends the BPE algorithm to include both traditional subword tokens (contained within word boundaries), as well as new **superword** tokens (containing parts of multiple words)! It matches the [8B BPE model](huggingface.co/UW/OLMo2-8B-BPE) in both train and inference FLOPs.
|
| 12 |
|
| 13 |
+
The model was trained with a scaled-up version of the Olmo2 7B architecture and the Olmo2 7B pretraining data. It has a context length of 3,000 tokens (to match the effective context size in bytes of a BPE model with a context length of 4,096 tokens), and is trained on 238B tokens. The tokenizer has a vocabulary size of 200k and transitions from learning subword to learning superword tokens at vocabulary size of 180k.
|
| 14 |
|
| 15 |
## Example Usage
|
| 16 |
|