UW
/

OLMo2-11B-SuperBPE-t180k

Text Generation

Model card Files Files and versions

Jhayase commited on Mar 20

Commit

efca111

·

verified ·

1 Parent(s): b8d39f7

Clarify architecture

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ datasets:
 # SuperBPE
 This 11B model was trained from scratch with a SuperBPE tokenizer. [SuperBPE](https://arxiv.org/abs/2503.13423) extends the BPE algorithm to include both traditional subword tokens (contained within word boundaries), as well as new **superword** tokens (containing parts of multiple words)! It matches the [8B BPE model](huggingface.co/UW/OLMo2-8B-BPE) in both train and inference FLOPs.
-The model was trained on the OLMo2 pretraining data. It has a context length of 3,000 tokens (to match the effective context size in bytes of a BPE model with a context length of 4,096 tokens), and is trained on 238B tokens. The tokenizer has a vocabulary size of 200k and transitions from learning subword to learning superword tokens at vocabulary size of 180k.
 ## Example Usage

 # SuperBPE
 This 11B model was trained from scratch with a SuperBPE tokenizer. [SuperBPE](https://arxiv.org/abs/2503.13423) extends the BPE algorithm to include both traditional subword tokens (contained within word boundaries), as well as new **superword** tokens (containing parts of multiple words)! It matches the [8B BPE model](huggingface.co/UW/OLMo2-8B-BPE) in both train and inference FLOPs.
+The model was trained with a scaled-up version of the Olmo2 7B architecture and the Olmo2 7B pretraining data.  It has a context length of 3,000 tokens (to match the effective context size in bytes of a BPE model with a context length of 4,096 tokens), and is trained on 238B tokens. The tokenizer has a vocabulary size of 200k and transitions from learning subword to learning superword tokens at vocabulary size of 180k.
 ## Example Usage