--- title: BPE DNA Tokenizer emoji: 🧬 colorFrom: green colorTo: blue sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: mit --- # 🧬 BPE DNA Tokenizer An interactive demo of a Byte Pair Encoding (BPE) tokenizer trained on the *E. coli* K-12 genome. ## 🎯 Key Results - **Vocabulary Size**: 5,000 tokens - **Compression Ratio**: 5.208x (62.8% above requirement) - **Dataset**: *E. coli* K-12 genome (4.6M base pairs) - **Lossless**: 100% perfect reconstruction ## ✨ Features - 🧬 **DNA-Optimized**: Specifically designed for genomic sequences - 🚀 **High Compression**: Achieves 5.2x compression - 🔬 **Biological Discovery**: Automatically finds codons, TATA boxes, and more - ✅ **Lossless**: Perfect encode-decode reconstruction ## 🔬 Discovered Patterns The tokenizer learned biologically meaningful patterns without supervision: - **Start Codon**: ATG - **Stop Codons**: TAA, TAG - **TATA Box**: TATAA - **Shine-Dalgarno**: AGGAGG - **CpG Islands**: GCGC ## 🚀 Try It Out 1. Enter any DNA sequence (A, C, G, T, N) 2. Click "Tokenize Sequence" 3. See the compression statistics and token breakdown ## 📊 Model Details - **Training Data**: 4,641,652 base pairs - **Compressed Size**: 891,316 tokens - **Training Time**: 88 minutes - **Longest Token**: 26 bases ## 🔗 Links - [GitHub Repository](https://github.com/abi2024/bpe-dna-tokenizer) - [Full Documentation](https://github.com/abi2024/bpe-dna-tokenizer#readme) --- **Built for genomics and machine learning** 🧬🤖