khopilot
/

khmer-tokenizer-v7

+# Changelog
+## [v2.2-RC] - 2025-10-07
+### Added
+- Graph-regularized lexeme embeddings (12,654 × 768 dimensions)
+- Morpho/synonymy graph with 4,245 curated edges (pruned from 15K)
+- Coherence@10 metric evaluation framework
+- Lexeme → subword mappings with max_pieces=24 (12,579 lexemes)
+- Complete metrics report (metrics_corrected.yaml)
+### Improved
+- **Semantic coherence: 43.25% Coherence@10** (+110% vs v1.0 baseline 20.62%)
+- Administrative term clusters: 93% cosine similarity
+- Geographic/national clusters: 90%+ similarity
+- Kinship term clusters: 88-92% similarity
+- Grammar particle clusters: 85-88% similarity
+### Technical Details
+- Lambda ratio optimized: lap:lex = 2.5:1 (emphasizes graph signal)
+- Graph quality over quantity: 4.2K clean edges >> 10K noisy
+- Minimal edge dropout: 0.01 (clean graph requires less regularization)
+- Gradient clipping: max_grad_norm=1.0 for training stability
+- Training duration: 6h 21min (7,761 steps)
+- Final Laplacian loss: 0.26 (phenomenal reduction from 0.86 initial)
+### Configuration
+- Base tokenizer: Production SentencePiece 8K (unchanged, compatible)
+- Graph: morpho/synonymy only (pruned, quality-controlled)
+- Schedule: warmup 2K → plateau 4K → anneal 1.76K
+- Symmetric Laplacian normalization (anti-hub bias)
+- Smart initialization: lexeme embeddings = mean(subword embeddings)
+### Fixed
+- v2.1 failure: Wrong lambda ratio (1:2) + noisy distributional graph → 12.92% coherence ❌
+- v2.2 solution: Inverted ratio (2.5:1) + clean morpho/synonymy graph → 43.25% coherence ✅
+### Dependencies
+- sentencepiece
+- torch
+- numpy
+- transformers (optional, for compatibility)
+## [v1.0] - 2025-10-04
+### Added
+- Initial Graph Regularization implementation
+- 12K SPM tokenizer
+- Mixed graph (4.2K edges)
+- Coherence@10: 20.62%
+### Technical
+- Lambda ratio: lap:lex = 1:1
+- Edge dropout: 0.2
+- Training: 5h 59min production run (1.2M lines)
+## Future Roadmap
+### v2.3 (Planned)
+- Weighted Consistency loss by nb_pieces (more weight to fragmented lexemes)
+- Relation-type weights (morphology > distributional)
+- Extended lexeme coverage (max_pieces=32)
+### Evaluations (In Progress)
+- Downstream NER benchmark (target +5-10 F1)
+- QA benchmark (target +5 EM)
+- Semantic search MRR (target +10%)
+### Visualizations (Planned)
+- t-SNE/UMAP 2D semantic space plots
+- Cosine similarity heatmaps
+- Cluster quality analysis dashboards