khopilot commited on
Commit
7f966f9
·
verified ·
1 Parent(s): 895b563

Upload CHANGELOG.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. CHANGELOG.md +72 -0
CHANGELOG.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Changelog
2
+
3
+ ## [v2.2-RC] - 2025-10-07
4
+
5
+ ### Added
6
+ - Graph-regularized lexeme embeddings (12,654 × 768 dimensions)
7
+ - Morpho/synonymy graph with 4,245 curated edges (pruned from 15K)
8
+ - Coherence@10 metric evaluation framework
9
+ - Lexeme → subword mappings with max_pieces=24 (12,579 lexemes)
10
+ - Complete metrics report (metrics_corrected.yaml)
11
+
12
+ ### Improved
13
+ - **Semantic coherence: 43.25% Coherence@10** (+110% vs v1.0 baseline 20.62%)
14
+ - Administrative term clusters: 93% cosine similarity
15
+ - Geographic/national clusters: 90%+ similarity
16
+ - Kinship term clusters: 88-92% similarity
17
+ - Grammar particle clusters: 85-88% similarity
18
+
19
+ ### Technical Details
20
+ - Lambda ratio optimized: lap:lex = 2.5:1 (emphasizes graph signal)
21
+ - Graph quality over quantity: 4.2K clean edges >> 10K noisy
22
+ - Minimal edge dropout: 0.01 (clean graph requires less regularization)
23
+ - Gradient clipping: max_grad_norm=1.0 for training stability
24
+ - Training duration: 6h 21min (7,761 steps)
25
+ - Final Laplacian loss: 0.26 (phenomenal reduction from 0.86 initial)
26
+
27
+ ### Configuration
28
+ - Base tokenizer: Production SentencePiece 8K (unchanged, compatible)
29
+ - Graph: morpho/synonymy only (pruned, quality-controlled)
30
+ - Schedule: warmup 2K → plateau 4K → anneal 1.76K
31
+ - Symmetric Laplacian normalization (anti-hub bias)
32
+ - Smart initialization: lexeme embeddings = mean(subword embeddings)
33
+
34
+ ### Fixed
35
+ - v2.1 failure: Wrong lambda ratio (1:2) + noisy distributional graph → 12.92% coherence ❌
36
+ - v2.2 solution: Inverted ratio (2.5:1) + clean morpho/synonymy graph → 43.25% coherence ✅
37
+
38
+ ### Dependencies
39
+ - sentencepiece
40
+ - torch
41
+ - numpy
42
+ - transformers (optional, for compatibility)
43
+
44
+ ## [v1.0] - 2025-10-04
45
+
46
+ ### Added
47
+ - Initial Graph Regularization implementation
48
+ - 12K SPM tokenizer
49
+ - Mixed graph (4.2K edges)
50
+ - Coherence@10: 20.62%
51
+
52
+ ### Technical
53
+ - Lambda ratio: lap:lex = 1:1
54
+ - Edge dropout: 0.2
55
+ - Training: 5h 59min production run (1.2M lines)
56
+
57
+ ## Future Roadmap
58
+
59
+ ### v2.3 (Planned)
60
+ - Weighted Consistency loss by nb_pieces (more weight to fragmented lexemes)
61
+ - Relation-type weights (morphology > distributional)
62
+ - Extended lexeme coverage (max_pieces=32)
63
+
64
+ ### Evaluations (In Progress)
65
+ - Downstream NER benchmark (target +5-10 F1)
66
+ - QA benchmark (target +5 EM)
67
+ - Semantic search MRR (target +10%)
68
+
69
+ ### Visualizations (Planned)
70
+ - t-SNE/UMAP 2D semantic space plots
71
+ - Cosine similarity heatmaps
72
+ - Cluster quality analysis dashboards