Jina Affiliation Reranker
Cross Encoder reranker model fine-tuned for affiliation string matching. Given a pair of affiliation strings, it predicts how likely they refer to the same institution.
Use Case
This model is designed for matching and disambiguating messy real-world affiliation strings against canonical institution records (ROR).
Examples of what it handles:
- Abbreviations: "MIT" β "Massachusetts Institute of Technology"
- Word reordering: "University of Oxford" β "Oxford University"
- Partial matches: "Dept. of Physics, Stanford" β "Stanford University"
- International variants: "ζ±δΊ¬ε€§ε¦" β "University of Tokyo"
- OCR noise: "Univ ersity of Cal ifornia" β "University of California"
Usage
from sentence_transformers import CrossEncoder
model = CrossEncoder(
"cometadata/jina-reranker-v2-multilingual-affiliations",
trust_remote_code=True,
)
# Score affiliation pairs (higher = more likely same institution)
pairs = [
["University of California, Berkeley", "UC Berkeley"],
["University of California, Berkeley", "Berkeley College"],
]
scores = model.predict(pairs)
# [0.82, 0.15] - first pair matches, second doesn't
# Rank candidates for an affiliation string
results = model.rank(
"MIT, Cambridge, MA",
[
"Massachusetts Institute of Technology",
"MIT University (India)",
"University of Cambridge",
]
)
# Returns candidates ranked by relevance
Training
Base Model: jinaai/jina-reranker-v2-base-multilingual
Dataset: cometadata/triplet-loss-for-embedding-affiliations-sample-1
- ~8K triplets (anchor, positive, negative)
- 80% hard negatives (similar but different institutions)
- 20% easy negatives (clearly different institutions)
Configuration:
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Loss | BinaryCrossEntropyLoss |
| Validation split | 15% |
Evaluation
Evaluated on 300 test cases across 10 difficulty tiers:
| Tier | Cases | Base Model | Fine-tuned | Ξ |
|---|---|---|---|---|
| Baseline | 30 | 100.0% | 100.0% | β |
| OCR/Noise | 30 | 100.0% | 100.0% | β |
| Abbreviations | 40 | 60.0% | 80.0% | +20.0% |
| Hierarchical | 35 | 71.4% | 77.1% | +5.7% |
| Medical/Hospital | 25 | 64.0% | 68.0% | +4.0% |
| Research Labs | 25 | 80.0% | 84.0% | +4.0% |
| International | 35 | 82.9% | 91.4% | +8.6% |
| Disambiguation | 31 | 45.2% | 51.6% | +6.5% |
| Negative Controls | 19 | 100.0% | 100.0% | β |
| Ultra-Hard | 30 | 93.3% | 96.7% | +3.3% |
Overall: 78.3% β 84.3% accuracy (+6.0%), MRR 0.873 β 0.913
Model Details
- Parameters: 278M
- Max sequence length: 1024 tokens
- Output: Single relevance score (0-1)
- Languages: Multilingual (inherits from base model)
License
CC-BY-NC-4.0 (inherited from base model - non-commercial use only)
Citation
@misc{jina-affiliation-reranker,
title={Jina Affiliation Reranker},
author={cometadata},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/cometadata/jina-reranker-v2-multilingual-affiliations}
}
- Downloads last month
- 176
Model tree for cometadata/jina-reranker-v2-multilingual-affiliations
Base model
jinaai/jina-reranker-v2-base-multilingual