Jina Affiliation Reranker

Cross Encoder reranker model fine-tuned for affiliation string matching. Given a pair of affiliation strings, it predicts how likely they refer to the same institution.

Use Case

This model is designed for matching and disambiguating messy real-world affiliation strings against canonical institution records (ROR).

Examples of what it handles:

  • Abbreviations: "MIT" ↔ "Massachusetts Institute of Technology"
  • Word reordering: "University of Oxford" ↔ "Oxford University"
  • Partial matches: "Dept. of Physics, Stanford" ↔ "Stanford University"
  • International variants: "東京倧学" ↔ "University of Tokyo"
  • OCR noise: "Univ ersity of Cal ifornia" ↔ "University of California"

Usage

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cometadata/jina-reranker-v2-multilingual-affiliations",
    trust_remote_code=True,
)

# Score affiliation pairs (higher = more likely same institution)
pairs = [
    ["University of California, Berkeley", "UC Berkeley"],
    ["University of California, Berkeley", "Berkeley College"],
]
scores = model.predict(pairs)
# [0.82, 0.15] - first pair matches, second doesn't

# Rank candidates for an affiliation string
results = model.rank(
    "MIT, Cambridge, MA",
    [
        "Massachusetts Institute of Technology",
        "MIT University (India)",
        "University of Cambridge",
    ]
)
# Returns candidates ranked by relevance

Training

Base Model: jinaai/jina-reranker-v2-base-multilingual

Dataset: cometadata/triplet-loss-for-embedding-affiliations-sample-1

  • ~8K triplets (anchor, positive, negative)
  • 80% hard negatives (similar but different institutions)
  • 20% easy negatives (clearly different institutions)

Configuration:

Parameter Value
Epochs 3
Batch size 16
Learning rate 2e-5
Loss BinaryCrossEntropyLoss
Validation split 15%

Evaluation

Evaluated on 300 test cases across 10 difficulty tiers:

Tier Cases Base Model Fine-tuned Ξ”
Baseline 30 100.0% 100.0% β€”
OCR/Noise 30 100.0% 100.0% β€”
Abbreviations 40 60.0% 80.0% +20.0%
Hierarchical 35 71.4% 77.1% +5.7%
Medical/Hospital 25 64.0% 68.0% +4.0%
Research Labs 25 80.0% 84.0% +4.0%
International 35 82.9% 91.4% +8.6%
Disambiguation 31 45.2% 51.6% +6.5%
Negative Controls 19 100.0% 100.0% β€”
Ultra-Hard 30 93.3% 96.7% +3.3%

Overall: 78.3% β†’ 84.3% accuracy (+6.0%), MRR 0.873 β†’ 0.913

Model Details

  • Parameters: 278M
  • Max sequence length: 1024 tokens
  • Output: Single relevance score (0-1)
  • Languages: Multilingual (inherits from base model)

License

CC-BY-NC-4.0 (inherited from base model - non-commercial use only)

Citation

@misc{jina-affiliation-reranker,
  title={Jina Affiliation Reranker},
  author={cometadata},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/cometadata/jina-reranker-v2-multilingual-affiliations}
}
Downloads last month
176
Safetensors
Model size
0.3B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cometadata/jina-reranker-v2-multilingual-affiliations

Finetuned
(24)
this model

Dataset used to train cometadata/jina-reranker-v2-multilingual-affiliations