File size: 6,622 Bytes
52ffbdc
033205f
 
52ffbdc
 
 
 
6507d23
 
033205f
 
6507d23
52ffbdc
 
033205f
 
 
 
 
 
4d35760
52ffbdc
9641ce1
 
6507d23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d35760
 
 
 
 
 
 
 
 
 
 
6507d23
 
 
4d35760
6507d23
 
4d35760
 
 
 
6507d23
4d35760
 
 
 
 
 
6507d23
 
 
 
 
 
 
 
 
 
 
4d35760
 
9641ce1
 
4d35760
52ffbdc
3d09516
52ffbdc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9641ce1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52ffbdc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
base_model:
- Qwen/Qwen3-4B
datasets:
- codefuse-ai/F2LLM
language:
- en
tags:
- transformers
license: apache-2.0
pipeline_tag: feature-extraction
library_name: sentence-transformers
---

# F2LLM-4B: Matching SOTA Embedding Performance with 6 Million Open-Source Data

This model is a part of the F2LLM family, presented in the paper [F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data](https://huggingface.co/papers/2510.02294).

**Code**: [https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM)

F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.

## Usage

### With Sentence Transformers

To encode text using F2LLM with the [Sentence Transformers](https://www.sbert.net/) library:

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("codefuse-ai/F2LLM-4B", model_kwargs={"torch_dtype": "bfloat16"})

# Some sample query and documents
query = "What is F2LLM used for?"
documents = [
    'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
    'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.',
    'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.'
]

# Encode the query and documents separately, the encode_query method uses the query prompt
query_embedding = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embedding.shape, document_embeddings.shape)
# (2560,) (3, 2560)

# Compute cosine similarity between the query and documents
similarity = model.similarity(query_embedding, document_embeddings)
print(similarity)
# tensor([[0.5209, 0.5680, 0.7818]])
```

### With Transformers

Or directly with the [Transformers](https://huggingface.co/docs/transformers/index) library:

```python
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F


model_path = "codefuse-ai/F2LLM-4B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})

query = "What is F2LLM used for?"
query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:"
documents = [
    'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
    'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.',
    'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.'
]

def encode(sentences):
    batch_size = len(sentences)
    tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt').to(model.device)
    last_hidden_state = model(**tokenized_inputs).last_hidden_state
    eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
    embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
    embeddings = F.normalize(embeddings, p=2, dim=1)
    return embeddings

# Encode the query and documents
query_embedding = encode([query_prompt + query])
document_embeddings = encode(documents)
print(query_embedding.shape, document_embeddings.shape)
# torch.Size([1, 2560]) torch.Size([3, 2560])

# Compute cosine similarity between the query and documents
similarity = query_embedding @ document_embeddings.T
print(similarity)
# tensor([[0.5156, 0.5664, 0.7773]], device='cuda:0', dtype=torch.bfloat16,
#        grad_fn=<MmBackward0>)
```

## Evaluation

To evaluate F2LLMs on MTEB (currently requires installing MTEB from source):

```python
import mteb
import logging
logging.basicConfig(level=logging.INFO)

task_names = ['AmazonCounterfactualClassification', 'ArXivHierarchicalClusteringP2P', 'ArXivHierarchicalClusteringS2S', 'ArguAna', 'AskUbuntuDupQuestions', 'BIOSSES', 'Banking77Classification', 'BiorxivClusteringP2P.v2', 'CQADupstackGamingRetrieval', 'CQADupstackUnixRetrieval', 'ClimateFEVERHardNegatives', 'FEVERHardNegatives', 'FiQA2018', 'HotpotQAHardNegatives', 'ImdbClassification', 'MTOPDomainClassification', 'MassiveIntentClassification', 'MassiveScenarioClassification', 'MedrxivClusteringP2P.v2', 'MedrxivClusteringS2S.v2', 'SCIDOCS', 'SICK-R', 'STS12', 'STS13', 'STS14', 'STS15', 'STS17', 'STS22.v2', 'STSBenchmark', 'SprintDuplicateQuestions', 'StackExchangeClustering.v2', 'StackExchangeClusteringP2P.v2', 'SummEvalSummarization.v2', 'TRECCOVID', 'Touche2020Retrieval.v3', 'ToxicConversationsClassification', 'TweetSentimentExtractionClassification', 'TwentyNewsgroupsClustering.v2', 'TwitterSemEval2015', 'TwitterURLCorpus', 'MindSmallReranking']

tasks = [
    mteb.get_task(task_name, languages = ["eng"], eval_splits=["test"], exclusive_language_filter=True)
    for task_name in task_names
]


model = mteb.get_model("codefuse-ai/F2LLM-4B", device="cuda:0")
evaluation = mteb.MTEB(tasks=tasks)
evaluation.run(model, encode_kwargs={"batch_size": 16})
```

## Training

Training code is available in our [Github repo](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM).

## Citation

If you use the F2LLM models, data, or code, please cite the following technical report.

```
@article{2025F2LLM,
  title={F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data}, 
  author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
  journal      = {CoRR},
  volume       = {abs/2510.02294},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2510.02294},
  doi          = {10.48550/ARXIV.2510.02294},
  eprinttype    = {arXiv},
  eprint       = {2510.02294}
}
```