- Bobo โ Embedding Model (derived from mixedbread-ai/mxbai-embed-large-v1)
Bobo is a text-embedding, LLM derived from mixedbread-ai/mxbai-embed-large-v1, packaged for drop-in use in semantic search, RAG (Retrieval-Augmented Generation), clustering, deduplication, and zero-shot classification. It produces dense vector representations suitable for ANN indexes (FAISS, ScaNN, Milvus, pgvector) and common retrieval stacks.
Key Features
- Derived from the proven mixedbread-ai/mxbai-embed-large-v1 embedding family.
- Strong performance on retrieval-style tasks (semantic search, RAG, clustering).
- Sentence-Transformersโcompatible API for fast adoption.
- Mean pooling with optional L2 normalization for cosine similarity workflows.
- Production-friendly guidance on chunking, batching, and indexing.
Technical Specifications
| Property | Value / Guidance |
|---|---|
| Base model | mixedbread-ai/mxbai-embed-large-v1 (encoder-style transformer) |
| Architecture | Transformer encoder (Sentence-Transformers compatible) |
| Embedding dimension | Query programmatically at runtime; common builds use 1024 |
| Tokenization | Provided by upstream model tokenizer |
| Max input length | Depends on upstream config; chunk long docs (e.g., 256โ512 tokens) |
| Pooling | Mean pooling (recommended), then optional L2 normalization |
| Output | Dense float vectors (often normalized if using cosine similarity) |
| Intended backends | FAISS, Milvus, pgvector, Qdrant, Chroma, Weaviate |
Tip: Always detect the dimension in code (see Quickstart) and configure your index accordingly.
๐ฏ Quickstart
- Here, we provide several ways to produce sentence embeddings. Please note that you have to provide the prompt
Represent this sentence for searching relevant passages:for query if you want to use it for retrieval. Besides that you don't need any prompt.
โ๏ธ Vectorized Datasets
Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning
- Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
- Regulations - Collection of federal regulations on the use of appropriated funds
- SF-133 - The Report on Budget Execution and Budgetary Resources
- Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
- SF-133 The Report on Budget Execution and Budgetary Resources
- Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
- Fastbook - Treasury guidance on federal ledger accounts
- Title 31 CFR - Money & Finance
- Redbook - The Principles of Appropriations Law (Volumes I & II).
- US Standard General Ledger - Account Definitions
- Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies
๐๏ธ Sentence Transformers
python -m pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
from sentence_transformers.quantization import quantize_embeddings
# 1. Specify preffered dimensions
dimensions = 512
# 2. load model
model = SentenceTransformer("leeroy-jankins/bobo-embed-large-v1", truncate_dim=dimensions)
# The prompt used for query retrieval tasks:
# query_prompt = 'Represent this sentence for searching relevant passages: '
query = "A man is eating a piece of bread"
docs = [
"A man is eating food.",
"A man is eating pasta.",
"The girl is carrying a baby.",
"A man is riding a horse.",
]
# 2. Encode
query_embedding = model.encode(query, prompt_name="query")
# Equivalent Alternatives:
# query_embedding = model.encode(query_prompt + query)
# query_embedding = model.encode(query, prompt=query_prompt)
docs_embeddings = model.encode(docs)
# Optional: Quantize the embeddings
binary_query_embedding = quantize_embeddings(query_embedding, precision="ubinary")
binary_docs_embeddings = quantize_embeddings(docs_embeddings, precision="ubinary")
similarities = cos_sim(query_embedding, docs_embeddings)
print('similarities:', similarities)
๐ง Transformers
from typing import Dict
import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer
from sentence_transformers.util import cos_sim
# For retrieval you need to pass this prompt. Please find our more in our blog post.
def transform_query(query: str) -> str:
""" For retrieval, add the prompt for query (not for documents).
"""
return f'Represent this sentence for searching relevant passages: {query}'
# The model works really well with cls pooling (default) but also with mean pooling.
def pooling(outputs: torch.Tensor, inputs: Dict, strategy: str = 'cls') -> np.ndarray:
if strategy == 'cls':
outputs = outputs[:, 0]
elif strategy == 'mean':
outputs = torch.sum(
outputs * inputs["attention_mask"][:, :, None], dim=1) / torch.sum(inputs["attention_mask"], dim=1, keepdim=True)
else:
raise NotImplementedError
return outputs.detach().cpu().numpy()
# 1. load model
model_id = 'mixedbread-ai/bobo-embed-large-v1'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).cuda()
docs = [
transform_query('A man is eating a piece of bread'),
"A man is eating food.",
"A man is eating pasta.",
"The girl is carrying a baby.",
"A man is riding a horse.",
]
# 2. encode
inputs = tokenizer(docs, padding=True, return_tensors='pt')
for k, v in inputs.items():
inputs[k] = v.cuda()
outputs = model(**inputs).last_hidden_state
embeddings = pooling(outputs, inputs, 'cls')
similarities = cos_sim(embeddings[0], embeddings[1:])
print('similarities:', similarities)
๐ Evaluation
As of March 2024, our model archives SOTA performance for Bert-large sized models on the MTEB. It ourperforms commercial models like OpenAIs text-embedding-3-large and matches the performance of model 20x it's size like the echo-mistral-7b. Our model was trained with no overlap of the MTEB data, which indicates that our model generalizes well across several domains, tasks and text length. We know there are some limitations with this model, which will be fixed in v2.
| Model | Avg (56 datasets) | Classification (12 datasets) | Clustering (11 datasets) | PairClassification (3 datasets) | Reranking (4 datasets) | Retrieval (15 datasets) | STS (10 datasets) | Summarization (1 dataset) |
|---|---|---|---|---|---|---|---|---|
| bobo-embed-large-v1 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85.00 | 32.71 |
| bge-large-en-v1.5 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 |
| bobo-embed-2d-large-v1 | 63.25 | 74.14 | 46.07 | 85.89 | 58.94 | 51.42 | 84.9 | 31.55 |
| nomic-embed-text-v1 | 62.39 | 74.12 | 43.91 | 85.15 | 55.69 | 52.81 | 82.06 | 30.08 |
| jina-embeddings-v2-base-en | 60.38 | 73.45 | 41.73 | 85.38 | 56.98 | 47.87 | 80.7 | 31.6 |
| Proprietary Models | ||||||||
| OpenAI text-embedding-3-large | 64.58 | 75.45 | 49.01 | 85.72 | 59.16 | 55.44 | 81.73 | 29.92 |
| Cohere embed-english-v3.0 | 64.47 | 76.49 | 47.43 | 85.84 | 58.01 | 55.00 | 82.62 | 30.18 |
| OpenAI text-embedding-ada-002 | 60.99 | 70.93 | 45.90 | 84.89 | 56.32 | 49.25 | 80.97 | 30.80 |
๐ป Matryoshka and Binary Quantization
Embeddings in their commonly used form (float arrays) have a high memory footprint when used at scale. Two approaches to solve this problem are Matryoshka Representation Learning (MRL) and (Binary) Quantization. While MRL reduces the number of dimensions of an embedding, binary quantization transforms the value of each dimension from a float32 into a lower precision (int8 or even binary). The model supports both approaches!
You can also take it one step further, and combine both MRL and quantization. This combination of binary quantization and MRL allows you to reduce the memory usage of your embeddings significantly. This leads to much lower costs when using a vector database in particular.
๐License
- Bobo is published under the MIT General Public License v3
- Downloads last month
- 15
16-bit