Preview
  • Bobo โ€” Embedding Model (derived from mixedbread-ai/mxbai-embed-large-v1)

Bobo is a text-embedding, LLM derived from mixedbread-ai/mxbai-embed-large-v1, packaged for drop-in use in semantic search, RAG (Retrieval-Augmented Generation), clustering, deduplication, and zero-shot classification. It produces dense vector representations suitable for ANN indexes (FAISS, ScaNN, Milvus, pgvector) and common retrieval stacks.

Key Features

  • Derived from the proven mixedbread-ai/mxbai-embed-large-v1 embedding family.
  • Strong performance on retrieval-style tasks (semantic search, RAG, clustering).
  • Sentence-Transformersโ€“compatible API for fast adoption.
  • Mean pooling with optional L2 normalization for cosine similarity workflows.
  • Production-friendly guidance on chunking, batching, and indexing.

Technical Specifications

Property Value / Guidance
Base model mixedbread-ai/mxbai-embed-large-v1 (encoder-style transformer)
Architecture Transformer encoder (Sentence-Transformers compatible)
Embedding dimension Query programmatically at runtime; common builds use 1024
Tokenization Provided by upstream model tokenizer
Max input length Depends on upstream config; chunk long docs (e.g., 256โ€“512 tokens)
Pooling Mean pooling (recommended), then optional L2 normalization
Output Dense float vectors (often normalized if using cosine similarity)
Intended backends FAISS, Milvus, pgvector, Qdrant, Chroma, Weaviate

Tip: Always detect the dimension in code (see Quickstart) and configure your index accordingly.

๐ŸŽฏ Quickstart

  • Here, we provide several ways to produce sentence embeddings. Please note that you have to provide the prompt Represent this sentence for searching relevant passages: for query if you want to use it for retrieval. Besides that you don't need any prompt.

โš™๏ธ Vectorized Datasets

Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning

  • Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
  • Regulations - Collection of federal regulations on the use of appropriated funds
  • SF-133 - The Report on Budget Execution and Budgetary Resources
  • Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
  • Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
  • SF-133 The Report on Budget Execution and Budgetary Resources
  • Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
  • Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
  • Fastbook - Treasury guidance on federal ledger accounts
  • Title 31 CFR - Money & Finance
  • Redbook - The Principles of Appropriations Law (Volumes I & II).
  • US Standard General Ledger - Account Definitions
  • Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies

๐Ÿ—๏ธ Sentence Transformers

python -m pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
from sentence_transformers.quantization import quantize_embeddings

# 1. Specify preffered dimensions
dimensions = 512

# 2. load model
model = SentenceTransformer("leeroy-jankins/bobo-embed-large-v1", truncate_dim=dimensions)

# The prompt used for query retrieval tasks:
# query_prompt = 'Represent this sentence for searching relevant passages: '

query = "A man is eating a piece of bread"
docs = [
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

# 2. Encode
query_embedding = model.encode(query, prompt_name="query")
# Equivalent Alternatives:
# query_embedding = model.encode(query_prompt + query)
# query_embedding = model.encode(query, prompt=query_prompt)

docs_embeddings = model.encode(docs)

# Optional: Quantize the embeddings
binary_query_embedding = quantize_embeddings(query_embedding, precision="ubinary")
binary_docs_embeddings = quantize_embeddings(docs_embeddings, precision="ubinary")

similarities = cos_sim(query_embedding, docs_embeddings)
print('similarities:', similarities)

๐Ÿง  Transformers

from typing import Dict

import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer
from sentence_transformers.util import cos_sim

# For retrieval you need to pass this prompt. Please find our more in our blog post.
def transform_query(query: str) -> str:
    """ For retrieval, add the prompt for query (not for documents).
    """
    return f'Represent this sentence for searching relevant passages: {query}'

# The model works really well with cls pooling (default) but also with mean pooling.
def pooling(outputs: torch.Tensor, inputs: Dict,  strategy: str = 'cls') -> np.ndarray:
    if strategy == 'cls':
        outputs = outputs[:, 0]
    elif strategy == 'mean':
        outputs = torch.sum(
            outputs * inputs["attention_mask"][:, :, None], dim=1) / torch.sum(inputs["attention_mask"], dim=1, keepdim=True)
    else:
        raise NotImplementedError
    return outputs.detach().cpu().numpy()

# 1. load model
model_id = 'mixedbread-ai/bobo-embed-large-v1'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).cuda()


docs = [
    transform_query('A man is eating a piece of bread'),
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

# 2. encode
inputs = tokenizer(docs, padding=True, return_tensors='pt')
for k, v in inputs.items():
    inputs[k] = v.cuda()
outputs = model(**inputs).last_hidden_state
embeddings = pooling(outputs, inputs, 'cls')

similarities = cos_sim(embeddings[0], embeddings[1:])
print('similarities:', similarities)

๐Ÿ Evaluation

As of March 2024, our model archives SOTA performance for Bert-large sized models on the MTEB. It ourperforms commercial models like OpenAIs text-embedding-3-large and matches the performance of model 20x it's size like the echo-mistral-7b. Our model was trained with no overlap of the MTEB data, which indicates that our model generalizes well across several domains, tasks and text length. We know there are some limitations with this model, which will be fixed in v2.

Model Avg (56 datasets) Classification (12 datasets) Clustering (11 datasets) PairClassification (3 datasets) Reranking (4 datasets) Retrieval (15 datasets) STS (10 datasets) Summarization (1 dataset)
bobo-embed-large-v1 64.68 75.64 46.71 87.2 60.11 54.39 85.00 32.71
bge-large-en-v1.5 64.23 75.97 46.08 87.12 60.03 54.29 83.11 31.61
bobo-embed-2d-large-v1 63.25 74.14 46.07 85.89 58.94 51.42 84.9 31.55
nomic-embed-text-v1 62.39 74.12 43.91 85.15 55.69 52.81 82.06 30.08
jina-embeddings-v2-base-en 60.38 73.45 41.73 85.38 56.98 47.87 80.7 31.6
Proprietary Models
OpenAI text-embedding-3-large 64.58 75.45 49.01 85.72 59.16 55.44 81.73 29.92
Cohere embed-english-v3.0 64.47 76.49 47.43 85.84 58.01 55.00 82.62 30.18
OpenAI text-embedding-ada-002 60.99 70.93 45.90 84.89 56.32 49.25 80.97 30.80

๐Ÿ’ป Matryoshka and Binary Quantization

Embeddings in their commonly used form (float arrays) have a high memory footprint when used at scale. Two approaches to solve this problem are Matryoshka Representation Learning (MRL) and (Binary) Quantization. While MRL reduces the number of dimensions of an embedding, binary quantization transforms the value of each dimension from a float32 into a lower precision (int8 or even binary). The model supports both approaches!

You can also take it one step further, and combine both MRL and quantization. This combination of binary quantization and MRL allows you to reduce the memory usage of your embeddings significantly. This leads to much lower costs when using a vector database in particular.

๐Ÿ“License

Downloads last month
15
GGUF
Model size
0.3B params
Architecture
bert
Hardware compatibility
Log In to view the estimation

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Datasets used to train leeroy-jankins/bobo