🧠 Boo — Phi-4-Mini-Instruct (Q4_K_M, GGUF)

Boo is a compact, instruction-tuned LLM derived from Phi-4-mini-reasoning and packaged in GGUF Q4_K_M quantization for fast local inference. It targets concise instruction following, lightweight reasoning, summarization, and light code synthesis—ideal for CLI assistants, edge deployments, and RAG agents where latency and footprint matter.

🧰 Key Features

Phi-4-Mini base trained on filtered, high-quality data.
Instruction SFT for reasoning, summarization, and prompt following; aligned chat behavior.
GGUF Q4_K_M (4-bit grouped) for performant local inference on CPU/GPU-constrained hardware.
Cold-start ready and compatible with llama.cpp, LM Studio, Ollama, and other GGUF loaders.

📝 Technical Specifications

Property	Value
Base model	Phi-4-Mini-Instruct
Architecture	Transformer, decoder-only
Quantization	GGUF Q4_K_M (4-bit grouped, medium precision)
Tokenizer	phi BPE (about 32k vocabulary)
Fine-tuning method	Supervised fine-tuning (about 20k examples)
Training style	Single-turn instructions, few-shot QA, summarization
Context window	2,048 tokens (default)
Compatible runtimes	llama.cpp, LM Studio, GGUF loaders, Ollama (via conversion)

⚡ Files

File	Description
Boo.Q4_K_M.gguf	Quantized model weights
tokenizer.model	Phi BPE tokenizer
config.json	Optional runtime config
README.md	This model card

⚙️ Vectorized Datasets

Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning

Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
Regulations - Collection of federal regulations on the use of appropriated funds
SF-133 - The Report on Budget Execution and Budgetary Resources
Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
SF-133 The Report on Budget Execution and Budgetary Resources
Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
Fastbook - Treasury guidance on federal ledger accounts
Title 31 CFR - Money & Finance
Redbook - The Principles of Appropriations Law (Volumes I & II).
US Standard General Ledger - Account Definitions
Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies

🎯 Quickstart (Local Inference)

llama.cpp

./main -m Boo.Q4_K_M.gguf \
  -p "Explain reinforcement learning like I'm 12." \
  -n 256

🧪 LM Studio

1) Import Boo.Q4_K_M.gguf.
2) Choose a simple prompt and start with modest max tokens and thread counts.
3) Increase settings as latency allows.

Tip: Boo is designed for low-resource setups. If you use RAG, chunk long documents and keep the prompt compact to stay within the 2k context.

🧠 RAG with the Boo LLM (Phi-4-Mini-Instruct, Q4_K_M, GGUF)

This end‑to‑end example shows how to build a tiny Retrieval‑Augmented Generation (RAG) pipeline using Boo for generation (via llama-cpp-python) and an embedding model (e.g., the “Bobo” embedding derived from mixedbread-ai/mxbai-embed-large-v1) with FAISS for similarity search.

📦 1) Install Dependencies

pip install llama-cpp-python sentence-transformers faiss-cpu numpy

🧱 2) Minimal Data & Ingestion

import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

# --- Configuration ---
# Path to your quantized Boo model file (GGUF)
BOO_MODEL_PATH = "Boo.Q4_K_M.gguf"

# Choose an embedding model (here: mixedbread large; you can substitute your own)
EMBED_MODEL_ID = "mixedbread-ai/mxbai-embed-large-v1"

# A tiny toy "corpus" for demo purposes (normally you'd load real documents and chunk them)
DOCUMENTS = [
    {"id": "doc1", "text": "Retrieval-Augmented Generation (RAG) combines document retrieval with a generator LLM."},
    {"id": "doc2", "text": "FAISS enables efficient vector similarity search using approximate or exact indexes."},
    {"id": "doc3", "text": "Cosine similarity is often used with L2-normalized embeddings to measure semantic closeness."},
    {"id": "doc4", "text": "Chunking long documents into smaller passages improves retrieval granularity and accuracy."},
    {"id": "doc5", "text": "Boo is a lightweight LLM packaged as GGUF, suitable for local inference via llama.cpp."},
]

# --- Embedder ---
embedder = SentenceTransformer(EMBED_MODEL_ID)

# Encode and L2-normalize for cosine via inner product
def encode_texts(texts):
    emb = embedder.encode(texts, normalize_embeddings=True)
    return emb.astype(np.float32)

# Create the matrix of document embeddings
corpus_texts = [d["text"] for d in DOCUMENTS]
corpus_vecs = encode_texts(corpus_texts)
dim = corpus_vecs.shape[1]

# --- Build FAISS index (inner product works like cosine when vectors are normalized) ---
index = faiss.IndexFlatIP(dim)
index.add(corpus_vecs)

# Keep ID mapping for retrieved results
id_map = np.array([i for i in range(len(DOCUMENTS))])

🔎 3) Retrieval Function

def retrieve(query, k=3):
    q_vec = encode_texts([query])  # already normalized
    scores, idx = index.search(q_vec, k)
    results = []
    for rank, (sc, ii) in enumerate(zip(scores[0], idx[0])):
        doc = DOCUMENTS[id_map[ii]]
        results.append({"rank": rank + 1, "score": float(sc), "id": doc["id"], "text": doc["text"]})
    return results

🦙 4) Generation with Boo (llama-cpp-python)

from llama_cpp import Llama

# Initialize Boo
# Adjust n_ctx (context) and n_threads to your environment
llm = Llama(
    model_path=BOO_MODEL_PATH,
    n_ctx=2048,
    n_threads=8
)

def build_prompt(query, context_chunks):
    ctx_lines = "\n".join([f"• {c['text']}" for c in context_chunks])
    prompt = f"""

You are a concise, factual assistant. Use only the provided context to answer the question. If the answer cannot be found in the context, say "I don't know."

Context: {ctx_lines}

Question: {query}

Answer (concise, with references to bullet numbers if applicable): """ return prompt.strip()

def generate_with_boo(prompt, max_tokens=256, temperature=0.6, top_p=0.9):
    out = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p
    )
    return out["choices"][0]["text"]

🧪 5) End‑to‑End RAG Query

user_query = "How does RAG improve factuality, and which library helps with vector search?"
top_k = 3

# 1) Retrieve
retrieved = retrieve(user_query, k=top_k)

# 2) Build prompt
prompt = build_prompt(user_query, retrieved)

# 3) Generate with Boo
answer = generate_with_boo(prompt)

print("---- Retrieved Chunks ----")
for r in retrieved:
    print(f"[{r['rank']}] (score={r['score']:.3f}) {r['text']}")

print("\n---- Boo Answer ----")
print(answer)

🧰 6) Practical Tips

• Chunking: For real docs, split into ~300–600 characters (or ~128–256 tokens) with 10–20% overlap.
• Normalization: L2-normalize embeddings when using cosine/IP search.
• Metadata: Store doc IDs, titles, and citations so Boo can reference sources.
• Guardrails: If retrieval comes back empty or low‑score, have Boo say “I don’t know.”
• Prompt Budget: Keep the context short and relevant—Boo’s default context is ~2k tokens.

🔒 Prompt Engineering Tips

Keep prompts concise to fit Boo’s 2k token window.

Use role-style instructions for better structure:

You are a concise, factual assistant. 
Always explain reasoning briefly and avoid unnecessary detail.

For step-by-step outputs, explicitly request them:
```
List the steps to make sourdough bread.
```

📊 Prompt Engineering Library

Guro is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
From academic writing to financial analysis, technical support, SEO, and beyond
Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.

🕒 Evaluation (indicative)

Boo shows improvements over the base Phi-4-Mini on common instruction tasks in small-context, quantized settings:

Task	Boo (Q4_K_M)	Base (Phi-4-Mini)
GSM8K (accuracy)	52.1%	44.8%
NaturalQuestions (EM / F1)	47.8 / 60.2	41.6 / 53.3
CNN/DailyMail (ROUGE-L)	38.4	33.9
HumanEval (pass@1, basic prompts)	6.3%	4.1%

Scores are approximate, reflect instruction-tuned, quantized inference, and are not directly comparable to full-precision or long-context runs.

🧩 Intended Use

Lightweight instruction following, reasoning, summarization, and light code generation.
Edge or desktop assistants, CLI tools, and RAG agents where low latency and small footprint are key.

⚡ Limitations

Context: 2k tokens; use chunking or RAG for long documents.
Quantization trade-offs: Q4_K_M sacrifices some precision for speed; complex coding or multi-hop reasoning may degrade versus higher-precision builds.
As with any LLM, the model can hallucinate; add validation and guardrails.

⚙️ Training Details (summary)

Base: Phi-4-Mini-Instruct
Method: SFT on about 20k instruction examples (single-turn chat, few-shot QA, summarization).
Packaging: GGUF Q4_K_M quantization for local runtimes (llama.cpp, LM Studio, etc.).

💻 Prompting

No special chat template is required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your app’s memory or RAG layer.

Example system style

You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.

🧩 Acknowledgements

Base model: Phi-4-Mini-Instruct
Quantization and local runtimes: GGUF ecosystem (for example, llama.cpp, LM Studio, Ollama loaders)

🏁 Changelog

v1.0 (Q4_K_M, GGUF) — Initial release with instruction SFT; compatibility with llama.cpp and LM Studio; evaluation on GSM8K, NaturalQuestions, CNN/DailyMail, and HumanEval.

📝License

Boo is published under the MIT General Public License v3

This model is a fine-tuned, quantized derivative of Phi-4-Mini-Instruct. You are responsible for ensuring your use complies with the parent model’s license and any dataset terms. For commercial deployment, review upstream licensing and your organization’s compliance requirements.

Downloads last month: 27

GGUF

Model size

4B params

Architecture

phi3

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leeroy-jankins/boo

Base model

microsoft/Phi-4-mini-reasoning

Quantized

unsloth/Phi-4-mini-reasoning-GGUF