Preview

🧠 Boo — Phi-4-Mini-Instruct (Q4_K_M, GGUF)

Boo is a compact, instruction-tuned LLM derived from Phi-4-mini-reasoning and packaged in GGUF Q4_K_M quantization for fast local inference. It targets concise instruction following, lightweight reasoning, summarization, and light code synthesis—ideal for CLI assistants, edge deployments, and RAG agents where latency and footprint matter.


🧰 Key Features

  • Phi-4-Mini base trained on filtered, high-quality data.
  • Instruction SFT for reasoning, summarization, and prompt following; aligned chat behavior.
  • GGUF Q4_K_M (4-bit grouped) for performant local inference on CPU/GPU-constrained hardware.
  • Cold-start ready and compatible with llama.cpp, LM Studio, Ollama, and other GGUF loaders.

📝 Technical Specifications

Property Value
Base model Phi-4-Mini-Instruct
Architecture Transformer, decoder-only
Quantization GGUF Q4_K_M (4-bit grouped, medium precision)
Tokenizer phi BPE (about 32k vocabulary)
Fine-tuning method Supervised fine-tuning (about 20k examples)
Training style Single-turn instructions, few-shot QA, summarization
Context window 2,048 tokens (default)
Compatible runtimes llama.cpp, LM Studio, GGUF loaders, Ollama (via conversion)

⚡ Files

File Description
Boo.Q4_K_M.gguf Quantized model weights
tokenizer.model Phi BPE tokenizer
config.json Optional runtime config
README.md This model card

⚙️ Vectorized Datasets

Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning

  • Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
  • Regulations - Collection of federal regulations on the use of appropriated funds
  • SF-133 - The Report on Budget Execution and Budgetary Resources
  • Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
  • Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
  • SF-133 The Report on Budget Execution and Budgetary Resources
  • Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
  • Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
  • Fastbook - Treasury guidance on federal ledger accounts
  • Title 31 CFR - Money & Finance
  • Redbook - The Principles of Appropriations Law (Volumes I & II).
  • US Standard General Ledger - Account Definitions
  • Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies

🎯 Quickstart (Local Inference)

llama.cpp

./main -m Boo.Q4_K_M.gguf \
  -p "Explain reinforcement learning like I'm 12." \
  -n 256

🧪 LM Studio

1) Import Boo.Q4_K_M.gguf.
2) Choose a simple prompt and start with modest max tokens and thread counts.
3) Increase settings as latency allows.

Tip: Boo is designed for low-resource setups. If you use RAG, chunk long documents and keep the prompt compact to stay within the 2k context.

🧠 RAG with the Boo LLM (Phi-4-Mini-Instruct, Q4_K_M, GGUF)

This end‑to‑end example shows how to build a tiny Retrieval‑Augmented Generation (RAG) pipeline using Boo for generation (via llama-cpp-python) and an embedding model (e.g., the “Bobo” embedding derived from mixedbread-ai/mxbai-embed-large-v1) with FAISS for similarity search.


📦 1) Install Dependencies

pip install llama-cpp-python sentence-transformers faiss-cpu numpy

🧱 2) Minimal Data & Ingestion

import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

# --- Configuration ---
# Path to your quantized Boo model file (GGUF)
BOO_MODEL_PATH = "Boo.Q4_K_M.gguf"

# Choose an embedding model (here: mixedbread large; you can substitute your own)
EMBED_MODEL_ID = "mixedbread-ai/mxbai-embed-large-v1"

# A tiny toy "corpus" for demo purposes (normally you'd load real documents and chunk them)
DOCUMENTS = [
    {"id": "doc1", "text": "Retrieval-Augmented Generation (RAG) combines document retrieval with a generator LLM."},
    {"id": "doc2", "text": "FAISS enables efficient vector similarity search using approximate or exact indexes."},
    {"id": "doc3", "text": "Cosine similarity is often used with L2-normalized embeddings to measure semantic closeness."},
    {"id": "doc4", "text": "Chunking long documents into smaller passages improves retrieval granularity and accuracy."},
    {"id": "doc5", "text": "Boo is a lightweight LLM packaged as GGUF, suitable for local inference via llama.cpp."},
]

# --- Embedder ---
embedder = SentenceTransformer(EMBED_MODEL_ID)

# Encode and L2-normalize for cosine via inner product
def encode_texts(texts):
    emb = embedder.encode(texts, normalize_embeddings=True)
    return emb.astype(np.float32)

# Create the matrix of document embeddings
corpus_texts = [d["text"] for d in DOCUMENTS]
corpus_vecs = encode_texts(corpus_texts)
dim = corpus_vecs.shape[1]

# --- Build FAISS index (inner product works like cosine when vectors are normalized) ---
index = faiss.IndexFlatIP(dim)
index.add(corpus_vecs)

# Keep ID mapping for retrieved results
id_map = np.array([i for i in range(len(DOCUMENTS))])

🔎 3) Retrieval Function

def retrieve(query, k=3):
    q_vec = encode_texts([query])  # already normalized
    scores, idx = index.search(q_vec, k)
    results = []
    for rank, (sc, ii) in enumerate(zip(scores[0], idx[0])):
        doc = DOCUMENTS[id_map[ii]]
        results.append({"rank": rank + 1, "score": float(sc), "id": doc["id"], "text": doc["text"]})
    return results

🦙 4) Generation with Boo (llama-cpp-python)

from llama_cpp import Llama

# Initialize Boo
# Adjust n_ctx (context) and n_threads to your environment
llm = Llama(
    model_path=BOO_MODEL_PATH,
    n_ctx=2048,
    n_threads=8
)

def build_prompt(query, context_chunks):
    ctx_lines = "\n".join([f"• {c['text']}" for c in context_chunks])
    prompt = f"""

You are a concise, factual assistant. Use only the provided context to answer the question. If the answer cannot be found in the context, say "I don't know."

Context: {ctx_lines}

Question: {query}

Answer (concise, with references to bullet numbers if applicable): """ return prompt.strip()

def generate_with_boo(prompt, max_tokens=256, temperature=0.6, top_p=0.9):
    out = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p
    )
    return out["choices"][0]["text"]

🧪 5) End‑to‑End RAG Query

user_query = "How does RAG improve factuality, and which library helps with vector search?"
top_k = 3

# 1) Retrieve
retrieved = retrieve(user_query, k=top_k)

# 2) Build prompt
prompt = build_prompt(user_query, retrieved)

# 3) Generate with Boo
answer = generate_with_boo(prompt)

print("---- Retrieved Chunks ----")
for r in retrieved:
    print(f"[{r['rank']}] (score={r['score']:.3f}) {r['text']}")

print("\n---- Boo Answer ----")
print(answer)

🧰 6) Practical Tips

• Chunking: For real docs, split into ~300–600 characters (or ~128–256 tokens) with 10–20% overlap.
• Normalization: L2-normalize embeddings when using cosine/IP search.
• Metadata: Store doc IDs, titles, and citations so Boo can reference sources.
• Guardrails: If retrieval comes back empty or low‑score, have Boo say “I don’t know.”
• Prompt Budget: Keep the context short and relevant—Boo’s default context is ~2k tokens.


🔒 Prompt Engineering Tips

  • Keep prompts concise to fit Boo’s 2k token window.
  • Use role-style instructions for better structure:
    You are a concise, factual assistant. 
    Always explain reasoning briefly and avoid unnecessary detail.
    
  • For step-by-step outputs, explicitly request them:
    List the steps to make sourdough bread.
    

📊 Prompt Engineering Library

  • Guro is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
  • From academic writing to financial analysis, technical support, SEO, and beyond
  • Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.

🕒 Evaluation (indicative)

Boo shows improvements over the base Phi-4-Mini on common instruction tasks in small-context, quantized settings:

Task Boo (Q4_K_M) Base (Phi-4-Mini)
GSM8K (accuracy) 52.1% 44.8%
NaturalQuestions (EM / F1) 47.8 / 60.2 41.6 / 53.3
CNN/DailyMail (ROUGE-L) 38.4 33.9
HumanEval (pass@1, basic prompts) 6.3% 4.1%

Scores are approximate, reflect instruction-tuned, quantized inference, and are not directly comparable to full-precision or long-context runs.

🧩 Intended Use

  • Lightweight instruction following, reasoning, summarization, and light code generation.
  • Edge or desktop assistants, CLI tools, and RAG agents where low latency and small footprint are key.

⚡ Limitations

  • Context: 2k tokens; use chunking or RAG for long documents.
  • Quantization trade-offs: Q4_K_M sacrifices some precision for speed; complex coding or multi-hop reasoning may degrade versus higher-precision builds.
  • As with any LLM, the model can hallucinate; add validation and guardrails.

⚙️ Training Details (summary)

  • Base: Phi-4-Mini-Instruct
  • Method: SFT on about 20k instruction examples (single-turn chat, few-shot QA, summarization).
  • Packaging: GGUF Q4_K_M quantization for local runtimes (llama.cpp, LM Studio, etc.).

💻 Prompting

No special chat template is required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your app’s memory or RAG layer.

Example system style

You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.

🧩 Acknowledgements

  • Base model: Phi-4-Mini-Instruct
  • Quantization and local runtimes: GGUF ecosystem (for example, llama.cpp, LM Studio, Ollama loaders)

🏁 Changelog

  • v1.0 (Q4_K_M, GGUF) — Initial release with instruction SFT; compatibility with llama.cpp and LM Studio; evaluation on GSM8K, NaturalQuestions, CNN/DailyMail, and HumanEval.

📝License

This model is a fine-tuned, quantized derivative of Phi-4-Mini-Instruct. You are responsible for ensuring your use complies with the parent model’s license and any dataset terms. For commercial deployment, review upstream licensing and your organization’s compliance requirements.

Downloads last month
27
GGUF
Model size
4B params
Architecture
phi3
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leeroy-jankins/boo

Quantized
(2)
this model

Datasets used to train leeroy-jankins/boo