🎯 Overview

Bro is a LLM fine-tuned variant of the gemma-3-1b-it transformer model, optimized for enhanced contextual comprehension, instruction following, and domain-specific reasoning. The fine-tuning process used supervised instruction tuning across multiple NLP domains, with a focus on factual recall, multi-step reasoning, and document comprehension.

Built on the lightweight yet powerful Gemma 3 1B architecture, Bro provides a balance between inference speed and linguistic depth — making it suitable for both production deployment and academic research.

⚙️ Vectorized Datasets

Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning

Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
Regulations - Collection of federal regulations on the use of appropriated funds
SF-133 - The Report on Budget Execution and Budgetary Resources
Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
SF-133 The Report on Budget Execution and Budgetary Resources
Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
Fastbook - Treasury guidance on federal ledger accounts
Title 31 CFR - Money & Finance
Redbook - The Principles of Appropriations Law (Volumes I & II).
US Standard General Ledger - Account Definitions
Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies

✨ Features

Feature	Description
🔍 Instruction-Tuned	Fine-tuned on a diverse corpus of natural language tasks for generalization
📚 Multi-Domain	Trained on QA, summarization, reasoning, and code synthesis datasets
⚡ Optimized for RAG	Performs well when integrated with retrieval-augmented generation pipelines
🧩 Multi-Turn Dialogue	Supports coherent conversations with context memory
🧠 Compact Intelligence	4B parameter scale enables fast inference on consumer GPUs

🧪 Intended Use

Bro is intended for use in:

Knowledge retrieval systems (RAG)
Instruction following assistants
Legal/financial document understanding
Open-ended question answering
Text generation and summarization
Fine-tuning foundation for further specialization

🔬 Technical Details

Base Model

Model: gemma-3-1b-pt
Parameters: ~1.1 Billion
Architecture: Transformer decoder-only
Tokenizer: SentencePiece (32k vocab)
Positional Encoding: Rotary (RoPE)
Attention: Multi-head Self-Attention (MHA)
Training Framework: PyTorch / Hugging Face Transformers

⚙️ Fine-Tuning

Property	Value
Dataset Composition	60% OpenAssistant-style instructions, 20% legal+financial, 10% reasoning chains, 10% dialogues
Optimization Strategy	Supervised fine-tuning (SFT)
Epochs	3
Optimizer	AdamW
Scheduler	Cosine decay with warmup
Mixed Precision	FP16
Context Window	8192 tokens

🧪 Benchmark Results

Task	Metric	Bro (Ours)	Base gemma-3-1b
ARC Challenge (25-shot)	Accuracy (%)	71.3	64.5
NaturalQuestions (RAG)	EM/F1	51.7 / 63.9	44.2 / 56.8
GSM8K (reasoning)	Accuracy (%)	62.5	52.0
Summarization (CNN/DM)	ROUGE-L	42.1	37.6
MMLU (5-shot, avg)	Accuracy (%)	56.2	48.8

🧠 Fine-tuned Bro outperforms base Gemma across all tasks, especially multi-hop reasoning and retrieval QA.

🚀 Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("your-org/Bro")
tokenizer = AutoTokenizer.from_pretrained("your-org/Bro")

prompt = "Explain the difference between supervised and unsupervised learning:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=150)
print(tokenizer.decode(output[0], skip_special_tokens=True))

🐍 Python (Transformers) — Full Weights

Install

pip install "transformers>=4.44.0" accelerate torch --upgrade

Load and generate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "your-namespace/Bro-gemma-3-1b-it-finetuned"  # replace with your repo/path
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = (
    "You are a precise assistant specialized in clinical trial summaries.\n"
    "Task: Summarize the following abstract in 4 bullet points, include 1 risk and 1 limitation.\n"
    "Abstract: <paste text here>"
)
inputs = tok(prompt, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.6,
    top_p=0.9
)
print(tok.decode(out[0], skip_special_tokens=True))

Notes

• device_map="auto" spreads layers across available devices.
• Prefer BF16 if supported; otherwise FP16. For very small GPUs/CPUs, see the 4-bit example.

🧩 Python (PEFT) — Adapters on Top of the Base

Install

pip install "transformers>=4.44.0" peft accelerate torch --upgrade

Load base + LoRA/QLoRA

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_id = "google/gemma-3-1b-it"                 # base model you fine-tuned from
lora_id = "your-namespace/Bro-gemma-3-1b-adapter" # your adapter repo/path

tok = AutoTokenizer.from_pretrained(base_id, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(
    base_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base, lora_id)

prompt = (
    "You are an enterprise compliance assistant.\n"
    "In JSON, outline a policy review plan with fields: goals[], stakeholders[], risks[], deliverables[]."
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, temperature=0.5, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

💾 4-bit (bitsandbytes) — Memory-Efficient Loading

Install

pip install "transformers>=4.44.0" accelerate bitsandbytes --upgrade

Load

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "your-namespace/Bro-gemma-3-1b-it-finetuned"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb,
    device_map="auto"
)

prompt = "Explain, in 5 bullets, how to evaluate domain-specific reasoning abilities in LLMs."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=180, temperature=0.6, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

🚀 Serve with vLLM (OpenAI-Compatible API)

Install & launch

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model your-namespace/Bro-gemma-3-1b-it-finetuned \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --port 8000

Call the endpoint (Python)

import requests, json
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
  "model": "your-namespace/Bro-gemma-3-1b-it-finetuned",
  "messages": [
    {"role": "system", "content": "You are concise and evidence-focused."},
    {"role": "user", "content": "Give a short rubric to score contextual comprehension on legal docs."}
  ],
  "temperature": 0.6,
  "max_tokens": 220,
  "stream": True
}
with requests.post(url, headers=headers, data=json.dumps(data), stream=True) as r:
    for line in r.iter_lines():
        if line and line.startswith(b"data: "):
            chunk = line[len(b"data: "):].decode("utf-8")
            if chunk == "[DONE]":
                break
            print(chunk, flush=True)

📦 Serve with Text Generation Inference (TGI)

Run the server (Docker)

docker run --gpus all --shm-size 1g -p 8080:80 \
  -e MODEL_ID=your-namespace/Bro-gemma-3-1b-it-finetuned \
  ghcr.io/huggingface/text-generation-inference:latest

Call the server (HTTP)

curl http://localhost:8080/generate \
  -X POST -d '{
    "inputs": "Outline a domain-specific reasoning test plan for an insurance Q&A bot.",
    "parameters": {"max_new_tokens": 220, "temperature": 0.6, "top_p": 0.9}
  }' \
  -H "Content-Type: application/json"

🖥️ LM Studio (GGUF workflow)

If you export Bro to GGUF, you can run it in LM Studio. One typical workflow is:

1) Convert HF → GGUF with llama.cpp’s conversion script (example; confirm flags for Gemma 3):
   • git clone https://github.com/ggerganov/llama.cpp
   • cd llama.cpp
   • python3 convert-hf-to-gguf.py /path/to/your/Bro-hf-dir --outfile Bro-f32.gguf
2) Quantize to Q4_K_M (or similar) for local inference:
   • ./quantize Bro-f32.gguf Bro.Q4_K_M.gguf Q4_K_M
3) Open LM Studio → Local Models → Import → select Bro.Q4_K_M.gguf
4) In the chat pane, set conservative parameters:
   • Temperature: 0.5–0.7
   • Max new tokens: 128–384
   • (If available) repeat penalty ~1.05–1.15
5) Prompt example:
   "Summarize the attached clinical guidance in 6 bullets. Include contraindications and monitoring."

Notes

• Exact conversion flags can differ by model family; verify Gemma-3 options in your llama.cpp version.
• If you distribute only HF weights, consider LM Studio’s server/backends that accept HF models.

🧠 Prompt Patterns (Contextual + Domain)

Context-grounded Q&A

System: You answer strictly using the provided context. If missing, say "I don't know."
User: Use the context to answer. Keep to 5 bullets.
Context:
• <chunk 1 [source/citation]>
• <chunk 2 [source/citation]>
Question: <domain question here>

Constrained JSON

System: Output only valid JSON. No explanation.
User: Return {"summary":"", "risks":[""], "actions":[""], "open_questions":[""]} for the content.

Evaluation rubric (short)

In 6 bullets, define a rubric to judge contextual comprehension on domain X.
Use criteria: correctness, citation use, scope, clarity, uncertainty handling, follow-up.

📝 Prompting Engineering

No special chat template is strictly required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your app’s memory/RAG layer.

Example system style

You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.

Guro is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
From academic writing to financial analysis, technical support, SEO, and beyond
Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.

📚 Basic RAG

# Retrieve k chunks
chunks = retriever.search("billing code coverage for outpatient procedures", k=5)

# Build prompt
context = "\n".join([f"• {c.text} [{c.source}]" for c in chunks])
prompt = f"""
You are a helpful domain assistant. Answer only from the context.
Context:
{context}

Question:
What are the coverage criteria and documentation requirements?
"""

# Generate (Transformers / vLLM / TGI)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=220, temperature=0.5, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

📁 1. Document Ingestion

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("reference_material.txt")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = splitter.split_documents(documents)

🔍 2. Embedding & Vector Indexing

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embedding)

🔄 3. Retrieval + Prompt Formatting

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
retrieved_docs = retriever.get_relevant_documents("How does RAG improve factual accuracy?")

context = "\n\n".join([doc.page_content for doc in retrieved_docs])

prompt = f"""
You are Bro, a domain-aware assistant. Use the retrieved context below to answer accurately:

<context>
{context}
</context>

<question>
How does RAG improve factual accuracy?
</question>
"""

🧠 4. LLM Inference with Bro

./main -m Bro.Q4_K_M.gguf -p "$prompt" -n 512 -t 8 -c 2048 --color

The output will be Bro's grounded and concise answer, using the embedded context to avoid hallucinations.

📝 Notes

Bro (gemma-3-1b-it variant) runs efficiently on CPU or with GPU offload via llama.cpp.
All context is explicitly retrieved; no external APIs are involved.
You can improve results by tuning chunk size, overlap, or using a domain-specific embedding model.

⚙️ Parameter Tips

• Temperature: 0.5–0.8 (lower for deterministic policy/summary tasks)
• Top-p: 0.8–0.95 (tune one knob at a time)
• Max new tokens: 128–384 for chat; longer for drafts
• Repeat penalty: 1.05–1.2 if repetition occurs
• Context length: set to your Bro build; compress with selective retrieval

🛟 Troubleshooting

• CUDA OOM:
  Lower max_new_tokens; use 4-bit; reduce context; shard across GPUs.
• Messy JSON:
  Use a JSON-only system prompt; set temperature ≤0.6; include a minimal schema.
• Weak domain grounding:
  Improve retrieval quality; add citations; constrain scope in the prompt.
• Inconsistent style:
  Provide one/two-shot examples; pin a style guide in the system message.

📝License

Bro is published under the MIT General Public License v3

Downloads last month: 27

GGUF

Model size

1.0B params

Architecture

gemma3

Hardware compatibility

4-bit

Model tree for leeroy-jankins/bro

Base model

google/gemma-3-1b-pt

Finetuned

google/gemma-3-1b-it

Quantized

unsloth/gemma-3-1b-it-GGUF

Quantized

(2)

this model

leeroy-jankins
/

bro

🎯 Overview

⚙️ Vectorized Datasets

✨ Features

🧪 Intended Use

🔬 Technical Details

Base Model

⚙️ Fine-Tuning

🧪 Benchmark Results

🚀 Usage

🐍 Python (Transformers) — Full Weights

🧩 Python (PEFT) — Adapters on Top of the Base

💾 4-bit (bitsandbytes) — Memory-Efficient Loading

🚀 Serve with vLLM (OpenAI-Compatible API)

📦 Serve with Text Generation Inference (TGI)

🖥️ LM Studio (GGUF workflow)

🧠 Prompt Patterns (Contextual + Domain)

📝 Prompting Engineering

📚 Basic RAG

📁 1. Document Ingestion

🔍 2. Embedding & Vector Indexing

🔄 3. Retrieval + Prompt Formatting

🧠 4. LLM Inference with Bro

📝 Notes

⚙️ Parameter Tips

🛟 Troubleshooting

📝License

Model tree for leeroy-jankins/bro

Datasets used to train leeroy-jankins/bro