CMMC Expert 7B v2.0

Notice: These models are provided for proof-of-concept and testing purposes only. Production-grade models are not publicly shared. For inquiries regarding production models or commercial licensing, please contact the maintainer: Nathan Maine.

A locally-hosted, fine-tuned language model specialized in CMMC 2.0, NIST 800-171, NIST 800-53, NIST CSF, HIPAA, DFARS, and cybersecurity compliance frameworks.

This is the 7B variant — optimized for fast responses on consumer hardware. Part of a four-model suite (7B, 14B, 32B, 72B) sharing the same compliance knowledge base.

What's New in v2.0

  • 40% more training data — 18,747 total examples (up from 16,906 in v1.0)
  • 6 new authoritative sources — NIST SP 800-53 Rev. 5 full catalog, NIST CSF 2.0, eCFR regulations (CMMC/DFARS/HIPAA), Federal Register documents, DoD PDFs
  • Expanded LoRA coverage — All 7 transformer modules targeted (v1.0 used only 4)
  • Improved eval loss — 1.142 (down from 1.241 in v1.0)
  • Automated data pipeline — Reproducible scraping, filtering, and deduplication via cmmc-data-pipeline

Quick Start (Ollama)

# Download and run
ollama pull Nathan-Maine/cmmc-expert-7b-v2.0

# Ask a compliance question
ollama run cmmc-expert-7b-v2.0 "What access controls are required for CMMC Level 2?"

# Or use the OpenAI-compatible API
curl http://localhost:11434/api/generate -d '{
  "model": "cmmc-expert-7b-v2.0",
  "prompt": "What are the key differences between CMMC Level 1 and Level 2?",
  "stream": false
}'

Model Details

Property Value
Base Model Qwen2.5-7B-Instruct
Parameters 7.6 billion
Fine-Tuning Method QLoRA (4-bit NF4 base, LoRA rank 64, alpha 128)
Quantization q5_k_m (GGUF)
File Size 5.1 GB
Context Length 32,768 tokens
Training Hardware NVIDIA A100-SXM4-80GB
Training Time ~3.1 hours
Training Framework HuggingFace TRL + PEFT + bitsandbytes

Security Domain Coverage

Models are fine-tuned for complete security domain coverage, including vulnerability analysis, incident response scenarios, and access control failure modes required for professional SSP and POA&M generation. Behavioral guardrails and policy enforcement are handled at the governed-llm-gateway layer.

Base model migration to Meta Llama 3.1/3.3 (US-origin, open weights) is in progress.

Compliance Framework Coverage

Trained across eight overlapping frameworks to support cross-framework mapping:

Framework Coverage
CMMC 2.0 (32 CFR Part 170) All three levels — 17 L1 practices, 110 L2, 134 L3, assessment methodology
NIST SP 800-171 Rev. 2 & 3 110 security requirements across 14 families
NIST SP 800-172 Enhanced security requirements for critical CUI programs
NIST SP 800-53 Rev. 5 Full catalog of 1,189 controls across 20 families
NIST SP 800-37 Risk Management Framework (RMF) steps and authorization
NIST CSF 2.0 Govern, Identify, Protect, Detect, Respond, Recover functions
HIPAA Security Rule Administrative, physical, and technical safeguards
DFARS Clauses 252.204-7008/7009/7012/7019/7020/7021/7024/7025, 252.239-7009/7010

Training Data

14,906 training + 3,841 validation examples (~4.5M tokens) assembled from 11 curated sources:

v1.0 Legacy Sources (13,434 examples)

Source Examples Share
NIST Cybersecurity (filtered from 424K) 6,372 33.9%
CMMC Full 4,787 25.5%
CMMC Balanced 994 5.3%
HIPAA Compliance 961 5.1%
CMMC Core 320 1.7%

v2.0 New Sources (1,841 examples via automated pipeline)

Source Examples Share
NIST CSRC (SP 800-53 Rev. 5 controls) 773 4.1%
DoD Documents (PDFs) 519 2.8%
eCFR Regulations (CMMC/DFARS/HIPAA) 75 0.4%
NIST SP 800-171 Rev. 3 63 0.3%
NIST CSF 2.0 61 0.3%
Federal Register 350 1.9%

v2.0 Data Processing Pipeline:

  1. Automated scraping — 6 authoritative sources scraped via dedicated modules
  2. Relevance filtering — eCFR filtered to only CMMC-relevant DFARS clauses (252.204-70xx, 252.239-70xx), CMMC (32 CFR 170), and HIPAA (45 CFR 164)
  3. Format conversion — Raw records converted to chat-style instruction/response pairs
  4. Quality filtering — Removed entries <100 chars, entries >8,000 chars, OCR artifacts
  5. Deduplication — Exact dedup (xxhash) + near-dedup (MinHash LSH, 128 permutations, Jaccard 0.8 threshold, 5-gram shingles)
  6. Cross-version dedup — v2.0 records deduplicated against v1.0 corpus to prevent overlap
  7. Validation split — 80/20 stratified split maintaining source distribution

Pipeline source code: github.com/NathanMaine/cmmc-data-pipeline

Training Configuration

Parameter Value
Epochs 3
Learning Rate 2e-4 (cosine decay)
Warmup 5% of steps
Optimizer 8-bit AdamW
Batch Size 4 (effective 32 with gradient accumulation x8)
LoRA Rank 64
LoRA Alpha 128
LoRA Dropout 0.05
LoRA Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Max Sequence Length 2048
Packing Enabled
Base Quantization 4-bit NF4 with double quantization

Evaluation Results

Training Metrics

Metric Value
Final Train Loss 1.030
Average Train Loss 1.209
Final Eval Loss 1.142
Mean Token Accuracy 76.5%
Total Training Steps 282
Tokens Processed ~18M

Training Curve (Selected Steps)

Step Epoch Train Loss Token Accuracy
100 ~1.1 1.297 71.4%
150 ~1.6 1.144 74.4%
200 ~2.0 1.101 75.4%
250 ~2.7 1.022 76.7%
282 3.0 1.030 76.5%

v1.0 vs v2.0 Comparison

Metric v1.0 v2.0 Change
Training Examples 13,434 14,906 +11%
Validation Examples 3,472 3,841 +11%
Eval Loss 1.241 1.142 -8% (better)
LoRA Target Modules 4 7 +75% coverage
Data Sources 5 11 +6 new sources

Intended Uses

  • SSP Generation — Draft System Security Plan control descriptions with NIST/CMMC citations
  • Gap Analysis — Identify controls required for specific CMMC levels and contract requirements
  • Assessment Prep — Generate evidence checklists and assessment objective narratives
  • Cross-Framework Mapping — Map controls between CMMC, NIST 800-53, HIPAA, and DFARS
  • Policy Drafting — Create policies aligned to specific CMMC practices
  • DFARS Clause Analysis — Identify requirements from contract language
  • Regulatory Research — Understand eCFR regulations and Federal Register guidance
  • Training & Education — Always-available compliance reference for teams

Limitations

  • Not a substitute for qualified compliance professionals. This model is a tool to accelerate compliance work, not replace human judgment.
  • Knowledge cutoff. The model's knowledge is based on training data available at the time of fine-tuning (February 2026). Always verify against current published frameworks.
  • 7B reasoning depth. For complex multi-framework analysis or detailed gap assessments, consider the 14B, 32B, or 72B variants which provide deeper reasoning capabilities.
  • No retrieval augmentation. The model generates responses from trained knowledge only — it does not search or retrieve external documents at inference time.
  • Citation accuracy. While the model generally cites correct control numbers and framework sections, always verify specific citations against authoritative sources.

Out-of-Scope Uses

  • Legal advice. This model does not provide legal opinions on compliance status.
  • Automated compliance certification. CMMC certification requires human assessors (C3PAOs).
  • Processing actual CUI/ITAR data. The model itself does not process or store sensitive data, but users should follow their organization's data handling policies.

Hardware Requirements

Mode GPU (VRAM) CPU-Only (RAM) Storage
Inference 8 GB 16 GB 10 GB
Training 16 GB+ N/A 30 GB

Supported OS: Linux, macOS, Windows (WSL2)

The Model Suite

This is the 7B model — the fastest option for day-to-day compliance queries. The full suite includes:

Model Parameters GGUF Size Best For
cmmc-expert-7b-v2.0 7.6B 5.1 GB Quick lookups, day-to-day queries
cmmc-expert-14b-v2.0 14.7B ~10 GB Detailed analysis, multi-control reasoning
cmmc-expert-32b-v2.0 32.5B ~19 GB Deep gap assessments, SSP drafting
cmmc-expert-72b-v2.0 72.7B ~42 GB Complex multi-framework analysis

Source Code

Known Issues

  • Repetition bug — The model may repeat content, lists, or entire sections multiple times within a single response. This is a known training artifact being addressed in future versions.
  • Verbose responses — Tends to over-explain in some contexts where a concise answer would be more appropriate.

Citation

@misc{maine2026cmmcexpert,
  title={CMMC Expert v2.0: Fine-Tuned Language Models for Cybersecurity Compliance},
  author={Nathan Maine},
  year={2026},
  url={https://github.com/NathanMaine/cmmc-compliance-ai-model}
}

Contact

Downloads last month
41
GGUF
Model size
8B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nathan-Maine/cmmc-expert-7b-v2.0

Base model

Qwen/Qwen2.5-7B
Quantized
(262)
this model

Collection including Nathan-Maine/cmmc-expert-7b-v2.0

Evaluation results