Step3p5 NVFP4
NVIDIA FP4 (NVFP4) quantized version of the Step3p5 Mixture-of-Experts model, with MoE router/gate weights dequantized to bfloat16 for vLLM compatibility.
Quantization Details
- Quantization method: NVIDIA ModelOpt 0.41.0, NVFP4 (W4A4)
- Weight format: FP4 E2M1, packed 2 values per uint8 byte
- Group size: 16
- Excluded from quantization:
lm_head,*.moe.gate*(router/gate)
The MoE router/gate weights are stored in bfloat16 (not quantized) following NVIDIA ModelOpt best practices — quantizing the router degrades routing quality with negligible memory savings.
Serving with vLLM
VLLM_USE_FLASHINFER_MOE_FP4=0 vllm serve apandacoding/step3p5-nvfp4 \
--quantization modelopt_fp4 \
--trust-remote-code \
--host 0.0.0.0 --port 8000
Note:
VLLM_USE_FLASHINFER_MOE_FP4=0is required to use the VLLM_CUTLASS MoE backend. The FlashInfer TRTLLM monolithic MoE kernel has a known issue with 288-expert models.
Model Architecture
- Type: Mixture of Experts (MoE) with shared experts
- Experts: 288 routed + shared expert per layer
- Top-K: 8 experts per token
- Hidden size: 4096
- MoE intermediate size: 1280
- MoE layers: 42 (layers 3–44)
- Attention: GQA with 96 heads, 8 KV heads
- Context length: 262,144 tokens
- Downloads last month
- 38