Step-Audio-R1-nvfp4
Format: NVFP4 β weights & activations quantized to FP4 with dual scaling.
Base model: stepfun-ai/Step-Audio-R1
How it was made: One-shot calibration with LLM Compressor (NVFP4 recipe), long-seq calibration with Rombo-Org/Optimized_Reasoning.
Notes: Keep
lm_headin high precision; calibrate on long, domain-relevant sequences.
π About This Model
This is a quantized NVFP4 (W4A4) version of Step-Audio-R1, an open-weights Audioβbased multimodal model for audio understanding and reasoning. The original BF16 model requires ~67 GB VRAM.
Step-Audio-R1 combines: A high-capacity audio encoder A projection layer that maps audio features into the transformer A language backbone for reasoning and text generation
The model is designed for: Speech transcription and interpretation Emotional / tonal analysis Speaker characteristics Music and sound-scene understanding High-quality step-by-step reasoning about audio inputs
It does not generate audio; it produces text based on audio input.
π¦ What This Quantized Version Enables
This NVFP4 quantized version reduces memory requirements significantly: Size: ~22 GB (down from ~67 GB) Should fit comfortably on a single RTX 5090
Preserves most reasoning performance from the BF16 release Because of this, anyone with a high-end consumer GPU can experiment with advanced audio reasoning locally.
Check the original model card for more information about this model.
Running the model with VLLM in Docker
It requires a specific vllm container released by the model authors.
docker run --rm -ti --gpus all \
-v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
-p 9999:9999 \
stepfun2025/vllm:step-audio-2-v20250909 \
vllm serve /Step-Audio-R1 \
--served-model-name Step-Audio-R1 \
--port 9999 \
--max-model-len 16384 \
--max-num-seqs 32 \
--chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}' \
--enable-log-requests \
--interleave-mm-strings \
--trust-remote-code
This example script should allow an audio wave file to be streamed to the model and get a response based on the prompt.
import requests
import base64
with open("audio.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
payload = {
"model": "Step-Audio-R1",
"stream": True,
"messages": [
{
"role": "user",
"content": [
{"type": "input_audio", "audio_data": audio_b64, "mime_type": "audio/wav"},
{"type": "text", "text": "Transcribe this and describe the speaker."}
]
}
]
}
with requests.post(
"http://localhost:9999/v1/chat/completions",
json=payload,
stream=True,
) as r:
for line in r.iter_lines():
if line:
print(line.decode())
This was tested on an RTX Pro 6000 Blackwell cloud instance.
If there are other models you're interested in seeing quantized to NVFP4 for use on the DGX Spark, or other modern Blackwell (or newer) cards let me know. I'm trying to make more NVFP4 models available to allow more people to try them out.
- Downloads last month
- 71
Model tree for Firworks/Step-Audio-R1-nvfp4
Base model
stepfun-ai/Step-Audio-R1