Step-Audio-R1-nvfp4

Format: NVFP4 β€” weights & activations quantized to FP4 with dual scaling.
Base model: stepfun-ai/Step-Audio-R1
How it was made: One-shot calibration with LLM Compressor (NVFP4 recipe), long-seq calibration with Rombo-Org/Optimized_Reasoning.

Notes: Keep lm_head in high precision; calibrate on long, domain-relevant sequences.

πŸ“˜ About This Model

This is a quantized NVFP4 (W4A4) version of Step-Audio-R1, an open-weights Audio–based multimodal model for audio understanding and reasoning. The original BF16 model requires ~67 GB VRAM.

Step-Audio-R1 combines: A high-capacity audio encoder A projection layer that maps audio features into the transformer A language backbone for reasoning and text generation

The model is designed for: Speech transcription and interpretation Emotional / tonal analysis Speaker characteristics Music and sound-scene understanding High-quality step-by-step reasoning about audio inputs

It does not generate audio; it produces text based on audio input.

πŸ“¦ What This Quantized Version Enables

This NVFP4 quantized version reduces memory requirements significantly: Size: ~22 GB (down from ~67 GB) Should fit comfortably on a single RTX 5090

Preserves most reasoning performance from the BF16 release Because of this, anyone with a high-end consumer GPU can experiment with advanced audio reasoning locally.

Check the original model card for more information about this model.

Running the model with VLLM in Docker

It requires a specific vllm container released by the model authors.

docker run --rm -ti --gpus all \
    -v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
    -p 9999:9999 \
    stepfun2025/vllm:step-audio-2-v20250909 \
    vllm serve /Step-Audio-R1 \
    --served-model-name Step-Audio-R1 \
    --port 9999 \
    --max-model-len 16384 \
    --max-num-seqs 32 \
    --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}' \
    --enable-log-requests \
    --interleave-mm-strings \
    --trust-remote-code

This example script should allow an audio wave file to be streamed to the model and get a response based on the prompt.

import requests
import base64

with open("audio.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

payload = {
    "model": "Step-Audio-R1",
    "stream": True,
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "audio_data": audio_b64, "mime_type": "audio/wav"},
                {"type": "text", "text": "Transcribe this and describe the speaker."}
            ]
        }
    ]
}

with requests.post(
    "http://localhost:9999/v1/chat/completions",
    json=payload,
    stream=True,
) as r:
    for line in r.iter_lines():
        if line:
            print(line.decode())

This was tested on an RTX Pro 6000 Blackwell cloud instance.

If there are other models you're interested in seeing quantized to NVFP4 for use on the DGX Spark, or other modern Blackwell (or newer) cards let me know. I'm trying to make more NVFP4 models available to allow more people to try them out.

Downloads last month
71
Safetensors
Model size
20B params
Tensor type
BF16
Β·
F32
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Firworks/Step-Audio-R1-nvfp4

Quantized
(2)
this model

Dataset used to train Firworks/Step-Audio-R1-nvfp4