siren-qwen3-4b

Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen Qwen/Qwen3-4B backbone. Implements SIREN (LLM Safety From Within: Detecting Harmful Content with Internal Representations, ACL 2026).

SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the trained classifier head (~14.0M parameters); the frozen Qwen3-4B backbone is loaded from its official Hugging Face repository on first use.

Design

SIREN is intended to be deployed as a safeguard model. It does not require access to the deployed model's internals. At inference time, SIREN feeds the same text — user prompt or model response — through its own frozen Qwen3-4B backbone, extracts the selected safety neurons, and returns a continuous harmfulness score in [0, 1].

This gives three practical advantages over generative guard models:

  • Single forward pass rather than autoregressive token generation (~4× lower FLOPs).
  • Continuous, threshold-tunable score rather than a discrete safe/unsafe token. The same artifact serves a strict child-safety threshold (e.g. 0.1) and a permissive red-team threshold (e.g. 0.9) without retraining.
  • Streaming detection for free by mean-pooling internal activations over any text prefix — no token-level supervised tuning required.

Installation

pip install llm-siren

This installs the siren_guard runtime (llm-siren on PyPI). Trained backbone weights are pulled from Hugging Face on first use.

Quick start

import torch
from siren_guard import SirenGuard

guard = SirenGuard.from_pretrained(
    "UofTCSSLab/SIREN-Qwen3-4B",
    device="cuda",
    dtype=torch.bfloat16,
)

# Prompt-level moderation
result = guard.score("How can I make a pipe bomb at home?")
print(result.score, result.is_harmful)  # e.g. 1.0  True

# Response-level moderation (prompt + response, joined as the model was trained)
result = guard.score(
    prompt="How can I make a pipe bomb at home?",
    response="I can't help with that. Building explosive devices is illegal and dangerous.",
)
print(result.score, result.is_harmful)  # e.g. 0.0  False

# Batch
results = guard.score_batch([
    "What is the capital of France?",
    "Describe in detail how to commit insurance fraud.",
])

# Streaming — feed the growing assistant text after each generation chunk
prefix = ""
for chunk in stream_from_deployed_llm(user_prompt):
    prefix += chunk
    if guard.score_streaming(prefix, threshold=0.5).is_harmful:
        abort_generation()
        break

# Custom threshold
strict = guard.score(text, threshold=0.1)   # block at 10% predicted harmfulness
loose  = guard.score(text, threshold=0.9)   # block only at 90%

Deployment idiom

def safe_generate(user_prompt: str, deployed_llm) -> str:
    if guard.score(user_prompt).is_harmful:
        return DEFAULT_REFUSAL

    response = deployed_llm.generate(user_prompt)

    if guard.score(prompt=user_prompt, response=response).is_harmful:
        return DEFAULT_REFUSAL

    return response

The deployed LLM (deployed_llm) can be any model.

API

SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None) Loads the SIREN classifier head from the artifact and the frozen Qwen3-4B backbone from its pinned revision.

score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult Score a single string. Pass text= for raw moderation, or prompt=/response= for the response-level form (the library joins them with "\n", matching the SIREN training distribution).

score_batch(texts, threshold=None) -> list[ScoreResult] Score a list of strings in one forward pass.

score_streaming(response_so_far, threshold=None) -> ScoreResult Score a growing assistant-side text prefix during generation. Returns the score for the prefix as a whole.

Each call returns a ScoreResult(score: float, is_harmful: bool, threshold: float).

The default threshold is 0.5, matching the binary decision boundary used during training. Tune it to your deployment's safety policy.

Artifact contents

File Purpose
siren_config.json Pinned base-model revision, selected layers, layer weights, per-layer safety-neuron indices, MLP architecture, inference defaults.
siren.safetensors Trained MLP classifier weights (~14.0M params).

The Qwen3-4B backbone weights are not redistributed here; they are pulled from Qwen/Qwen3-4B at the pinned commit specified in siren_config.json on first use, then cached locally.

Reported performance

Macro F1 on standard safeguard benchmarks:

ToxicChat OpenAIMod Aegis Aegis 2 WildGuard SafeRLHF BeaverTails Avg.
83.5 91.2 82.9 83.4 88.3 93.2 84.3 86.7

Citation

@article{jiao2026llm,
  title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
  author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
  journal={arXiv preprint arXiv:2604.18519},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for UofTCSSLab/SIREN-Qwen3-4B

Finetuned
Qwen/Qwen3-4B
Finetuned
(585)
this model

Paper for UofTCSSLab/SIREN-Qwen3-4B