Qwen3 0.6B PII Detector

This model is a fine-tuned version of Qwen/Qwen3-0.6B for detecting Personally Identifiable Information (PII) and Protected Health Information (PHI) in text using inline tagging format.

Model Description

Base Model: Qwen/Qwen3-0.6B
Training Data: nvidia/Nemotron-PII (47,500 training samples, 2,375 evaluation samples)
Training Method: LoRA fine-tuning with Unsloth
Task: Named Entity Recognition (NER) for PII/PHI detection
Output Format: text_tagged (inline tags: [entity]label)
Context-Aware: Trained with domain, document type, and locale information

Key Features

Inline Tag Format: Outputs text with entities tagged as [entity]label
Domain-Aware: Optionally accepts domain context (finance, healthcare, legal, etc.) for improved accuracy
Locale Support: Handles both US and international text
Natural Language: Works on conversations, documents, forms, and unstructured text
55+ Entity Types: Comprehensive PII/PHI coverage across 50+ industries

Performance

Epochs Completed: 2.096
Total Training Steps: 6,224
Initial Training Loss: 1.8435
Final Training Loss: 0.4155
Initial Evaluation Loss: 1.9682
Final Evaluation Loss: 0.4572
Best Evaluation Loss: 0.4551

Training Configuration

Epochs Planned: 5
Epochs Completed: 2.096
Batch Size: 4
Gradient Accumulation: 4
Effective Batch Size: 16
Learning Rate: 0.0002
LR Scheduler: cosine
Warmup Ratio: 0.1
Weight Decay: 0.01
Max Sequence Length: 2048
LoRA r: 16
LoRA alpha: 16
GPU: Nvidia L4
Evaluation Strategy: Every 100 steps
Output Format: text_tagged with domain context

Framework Versions

Transformers: 4.56.2
PyTorch: 2.x
Unsloth: Latest
Datasets: 3.6.0
TRL: Latest

Usage

Installation

pip install transformers torch

Basic Usage (Without Domain/Locale)

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "naazimsnh02/qwen3-0.6b-pii-detector"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prepare input
text = "My name is John Smith and my SSN is 123-45-6789. Contact me at [email protected]"

messages = [
    {
        "role": "user",
        "content": f"""Analyze the following text and identify all PII (Personally Identifiable Information) and PHI (Protected Health Information) entities.

Text: {text}

Provide the output with inline tags in the format: [entity]label"""
    }
]

# Generate
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.1,  # Lower for consistent tagging
    top_p=0.9,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
# Output: My name is [John Smith]name and my SSN is [123-45-6789]ssn. Contact me at [[email protected]]email

Enhanced Usage (With Domain/Locale Context)

For better accuracy, provide domain and locale information:

# Prepare input with context
text = "Patient John Doe, MRN 12345, diagnosed with hypertension."
domain = "healthcare"
doc_type = "medical_record"
locale = "us"

messages = [
    {
        "role": "user",
        "content": f"""Analyze the following {doc_type} from the {domain} domain ({locale.upper()} locale) and identify all PII (Personally Identifiable Information) and PHI (Protected Health Information) entities.

Text: {text}

Provide the output with inline tags in the format: [entity]label"""
    }
]

# Generate as above
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(input_ids, max_new_tokens=512, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Extracting Entities from Model Output

Use this utility class to extract structured data from the model's tagged output:

import re
from typing import List, Dict

class PIIExtractor:
    """
    Utility class to extract PII entities from text tagged by the model.
    
    The model outputs text in the format: [entity]label
    Example: "[John Smith]first_name works at [Acme Corp]organization"
    """
    
    def __init__(self):
        # Pattern to match [entity]label format
        self.pattern = re.compile(r'\[([^\]]+)\](\w+)')
    
    def extract_entities(self, tagged_text: str) -> List[Dict[str, str]]:
        """
        Extract all PII entities from tagged text.
        
        Returns:
            List of dictionaries with 'text' and 'label' keys
        """
        matches = self.pattern.findall(tagged_text)
        return [{"text": text, "label": label} for text, label in matches]
    
    def extract_with_positions(self, tagged_text: str) -> List[Dict[str, any]]:
        """Extract entities with their positions in the original text."""
        entities = []
        offset = 0
        
        for match in self.pattern.finditer(tagged_text):
            entity_text = match.group(1)
            label = match.group(2)
            
            start = match.start() - offset
            end = start + len(entity_text)
            
            entities.append({
                "text": entity_text,
                "label": label,
                "start": start,
                "end": end
            })
            
            # Update offset (length of tags removed)
            offset += len(label) + 2  # +2 for ] and label
        
        return entities
    
    def get_clean_text(self, tagged_text: str) -> str:
        """Remove all tags from text, leaving only the original content."""
        # Remove labels and brackets
        text = re.sub(r'\](\w+)', '', tagged_text)
        text = text.replace('[', '')
        return text
    
    def group_by_label(self, tagged_text: str) -> Dict[str, List[str]]:
        """Group extracted entities by their label type."""
        entities = self.extract_entities(tagged_text)
        grouped = {}
        for entity in entities:
            label = entity['label']
            if label not in grouped:
                grouped[label] = []
            grouped[label].append(entity['text'])
        return grouped

# Example usage
extractor = PIIExtractor()
tagged_output = "[John Smith]first_name lives in [New York]city"

entities = extractor.extract_entities(tagged_output)
print(entities)
# [{'text': 'John Smith', 'label': 'first_name'}, {'text': 'New York', 'label': 'city'}]

clean_text = extractor.get_clean_text(tagged_output)
print(clean_text)
# "John Smith lives in New York"

grouped = extractor.group_by_label(tagged_output)
print(grouped)
# {'first_name': ['John Smith'], 'city': ['New York']}

Output Format

The model outputs text with inline tags in the format: [entity]label

Example:

Input:

I am applying for student financial aid. My name is Peggy and my SSN is 250-38-8116.

Output:

I am applying for student financial aid. My name is [Peggy]first_name and my SSN is [250-38-8116]ssn.

Supported Entity Types (55+ categories)

The model can detect various types of sensitive information including:

Personal Identifiers

first_name, last_name, name, maiden_name, middle_name
ssn (Social Security Number)
date_of_birth, age
gender, race_ethnicity, religious_belief
driver_license, passport, national_id

Contact Information

email, phone_number, fax_number
street_address, address, city, state, county, zip_code, country
po_box

Medical/Health Information (PHI)

medical_record_number (mrn), patient_id
blood_type, diagnosis, medication, procedure
health_plan_id, insurance_number, policy_number
hospital_name, doctor_name

Financial Information

credit_card_number, bank_account_number
routing_number, account_number, iban
tax_id, employer_id (ein)
salary, income

Professional/Organizational

organization, company_name, employer
job_title, employee_id
username, user_id

Digital Identifiers

ip_address, mac_address
url, domain_name
device_id, imei

Educational

student_id, school_name, university
degree, gpa

Legal

case_number, court_name
license_plate, vin

And 20+ more categories...

Domain Coverage (50+ industries)

Trained on diverse domains including:

Healthcare & Medical
Finance & Banking
Legal & Law
Education & Academia
Government & Public Sector
Cybersecurity & IT
Human Resources
Insurance
Real Estate
Retail & E-commerce
And more...

Training Details

This model was trained using:

Unsloth for efficient training
TRL for supervised fine-tuning
Modal infrastructure with Nvidia L4 GPU
Evaluation during training with 5% validation split
Context-aware training with domain, document type, and locale information

Limitations

Context Dependency: Accuracy may vary based on context clarity
Domain Specificity: Works best when domain context is provided
Ambiguous Entities: May struggle with ambiguous text (e.g., is "Apple" a company or fruit?)
Language: Primarily trained on English text
Novel Entities: May not detect entity types not present in training data
Format Sensitivity: Best results with natural language; may vary with heavily formatted text

Ethical Considerations

⚠️ Important: This model is designed to DETECT PII/PHI, not to protect it.

Use Responsibly: Ensure compliance with privacy regulations (GDPR, HIPAA, CCPA, etc.)
Validation Required: Always validate model outputs in production environments
Not 100% Accurate: May miss some PII or incorrectly tag non-PII
Redaction: Use detected entities to redact/anonymize sensitive information
Audit Trail: Maintain logs of PII detection for compliance
Human Review: Critical applications should include human oversight

Intended Use

Recommended Use Cases:

PII detection in documents before sharing
Automated redaction pipelines
Compliance monitoring
Data anonymization workflows
Privacy-preserving data analysis
Document sanitization
GDPR/HIPAA compliance tools

Not Recommended:

As the sole PII detection method in high-stakes scenarios
Without human review in sensitive applications
For languages other than English
Real-time critical systems without validation

License

Apache 2.0

Citation

If you use this model, please cite:

@misc{qwen3-pii-detector,
  author = {Syed Naazim Hussain},
  title = {Qwen3 0.6B PII Detector},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/naazimsnh02/qwen3-0.6b-pii-detector}
}

Acknowledgments

Base Model: Qwen Team for Qwen3-0.6B
Dataset: NVIDIA for Nemotron-PII dataset
Training: Fine-tuned using Unsloth
Infrastructure: Trained on Modal with Nvidia L4 GPU

Downloads last month: 28

Safetensors

Model size

0.6B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for naazimsnh02/qwen3-0.6b-pii-detector

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

(421)

this model

Quantizations

1 model

Dataset used to train naazimsnh02/qwen3-0.6b-pii-detector

Evaluation results

Evaluation Loss on Nemotron-PII
self-reported

0.457

View on Papers With Code