IrishCore-DiffMask-135M-v1-rc5

IrishCore-DiffMask-135M-v1-rc5 is a raw-only Irish PII masking model derived from OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1.

It is a small, scanner-free span extractor tuned for:

  • PPSN
  • ACCOUNT_NUMBER
  • BANK_ROUTING_NUMBER
  • CREDIT_DEBIT_CARD
  • PASSPORT_NUMBER
  • POSTCODE
  • PHONE_NUMBER
  • EMAIL
  • FIRST_NAME
  • LAST_NAME
  • SWIFT_BIC

The main target is English plus Irish Gaelic text in citizen-support, public-sector, and HSE-style flows. The repo ships both the full transformers checkpoint and a dynamic q8 ONNX artifact for CPU deployment.

What "DiffMask" Means Here

This release is not a generative diffusion language model. It is a compact discriminative token-span model trained with a diffusion-style denoising schedule.

The short version:

  • Base OpenMed: plain BIO token classification
  • DiffMask: token-span extraction with token-presence and boundary heads
  • DiffMask training: repeated masked denoising over the same sentence
  • DiffMask inference: one forward pass, no iterative refinement, no text generation

Concretely:

  • The encoder starts from the DistilBERT-family weights inside OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1.
  • The model adds three task heads over the encoder hidden states:
    • a per-label token-presence head
    • a typed start-boundary head
    • a typed end-boundary head
  • During training, each input sentence is corrupted multiple times by replacing a random fraction of visible tokens with [MASK].
  • The corruption level follows a short noise schedule from heavy masking to light masking.
  • The same gold spans are learned at every noise level, and the losses are averaged across the denoising passes.
  • At inference time there is no diffusion loop and no rewrite step: the model runs once and a score-only span decoder reconstructs spans from token scores plus typed boundaries.

So the "DLLM" aspect here is the training recipe: repeated masked denoising over text, not autoregressive generation.

What It Is Not

This model is not a full discrete diffusion language model in the LLaDA sense.

A true DLLM would usually have:

  • timestep or noise conditioning inside the model
  • iterative denoising at inference time
  • multi-step sequence refinement at runtime
  • text generation or full-sequence reconstruction as a first-class objective

This release does not do that.

Instead, it uses the diffusion idea only as a training-time robustness trick:

  • corrupt the sentence with [MASK] at several noise levels
  • train on the same target spans each time
  • average those losses

At runtime, it behaves like a normal fast discriminative extractor.

Architecture

  • Encoder: DistilBERT-size encoder from the OpenMed mLiteClinical 135M base
  • Heads:
    • token presence per released label
    • typed start boundary per released label
    • typed end boundary per released label
  • Decoder:
    • score-only span decoding from offsets, token continuity, label-specific thresholds, and typed boundaries
    • email spans do not bridge across whitespace
    • structured labels can bridge over a simple hyphen token when needed
    • name spans containing digits are rejected during decoding
    • PPSN whitespace bridging is restricted to short suffix tokens only
    • structured IBAN and card spans use conservative start/tail rescue logic for fragmented tokenization
    • no regex candidate extractor
    • no checksum validator
    • no scanner layer

The release behavior is fully defined by the weights plus the bundled decoder in common.py.

Training And Inference Flow

Training:

  1. tokenize a sentence with gold BIO spans
  2. convert spans into:
    • token-presence targets
    • typed start targets
    • typed end targets
  3. create several noised copies of the same tokenized sentence by masking random visible tokens
  4. run the same encoder+heads on each noised copy
  5. average the losses across those denoising passes

Inference:

  1. tokenize the raw text once
  2. run a single forward pass
  3. predict:
    • which labels are present on each token
    • where each labeled span starts
    • where each labeled span ends
  4. decode spans with label-aware thresholds and boundary rules
  5. replace the detected spans with placeholders such as [PII:PPSN]

There is no multi-step refinement loop in deployment.

How It Differs From The Original OpenMed Model

The original OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 is a standard DistilBertForTokenClassification model:

  • one encoder
  • one token-classification head
  • BIO labels such as B-email, I-email, B-phone_number
  • generic token aggregation to recover spans

DiffMask changes two things:

  1. Different supervision

    • base OpenMed learns only BIO token labels
    • DiffMask learns token presence plus typed span boundaries
  2. Different training recipe

    • base OpenMed is trained as a standard token classifier
    • DiffMask is trained on multiple masked-noised views of the same sentence

That makes DiffMask better suited to structured Irish identifiers and mixed PII masking, while still keeping a small encoder and a fast CPU path.

How It Differs From rc5 And rc8

Model Core idea External scanner/validator Runtime shape
rc5 token classifier + repair logic yes heavier, decoder-assisted
rc8 raw-only token-span model no one pass + span decoder
DiffMask raw-only token-span model + denoising training no one pass + span decoder

So DiffMask is closest to rc8 operationally, but it uses a stronger training recipe.

Why This Exists

The older rc5 release still depended on a repair-oriented decoder stack. The public rc8 release removed that external logic, but it regressed on several structured Irish identifiers. This release keeps the raw-only deployment shape while re-hardening the model on Irish numeric and mixed-PII cases.

References

Direct implementation references:

Conceptual diffusion-style training references:

These diffusion papers were used as architectural inspiration for the masked noising schedule. This release does not implement a generative text diffusion runtime.

Included Artifacts

  • Full transformers checkpoint in the repo root
  • Dynamic q8 ONNX export in onnx/model_quantized.onnx
  • Unquantized ONNX export in onnx/model.onnx
  • inference_mask.py for the full checkpoint
  • inference_mask_onnx.py for the ONNX q8 path
  • common.py, model.py, and multitask_model.py implementing the release decoder
  • benchmark files in eval/

Artifact sizes:

  • Full checkpoint: 514 MB (model.safetensors)
  • Dynamic q8 ONNX: 393 MB (onnx/model_quantized.onnx)

How To Use It

Full checkpoint:

uv run python inference_mask.py \
  --model temsa/IrishCore-DiffMask-135M-v1-rc5 \
  --min-score 0.5 \
  --text "My PPSN is 1234567TW, my Eircode is D02 X285, and my phone is 087 123 4567." \
  --json

Dynamic q8 ONNX:

uv run python inference_mask_onnx.py \
  --model temsa/IrishCore-DiffMask-135M-v1-rc5 \
  --min-score 0.5 \
  --text "Please provide your passport NN5123456 and call me on 0851234567." \
  --json

Both scripts emit explicit placeholders like [PII:PPSN] in masked_text.

Q8 Comparison

Deployment-relevant comparison on CPU:

Model Core F1 Edge F1 Finance F1 Finance-boundary F1 User PPSN F1 GA weak PPSN F1 Multilingual PPSN F1 Hardening F1
rc5 ONNX q8 0.9669 0.9744 0.9362 0.8750 1.0000 1.0000 0.9333 -
rc8 ONNX q8 0.9737 1.0000 1.0000 1.0000 1.0000 1.0000 0.9176 0.7059
IrishCore-DiffMask-135M-v1-rc5 ONNX q8 0.9733 0.9500 1.0000 1.0000 - 1.0000 0.9379 1.0000

Additional targeted suites used during rc5 selection:

Suite Metric IrishCore-DiffMask-135M-v1-rc5
diffmask_rc3_feedback_exact_v1 F1, IoU 0.5, ONNX q8 1.0000
diffmask_gap_uat_exact_v1 F1, IoU 0.5, ONNX q8 0.9032
diffmask_rc3_feedback_exact_v1 F1, exact boundary (IoU 1.0, full checkpoint) 0.9444

CPU throughput references:

Suite rc5 q8 rc8 q8 IrishCore-DiffMask-135M-v1-rc5 q8
Irish core short-text path 33.6193 ex/s 257.3756 ex/s 249.2250 ex/s
Multilingual PPSN short-text path 35.5561 ex/s 230.5181 ex/s 247.4313 ex/s
Runtime profile source 23.8338 ex/s 179.4708 ex/s 157.4094 ex/s

Notes:

  • The rc5 speed references come from its published q8 end-to-end inference stack, which includes its older repair decoder.
  • The rc8 and IrishCore-DiffMask-135M-v1-rc5 numbers use the same raw-only token-span ONNX path.
  • A weight-only q4 ONNX experiment was also tried during development, but it was slower than q8 on this CPU and is not shipped.

Limits

  • This is still a compact model. The hardest remaining errors are multilingual PPSN near-miss cases rather than Irish core numeric formats.
  • The release path is intentionally scanner-free. If you need deterministic validation of individual identifier types, add that in your application layer.
  • If you rely on release behavior, use the bundled inference scripts or import decode_token_presence_segments from common.py.
  • rc5 resolves the post-rc3 QA feedback suite, but it still has known misses on a few longer UAT-style messages:
    • the second phone number in a two-phone support sentence
    • one multiline address block with R93 EC57
    • EPStamp4@enterprise.gov.ie in the longer employment-permit example

License And Attribution

  • Release license: Apache-2.0
  • Base model: OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1
  • The derivative release remains subject to the attribution terms of the upstream datasets listed above.
  • See NOTICE, training_sources.json, and eval/benchmark_summary.json for provenance and benchmark details.

Portfolio Comparison

Updated: 2026-03-15.

Use this section for the fastest public comparison across the temsa PII masking portfolio.

  • The first core table only includes public checkpoints that ship both comparable q8 accuracy and q8 CPU throughput.
  • The first PPSN table only includes public artifacts that ship comparable PPSN accuracy and CPU throughput.
  • Missing cells in the archive tables mean the older release did not ship that metric in its public bundle.
  • DiffMask rows use the reconciled clean_single_pass harness that matches the deployed runtime.
  • GlobalPointer rows use the public raw-only span-matrix release bundle and its packaged q8 ONNX artifact.
  • The same content is shipped as PORTFOLIO_COMPARISON.md inside each public model repo.

Irish Core PII: Comparable Public Checkpoints

Repo Stack Full Core F1 Q8 Core F1 Q8 Multilingual PPSN F1 Q8 Core ex/s
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc23 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 237.6
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc22 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 106.8
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc21 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 150.8
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc20 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 181.9
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc19 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 73.1
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc18 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 126.2
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc17 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 125.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc16 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 125.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc15 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 125.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc14 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 119.2
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc13 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 126.1
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc12 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 73.6
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc11 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 94.1
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc10 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 125.8
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc9 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 119.8
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc8 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 128.9
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc7 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 89.0
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc6 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 89.0
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc5 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 84.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc4 GlobalPointer raw-only + context labels 0.9935 0.9935 0.9333 61.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc3 GlobalPointer raw-only + context labels 0.9935 0.9935 0.9333 61.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc2 GlobalPointer raw-only + context labels 0.9935 0.9935 0.9222 61.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc1 GlobalPointer raw-only + context labels 0.9935 0.9935 0.9222 61.5
temsa/IrishCore-GlobalPointer-135M-v1-rc4 GlobalPointer raw-only span-matrix 1.0000 1.0000 0.9333 221.6
temsa/IrishCore-GlobalPointer-135M-v1-rc3 GlobalPointer raw-only span-matrix 1.0000 1.0000 0.9213 204.9
temsa/IrishCore-GlobalPointer-135M-v1-rc2 GlobalPointer raw-only span-matrix 0.9934 0.9934 0.9326 231.2
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc8 Raw-only token-span 0.9737 0.9737 0.9176 46.1
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7 Hybrid classifier + generated scanner spec 1.0000 0.9934 1.0000 30.0
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6 Hybrid classifier + repair decoders 1.0000 0.9934 1.0000 29.5
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5 Hybrid classifier + repair decoders 0.9737 0.9669 0.9333 34.4
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4 Hybrid classifier + repair decoders 0.9870 0.9740 0.9600 114.2
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc3 Hybrid classifier + repair decoders 0.9806 0.9677 0.9333 44.9
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc2 Hybrid classifier + repair decoders 0.9554 0.9615 0.7887 119.1
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v1 Hybrid classifier baseline 0.9530 0.9333 0.9882 103.3
temsa/IrishCore-DiffMask-135M-v1-rc6 DiffMask token-span, scanner-free 0.9801 0.9733 0.9274 130.3
temsa/IrishCore-DiffMask-135M-v1-rc5 DiffMask token-span, scanner-free 0.9733 0.9733 0.9379 249.2
temsa/IrishCore-DiffMask-135M-v1-rc4 DiffMask token-span, scanner-free 0.9733 0.9733 0.9371 29.5
temsa/IrishCore-DiffMask-135M-v1-rc3 DiffMask token-span, scanner-free 0.9664 0.9664 0.9591 30.0
temsa/IrishCore-DiffMask-135M-v1-rc2 DiffMask token-span, scanner-free 0.9664 0.9664 0.9212 247.1
temsa/IrishCore-DiffMask-135M-v1-rc1 DiffMask token-span, scanner-free 0.9801 0.9934 0.9412 251.2

Irish Core PII: Other Public Checkpoints

Repo Stack Full Core F1 Q8 Core F1 Q8 Multilingual PPSN F1 Notes
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc1 Hybrid classifier prototype 0.9487 Predates the public q8 artifact.

Finance-boundary q8 F1 is 1.0000 for OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6, OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7, OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc8, and all public IrishCore-DiffMask releases from rc1 to rc6. OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5 ships 0.8750 on that public q8 suite.

PPSN-Only: Comparable Public Artifacts

Repo Artifact Irish Large F1 Multilingual PPSN F1 User Raw F1 QA v8 F1 CPU ex/s
temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1 fp32 canonical checkpoint 0.8979 0.9704 0.8000 0.7385 57.4
temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1-fp16 fp16 CPU/GPU artifact 0.9704 0.8000 0.7385 45.8
temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1-q8 dynamic int8 CPU artifact 0.9040 132.1

PPSN-Only: Historical Public Checkpoints

Repo Main Published Metrics Notes
temsa/OpenMed-PPSN-mLiteClinical-v1 same as canonical fp32 repo: multilingual 0.9704, user raw 0.8000 Legacy alias; prefer temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1.
temsa/OpenMed-PPSN-v6-raw-rc2 irish_reg_v5 0.8750; user_raw 0.8000; qa_v8 0.7385 Raw PPSN-only research checkpoint; no packaged multilingual CPU benchmark row.
temsa/OpenMed-PPSN-v5_1 irish_large_v2 raw 0.9285; qa_v6 hybrid strict 1.0000 Hybrid PPSN-only checkpoint; predates the canonical multilingual suite packaging.
temsa/OpenMed-PPSN-v5 irish_reg_v5 raw 0.8235; irish_reg_v5 hybrid strict 1.0000 Hybrid PPSN-only checkpoint; predates the canonical multilingual suite packaging.
temsa/OpenMed-PPSN-v4 synthetic non-PPSN drift check only Predates the current PPSN eval suite; no packaged apples-to-apples multilingual CPU row.

If you need the strongest current raw-only Irish core model, start with IrishCore-GlobalPointer-135M-v1-rc4. If you need the fastest CPU-first raw-only line, compare it against IrishCore-DiffMask-135M-v1-rc6. If you need a PPSN-only artifact, compare the canonical fp32, fp16, and q8 variants of OpenMed-mLiteClinical-IrishPPSN-135M-v1 directly in the table above.

Downloads last month
387
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/IrishCore-DiffMask-135M-v1-rc5

Datasets used to train temsa/IrishCore-DiffMask-135M-v1-rc5

Papers for temsa/IrishCore-DiffMask-135M-v1-rc5

Evaluation results