IrishCore-DiffMask-135M-v1-rc3

IrishCore-DiffMask-135M-v1-rc3 is a raw-only Irish PII masking model derived from OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1.

It is a small, scanner-free span extractor tuned for:

  • PPSN
  • ACCOUNT_NUMBER
  • BANK_ROUTING_NUMBER
  • CREDIT_DEBIT_CARD
  • PASSPORT_NUMBER
  • POSTCODE
  • PHONE_NUMBER
  • EMAIL
  • FIRST_NAME
  • LAST_NAME
  • SWIFT_BIC

The main target is English plus Irish Gaelic text in citizen-support, public-sector, and HSE-style flows. The repo ships both the full transformers checkpoint and a dynamic q8 ONNX artifact for CPU deployment.

What "DiffMask" Means Here

This release is not a generative diffusion language model. It is a compact discriminative token-span model trained with a diffusion-style denoising schedule.

The short version:

  • Base OpenMed: plain BIO token classification
  • DiffMask: token-span extraction with token-presence and boundary heads
  • DiffMask training: repeated masked denoising over the same sentence
  • DiffMask inference: one forward pass, no iterative refinement, no text generation

Concretely:

  • The encoder starts from the DistilBERT-family weights inside OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1.
  • The model adds three task heads over the encoder hidden states:
    • a per-label token-presence head
    • a typed start-boundary head
    • a typed end-boundary head
  • During training, each input sentence is corrupted multiple times by replacing a random fraction of visible tokens with [MASK].
  • The corruption level follows a short noise schedule from heavy masking to light masking.
  • The same gold spans are learned at every noise level, and the losses are averaged across the denoising passes.
  • At inference time there is no diffusion loop and no rewrite step: the model runs once and a score-only span decoder reconstructs spans from token scores plus typed boundaries.

So the "DLLM" aspect here is the training recipe: repeated masked denoising over text, not autoregressive generation.

What It Is Not

This model is not a full discrete diffusion language model in the LLaDA sense.

A true DLLM would usually have:

  • timestep or noise conditioning inside the model
  • iterative denoising at inference time
  • multi-step sequence refinement at runtime
  • text generation or full-sequence reconstruction as a first-class objective

This release does not do that.

Instead, it uses the diffusion idea only as a training-time robustness trick:

  • corrupt the sentence with [MASK] at several noise levels
  • train on the same target spans each time
  • average those losses

At runtime, it behaves like a normal fast discriminative extractor.

Architecture

  • Encoder: DistilBERT-size encoder from the OpenMed mLiteClinical 135M base
  • Heads:
    • token presence per released label
    • typed start boundary per released label
    • typed end boundary per released label
  • Decoder:
    • score-only span decoding from offsets, token continuity, label-specific thresholds, and typed boundaries
    • no regex candidate extractor
    • no checksum validator
    • no scanner layer

The release behavior is fully defined by the weights plus the bundled decoder in common.py.

Training And Inference Flow

Training:

  1. tokenize a sentence with gold BIO spans
  2. convert spans into:
    • token-presence targets
    • typed start targets
    • typed end targets
  3. create several noised copies of the same tokenized sentence by masking random visible tokens
  4. run the same encoder+heads on each noised copy
  5. average the losses across those denoising passes

Inference:

  1. tokenize the raw text once
  2. run a single forward pass
  3. predict:
    • which labels are present on each token
    • where each labeled span starts
    • where each labeled span ends
  4. decode spans with label-aware thresholds and boundary rules
  5. replace the detected spans with placeholders such as [PII:PPSN]

There is no multi-step refinement loop in deployment.

How It Differs From The Original OpenMed Model

The original OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 is a standard DistilBertForTokenClassification model:

  • one encoder
  • one token-classification head
  • BIO labels such as B-email, I-email, B-phone_number
  • generic token aggregation to recover spans

DiffMask changes two things:

  1. Different supervision

    • base OpenMed learns only BIO token labels
    • DiffMask learns token presence plus typed span boundaries
  2. Different training recipe

    • base OpenMed is trained as a standard token classifier
    • DiffMask is trained on multiple masked-noised views of the same sentence

That makes DiffMask better suited to structured Irish identifiers and mixed PII masking, while still keeping a small encoder and a fast CPU path.

How It Differs From rc5 And rc8

Model Core idea External scanner/validator Runtime shape
rc5 token classifier + repair logic yes heavier, decoder-assisted
rc8 raw-only token-span model no one pass + span decoder
DiffMask raw-only token-span model + denoising training no one pass + span decoder

So DiffMask is closest to rc8 operationally, but it uses a stronger training recipe.

Why This Exists

The older rc5 release still depended on a repair-oriented decoder stack. The public rc8 release removed that external logic, but it regressed on several structured Irish identifiers. This release keeps the raw-only deployment shape while re-hardening the model on Irish numeric and mixed-PII cases.

rc3 is the next candidate after rc2. It keeps the stronger focusv3 checkpoint selected during local iteration, then applies a small decoder-profile retune for the published config:

  • lower EMAIL token extend threshold to keep contiguous mailbox fragments together
  • lower PASSPORT_NUMBER q8 threshold slightly to recover a mixed-message passport miss after dynamic quantization

The weights remain raw-only and scanner-free. The rc3 change is the checkpoint plus a stricter release-time decoder profile in config.json.

References

Direct implementation references:

Conceptual diffusion-style training references:

These diffusion papers were used as architectural inspiration for the masked noising schedule. This release does not implement a generative text diffusion runtime.

Included Artifacts

  • Full transformers checkpoint in the repo root
  • Dynamic q8 ONNX export in onnx/model_quantized.onnx
  • Unquantized ONNX export in onnx/model.onnx
  • inference_mask.py for the full checkpoint
  • inference_mask_onnx.py for the ONNX q8 path
  • common.py, model.py, and multitask_model.py implementing the release decoder
  • benchmark files in eval/

Artifact sizes:

  • Full checkpoint: 514 MB (model.safetensors)
  • Dynamic q8 ONNX: 393 MB (onnx/model_quantized.onnx)

How To Use It

Full checkpoint:

uv run python inference_mask.py \
  --model temsa/IrishCore-DiffMask-135M-v1-rc3 \
  --min-score 0.5 \
  --text "My PPSN is 1234567TW, my Eircode is D02 X285, and my phone is 087 123 4567." \
  --json

Dynamic q8 ONNX:

uv run python inference_mask_onnx.py \
  --model temsa/IrishCore-DiffMask-135M-v1-rc3 \
  --min-score 0.5 \
  --text "Please provide your passport NN5123456 and call me on 0851234567." \
  --json

Both scripts emit explicit placeholders like [PII:PPSN] in masked_text.

Q8 Comparison

Deployment-relevant comparison on CPU:

Model Core F1 Edge F1 Finance F1 Finance-boundary F1 User PPSN F1 GA weak PPSN F1 Multilingual PPSN F1 Hardening F1
rc5 ONNX q8 0.9669 0.9744 0.9362 0.8750 1.0000 1.0000 0.9333 -
rc8 ONNX q8 0.9737 1.0000 1.0000 1.0000 1.0000 1.0000 0.9176 0.7059
IrishCore-DiffMask-135M-v1-rc3 ONNX q8 0.9664 1.0000 1.0000 1.0000 0.8571 1.0000 0.9591 1.0000

UAT replay exact suite used for the recent hardening pass:

Model UAT replay exact F1 Precision Recall
IrishCore-DiffMask-135M-v1-rc1 ONNX q8 0.4545 1.0000 0.2941
IrishCore-DiffMask-135M-v1-rc2 ONNX q8 0.8276 1.0000 0.7059
rc8 ONNX q8 0.3636 0.3750 0.3529
IrishCore-DiffMask-135M-v1-rc3 ONNX q8 0.9032 1.0000 0.8235

CPU throughput references:

Suite rc5 q8 rc8 q8 IrishCore-DiffMask-135M-v1-rc3 q8
Irish core short-text path 33.6193 ex/s 257.3756 ex/s 29.9676 ex/s
Multilingual PPSN short-text path 35.5561 ex/s 230.5181 ex/s 54.2219 ex/s
Runtime profile source 23.8338 ex/s 179.4708 ex/s 46.1519 ex/s

Notes:

  • The rc5 speed references come from its published q8 end-to-end inference stack, which includes its older repair decoder.
  • The rc8 and IrishCore-DiffMask-135M-v1-rc3 numbers use the same raw-only token-span ONNX path.
  • A weight-only q4 ONNX experiment was also tried during development, but it was slower than q8 on this CPU and is not shipped.
  • The user_raw_regression_cases_v1 suite is a legacy PPSN-only regression set. In rc3, the single counted false positive is 0871234567, which is now intentionally masked as PHONE_NUMBER rather than misread as PPSN.

Additional Training Data Used For This RC

Published training sources:

  • temsa/OpenMed-Irish-CorePII-TrainMix-v1
  • temsa/OpenMed-Irish-PPSN-Eircode-Spec-v1
  • joelniklaus/mapa
  • gretelai/synthetic_pii_finance_multilingual

Additional local synthetic hardening and replay sets used during checkpoint selection:

  • irish_core_diffmask_v5_mix
  • dllm_uat_replay_v1
  • dllm_gap_patch_v4
  • dllm_uat_patch_v3
  • irish_core_diffmask_focus_v3

rc3 is based on the locally selected focusv3 checkpoint and then retuned with a narrower decoder profile for the public config.

Limits

  • This is still a compact model. The hardest remaining errors are multilingual PPSN near-miss cases rather than Irish core numeric formats.
  • The release path is intentionally scanner-free. If you need deterministic validation of individual identifier types, add that in your application layer.
  • If you rely on release behavior, use the bundled inference scripts or import decode_token_presence_segments from common.py.
  • Known remaining misses on the current UAT replay suite are the second phone number in the long Client Identity Services sentence (071 967 2616), R93 EC57 inside the longer allocation-centre block, and EPStamp4@enterprise.gov.ie.

License And Attribution

  • Release license: Apache-2.0
  • Base model: OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1
  • The derivative release remains subject to the attribution terms of the upstream datasets listed above.
  • See NOTICE, training_sources.json, and eval/benchmark_summary.json for provenance and benchmark details.

Portfolio Comparison

Updated: 2026-03-16.

Use this section for the fastest public comparison across the temsa PII masking portfolio.

  • The first core table only includes public checkpoints that ship both comparable q8 accuracy and q8 CPU throughput.
  • The first PPSN table only includes public artifacts that ship comparable PPSN accuracy and CPU throughput.
  • Missing cells in the archive tables mean the older release did not ship that metric in its public bundle.
  • DiffMask rows use the reconciled clean_single_pass harness that matches the deployed runtime.
  • GlobalPointer rows use the public raw-only span-matrix release bundle and its packaged q8 ONNX artifact.
  • The same content is shipped as PORTFOLIO_COMPARISON.md inside each public model repo.

Irish Core PII: Comparable Public Checkpoints

Repo Stack Full Core F1 Q8 Core F1 Q8 Multilingual PPSN F1 Q8 Core ex/s
temsa/IrishCore-GlobalPointer-ContextPII-4L-122M-v1-rc4 4-layer GlobalPointer distilled fast student 1.0000 1.0000 0.9333 299.0
temsa/IrishCore-GlobalPointer-ContextPII-4L-122M-v1-rc3 4-layer GlobalPointer distilled fast student 1.0000 1.0000 0.9333 317.9
temsa/IrishCore-GlobalPointer-ContextPII-4L-122M-v1-rc2 4-layer GlobalPointer distilled fast student 1.0000 1.0000 0.9333 292.5
temsa/IrishCore-GlobalPointer-ContextPII-4L-122M-v1-rc1 4-layer GlobalPointer distilled fast student 1.0000 1.0000 0.9333 337.3
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc27 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 270.0
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc25 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 212.1
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc24 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 278.9
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc23 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 237.6
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc22 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 106.8
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc21 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 150.8
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc20 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 181.9
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc19 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 73.1
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc18 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 126.2
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc17 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 125.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc16 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 125.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc15 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 125.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc14 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 119.2
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc13 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 126.1
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc12 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 73.6
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc11 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 94.1
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc10 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 125.8
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc9 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 119.8
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc8 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 128.9
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc7 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 89.0
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc6 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 89.0
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc5 GlobalPointer raw-only + context labels 1.0000 1.0000 0.9333 84.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc4 GlobalPointer raw-only + context labels 0.9935 0.9935 0.9333 61.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc3 GlobalPointer raw-only + context labels 0.9935 0.9935 0.9333 61.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc2 GlobalPointer raw-only + context labels 0.9935 0.9935 0.9222 61.5
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc1 GlobalPointer raw-only + context labels 0.9935 0.9935 0.9222 61.5
temsa/IrishCore-GlobalPointer-135M-v1-rc4 GlobalPointer raw-only span-matrix 1.0000 1.0000 0.9333 221.6
temsa/IrishCore-GlobalPointer-135M-v1-rc3 GlobalPointer raw-only span-matrix 1.0000 1.0000 0.9213 204.9
temsa/IrishCore-GlobalPointer-135M-v1-rc2 GlobalPointer raw-only span-matrix 0.9934 0.9934 0.9326 231.2
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc8 Raw-only token-span 0.9737 0.9737 0.9176 46.1
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7 Hybrid classifier + generated scanner spec 1.0000 0.9934 1.0000 30.0
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6 Hybrid classifier + repair decoders 1.0000 0.9934 1.0000 29.5
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5 Hybrid classifier + repair decoders 0.9737 0.9669 0.9333 34.4
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4 Hybrid classifier + repair decoders 0.9870 0.9740 0.9600 114.2
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc3 Hybrid classifier + repair decoders 0.9806 0.9677 0.9333 44.9
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc2 Hybrid classifier + repair decoders 0.9554 0.9615 0.7887 119.1
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v1 Hybrid classifier baseline 0.9530 0.9333 0.9882 103.3
temsa/IrishCore-DiffMask-135M-v1-rc6 DiffMask token-span, scanner-free 0.9801 0.9733 0.9274 130.3
temsa/IrishCore-DiffMask-135M-v1-rc5 DiffMask token-span, scanner-free 0.9733 0.9733 0.9379 249.2
temsa/IrishCore-DiffMask-135M-v1-rc4 DiffMask token-span, scanner-free 0.9733 0.9733 0.9371 29.5
temsa/IrishCore-DiffMask-135M-v1-rc3 DiffMask token-span, scanner-free 0.9664 0.9664 0.9591 30.0
temsa/IrishCore-DiffMask-135M-v1-rc2 DiffMask token-span, scanner-free 0.9664 0.9664 0.9212 247.1
temsa/IrishCore-DiffMask-135M-v1-rc1 DiffMask token-span, scanner-free 0.9801 0.9934 0.9412 251.2

Irish Core PII: Other Public Checkpoints

Repo Stack Full Core F1 Q8 Core F1 Q8 Multilingual PPSN F1 Notes
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc1 Hybrid classifier prototype 0.9487 Predates the public q8 artifact.

Finance-boundary q8 F1 is 1.0000 for OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6, OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7, OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc8, and all public IrishCore-DiffMask releases from rc1 to rc6. OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5 ships 0.8750 on that public q8 suite.

PPSN-Only: Comparable Public Artifacts

Repo Artifact Irish Large F1 Multilingual PPSN F1 User Raw F1 QA v8 F1 CPU ex/s
temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1 fp32 canonical checkpoint 0.8979 0.9704 0.8000 0.7385 57.4
temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1-fp16 fp16 CPU/GPU artifact 0.9704 0.8000 0.7385 45.8
temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1-q8 dynamic int8 CPU artifact 0.9040 132.1

PPSN-Only: Historical Public Checkpoints

Repo Main Published Metrics Notes
temsa/OpenMed-PPSN-mLiteClinical-v1 same as canonical fp32 repo: multilingual 0.9704, user raw 0.8000 Legacy alias; prefer temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1.
temsa/OpenMed-PPSN-v6-raw-rc2 irish_reg_v5 0.8750; user_raw 0.8000; qa_v8 0.7385 Raw PPSN-only research checkpoint; no packaged multilingual CPU benchmark row.
temsa/OpenMed-PPSN-v5_1 irish_large_v2 raw 0.9285; qa_v6 hybrid strict 1.0000 Hybrid PPSN-only checkpoint; predates the canonical multilingual suite packaging.
temsa/OpenMed-PPSN-v5 irish_reg_v5 raw 0.8235; irish_reg_v5 hybrid strict 1.0000 Hybrid PPSN-only checkpoint; predates the canonical multilingual suite packaging.
temsa/OpenMed-PPSN-v4 synthetic non-PPSN drift check only Predates the current PPSN eval suite; no packaged apples-to-apples multilingual CPU row.

If you need the strongest current raw-only Irish core model, start with IrishCore-GlobalPointer-135M-v1-rc4. If you need the fastest CPU-first raw-only line, compare it against IrishCore-DiffMask-135M-v1-rc6. If you need a PPSN-only artifact, compare the canonical fp32, fp16, and q8 variants of OpenMed-mLiteClinical-IrishPPSN-135M-v1 directly in the table above.

Downloads last month
390
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/IrishCore-DiffMask-135M-v1-rc3

Datasets used to train temsa/IrishCore-DiffMask-135M-v1-rc3

Papers for temsa/IrishCore-DiffMask-135M-v1-rc3

Evaluation results