IrishCore-DiffMask-135M-v1-rc5
IrishCore-DiffMask-135M-v1-rc5 is a raw-only Irish PII masking model derived from OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1.
It is a small, scanner-free span extractor tuned for:
PPSNACCOUNT_NUMBERBANK_ROUTING_NUMBERCREDIT_DEBIT_CARDPASSPORT_NUMBERPOSTCODEPHONE_NUMBEREMAILFIRST_NAMELAST_NAMESWIFT_BIC
The main target is English plus Irish Gaelic text in citizen-support, public-sector, and HSE-style flows. The repo ships both the full transformers checkpoint and a dynamic q8 ONNX artifact for CPU deployment.
What "DiffMask" Means Here
This release is not a generative diffusion language model. It is a compact discriminative token-span model trained with a diffusion-style denoising schedule.
The short version:
- Base OpenMed: plain BIO token classification
- DiffMask: token-span extraction with token-presence and boundary heads
- DiffMask training: repeated masked denoising over the same sentence
- DiffMask inference: one forward pass, no iterative refinement, no text generation
Concretely:
- The encoder starts from the DistilBERT-family weights inside
OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1. - The model adds three task heads over the encoder hidden states:
- a per-label token-presence head
- a typed start-boundary head
- a typed end-boundary head
- During training, each input sentence is corrupted multiple times by replacing a random fraction of visible tokens with
[MASK]. - The corruption level follows a short noise schedule from heavy masking to light masking.
- The same gold spans are learned at every noise level, and the losses are averaged across the denoising passes.
- At inference time there is no diffusion loop and no rewrite step: the model runs once and a score-only span decoder reconstructs spans from token scores plus typed boundaries.
So the "DLLM" aspect here is the training recipe: repeated masked denoising over text, not autoregressive generation.
What It Is Not
This model is not a full discrete diffusion language model in the LLaDA sense.
A true DLLM would usually have:
- timestep or noise conditioning inside the model
- iterative denoising at inference time
- multi-step sequence refinement at runtime
- text generation or full-sequence reconstruction as a first-class objective
This release does not do that.
Instead, it uses the diffusion idea only as a training-time robustness trick:
- corrupt the sentence with
[MASK]at several noise levels - train on the same target spans each time
- average those losses
At runtime, it behaves like a normal fast discriminative extractor.
Architecture
- Encoder: DistilBERT-size encoder from the OpenMed mLiteClinical 135M base
- Heads:
- token presence per released label
- typed start boundary per released label
- typed end boundary per released label
- Decoder:
- score-only span decoding from offsets, token continuity, label-specific thresholds, and typed boundaries
- email spans do not bridge across whitespace
- structured labels can bridge over a simple hyphen token when needed
- name spans containing digits are rejected during decoding
PPSNwhitespace bridging is restricted to short suffix tokens only- structured
IBANand card spans use conservative start/tail rescue logic for fragmented tokenization - no regex candidate extractor
- no checksum validator
- no scanner layer
The release behavior is fully defined by the weights plus the bundled decoder in common.py.
Training And Inference Flow
Training:
- tokenize a sentence with gold BIO spans
- convert spans into:
- token-presence targets
- typed start targets
- typed end targets
- create several noised copies of the same tokenized sentence by masking random visible tokens
- run the same encoder+heads on each noised copy
- average the losses across those denoising passes
Inference:
- tokenize the raw text once
- run a single forward pass
- predict:
- which labels are present on each token
- where each labeled span starts
- where each labeled span ends
- decode spans with label-aware thresholds and boundary rules
- replace the detected spans with placeholders such as
[PII:PPSN]
There is no multi-step refinement loop in deployment.
How It Differs From The Original OpenMed Model
The original OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 is a standard DistilBertForTokenClassification model:
- one encoder
- one token-classification head
- BIO labels such as
B-email,I-email,B-phone_number - generic token aggregation to recover spans
DiffMask changes two things:
Different supervision
- base OpenMed learns only BIO token labels
- DiffMask learns token presence plus typed span boundaries
Different training recipe
- base OpenMed is trained as a standard token classifier
- DiffMask is trained on multiple masked-noised views of the same sentence
That makes DiffMask better suited to structured Irish identifiers and mixed PII masking, while still keeping a small encoder and a fast CPU path.
How It Differs From rc5 And rc8
| Model | Core idea | External scanner/validator | Runtime shape |
|---|---|---|---|
rc5 |
token classifier + repair logic | yes | heavier, decoder-assisted |
rc8 |
raw-only token-span model | no | one pass + span decoder |
DiffMask |
raw-only token-span model + denoising training | no | one pass + span decoder |
So DiffMask is closest to rc8 operationally, but it uses a stronger training recipe.
Why This Exists
The older rc5 release still depended on a repair-oriented decoder stack. The public rc8 release removed that external logic, but it regressed on several structured Irish identifiers. This release keeps the raw-only deployment shape while re-hardening the model on Irish numeric and mixed-PII cases.
References
Direct implementation references:
- Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/abs/1810.04805 - Sanh et al., DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
https://arxiv.org/abs/1910.01108 - Fu et al., Boundary Smoothing for Named Entity Recognition
https://aclanthology.org/2022.acl-long.490/ - Wang et al., SPANNER: Named Entity Re-/Recognition as Span Prediction
https://aclanthology.org/2021.acl-long.558/
Conceptual diffusion-style training references:
- Nie et al., LLaDA 2.0: Scaling Up Diffusion Language Models to 100B
https://arxiv.org/abs/2512.15745 - Gong et al., Scaling Diffusion Language Models via Adaptation from Autoregressive Models
https://arxiv.org/abs/2410.17891
These diffusion papers were used as architectural inspiration for the masked noising schedule. This release does not implement a generative text diffusion runtime.
Included Artifacts
- Full
transformerscheckpoint in the repo root - Dynamic q8 ONNX export in
onnx/model_quantized.onnx - Unquantized ONNX export in
onnx/model.onnx inference_mask.pyfor the full checkpointinference_mask_onnx.pyfor the ONNX q8 pathcommon.py,model.py, andmultitask_model.pyimplementing the release decoder- benchmark files in
eval/
Artifact sizes:
- Full checkpoint:
514 MB(model.safetensors) - Dynamic q8 ONNX:
393 MB(onnx/model_quantized.onnx)
How To Use It
Full checkpoint:
uv run python inference_mask.py \
--model temsa/IrishCore-DiffMask-135M-v1-rc5 \
--min-score 0.5 \
--text "My PPSN is 1234567TW, my Eircode is D02 X285, and my phone is 087 123 4567." \
--json
Dynamic q8 ONNX:
uv run python inference_mask_onnx.py \
--model temsa/IrishCore-DiffMask-135M-v1-rc5 \
--min-score 0.5 \
--text "Please provide your passport NN5123456 and call me on 0851234567." \
--json
Both scripts emit explicit placeholders like [PII:PPSN] in masked_text.
Q8 Comparison
Deployment-relevant comparison on CPU:
| Model | Core F1 | Edge F1 | Finance F1 | Finance-boundary F1 | User PPSN F1 | GA weak PPSN F1 | Multilingual PPSN F1 | Hardening F1 |
|---|---|---|---|---|---|---|---|---|
rc5 ONNX q8 |
0.9669 | 0.9744 | 0.9362 | 0.8750 | 1.0000 | 1.0000 | 0.9333 | - |
rc8 ONNX q8 |
0.9737 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.9176 | 0.7059 |
IrishCore-DiffMask-135M-v1-rc5 ONNX q8 |
0.9733 | 0.9500 | 1.0000 | 1.0000 | - | 1.0000 | 0.9379 | 1.0000 |
Additional targeted suites used during rc5 selection:
| Suite | Metric | IrishCore-DiffMask-135M-v1-rc5 |
|---|---|---|
diffmask_rc3_feedback_exact_v1 |
F1, IoU 0.5, ONNX q8 |
1.0000 |
diffmask_gap_uat_exact_v1 |
F1, IoU 0.5, ONNX q8 |
0.9032 |
diffmask_rc3_feedback_exact_v1 |
F1, exact boundary (IoU 1.0, full checkpoint) |
0.9444 |
CPU throughput references:
| Suite | rc5 q8 |
rc8 q8 |
IrishCore-DiffMask-135M-v1-rc5 q8 |
|---|---|---|---|
| Irish core short-text path | 33.6193 ex/s | 257.3756 ex/s | 249.2250 ex/s |
| Multilingual PPSN short-text path | 35.5561 ex/s | 230.5181 ex/s | 247.4313 ex/s |
| Runtime profile source | 23.8338 ex/s | 179.4708 ex/s | 157.4094 ex/s |
Notes:
- The
rc5speed references come from its published q8 end-to-end inference stack, which includes its older repair decoder. - The
rc8andIrishCore-DiffMask-135M-v1-rc5numbers use the same raw-only token-span ONNX path. - A weight-only q4 ONNX experiment was also tried during development, but it was slower than q8 on this CPU and is not shipped.
Limits
- This is still a compact model. The hardest remaining errors are multilingual PPSN near-miss cases rather than Irish core numeric formats.
- The release path is intentionally scanner-free. If you need deterministic validation of individual identifier types, add that in your application layer.
- If you rely on release behavior, use the bundled inference scripts or import
decode_token_presence_segmentsfromcommon.py. rc5resolves the post-rc3QA feedback suite, but it still has known misses on a few longer UAT-style messages:- the second phone number in a two-phone support sentence
- one multiline address block with
R93 EC57 EPStamp4@enterprise.gov.iein the longer employment-permit example
License And Attribution
- Release license: Apache-2.0
- Base model:
OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 - The derivative release remains subject to the attribution terms of the upstream datasets listed above.
- See
NOTICE,training_sources.json, andeval/benchmark_summary.jsonfor provenance and benchmark details.
Portfolio Comparison
Updated: 2026-03-15.
Use this section for the fastest public comparison across the temsa PII masking portfolio.
- The first core table only includes public checkpoints that ship both comparable q8 accuracy and q8 CPU throughput.
- The first PPSN table only includes public artifacts that ship comparable PPSN accuracy and CPU throughput.
- Missing cells in the archive tables mean the older release did not ship that metric in its public bundle.
- DiffMask rows use the reconciled
clean_single_passharness that matches the deployed runtime. - GlobalPointer rows use the public raw-only span-matrix release bundle and its packaged q8 ONNX artifact.
- The same content is shipped as
PORTFOLIO_COMPARISON.mdinside each public model repo.
Irish Core PII: Comparable Public Checkpoints
| Repo | Stack | Full Core F1 | Q8 Core F1 | Q8 Multilingual PPSN F1 | Q8 Core ex/s |
|---|---|---|---|---|---|
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc23 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 237.6 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc22 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 106.8 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc21 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 150.8 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc20 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 181.9 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc19 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 73.1 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc18 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 126.2 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc17 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 125.5 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc16 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 125.5 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc15 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 125.5 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc14 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 119.2 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc13 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 126.1 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc12 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 73.6 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc11 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 94.1 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc10 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 125.8 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc9 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 119.8 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc8 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 128.9 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc7 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 89.0 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc6 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 89.0 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc5 |
GlobalPointer raw-only + context labels | 1.0000 | 1.0000 | 0.9333 | 84.5 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc4 |
GlobalPointer raw-only + context labels | 0.9935 | 0.9935 | 0.9333 | 61.5 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc3 |
GlobalPointer raw-only + context labels | 0.9935 | 0.9935 | 0.9333 | 61.5 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc2 |
GlobalPointer raw-only + context labels | 0.9935 | 0.9935 | 0.9222 | 61.5 |
temsa/IrishCore-GlobalPointer-ContextPII-135M-v1-rc1 |
GlobalPointer raw-only + context labels | 0.9935 | 0.9935 | 0.9222 | 61.5 |
temsa/IrishCore-GlobalPointer-135M-v1-rc4 |
GlobalPointer raw-only span-matrix | 1.0000 | 1.0000 | 0.9333 | 221.6 |
temsa/IrishCore-GlobalPointer-135M-v1-rc3 |
GlobalPointer raw-only span-matrix | 1.0000 | 1.0000 | 0.9213 | 204.9 |
temsa/IrishCore-GlobalPointer-135M-v1-rc2 |
GlobalPointer raw-only span-matrix | 0.9934 | 0.9934 | 0.9326 | 231.2 |
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc8 |
Raw-only token-span | 0.9737 | 0.9737 | 0.9176 | 46.1 |
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7 |
Hybrid classifier + generated scanner spec | 1.0000 | 0.9934 | 1.0000 | 30.0 |
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6 |
Hybrid classifier + repair decoders | 1.0000 | 0.9934 | 1.0000 | 29.5 |
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5 |
Hybrid classifier + repair decoders | 0.9737 | 0.9669 | 0.9333 | 34.4 |
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4 |
Hybrid classifier + repair decoders | 0.9870 | 0.9740 | 0.9600 | 114.2 |
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc3 |
Hybrid classifier + repair decoders | 0.9806 | 0.9677 | 0.9333 | 44.9 |
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc2 |
Hybrid classifier + repair decoders | 0.9554 | 0.9615 | 0.7887 | 119.1 |
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v1 |
Hybrid classifier baseline | 0.9530 | 0.9333 | 0.9882 | 103.3 |
temsa/IrishCore-DiffMask-135M-v1-rc6 |
DiffMask token-span, scanner-free | 0.9801 | 0.9733 | 0.9274 | 130.3 |
temsa/IrishCore-DiffMask-135M-v1-rc5 |
DiffMask token-span, scanner-free | 0.9733 | 0.9733 | 0.9379 | 249.2 |
temsa/IrishCore-DiffMask-135M-v1-rc4 |
DiffMask token-span, scanner-free | 0.9733 | 0.9733 | 0.9371 | 29.5 |
temsa/IrishCore-DiffMask-135M-v1-rc3 |
DiffMask token-span, scanner-free | 0.9664 | 0.9664 | 0.9591 | 30.0 |
temsa/IrishCore-DiffMask-135M-v1-rc2 |
DiffMask token-span, scanner-free | 0.9664 | 0.9664 | 0.9212 | 247.1 |
temsa/IrishCore-DiffMask-135M-v1-rc1 |
DiffMask token-span, scanner-free | 0.9801 | 0.9934 | 0.9412 | 251.2 |
Irish Core PII: Other Public Checkpoints
| Repo | Stack | Full Core F1 | Q8 Core F1 | Q8 Multilingual PPSN F1 | Notes |
|---|---|---|---|---|---|
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc1 |
Hybrid classifier prototype | 0.9487 | — | — | Predates the public q8 artifact. |
Finance-boundary q8 F1 is 1.0000 for OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6, OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7, OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc8, and all public IrishCore-DiffMask releases from rc1 to rc6. OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5 ships 0.8750 on that public q8 suite.
PPSN-Only: Comparable Public Artifacts
| Repo | Artifact | Irish Large F1 | Multilingual PPSN F1 | User Raw F1 | QA v8 F1 | CPU ex/s |
|---|---|---|---|---|---|---|
temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1 |
fp32 canonical checkpoint | 0.8979 | 0.9704 | 0.8000 | 0.7385 | 57.4 |
temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1-fp16 |
fp16 CPU/GPU artifact | — | 0.9704 | 0.8000 | 0.7385 | 45.8 |
temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1-q8 |
dynamic int8 CPU artifact | — | 0.9040 | — | — | 132.1 |
PPSN-Only: Historical Public Checkpoints
| Repo | Main Published Metrics | Notes |
|---|---|---|
temsa/OpenMed-PPSN-mLiteClinical-v1 |
same as canonical fp32 repo: multilingual 0.9704, user raw 0.8000 | Legacy alias; prefer temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1. |
temsa/OpenMed-PPSN-v6-raw-rc2 |
irish_reg_v5 0.8750; user_raw 0.8000; qa_v8 0.7385 | Raw PPSN-only research checkpoint; no packaged multilingual CPU benchmark row. |
temsa/OpenMed-PPSN-v5_1 |
irish_large_v2 raw 0.9285; qa_v6 hybrid strict 1.0000 | Hybrid PPSN-only checkpoint; predates the canonical multilingual suite packaging. |
temsa/OpenMed-PPSN-v5 |
irish_reg_v5 raw 0.8235; irish_reg_v5 hybrid strict 1.0000 | Hybrid PPSN-only checkpoint; predates the canonical multilingual suite packaging. |
temsa/OpenMed-PPSN-v4 |
synthetic non-PPSN drift check only | Predates the current PPSN eval suite; no packaged apples-to-apples multilingual CPU row. |
If you need the strongest current raw-only Irish core model, start with IrishCore-GlobalPointer-135M-v1-rc4. If you need the fastest CPU-first raw-only line, compare it against IrishCore-DiffMask-135M-v1-rc6. If you need a PPSN-only artifact, compare the canonical fp32, fp16, and q8 variants of OpenMed-mLiteClinical-IrishPPSN-135M-v1 directly in the table above.
- Downloads last month
- 387
Model tree for temsa/IrishCore-DiffMask-135M-v1-rc5
Datasets used to train temsa/IrishCore-DiffMask-135M-v1-rc5
Papers for temsa/IrishCore-DiffMask-135M-v1-rc5
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Evaluation results
- Overall F1 on irish_core_pii_v1self-reported0.973
- Overall F1 on multilingual_ppsn_v1_allself-reported0.938
- Overall F1 on irish_dllm_hardening_exact_v1self-reported1.000