Skip to main content
NLPDeBERTaFine-tuningGDPR

Why We Fine-Tuned DeBERTa-base and Not XLM-R for German PII

The problem

The retrieval layer for the GDPR-compliant RAG platform I was building at SAP Labs India handed a stream of German legal chunks to a PII redaction step before anything hit the generator. Bundesdatenschutzgesetz — Germany's federal data protection law — has opinions about what counts as personal data that are broader than GDPR's minimum, and stricter about the specific national identifier classes. Sozialversicherungsnummer, Steueridentifikationsnummer, Krankenversicherungskarte-Nummer, Personalausweisnummer. A miss on any of those was not a metric regression, it was a compliance incident.

The corpus for fine-tuning was in the 300k-500k German document range, annotated over several weeks with a mix of rule-based seed labels and human review. I needed a model that would sit inline in a retrieval-time pipeline — not batch, not overnight — and hit a recall bar high enough that the residual false-negative rate was defensible in a Datenschutz-Folgenabschätzung.

The obvious answer was the multilingual one.


The naive first approach

XLM-R. Facebook's XLM-RoBERTa was the default recommendation for anything cross-lingual in 2024, and every "European multilingual PII" thread on Hugging Face pointed at either XLM-R-base or its larger sibling. The reasoning was tidy: it was pre-trained on 2.5TB of CommonCrawl across 100 languages including a huge German slice, its tokenizer had seen German morphology in the wild, and — most importantly for a regulated pipeline — the "one model, many languages" story was operationally simple. One artifact to ship, one artifact to monitor, one artifact to re-certify when the compliance team asked.

So I fine-tuned XLM-R-base on the annotated German corpus with a standard token-classification head. IOB2 tagging over ten entity classes. Cross-entropy, class weights to handle the imbalance between O tokens and everything else, AdamW, the usual warmup schedule. Nothing exotic.

The baseline numbers were fine on paper. They were not fine in the failure modes.

Two things showed up on the German-specific eval set that I couldn't wave away.

First, the model consistently under-recalled on the compound-noun identifier classes. Sozialversicherungsnummer would sometimes be tagged correctly. Krankenversicherungskarte-Nummer, which is a genuine German compound plus a hyphenated qualifier, was tagged as three separate spans about a third of the time — and each of those partial spans then triggered a downstream re-alignment bug in the redactor. The model wasn't wrong about "there is PII here." It was wrong about where the PII ended.

Second, the boundary errors were not evenly distributed. They clustered on the words with the highest compliance stakes. Personal-identification compounds. Address compounds with -straße and -platz suffixes. Health-identifier compounds. The words that a Bundesdatenschutz auditor was most likely to check by hand were the words the model was least confident about.

I could patch the recall by dropping the confidence threshold and living with more false positives. But false positives in a redaction pipeline are not free either — over-redacting a public entity in a public document is a different kind of bug that ends up in front of the same product owner.


The decision

I moved to DeBERTa-base — the English-focused v3 checkpoint — and fine-tuned it as a monolingual German model. Same head, same loss, same training data.

That reads wrong on first pass. DeBERTa's pre-training corpus is English-dominated. Why would a model with less German exposure do better on German?

The answer, once I got the numbers back, wasn't about pre-training breadth. It was about tokenizer geometry and about what disentangled attention does to morphology-heavy languages.


The tradeoff

Directional numbers from our internal eval — a held-out slice of the 300k-500k German corpus, plus a small human-curated adversarial set of Bundesdatenschutz-heavy passages. Not a public benchmark. Your numbers will differ.

AxisXLM-R-base (fine-tuned)DeBERTa-base (fine-tuned, monolingual DE)
Entity-level F1 on Bundesdatenschutz classesbaseline+6.1 F1 over baseline
Recall@10 on the retrieval-linked eval~88%94%
MRR@10 on the same eval~0.740.82
Compound-noun span accuracyboundary errors clustered on identifier compoundsconsistently tighter spans
Tokens per Sozialversicherungsnummer5-6 sub-tokens, unstable boundaries3-4 sub-tokens, stable boundaries
Inference latency at batch 32comparablecomparable
Parameter count270M184M
Multilingual reuse storyone artifact for all EU languagesone artifact per language
Ops costsingle fine-tune, single monitorN fine-tunes, N monitors

The six-point F1 delta was the number that ended the debate. On the specific entity classes the compliance team cared about, DeBERTa-base wasn't marginally better — it was a category better. Recall@10 crossed the bar we'd set for the retrieval-linked evaluation. MRR@10 at 0.82 meant the correct redaction span was the top-ranked candidate the overwhelming majority of the time, which mattered because a downstream span-selection step used those rankings.

The trade I was making was explicit: I gave up the "one model, all languages" operational story, and I paid in maintenance overhead — every new EU language would now be its own fine-tune, its own eval, its own monitored artifact. In exchange I got a recall floor I could defend and a boundary-precision profile that stopped causing downstream alignment bugs.

For a regulated pipeline where the failure mode is a compliance incident, that trade was straightforwardly the right one. If the same model had been powering a customer-facing feature where breadth mattered more than depth, I would have kept XLM-R.


Implementation notes

The entity-class mapping was where the work actually was. Bundesdatenschutz-relevant PII doesn't line up cleanly with the CoNLL-style PER / ORG / LOC / MISC schema most tutorials assume. I mapped everything to a domain-specific schema and IOB2-tagged it before touching the model.

# The entity schema the redactor cared about — flat, no nesting.
# German identifier classes get their own labels; generic PII stays generic.
 
ENTITY_CLASSES = [
    "PER",                    # personal names
    "ORG",                    # organizations, employers
    "LOC",                    # addresses, cities, streets
    "EMAIL",
    "PHONE_DE",               # +49 formats, incl. mobile prefixes
    "IBAN_DE",                # DE\d{20}, validated separately
    "SVNR",                   # Sozialversicherungsnummer (11 chars)
    "STEUERID",               # Steueridentifikationsnummer (11 digits)
    "KVNR",                   # Krankenversicherungskarte-Nummer
    "PERSONALAUSWEIS",        # national ID card number
]
 
# IOB2 tag set derived from the classes above.
LABEL_LIST = ["O"] + [f"{p}-{c}" for c in ENTITY_CLASSES for p in ("B", "I")]
LABEL2ID = {label: i for i, label in enumerate(LABEL_LIST)}
ID2LABEL = {i: label for label, i in LABEL2ID.items()}
 
# Token-classification head on top of the DeBERTa encoder.
model = AutoModelForTokenClassification.from_pretrained(
    "microsoft/deberta-v3-base",
    num_labels=len(LABEL_LIST),
    id2label=ID2LABEL,
    label2id=LABEL2ID,
)

The two identifier classes that mattered most — SVNR and KVNR — got their own regex validators downstream of the model. The model's job was to say "there is a Sozialversicherungsnummer here." The validator's job was to say "and it passes the checksum." Neither could do the other's job. The model saw context (surrounding legal language); the regex saw structure (11 chars, specific digit patterns).

The tokenizer test was the tell. Before I trained anything, I ran the same set of German compound identifiers through both tokenizers.

Krankenversicherungskarte-Nummer through XLM-R's SentencePiece split into six pieces with the boundaries falling in different places depending on the surrounding sentence. Through DeBERTa-v3's tokenizer, it consistently split into three or four pieces on morpheme-adjacent boundaries. Stability, not just count, was what mattered. A model can learn a compound if the sub-word decomposition is consistent. It cannot learn a compound whose decomposition drifts with context.

Disentangled attention did something specific for German. DeBERTa's disentangled attention separates content and position representations — the attention score between two tokens is computed from three components: content-to-content, content-to-position, and position-to-content. On a morphology-heavy language where the same root can appear with wildly different affixes and compounds, that separation let the model attend to a Sozialversicherung- root regardless of what suffix it was fused to. XLM-R's standard attention had to learn that invariance implicitly, and on 300k-500k docs, it didn't fully.

We froze the bottom half. Fine-tuning all 12 DeBERTa layers on our corpus size started overfitting after 2-3 epochs. Freezing layers 0-5 and only fine-tuning 6-11 plus the classification head cut training time in half and gave a slightly better eval F1. The bottom layers were doing morphological work that our task didn't need to rewrite.

Side-by-side tokenizer comparison on the German compound noun Krankenversicherungskarte-Nummer. XLM-R produces seven fragmented sub-word pieces (▁Kranken, versicherung, s, karte, -, Numm, er) with a predicted span that trails off and misses the tail; a single fused attention surface is drawn beneath. DeBERTa-v3 produces three morpheme-shaped pieces (Krankenversicherungs, karte, -Nummer) with a tight contiguous span; three disentangled attention surfaces (c↔c, c↔p, p↔c) are drawn beneath. Footer shows identifier-class F1 of 87.2 for XLM-R and 93.4 for DeBERTa-v3, a +6.2 delta.
Tokenizer stability plus a separated position channel yields tighter compound-noun spans. The 6-point F1 gap on identifier classes lives in this picture.

What surprised me

I had expected multilingual pre-training breadth to dominate. Every paper I'd read on cross-lingual transfer said the same thing: more languages seen in pre-training equals better zero-shot and better few-shot on any one of them. That's true in the zero-shot regime. It stopped being true once I had 300k-500k in-domain German documents to fine-tune on.

With enough in-domain data, pre-training breadth is a rounding error and pre-training depth in the modeling primitives — the tokenizer's morphological granularity, the attention mechanism's ability to separate lexical from positional signal — is what carries the delta. XLM-R had seen more German. DeBERTa had a better mechanism for the German it saw during fine-tuning.

The second surprise was operational. I had assumed the "one model, N languages" story would save real money in monitoring and ops. In practice, monitoring a single multilingual model well is harder than monitoring N monolingual models, because the failure modes are language-specific and a single dashboard aggregates them into noise. When we split into per-language artifacts, the German dashboard got quieter and more informative, not louder.


What I'd do differently at 10x scale

At 3M-5M German documents, the fine-tune I shipped is still probably the right shape. At 30M+ across five languages, I'd rethink the whole layout.

The path I would take:

  1. Adapter-based specialization on top of a shared multilingual base. Keep XLM-R (or its 2026-era successor) as the trunk, and train LoRA-style adapters per language and per PII entity class. You get the operational story of one artifact plus small deltas, and you get the specialization story of a per-language head. The six-point F1 delta I paid for by going monolingual is exactly the delta I'd try to recover in the adapter.
  2. Structured decoding, not just token classification. For identifier classes with strict formats — SVNR, IBAN, Steueridentifikationsnummer — a constrained-decoding pass on top of the tagger would eliminate an entire class of boundary error. The model proposes spans; a validator with a formal grammar accepts or rejects. This is close to what I already did with regex, but the regex was outside the training loop. At 10x scale, I'd fold it in.
  3. Active learning on the tail. The corpus at 300k-500k was mostly hand-curated. At 3M, a random sample won't hit the identifier classes densely enough. I'd build an uncertainty-weighted sampler that pulls the model's low-confidence predictions on unlabeled documents into the annotation queue. Every hour of annotator time should be spent on the boundary between confident and confused, not on re-labeling PER for the millionth time.
  4. Separate the "detect" model from the "classify" model. At scale, a small fast model that just says "there's PII in this chunk" can gate a larger slower model that says "and here's exactly what." The retrieval layer only needs the second model on the top-k results — not on every candidate. The current pipeline runs the full model on everything.

The meta-lesson: multilingual is a strategy for the zero-shot case. Monolingual is a strategy for the in-domain case. In a regulated pipeline where in-domain data exists and the failure mode has legal consequences, specialize first and generalize later — never the other way around.


See also


More on the RAG platform this was built for — 2M+ documents, 400+ users, sub-2s p95, GDPR-compliant PII detection — is on the projects page.

Cite as: Saravanan, K. (2026). Why We Fine-Tuned DeBERTa-base and Not XLM-R for German PII. Kaushik Saravanan. https://www.kaushik.cv/blog/deberta-over-xlmr-german-pii