Skip to content

Add dutch language normalization#19

Merged
Karamouche merged 4 commits intomainfrom
feat/dutch-normalizer-bis
Apr 21, 2026
Merged

Add dutch language normalization#19
Karamouche merged 4 commits intomainfrom
feat/dutch-normalizer-bis

Conversation

@egenthon-cmd
Copy link
Copy Markdown
Contributor

@egenthon-cmd egenthon-cmd commented Apr 17, 2026

What does this PR do?

This change adds Dutch (nl) normalization—operators, number parsing, Flemish clitic expansion, dialect/word and phrase replacements, plus unit and e2e tests—so transcripts normalize consistently for evaluation.

Type of change

  • New language
  • Edit existing language (fix a replacement, tweak config, …)
  • New normalization step
  • Edit existing step (bug fix, behaviour change)
  • New preset version
  • Bug fix (other)
  • Refactor / docs / CI

Checklist

Only fill in the section(s) that match your change — delete the rest.


New language

  • Created normalization/languages/{lang}/ with operators.py, replacements.py, __init__.py
  • Word substitutions are in replacements.py (not hardcoded in operators.py)
  • LanguageConfig is filled in with the language's data (separators, currency words, digit words, …)
  • Subclassed LanguageOperators — only override methods where the logic changes, not just the data
  • Class is decorated with @register_language and imported in normalization/languages/__init__.py
  • Unit tests added in tests/unit/languages/
  • E2e CSV added in tests/e2e/files/{preset}/{lang}.csv (e.g. tests/e2e/files/gladia-3/fr.csv)

Edit existing language

  • New/changed word substitutions go in replacements.py, not inline in operators.py
  • If you changed a config field that can be None: the step reading it still handles None gracefully
  • Unit tests updated or added
  • E2e CSV updated if the expected output changed

New step

  • Unique name class attribute set (this is the key used in YAML presets)
  • Decorated with @register_step and imported in steps/text/__init__.py or steps/word/__init__.py
  • No hardcoded language values — read data from operators.config.* instead
  • If placeholder-based: protect + restore are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
  • Unit tests added in tests/unit/steps/
  • Step name added to the relevant preset YAML — or a new preset file created if existing presets are affected
  • If the docstring changed: ran uv run scripts/generate_step_docs.py

Edit existing step

  • Step name is unchanged — if the output changes, create a new step name + new preset instead
  • No language-specific logic or string literals added inside the step
  • Unit tests updated or added
  • If the docstring changed: ran uv run scripts/generate_step_docs.py

Preset change

  • Existing preset files are not modified — new behaviour goes in a new preset file
  • pipeline.validate() passes (runs automatically via loader.py)

How was this tested?

uv run pytest tests/

Summary by CodeRabbit

  • New Features

    • Added Dutch language support with comprehensive text normalization.
    • Converts written and mixed Dutch numbers to numeric formats, including large-number multipliers.
    • Expands Dutch/Flemish contractions and clitics to standard Dutch.
    • Normalizes common Dutch words and multiword phrases and dialectal variants.
    • Handles Dutch currency symbols and trailing currency word forms.
  • Tests

    • Added unit tests validating Dutch normalization and number/currency behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

Adds Dutch language support: new normalization/languages/dutch package (operators, number normalizer, replacements, sentence replacements), registers nl in the language registry, updates package exports, and introduces unit tests covering operators and number normalization.

Changes

Cohort / File(s) Summary
Package Init & Exports
normalization/languages/__init__.py, normalization/languages/dutch/__init__.py
Imports and exposes the new dutch subpackage; dutch exports DutchOperators and DUTCH_REPLACEMENTS.
Dutch Operators & Config
normalization/languages/dutch/operators.py
Adds DutchOperators (registered as "nl"), DUTCH_CONFIG, contraction/clitic normalization, and delegates written-number expansion to DutchNumberNormalizer.
Number Normalizer
normalization/languages/dutch/number_normalizer.py
Adds DutchNumberNormalizer with preprocessing for currency symbols and mixed digit+multiplier forms, alpha2digit integration, and postprocessing for currency pluralization.
Replacements Data
normalization/languages/dutch/replacements.py, normalization/languages/dutch/sentence_replacements.py
Adds DUTCH_REPLACEMENTS and DUTCH_SENTENCE_REPLACEMENTS dictionaries for token and phrase-level normalization (Flemish/colloquial variants).
Tests & Fixtures
tests/unit/languages/dutch_operators_test.py, tests/unit/languages/dutch_number_normalizer_test.py, tests/unit/steps/text/conftest.py
Adds unit tests covering registration, contraction expansion, written-number and currency handling, and a dutch_operators fixture for test steps.

Sequence Diagram

sequenceDiagram
    participant Client as Client Code
    participant Operators as DutchOperators
    participant Normalizer as DutchNumberNormalizer
    participant Regex as Regex Engine

    Client->>Operators: expand_written_numbers(text)
    Operators->>Normalizer: __call__(text)
    Normalizer->>Normalizer: _normalize_currency_symbols / _normalize_mixed_numbers
    Normalizer->>Normalizer: alpha2digit(text, "nl")
    Normalizer->>Regex: _apply_currency_plural_fixes(result)
    Regex-->>Normalizer: fixed_text
    Normalizer-->>Operators: normalized_text
    Operators-->>Client: result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • Karamouche

Poem

🐇 I hopped through woorden, found numbers to mend,
Euros and clitics — I twisted and penned.
From "drie miljard" to "€10" in cheer,
Dutch words now sparkle — hop, hop, hurray dear!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and accurately describes the main change: adding Dutch language normalization support to the codebase.
Description check ✅ Passed The PR description is comprehensive and follows the template structure. All required checklist items for 'New language' are marked complete, and the 'How was this tested' section is provided.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/dutch-normalizer-bis

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (7)
tests/unit/languages/dutch_operators_test.py (2)

45-51: Consolidate repeated get_word_replacements() calls.

Each assertion calls operators.get_word_replacements() again. The function is cheap, but it's cleaner and faster to bind once:

Proposed change
 def test_word_replacements(operators):
-    assert operators.get_word_replacements()["ge"] == "je"
-    assert operators.get_word_replacements()["da"] == "dat"
-    assert operators.get_word_replacements()["u"] == "je"
-    assert operators.get_word_replacements()["uw"] == "je"
-    assert operators.get_word_replacements()["okee"] == "oke"
-    assert operators.get_word_replacements()["euro"] == "euros"
+    replacements = operators.get_word_replacements()
+    assert replacements["ge"] == "je"
+    assert replacements["da"] == "dat"
+    assert replacements["u"] == "je"
+    assert replacements["uw"] == "je"
+    assert replacements["okee"] == "oke"
+    assert replacements["euro"] == "euros"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/dutch_operators_test.py` around lines 45 - 51, The test
repeatedly calls operators.get_word_replacements(); change
test_word_replacements to call operators.get_word_replacements() once, assign
the result to a local variable (e.g., replacements) and use replacements["ge"],
replacements["da"], replacements["u"], replacements["uw"], replacements["okee"],
replacements["euro"] in the assertions to avoid repeated calls and improve
clarity.

54-62: Consider adding a negative/edge test for the compound-number Dutch order.

The existing tests cover single-word amounts (tien euro) and numeric+symbol forms (€10, $3.50). Given the Dutch ones + "en" + tens composition is a non-trivial branch in DutchNumberNormalizer.process_words (the pending_ones state), please add at least one assertion exercising it end-to-end, e.g. "vijfentwintig euro" or "een en twintig euro""25 euros". Without it, regressions in the compound-order logic won't be caught by this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/dutch_operators_test.py` around lines 54 - 62, Add a
negative/edge unit test exercising the Dutch compound-number branch by asserting
that operators.expand_written_numbers correctly normalizes compound words like
"vijfentwintig euro" (or the spaced variant "een en twintig euro") to "25
euros"; this targets the DutchNumberNormalizer.process_words logic and its
pending_ones handling, so add an assertion in
tests/unit/languages/dutch_operators_test.py alongside
test_expand_written_numbers_euro_after_amount_dutch_order that calls
operators.expand_written_numbers("vijfentwintig euro") and expects "25 euros"
(and optionally another for "een en twintig euro").
normalization/languages/dutch/operators.py (1)

108-111: Move the DUTCH_REPLACEMENTS import to module top-level.

There's no circular-import reason to defer this — replacements.py imports nothing from this module. Other language implementations import the replacements dict at module top. A local import here adds per-call overhead for no gain.

Proposed change
 from normalization.languages.dutch.number_normalizer import DutchNumberNormalizer
+from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS
 from normalization.languages.dutch.sentence_replacements import (
     DUTCH_SENTENCE_REPLACEMENTS,
 )
@@
     def get_word_replacements(self) -> dict[str, str]:
-        from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS
-
         return DUTCH_REPLACEMENTS
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/operators.py` around lines 108 - 111, The
get_word_replacements method currently performs a local import of
DUTCH_REPLACEMENTS which adds per-call overhead; move the import of
DUTCH_REPLACEMENTS from normalization.languages.dutch.replacements to the top of
this module and update get_word_replacements (the function name) to simply
return the top-level DUTCH_REPLACEMENTS constant so other language modules
follow the same pattern and avoid repeated local imports.
normalization/languages/dutch/replacements.py (1)

26-27: euro → euros canonical form is non-standard Dutch.

The Dutch plural of euro is euro (official) or colloquially euro's; euros is the English form. Since the pipeline also maps the currency symbol to "euros" in DUTCH_CONFIG.currency_symbol_to_word, this is internally consistent as a canonical evaluation token, but it's worth a one-line comment here documenting that euros is an internal canonical form (not a Dutch word) so future contributors don't "fix" it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/replacements.py` around lines 26 - 27, The
mapping "euro": "euros" uses a non-standard Dutch canonical token; update
replacements.py to add a one-line comment next to the "euro": "euros" entry
clarifying that "euros" is an internal canonical form (not the Dutch plural) and
noting that DUTCH_CONFIG.currency_symbol_to_word also maps '€' to "euros" for
consistency, so future contributors don't "fix" it to a real Dutch word.
normalization/languages/dutch/number_normalizer.py (3)

5-13: Move module docstring to the top of the file.

The triple-quoted string at lines 5–13 appears after the imports, so Python does not treat it as the module docstring (__doc__ will be None). Place it above the import statements for it to be picked up by tooling (help(), IDE hover, Sphinx).

Proposed change
-import re
-from fractions import Fraction
-from typing import Iterator, Match
-
-"""
-Dutch number normalizer: spelled-out numbers to digits.
-
-- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21).
-- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ...
-- Multipliers: honderd, duizend, miljoen, miljard, biljoen.
-- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma).
-- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros").
-"""
+"""
+Dutch number normalizer: spelled-out numbers to digits.
+
+- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21).
+- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ...
+- Multipliers: honderd, duizend, miljoen, miljard, biljoen.
+- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma).
+- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros").
+"""
+
+import re
+from fractions import Fraction
+from typing import Iterator, Match
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 5 - 13, The
module docstring currently appears after the import statements so __doc__ is
None; move the triple-quoted Dutch number normalizer string to the very top of
the file before any import lines so it becomes the module docstring available to
help()/IDE/Sphinx, ensuring you preserve the exact text and spacing and leave
the imports unchanged below it.

210-212: has_prefix check can misfire on 1-char numeric tokens like "0".

self.prefixes contains the string "0" (from "nul": "0" in preceding_prefixers), so a token of literal "0" gives current[0] == "0" in self.prefixeshas_prefix = True, and current_without_prefix = "" fails the regex. Control falls through to the current_lower not in self.words branch (since "0" isn't a Dutch word) and the token is yielded via output(current) — so functionally it survives, but the logic is accidentally load-bearing.

Two small cleanups worth considering:

  1. Exclude numeric digits from self.prefixes, or
  2. Short-circuit has_prefix to only trigger on non-digit first characters:
-            has_prefix = current[0] in self.prefixes
+            has_prefix = not current[0].isdigit() and current[0] in self.prefixes

No user-visible bug today as far as I can tell; flagging for defensive clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 210 - 212,
The has_prefix check can mis-detect single-digit tokens (e.g., "0") because
self.prefixes may contain digit strings; change the predicate so it only treats
a leading prefix when the first character is not a digit (e.g., set has_prefix =
(not current[0].isdigit()) and current[0] in self.prefixes) or alternatively
remove digit entries from self.prefixes; update the logic around
has_prefix/current_without_prefix and the regex match in the function that
processes tokens (refer to variables has_prefix, current,
current_without_prefix, and self.prefixes) so single-digit numeric tokens are
not stripped into empty strings before the numeric regex check.

79-85: tens_plural keys like twintigen are not standard Dutch plurals.

Building plural variants as "twintig" + "en""twintigen", "dertig" + "en""dertigen", etc. is not a real Dutch form (Dutch uses tientallen, in de twintig, or bare twintig for "twenties"). Same critique applies to multipliers_plural (honderden, duizenden, miljoenen, miljarden, biljoenen are actually correct and common — those are fine).

For tens_plural specifically, the generated keys are unlikely to appear in real transcripts but will still match if an ASR produces them, potentially yielding surprising output. If these were copy-pasted from the English/French normalizer template, consider dropping tens_plural entirely for Dutch, or verify at least one of these forms against corpus data before shipping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 79 - 85, The
tens_plural dict is generating nonstandard Dutch keys (e.g., "twintigen") and
should be removed to avoid false matches: delete the creation of tens_plural and
change tens_suffixed to not include it (keep only tens_ordinal and any genuinely
valid suffixed forms), leaving multipliers_plural as-is; update any references
that expect tens_plural to instead rely on tens_ordinal or validated
corpus-backed forms so tens_suffixed = {**self.tens_ordinal} (or merge only
validated entries) and adjust downstream logic that used tens_plural
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/dutch/operators.py`:
- Around line 16-24: The current _RE_CLITIC_S unconditionally matches any
leading non-word then _APOST s and will incorrectly convert genitive/adverbial
constructions like "'s werelds" to "is werelds"; update the replacement regex
logic so we only convert reduced copula instances: replace _RE_CLITIC_S with a
pattern that requires a copula-like preceding token (e.g. use a positive
lookbehind such as (?<=\b(?:dat|hoe|wat|er|hier|daar)\s){_APOST}s\b) or
alternatively add exclusions by extending _TEMPORAL_S_AFTER to include known
genitive words like werelds|lands if you prefer the whitelist approach; adjust
the code that compiles _RE_CLITIC_S to use the chosen narrower pattern so only
true "'s" copula cases are replaced.

In `@normalization/languages/dutch/replacements.py`:
- Around line 21-24: The unconditional mappings "u": "je" and "uw": "je" in
normalization/languages/dutch/replacements.py aggressively conflate formal and
informal pronouns and should be either documented or made opt-in; update the
module/class docstring (e.g. DutchOperators) to clearly state these rules are
tuned for Flemish informal customer-service transcripts and their impact on WER,
or refactor the mappings into a guarded/parameterized option (e.g. a boolean
flag passed to the normalization routine that enables the "u"/"uw" → "je"
replacement) so callers can opt in when appropriate; ensure the change
references the exact mapping keys "u" and "uw" and the normalization entry point
(the Dutch normalization function or DutchOperators) so callers know how to
enable/disable it.

---

Nitpick comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 5-13: The module docstring currently appears after the import
statements so __doc__ is None; move the triple-quoted Dutch number normalizer
string to the very top of the file before any import lines so it becomes the
module docstring available to help()/IDE/Sphinx, ensuring you preserve the exact
text and spacing and leave the imports unchanged below it.
- Around line 210-212: The has_prefix check can mis-detect single-digit tokens
(e.g., "0") because self.prefixes may contain digit strings; change the
predicate so it only treats a leading prefix when the first character is not a
digit (e.g., set has_prefix = (not current[0].isdigit()) and current[0] in
self.prefixes) or alternatively remove digit entries from self.prefixes; update
the logic around has_prefix/current_without_prefix and the regex match in the
function that processes tokens (refer to variables has_prefix, current,
current_without_prefix, and self.prefixes) so single-digit numeric tokens are
not stripped into empty strings before the numeric regex check.
- Around line 79-85: The tens_plural dict is generating nonstandard Dutch keys
(e.g., "twintigen") and should be removed to avoid false matches: delete the
creation of tens_plural and change tens_suffixed to not include it (keep only
tens_ordinal and any genuinely valid suffixed forms), leaving multipliers_plural
as-is; update any references that expect tens_plural to instead rely on
tens_ordinal or validated corpus-backed forms so tens_suffixed =
{**self.tens_ordinal} (or merge only validated entries) and adjust downstream
logic that used tens_plural accordingly.

In `@normalization/languages/dutch/operators.py`:
- Around line 108-111: The get_word_replacements method currently performs a
local import of DUTCH_REPLACEMENTS which adds per-call overhead; move the import
of DUTCH_REPLACEMENTS from normalization.languages.dutch.replacements to the top
of this module and update get_word_replacements (the function name) to simply
return the top-level DUTCH_REPLACEMENTS constant so other language modules
follow the same pattern and avoid repeated local imports.

In `@normalization/languages/dutch/replacements.py`:
- Around line 26-27: The mapping "euro": "euros" uses a non-standard Dutch
canonical token; update replacements.py to add a one-line comment next to the
"euro": "euros" entry clarifying that "euros" is an internal canonical form (not
the Dutch plural) and noting that DUTCH_CONFIG.currency_symbol_to_word also maps
'€' to "euros" for consistency, so future contributors don't "fix" it to a real
Dutch word.

In `@tests/unit/languages/dutch_operators_test.py`:
- Around line 45-51: The test repeatedly calls
operators.get_word_replacements(); change test_word_replacements to call
operators.get_word_replacements() once, assign the result to a local variable
(e.g., replacements) and use replacements["ge"], replacements["da"],
replacements["u"], replacements["uw"], replacements["okee"],
replacements["euro"] in the assertions to avoid repeated calls and improve
clarity.
- Around line 54-62: Add a negative/edge unit test exercising the Dutch
compound-number branch by asserting that operators.expand_written_numbers
correctly normalizes compound words like "vijfentwintig euro" (or the spaced
variant "een en twintig euro") to "25 euros"; this targets the
DutchNumberNormalizer.process_words logic and its pending_ones handling, so add
an assertion in tests/unit/languages/dutch_operators_test.py alongside
test_expand_written_numbers_euro_after_amount_dutch_order that calls
operators.expand_written_numbers("vijfentwintig euro") and expects "25 euros"
(and optionally another for "een en twintig euro").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b667681a-4f7b-4a0d-b733-df033b4abe17

📥 Commits

Reviewing files that changed from the base of the PR and between baec872 and b458749.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3/nl.csv is excluded by !**/*.csv
📒 Files selected for processing (7)
  • normalization/languages/__init__.py
  • normalization/languages/dutch/__init__.py
  • normalization/languages/dutch/number_normalizer.py
  • normalization/languages/dutch/operators.py
  • normalization/languages/dutch/replacements.py
  • normalization/languages/dutch/sentence_replacements.py
  • tests/unit/languages/dutch_operators_test.py

Comment on lines +16 to +24
_TEMPORAL_S_AFTER = (
r"ochtends|morgens|middags|namiddags|avonds|nachts|"
r"zaterdags|zondags|weekend|weekends"
)
_RE_TEMPORAL_S = re.compile(
rf"(?<!\w){_APOST}s(\s+)({_TEMPORAL_S_AFTER})\b",
re.IGNORECASE,
)
_RE_CLITIC_S = re.compile(rf"(?<!\w){_APOST}s\b", re.IGNORECASE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

_RE_CLITIC_S may over-expand non-temporal 's to is.

After the temporal pass consumes 's ochtends/avonds/..., any remaining 's preceded by a non-word char (start of string, space, punctuation) is unconditionally replaced by is. That corrupts genuine Dutch constructions where standalone 's is a reduced genitive/adverbial article, e.g.:

  • 's werelds grootste (= "of the world's largest") → is werelds grootste
  • 's lands belangenis lands belangen

These are less frequent than the informal "dat 's goed" → "dat is goed" case you want to cover, but silently wrong when they do appear. Consider either (a) extending _TEMPORAL_S_AFTER to include werelds|lands|..., or (b) only replacing 's with is when the previous token ends with a vowel/word consistent with a reduced copula (e.g. require (?:dat|hoe|wat|er|hier|daar)\s+ before it).

Also applies to: 99-100

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/operators.py` around lines 16 - 24, The current
_RE_CLITIC_S unconditionally matches any leading non-word then _APOST s and will
incorrectly convert genitive/adverbial constructions like "'s werelds" to "is
werelds"; update the replacement regex logic so we only convert reduced copula
instances: replace _RE_CLITIC_S with a pattern that requires a copula-like
preceding token (e.g. use a positive lookbehind such as
(?<=\b(?:dat|hoe|wat|er|hier|daar)\s){_APOST}s\b) or alternatively add
exclusions by extending _TEMPORAL_S_AFTER to include known genitive words like
werelds|lands if you prefer the whitelist approach; adjust the code that
compiles _RE_CLITIC_S to use the chosen narrower pattern so only true "'s"
copula cases are replaced.

Comment thread normalization/languages/dutch/replacements.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for steps must be in corresponding test files in /tests/unit/step/text

Comment thread normalization/languages/dutch/number_normalizer.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/unit/languages/dutch_number_normalizer_test.py (1)

17-55: Consider adding coverage for decimal separator and edge cases.

The Dutch decimal separator is "," per DUTCH_CONFIG.decimal_separator, yet currency tests only exercise . decimals (e.g. "$3.50"). Worth adding cases like "€10,50""10,50 euros", "¢50" (the ¢→cent mapping where singular==trailing and the postprocess branch is skipped), and a multi-digit+multiplier case (once the 12 miljoen semantics are decided) to prevent regressions in the config→normalizer coupling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/dutch_number_normalizer_test.py` around lines 17 - 55,
Add unit tests covering Dutch decimal separator and edge cases: extend or add
cases in test_currency_symbols_and_plural_trailing_words to include a
comma-decimal currency like "€10,50" expecting "10,50 euros" (verify
DUTCH_CONFIG.decimal_separator handling), add a cent symbol case such as "¢50"
expecting the mapped "50 cent" behavior (covering the postprocess branch where
singular==trailing is skipped), and add an extra multi-digit+multiplier case
(e.g., similar to test_multi_digit_then_miljoen_not_fully_merged) to assert the
decided semantics for "12 miljoen" (ensure DutchNumberNormalizer preserves the
digit token plus multiplier as expected).
normalization/languages/dutch/number_normalizer.py (1)

52-63: Plural→singular and euros-specific fix are hardcoded; decouple from config.

_singular_spoken_unit hardcodes the exact plural forms currently in DUTCH_CONFIG.currency_symbol_to_word, and _currency_plural_fix_patterns special-cases "euros" to also match euro's. If the config is ever changed (e.g., adding a new currency, or using a different plural spelling), both preprocessing and postprocessing will silently misbehave while tests still pass against the fixed config. Consider either (a) co-locating the singular/plural mapping with DUTCH_CONFIG (e.g., a small {plural: singular} map on the config) and passing it in, or (b) deriving the singular heuristically (strip trailing s/en) with a small explicit-overrides map for irregulars (ponden→pond). The euro's apostrophe-s form is a Dutch plural variant and would then be handled generically by the singular derivation rather than a special euros branch.

Also applies to: 98-106

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 52 - 63,
_singular_spoken_unit currently hardcodes plural→singular mappings (and
_currency_plural_fix_patterns special-cases "euros") which will break if
DUTCH_CONFIG changes; update the logic to derive singulars from DUTCH_CONFIG
instead of hardcoding: either add a plural→singular map to DUTCH_CONFIG and use
that in _singular_spoken_unit (passed in or imported), or implement a generic
heuristic (strip trailing "s"/"en" and remove apostrophe-s) with a small
explicit override map for irregulars (e.g., "ponden"→"pond") so both
preprocessing and postprocessing use the same source of truth and handle forms
like "euro's" without bespoke branches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 127-137: The runtime ImportError check for "alpha2digit is None"
inside DutchNumberNormalizer.__init__ is dead because alpha2digit is imported at
module scope; remove the conditional and its raise so __init__ simply assigns
self._alpha2digit = alpha2digit and continues to set
self._currency_symbol_to_word and self._currency_plural_fixes; alternatively, if
you prefer lazy-friendly behavior, move the module import of alpha2digit into a
try/except ImportError at module scope (setting alpha2digit = None on failure)
and keep the existing check — but do not keep both patterns.
- Around line 34-49: The current _normalize_mixed_numbers leaves multi-digit
numerals like "12 miljoen" unchanged so later alpha2digit produces "12 1000000"
(wrong); change _normalize_mixed_numbers (matched via _RE_MIXED_NUMBER) to
detect when match.group(1) is a multi-digit numeric string and match.group(2) is
a large-number multiplier, parse the numeric part to an int, multiply by the
multiplier value (use the same mapping used for single-word multipliers, e.g.
the numeric value for "miljoen"/"miljard"), and replace the match with the
computed integer string (e.g. "12000000") instead of leaving it; keep the
existing single-digit branch that converts digits to words using
_DIGIT_TO_DUTCH, and update the failing test
(test_multi_digit_then_miljoen_not_fully_merged) to expect "12 miljoen" →
"12000000".

In `@tests/unit/languages/dutch_number_normalizer_test.py`:
- Around line 34-36: The test test_multi_digit_then_miljoen_not_fully_merged
currently asserts the old incorrect behavior ("12 miljoen" -> "12 1000000");
update the expectation to the corrected multiplication result ("12000000") and
rename the test to reflect correct multiplication semantics (e.g.,
test_multi_digit_times_miljoen_fully_merged) to match the fix made in
_normalize_mixed_numbers in normalization/languages/dutch/number_normalizer.py;
keep the test using the DutchNumberNormalizer fixture and ensure the assertion
uses the new expected string.

---

Nitpick comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 52-63: _singular_spoken_unit currently hardcodes plural→singular
mappings (and _currency_plural_fix_patterns special-cases "euros") which will
break if DUTCH_CONFIG changes; update the logic to derive singulars from
DUTCH_CONFIG instead of hardcoding: either add a plural→singular map to
DUTCH_CONFIG and use that in _singular_spoken_unit (passed in or imported), or
implement a generic heuristic (strip trailing "s"/"en" and remove apostrophe-s)
with a small explicit override map for irregulars (e.g., "ponden"→"pond") so
both preprocessing and postprocessing use the same source of truth and handle
forms like "euro's" without bespoke branches.

In `@tests/unit/languages/dutch_number_normalizer_test.py`:
- Around line 17-55: Add unit tests covering Dutch decimal separator and edge
cases: extend or add cases in test_currency_symbols_and_plural_trailing_words to
include a comma-decimal currency like "€10,50" expecting "10,50 euros" (verify
DUTCH_CONFIG.decimal_separator handling), add a cent symbol case such as "¢50"
expecting the mapped "50 cent" behavior (covering the postprocess branch where
singular==trailing is skipped), and add an extra multi-digit+multiplier case
(e.g., similar to test_multi_digit_then_miljoen_not_fully_merged) to assert the
decided semantics for "12 miljoen" (ensure DutchNumberNormalizer preserves the
digit token plus multiplier as expected).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fc6ce718-a9de-47ff-a18c-7c7d32f4780f

📥 Commits

Reviewing files that changed from the base of the PR and between b458749 and eb6f0a4.

📒 Files selected for processing (5)
  • normalization/languages/dutch/number_normalizer.py
  • normalization/languages/dutch/operators.py
  • tests/unit/languages/dutch_number_normalizer_test.py
  • tests/unit/languages/dutch_operators_test.py
  • tests/unit/steps/text/conftest.py
✅ Files skipped from review due to trivial changes (2)
  • tests/unit/steps/text/conftest.py
  • tests/unit/languages/dutch_operators_test.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • normalization/languages/dutch/operators.py

Comment thread normalization/languages/dutch/number_normalizer.py
Comment thread normalization/languages/dutch/number_normalizer.py
Comment thread tests/unit/languages/dutch_number_normalizer_test.py
@Karamouche Karamouche self-requested a review April 21, 2026 15:15
Comment thread normalization/languages/dutch/number_normalizer.py
@Karamouche Karamouche merged commit 718527e into main Apr 21, 2026
10 checks passed
@Karamouche Karamouche deleted the feat/dutch-normalizer-bis branch April 21, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants