fix: implement number normalizers for German, Italian, enhance Dutch normalization with digit words by Karamouche · Pull Request #20 · gladiaio/normalization

Karamouche · 2026-04-21T16:59:51Z

What does this PR do?

Implement number normalizers for German, Italian, enhance Dutch normalization with digit words

Type of change

Tests

Tested different way of saying numbers in many languages

Summary by CodeRabbit

New Features
- Enhanced number normalization for Dutch, German, and Italian languages with comprehensive digit-word mappings
- Improved conversion of written-out number expressions to digits across German and Italian
- Better handling of mixed digit-word number patterns (e.g., "2 hundred")
- Added Italian sentence-level phrase replacements for improved text normalization

…normalization with digit words

coderabbitai · 2026-04-21T17:00:04Z

📝 Walkthrough

Walkthrough

The changes introduce number normalization pipelines for Dutch, German, and Italian languages. Dutch operators gain digit-word mappings and expanded number configuration. German and Italian each receive new dedicated number normalizer modules that preprocess mixed digit-word patterns and apply alpha2digit conversion, with corresponding operator updates and configuration extensions.

Changes

Cohort / File(s)	Summary
Dutch Language Support `normalization/languages/dutch/operators.py`	Added `_DUTCH_DIGIT_WORDS` mapping ("nul"–"negen" to digits 0–9), extended `DUTCH_CONFIG` with `digit_words` and `number_words` covering digit words, teens, tens, and large scales, and set `plus_word="plus"`.
German Language Support `normalization/languages/german/number_normalizer.py`, `normalization/languages/german/operators.py`	Added `GermanNumberNormalizer` class with regex-based preprocessing to convert mixed digit-multiplier forms (e.g., `2 hundert` → `zwei hundert`), followed by alpha2digit conversion and targeted word fixes. Extended `GERMAN_CONFIG` with `digit_words`, `number_words`, and `plus_word`, and added `expand_written_numbers()` method to `GermanOperators`.
Italian Language Support `normalization/languages/italian/number_normalizer.py`, `normalization/languages/italian/operators.py`, `normalization/languages/italian/sentence_replacements.py`	Added `ItalianNumberNormalizer` class with similar mixed-number preprocessing and post-pass word fixing. Refactored `ItalianOperators` to use the normalizer via new `expand_written_numbers()` method, removed legacy `fix_one_word_in_numeric_contexts()` method, and extracted sentence replacements into a dedicated module.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant GermanNumberNormalizer
    participant RegexPreprocessor as Regex<br/>Preprocessor
    participant alpha2digit
    participant PostProcessor

    Client->>GermanNumberNormalizer: __call__(text)
    GermanNumberNormalizer->>RegexPreprocessor: Mixed number patterns<br/>(e.g., "2 hundert")
    RegexPreprocessor->>GermanNumberNormalizer: Expanded form<br/>(e.g., "zwei hundert")
    GermanNumberNormalizer->>alpha2digit: Normalized text
    alpha2digit->>GermanNumberNormalizer: Partially converted<br/>(some words remain)
    GermanNumberNormalizer->>PostProcessor: Post-pass fixes<br/>("zwei"→"2", "null"→"0")
    PostProcessor->>GermanNumberNormalizer: Final normalized text
    GermanNumberNormalizer->>Client: return normalized_text

sequenceDiagram
    participant Client
    participant ItalianNumberNormalizer
    participant RegexPreprocessor as Regex<br/>Preprocessor
    participant alpha2digit
    participant PostProcessor

    Client->>ItalianNumberNormalizer: __call__(text)
    ItalianNumberNormalizer->>RegexPreprocessor: Mixed number patterns<br/>(e.g., "2 cento", "3 mila")
    RegexPreprocessor->>ItalianNumberNormalizer: Expanded form<br/>(e.g., "due cento", "tre mila")
    ItalianNumberNormalizer->>alpha2digit: Normalized text
    alpha2digit->>ItalianNumberNormalizer: Partially converted<br/>(some words remain)
    ItalianNumberNormalizer->>PostProcessor: Post-pass fixes<br/>("uno"→"1", "due"→"2")
    PostProcessor->>ItalianNumberNormalizer: Final normalized text
    ItalianNumberNormalizer->>Client: return normalized_text

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

normalization#19: Directly modifies the Dutch operators module and DUTCH_CONFIG with identical changes (adding digit_words, expanding number_words, and setting plus_word).

Suggested reviewers

lrossillon-gladia

Poem

🐰 Numbers dance across tongues with grace,
Dutch, German, Italian in their place,
Digit words mapped with regex keen,
From "nul" to "null," normalization supreme,
A multilingual feast, a hoppy embrace! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is incomplete. It lacks required checklist details, specific test information, and doesn't match the template structure for 'Edit existing language' or other applicable categories.	Complete the description by selecting all applicable change types with checkmarks, providing specific test output, and adding E2E CSV updates if needed for German, Italian, and Dutch.
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: implementing number normalizers for German and Italian, and enhancing Dutch normalization with digit words.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/number-normalization-de-du-it

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/dutch/operators.py`:
- Around line 35-46: Add unit tests to cover the Dutch digit-word "een" in
numeric contexts handled by fix_dot_adjacent_number_words and
protect_plus_word_before_digit_words: assert that "versie punt een" transforms
"een" → "1" (dot/version/IP context) and that "plus een" is treated as a
phone-country-code context by protect_plus_word_before_digit_words, and also add
an edge-case test like "Dit kost plus een euro" to ensure it does not
incorrectly convert in ordinary currency phrases; use the existing
_DUTCH_DIGIT_WORDS mapping and the same test harness used for other Dutch
normalization tests to locate and validate behavior.

In `@normalization/languages/german/number_normalizer.py`:
- Around line 31-47: The regex _RE_MIXED_NUMBER should accept singular forms of
the large-scale words and the replacer in _normalize_mixed_numbers must
special-case the digit "1" to use the feminine form "eine" for feminine scales
(million/millionen, milliarde/milliarden, billion/billionen) instead of the
default _DIGIT_TO_GERMAN value that yields "ein"; update the pattern for
_RE_MIXED_NUMBER to include explicit singular variants (e.g. million, milliarde,
billion as well as their plural forms) and modify the replace(match) in
_normalize_mixed_numbers to return "eine {multiplier}" when number == "1" and
multiplier is in the feminine set, otherwise fall back to the existing mapping
(so "2 million" is matched and becomes "zwei million" and "1 milliarde" becomes
"eine milliarde").

In `@normalization/languages/german/operators.py`:
- Around line 9-12: The digit-word mapping _GERMAN_DIGIT_WORDS incorrectly
includes the ambiguous token "ein", which causes
ProtectPlusWordBeforeDigitWordsStep (which consumes config.digit_words) to
misclassify normal phrases like "plus ein bisschen" as phone-plus context;
remove "ein" from _GERMAN_DIGIT_WORDS (leave "eins" if needed) and ensure any
configuration or references to config.digit_words no longer contain the
ambiguous "ein" token so plus-word protection only triggers on unambiguous digit
words.

In `@normalization/languages/italian/number_normalizer.py`:
- Around line 16-19: The regex _RE_MIXED_NUMBER currently uses "mila?" which
matches "mil" or "mila" and misses the correct Italian "mille"; update the
pattern used in _RE_MIXED_NUMBER so it explicitly matches "mille" and "mila"
(e.g., replace the "mila?" token with an explicit alternation like
"(mille|mila)") while preserving other alternatives (cento, milione/milioni,
miliardo/miliardi) and re.IGNORECASE to ensure inputs like "1 mille" are
correctly caught by the mixed-number pre-pass.

In `@normalization/languages/italian/operators.py`:
- Around line 108-109: The Italian operator needs a language-specific override
of fix_one_word_in_numeric_contexts to convert isolated "uno" to "1" when
adjacent to digits (mirroring English/Spanish implementations) because
expand_written_numbers only handles full number words; implement
operators.fix_one_word_in_numeric_contexts() to use a regex that matches
word-boundary "uno" when preceded or followed by digits (or digit sequences with
separators) and replace it with "1" while preserving surrounding
whitespace/punctuation, ensuring the method name
fix_one_word_in_numeric_contexts and the existing expand_written_numbers remain
unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b8f7a32-1869-4f61-8f00-ce6c5d8af1d1

📥 Commits

Reviewing files that changed from the base of the PR and between 165bd95 and 4685f98.

⛔ Files ignored due to path filters (6)

tests/e2e/files/gladia-3/de.csv is excluded by !**/*.csv
tests/e2e/files/gladia-3/en.csv is excluded by !**/*.csv
tests/e2e/files/gladia-3/es.csv is excluded by !**/*.csv
tests/e2e/files/gladia-3/fr.csv is excluded by !**/*.csv
tests/e2e/files/gladia-3/it.csv is excluded by !**/*.csv
tests/e2e/files/gladia-3/nl.csv is excluded by !**/*.csv

📒 Files selected for processing (6)

normalization/languages/dutch/operators.py
normalization/languages/german/number_normalizer.py
normalization/languages/german/operators.py
normalization/languages/italian/number_normalizer.py
normalization/languages/italian/operators.py
normalization/languages/italian/sentence_replacements.py

coderabbitai · 2026-04-21T17:05:39Z

+_DUTCH_DIGIT_WORDS: dict[str, str] = {
+    "nul": "0",
+    "een": "1",
+    "twee": "2",
+    "drie": "3",
+    "vier": "4",
+    "vijf": "5",
+    "zes": "6",
+    "zeven": "7",
+    "acht": "8",
+    "negen": "9",
+}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, understand the file structure and how digit_words are used find . -type f -name "*.py" | grep -E "(dutch|operators)" | head -20

Repository: gladiaio/normalization

Length of output: 715

🏁 Script executed:

# Search for the functions mentioned in the review comment rg "fix_dot_adjacent_number_words|protect_plus_word_before_digit_words" --type py -A 5

Repository: gladiaio/normalization

Length of output: 3696

🏁 Script executed:

# Check how digit_words are constructed and passed to these functions rg "digit_words" --type py -B 2 -A 2 | head -100

Repository: gladiaio/normalization

Length of output: 8296

🏁 Script executed:

# Look for tests related to Dutch and "een" find . -type f -name "*.py" | xargs grep -l "dutch\|Nederlandse" | grep test

Repository: gladiaio/normalization

Length of output: 203

🏁 Script executed:

# Read the full implementation of fix_dot_adjacent_number_words cat -n normalization/steps/text/fix_dot_adjacent_number_words.py

Repository: gladiaio/normalization

Length of output: 1597

🏁 Script executed:

# Read the full implementation of protect_plus_word_before_digit_words cat -n normalization/steps/text/protect_plus_word_before_digit_words.py

Repository: gladiaio/normalization

Length of output: 1606

🏁 Script executed:

# Check Dutch tests for "een" coverage cat -n tests/unit/languages/dutch_operators_test.py | head -150

Repository: gladiaio/normalization

Length of output: 2352

🏁 Script executed:

# Check if there are any test cases with "een" as the indefinite article rg "een" tests/unit/languages/dutch_operators_test.py -B 2 -A 2

Repository: gladiaio/normalization

Length of output: 424

🏁 Script executed:

# Check the dutch number normalizer tests for "een" coverage cat -n tests/unit/languages/dutch_number_normalizer_test.py | head -100

Repository: gladiaio/normalization

Length of output: 2562

🏁 Script executed:

# Search for any integration tests or broader test scenarios rg "een" tests/unit/languages/dutch_number_normalizer_test.py -B 2 -A 2

Repository: gladiaio/normalization

Length of output: 227

🏁 Script executed:

# Check if there are any other test files that might test the digit_words steps find tests -type f -name "*.py" -exec grep -l "fix_dot_adjacent\|protect_plus_word" {} \;

Repository: gladiaio/normalization

Length of output: 48

🏁 Script executed:

# Check if there are step-level tests for these functions find tests -type f -name "*.py" | xargs grep -l "FixDotAdjacentNumberWordsStep\|ProtectPlusWordBeforeDigitWordsStep" 2>/dev/null

Repository: gladiaio/normalization

Length of output: 50

🏁 Script executed:

# Search for any integration tests that might run these steps end-to-end rg "fix_dot_adjacent\|protect_plus_word" tests/ -l

Repository: gladiaio/normalization

Length of output: 48

🏁 Script executed:

# Let's also check the Dutch config to understand the full context cat -n normalization/languages/dutch/operators.py

Repository: gladiaio/normalization

Length of output: 5161

"een" matching in numeric contexts is intentional but warrants test coverage.

The functions fix_dot_adjacent_number_words and protect_plus_word_before_digit_words use word boundaries (\b) in their regex patterns, which provides protection against false positives in arbitrary text. However, "een" will still match when it appears in the specific numeric patterns these steps are designed for:

After "punt" (dot): "versie punt een" → "versie punt 1" ✓ (intended for IPs/versions)

After "plus" (plus): "plus een" → converts to phone context marker (intended for +1 country codes)

In these contexts, the behavior is correct. However, there's a narrow edge case risk: ambiguous sentences like "Dit kost plus een euro" could be misparsed. Adding explicit tests for "een" in these numeric-context patterns would confirm the behavior is safe for your typical input corpus.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/operators.py` around lines 35 - 46, Add unit tests to cover the Dutch digit-word "een" in numeric contexts handled by fix_dot_adjacent_number_words and protect_plus_word_before_digit_words: assert that "versie punt een" transforms "een" → "1" (dot/version/IP context) and that "plus een" is treated as a phone-country-code context by protect_plus_word_before_digit_words, and also add an edge-case test like "Dit kost plus een euro" to ensure it does not incorrectly convert in ordinary currency phrases; use the existing _DUTCH_DIGIT_WORDS mapping and the same test harness used for other Dutch normalization tests to locate and validate behavior.

coderabbitai · 2026-04-21T17:05:39Z

+_RE_MIXED_NUMBER = re.compile(
+    r"\b(\d+)\s+(hundert|tausend|millionen?|milliarden?|billionen?)\b",
+    re.IGNORECASE,
+)
+
+_RE_ZWEI = re.compile(r"\bzwei\b", re.IGNORECASE)
+_RE_NULL = re.compile(r"\bnull\b", re.IGNORECASE)
+
+
+def _normalize_mixed_numbers(text: str) -> str:
+    """Convert '2 hundert' → 'zwei hundert' so alpha2digit yields 200, not '2 100'."""
+
+    def replace(match: re.Match) -> str:
+        number = match.group(1)
+        multiplier = match.group(2)
+        if len(number) == 1 and number in _DIGIT_TO_GERMAN:
+            return f"{_DIGIT_TO_GERMAN[number]} {multiplier}"


⚠️ Potential issue | 🟠 Major

Fix mixed digit + German scale preprocessing for singular scale words.

Line 32 misses valid singular million/billion and Line 47 rewrites 1 milliarde as ein milliarde. That leaves common inputs like 2 million unhandled and can make 1 Million/Milliarde/Billion less alpha2digit-friendly.

🐛 Proposed fix

+_FEMININE_SINGULAR_SCALES = {"million", "milliarde", "billion"} + _RE_MIXED_NUMBER = re.compile( - r"\b(\d+)\s+(hundert|tausend|millionen?|milliarden?|billionen?)\b", + r"\b(\d+)\s+(hundert|tausend|million(?:en)?|milliarde(?:n)?|billion(?:en)?)\b", re.IGNORECASE, )

number = match.group(1) multiplier = match.group(2) if len(number) == 1 and number in _DIGIT_TO_GERMAN: - return f"{_DIGIT_TO_GERMAN[number]} {multiplier}" + digit_word = _DIGIT_TO_GERMAN[number] + if number == "1" and multiplier.lower() in _FEMININE_SINGULAR_SCALES: + digit_word = "eine" + return f"{digit_word} {multiplier}" return match.group(0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/german/number_normalizer.py` around lines 31 - 47, The regex _RE_MIXED_NUMBER should accept singular forms of the large-scale words and the replacer in _normalize_mixed_numbers must special-case the digit "1" to use the feminine form "eine" for feminine scales (million/millionen, milliarde/milliarden, billion/billionen) instead of the default _DIGIT_TO_GERMAN value that yields "ein"; update the pattern for _RE_MIXED_NUMBER to include explicit singular variants (e.g. million, milliarde, billion as well as their plural forms) and modify the replace(match) in _normalize_mixed_numbers to return "eine {multiplier}" when number == "1" and multiplier is in the feminine set, otherwise fall back to the existing mapping (so "2 million" is matched and becomes "zwei million" and "1 milliarde" becomes "eine milliarde").

coderabbitai · 2026-04-21T17:05:39Z

+_GERMAN_DIGIT_WORDS: dict[str, str] = {
+    "null": "0",
+    "ein": "1",
+    "eins": "1",


⚠️ Potential issue | 🟠 Major

Keep ambiguous ein out of digit_words used by plus protection.

Line 11 makes ein a digit token while Line 79 enables the plus-word protection path. Since ProtectPlusWordBeforeDigitWordsStep consumes config.digit_words, normal phrases like plus ein bisschen can be treated as phone-plus context and later become + ein bisschen.

🐛 Proposed fix

_GERMAN_DIGIT_WORDS: dict[str, str] = { "null": "0", - "ein": "1", "eins": "1", "zwei": "2",

digit_words=_GERMAN_DIGIT_WORDS, number_words=[ + "ein", *_GERMAN_DIGIT_WORDS, "zehn",

Also applies to: 49-50, 79-79

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/german/operators.py` around lines 9 - 12, The digit-word mapping _GERMAN_DIGIT_WORDS incorrectly includes the ambiguous token "ein", which causes ProtectPlusWordBeforeDigitWordsStep (which consumes config.digit_words) to misclassify normal phrases like "plus ein bisschen" as phone-plus context; remove "ein" from _GERMAN_DIGIT_WORDS (leave "eins" if needed) and ensure any configuration or references to config.digit_words no longer contain the ambiguous "ein" token so plus-word protection only triggers on unambiguous digit words.

coderabbitai · 2026-04-21T17:05:39Z

+_RE_MIXED_NUMBER = re.compile(
+    r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b",
+    re.IGNORECASE,
+)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' import re pat = re.compile(r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b", re.IGNORECASE) for text in ("2 mila", "1 mille", "2 mil"): print(f"{text!r}: {bool(pat.search(text))}") PY

Repository: gladiaio/normalization

Length of output: 111

🏁 Script executed:

find . -type f -name "*.py" -o -name "*.json" | head -20

Repository: gladiaio/normalization

Length of output: 930

🏁 Script executed:

fd -e py -e json | grep -i italian | head -20

Repository: gladiaio/normalization

Length of output: 312

🏁 Script executed:

rg -i "mille|mila" --type py | head -30

Repository: gladiaio/normalization

Length of output: 355

🏁 Script executed:

cat -n normalization/languages/italian/operators.py | head -50

Repository: gladiaio/normalization

Length of output: 1745

🏁 Script executed:

cat -n normalization/languages/italian/number_normalizer.py

Repository: gladiaio/normalization

Length of output: 2358

Match mille explicitly instead of mila?.

Line 17 currently matches mil/mila but not the configured Italian word mille, so inputs like 1 mille skip the mixed-number pre-pass while invalid 2 mil is accepted by the regex.

Proposed fix

_RE_MIXED_NUMBER = re.compile( - r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b", + r"\b(\d+)\s+(cento|mille|mila|milioni?|miliardi?)\b", re.IGNORECASE, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_RE_MIXED_NUMBER = re.compile(

r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b",

re.IGNORECASE,

)

_RE_MIXED_NUMBER = re.compile(

r"\b(\d+)\s+(cento|mille|mila|milioni?|miliardi?)\b",

re.IGNORECASE,

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/italian/number_normalizer.py` around lines 16 - 19, The regex _RE_MIXED_NUMBER currently uses "mila?" which matches "mil" or "mila" and misses the correct Italian "mille"; update the pattern used in _RE_MIXED_NUMBER so it explicitly matches "mille" and "mila" (e.g., replace the "mila?" token with an explicit alternation like "(mille|mila)") while preserving other alternatives (cento, milione/milioni, miliardo/miliardi) and re.IGNORECASE to ensure inputs like "1 mille" are correctly caught by the mixed-number pre-pass.

coderabbitai · 2026-04-21T17:05:39Z

+    def expand_written_numbers(self, text: str) -> str:
+        return self._number_normalizer(text)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Inspect where the numeric-context step is included relative to n_to_digits. # Expectation: If fix_one_word_in_numeric_contexts can run for Italian, ItalianOperators should still override it. rg -n -C3 '\bfix_one_word_in_numeric_contexts\b|\bn_to_digits\b|ExpandWrittenNumbersToDigitsStep|FixOneWordInNumericContextsStep'

Repository: gladiaio/normalization

Length of output: 7509

🏁 Script executed:

grep -n "fix_one_word_in_numeric_contexts" normalization/languages/italian/operators.py

Repository: gladiaio/normalization

Length of output: 48

🏁 Script executed:

cat -n normalization/languages/italian/operators.py | head -130

Repository: gladiaio/normalization

Length of output: 3449

Implement fix_one_word_in_numeric_contexts() for Italian.

FixOneWordInNumericContextsStep is registered in the pipeline and calls operators.fix_one_word_in_numeric_contexts(). Without an override, Italian falls back to the base no-op, while English and Spanish both implement language-specific regex patterns to convert "one"/"uno" to "1" when adjacent to digits. The new expand_written_numbers() method handles full number words but does not cover isolated "uno" in numeric contexts, so inputs like "10 uno uno" will regress to unmodified output.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/italian/operators.py` around lines 108 - 109, The Italian operator needs a language-specific override of fix_one_word_in_numeric_contexts to convert isolated "uno" to "1" when adjacent to digits (mirroring English/Spanish implementations) because expand_written_numbers only handles full number words; implement operators.fix_one_word_in_numeric_contexts() to use a regex that matches word-boundary "uno" when preceded or followed by digits (or digit sequences with separators) and replace it with "1" while preserving surrounding whitespace/punctuation, ensuring the method name fix_one_word_in_numeric_contexts and the existing expand_written_numbers remain unchanged.

fix: implement number normalizers for German, Italian, enhance Dutch …

4685f98

…normalization with digit words

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Karamouche requested a review from lrossillon-gladia April 21, 2026 17:10

lrossillon-gladia approved these changes Apr 21, 2026

View reviewed changes

Karamouche merged commit c19e573 into main Apr 21, 2026
10 checks passed

Karamouche deleted the fix/number-normalization-de-du-it branch April 21, 2026 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: implement number normalizers for German, Italian, enhance Dutch normalization with digit words#20

fix: implement number normalizers for German, Italian, enhance Dutch normalization with digit words#20
Karamouche merged 1 commit intomainfrom
fix/number-normalization-de-du-it

Karamouche commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		def expand_written_numbers(self, text: str) -> str:
		return self._number_normalizer(text)

Conversation

Karamouche commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Karamouche commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading