fix: implement number normalizers for German, Italian, enhance Dutch normalization with digit words#20
Conversation
…normalization with digit words
📝 WalkthroughWalkthroughThe changes introduce number normalization pipelines for Dutch, German, and Italian languages. Dutch operators gain digit-word mappings and expanded number configuration. German and Italian each receive new dedicated number normalizer modules that preprocess mixed digit-word patterns and apply alpha2digit conversion, with corresponding operator updates and configuration extensions. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant GermanNumberNormalizer
participant RegexPreprocessor as Regex<br/>Preprocessor
participant alpha2digit
participant PostProcessor
Client->>GermanNumberNormalizer: __call__(text)
GermanNumberNormalizer->>RegexPreprocessor: Mixed number patterns<br/>(e.g., "2 hundert")
RegexPreprocessor->>GermanNumberNormalizer: Expanded form<br/>(e.g., "zwei hundert")
GermanNumberNormalizer->>alpha2digit: Normalized text
alpha2digit->>GermanNumberNormalizer: Partially converted<br/>(some words remain)
GermanNumberNormalizer->>PostProcessor: Post-pass fixes<br/>("zwei"→"2", "null"→"0")
PostProcessor->>GermanNumberNormalizer: Final normalized text
GermanNumberNormalizer->>Client: return normalized_text
sequenceDiagram
participant Client
participant ItalianNumberNormalizer
participant RegexPreprocessor as Regex<br/>Preprocessor
participant alpha2digit
participant PostProcessor
Client->>ItalianNumberNormalizer: __call__(text)
ItalianNumberNormalizer->>RegexPreprocessor: Mixed number patterns<br/>(e.g., "2 cento", "3 mila")
RegexPreprocessor->>ItalianNumberNormalizer: Expanded form<br/>(e.g., "due cento", "tre mila")
ItalianNumberNormalizer->>alpha2digit: Normalized text
alpha2digit->>ItalianNumberNormalizer: Partially converted<br/>(some words remain)
ItalianNumberNormalizer->>PostProcessor: Post-pass fixes<br/>("uno"→"1", "due"→"2")
PostProcessor->>ItalianNumberNormalizer: Final normalized text
ItalianNumberNormalizer->>Client: return normalized_text
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/dutch/operators.py`:
- Around line 35-46: Add unit tests to cover the Dutch digit-word "een" in
numeric contexts handled by fix_dot_adjacent_number_words and
protect_plus_word_before_digit_words: assert that "versie punt een" transforms
"een" → "1" (dot/version/IP context) and that "plus een" is treated as a
phone-country-code context by protect_plus_word_before_digit_words, and also add
an edge-case test like "Dit kost plus een euro" to ensure it does not
incorrectly convert in ordinary currency phrases; use the existing
_DUTCH_DIGIT_WORDS mapping and the same test harness used for other Dutch
normalization tests to locate and validate behavior.
In `@normalization/languages/german/number_normalizer.py`:
- Around line 31-47: The regex _RE_MIXED_NUMBER should accept singular forms of
the large-scale words and the replacer in _normalize_mixed_numbers must
special-case the digit "1" to use the feminine form "eine" for feminine scales
(million/millionen, milliarde/milliarden, billion/billionen) instead of the
default _DIGIT_TO_GERMAN value that yields "ein"; update the pattern for
_RE_MIXED_NUMBER to include explicit singular variants (e.g. million, milliarde,
billion as well as their plural forms) and modify the replace(match) in
_normalize_mixed_numbers to return "eine {multiplier}" when number == "1" and
multiplier is in the feminine set, otherwise fall back to the existing mapping
(so "2 million" is matched and becomes "zwei million" and "1 milliarde" becomes
"eine milliarde").
In `@normalization/languages/german/operators.py`:
- Around line 9-12: The digit-word mapping _GERMAN_DIGIT_WORDS incorrectly
includes the ambiguous token "ein", which causes
ProtectPlusWordBeforeDigitWordsStep (which consumes config.digit_words) to
misclassify normal phrases like "plus ein bisschen" as phone-plus context;
remove "ein" from _GERMAN_DIGIT_WORDS (leave "eins" if needed) and ensure any
configuration or references to config.digit_words no longer contain the
ambiguous "ein" token so plus-word protection only triggers on unambiguous digit
words.
In `@normalization/languages/italian/number_normalizer.py`:
- Around line 16-19: The regex _RE_MIXED_NUMBER currently uses "mila?" which
matches "mil" or "mila" and misses the correct Italian "mille"; update the
pattern used in _RE_MIXED_NUMBER so it explicitly matches "mille" and "mila"
(e.g., replace the "mila?" token with an explicit alternation like
"(mille|mila)") while preserving other alternatives (cento, milione/milioni,
miliardo/miliardi) and re.IGNORECASE to ensure inputs like "1 mille" are
correctly caught by the mixed-number pre-pass.
In `@normalization/languages/italian/operators.py`:
- Around line 108-109: The Italian operator needs a language-specific override
of fix_one_word_in_numeric_contexts to convert isolated "uno" to "1" when
adjacent to digits (mirroring English/Spanish implementations) because
expand_written_numbers only handles full number words; implement
operators.fix_one_word_in_numeric_contexts() to use a regex that matches
word-boundary "uno" when preceded or followed by digits (or digit sequences with
separators) and replace it with "1" while preserving surrounding
whitespace/punctuation, ensuring the method name
fix_one_word_in_numeric_contexts and the existing expand_written_numbers remain
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0b8f7a32-1869-4f61-8f00-ce6c5d8af1d1
⛔ Files ignored due to path filters (6)
tests/e2e/files/gladia-3/de.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/en.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/es.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/fr.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/it.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/nl.csvis excluded by!**/*.csv
📒 Files selected for processing (6)
normalization/languages/dutch/operators.pynormalization/languages/german/number_normalizer.pynormalization/languages/german/operators.pynormalization/languages/italian/number_normalizer.pynormalization/languages/italian/operators.pynormalization/languages/italian/sentence_replacements.py
| _DUTCH_DIGIT_WORDS: dict[str, str] = { | ||
| "nul": "0", | ||
| "een": "1", | ||
| "twee": "2", | ||
| "drie": "3", | ||
| "vier": "4", | ||
| "vijf": "5", | ||
| "zes": "6", | ||
| "zeven": "7", | ||
| "acht": "8", | ||
| "negen": "9", | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, understand the file structure and how digit_words are used
find . -type f -name "*.py" | grep -E "(dutch|operators)" | head -20Repository: gladiaio/normalization
Length of output: 715
🏁 Script executed:
# Search for the functions mentioned in the review comment
rg "fix_dot_adjacent_number_words|protect_plus_word_before_digit_words" --type py -A 5Repository: gladiaio/normalization
Length of output: 3696
🏁 Script executed:
# Check how digit_words are constructed and passed to these functions
rg "digit_words" --type py -B 2 -A 2 | head -100Repository: gladiaio/normalization
Length of output: 8296
🏁 Script executed:
# Look for tests related to Dutch and "een"
find . -type f -name "*.py" | xargs grep -l "dutch\|Nederlandse" | grep testRepository: gladiaio/normalization
Length of output: 203
🏁 Script executed:
# Read the full implementation of fix_dot_adjacent_number_words
cat -n normalization/steps/text/fix_dot_adjacent_number_words.pyRepository: gladiaio/normalization
Length of output: 1597
🏁 Script executed:
# Read the full implementation of protect_plus_word_before_digit_words
cat -n normalization/steps/text/protect_plus_word_before_digit_words.pyRepository: gladiaio/normalization
Length of output: 1606
🏁 Script executed:
# Check Dutch tests for "een" coverage
cat -n tests/unit/languages/dutch_operators_test.py | head -150Repository: gladiaio/normalization
Length of output: 2352
🏁 Script executed:
# Check if there are any test cases with "een" as the indefinite article
rg "een" tests/unit/languages/dutch_operators_test.py -B 2 -A 2Repository: gladiaio/normalization
Length of output: 424
🏁 Script executed:
# Check the dutch number normalizer tests for "een" coverage
cat -n tests/unit/languages/dutch_number_normalizer_test.py | head -100Repository: gladiaio/normalization
Length of output: 2562
🏁 Script executed:
# Search for any integration tests or broader test scenarios
rg "een" tests/unit/languages/dutch_number_normalizer_test.py -B 2 -A 2Repository: gladiaio/normalization
Length of output: 227
🏁 Script executed:
# Check if there are any other test files that might test the digit_words steps
find tests -type f -name "*.py" -exec grep -l "fix_dot_adjacent\|protect_plus_word" {} \;Repository: gladiaio/normalization
Length of output: 48
🏁 Script executed:
# Check if there are step-level tests for these functions
find tests -type f -name "*.py" | xargs grep -l "FixDotAdjacentNumberWordsStep\|ProtectPlusWordBeforeDigitWordsStep" 2>/dev/nullRepository: gladiaio/normalization
Length of output: 50
🏁 Script executed:
# Search for any integration tests that might run these steps end-to-end
rg "fix_dot_adjacent\|protect_plus_word" tests/ -lRepository: gladiaio/normalization
Length of output: 48
🏁 Script executed:
# Let's also check the Dutch config to understand the full context
cat -n normalization/languages/dutch/operators.pyRepository: gladiaio/normalization
Length of output: 5161
"een" matching in numeric contexts is intentional but warrants test coverage.
The functions fix_dot_adjacent_number_words and protect_plus_word_before_digit_words use word boundaries (\b) in their regex patterns, which provides protection against false positives in arbitrary text. However, "een" will still match when it appears in the specific numeric patterns these steps are designed for:
- After
"punt"(dot):"versie punt een"→"versie punt 1"✓ (intended for IPs/versions) - After
"plus"(plus):"plus een"→ converts to phone context marker (intended for+1country codes)
In these contexts, the behavior is correct. However, there's a narrow edge case risk: ambiguous sentences like "Dit kost plus een euro" could be misparsed. Adding explicit tests for "een" in these numeric-context patterns would confirm the behavior is safe for your typical input corpus.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/dutch/operators.py` around lines 35 - 46, Add unit
tests to cover the Dutch digit-word "een" in numeric contexts handled by
fix_dot_adjacent_number_words and protect_plus_word_before_digit_words: assert
that "versie punt een" transforms "een" → "1" (dot/version/IP context) and that
"plus een" is treated as a phone-country-code context by
protect_plus_word_before_digit_words, and also add an edge-case test like "Dit
kost plus een euro" to ensure it does not incorrectly convert in ordinary
currency phrases; use the existing _DUTCH_DIGIT_WORDS mapping and the same test
harness used for other Dutch normalization tests to locate and validate
behavior.
| _RE_MIXED_NUMBER = re.compile( | ||
| r"\b(\d+)\s+(hundert|tausend|millionen?|milliarden?|billionen?)\b", | ||
| re.IGNORECASE, | ||
| ) | ||
|
|
||
| _RE_ZWEI = re.compile(r"\bzwei\b", re.IGNORECASE) | ||
| _RE_NULL = re.compile(r"\bnull\b", re.IGNORECASE) | ||
|
|
||
|
|
||
| def _normalize_mixed_numbers(text: str) -> str: | ||
| """Convert '2 hundert' → 'zwei hundert' so alpha2digit yields 200, not '2 100'.""" | ||
|
|
||
| def replace(match: re.Match) -> str: | ||
| number = match.group(1) | ||
| multiplier = match.group(2) | ||
| if len(number) == 1 and number in _DIGIT_TO_GERMAN: | ||
| return f"{_DIGIT_TO_GERMAN[number]} {multiplier}" |
There was a problem hiding this comment.
Fix mixed digit + German scale preprocessing for singular scale words.
Line 32 misses valid singular million/billion and Line 47 rewrites 1 milliarde as ein milliarde. That leaves common inputs like 2 million unhandled and can make 1 Million/Milliarde/Billion less alpha2digit-friendly.
🐛 Proposed fix
+_FEMININE_SINGULAR_SCALES = {"million", "milliarde", "billion"}
+
_RE_MIXED_NUMBER = re.compile(
- r"\b(\d+)\s+(hundert|tausend|millionen?|milliarden?|billionen?)\b",
+ r"\b(\d+)\s+(hundert|tausend|million(?:en)?|milliarde(?:n)?|billion(?:en)?)\b",
re.IGNORECASE,
) number = match.group(1)
multiplier = match.group(2)
if len(number) == 1 and number in _DIGIT_TO_GERMAN:
- return f"{_DIGIT_TO_GERMAN[number]} {multiplier}"
+ digit_word = _DIGIT_TO_GERMAN[number]
+ if number == "1" and multiplier.lower() in _FEMININE_SINGULAR_SCALES:
+ digit_word = "eine"
+ return f"{digit_word} {multiplier}"
return match.group(0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/german/number_normalizer.py` around lines 31 - 47,
The regex _RE_MIXED_NUMBER should accept singular forms of the large-scale words
and the replacer in _normalize_mixed_numbers must special-case the digit "1" to
use the feminine form "eine" for feminine scales (million/millionen,
milliarde/milliarden, billion/billionen) instead of the default _DIGIT_TO_GERMAN
value that yields "ein"; update the pattern for _RE_MIXED_NUMBER to include
explicit singular variants (e.g. million, milliarde, billion as well as their
plural forms) and modify the replace(match) in _normalize_mixed_numbers to
return "eine {multiplier}" when number == "1" and multiplier is in the feminine
set, otherwise fall back to the existing mapping (so "2 million" is matched and
becomes "zwei million" and "1 milliarde" becomes "eine milliarde").
| _GERMAN_DIGIT_WORDS: dict[str, str] = { | ||
| "null": "0", | ||
| "ein": "1", | ||
| "eins": "1", |
There was a problem hiding this comment.
Keep ambiguous ein out of digit_words used by plus protection.
Line 11 makes ein a digit token while Line 79 enables the plus-word protection path. Since ProtectPlusWordBeforeDigitWordsStep consumes config.digit_words, normal phrases like plus ein bisschen can be treated as phone-plus context and later become + ein bisschen.
🐛 Proposed fix
_GERMAN_DIGIT_WORDS: dict[str, str] = {
"null": "0",
- "ein": "1",
"eins": "1",
"zwei": "2", digit_words=_GERMAN_DIGIT_WORDS,
number_words=[
+ "ein",
*_GERMAN_DIGIT_WORDS,
"zehn",Also applies to: 49-50, 79-79
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/german/operators.py` around lines 9 - 12, The
digit-word mapping _GERMAN_DIGIT_WORDS incorrectly includes the ambiguous token
"ein", which causes ProtectPlusWordBeforeDigitWordsStep (which consumes
config.digit_words) to misclassify normal phrases like "plus ein bisschen" as
phone-plus context; remove "ein" from _GERMAN_DIGIT_WORDS (leave "eins" if
needed) and ensure any configuration or references to config.digit_words no
longer contain the ambiguous "ein" token so plus-word protection only triggers
on unambiguous digit words.
| _RE_MIXED_NUMBER = re.compile( | ||
| r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b", | ||
| re.IGNORECASE, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import re
pat = re.compile(r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b", re.IGNORECASE)
for text in ("2 mila", "1 mille", "2 mil"):
print(f"{text!r}: {bool(pat.search(text))}")
PYRepository: gladiaio/normalization
Length of output: 111
🏁 Script executed:
find . -type f -name "*.py" -o -name "*.json" | head -20Repository: gladiaio/normalization
Length of output: 930
🏁 Script executed:
fd -e py -e json | grep -i italian | head -20Repository: gladiaio/normalization
Length of output: 312
🏁 Script executed:
rg -i "mille|mila" --type py | head -30Repository: gladiaio/normalization
Length of output: 355
🏁 Script executed:
cat -n normalization/languages/italian/operators.py | head -50Repository: gladiaio/normalization
Length of output: 1745
🏁 Script executed:
cat -n normalization/languages/italian/number_normalizer.pyRepository: gladiaio/normalization
Length of output: 2358
Match mille explicitly instead of mila?.
Line 17 currently matches mil/mila but not the configured Italian word mille, so inputs like 1 mille skip the mixed-number pre-pass while invalid 2 mil is accepted by the regex.
Proposed fix
_RE_MIXED_NUMBER = re.compile(
- r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b",
+ r"\b(\d+)\s+(cento|mille|mila|milioni?|miliardi?)\b",
re.IGNORECASE,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _RE_MIXED_NUMBER = re.compile( | |
| r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b", | |
| re.IGNORECASE, | |
| ) | |
| _RE_MIXED_NUMBER = re.compile( | |
| r"\b(\d+)\s+(cento|mille|mila|milioni?|miliardi?)\b", | |
| re.IGNORECASE, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/italian/number_normalizer.py` around lines 16 - 19,
The regex _RE_MIXED_NUMBER currently uses "mila?" which matches "mil" or "mila"
and misses the correct Italian "mille"; update the pattern used in
_RE_MIXED_NUMBER so it explicitly matches "mille" and "mila" (e.g., replace the
"mila?" token with an explicit alternation like "(mille|mila)") while preserving
other alternatives (cento, milione/milioni, miliardo/miliardi) and re.IGNORECASE
to ensure inputs like "1 mille" are correctly caught by the mixed-number
pre-pass.
| def expand_written_numbers(self, text: str) -> str: | ||
| return self._number_normalizer(text) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Inspect where the numeric-context step is included relative to n_to_digits.
# Expectation: If fix_one_word_in_numeric_contexts can run for Italian, ItalianOperators should still override it.
rg -n -C3 '\bfix_one_word_in_numeric_contexts\b|\bn_to_digits\b|ExpandWrittenNumbersToDigitsStep|FixOneWordInNumericContextsStep'Repository: gladiaio/normalization
Length of output: 7509
🏁 Script executed:
grep -n "fix_one_word_in_numeric_contexts" normalization/languages/italian/operators.pyRepository: gladiaio/normalization
Length of output: 48
🏁 Script executed:
cat -n normalization/languages/italian/operators.py | head -130Repository: gladiaio/normalization
Length of output: 3449
Implement fix_one_word_in_numeric_contexts() for Italian.
FixOneWordInNumericContextsStep is registered in the pipeline and calls operators.fix_one_word_in_numeric_contexts(). Without an override, Italian falls back to the base no-op, while English and Spanish both implement language-specific regex patterns to convert "one"/"uno" to "1" when adjacent to digits. The new expand_written_numbers() method handles full number words but does not cover isolated "uno" in numeric contexts, so inputs like "10 uno uno" will regress to unmodified output.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/italian/operators.py` around lines 108 - 109, The
Italian operator needs a language-specific override of
fix_one_word_in_numeric_contexts to convert isolated "uno" to "1" when adjacent
to digits (mirroring English/Spanish implementations) because
expand_written_numbers only handles full number words; implement
operators.fix_one_word_in_numeric_contexts() to use a regex that matches
word-boundary "uno" when preceded or followed by digits (or digit sequences with
separators) and replace it with "1" while preserving surrounding
whitespace/punctuation, ensuring the method name
fix_one_word_in_numeric_contexts and the existing expand_written_numbers remain
unchanged.
What does this PR do?
Implement number normalizers for German, Italian, enhance Dutch normalization with digit words
Type of change
languages/{lang}/)steps/text/orsteps/word/)presets/)Tests
Tested different way of saying numbers in many languages
Summary by CodeRabbit