Skip to content

fix: implement number normalizers for German, Italian, enhance Dutch normalization with digit words#20

Merged
Karamouche merged 1 commit intomainfrom
fix/number-normalization-de-du-it
Apr 21, 2026
Merged

fix: implement number normalizers for German, Italian, enhance Dutch normalization with digit words#20
Karamouche merged 1 commit intomainfrom
fix/number-normalization-de-du-it

Conversation

@Karamouche
Copy link
Copy Markdown
Collaborator

@Karamouche Karamouche commented Apr 21, 2026

What does this PR do?

Implement number normalizers for German, Italian, enhance Dutch normalization with digit words

Type of change

  • New language (languages/{lang}/)
  • New step (steps/text/ or steps/word/)
  • New preset version (presets/)
  • Bug fix
  • Refactor / internal cleanup
  • Docs / CI

Tests

Tested different way of saying numbers in many languages

Summary by CodeRabbit

  • New Features
    • Enhanced number normalization for Dutch, German, and Italian languages with comprehensive digit-word mappings
    • Improved conversion of written-out number expressions to digits across German and Italian
    • Better handling of mixed digit-word number patterns (e.g., "2 hundred")
    • Added Italian sentence-level phrase replacements for improved text normalization

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

The changes introduce number normalization pipelines for Dutch, German, and Italian languages. Dutch operators gain digit-word mappings and expanded number configuration. German and Italian each receive new dedicated number normalizer modules that preprocess mixed digit-word patterns and apply alpha2digit conversion, with corresponding operator updates and configuration extensions.

Changes

Cohort / File(s) Summary
Dutch Language Support
normalization/languages/dutch/operators.py
Added _DUTCH_DIGIT_WORDS mapping ("nul"–"negen" to digits 0–9), extended DUTCH_CONFIG with digit_words and number_words covering digit words, teens, tens, and large scales, and set plus_word="plus".
German Language Support
normalization/languages/german/number_normalizer.py, normalization/languages/german/operators.py
Added GermanNumberNormalizer class with regex-based preprocessing to convert mixed digit-multiplier forms (e.g., 2 hundertzwei hundert), followed by alpha2digit conversion and targeted word fixes. Extended GERMAN_CONFIG with digit_words, number_words, and plus_word, and added expand_written_numbers() method to GermanOperators.
Italian Language Support
normalization/languages/italian/number_normalizer.py, normalization/languages/italian/operators.py, normalization/languages/italian/sentence_replacements.py
Added ItalianNumberNormalizer class with similar mixed-number preprocessing and post-pass word fixing. Refactored ItalianOperators to use the normalizer via new expand_written_numbers() method, removed legacy fix_one_word_in_numeric_contexts() method, and extracted sentence replacements into a dedicated module.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant GermanNumberNormalizer
    participant RegexPreprocessor as Regex<br/>Preprocessor
    participant alpha2digit
    participant PostProcessor

    Client->>GermanNumberNormalizer: __call__(text)
    GermanNumberNormalizer->>RegexPreprocessor: Mixed number patterns<br/>(e.g., "2 hundert")
    RegexPreprocessor->>GermanNumberNormalizer: Expanded form<br/>(e.g., "zwei hundert")
    GermanNumberNormalizer->>alpha2digit: Normalized text
    alpha2digit->>GermanNumberNormalizer: Partially converted<br/>(some words remain)
    GermanNumberNormalizer->>PostProcessor: Post-pass fixes<br/>("zwei"→"2", "null"→"0")
    PostProcessor->>GermanNumberNormalizer: Final normalized text
    GermanNumberNormalizer->>Client: return normalized_text
Loading
sequenceDiagram
    participant Client
    participant ItalianNumberNormalizer
    participant RegexPreprocessor as Regex<br/>Preprocessor
    participant alpha2digit
    participant PostProcessor

    Client->>ItalianNumberNormalizer: __call__(text)
    ItalianNumberNormalizer->>RegexPreprocessor: Mixed number patterns<br/>(e.g., "2 cento", "3 mila")
    RegexPreprocessor->>ItalianNumberNormalizer: Expanded form<br/>(e.g., "due cento", "tre mila")
    ItalianNumberNormalizer->>alpha2digit: Normalized text
    alpha2digit->>ItalianNumberNormalizer: Partially converted<br/>(some words remain)
    ItalianNumberNormalizer->>PostProcessor: Post-pass fixes<br/>("uno"→"1", "due"→"2")
    PostProcessor->>ItalianNumberNormalizer: Final normalized text
    ItalianNumberNormalizer->>Client: return normalized_text
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • normalization#19: Directly modifies the Dutch operators module and DUTCH_CONFIG with identical changes (adding digit_words, expanding number_words, and setting plus_word).

Suggested reviewers

  • lrossillon-gladia

Poem

🐰 Numbers dance across tongues with grace,
Dutch, German, Italian in their place,
Digit words mapped with regex keen,
From "nul" to "null," normalization supreme,
A multilingual feast, a hoppy embrace! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is incomplete. It lacks required checklist details, specific test information, and doesn't match the template structure for 'Edit existing language' or other applicable categories. Complete the description by selecting all applicable change types with checkmarks, providing specific test output, and adding E2E CSV updates if needed for German, Italian, and Dutch.
Docstring Coverage ⚠️ Warning Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: implementing number normalizers for German and Italian, and enhancing Dutch normalization with digit words.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/number-normalization-de-du-it

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/dutch/operators.py`:
- Around line 35-46: Add unit tests to cover the Dutch digit-word "een" in
numeric contexts handled by fix_dot_adjacent_number_words and
protect_plus_word_before_digit_words: assert that "versie punt een" transforms
"een" → "1" (dot/version/IP context) and that "plus een" is treated as a
phone-country-code context by protect_plus_word_before_digit_words, and also add
an edge-case test like "Dit kost plus een euro" to ensure it does not
incorrectly convert in ordinary currency phrases; use the existing
_DUTCH_DIGIT_WORDS mapping and the same test harness used for other Dutch
normalization tests to locate and validate behavior.

In `@normalization/languages/german/number_normalizer.py`:
- Around line 31-47: The regex _RE_MIXED_NUMBER should accept singular forms of
the large-scale words and the replacer in _normalize_mixed_numbers must
special-case the digit "1" to use the feminine form "eine" for feminine scales
(million/millionen, milliarde/milliarden, billion/billionen) instead of the
default _DIGIT_TO_GERMAN value that yields "ein"; update the pattern for
_RE_MIXED_NUMBER to include explicit singular variants (e.g. million, milliarde,
billion as well as their plural forms) and modify the replace(match) in
_normalize_mixed_numbers to return "eine {multiplier}" when number == "1" and
multiplier is in the feminine set, otherwise fall back to the existing mapping
(so "2 million" is matched and becomes "zwei million" and "1 milliarde" becomes
"eine milliarde").

In `@normalization/languages/german/operators.py`:
- Around line 9-12: The digit-word mapping _GERMAN_DIGIT_WORDS incorrectly
includes the ambiguous token "ein", which causes
ProtectPlusWordBeforeDigitWordsStep (which consumes config.digit_words) to
misclassify normal phrases like "plus ein bisschen" as phone-plus context;
remove "ein" from _GERMAN_DIGIT_WORDS (leave "eins" if needed) and ensure any
configuration or references to config.digit_words no longer contain the
ambiguous "ein" token so plus-word protection only triggers on unambiguous digit
words.

In `@normalization/languages/italian/number_normalizer.py`:
- Around line 16-19: The regex _RE_MIXED_NUMBER currently uses "mila?" which
matches "mil" or "mila" and misses the correct Italian "mille"; update the
pattern used in _RE_MIXED_NUMBER so it explicitly matches "mille" and "mila"
(e.g., replace the "mila?" token with an explicit alternation like
"(mille|mila)") while preserving other alternatives (cento, milione/milioni,
miliardo/miliardi) and re.IGNORECASE to ensure inputs like "1 mille" are
correctly caught by the mixed-number pre-pass.

In `@normalization/languages/italian/operators.py`:
- Around line 108-109: The Italian operator needs a language-specific override
of fix_one_word_in_numeric_contexts to convert isolated "uno" to "1" when
adjacent to digits (mirroring English/Spanish implementations) because
expand_written_numbers only handles full number words; implement
operators.fix_one_word_in_numeric_contexts() to use a regex that matches
word-boundary "uno" when preceded or followed by digits (or digit sequences with
separators) and replace it with "1" while preserving surrounding
whitespace/punctuation, ensuring the method name
fix_one_word_in_numeric_contexts and the existing expand_written_numbers remain
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b8f7a32-1869-4f61-8f00-ce6c5d8af1d1

📥 Commits

Reviewing files that changed from the base of the PR and between 165bd95 and 4685f98.

⛔ Files ignored due to path filters (6)
  • tests/e2e/files/gladia-3/de.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/en.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/es.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/fr.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/it.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/nl.csv is excluded by !**/*.csv
📒 Files selected for processing (6)
  • normalization/languages/dutch/operators.py
  • normalization/languages/german/number_normalizer.py
  • normalization/languages/german/operators.py
  • normalization/languages/italian/number_normalizer.py
  • normalization/languages/italian/operators.py
  • normalization/languages/italian/sentence_replacements.py

Comment on lines +35 to +46
_DUTCH_DIGIT_WORDS: dict[str, str] = {
"nul": "0",
"een": "1",
"twee": "2",
"drie": "3",
"vier": "4",
"vijf": "5",
"zes": "6",
"zeven": "7",
"acht": "8",
"negen": "9",
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, understand the file structure and how digit_words are used
find . -type f -name "*.py" | grep -E "(dutch|operators)" | head -20

Repository: gladiaio/normalization

Length of output: 715


🏁 Script executed:

# Search for the functions mentioned in the review comment
rg "fix_dot_adjacent_number_words|protect_plus_word_before_digit_words" --type py -A 5

Repository: gladiaio/normalization

Length of output: 3696


🏁 Script executed:

# Check how digit_words are constructed and passed to these functions
rg "digit_words" --type py -B 2 -A 2 | head -100

Repository: gladiaio/normalization

Length of output: 8296


🏁 Script executed:

# Look for tests related to Dutch and "een"
find . -type f -name "*.py" | xargs grep -l "dutch\|Nederlandse" | grep test

Repository: gladiaio/normalization

Length of output: 203


🏁 Script executed:

# Read the full implementation of fix_dot_adjacent_number_words
cat -n normalization/steps/text/fix_dot_adjacent_number_words.py

Repository: gladiaio/normalization

Length of output: 1597


🏁 Script executed:

# Read the full implementation of protect_plus_word_before_digit_words
cat -n normalization/steps/text/protect_plus_word_before_digit_words.py

Repository: gladiaio/normalization

Length of output: 1606


🏁 Script executed:

# Check Dutch tests for "een" coverage
cat -n tests/unit/languages/dutch_operators_test.py | head -150

Repository: gladiaio/normalization

Length of output: 2352


🏁 Script executed:

# Check if there are any test cases with "een" as the indefinite article
rg "een" tests/unit/languages/dutch_operators_test.py -B 2 -A 2

Repository: gladiaio/normalization

Length of output: 424


🏁 Script executed:

# Check the dutch number normalizer tests for "een" coverage
cat -n tests/unit/languages/dutch_number_normalizer_test.py | head -100

Repository: gladiaio/normalization

Length of output: 2562


🏁 Script executed:

# Search for any integration tests or broader test scenarios
rg "een" tests/unit/languages/dutch_number_normalizer_test.py -B 2 -A 2

Repository: gladiaio/normalization

Length of output: 227


🏁 Script executed:

# Check if there are any other test files that might test the digit_words steps
find tests -type f -name "*.py" -exec grep -l "fix_dot_adjacent\|protect_plus_word" {} \;

Repository: gladiaio/normalization

Length of output: 48


🏁 Script executed:

# Check if there are step-level tests for these functions
find tests -type f -name "*.py" | xargs grep -l "FixDotAdjacentNumberWordsStep\|ProtectPlusWordBeforeDigitWordsStep" 2>/dev/null

Repository: gladiaio/normalization

Length of output: 50


🏁 Script executed:

# Search for any integration tests that might run these steps end-to-end
rg "fix_dot_adjacent\|protect_plus_word" tests/ -l

Repository: gladiaio/normalization

Length of output: 48


🏁 Script executed:

# Let's also check the Dutch config to understand the full context
cat -n normalization/languages/dutch/operators.py

Repository: gladiaio/normalization

Length of output: 5161


"een" matching in numeric contexts is intentional but warrants test coverage.

The functions fix_dot_adjacent_number_words and protect_plus_word_before_digit_words use word boundaries (\b) in their regex patterns, which provides protection against false positives in arbitrary text. However, "een" will still match when it appears in the specific numeric patterns these steps are designed for:

  • After "punt" (dot): "versie punt een""versie punt 1" ✓ (intended for IPs/versions)
  • After "plus" (plus): "plus een" → converts to phone context marker (intended for +1 country codes)

In these contexts, the behavior is correct. However, there's a narrow edge case risk: ambiguous sentences like "Dit kost plus een euro" could be misparsed. Adding explicit tests for "een" in these numeric-context patterns would confirm the behavior is safe for your typical input corpus.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/operators.py` around lines 35 - 46, Add unit
tests to cover the Dutch digit-word "een" in numeric contexts handled by
fix_dot_adjacent_number_words and protect_plus_word_before_digit_words: assert
that "versie punt een" transforms "een" → "1" (dot/version/IP context) and that
"plus een" is treated as a phone-country-code context by
protect_plus_word_before_digit_words, and also add an edge-case test like "Dit
kost plus een euro" to ensure it does not incorrectly convert in ordinary
currency phrases; use the existing _DUTCH_DIGIT_WORDS mapping and the same test
harness used for other Dutch normalization tests to locate and validate
behavior.

Comment on lines +31 to +47
_RE_MIXED_NUMBER = re.compile(
r"\b(\d+)\s+(hundert|tausend|millionen?|milliarden?|billionen?)\b",
re.IGNORECASE,
)

_RE_ZWEI = re.compile(r"\bzwei\b", re.IGNORECASE)
_RE_NULL = re.compile(r"\bnull\b", re.IGNORECASE)


def _normalize_mixed_numbers(text: str) -> str:
"""Convert '2 hundert' → 'zwei hundert' so alpha2digit yields 200, not '2 100'."""

def replace(match: re.Match) -> str:
number = match.group(1)
multiplier = match.group(2)
if len(number) == 1 and number in _DIGIT_TO_GERMAN:
return f"{_DIGIT_TO_GERMAN[number]} {multiplier}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix mixed digit + German scale preprocessing for singular scale words.

Line 32 misses valid singular million/billion and Line 47 rewrites 1 milliarde as ein milliarde. That leaves common inputs like 2 million unhandled and can make 1 Million/Milliarde/Billion less alpha2digit-friendly.

🐛 Proposed fix
+_FEMININE_SINGULAR_SCALES = {"million", "milliarde", "billion"}
+
 _RE_MIXED_NUMBER = re.compile(
-    r"\b(\d+)\s+(hundert|tausend|millionen?|milliarden?|billionen?)\b",
+    r"\b(\d+)\s+(hundert|tausend|million(?:en)?|milliarde(?:n)?|billion(?:en)?)\b",
     re.IGNORECASE,
 )
         number = match.group(1)
         multiplier = match.group(2)
         if len(number) == 1 and number in _DIGIT_TO_GERMAN:
-            return f"{_DIGIT_TO_GERMAN[number]} {multiplier}"
+            digit_word = _DIGIT_TO_GERMAN[number]
+            if number == "1" and multiplier.lower() in _FEMININE_SINGULAR_SCALES:
+                digit_word = "eine"
+            return f"{digit_word} {multiplier}"
         return match.group(0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/german/number_normalizer.py` around lines 31 - 47,
The regex _RE_MIXED_NUMBER should accept singular forms of the large-scale words
and the replacer in _normalize_mixed_numbers must special-case the digit "1" to
use the feminine form "eine" for feminine scales (million/millionen,
milliarde/milliarden, billion/billionen) instead of the default _DIGIT_TO_GERMAN
value that yields "ein"; update the pattern for _RE_MIXED_NUMBER to include
explicit singular variants (e.g. million, milliarde, billion as well as their
plural forms) and modify the replace(match) in _normalize_mixed_numbers to
return "eine {multiplier}" when number == "1" and multiplier is in the feminine
set, otherwise fall back to the existing mapping (so "2 million" is matched and
becomes "zwei million" and "1 milliarde" becomes "eine milliarde").

Comment on lines +9 to +12
_GERMAN_DIGIT_WORDS: dict[str, str] = {
"null": "0",
"ein": "1",
"eins": "1",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep ambiguous ein out of digit_words used by plus protection.

Line 11 makes ein a digit token while Line 79 enables the plus-word protection path. Since ProtectPlusWordBeforeDigitWordsStep consumes config.digit_words, normal phrases like plus ein bisschen can be treated as phone-plus context and later become + ein bisschen.

🐛 Proposed fix
 _GERMAN_DIGIT_WORDS: dict[str, str] = {
     "null": "0",
-    "ein": "1",
     "eins": "1",
     "zwei": "2",
     digit_words=_GERMAN_DIGIT_WORDS,
     number_words=[
+        "ein",
         *_GERMAN_DIGIT_WORDS,
         "zehn",

Also applies to: 49-50, 79-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/german/operators.py` around lines 9 - 12, The
digit-word mapping _GERMAN_DIGIT_WORDS incorrectly includes the ambiguous token
"ein", which causes ProtectPlusWordBeforeDigitWordsStep (which consumes
config.digit_words) to misclassify normal phrases like "plus ein bisschen" as
phone-plus context; remove "ein" from _GERMAN_DIGIT_WORDS (leave "eins" if
needed) and ensure any configuration or references to config.digit_words no
longer contain the ambiguous "ein" token so plus-word protection only triggers
on unambiguous digit words.

Comment on lines +16 to +19
_RE_MIXED_NUMBER = re.compile(
r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b",
re.IGNORECASE,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import re

pat = re.compile(r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b", re.IGNORECASE)
for text in ("2 mila", "1 mille", "2 mil"):
    print(f"{text!r}: {bool(pat.search(text))}")
PY

Repository: gladiaio/normalization

Length of output: 111


🏁 Script executed:

find . -type f -name "*.py" -o -name "*.json" | head -20

Repository: gladiaio/normalization

Length of output: 930


🏁 Script executed:

fd -e py -e json | grep -i italian | head -20

Repository: gladiaio/normalization

Length of output: 312


🏁 Script executed:

rg -i "mille|mila" --type py | head -30

Repository: gladiaio/normalization

Length of output: 355


🏁 Script executed:

cat -n normalization/languages/italian/operators.py | head -50

Repository: gladiaio/normalization

Length of output: 1745


🏁 Script executed:

cat -n normalization/languages/italian/number_normalizer.py

Repository: gladiaio/normalization

Length of output: 2358


Match mille explicitly instead of mila?.

Line 17 currently matches mil/mila but not the configured Italian word mille, so inputs like 1 mille skip the mixed-number pre-pass while invalid 2 mil is accepted by the regex.

Proposed fix
 _RE_MIXED_NUMBER = re.compile(
-    r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b",
+    r"\b(\d+)\s+(cento|mille|mila|milioni?|miliardi?)\b",
     re.IGNORECASE,
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_RE_MIXED_NUMBER = re.compile(
r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b",
re.IGNORECASE,
)
_RE_MIXED_NUMBER = re.compile(
r"\b(\d+)\s+(cento|mille|mila|milioni?|miliardi?)\b",
re.IGNORECASE,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/italian/number_normalizer.py` around lines 16 - 19,
The regex _RE_MIXED_NUMBER currently uses "mila?" which matches "mil" or "mila"
and misses the correct Italian "mille"; update the pattern used in
_RE_MIXED_NUMBER so it explicitly matches "mille" and "mila" (e.g., replace the
"mila?" token with an explicit alternation like "(mille|mila)") while preserving
other alternatives (cento, milione/milioni, miliardo/miliardi) and re.IGNORECASE
to ensure inputs like "1 mille" are correctly caught by the mixed-number
pre-pass.

Comment on lines +108 to +109
def expand_written_numbers(self, text: str) -> str:
return self._number_normalizer(text)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Inspect where the numeric-context step is included relative to n_to_digits.
# Expectation: If fix_one_word_in_numeric_contexts can run for Italian, ItalianOperators should still override it.
rg -n -C3 '\bfix_one_word_in_numeric_contexts\b|\bn_to_digits\b|ExpandWrittenNumbersToDigitsStep|FixOneWordInNumericContextsStep'

Repository: gladiaio/normalization

Length of output: 7509


🏁 Script executed:

grep -n "fix_one_word_in_numeric_contexts" normalization/languages/italian/operators.py

Repository: gladiaio/normalization

Length of output: 48


🏁 Script executed:

cat -n normalization/languages/italian/operators.py | head -130

Repository: gladiaio/normalization

Length of output: 3449


Implement fix_one_word_in_numeric_contexts() for Italian.

FixOneWordInNumericContextsStep is registered in the pipeline and calls operators.fix_one_word_in_numeric_contexts(). Without an override, Italian falls back to the base no-op, while English and Spanish both implement language-specific regex patterns to convert "one"/"uno" to "1" when adjacent to digits. The new expand_written_numbers() method handles full number words but does not cover isolated "uno" in numeric contexts, so inputs like "10 uno uno" will regress to unmodified output.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/italian/operators.py` around lines 108 - 109, The
Italian operator needs a language-specific override of
fix_one_word_in_numeric_contexts to convert isolated "uno" to "1" when adjacent
to digits (mirroring English/Spanish implementations) because
expand_written_numbers only handles full number words; implement
operators.fix_one_word_in_numeric_contexts() to use a regex that matches
word-boundary "uno" when preceded or followed by digits (or digit sequences with
separators) and replace it with "1" while preserving surrounding
whitespace/punctuation, ensuring the method name
fix_one_word_in_numeric_contexts and the existing expand_written_numbers remain
unchanged.

@Karamouche Karamouche merged commit c19e573 into main Apr 21, 2026
10 checks passed
@Karamouche Karamouche deleted the fix/number-normalization-de-du-it branch April 21, 2026 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants