feat: init finnish language basic normalization#22
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis pull request adds comprehensive Finnish language support to the normalization module. It introduces a Finnish language package with number normalization logic, word replacement mappings, language operators, and complete unit test coverage, registering Finnish as a supported language in the module's language registry. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 52 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (4)
normalization/languages/finnish/operators.py (1)
101-104: Move theFINNISH_REPLACEMENTSimport to module scope.The function-local import pattern is usually reserved for breaking circular imports, but
replacements.pyis a leaf module that doesn’t import fromoperators.py, so a top-level import is safe and matches what the other language packages (per the__init__.pyexports) already do.from normalization.languages.base import LanguageConfig, LanguageOperators from normalization.languages.finnish.number_normalizer import FinnishNumberNormalizer +from normalization.languages.finnish.replacements import FINNISH_REPLACEMENTS from normalization.languages.registry import register_language @@ def get_word_replacements(self) -> dict[str, str]: - from normalization.languages.finnish.replacements import FINNISH_REPLACEMENTS - return FINNISH_REPLACEMENTS🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/finnish/operators.py` around lines 101 - 104, Move the local import of FINNISH_REPLACEMENTS out of get_word_replacements and place it at module scope; in operators.py add a top-level "from normalization.languages.finnish.replacements import FINNISH_REPLACEMENTS" and then update get_word_replacements to simply return FINNISH_REPLACEMENTS (remove the function-local import). This keeps parity with other language modules and avoids unnecessary function-local imports when there is no circular dependency.normalization/languages/finnish/number_normalizer.py (2)
272-322: Redundant and inconsistent"yksi" + multiplierspecial cases.These four branches (
yksi tuhat/tuhatta,yksi miljoona,yksi miljardi(a),yksi biljoona(a)) are unreachable in practice:_parse_0_999already matches bareyksivia_parse_0_99(line 454) and returns(i+1, 1), after which the chaining logic at lines 332–353 multiplies by the following_BIG_MULTentry. You can verify this by tracing the"yksi tuhat"and"yksi miljoona"tests — both paths reach the same result via the fallthrough.They are also inconsistent with
_BIG_MULT:
- Line 281 matches only
"miljoona"but not"miljoonaa"/"miljoonan".- Line 294 matches
"miljardi"/"miljardia"but not"miljardin".- Line 312 matches
"biljoona"/"biljoonaa"but not"biljoonan".Either remove the special cases entirely (cleanest), or expand them to cover every inflection in
_BIG_MULT— anything in between just confuses future readers into thinking there’s a semantic distinction when there isn’t.- if i + 1 < n and fw == "yksi" and _fold(words[i + 1]) in ("tuhat", "tuhatta"): - j = i + 2 - tail = self._parse_number(words, j, n) - base = 1000 - if tail is not None: - end, v2 = tail - return end, base + v2 - return j, base - - if i + 1 < n and fw == "yksi" and _fold(words[i + 1]) == "miljoona": - ... - if (... "miljardi", "miljardia" ...): - ... - if (... "biljoona", "biljoonaa" ...): - ... + # `yksi <multiplier>` is already handled by _parse_0_999 + the chaining + # path below, so no special-case branches are needed here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/finnish/number_normalizer.py` around lines 272 - 322, The code contains redundant special-case branches handling "yksi" + multiplier (the blocks that check fw == "yksi" for tuh(a)t, miljoona, miljardi(a), biljoona(a)); these are unreachable because _parse_0_999 already parses bare "yksi" and the generic chaining logic in _parse_number/_BIG_MULT handles multiplication, and the special cases are also inconsistent with _BIG_MULT inflections. Remove these four "yksi" special-case blocks entirely (or if you prefer to keep them, make them mirror every inflection listed in _BIG_MULT), leaving the generic _parse_0_999 → _parse_number chaining to handle "yksi" multipliers; update or delete any related comments so the intent is clear.
19-24:_getunnecessarily linear over dict keys.The lookup tables are built with already-lowercase ASCII/Unicode keys, so
_getcan be a directtable.get(word.casefold())instead of scanning every key and_fold-ing it on every call._parse_glued_kymmenta,_parse_0_99, and_continues_numberall hit this function in tight loops across the input.def _get(table: dict[str, int], word: str) -> int | None: - fw = _fold(word) - for k, v in table.items(): - if _fold(k) == fw: - return v - return None + return table.get(_fold(word))If there's a reason keys might contain mixed case in the future, a one-time lowercase normalization at module load is still cheaper than per-call scans.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/finnish/number_normalizer.py` around lines 19 - 24, The _get function is doing an O(N) scan; replace it with a direct lookup by using table.get(_fold(word)) (or table.get(word.casefold())) to avoid per-call key iteration, and ensure the numeric lookup tables used by _parse_glued_kymmenta, _parse_0_99, and _continues_number are normalized once at module load (e.g., rebuild each table with keys passed through _fold/casefold when they are created) so mixed-case keys won’t break the direct lookup.tests/unit/languages/finnish_operators_test.py (1)
25-29: Minor: callget_word_replacements()once.Repeated calls work but are wasteful and make the intent less clear. Consider binding once:
-def test_word_replacements(operators: FinnishOperators) -> None: - assert operators.get_word_replacements()["ma"] == "mina" - assert operators.get_word_replacements()["ok"] == "okei" - assert operators.get_word_replacements()["juu"] == "joo" - assert operators.get_word_replacements()["euro"] == "euros" +def test_word_replacements(operators: FinnishOperators) -> None: + replacements = operators.get_word_replacements() + assert replacements["ma"] == "mina" + assert replacements["ok"] == "okei" + assert replacements["juu"] == "joo" + assert replacements["euro"] == "euros"Also note: the
"euro" == "euros"expectation here locks in the questionableeuro → eurosmapping flagged onreplacements.py— if that entry is removed, this assertion needs to be updated.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/languages/finnish_operators_test.py` around lines 25 - 29, In test_word_replacements, call operators.get_word_replacements() once and store the result in a local variable (e.g., replacements) and then use replacements[...] for each assertion to avoid repeated calls; locate the test function test_word_replacements and the FinnishOperators.get_word_replacements() usage to change the four assert lines accordingly, and update or remove the "euro" == "euros" assertion if the euro→euros entry is removed from replacements.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/finnish/number_normalizer.py`:
- Around line 155-165: The Finnish normalizer contains an English-centric
two‑pass currency fix: remove the hardcoded English mapping in
_singular_spoken_unit and eliminate _currency_plural_fix_patterns and
_apply_currency_plural_fixes, then change _normalize_currency_symbols to
directly insert the configured Finnish trailing form from
currency_symbol_to_word (e.g., partitive forms like "euroa") instead of
converting to English singular and regex‑replacing back to plural; update any
callers of those removed helpers to use the single-pass replacement so we no
longer do the lossy round‑trip or risk rewriting unrelated occurrences.
- Around line 361-364: The branch in _parse_0_999 that checks _fold(words[i]) ==
"nolla" currently returns None when the next token satisfies _continues_number,
causing "nolla kaksi" → "nolla 2"; change the behavior to return a consumed
index and numeric 0 instead (i.e., return i+1, 0) so "nolla" is normalized to
"0" even when followed by another number word; update or add a unit test for
_parse_0_999 covering "nolla kaksi" and document the behavior with a brief
comment referencing _fold, _continues_number, and _parse_0_999.
In `@normalization/languages/finnish/operators.py`:
- Around line 35-41: The mapping currency_symbol_to_word currently uses English
plurals (e.g., "euros", "dollars") which is incorrect for Finnish numerals;
update currency_symbol_to_word to use Finnish partitive forms ("euroa",
"dollaria", "puntaa", "senttiä", "jeniä") for the symbols "€", "$", "£", "¢",
"¥". After changing currency_symbol_to_word, remove the now-redundant helpers
`_singular_spoken_unit` and `_currency_plural_fix_patterns` (and any logic that
relies on them) and update the tests in finnish_number_normalizer_test.py to
expect partitive outputs (e.g., "€50" -> "50 euroa"). Ensure all references to
those removed symbols are cleaned up to avoid unused symbol errors.
- Around line 5-16: The digit and number-word mappings (_FINNISH_DIGIT_WORDS and
number_words) only include diacritized keys and must also include ASCII-folded
equivalents so later stages that run after remove_diacritics can match; update
_FINNISH_DIGIT_WORDS and number_words to duplicate entries for each diacritized
key with its ASCII-folded form (e.g., add "nelja" alongside "neljä", "seitseman"
alongside "seitsemän", "yhdeksan" alongside "yhdeksän", etc.) following the
pattern used in FinnishNumberNormalizer (duplicate mapping keys to the same
digit strings).
In `@tests/unit/languages/finnish_number_normalizer_test.py`:
- Around line 17-46: The test expectations for symbol-mapped currencies in
test_currency_and_spoken_units are asserting English plurals; update the
expected outputs to use Finnish partitive forms (e.g., change "50 euros" to "50
euroa" and similarly for other currency tests) to match the corrected currency
mapping in operators.py, and add two new parametrized cases in the same test:
one for a bare symbol with no number (e.g., "€" -> expected behavior such as
unchanged "€" or a decided normalization) and one for a decimal amount using
Finnish decimal comma (e.g., "€9,99" -> "9,99 euroa") to verify
decimal_separator="," handling by FinnishNumberNormalizer.
---
Nitpick comments:
In `@normalization/languages/finnish/number_normalizer.py`:
- Around line 272-322: The code contains redundant special-case branches
handling "yksi" + multiplier (the blocks that check fw == "yksi" for tuh(a)t,
miljoona, miljardi(a), biljoona(a)); these are unreachable because _parse_0_999
already parses bare "yksi" and the generic chaining logic in
_parse_number/_BIG_MULT handles multiplication, and the special cases are also
inconsistent with _BIG_MULT inflections. Remove these four "yksi" special-case
blocks entirely (or if you prefer to keep them, make them mirror every
inflection listed in _BIG_MULT), leaving the generic _parse_0_999 →
_parse_number chaining to handle "yksi" multipliers; update or delete any
related comments so the intent is clear.
- Around line 19-24: The _get function is doing an O(N) scan; replace it with a
direct lookup by using table.get(_fold(word)) (or table.get(word.casefold())) to
avoid per-call key iteration, and ensure the numeric lookup tables used by
_parse_glued_kymmenta, _parse_0_99, and _continues_number are normalized once at
module load (e.g., rebuild each table with keys passed through _fold/casefold
when they are created) so mixed-case keys won’t break the direct lookup.
In `@normalization/languages/finnish/operators.py`:
- Around line 101-104: Move the local import of FINNISH_REPLACEMENTS out of
get_word_replacements and place it at module scope; in operators.py add a
top-level "from normalization.languages.finnish.replacements import
FINNISH_REPLACEMENTS" and then update get_word_replacements to simply return
FINNISH_REPLACEMENTS (remove the function-local import). This keeps parity with
other language modules and avoids unnecessary function-local imports when there
is no circular dependency.
In `@tests/unit/languages/finnish_operators_test.py`:
- Around line 25-29: In test_word_replacements, call
operators.get_word_replacements() once and store the result in a local variable
(e.g., replacements) and then use replacements[...] for each assertion to avoid
repeated calls; locate the test function test_word_replacements and the
FinnishOperators.get_word_replacements() usage to change the four assert lines
accordingly, and update or remove the "euro" == "euros" assertion if the
euro→euros entry is removed from replacements.py.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c4e9f736-86bf-42d5-9f99-46b5b5a3177d
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3/fi.csvis excluded by!**/*.csv
📒 Files selected for processing (7)
normalization/languages/__init__.pynormalization/languages/finnish/__init__.pynormalization/languages/finnish/number_normalizer.pynormalization/languages/finnish/operators.pynormalization/languages/finnish/replacements.pytests/unit/languages/finnish_number_normalizer_test.pytests/unit/languages/finnish_operators_test.py
| def _singular_spoken_unit(trailing_word: str) -> str: | ||
| t = trailing_word.lower() | ||
| if t == "euros": | ||
| return "euro" | ||
| if t == "dollars": | ||
| return "dollar" | ||
| if t == "pounds": | ||
| return "pound" | ||
| if t == "yens": | ||
| return "yen" | ||
| return trailing_word |
There was a problem hiding this comment.
Hardcoded English singular/plural table in a Finnish module.
_singular_spoken_unit maps euros→euro, dollars→dollar, pounds→pound, yens→yen. These are English forms; there is nothing Finnish about them, and the function will silently return the input unchanged for any value the config actually should hold in Finnish (euroa, dollaria, puntaa, jeniä, senttiä). This strongly suggests the design copied the Dutch/Swedish normalizer verbatim and inherited their trailing-word scheme.
In Finnish, numerals take the partitive singular regardless of amount (1 euro / 5 euroa — actually yksi euro is also acceptable for 1, but 5 euros is never correct). That means:
- If
currency_symbol_to_wordis set to the Finnish partitive (euroaetc.),_singular_spoken_unitand_currency_plural_fix_patterns/_apply_currency_plural_fixesbecome unnecessary — you can just substitute the trailing word directly in_normalize_currency_symbols. - The current two-pass approach (convert to singular, then regex back to plural) is an unnecessarily lossy round-trip that also risks rewriting unrelated occurrences of
euro/dollar/etc. elsewhere in the text.
Recommended to drop _singular_spoken_unit, _currency_plural_fix_patterns, and _apply_currency_plural_fixes, and simplify _normalize_currency_symbols to emit the configured trailing word directly. (See the companion comment on operators.py re: fixing the config itself.)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/finnish/number_normalizer.py` around lines 155 - 165,
The Finnish normalizer contains an English-centric two‑pass currency fix: remove
the hardcoded English mapping in _singular_spoken_unit and eliminate
_currency_plural_fix_patterns and _apply_currency_plural_fixes, then change
_normalize_currency_symbols to directly insert the configured Finnish trailing
form from currency_symbol_to_word (e.g., partitive forms like "euroa") instead
of converting to English singular and regex‑replacing back to plural; update any
callers of those removed helpers to use the single-pass replacement so we no
longer do the lossy round‑trip or risk rewriting unrelated occurrences.
| if _fold(words[i]) == "nolla": | ||
| if i + 1 < n and self._continues_number(words[i + 1]): | ||
| return None | ||
| return i + 1, 0 |
There was a problem hiding this comment.
nolla followed by a number word is silently left un-normalized.
When nolla is followed by another number word, _parse_0_999 returns None rather than producing 0. The caller then falls through to out.append(words[i]), leaving the literal "nolla" in place while the next word still gets converted. Result: "nolla kaksi" → "nolla 2", which is neither the original spelled-out form nor a consistent digit form.
If the goal is to avoid consuming leading zeros in a compound (e.g. phone-number-like sequences), consider emitting "0" explicitly so at least the output is internally consistent:
if _fold(words[i]) == "nolla":
- if i + 1 < n and self._continues_number(words[i + 1]):
- return None
return i + 1, 0or, if the "don't consume" behavior is intentional for digit-sequence preservation, document it with a comment and add a test case covering the intended downstream step that turns each nolla into 0.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if _fold(words[i]) == "nolla": | |
| if i + 1 < n and self._continues_number(words[i + 1]): | |
| return None | |
| return i + 1, 0 | |
| if _fold(words[i]) == "nolla": | |
| return i + 1, 0 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/finnish/number_normalizer.py` around lines 361 - 364,
The branch in _parse_0_999 that checks _fold(words[i]) == "nolla" currently
returns None when the next token satisfies _continues_number, causing "nolla
kaksi" → "nolla 2"; change the behavior to return a consumed index and numeric 0
instead (i.e., return i+1, 0) so "nolla" is normalized to "0" even when followed
by another number word; update or add a unit test for _parse_0_999 covering
"nolla kaksi" and document the behavior with a brief comment referencing _fold,
_continues_number, and _parse_0_999.
| _FINNISH_DIGIT_WORDS: dict[str, str] = { | ||
| "nolla": "0", | ||
| "yksi": "1", | ||
| "kaksi": "2", | ||
| "kolme": "3", | ||
| "neljä": "4", | ||
| "viisi": "5", | ||
| "kuusi": "6", | ||
| "seitsemän": "7", | ||
| "kahdeksan": "8", | ||
| "yhdeksän": "9", | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Locate the pipeline steps that consume digit_words / number_words and check whether
# they operate on diacritic-folded text or raw text.
rg -nP --type=py -C3 '\b(digit_words|number_words)\b'
rg -nP --type=py -C2 '\b(casefold_text|remove_diacritics)\b'Repository: gladiaio/normalization
Length of output: 21861
🏁 Script executed:
# Find pipeline definition and execution order
find . -type f -name "*.py" | xargs grep -l "pipeline\|Pipeline\|STEPS\|step.*order" | head -20Repository: gladiaio/normalization
Length of output: 452
🏁 Script executed:
# Check if there's a pipeline configuration or orchestrator
fd -t f "pipeline|config" -e py | head -20Repository: gladiaio/normalization
Length of output: 148
🏁 Script executed:
# Look for how steps are registered and executed
rg -l "TextStep|register.*step" --type=py | head -10Repository: gladiaio/normalization
Length of output: 532
🏁 Script executed:
# Check the actual pipeline execution order by finding the main pipeline runner
rg -nP "convert_digit_word_sequences_to_digits|remove_diacritics|casefold_text" --type=py -B2 -A2 | grep -E "\.py.*:.*\(convert_digit|\.py.*:.*\(remove_|\.py.*:.*\(casefold"Repository: gladiaio/normalization
Length of output: 48
🏁 Script executed:
# Examine the pipeline base definition
cat normalization/pipeline/base.pyRepository: gladiaio/normalization
Length of output: 4430
🏁 Script executed:
# Check the step registry to see how steps are ordered
cat normalization/steps/registry.pyRepository: gladiaio/normalization
Length of output: 856
🏁 Script executed:
# Look at the pipeline loader
head -100 normalization/pipeline/loader.pyRepository: gladiaio/normalization
Length of output: 2532
🏁 Script executed:
# Search for explicit step ordering or pipeline configuration
rg -A30 "DEFAULT.*STEPS|PIPELINE_STEPS|steps.*=.*\[" --type=py normalization/pipeline/ normalization/Repository: gladiaio/normalization
Length of output: 48
🏁 Script executed:
# Find preset files
find . -name "*.yaml" -path "*/presets/*" -type fRepository: gladiaio/normalization
Length of output: 103
🏁 Script executed:
# Look at the default preset
cat normalization/presets/gladia-3.yamlRepository: gladiaio/normalization
Length of output: 5298
🏁 Script executed:
# Check if there are language-specific presets
ls -la normalization/presets/Repository: gladiaio/normalization
Length of output: 239
🏁 Script executed:
# Verify when replacements are applied in the pipeline
rg -n "apply_sentence_level_replacements|replacements" normalization/presets/gladia-3.yamlRepository: gladiaio/normalization
Length of output: 141
🏁 Script executed:
# Check the apply_sentence_level_replacements step to understand the context
cat normalization/steps/text/apply_sentence_level_replacements.pyRepository: gladiaio/normalization
Length of output: 1189
🏁 Script executed:
# Examine the Finnish replacements module to understand the context
cat normalization/languages/finnish/replacements.pyRepository: gladiaio/normalization
Length of output: 1199
🏁 Script executed:
# Check if German has diacritics in digit_words
rg -A15 "_GERMAN_DIGIT_WORDS" normalization/languages/german/operators.pyRepository: gladiaio/normalization
Length of output: 716
🏁 Script executed:
# Check French digit_words in detail
rg -A15 "_FRENCH_DIGIT_WORDS" normalization/languages/french/operators.pyRepository: gladiaio/normalization
Length of output: 675
🏁 Script executed:
# Verify the exact step order in the pipeline: when do digit_words steps run?
cat normalization/presets/gladia-3.yaml | grep -A3 -B3 "casefold_text\|remove_diacritics\|convert_digit_word"Repository: gladiaio/normalization
Length of output: 666
🏁 Script executed:
# Verify all steps that consume digit_words/number_words and their positions in the pipeline
rg -l "digit_words|number_words" normalization/steps/text/*.py | while read f; do
basename "$f"
doneRepository: gladiaio/normalization
Length of output: 238
🏁 Script executed:
# Check the exact position of each digit_words/number_words consuming step in the pipeline
cat normalization/presets/gladia-3.yaml | grep -E "protect_plus_word_before_digit_words|convert_digit_word_sequences_to_digits|normalize_punctuation_between_number_words|fix_dot_adjacent_number_words|casefold_text|remove_diacritics"Repository: gladiaio/normalization
Length of output: 360
🏁 Script executed:
# Verify the order is consistent: all digit_words steps before casefold
echo "=== Steps consuming digit_words/number_words and their position ===" && \
cat normalization/presets/gladia-3.yaml | grep -n "protect_plus_word_before_digit_words\|convert_digit_word_sequences_to_digits\|normalize_punctuation_between_number_words\|fix_dot_adjacent_number_words\|casefold_text\|remove_diacritics"Repository: gladiaio/normalization
Length of output: 449
🏁 Script executed:
# Check if FinnishNumberNormalizer includes both diacritics and non-diacritics forms
cat normalization/languages/finnish/number_normalizer.py | grep -A30 "_ONES_2_9\|_TEENS"Repository: gladiaio/normalization
Length of output: 4872
🏁 Script executed:
# Verify the exact execution context of fix_dot_adjacent_number_words
cat normalization/steps/text/post_steps/fix_dot_adjacent_number_words.py 2>/dev/null || cat normalization/steps/text/fix_dot_adjacent_number_words.pyRepository: gladiaio/normalization
Length of output: 1338
🏁 Script executed:
# Check what text state fix_dot_adjacent_number_words receives
rg -B10 "class.*FixDotAdjacentNumberWords" normalization/steps/text/fix_dot_adjacent_number_words.pyRepository: gladiaio/normalization
Length of output: 300
Include both diacritic and ASCII-folded forms in _FINNISH_DIGIT_WORDS.
The pipeline consumes digit_words in two stages with opposing diacritic states:
protect_plus_word_before_digit_words,normalize_punctuation_between_number_words, andconvert_digit_word_sequences_to_digitsrun beforecasefold_textandremove_diacritics, so they match diacritized forms in the original text.fix_dot_adjacent_number_wordsruns after diacritics are removed intext_post, so it matches ASCII-folded forms.
The current _FINNISH_DIGIT_WORDS contains only diacritized keys (neljä, seitsemän, yhdeksän), which will fail in the second stage. Follow the pattern already used in FinnishNumberNormalizer (lines 32–54): include both forms for each word:
_FINNISH_DIGIT_WORDS: dict[str, str] = {
"nolla": "0",
"yksi": "1",
"kaksi": "2",
"kolme": "3",
"neljä": "4",
"nelja": "4",
"viisi": "5",
"kuusi": "6",
"seitsemän": "7",
"seitseman": "7",
"kahdeksan": "8",
"yhdeksän": "9",
"yhdeksan": "9",
}The same fix applies to number_words (lines 57–84).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/finnish/operators.py` around lines 5 - 16, The digit
and number-word mappings (_FINNISH_DIGIT_WORDS and number_words) only include
diacritized keys and must also include ASCII-folded equivalents so later stages
that run after remove_diacritics can match; update _FINNISH_DIGIT_WORDS and
number_words to duplicate entries for each diacritized key with its ASCII-folded
form (e.g., add "nelja" alongside "neljä", "seitseman" alongside "seitsemän",
"yhdeksan" alongside "yhdeksän", etc.) following the pattern used in
FinnishNumberNormalizer (duplicate mapping keys to the same digit strings).
| currency_symbol_to_word={ | ||
| "€": "euros", | ||
| "$": "dollars", | ||
| "£": "pounds", | ||
| "¢": "cent", | ||
| "¥": "yens", | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Finnish currency after numeral partitive form euroa dollaria
💡 Result:
In Finnish grammar, after cardinal numerals (other than yksi 'one'), currency names like euro and dollari take the partitive singular form. Thus, '2 euros' is kaksi euroa and '2 dollars' is kaksi dollaria. This follows the general rule where nouns after numerals appear in the partitive case (e.g., kaksi kuppia 'two cups'). The partitive forms are euroa (for euro) and dollaria (for dollari).
Citations:
- 1: https://uusikielemme.fi/finnish-grammar/the-partitive-case-partitiivi/
- 2: https://jkorpela.fi/finnish/Cardinal_numerals.html
- 3: https://kielitoimistonohjepankki.fi/ohje/luvut-ja-numerot-200aa-autoa-vai-200-autoa/
- 4: http://www.kielitoimistonohjepankki.fi/ohje/269
🏁 Script executed:
# First, find and read the file in question
find . -name "operators.py" -path "*/finnish/*" -type fRepository: gladiaio/normalization
Length of output: 112
🏁 Script executed:
# Read the operators.py file to verify the code snippet
cat -n normalization/languages/finnish/operators.py | head -50Repository: gladiaio/normalization
Length of output: 1595
🏁 Script executed:
# Search for FinnishNumberNormalizer to find its location
find . -name "*.py" -type f | xargs grep -l "FinnishNumberNormalizer" | head -5Repository: gladiaio/normalization
Length of output: 224
🏁 Script executed:
# Search for _singular_spoken_unit in the codebase
rg "_singular_spoken_unit" -A 10Repository: gladiaio/normalization
Length of output: 5937
🏁 Script executed:
# Search for test files related to Finnish number normalizer
find . -name "*finnish*test*.py" -o -name "*test*finnish*.py" | head -10Repository: gladiaio/normalization
Length of output: 171
🏁 Script executed:
# Read the test file to see what's expected
cat -n tests/unit/languages/finnish_number_normalizer_test.pyRepository: gladiaio/normalization
Length of output: 2077
🏁 Script executed:
# Read the full Finnish number normalizer implementation
cat -n normalization/languages/finnish/number_normalizer.pyRepository: gladiaio/normalization
Length of output: 15942
Fix Finnish currency forms: use partitive case instead of English plurals.
The currency_symbol_to_word mapping uses English plural forms (euros, dollars, pounds, yens) when Finnish requires the partitive case after numerals: euroa, dollaria, puntaa, senttiä, jeniä. For example, 5 € in Finnish speech is "viisi euroa", never "viisi euros". This causes:
- Non-Finnish output text that will degrade WER against correct Finnish transcripts
- Awkward redundancy:
_singular_spoken_unitcontains hardcoded English mappings (euros→euro, dollars→dollar, etc.) that must then be "fixed back" to plural via_currency_plural_fix_patterns— a clear sign of non-Finnish adaptation
Additionally, "¢" → "cent" is singular while all others are plural, breaking internal consistency.
Replace with Finnish partitive forms:
Suggested diff
currency_symbol_to_word={
- "€": "euros",
- "$": "dollars",
- "£": "pounds",
- "¢": "cent",
- "¥": "yens",
+ "€": "euroa",
+ "$": "dollaria",
+ "£": "puntaa",
+ "¢": "senttiä",
+ "¥": "jeniä",
},With this fix, both _singular_spoken_unit and _currency_plural_fix_patterns can be removed entirely (partitive form is already correct after numerals). Update test expectations in finnish_number_normalizer_test.py accordingly (e.g., "€50" → "50 euroa").
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| currency_symbol_to_word={ | |
| "€": "euros", | |
| "$": "dollars", | |
| "£": "pounds", | |
| "¢": "cent", | |
| "¥": "yens", | |
| }, | |
| currency_symbol_to_word={ | |
| "€": "euroa", | |
| "$": "dollaria", | |
| "£": "puntaa", | |
| "¢": "senttiä", | |
| "¥": "jeniä", | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/finnish/operators.py` around lines 35 - 41, The
mapping currency_symbol_to_word currently uses English plurals (e.g., "euros",
"dollars") which is incorrect for Finnish numerals; update
currency_symbol_to_word to use Finnish partitive forms ("euroa", "dollaria",
"puntaa", "senttiä", "jeniä") for the symbols "€", "$", "£", "¢", "¥". After
changing currency_symbol_to_word, remove the now-redundant helpers
`_singular_spoken_unit` and `_currency_plural_fix_patterns` (and any logic that
relies on them) and update the tests in finnish_number_normalizer_test.py to
expect partitive outputs (e.g., "€50" -> "50 euroa"). Ensure all references to
those removed symbols are cleaned up to avoid unused symbol errors.
| FINNISH_REPLACEMENTS: dict[str, str] = { | ||
| "ma": "mina", | ||
| "maa": "mina", | ||
| "mulle": "minulle", | ||
| "mulla": "minulla", | ||
| "mua": "minua", | ||
| "mun": "minun", | ||
| "sa": "sina", | ||
| "sulle": "sinulle", | ||
| "sulla": "sinulla", | ||
| "sua": "sinua", | ||
| "sun": "sinun", | ||
| "toi": "tuo", | ||
| "ton": "tuon", | ||
| "tossa": "tuossa", | ||
| "tosta": "tuosta", | ||
| "tohon": "tuohon", | ||
| "taa": "tama", | ||
| "naa": "nama", | ||
| "olis": "olisi", | ||
| "ois": "olisi", | ||
| "oo": "ole", | ||
| "ollu": "ollut", | ||
| "onks": "onko", | ||
| "oliks": "oliko", | ||
| "oisko": "olisiko", | ||
| "vois": "voisi", | ||
| "katotaan": "katsotaan", | ||
| "kattoa": "katsoa", | ||
| "mut": "mutta", | ||
| "sit": "sitten", | ||
| "sitte": "sitten", | ||
| "et": "etta", | ||
| "sillon": "silloin", | ||
| "viimeks": "viimeksi", | ||
| "elikka": "eli", | ||
| "juu": "joo", | ||
| "jes": "joo", | ||
| "ok": "okei", | ||
| "bank": "pankki", | ||
| "bankin": "pankin", | ||
| "euro": "euros", | ||
| } |
There was a problem hiding this comment.
Several replacement entries look incorrect or unsafe for Finnish.
A few of these will actively corrupt otherwise-correct Finnish text rather than normalize colloquial forms:
"maa": "mina"—maais a common Finnish noun meaning "land/country". Mapping it tomina(minä, "I") will mangle any sentence mentioning a country or land. The intended colloquial form for minä ismä→ ASCIIma, which is already covered on line 8. Line 9 should be dropped."euro": "euros"—eurois the standard Finnish singular for the currency. Replacing it witheuros(which is an English plural, not Finnish) will both break correct Finnish and conflict with the currency-restore logic innumber_normalizer.py. In Finnish the form used after a numeral is the partitiveeuroa(already appearing in the tests as the expected output for"kymmenen euroa")."bank"/"bankin"→"pankki"/"pankin"— these are English, not Finnish colloquial variants. If the intent is to normalize ASR mis-transcriptions of loanwords, it should be documented; otherwise they don't belong in a Finnish colloquial→standard table."jes": "joo"—jesis an interjection ("yes!"), not a variant ofjoo. Collapsing it tojooloses semantic distinction; consider dropping.
Please have a native speaker sanity-check the rest of the table as well (e.g., "taa" only matches tää once diacritics are stripped, which seems to be the assumption per the module docstring, but it would also collide with the Finnish word taa = "behind" if the pipeline ever feeds un-folded input).
🩹 Suggested removals
"ma": "mina",
- "maa": "mina",
"mulle": "minulle",
@@
"juu": "joo",
- "jes": "joo",
"ok": "okei",
- "bank": "pankki",
- "bankin": "pankin",
- "euro": "euros",
+ "bank": "pankki",
+ "bankin": "pankin",(Keep or drop the bank* entries depending on whether they are intentional ASR fixes — if kept, consider a comment explaining the rationale.)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| FINNISH_REPLACEMENTS: dict[str, str] = { | |
| "ma": "mina", | |
| "maa": "mina", | |
| "mulle": "minulle", | |
| "mulla": "minulla", | |
| "mua": "minua", | |
| "mun": "minun", | |
| "sa": "sina", | |
| "sulle": "sinulle", | |
| "sulla": "sinulla", | |
| "sua": "sinua", | |
| "sun": "sinun", | |
| "toi": "tuo", | |
| "ton": "tuon", | |
| "tossa": "tuossa", | |
| "tosta": "tuosta", | |
| "tohon": "tuohon", | |
| "taa": "tama", | |
| "naa": "nama", | |
| "olis": "olisi", | |
| "ois": "olisi", | |
| "oo": "ole", | |
| "ollu": "ollut", | |
| "onks": "onko", | |
| "oliks": "oliko", | |
| "oisko": "olisiko", | |
| "vois": "voisi", | |
| "katotaan": "katsotaan", | |
| "kattoa": "katsoa", | |
| "mut": "mutta", | |
| "sit": "sitten", | |
| "sitte": "sitten", | |
| "et": "etta", | |
| "sillon": "silloin", | |
| "viimeks": "viimeksi", | |
| "elikka": "eli", | |
| "juu": "joo", | |
| "jes": "joo", | |
| "ok": "okei", | |
| "bank": "pankki", | |
| "bankin": "pankin", | |
| "euro": "euros", | |
| } | |
| FINNISH_REPLACEMENTS: dict[str, str] = { | |
| "ma": "mina", | |
| "mulle": "minulle", | |
| "mulla": "minulla", | |
| "mua": "minua", | |
| "mun": "minun", | |
| "sa": "sina", | |
| "sulle": "sinulle", | |
| "sulla": "sinulla", | |
| "sua": "sinua", | |
| "sun": "sinun", | |
| "toi": "tuo", | |
| "ton": "tuon", | |
| "tossa": "tuossa", | |
| "tosta": "tuosta", | |
| "tohon": "tuohon", | |
| "taa": "tama", | |
| "naa": "nama", | |
| "olis": "olisi", | |
| "ois": "olisi", | |
| "oo": "ole", | |
| "ollu": "ollut", | |
| "onks": "onko", | |
| "oliks": "oliko", | |
| "oisko": "olisiko", | |
| "vois": "voisi", | |
| "katotaan": "katsotaan", | |
| "kattoa": "katsoa", | |
| "mut": "mutta", | |
| "sit": "sitten", | |
| "sitte": "sitten", | |
| "et": "etta", | |
| "sillon": "silloin", | |
| "viimeks": "viimeksi", | |
| "elikka": "eli", | |
| "juu": "joo", | |
| "ok": "okei", | |
| "bank": "pankki", | |
| "bankin": "pankin", | |
| } |
| @pytest.mark.parametrize( | ||
| ("text", "expected"), | ||
| [ | ||
| ("kaksi kymmenta viisi", "25"), | ||
| ("kaksi kymmentä viisi", "25"), | ||
| ("sata", "100"), | ||
| ("tuhat", "1000"), | ||
| ("yksi tuhat", "1000"), | ||
| ("kolme miljoonaa", "3000000"), | ||
| ("yksi miljoona", "1000000"), | ||
| ], | ||
| ) | ||
| def test_finnish_spelled_numbers( | ||
| normalizer: FinnishNumberNormalizer, text: str, expected: str | ||
| ) -> None: | ||
| assert normalizer(text) == expected | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| ("text", "expected"), | ||
| [ | ||
| ("kymmenen euroa", "10 euroa"), | ||
| ("€50", "50 euros"), | ||
| ("50 €", "50 euros"), | ||
| ], | ||
| ) | ||
| def test_currency_and_spoken_units( | ||
| normalizer: FinnishNumberNormalizer, text: str, expected: str | ||
| ) -> None: | ||
| assert normalizer(text) == expected |
There was a problem hiding this comment.
Currency expectations encode the English-plural bug.
"€50" → "50 euros" and "50 € → 50 euros" hardcode the output of the English-plural currency mapping (see the comment on operators.py lines 35–41). Asymmetrically, "kymmenen euroa" → "10 euroa" already uses the correct Finnish partitive (because euroa is preserved verbatim as a trailing word, not reconstructed from the symbol map).
If the currency map is fixed to use euroa / dollaria / puntaa / senttiä / jeniä, these parametrized expectations should be updated accordingly:
- ("€50", "50 euros"),
- ("50 €", "50 euros"),
+ ("€50", "50 euroa"),
+ ("50 €", "50 euroa"),Also consider adding a negative/edge case: a bare currency symbol with no adjacent number (e.g. "€" alone) and a decimal amount (e.g. "€9,99" — note Finnish decimal comma), to lock down behavior around decimal_separator=",".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/languages/finnish_number_normalizer_test.py` around lines 17 - 46,
The test expectations for symbol-mapped currencies in
test_currency_and_spoken_units are asserting English plurals; update the
expected outputs to use Finnish partitive forms (e.g., change "50 euros" to "50
euroa" and similarly for other currency tests) to match the corrected currency
mapping in operators.py, and add two new parametrized cases in the same test:
one for a bare symbol with no number (e.g., "€" -> expected behavior such as
unchanged "€" or a decided normalization) and one for a decimal amount using
Finnish decimal comma (e.g., "€9,99" -> "9,99 euroa") to verify
decimal_separator="," handling by FinnishNumberNormalizer.
What does this PR do?
dds finnish normalization (operators, replacements, number normalizer, registry wiring, unit and gladia-3 e2e tests
Type of change
Checklist
Only fill in the section(s) that match your change — delete the rest.
New language
normalization/languages/{lang}/withoperators.py,replacements.py,__init__.pyreplacements.py(not hardcoded inoperators.py)LanguageConfigis filled in with the language's data (separators, currency words, digit words, …)LanguageOperators— only override methods where the logic changes, not just the data@register_languageand imported innormalization/languages/__init__.pytests/unit/languages/tests/e2e/files/{preset}/{lang}.csv(e.g.tests/e2e/files/gladia-3/fr.csv)Edit existing language
replacements.py, not inline inoperators.pyNone: the step reading it still handlesNonegracefullyNew step
nameclass attribute set (this is the key used in YAML presets)@register_stepand imported insteps/text/__init__.pyorsteps/word/__init__.pyoperators.config.*insteadsteps/text/placeholders.pyandpipeline/base.py'svalidate()is updatedtests/unit/steps/uv run scripts/generate_step_docs.pyEdit existing step
nameis unchanged — if the output changes, create a new step name + new preset insteaduv run scripts/generate_step_docs.pyPreset change
pipeline.validate()passes (runs automatically vialoader.py)How was this tested?
Summary by CodeRabbit
New Features
Tests