feat: init finnish language basic normalization by egenthon-cmd · Pull Request #22 · gladiaio/normalization

egenthon-cmd · 2026-04-23T12:44:25Z

What does this PR do?

dds finnish normalization (operators, replacements, number normalizer, registry wiring, unit and gladia-3 e2e tests

Type of change

Checklist

Only fill in the section(s) that match your change — delete the rest.

New language

Created normalization/languages/{lang}/ with operators.py, replacements.py, __init__.py
Word substitutions are in replacements.py (not hardcoded in operators.py)
LanguageConfig is filled in with the language's data (separators, currency words, digit words, …)
Subclassed LanguageOperators — only override methods where the logic changes, not just the data
Class is decorated with @register_language and imported in normalization/languages/__init__.py
Unit tests added in tests/unit/languages/
E2e CSV added in tests/e2e/files/{preset}/{lang}.csv (e.g. tests/e2e/files/gladia-3/fr.csv)

Edit existing language

New/changed word substitutions go in replacements.py, not inline in operators.py
If you changed a config field that can be None: the step reading it still handles None gracefully
Unit tests updated or added
E2e CSV updated if the expected output changed

New step

Unique name class attribute set (this is the key used in YAML presets)
Decorated with @register_step and imported in steps/text/__init__.py or steps/word/__init__.py
No hardcoded language values — read data from operators.config.* instead
If placeholder-based: protect + restore are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
Unit tests added in tests/unit/steps/
Step name added to the relevant preset YAML — or a new preset file created if existing presets are affected
If the docstring changed: ran uv run scripts/generate_step_docs.py

Edit existing step

Step name is unchanged — if the output changes, create a new step name + new preset instead
No language-specific logic or string literals added inside the step
Unit tests updated or added
If the docstring changed: ran uv run scripts/generate_step_docs.py

Preset change

Existing preset files are not modified — new behaviour goes in a new preset file
pipeline.validate() passes (runs automatically via loader.py)

How was this tested?

uv run pytest tests/

Summary by CodeRabbit

New Features
- Finnish language support now available with automatic conversion of written number words to digits.
- Finnish word replacement rules for normalizing colloquial to standard forms.
- Currency symbol handling for Finnish language text processing.
Tests
- Added comprehensive unit tests validating Finnish number normalization and language operations, including currency handling scenarios.

coderabbitai · 2026-04-23T12:44:37Z

Warning

Rate limit exceeded

@egenthon-cmd has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 30 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb68aff9-438c-46e1-82ec-25107b02a7cd

📥 Commits

Reviewing files that changed from the base of the PR and between f259b26 and 761ca87.

📒 Files selected for processing (1)

normalization/languages/finnish/operators.py

📝 Walkthrough

Walkthrough

This pull request adds comprehensive Finnish language support to the normalization module. It introduces a Finnish language package with number normalization logic, word replacement mappings, language operators, and complete unit test coverage, registering Finnish as a supported language in the module's language registry.

Changes

Cohort / File(s)	Summary
Module Registration `normalization/languages/__init__.py`	Adds Finnish to the exported language modules list via `__all__` and imports the finnish submodule.
Finnish Package Initialization `normalization/languages/finnish/__init__.py`	Exposes `FinnishOperators` and `FINNISH_REPLACEMENTS` as public exports from the Finnish language package.
Finnish Number Normalization `normalization/languages/finnish/number_normalizer.py`	Implements `FinnishNumberNormalizer` class that converts Finnish spelled-out numbers (0–999, thousands, millions, etc.) to digit form, with support for various grammatical forms, large multipliers, and optional currency symbol normalization with plural restoration.
Finnish Language Operators `normalization/languages/finnish/operators.py`	Defines `FinnishOperators` class that integrates `FinnishNumberNormalizer` for expanding written numbers and provides word replacement mappings via `get_word_replacements()`.
Finnish Word Replacements `normalization/languages/finnish/replacements.py`	Defines `FINNISH_REPLACEMENTS` dictionary mapping colloquial/spoken Finnish tokens to canonical standard forms.
Finnish Operator Tests `tests/unit/languages/finnish_operators_test.py`	Unit tests verifying Finnish language registration, operator instantiation, configuration, and word replacement mappings.
Finnish Number Normalizer Tests `tests/unit/languages/finnish_number_normalizer_test.py`	Unit tests validating conversion of Finnish number phrases to digits, currency handling, and edge cases with/without currency configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

PR #20: Both add per-language number-normalizer implementations and register them within language operators for expanding written numbers.
PR #19: Both introduce new language packages under normalization/languages/ with parallel module structure (__init__.py, operators.py, number_normalizer.py, replacements.py) and comprehensive test coverage.

Suggested reviewers

Karamouche
lrossillon-gladia

Poem

🐰 Finnish numbers dance with glee,
"Kymmenen" becomes "10," you see!
From "tuhat" down to "nolla" small,
This normalizer handles them all! 🇫🇮

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.68% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: initialization of Finnish language basic normalization support.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description covers the main objective and includes a completed checklist for new language requirements, though some minor formatting issues exist.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/add-finnish-language

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 52 minutes and 30 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (4)

normalization/languages/finnish/operators.py (1)
101-104: Move the FINNISH_REPLACEMENTS import to module scope.

The function-local import pattern is usually reserved for breaking circular imports, but replacements.py is a leaf module that doesn’t import from operators.py, so a top-level import is safe and matches what the other language packages (per the __init__.py exports) already do.
 from normalization.languages.base import LanguageConfig, LanguageOperators
 from normalization.languages.finnish.number_normalizer import FinnishNumberNormalizer
+from normalization.languages.finnish.replacements import FINNISH_REPLACEMENTS
 from normalization.languages.registry import register_language
@@
     def get_word_replacements(self) -> dict[str, str]:
-        from normalization.languages.finnish.replacements import FINNISH_REPLACEMENTS
-
         return FINNISH_REPLACEMENTS
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/finnish/operators.py` around lines 101 - 104, Move
the local import of FINNISH_REPLACEMENTS out of get_word_replacements and place
it at module scope; in operators.py add a top-level "from
normalization.languages.finnish.replacements import FINNISH_REPLACEMENTS" and
then update get_word_replacements to simply return FINNISH_REPLACEMENTS (remove
the function-local import). This keeps parity with other language modules and
avoids unnecessary function-local imports when there is no circular dependency.
normalization/languages/finnish/number_normalizer.py (2)
272-322: Redundant and inconsistent "yksi" + multiplier special cases.

These four branches (yksi tuhat/tuhatta, yksi miljoona, yksi miljardi(a), yksi biljoona(a)) are unreachable in practice: _parse_0_999 already matches bare yksi via _parse_0_99 (line 454) and returns (i+1, 1), after which the chaining logic at lines 332–353 multiplies by the following _BIG_MULT entry. You can verify this by tracing the "yksi tuhat" and "yksi miljoona" tests — both paths reach the same result via the fallthrough.

They are also inconsistent with _BIG_MULT:

Line 281 matches only "miljoona" but not "miljoonaa" / "miljoonan".

Line 294 matches "miljardi" / "miljardia" but not "miljardin".

Line 312 matches "biljoona" / "biljoonaa" but not "biljoonan".

Either remove the special cases entirely (cleanest), or expand them to cover every inflection in _BIG_MULT — anything in between just confuses future readers into thinking there’s a semantic distinction when there isn’t.
-        if i + 1 < n and fw == "yksi" and _fold(words[i + 1]) in ("tuhat", "tuhatta"):
-            j = i + 2
-            tail = self._parse_number(words, j, n)
-            base = 1000
-            if tail is not None:
-                end, v2 = tail
-                return end, base + v2
-            return j, base
-
-        if i + 1 < n and fw == "yksi" and _fold(words[i + 1]) == "miljoona":
-            ...
-        if (... "miljardi", "miljardia" ...):
-            ...
-        if (... "biljoona", "biljoonaa" ...):
-            ...
+        # `yksi <multiplier>` is already handled by _parse_0_999 + the chaining
+        # path below, so no special-case branches are needed here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/finnish/number_normalizer.py` around lines 272 - 322,
The code contains redundant special-case branches handling "yksi" + multiplier
(the blocks that check fw == "yksi" for tuh(a)t, miljoona, miljardi(a),
biljoona(a)); these are unreachable because _parse_0_999 already parses bare
"yksi" and the generic chaining logic in _parse_number/_BIG_MULT handles
multiplication, and the special cases are also inconsistent with _BIG_MULT
inflections. Remove these four "yksi" special-case blocks entirely (or if you
prefer to keep them, make them mirror every inflection listed in _BIG_MULT),
leaving the generic _parse_0_999 → _parse_number chaining to handle "yksi"
multipliers; update or delete any related comments so the intent is clear.
19-24: _get unnecessarily linear over dict keys.

The lookup tables are built with already-lowercase ASCII/Unicode keys, so _get can be a direct table.get(word.casefold()) instead of scanning every key and _fold-ing it on every call. _parse_glued_kymmenta, _parse_0_99, and _continues_number all hit this function in tight loops across the input.
 def _get(table: dict[str, int], word: str) -> int | None:
-    fw = _fold(word)
-    for k, v in table.items():
-        if _fold(k) == fw:
-            return v
-    return None
+    return table.get(_fold(word))
If there's a reason keys might contain mixed case in the future, a one-time lowercase normalization at module load is still cheaper than per-call scans.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/finnish/number_normalizer.py` around lines 19 - 24,
The _get function is doing an O(N) scan; replace it with a direct lookup by
using table.get(_fold(word)) (or table.get(word.casefold())) to avoid per-call
key iteration, and ensure the numeric lookup tables used by
_parse_glued_kymmenta, _parse_0_99, and _continues_number are normalized once at
module load (e.g., rebuild each table with keys passed through _fold/casefold
when they are created) so mixed-case keys won’t break the direct lookup.
tests/unit/languages/finnish_operators_test.py (1)
25-29: Minor: call get_word_replacements() once.

Repeated calls work but are wasteful and make the intent less clear. Consider binding once:
-def test_word_replacements(operators: FinnishOperators) -> None:
-    assert operators.get_word_replacements()["ma"] == "mina"
-    assert operators.get_word_replacements()["ok"] == "okei"
-    assert operators.get_word_replacements()["juu"] == "joo"
-    assert operators.get_word_replacements()["euro"] == "euros"
+def test_word_replacements(operators: FinnishOperators) -> None:
+    replacements = operators.get_word_replacements()
+    assert replacements["ma"] == "mina"
+    assert replacements["ok"] == "okei"
+    assert replacements["juu"] == "joo"
+    assert replacements["euro"] == "euros"
Also note: the "euro" == "euros" expectation here locks in the questionable euro → euros mapping flagged on replacements.py — if that entry is removed, this assertion needs to be updated.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/finnish_operators_test.py` around lines 25 - 29, In
test_word_replacements, call operators.get_word_replacements() once and store
the result in a local variable (e.g., replacements) and then use
replacements[...] for each assertion to avoid repeated calls; locate the test
function test_word_replacements and the FinnishOperators.get_word_replacements()
usage to change the four assert lines accordingly, and update or remove the
"euro" == "euros" assertion if the euro→euros entry is removed from
replacements.py.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/finnish/number_normalizer.py`:
- Around line 155-165: The Finnish normalizer contains an English-centric
two‑pass currency fix: remove the hardcoded English mapping in
_singular_spoken_unit and eliminate _currency_plural_fix_patterns and
_apply_currency_plural_fixes, then change _normalize_currency_symbols to
directly insert the configured Finnish trailing form from
currency_symbol_to_word (e.g., partitive forms like "euroa") instead of
converting to English singular and regex‑replacing back to plural; update any
callers of those removed helpers to use the single-pass replacement so we no
longer do the lossy round‑trip or risk rewriting unrelated occurrences.
- Around line 361-364: The branch in _parse_0_999 that checks _fold(words[i]) ==
"nolla" currently returns None when the next token satisfies _continues_number,
causing "nolla kaksi" → "nolla 2"; change the behavior to return a consumed
index and numeric 0 instead (i.e., return i+1, 0) so "nolla" is normalized to
"0" even when followed by another number word; update or add a unit test for
_parse_0_999 covering "nolla kaksi" and document the behavior with a brief
comment referencing _fold, _continues_number, and _parse_0_999.

In `@normalization/languages/finnish/operators.py`:
- Around line 35-41: The mapping currency_symbol_to_word currently uses English
plurals (e.g., "euros", "dollars") which is incorrect for Finnish numerals;
update currency_symbol_to_word to use Finnish partitive forms ("euroa",
"dollaria", "puntaa", "senttiä", "jeniä") for the symbols "€", "$", "£", "¢",
"¥". After changing currency_symbol_to_word, remove the now-redundant helpers
`_singular_spoken_unit` and `_currency_plural_fix_patterns` (and any logic that
relies on them) and update the tests in finnish_number_normalizer_test.py to
expect partitive outputs (e.g., "€50" -> "50 euroa"). Ensure all references to
those removed symbols are cleaned up to avoid unused symbol errors.
- Around line 5-16: The digit and number-word mappings (_FINNISH_DIGIT_WORDS and
number_words) only include diacritized keys and must also include ASCII-folded
equivalents so later stages that run after remove_diacritics can match; update
_FINNISH_DIGIT_WORDS and number_words to duplicate entries for each diacritized
key with its ASCII-folded form (e.g., add "nelja" alongside "neljä", "seitseman"
alongside "seitsemän", "yhdeksan" alongside "yhdeksän", etc.) following the
pattern used in FinnishNumberNormalizer (duplicate mapping keys to the same
digit strings).

In `@tests/unit/languages/finnish_number_normalizer_test.py`:
- Around line 17-46: The test expectations for symbol-mapped currencies in
test_currency_and_spoken_units are asserting English plurals; update the
expected outputs to use Finnish partitive forms (e.g., change "50 euros" to "50
euroa" and similarly for other currency tests) to match the corrected currency
mapping in operators.py, and add two new parametrized cases in the same test:
one for a bare symbol with no number (e.g., "€" -> expected behavior such as
unchanged "€" or a decided normalization) and one for a decimal amount using
Finnish decimal comma (e.g., "€9,99" -> "9,99 euroa") to verify
decimal_separator="," handling by FinnishNumberNormalizer.

---

Nitpick comments:
In `@normalization/languages/finnish/number_normalizer.py`:
- Around line 272-322: The code contains redundant special-case branches
handling "yksi" + multiplier (the blocks that check fw == "yksi" for tuh(a)t,
miljoona, miljardi(a), biljoona(a)); these are unreachable because _parse_0_999
already parses bare "yksi" and the generic chaining logic in
_parse_number/_BIG_MULT handles multiplication, and the special cases are also
inconsistent with _BIG_MULT inflections. Remove these four "yksi" special-case
blocks entirely (or if you prefer to keep them, make them mirror every
inflection listed in _BIG_MULT), leaving the generic _parse_0_999 →
_parse_number chaining to handle "yksi" multipliers; update or delete any
related comments so the intent is clear.
- Around line 19-24: The _get function is doing an O(N) scan; replace it with a
direct lookup by using table.get(_fold(word)) (or table.get(word.casefold())) to
avoid per-call key iteration, and ensure the numeric lookup tables used by
_parse_glued_kymmenta, _parse_0_99, and _continues_number are normalized once at
module load (e.g., rebuild each table with keys passed through _fold/casefold
when they are created) so mixed-case keys won’t break the direct lookup.

In `@normalization/languages/finnish/operators.py`:
- Around line 101-104: Move the local import of FINNISH_REPLACEMENTS out of
get_word_replacements and place it at module scope; in operators.py add a
top-level "from normalization.languages.finnish.replacements import
FINNISH_REPLACEMENTS" and then update get_word_replacements to simply return
FINNISH_REPLACEMENTS (remove the function-local import). This keeps parity with
other language modules and avoids unnecessary function-local imports when there
is no circular dependency.

In `@tests/unit/languages/finnish_operators_test.py`:
- Around line 25-29: In test_word_replacements, call
operators.get_word_replacements() once and store the result in a local variable
(e.g., replacements) and then use replacements[...] for each assertion to avoid
repeated calls; locate the test function test_word_replacements and the
FinnishOperators.get_word_replacements() usage to change the four assert lines
accordingly, and update or remove the "euro" == "euros" assertion if the
euro→euros entry is removed from replacements.py.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c4e9f736-86bf-42d5-9f99-46b5b5a3177d

📥 Commits

Reviewing files that changed from the base of the PR and between 88b54a4 and f259b26.

⛔ Files ignored due to path filters (1)

tests/e2e/files/gladia-3/fi.csv is excluded by !**/*.csv

📒 Files selected for processing (7)

normalization/languages/__init__.py
normalization/languages/finnish/__init__.py
normalization/languages/finnish/number_normalizer.py
normalization/languages/finnish/operators.py
normalization/languages/finnish/replacements.py
tests/unit/languages/finnish_number_normalizer_test.py
tests/unit/languages/finnish_operators_test.py

coderabbitai · 2026-04-23T12:50:41Z

+def _singular_spoken_unit(trailing_word: str) -> str:
+    t = trailing_word.lower()
+    if t == "euros":
+        return "euro"
+    if t == "dollars":
+        return "dollar"
+    if t == "pounds":
+        return "pound"
+    if t == "yens":
+        return "yen"
+    return trailing_word


⚠️ Potential issue | 🟠 Major

Hardcoded English singular/plural table in a Finnish module.

_singular_spoken_unit maps euros→euro, dollars→dollar, pounds→pound, yens→yen. These are English forms; there is nothing Finnish about them, and the function will silently return the input unchanged for any value the config actually should hold in Finnish (euroa, dollaria, puntaa, jeniä, senttiä). This strongly suggests the design copied the Dutch/Swedish normalizer verbatim and inherited their trailing-word scheme.

In Finnish, numerals take the partitive singular regardless of amount (1 euro / 5 euroa — actually yksi euro is also acceptable for 1, but 5 euros is never correct). That means:

If currency_symbol_to_word is set to the Finnish partitive (euroa etc.), _singular_spoken_unit and _currency_plural_fix_patterns / _apply_currency_plural_fixes become unnecessary — you can just substitute the trailing word directly in _normalize_currency_symbols.

The current two-pass approach (convert to singular, then regex back to plural) is an unnecessarily lossy round-trip that also risks rewriting unrelated occurrences of euro/dollar/etc. elsewhere in the text.

Recommended to drop _singular_spoken_unit, _currency_plural_fix_patterns, and _apply_currency_plural_fixes, and simplify _normalize_currency_symbols to emit the configured trailing word directly. (See the companion comment on operators.py re: fixing the config itself.)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/finnish/number_normalizer.py` around lines 155 - 165, The Finnish normalizer contains an English-centric two‑pass currency fix: remove the hardcoded English mapping in _singular_spoken_unit and eliminate _currency_plural_fix_patterns and _apply_currency_plural_fixes, then change _normalize_currency_symbols to directly insert the configured Finnish trailing form from currency_symbol_to_word (e.g., partitive forms like "euroa") instead of converting to English singular and regex‑replacing back to plural; update any callers of those removed helpers to use the single-pass replacement so we no longer do the lossy round‑trip or risk rewriting unrelated occurrences.

coderabbitai · 2026-04-23T12:50:41Z

+        if _fold(words[i]) == "nolla":
+            if i + 1 < n and self._continues_number(words[i + 1]):
+                return None
+            return i + 1, 0


⚠️ Potential issue | 🟡 Minor

nolla followed by a number word is silently left un-normalized.

When nolla is followed by another number word, _parse_0_999 returns None rather than producing 0. The caller then falls through to out.append(words[i]), leaving the literal "nolla" in place while the next word still gets converted. Result: "nolla kaksi" → "nolla 2", which is neither the original spelled-out form nor a consistent digit form.

If the goal is to avoid consuming leading zeros in a compound (e.g. phone-number-like sequences), consider emitting "0" explicitly so at least the output is internally consistent:

if _fold(words[i]) == "nolla": - if i + 1 < n and self._continues_number(words[i + 1]): - return None return i + 1, 0

or, if the "don't consume" behavior is intentional for digit-sequence preservation, document it with a comment and add a test case covering the intended downstream step that turns each nolla into 0.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if _fold(words[i]) == "nolla":

if i + 1 < n and self._continues_number(words[i + 1]):

return None

return i + 1, 0

if _fold(words[i]) == "nolla":

return i + 1, 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/finnish/number_normalizer.py` around lines 361 - 364, The branch in _parse_0_999 that checks _fold(words[i]) == "nolla" currently returns None when the next token satisfies _continues_number, causing "nolla kaksi" → "nolla 2"; change the behavior to return a consumed index and numeric 0 instead (i.e., return i+1, 0) so "nolla" is normalized to "0" even when followed by another number word; update or add a unit test for _parse_0_999 covering "nolla kaksi" and document the behavior with a brief comment referencing _fold, _continues_number, and _parse_0_999.

coderabbitai · 2026-04-23T12:50:41Z

+_FINNISH_DIGIT_WORDS: dict[str, str] = {
+    "nolla": "0",
+    "yksi": "1",
+    "kaksi": "2",
+    "kolme": "3",
+    "neljä": "4",
+    "viisi": "5",
+    "kuusi": "6",
+    "seitsemän": "7",
+    "kahdeksan": "8",
+    "yhdeksän": "9",
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Locate the pipeline steps that consume digit_words / number_words and check whether # they operate on diacritic-folded text or raw text. rg -nP --type=py -C3 '\b(digit_words|number_words)\b' rg -nP --type=py -C2 '\b(casefold_text|remove_diacritics)\b'

Repository: gladiaio/normalization

Length of output: 21861

🏁 Script executed:

# Find pipeline definition and execution order find . -type f -name "*.py" | xargs grep -l "pipeline\|Pipeline\|STEPS\|step.*order" | head -20

Repository: gladiaio/normalization

Length of output: 452

🏁 Script executed:

# Check if there's a pipeline configuration or orchestrator fd -t f "pipeline|config" -e py | head -20

Repository: gladiaio/normalization

Length of output: 148

🏁 Script executed:

# Look for how steps are registered and executed rg -l "TextStep|register.*step" --type=py | head -10

Repository: gladiaio/normalization

Length of output: 532

🏁 Script executed:

# Check the actual pipeline execution order by finding the main pipeline runner rg -nP "convert_digit_word_sequences_to_digits|remove_diacritics|casefold_text" --type=py -B2 -A2 | grep -E "\.py.*:.*\(convert_digit|\.py.*:.*\(remove_|\.py.*:.*\(casefold"

Repository: gladiaio/normalization

Length of output: 48

🏁 Script executed:

# Examine the pipeline base definition cat normalization/pipeline/base.py

Repository: gladiaio/normalization

Length of output: 4430

🏁 Script executed:

# Check the step registry to see how steps are ordered cat normalization/steps/registry.py

Repository: gladiaio/normalization

Length of output: 856

🏁 Script executed:

# Look at the pipeline loader head -100 normalization/pipeline/loader.py

Repository: gladiaio/normalization

Length of output: 2532

🏁 Script executed:

# Search for explicit step ordering or pipeline configuration rg -A30 "DEFAULT.*STEPS|PIPELINE_STEPS|steps.*=.*\[" --type=py normalization/pipeline/ normalization/

Repository: gladiaio/normalization

Length of output: 48

🏁 Script executed:

# Find preset files find . -name "*.yaml" -path "*/presets/*" -type f

Repository: gladiaio/normalization

Length of output: 103

🏁 Script executed:

# Look at the default preset cat normalization/presets/gladia-3.yaml

Repository: gladiaio/normalization

Length of output: 5298

🏁 Script executed:

# Check if there are language-specific presets ls -la normalization/presets/

Repository: gladiaio/normalization

Length of output: 239

🏁 Script executed:

# Verify when replacements are applied in the pipeline rg -n "apply_sentence_level_replacements|replacements" normalization/presets/gladia-3.yaml

Repository: gladiaio/normalization

Length of output: 141

🏁 Script executed:

# Check the apply_sentence_level_replacements step to understand the context cat normalization/steps/text/apply_sentence_level_replacements.py

Repository: gladiaio/normalization

Length of output: 1189

🏁 Script executed:

# Examine the Finnish replacements module to understand the context cat normalization/languages/finnish/replacements.py

Repository: gladiaio/normalization

Length of output: 1199

🏁 Script executed:

# Check if German has diacritics in digit_words rg -A15 "_GERMAN_DIGIT_WORDS" normalization/languages/german/operators.py

Repository: gladiaio/normalization

Length of output: 716

🏁 Script executed:

# Check French digit_words in detail rg -A15 "_FRENCH_DIGIT_WORDS" normalization/languages/french/operators.py

Repository: gladiaio/normalization

Length of output: 675

🏁 Script executed:

# Verify the exact step order in the pipeline: when do digit_words steps run? cat normalization/presets/gladia-3.yaml | grep -A3 -B3 "casefold_text\|remove_diacritics\|convert_digit_word"

Repository: gladiaio/normalization

Length of output: 666

🏁 Script executed:

# Verify all steps that consume digit_words/number_words and their positions in the pipeline rg -l "digit_words|number_words" normalization/steps/text/*.py | while read f; do basename "$f" done

Repository: gladiaio/normalization

Length of output: 238

🏁 Script executed:

# Check the exact position of each digit_words/number_words consuming step in the pipeline cat normalization/presets/gladia-3.yaml | grep -E "protect_plus_word_before_digit_words|convert_digit_word_sequences_to_digits|normalize_punctuation_between_number_words|fix_dot_adjacent_number_words|casefold_text|remove_diacritics"

Repository: gladiaio/normalization

Length of output: 360

🏁 Script executed:

# Verify the order is consistent: all digit_words steps before casefold echo "=== Steps consuming digit_words/number_words and their position ===" && \ cat normalization/presets/gladia-3.yaml | grep -n "protect_plus_word_before_digit_words\|convert_digit_word_sequences_to_digits\|normalize_punctuation_between_number_words\|fix_dot_adjacent_number_words\|casefold_text\|remove_diacritics"

Repository: gladiaio/normalization

Length of output: 449

🏁 Script executed:

# Check if FinnishNumberNormalizer includes both diacritics and non-diacritics forms cat normalization/languages/finnish/number_normalizer.py | grep -A30 "_ONES_2_9\|_TEENS"

Repository: gladiaio/normalization

Length of output: 4872

🏁 Script executed:

# Verify the exact execution context of fix_dot_adjacent_number_words cat normalization/steps/text/post_steps/fix_dot_adjacent_number_words.py 2>/dev/null || cat normalization/steps/text/fix_dot_adjacent_number_words.py

Repository: gladiaio/normalization

Length of output: 1338

🏁 Script executed:

# Check what text state fix_dot_adjacent_number_words receives rg -B10 "class.*FixDotAdjacentNumberWords" normalization/steps/text/fix_dot_adjacent_number_words.py

Repository: gladiaio/normalization

Length of output: 300

Include both diacritic and ASCII-folded forms in _FINNISH_DIGIT_WORDS.

The pipeline consumes digit_words in two stages with opposing diacritic states:

protect_plus_word_before_digit_words, normalize_punctuation_between_number_words, and convert_digit_word_sequences_to_digits run before casefold_text and remove_diacritics, so they match diacritized forms in the original text.

fix_dot_adjacent_number_words runs after diacritics are removed in text_post, so it matches ASCII-folded forms.

The current _FINNISH_DIGIT_WORDS contains only diacritized keys (neljä, seitsemän, yhdeksän), which will fail in the second stage. Follow the pattern already used in FinnishNumberNormalizer (lines 32–54): include both forms for each word:

_FINNISH_DIGIT_WORDS: dict[str, str] = { "nolla": "0", "yksi": "1", "kaksi": "2", "kolme": "3", "neljä": "4", "nelja": "4", "viisi": "5", "kuusi": "6", "seitsemän": "7", "seitseman": "7", "kahdeksan": "8", "yhdeksän": "9", "yhdeksan": "9", }

The same fix applies to number_words (lines 57–84).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/finnish/operators.py` around lines 5 - 16, The digit and number-word mappings (_FINNISH_DIGIT_WORDS and number_words) only include diacritized keys and must also include ASCII-folded equivalents so later stages that run after remove_diacritics can match; update _FINNISH_DIGIT_WORDS and number_words to duplicate entries for each diacritized key with its ASCII-folded form (e.g., add "nelja" alongside "neljä", "seitseman" alongside "seitsemän", "yhdeksan" alongside "yhdeksän", etc.) following the pattern used in FinnishNumberNormalizer (duplicate mapping keys to the same digit strings).

coderabbitai · 2026-04-23T12:50:41Z

+    currency_symbol_to_word={
+        "€": "euros",
+        "$": "dollars",
+        "£": "pounds",
+        "¢": "cent",
+        "¥": "yens",
+    },


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Finnish currency after numeral partitive form euroa dollaria

💡 Result:

In Finnish grammar, after cardinal numerals (other than yksi 'one'), currency names like euro and dollari take the partitive singular form. Thus, '2 euros' is kaksi euroa and '2 dollars' is kaksi dollaria. This follows the general rule where nouns after numerals appear in the partitive case (e.g., kaksi kuppia 'two cups'). The partitive forms are euroa (for euro) and dollaria (for dollari).

Citations:

1: https://uusikielemme.fi/finnish-grammar/the-partitive-case-partitiivi/

2: https://jkorpela.fi/finnish/Cardinal_numerals.html

3: https://kielitoimistonohjepankki.fi/ohje/luvut-ja-numerot-200aa-autoa-vai-200-autoa/

4: http://www.kielitoimistonohjepankki.fi/ohje/269

🏁 Script executed:

# First, find and read the file in question find . -name "operators.py" -path "*/finnish/*" -type f

Repository: gladiaio/normalization

Length of output: 112

🏁 Script executed:

# Read the operators.py file to verify the code snippet cat -n normalization/languages/finnish/operators.py | head -50

Repository: gladiaio/normalization

Length of output: 1595

🏁 Script executed:

# Search for FinnishNumberNormalizer to find its location find . -name "*.py" -type f | xargs grep -l "FinnishNumberNormalizer" | head -5

Repository: gladiaio/normalization

Length of output: 224

🏁 Script executed:

# Search for _singular_spoken_unit in the codebase rg "_singular_spoken_unit" -A 10

Repository: gladiaio/normalization

Length of output: 5937

🏁 Script executed:

# Search for test files related to Finnish number normalizer find . -name "*finnish*test*.py" -o -name "*test*finnish*.py" | head -10

Repository: gladiaio/normalization

Length of output: 171

🏁 Script executed:

# Read the test file to see what's expected cat -n tests/unit/languages/finnish_number_normalizer_test.py

Repository: gladiaio/normalization

Length of output: 2077

🏁 Script executed:

# Read the full Finnish number normalizer implementation cat -n normalization/languages/finnish/number_normalizer.py

Repository: gladiaio/normalization

Length of output: 15942

Fix Finnish currency forms: use partitive case instead of English plurals.

The currency_symbol_to_word mapping uses English plural forms (euros, dollars, pounds, yens) when Finnish requires the partitive case after numerals: euroa, dollaria, puntaa, senttiä, jeniä. For example, 5 € in Finnish speech is "viisi euroa", never "viisi euros". This causes:

Non-Finnish output text that will degrade WER against correct Finnish transcripts

Awkward redundancy: _singular_spoken_unit contains hardcoded English mappings (euros→euro, dollars→dollar, etc.) that must then be "fixed back" to plural via _currency_plural_fix_patterns — a clear sign of non-Finnish adaptation

Additionally, "¢" → "cent" is singular while all others are plural, breaking internal consistency.

Replace with Finnish partitive forms:

Suggested diff

currency_symbol_to_word={ - "€": "euros", - "$": "dollars", - "£": "pounds", - "¢": "cent", - "¥": "yens", + "€": "euroa", + "$": "dollaria", + "£": "puntaa", + "¢": "senttiä", + "¥": "jeniä", },

With this fix, both _singular_spoken_unit and _currency_plural_fix_patterns can be removed entirely (partitive form is already correct after numerals). Update test expectations in finnish_number_normalizer_test.py accordingly (e.g., "€50" → "50 euroa").

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

currency_symbol_to_word={

"€": "euros",

"$": "dollars",

"£": "pounds",

"¢": "cent",

"¥": "yens",

},

currency_symbol_to_word={

"€": "euroa",

"$": "dollaria",

"£": "puntaa",

"¢": "senttiä",

"¥": "jeniä",

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/finnish/operators.py` around lines 35 - 41, The mapping currency_symbol_to_word currently uses English plurals (e.g., "euros", "dollars") which is incorrect for Finnish numerals; update currency_symbol_to_word to use Finnish partitive forms ("euroa", "dollaria", "puntaa", "senttiä", "jeniä") for the symbols "€", "$", "£", "¢", "¥". After changing currency_symbol_to_word, remove the now-redundant helpers `_singular_spoken_unit` and `_currency_plural_fix_patterns` (and any logic that relies on them) and update the tests in finnish_number_normalizer_test.py to expect partitive outputs (e.g., "€50" -> "50 euroa"). Ensure all references to those removed symbols are cleaned up to avoid unused symbol errors.

coderabbitai · 2026-04-23T12:50:41Z

+FINNISH_REPLACEMENTS: dict[str, str] = {
+    "ma": "mina",
+    "maa": "mina",
+    "mulle": "minulle",
+    "mulla": "minulla",
+    "mua": "minua",
+    "mun": "minun",
+    "sa": "sina",
+    "sulle": "sinulle",
+    "sulla": "sinulla",
+    "sua": "sinua",
+    "sun": "sinun",
+    "toi": "tuo",
+    "ton": "tuon",
+    "tossa": "tuossa",
+    "tosta": "tuosta",
+    "tohon": "tuohon",
+    "taa": "tama",
+    "naa": "nama",
+    "olis": "olisi",
+    "ois": "olisi",
+    "oo": "ole",
+    "ollu": "ollut",
+    "onks": "onko",
+    "oliks": "oliko",
+    "oisko": "olisiko",
+    "vois": "voisi",
+    "katotaan": "katsotaan",
+    "kattoa": "katsoa",
+    "mut": "mutta",
+    "sit": "sitten",
+    "sitte": "sitten",
+    "et": "etta",
+    "sillon": "silloin",
+    "viimeks": "viimeksi",
+    "elikka": "eli",
+    "juu": "joo",
+    "jes": "joo",
+    "ok": "okei",
+    "bank": "pankki",
+    "bankin": "pankin",
+    "euro": "euros",
+}


⚠️ Potential issue | 🟠 Major

Several replacement entries look incorrect or unsafe for Finnish.

A few of these will actively corrupt otherwise-correct Finnish text rather than normalize colloquial forms:

"maa": "mina" — maa is a common Finnish noun meaning "land/country". Mapping it to mina (minä, "I") will mangle any sentence mentioning a country or land. The intended colloquial form for minä is mä → ASCII ma, which is already covered on line 8. Line 9 should be dropped.

"euro": "euros" — euro is the standard Finnish singular for the currency. Replacing it with euros (which is an English plural, not Finnish) will both break correct Finnish and conflict with the currency-restore logic in number_normalizer.py. In Finnish the form used after a numeral is the partitive euroa (already appearing in the tests as the expected output for "kymmenen euroa").

"bank" / "bankin" → "pankki" / "pankin" — these are English, not Finnish colloquial variants. If the intent is to normalize ASR mis-transcriptions of loanwords, it should be documented; otherwise they don't belong in a Finnish colloquial→standard table.

"jes": "joo" — jes is an interjection ("yes!"), not a variant of joo. Collapsing it to joo loses semantic distinction; consider dropping.

Please have a native speaker sanity-check the rest of the table as well (e.g., "taa" only matches tää once diacritics are stripped, which seems to be the assumption per the module docstring, but it would also collide with the Finnish word taa = "behind" if the pipeline ever feeds un-folded input).

🩹 Suggested removals

"ma": "mina", - "maa": "mina", "mulle": "minulle", @@ "juu": "joo", - "jes": "joo", "ok": "okei", - "bank": "pankki", - "bankin": "pankin", - "euro": "euros", + "bank": "pankki", + "bankin": "pankin",

(Keep or drop the bank* entries depending on whether they are intentional ASR fixes — if kept, consider a comment explaining the rationale.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

FINNISH_REPLACEMENTS: dict[str, str] = {

"ma": "mina",

"maa": "mina",

"mulle": "minulle",

"mulla": "minulla",

"mua": "minua",

"mun": "minun",

"sa": "sina",

"sulle": "sinulle",

"sulla": "sinulla",

"sua": "sinua",

"sun": "sinun",

"toi": "tuo",

"ton": "tuon",

"tossa": "tuossa",

"tosta": "tuosta",

"tohon": "tuohon",

"taa": "tama",

"naa": "nama",

"olis": "olisi",

"ois": "olisi",

"oo": "ole",

"ollu": "ollut",

"onks": "onko",

"oliks": "oliko",

"oisko": "olisiko",

"vois": "voisi",

"katotaan": "katsotaan",

"kattoa": "katsoa",

"mut": "mutta",

"sit": "sitten",

"sitte": "sitten",

"et": "etta",

"sillon": "silloin",

"viimeks": "viimeksi",

"elikka": "eli",

"juu": "joo",

"jes": "joo",

"ok": "okei",

"bank": "pankki",

"bankin": "pankin",

"euro": "euros",

}

FINNISH_REPLACEMENTS: dict[str, str] = {

"ma": "mina",

"mulle": "minulle",

"mulla": "minulla",

"mua": "minua",

"mun": "minun",

"sa": "sina",

"sulle": "sinulle",

"sulla": "sinulla",

"sua": "sinua",

"sun": "sinun",

"toi": "tuo",

"ton": "tuon",

"tossa": "tuossa",

"tosta": "tuosta",

"tohon": "tuohon",

"taa": "tama",

"naa": "nama",

"olis": "olisi",

"ois": "olisi",

"oo": "ole",

"ollu": "ollut",

"onks": "onko",

"oliks": "oliko",

"oisko": "olisiko",

"vois": "voisi",

"katotaan": "katsotaan",

"kattoa": "katsoa",

"mut": "mutta",

"sit": "sitten",

"sitte": "sitten",

"et": "etta",

"sillon": "silloin",

"viimeks": "viimeksi",

"elikka": "eli",

"juu": "joo",

"ok": "okei",

"bank": "pankki",

"bankin": "pankin",

}

coderabbitai · 2026-04-23T12:50:41Z

+@pytest.mark.parametrize(
+    ("text", "expected"),
+    [
+        ("kaksi kymmenta viisi", "25"),
+        ("kaksi kymmentä viisi", "25"),
+        ("sata", "100"),
+        ("tuhat", "1000"),
+        ("yksi tuhat", "1000"),
+        ("kolme miljoonaa", "3000000"),
+        ("yksi miljoona", "1000000"),
+    ],
+)
+def test_finnish_spelled_numbers(
+    normalizer: FinnishNumberNormalizer, text: str, expected: str
+) -> None:
+    assert normalizer(text) == expected
+
+
+@pytest.mark.parametrize(
+    ("text", "expected"),
+    [
+        ("kymmenen euroa", "10 euroa"),
+        ("€50", "50 euros"),
+        ("50 €", "50 euros"),
+    ],
+)
+def test_currency_and_spoken_units(
+    normalizer: FinnishNumberNormalizer, text: str, expected: str
+) -> None:
+    assert normalizer(text) == expected


⚠️ Potential issue | 🟡 Minor

Currency expectations encode the English-plural bug.

"€50" → "50 euros" and "50 € → 50 euros" hardcode the output of the English-plural currency mapping (see the comment on operators.py lines 35–41). Asymmetrically, "kymmenen euroa" → "10 euroa" already uses the correct Finnish partitive (because euroa is preserved verbatim as a trailing word, not reconstructed from the symbol map).

If the currency map is fixed to use euroa / dollaria / puntaa / senttiä / jeniä, these parametrized expectations should be updated accordingly:

- ("€50", "50 euros"), - ("50 €", "50 euros"), + ("€50", "50 euroa"), + ("50 €", "50 euroa"),

Also consider adding a negative/edge case: a bare currency symbol with no adjacent number (e.g. "€" alone) and a decimal amount (e.g. "€9,99" — note Finnish decimal comma), to lock down behavior around decimal_separator=",".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/languages/finnish_number_normalizer_test.py` around lines 17 - 46, The test expectations for symbol-mapped currencies in test_currency_and_spoken_units are asserting English plurals; update the expected outputs to use Finnish partitive forms (e.g., change "50 euros" to "50 euroa" and similarly for other currency tests) to match the corrected currency mapping in operators.py, and add two new parametrized cases in the same test: one for a bare symbol with no number (e.g., "€" -> expected behavior such as unchanged "€" or a decided normalization) and one for a decimal amount using Finnish decimal comma (e.g., "€9,99" -> "9,99 euroa") to verify decimal_separator="," handling by FinnishNumberNormalizer.

feat: init finnish language

f259b26

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

fix: affirmations and negation removed from filler words

761ca87

Karamouche approved these changes May 5, 2026

View reviewed changes

Karamouche merged commit 0e06d06 into main May 5, 2026
10 checks passed

Karamouche deleted the feat/add-finnish-language branch May 5, 2026 16:04

coderabbitai Bot mentioned this pull request May 5, 2026

feat: init swedish language basic normalization #23

Merged

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: init finnish language basic normalization#22

feat: init finnish language basic normalization#22
Karamouche merged 2 commits intomainfrom
feat/add-finnish-language

egenthon-cmd commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Uh oh!

coderabbitai Bot Apr 23, 2026

Uh oh!

coderabbitai Bot Apr 23, 2026

Uh oh!

coderabbitai Bot Apr 23, 2026

Uh oh!

coderabbitai Bot Apr 23, 2026

Uh oh!

coderabbitai Bot Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

egenthon-cmd commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Checklist

New language

Edit existing language

New step

Edit existing step

Preset change

How was this tested?

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

egenthon-cmd commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading