Skip to content

feat: init finnish language basic normalization#22

Merged
Karamouche merged 2 commits intomainfrom
feat/add-finnish-language
May 5, 2026
Merged

feat: init finnish language basic normalization#22
Karamouche merged 2 commits intomainfrom
feat/add-finnish-language

Conversation

@egenthon-cmd
Copy link
Copy Markdown
Contributor

@egenthon-cmd egenthon-cmd commented Apr 23, 2026

What does this PR do?

dds finnish normalization (operators, replacements, number normalizer, registry wiring, unit and gladia-3 e2e tests

Type of change

  • New language
  • Edit existing language (fix a replacement, tweak config, …)
  • New normalization step
  • Edit existing step (bug fix, behaviour change)
  • New preset version
  • Bug fix (other)
  • Refactor / docs / CI

Checklist

Only fill in the section(s) that match your change — delete the rest.


New language

  • Created normalization/languages/{lang}/ with operators.py, replacements.py, __init__.py
  • Word substitutions are in replacements.py (not hardcoded in operators.py)
  • LanguageConfig is filled in with the language's data (separators, currency words, digit words, …)
  • Subclassed LanguageOperators — only override methods where the logic changes, not just the data
  • Class is decorated with @register_language and imported in normalization/languages/__init__.py
  • Unit tests added in tests/unit/languages/
  • E2e CSV added in tests/e2e/files/{preset}/{lang}.csv (e.g. tests/e2e/files/gladia-3/fr.csv)

Edit existing language

  • New/changed word substitutions go in replacements.py, not inline in operators.py
  • If you changed a config field that can be None: the step reading it still handles None gracefully
  • Unit tests updated or added
  • E2e CSV updated if the expected output changed

New step

  • Unique name class attribute set (this is the key used in YAML presets)
  • Decorated with @register_step and imported in steps/text/__init__.py or steps/word/__init__.py
  • No hardcoded language values — read data from operators.config.* instead
  • If placeholder-based: protect + restore are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
  • Unit tests added in tests/unit/steps/
  • Step name added to the relevant preset YAML — or a new preset file created if existing presets are affected
  • If the docstring changed: ran uv run scripts/generate_step_docs.py

Edit existing step

  • Step name is unchanged — if the output changes, create a new step name + new preset instead
  • No language-specific logic or string literals added inside the step
  • Unit tests updated or added
  • If the docstring changed: ran uv run scripts/generate_step_docs.py

Preset change

  • Existing preset files are not modified — new behaviour goes in a new preset file
  • pipeline.validate() passes (runs automatically via loader.py)

How was this tested?

uv run pytest tests/

Summary by CodeRabbit

  • New Features

    • Finnish language support now available with automatic conversion of written number words to digits.
    • Finnish word replacement rules for normalizing colloquial to standard forms.
    • Currency symbol handling for Finnish language text processing.
  • Tests

    • Added comprehensive unit tests validating Finnish number normalization and language operations, including currency handling scenarios.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@egenthon-cmd has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 30 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb68aff9-438c-46e1-82ec-25107b02a7cd

📥 Commits

Reviewing files that changed from the base of the PR and between f259b26 and 761ca87.

📒 Files selected for processing (1)
  • normalization/languages/finnish/operators.py
📝 Walkthrough

Walkthrough

This pull request adds comprehensive Finnish language support to the normalization module. It introduces a Finnish language package with number normalization logic, word replacement mappings, language operators, and complete unit test coverage, registering Finnish as a supported language in the module's language registry.

Changes

Cohort / File(s) Summary
Module Registration
normalization/languages/__init__.py
Adds Finnish to the exported language modules list via __all__ and imports the finnish submodule.
Finnish Package Initialization
normalization/languages/finnish/__init__.py
Exposes FinnishOperators and FINNISH_REPLACEMENTS as public exports from the Finnish language package.
Finnish Number Normalization
normalization/languages/finnish/number_normalizer.py
Implements FinnishNumberNormalizer class that converts Finnish spelled-out numbers (0–999, thousands, millions, etc.) to digit form, with support for various grammatical forms, large multipliers, and optional currency symbol normalization with plural restoration.
Finnish Language Operators
normalization/languages/finnish/operators.py
Defines FinnishOperators class that integrates FinnishNumberNormalizer for expanding written numbers and provides word replacement mappings via get_word_replacements().
Finnish Word Replacements
normalization/languages/finnish/replacements.py
Defines FINNISH_REPLACEMENTS dictionary mapping colloquial/spoken Finnish tokens to canonical standard forms.
Finnish Operator Tests
tests/unit/languages/finnish_operators_test.py
Unit tests verifying Finnish language registration, operator instantiation, configuration, and word replacement mappings.
Finnish Number Normalizer Tests
tests/unit/languages/finnish_number_normalizer_test.py
Unit tests validating conversion of Finnish number phrases to digits, currency handling, and edge cases with/without currency configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • PR #20: Both add per-language number-normalizer implementations and register them within language operators for expanding written numbers.
  • PR #19: Both introduce new language packages under normalization/languages/ with parallel module structure (__init__.py, operators.py, number_normalizer.py, replacements.py) and comprehensive test coverage.

Suggested reviewers

  • Karamouche
  • lrossillon-gladia

Poem

🐰 Finnish numbers dance with glee,
"Kymmenen" becomes "10," you see!
From "tuhat" down to "nolla" small,
This normalizer handles them all! 🇫🇮

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.68% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: initialization of Finnish language basic normalization support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description covers the main objective and includes a completed checklist for new language requirements, though some minor formatting issues exist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/add-finnish-language

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 52 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
normalization/languages/finnish/operators.py (1)

101-104: Move the FINNISH_REPLACEMENTS import to module scope.

The function-local import pattern is usually reserved for breaking circular imports, but replacements.py is a leaf module that doesn’t import from operators.py, so a top-level import is safe and matches what the other language packages (per the __init__.py exports) already do.

 from normalization.languages.base import LanguageConfig, LanguageOperators
 from normalization.languages.finnish.number_normalizer import FinnishNumberNormalizer
+from normalization.languages.finnish.replacements import FINNISH_REPLACEMENTS
 from normalization.languages.registry import register_language
@@
     def get_word_replacements(self) -> dict[str, str]:
-        from normalization.languages.finnish.replacements import FINNISH_REPLACEMENTS
-
         return FINNISH_REPLACEMENTS
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/finnish/operators.py` around lines 101 - 104, Move
the local import of FINNISH_REPLACEMENTS out of get_word_replacements and place
it at module scope; in operators.py add a top-level "from
normalization.languages.finnish.replacements import FINNISH_REPLACEMENTS" and
then update get_word_replacements to simply return FINNISH_REPLACEMENTS (remove
the function-local import). This keeps parity with other language modules and
avoids unnecessary function-local imports when there is no circular dependency.
normalization/languages/finnish/number_normalizer.py (2)

272-322: Redundant and inconsistent "yksi" + multiplier special cases.

These four branches (yksi tuhat/tuhatta, yksi miljoona, yksi miljardi(a), yksi biljoona(a)) are unreachable in practice: _parse_0_999 already matches bare yksi via _parse_0_99 (line 454) and returns (i+1, 1), after which the chaining logic at lines 332–353 multiplies by the following _BIG_MULT entry. You can verify this by tracing the "yksi tuhat" and "yksi miljoona" tests — both paths reach the same result via the fallthrough.

They are also inconsistent with _BIG_MULT:

  • Line 281 matches only "miljoona" but not "miljoonaa" / "miljoonan".
  • Line 294 matches "miljardi" / "miljardia" but not "miljardin".
  • Line 312 matches "biljoona" / "biljoonaa" but not "biljoonan".

Either remove the special cases entirely (cleanest), or expand them to cover every inflection in _BIG_MULT — anything in between just confuses future readers into thinking there’s a semantic distinction when there isn’t.

-        if i + 1 < n and fw == "yksi" and _fold(words[i + 1]) in ("tuhat", "tuhatta"):
-            j = i + 2
-            tail = self._parse_number(words, j, n)
-            base = 1000
-            if tail is not None:
-                end, v2 = tail
-                return end, base + v2
-            return j, base
-
-        if i + 1 < n and fw == "yksi" and _fold(words[i + 1]) == "miljoona":
-            ...
-        if (... "miljardi", "miljardia" ...):
-            ...
-        if (... "biljoona", "biljoonaa" ...):
-            ...
+        # `yksi <multiplier>` is already handled by _parse_0_999 + the chaining
+        # path below, so no special-case branches are needed here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/finnish/number_normalizer.py` around lines 272 - 322,
The code contains redundant special-case branches handling "yksi" + multiplier
(the blocks that check fw == "yksi" for tuh(a)t, miljoona, miljardi(a),
biljoona(a)); these are unreachable because _parse_0_999 already parses bare
"yksi" and the generic chaining logic in _parse_number/_BIG_MULT handles
multiplication, and the special cases are also inconsistent with _BIG_MULT
inflections. Remove these four "yksi" special-case blocks entirely (or if you
prefer to keep them, make them mirror every inflection listed in _BIG_MULT),
leaving the generic _parse_0_999 → _parse_number chaining to handle "yksi"
multipliers; update or delete any related comments so the intent is clear.

19-24: _get unnecessarily linear over dict keys.

The lookup tables are built with already-lowercase ASCII/Unicode keys, so _get can be a direct table.get(word.casefold()) instead of scanning every key and _fold-ing it on every call. _parse_glued_kymmenta, _parse_0_99, and _continues_number all hit this function in tight loops across the input.

 def _get(table: dict[str, int], word: str) -> int | None:
-    fw = _fold(word)
-    for k, v in table.items():
-        if _fold(k) == fw:
-            return v
-    return None
+    return table.get(_fold(word))

If there's a reason keys might contain mixed case in the future, a one-time lowercase normalization at module load is still cheaper than per-call scans.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/finnish/number_normalizer.py` around lines 19 - 24,
The _get function is doing an O(N) scan; replace it with a direct lookup by
using table.get(_fold(word)) (or table.get(word.casefold())) to avoid per-call
key iteration, and ensure the numeric lookup tables used by
_parse_glued_kymmenta, _parse_0_99, and _continues_number are normalized once at
module load (e.g., rebuild each table with keys passed through _fold/casefold
when they are created) so mixed-case keys won’t break the direct lookup.
tests/unit/languages/finnish_operators_test.py (1)

25-29: Minor: call get_word_replacements() once.

Repeated calls work but are wasteful and make the intent less clear. Consider binding once:

-def test_word_replacements(operators: FinnishOperators) -> None:
-    assert operators.get_word_replacements()["ma"] == "mina"
-    assert operators.get_word_replacements()["ok"] == "okei"
-    assert operators.get_word_replacements()["juu"] == "joo"
-    assert operators.get_word_replacements()["euro"] == "euros"
+def test_word_replacements(operators: FinnishOperators) -> None:
+    replacements = operators.get_word_replacements()
+    assert replacements["ma"] == "mina"
+    assert replacements["ok"] == "okei"
+    assert replacements["juu"] == "joo"
+    assert replacements["euro"] == "euros"

Also note: the "euro" == "euros" expectation here locks in the questionable euro → euros mapping flagged on replacements.py — if that entry is removed, this assertion needs to be updated.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/finnish_operators_test.py` around lines 25 - 29, In
test_word_replacements, call operators.get_word_replacements() once and store
the result in a local variable (e.g., replacements) and then use
replacements[...] for each assertion to avoid repeated calls; locate the test
function test_word_replacements and the FinnishOperators.get_word_replacements()
usage to change the four assert lines accordingly, and update or remove the
"euro" == "euros" assertion if the euro→euros entry is removed from
replacements.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/finnish/number_normalizer.py`:
- Around line 155-165: The Finnish normalizer contains an English-centric
two‑pass currency fix: remove the hardcoded English mapping in
_singular_spoken_unit and eliminate _currency_plural_fix_patterns and
_apply_currency_plural_fixes, then change _normalize_currency_symbols to
directly insert the configured Finnish trailing form from
currency_symbol_to_word (e.g., partitive forms like "euroa") instead of
converting to English singular and regex‑replacing back to plural; update any
callers of those removed helpers to use the single-pass replacement so we no
longer do the lossy round‑trip or risk rewriting unrelated occurrences.
- Around line 361-364: The branch in _parse_0_999 that checks _fold(words[i]) ==
"nolla" currently returns None when the next token satisfies _continues_number,
causing "nolla kaksi" → "nolla 2"; change the behavior to return a consumed
index and numeric 0 instead (i.e., return i+1, 0) so "nolla" is normalized to
"0" even when followed by another number word; update or add a unit test for
_parse_0_999 covering "nolla kaksi" and document the behavior with a brief
comment referencing _fold, _continues_number, and _parse_0_999.

In `@normalization/languages/finnish/operators.py`:
- Around line 35-41: The mapping currency_symbol_to_word currently uses English
plurals (e.g., "euros", "dollars") which is incorrect for Finnish numerals;
update currency_symbol_to_word to use Finnish partitive forms ("euroa",
"dollaria", "puntaa", "senttiä", "jeniä") for the symbols "€", "$", "£", "¢",
"¥". After changing currency_symbol_to_word, remove the now-redundant helpers
`_singular_spoken_unit` and `_currency_plural_fix_patterns` (and any logic that
relies on them) and update the tests in finnish_number_normalizer_test.py to
expect partitive outputs (e.g., "€50" -> "50 euroa"). Ensure all references to
those removed symbols are cleaned up to avoid unused symbol errors.
- Around line 5-16: The digit and number-word mappings (_FINNISH_DIGIT_WORDS and
number_words) only include diacritized keys and must also include ASCII-folded
equivalents so later stages that run after remove_diacritics can match; update
_FINNISH_DIGIT_WORDS and number_words to duplicate entries for each diacritized
key with its ASCII-folded form (e.g., add "nelja" alongside "neljä", "seitseman"
alongside "seitsemän", "yhdeksan" alongside "yhdeksän", etc.) following the
pattern used in FinnishNumberNormalizer (duplicate mapping keys to the same
digit strings).

In `@tests/unit/languages/finnish_number_normalizer_test.py`:
- Around line 17-46: The test expectations for symbol-mapped currencies in
test_currency_and_spoken_units are asserting English plurals; update the
expected outputs to use Finnish partitive forms (e.g., change "50 euros" to "50
euroa" and similarly for other currency tests) to match the corrected currency
mapping in operators.py, and add two new parametrized cases in the same test:
one for a bare symbol with no number (e.g., "€" -> expected behavior such as
unchanged "€" or a decided normalization) and one for a decimal amount using
Finnish decimal comma (e.g., "€9,99" -> "9,99 euroa") to verify
decimal_separator="," handling by FinnishNumberNormalizer.

---

Nitpick comments:
In `@normalization/languages/finnish/number_normalizer.py`:
- Around line 272-322: The code contains redundant special-case branches
handling "yksi" + multiplier (the blocks that check fw == "yksi" for tuh(a)t,
miljoona, miljardi(a), biljoona(a)); these are unreachable because _parse_0_999
already parses bare "yksi" and the generic chaining logic in
_parse_number/_BIG_MULT handles multiplication, and the special cases are also
inconsistent with _BIG_MULT inflections. Remove these four "yksi" special-case
blocks entirely (or if you prefer to keep them, make them mirror every
inflection listed in _BIG_MULT), leaving the generic _parse_0_999 →
_parse_number chaining to handle "yksi" multipliers; update or delete any
related comments so the intent is clear.
- Around line 19-24: The _get function is doing an O(N) scan; replace it with a
direct lookup by using table.get(_fold(word)) (or table.get(word.casefold())) to
avoid per-call key iteration, and ensure the numeric lookup tables used by
_parse_glued_kymmenta, _parse_0_99, and _continues_number are normalized once at
module load (e.g., rebuild each table with keys passed through _fold/casefold
when they are created) so mixed-case keys won’t break the direct lookup.

In `@normalization/languages/finnish/operators.py`:
- Around line 101-104: Move the local import of FINNISH_REPLACEMENTS out of
get_word_replacements and place it at module scope; in operators.py add a
top-level "from normalization.languages.finnish.replacements import
FINNISH_REPLACEMENTS" and then update get_word_replacements to simply return
FINNISH_REPLACEMENTS (remove the function-local import). This keeps parity with
other language modules and avoids unnecessary function-local imports when there
is no circular dependency.

In `@tests/unit/languages/finnish_operators_test.py`:
- Around line 25-29: In test_word_replacements, call
operators.get_word_replacements() once and store the result in a local variable
(e.g., replacements) and then use replacements[...] for each assertion to avoid
repeated calls; locate the test function test_word_replacements and the
FinnishOperators.get_word_replacements() usage to change the four assert lines
accordingly, and update or remove the "euro" == "euros" assertion if the
euro→euros entry is removed from replacements.py.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c4e9f736-86bf-42d5-9f99-46b5b5a3177d

📥 Commits

Reviewing files that changed from the base of the PR and between 88b54a4 and f259b26.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3/fi.csv is excluded by !**/*.csv
📒 Files selected for processing (7)
  • normalization/languages/__init__.py
  • normalization/languages/finnish/__init__.py
  • normalization/languages/finnish/number_normalizer.py
  • normalization/languages/finnish/operators.py
  • normalization/languages/finnish/replacements.py
  • tests/unit/languages/finnish_number_normalizer_test.py
  • tests/unit/languages/finnish_operators_test.py

Comment on lines +155 to +165
def _singular_spoken_unit(trailing_word: str) -> str:
t = trailing_word.lower()
if t == "euros":
return "euro"
if t == "dollars":
return "dollar"
if t == "pounds":
return "pound"
if t == "yens":
return "yen"
return trailing_word
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded English singular/plural table in a Finnish module.

_singular_spoken_unit maps euros→euro, dollars→dollar, pounds→pound, yens→yen. These are English forms; there is nothing Finnish about them, and the function will silently return the input unchanged for any value the config actually should hold in Finnish (euroa, dollaria, puntaa, jeniä, senttiä). This strongly suggests the design copied the Dutch/Swedish normalizer verbatim and inherited their trailing-word scheme.

In Finnish, numerals take the partitive singular regardless of amount (1 euro / 5 euroa — actually yksi euro is also acceptable for 1, but 5 euros is never correct). That means:

  • If currency_symbol_to_word is set to the Finnish partitive (euroa etc.), _singular_spoken_unit and _currency_plural_fix_patterns / _apply_currency_plural_fixes become unnecessary — you can just substitute the trailing word directly in _normalize_currency_symbols.
  • The current two-pass approach (convert to singular, then regex back to plural) is an unnecessarily lossy round-trip that also risks rewriting unrelated occurrences of euro/dollar/etc. elsewhere in the text.

Recommended to drop _singular_spoken_unit, _currency_plural_fix_patterns, and _apply_currency_plural_fixes, and simplify _normalize_currency_symbols to emit the configured trailing word directly. (See the companion comment on operators.py re: fixing the config itself.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/finnish/number_normalizer.py` around lines 155 - 165,
The Finnish normalizer contains an English-centric two‑pass currency fix: remove
the hardcoded English mapping in _singular_spoken_unit and eliminate
_currency_plural_fix_patterns and _apply_currency_plural_fixes, then change
_normalize_currency_symbols to directly insert the configured Finnish trailing
form from currency_symbol_to_word (e.g., partitive forms like "euroa") instead
of converting to English singular and regex‑replacing back to plural; update any
callers of those removed helpers to use the single-pass replacement so we no
longer do the lossy round‑trip or risk rewriting unrelated occurrences.

Comment on lines +361 to +364
if _fold(words[i]) == "nolla":
if i + 1 < n and self._continues_number(words[i + 1]):
return None
return i + 1, 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

nolla followed by a number word is silently left un-normalized.

When nolla is followed by another number word, _parse_0_999 returns None rather than producing 0. The caller then falls through to out.append(words[i]), leaving the literal "nolla" in place while the next word still gets converted. Result: "nolla kaksi""nolla 2", which is neither the original spelled-out form nor a consistent digit form.

If the goal is to avoid consuming leading zeros in a compound (e.g. phone-number-like sequences), consider emitting "0" explicitly so at least the output is internally consistent:

         if _fold(words[i]) == "nolla":
-            if i + 1 < n and self._continues_number(words[i + 1]):
-                return None
             return i + 1, 0

or, if the "don't consume" behavior is intentional for digit-sequence preservation, document it with a comment and add a test case covering the intended downstream step that turns each nolla into 0.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if _fold(words[i]) == "nolla":
if i + 1 < n and self._continues_number(words[i + 1]):
return None
return i + 1, 0
if _fold(words[i]) == "nolla":
return i + 1, 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/finnish/number_normalizer.py` around lines 361 - 364,
The branch in _parse_0_999 that checks _fold(words[i]) == "nolla" currently
returns None when the next token satisfies _continues_number, causing "nolla
kaksi" → "nolla 2"; change the behavior to return a consumed index and numeric 0
instead (i.e., return i+1, 0) so "nolla" is normalized to "0" even when followed
by another number word; update or add a unit test for _parse_0_999 covering
"nolla kaksi" and document the behavior with a brief comment referencing _fold,
_continues_number, and _parse_0_999.

Comment on lines +5 to +16
_FINNISH_DIGIT_WORDS: dict[str, str] = {
"nolla": "0",
"yksi": "1",
"kaksi": "2",
"kolme": "3",
"neljä": "4",
"viisi": "5",
"kuusi": "6",
"seitsemän": "7",
"kahdeksan": "8",
"yhdeksän": "9",
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Locate the pipeline steps that consume digit_words / number_words and check whether
# they operate on diacritic-folded text or raw text.
rg -nP --type=py -C3 '\b(digit_words|number_words)\b'
rg -nP --type=py -C2 '\b(casefold_text|remove_diacritics)\b'

Repository: gladiaio/normalization

Length of output: 21861


🏁 Script executed:

# Find pipeline definition and execution order
find . -type f -name "*.py" | xargs grep -l "pipeline\|Pipeline\|STEPS\|step.*order" | head -20

Repository: gladiaio/normalization

Length of output: 452


🏁 Script executed:

# Check if there's a pipeline configuration or orchestrator
fd -t f "pipeline|config" -e py | head -20

Repository: gladiaio/normalization

Length of output: 148


🏁 Script executed:

# Look for how steps are registered and executed
rg -l "TextStep|register.*step" --type=py | head -10

Repository: gladiaio/normalization

Length of output: 532


🏁 Script executed:

# Check the actual pipeline execution order by finding the main pipeline runner
rg -nP "convert_digit_word_sequences_to_digits|remove_diacritics|casefold_text" --type=py -B2 -A2 | grep -E "\.py.*:.*\(convert_digit|\.py.*:.*\(remove_|\.py.*:.*\(casefold"

Repository: gladiaio/normalization

Length of output: 48


🏁 Script executed:

# Examine the pipeline base definition
cat normalization/pipeline/base.py

Repository: gladiaio/normalization

Length of output: 4430


🏁 Script executed:

# Check the step registry to see how steps are ordered
cat normalization/steps/registry.py

Repository: gladiaio/normalization

Length of output: 856


🏁 Script executed:

# Look at the pipeline loader
head -100 normalization/pipeline/loader.py

Repository: gladiaio/normalization

Length of output: 2532


🏁 Script executed:

# Search for explicit step ordering or pipeline configuration
rg -A30 "DEFAULT.*STEPS|PIPELINE_STEPS|steps.*=.*\[" --type=py normalization/pipeline/ normalization/

Repository: gladiaio/normalization

Length of output: 48


🏁 Script executed:

# Find preset files
find . -name "*.yaml" -path "*/presets/*" -type f

Repository: gladiaio/normalization

Length of output: 103


🏁 Script executed:

# Look at the default preset
cat normalization/presets/gladia-3.yaml

Repository: gladiaio/normalization

Length of output: 5298


🏁 Script executed:

# Check if there are language-specific presets
ls -la normalization/presets/

Repository: gladiaio/normalization

Length of output: 239


🏁 Script executed:

# Verify when replacements are applied in the pipeline
rg -n "apply_sentence_level_replacements|replacements" normalization/presets/gladia-3.yaml

Repository: gladiaio/normalization

Length of output: 141


🏁 Script executed:

# Check the apply_sentence_level_replacements step to understand the context
cat normalization/steps/text/apply_sentence_level_replacements.py

Repository: gladiaio/normalization

Length of output: 1189


🏁 Script executed:

# Examine the Finnish replacements module to understand the context
cat normalization/languages/finnish/replacements.py

Repository: gladiaio/normalization

Length of output: 1199


🏁 Script executed:

# Check if German has diacritics in digit_words
rg -A15 "_GERMAN_DIGIT_WORDS" normalization/languages/german/operators.py

Repository: gladiaio/normalization

Length of output: 716


🏁 Script executed:

# Check French digit_words in detail
rg -A15 "_FRENCH_DIGIT_WORDS" normalization/languages/french/operators.py

Repository: gladiaio/normalization

Length of output: 675


🏁 Script executed:

# Verify the exact step order in the pipeline: when do digit_words steps run?
cat normalization/presets/gladia-3.yaml | grep -A3 -B3 "casefold_text\|remove_diacritics\|convert_digit_word"

Repository: gladiaio/normalization

Length of output: 666


🏁 Script executed:

# Verify all steps that consume digit_words/number_words and their positions in the pipeline
rg -l "digit_words|number_words" normalization/steps/text/*.py | while read f; do
  basename "$f"
done

Repository: gladiaio/normalization

Length of output: 238


🏁 Script executed:

# Check the exact position of each digit_words/number_words consuming step in the pipeline
cat normalization/presets/gladia-3.yaml | grep -E "protect_plus_word_before_digit_words|convert_digit_word_sequences_to_digits|normalize_punctuation_between_number_words|fix_dot_adjacent_number_words|casefold_text|remove_diacritics"

Repository: gladiaio/normalization

Length of output: 360


🏁 Script executed:

# Verify the order is consistent: all digit_words steps before casefold
echo "=== Steps consuming digit_words/number_words and their position ===" && \
cat normalization/presets/gladia-3.yaml | grep -n "protect_plus_word_before_digit_words\|convert_digit_word_sequences_to_digits\|normalize_punctuation_between_number_words\|fix_dot_adjacent_number_words\|casefold_text\|remove_diacritics"

Repository: gladiaio/normalization

Length of output: 449


🏁 Script executed:

# Check if FinnishNumberNormalizer includes both diacritics and non-diacritics forms
cat normalization/languages/finnish/number_normalizer.py | grep -A30 "_ONES_2_9\|_TEENS"

Repository: gladiaio/normalization

Length of output: 4872


🏁 Script executed:

# Verify the exact execution context of fix_dot_adjacent_number_words
cat normalization/steps/text/post_steps/fix_dot_adjacent_number_words.py 2>/dev/null || cat normalization/steps/text/fix_dot_adjacent_number_words.py

Repository: gladiaio/normalization

Length of output: 1338


🏁 Script executed:

# Check what text state fix_dot_adjacent_number_words receives
rg -B10 "class.*FixDotAdjacentNumberWords" normalization/steps/text/fix_dot_adjacent_number_words.py

Repository: gladiaio/normalization

Length of output: 300


Include both diacritic and ASCII-folded forms in _FINNISH_DIGIT_WORDS.

The pipeline consumes digit_words in two stages with opposing diacritic states:

  • protect_plus_word_before_digit_words, normalize_punctuation_between_number_words, and convert_digit_word_sequences_to_digits run before casefold_text and remove_diacritics, so they match diacritized forms in the original text.
  • fix_dot_adjacent_number_words runs after diacritics are removed in text_post, so it matches ASCII-folded forms.

The current _FINNISH_DIGIT_WORDS contains only diacritized keys (neljä, seitsemän, yhdeksän), which will fail in the second stage. Follow the pattern already used in FinnishNumberNormalizer (lines 32–54): include both forms for each word:

_FINNISH_DIGIT_WORDS: dict[str, str] = {
    "nolla": "0",
    "yksi": "1",
    "kaksi": "2",
    "kolme": "3",
    "neljä": "4",
    "nelja": "4",
    "viisi": "5",
    "kuusi": "6",
    "seitsemän": "7",
    "seitseman": "7",
    "kahdeksan": "8",
    "yhdeksän": "9",
    "yhdeksan": "9",
}

The same fix applies to number_words (lines 57–84).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/finnish/operators.py` around lines 5 - 16, The digit
and number-word mappings (_FINNISH_DIGIT_WORDS and number_words) only include
diacritized keys and must also include ASCII-folded equivalents so later stages
that run after remove_diacritics can match; update _FINNISH_DIGIT_WORDS and
number_words to duplicate entries for each diacritized key with its ASCII-folded
form (e.g., add "nelja" alongside "neljä", "seitseman" alongside "seitsemän",
"yhdeksan" alongside "yhdeksän", etc.) following the pattern used in
FinnishNumberNormalizer (duplicate mapping keys to the same digit strings).

Comment on lines +35 to +41
currency_symbol_to_word={
"€": "euros",
"$": "dollars",
"£": "pounds",
"¢": "cent",
"¥": "yens",
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Finnish currency after numeral partitive form euroa dollaria

💡 Result:

In Finnish grammar, after cardinal numerals (other than yksi 'one'), currency names like euro and dollari take the partitive singular form. Thus, '2 euros' is kaksi euroa and '2 dollars' is kaksi dollaria. This follows the general rule where nouns after numerals appear in the partitive case (e.g., kaksi kuppia 'two cups'). The partitive forms are euroa (for euro) and dollaria (for dollari).

Citations:


🏁 Script executed:

# First, find and read the file in question
find . -name "operators.py" -path "*/finnish/*" -type f

Repository: gladiaio/normalization

Length of output: 112


🏁 Script executed:

# Read the operators.py file to verify the code snippet
cat -n normalization/languages/finnish/operators.py | head -50

Repository: gladiaio/normalization

Length of output: 1595


🏁 Script executed:

# Search for FinnishNumberNormalizer to find its location
find . -name "*.py" -type f | xargs grep -l "FinnishNumberNormalizer" | head -5

Repository: gladiaio/normalization

Length of output: 224


🏁 Script executed:

# Search for _singular_spoken_unit in the codebase
rg "_singular_spoken_unit" -A 10

Repository: gladiaio/normalization

Length of output: 5937


🏁 Script executed:

# Search for test files related to Finnish number normalizer
find . -name "*finnish*test*.py" -o -name "*test*finnish*.py" | head -10

Repository: gladiaio/normalization

Length of output: 171


🏁 Script executed:

# Read the test file to see what's expected
cat -n tests/unit/languages/finnish_number_normalizer_test.py

Repository: gladiaio/normalization

Length of output: 2077


🏁 Script executed:

# Read the full Finnish number normalizer implementation
cat -n normalization/languages/finnish/number_normalizer.py

Repository: gladiaio/normalization

Length of output: 15942


Fix Finnish currency forms: use partitive case instead of English plurals.

The currency_symbol_to_word mapping uses English plural forms (euros, dollars, pounds, yens) when Finnish requires the partitive case after numerals: euroa, dollaria, puntaa, senttiä, jeniä. For example, 5 € in Finnish speech is "viisi euroa", never "viisi euros". This causes:

  1. Non-Finnish output text that will degrade WER against correct Finnish transcripts
  2. Awkward redundancy: _singular_spoken_unit contains hardcoded English mappings (euros→euro, dollars→dollar, etc.) that must then be "fixed back" to plural via _currency_plural_fix_patterns — a clear sign of non-Finnish adaptation

Additionally, "¢" → "cent" is singular while all others are plural, breaking internal consistency.

Replace with Finnish partitive forms:

Suggested diff
     currency_symbol_to_word={
-        "€": "euros",
-        "$": "dollars",
-        "£": "pounds",
-        "¢": "cent",
-        "¥": "yens",
+        "€": "euroa",
+        "$": "dollaria",
+        "£": "puntaa",
+        "¢": "senttiä",
+        "¥": "jeniä",
     },

With this fix, both _singular_spoken_unit and _currency_plural_fix_patterns can be removed entirely (partitive form is already correct after numerals). Update test expectations in finnish_number_normalizer_test.py accordingly (e.g., "€50" → "50 euroa").

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
currency_symbol_to_word={
"€": "euros",
"$": "dollars",
"£": "pounds",
"¢": "cent",
"¥": "yens",
},
currency_symbol_to_word={
"€": "euroa",
"$": "dollaria",
"£": "puntaa",
"¢": "senttiä",
"¥": "jeniä",
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/finnish/operators.py` around lines 35 - 41, The
mapping currency_symbol_to_word currently uses English plurals (e.g., "euros",
"dollars") which is incorrect for Finnish numerals; update
currency_symbol_to_word to use Finnish partitive forms ("euroa", "dollaria",
"puntaa", "senttiä", "jeniä") for the symbols "€", "$", "£", "¢", "¥". After
changing currency_symbol_to_word, remove the now-redundant helpers
`_singular_spoken_unit` and `_currency_plural_fix_patterns` (and any logic that
relies on them) and update the tests in finnish_number_normalizer_test.py to
expect partitive outputs (e.g., "€50" -> "50 euroa"). Ensure all references to
those removed symbols are cleaned up to avoid unused symbol errors.

Comment on lines +7 to +49
FINNISH_REPLACEMENTS: dict[str, str] = {
"ma": "mina",
"maa": "mina",
"mulle": "minulle",
"mulla": "minulla",
"mua": "minua",
"mun": "minun",
"sa": "sina",
"sulle": "sinulle",
"sulla": "sinulla",
"sua": "sinua",
"sun": "sinun",
"toi": "tuo",
"ton": "tuon",
"tossa": "tuossa",
"tosta": "tuosta",
"tohon": "tuohon",
"taa": "tama",
"naa": "nama",
"olis": "olisi",
"ois": "olisi",
"oo": "ole",
"ollu": "ollut",
"onks": "onko",
"oliks": "oliko",
"oisko": "olisiko",
"vois": "voisi",
"katotaan": "katsotaan",
"kattoa": "katsoa",
"mut": "mutta",
"sit": "sitten",
"sitte": "sitten",
"et": "etta",
"sillon": "silloin",
"viimeks": "viimeksi",
"elikka": "eli",
"juu": "joo",
"jes": "joo",
"ok": "okei",
"bank": "pankki",
"bankin": "pankin",
"euro": "euros",
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Several replacement entries look incorrect or unsafe for Finnish.

A few of these will actively corrupt otherwise-correct Finnish text rather than normalize colloquial forms:

  • "maa": "mina"maa is a common Finnish noun meaning "land/country". Mapping it to mina (minä, "I") will mangle any sentence mentioning a country or land. The intended colloquial form for minä is → ASCII ma, which is already covered on line 8. Line 9 should be dropped.
  • "euro": "euros"euro is the standard Finnish singular for the currency. Replacing it with euros (which is an English plural, not Finnish) will both break correct Finnish and conflict with the currency-restore logic in number_normalizer.py. In Finnish the form used after a numeral is the partitive euroa (already appearing in the tests as the expected output for "kymmenen euroa").
  • "bank" / "bankin""pankki" / "pankin" — these are English, not Finnish colloquial variants. If the intent is to normalize ASR mis-transcriptions of loanwords, it should be documented; otherwise they don't belong in a Finnish colloquial→standard table.
  • "jes": "joo"jes is an interjection ("yes!"), not a variant of joo. Collapsing it to joo loses semantic distinction; consider dropping.

Please have a native speaker sanity-check the rest of the table as well (e.g., "taa" only matches tää once diacritics are stripped, which seems to be the assumption per the module docstring, but it would also collide with the Finnish word taa = "behind" if the pipeline ever feeds un-folded input).

🩹 Suggested removals
     "ma": "mina",
-    "maa": "mina",
     "mulle": "minulle",
@@
     "juu": "joo",
-    "jes": "joo",
     "ok": "okei",
-    "bank": "pankki",
-    "bankin": "pankin",
-    "euro": "euros",
+    "bank": "pankki",
+    "bankin": "pankin",

(Keep or drop the bank* entries depending on whether they are intentional ASR fixes — if kept, consider a comment explaining the rationale.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FINNISH_REPLACEMENTS: dict[str, str] = {
"ma": "mina",
"maa": "mina",
"mulle": "minulle",
"mulla": "minulla",
"mua": "minua",
"mun": "minun",
"sa": "sina",
"sulle": "sinulle",
"sulla": "sinulla",
"sua": "sinua",
"sun": "sinun",
"toi": "tuo",
"ton": "tuon",
"tossa": "tuossa",
"tosta": "tuosta",
"tohon": "tuohon",
"taa": "tama",
"naa": "nama",
"olis": "olisi",
"ois": "olisi",
"oo": "ole",
"ollu": "ollut",
"onks": "onko",
"oliks": "oliko",
"oisko": "olisiko",
"vois": "voisi",
"katotaan": "katsotaan",
"kattoa": "katsoa",
"mut": "mutta",
"sit": "sitten",
"sitte": "sitten",
"et": "etta",
"sillon": "silloin",
"viimeks": "viimeksi",
"elikka": "eli",
"juu": "joo",
"jes": "joo",
"ok": "okei",
"bank": "pankki",
"bankin": "pankin",
"euro": "euros",
}
FINNISH_REPLACEMENTS: dict[str, str] = {
"ma": "mina",
"mulle": "minulle",
"mulla": "minulla",
"mua": "minua",
"mun": "minun",
"sa": "sina",
"sulle": "sinulle",
"sulla": "sinulla",
"sua": "sinua",
"sun": "sinun",
"toi": "tuo",
"ton": "tuon",
"tossa": "tuossa",
"tosta": "tuosta",
"tohon": "tuohon",
"taa": "tama",
"naa": "nama",
"olis": "olisi",
"ois": "olisi",
"oo": "ole",
"ollu": "ollut",
"onks": "onko",
"oliks": "oliko",
"oisko": "olisiko",
"vois": "voisi",
"katotaan": "katsotaan",
"kattoa": "katsoa",
"mut": "mutta",
"sit": "sitten",
"sitte": "sitten",
"et": "etta",
"sillon": "silloin",
"viimeks": "viimeksi",
"elikka": "eli",
"juu": "joo",
"ok": "okei",
"bank": "pankki",
"bankin": "pankin",
}

Comment on lines +17 to +46
@pytest.mark.parametrize(
("text", "expected"),
[
("kaksi kymmenta viisi", "25"),
("kaksi kymmentä viisi", "25"),
("sata", "100"),
("tuhat", "1000"),
("yksi tuhat", "1000"),
("kolme miljoonaa", "3000000"),
("yksi miljoona", "1000000"),
],
)
def test_finnish_spelled_numbers(
normalizer: FinnishNumberNormalizer, text: str, expected: str
) -> None:
assert normalizer(text) == expected


@pytest.mark.parametrize(
("text", "expected"),
[
("kymmenen euroa", "10 euroa"),
("€50", "50 euros"),
("50 €", "50 euros"),
],
)
def test_currency_and_spoken_units(
normalizer: FinnishNumberNormalizer, text: str, expected: str
) -> None:
assert normalizer(text) == expected
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Currency expectations encode the English-plural bug.

"€50" → "50 euros" and "50 € → 50 euros" hardcode the output of the English-plural currency mapping (see the comment on operators.py lines 35–41). Asymmetrically, "kymmenen euroa" → "10 euroa" already uses the correct Finnish partitive (because euroa is preserved verbatim as a trailing word, not reconstructed from the symbol map).

If the currency map is fixed to use euroa / dollaria / puntaa / senttiä / jeniä, these parametrized expectations should be updated accordingly:

-        ("€50", "50 euros"),
-        ("50 €", "50 euros"),
+        ("€50", "50 euroa"),
+        ("50 €", "50 euroa"),

Also consider adding a negative/edge case: a bare currency symbol with no adjacent number (e.g. "€" alone) and a decimal amount (e.g. "€9,99" — note Finnish decimal comma), to lock down behavior around decimal_separator=",".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/finnish_number_normalizer_test.py` around lines 17 - 46,
The test expectations for symbol-mapped currencies in
test_currency_and_spoken_units are asserting English plurals; update the
expected outputs to use Finnish partitive forms (e.g., change "50 euros" to "50
euroa" and similarly for other currency tests) to match the corrected currency
mapping in operators.py, and add two new parametrized cases in the same test:
one for a bare symbol with no number (e.g., "€" -> expected behavior such as
unchanged "€" or a decided normalization) and one for a decimal amount using
Finnish decimal comma (e.g., "€9,99" -> "9,99 euroa") to verify
decimal_separator="," handling by FinnishNumberNormalizer.

@Karamouche Karamouche merged commit 0e06d06 into main May 5, 2026
10 checks passed
@Karamouche Karamouche deleted the feat/add-finnish-language branch May 5, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants