feat: init swedish language basic normalization by egenthon-cmd · Pull Request #23 · gladiaio/normalization

egenthon-cmd · 2026-04-23T12:45:30Z

adds Swedish (sv) normalization (operators, replacements, number normalizer, registry wiring, unit and gladia-3 e2e tests

Type of change

Checklist

Only fill in the section(s) that match your change — delete the rest.

New language

Created normalization/languages/{lang}/ with operators.py, replacements.py, __init__.py
Word substitutions are in replacements.py (not hardcoded in operators.py)
LanguageConfig is filled in with the language's data (separators, currency words, digit words, …)
Subclassed LanguageOperators — only override methods where the logic changes, not just the data
Class is decorated with @register_language and imported in normalization/languages/__init__.py
Unit tests added in tests/unit/languages/
E2e CSV added in tests/e2e/files/{preset}/{lang}.csv (e.g. tests/e2e/files/gladia-3/fr.csv)

Edit existing language

New/changed word substitutions go in replacements.py, not inline in operators.py
If you changed a config field that can be None: the step reading it still handles None gracefully
Unit tests updated or added
E2e CSV updated if the expected output changed

New step

Unique name class attribute set (this is the key used in YAML presets)
Decorated with @register_step and imported in steps/text/__init__.py or steps/word/__init__.py
No hardcoded language values — read data from operators.config.* instead
If placeholder-based: protect + restore are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
Unit tests added in tests/unit/steps/
Step name added to the relevant preset YAML — or a new preset file created if existing presets are affected
If the docstring changed: ran uv run scripts/generate_step_docs.py

Edit existing step

Step name is unchanged — if the output changes, create a new step name + new preset instead
No language-specific logic or string literals added inside the step
Unit tests updated or added
If the docstring changed: ran uv run scripts/generate_step_docs.py

Preset change

Existing preset files are not modified — new behaviour goes in a new preset file
pipeline.validate() passes (runs automatically via loader.py)

How was this tested?

uv run pytest tests/

Summary by CodeRabbit

New Features
- Added Swedish language support with number normalization and spelling variant mappings.
Bug Fixes
- Improved currency symbol handling to differentiate single-character and multi-character symbols correctly.
Documentation
- Updated language support table and contributing guide.
- Enhanced text processing steps documentation with implementation details.

Made-with: Cursor

coderabbitai · 2026-04-23T12:45:36Z

📝 Walkthrough

Walkthrough

This PR adds Swedish language support to a text normalization library by introducing a complete Swedish language package with number normalization, operators, and word replacements, while also improving multi-character currency symbol handling in existing text-processing steps to avoid partial matches.

Changes

Swedish Language Support

Layer / File(s)	Summary
Module Setup `normalization/languages/__init__.py`, `normalization/languages/swedish/__init__.py`	Swedish language module is registered and exported alongside other language packages, exposing `SwedishOperators` and `SWEDISH_REPLACEMENTS`.
Language Configuration & Replacements `normalization/languages/swedish/operators.py`, `normalization/languages/swedish/replacements.py`	`SWEDISH_CONFIG` defines Swedish number formatting (decimal/thousand separators), currency mappings, and filler words. `SWEDISH_REPLACEMENTS` maps colloquial variants (`mej`→`mig`, `dom`→`de`, etc.) to canonical forms.
Number Normalization Engine `normalization/languages/swedish/number_normalizer.py`	`SwedishNumberNormalizer` implements recursive parsing of Swedish spelled-out numbers (0–999, `tusen`, `miljon`, `miljard`, `biljon`), with optional currency symbol rewriting and plural currency suffix restoration.
Language Operators `normalization/languages/swedish/operators.py`	`SwedishOperators` class, registered via `@register_language`, instantiates `SwedishNumberNormalizer` and provides `expand_written_numbers()` and `get_word_replacements()` methods.
Currency Symbol Handling Integration `normalization/steps/text/remove_standalone_currency_symbols.py`, `normalization/steps/text/replace_currency.py`	Multi-character currency symbols (e.g., `kr`) are now matched with word boundaries to avoid partially stripping text like `kronor`; single-character symbols retain existing regex logic.
Tests `tests/unit/languages/swedish_number_normalizer_test.py`, `tests/unit/languages/swedish_operators_test.py`, `tests/unit/steps/text/remove_standalone_currency_symbols_test.py`, `tests/unit/steps/text/replace_currency_test.py`	Comprehensive unit tests verify Swedish number parsing, operator registration, word replacement mappings, and currency handling correctness (e.g., `25 kronor` remains unchanged while `25 kr` expands to `25 kronor`).
Documentation `README.md`, `docs/contributing-guide.md`, `docs/steps.md`	Swedish (`sv`) is added to supported languages table; E2E test fixtures list includes `sv.csv`; documentation clarifies caching behavior and multi-character symbol word-boundary matching.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

gladiaio/normalization#22: Both PRs add a new language package following the same pattern (operators, number_normalizer, replacements, tests) and update normalization/languages/__init__.py exports.
gladiaio/normalization#15: Both PRs add new language packages and register them in normalization/languages/__init__.py module exports.
gladiaio/normalization#19: Both PRs add a new language package following the same structural pattern and update normalization/languages/__init__.py to export the new language.

Suggested reviewers

Karamouche

🐰 A Swedish tongue joins the fold with grace,
Numbers parsed at their own measured pace,
From "tjugo fem" to digits they spring,
While "kronor" remains a most treasured thing.
Multi-char symbols know their rightful place—
Word boundaries guide them with Nordic embrace. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: init swedish language basic normalization' clearly and specifically summarizes the main change—adding Swedish language support with basic normalization.
Description check	✅ Passed	The PR description follows the template structure, includes the required 'New language' checklist with all items marked as complete, and provides the testing command.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/add-swedish-language

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

normalization/languages/swedish/number_normalizer.py (2)
113-125: ⚖️ Poor tradeoff

Hardcoded singular mapping couples normalizer to specific currency words.

_singular_spoken_unit enumerates euros/dollars/pounds/kronor/yens directly. Any new entry added to currency_symbol_to_word (e.g. a new locale) silently falls back to using the trailing word as both singular and plural, defeating the plural-fix patterns built later. Consider either deriving singular forms from a small data table colocated with LanguageConfig or, if the canonical-plural strategy is project-wide, lifting this map into a shared module so it can be reused by other language normalizers.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@normalization/languages/swedish/number_normalizer.py` around lines 113 - 125,
_singular_spoken_unit currently hardcodes a few currency plurals so new entries
in currency_symbol_to_word won't get correct singular forms; replace the ad-hoc
mapping by deriving singulars from a small colocated data table (or lift the map
into the shared config) and look up trailing_word there: update
_singular_spoken_unit to consult the new singulars map (or shared module)
instead of hardcoding euros/dollars/pounds/kronor/yens, and ensure the table is
kept next to LanguageConfig or exported from the global canonical-plural utility
so other language normalizers can reuse it.
234-280: 💤 Low value

Inconsistent en/ett + multiplier coverage.

The fast-path branches at lines 234, 242, 250 and 266 handle en/ett tusen, en/ett miljon (singular only), and en/ett miljard(er) / en/ett biljon(er) (singular and plural). miljon does not get its plural form miljoner listed alongside it as the others do. While en miljoner is not idiomatic Swedish and rarely produced by STT, the asymmetry is easy to miss in maintenance. Consider unifying the multipliers in a single tuple-driven branch.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@normalization/languages/swedish/number_normalizer.py` around lines 234 - 280,
The branch handling "en"/"ett" + multiplier is inconsistent: add the missing
plural "miljoner" or unify all multiplier checks into a single tuple-driven
branch to avoid asymmetry; locate the blocks that check fw in ("en", "ett") and
_fold(words[i + 1]) == "miljon" (and the other blocks using
"tusen"/"miljard"/"biljon") inside the _parse_number logic and either include
"miljoner" alongside "miljon" or replace the repeated if-blocks with one that
looks up _fold(words[i + 1]) in a mapping/tuple of multipliers (e.g.,
{"tusen":1000, "miljon":1_000_000, "miljoner":1_000_000,
"miljard":1_000_000_000, "miljarder":1_000_000_000, "biljon":1_000_000_000_000,
"biljoner":1_000_000_000_000}), then call self._parse_number(words, i+2, n) and
return the combined value as currently done.
tests/unit/steps/text/remove_standalone_currency_symbols_test.py (1)
13-17: ⚡ Quick win

Add a standalone kr assertion to prevent no-op false positives.

Lines 13–17 verify boundary safety, but adding a companion case for standalone kr removal would ensure the core behavior still works (not just the “do not strip inside words” case).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/steps/text/remove_standalone_currency_symbols_test.py` around
lines 13 - 17, Add a new assertion to the test to ensure standalone "kr" is
handled correctly: update the test function
test_multi_char_kr_not_stripped_from_kronor (or add a new sibling test) to call
RemoveStandaloneCurrencySymbolsStep() with SwedishOperators() and assert that an
input like "25 kr" is transformed/removed according to expected behavior (e.g.,
becomes "25" or "25 " depending on trimming rules) so the suite covers both
multi-char "kronor" retention and standalone "kr" handling.
tests/unit/languages/swedish_operators_test.py (1)
4-4: ⚡ Quick win

Add a test that verifies "sv" becomes available via package import flow.

The direct import on line 4 triggers the @register_language decorator at import time, making the registry tests pass. However, this bypasses verification that the package-level wiring in normalization/languages/__init__.py actually exercises the registration. Consider adding a separate test that imports via from normalization.languages import swedish (or similar) to ensure the __init__.py import chain properly registers Swedish operators.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/languages/swedish_operators_test.py` at line 4, Add a unit test
that verifies package-level import triggers the `@register_language` registration
for Swedish: instead of directly importing SwedishOperators, import the package
module (e.g., "from normalization.languages import swedish" or "import
normalization.languages; import normalization.languages.swedish") and then
assert the registry contains the "sv" entry or that swedish.SwedishOperators is
available; this ensures the __init__.py import chain exercises the
register_language decorator rather than relying on a direct import of
SwedishOperators.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@normalization/languages/swedish/number_normalizer.py`:
- Around line 82-110: The regex _RE_MIXED_NUMBER currently matches multi-digit
numerals but replace() only handles single digits (checking len(number) == 1),
so strings like "20 miljoner" are left unchanged; update
_normalize_mixed_numbers (replace) to handle multi-digit numbers by parsing
number = int(match.group(1)) and, if multiplier = match.group(2) is in
_BIG_MULT, compute value = number * _BIG_MULT[multiplier] and return str(value)
(otherwise fall back to the existing single-digit logic that uses
_DIGIT_TO_SWEDISH), or alternatively make _RE_MIXED_NUMBER only match a single
digit if you prefer the original single-digit-only behavior.

In `@normalization/languages/swedish/operators.py`:
- Around line 36-43: The currency mapping in currency_symbol_to_word in
operators.py uses "¢": "cent" which breaks the pluralization convention; change
that entry to "¢": "cents" so it matches the plural canonical form used for
other currencies and the _currency_plural_fix_patterns logic, and also add a
branch in number_normalizer.py's _singular_spoken_unit (e.g., if t == "cents":
return "cent") so the plural-to-singular fix recognizes "cents".

---

Nitpick comments:
In `@normalization/languages/swedish/number_normalizer.py`:
- Around line 113-125: _singular_spoken_unit currently hardcodes a few currency
plurals so new entries in currency_symbol_to_word won't get correct singular
forms; replace the ad-hoc mapping by deriving singulars from a small colocated
data table (or lift the map into the shared config) and look up trailing_word
there: update _singular_spoken_unit to consult the new singulars map (or shared
module) instead of hardcoding euros/dollars/pounds/kronor/yens, and ensure the
table is kept next to LanguageConfig or exported from the global
canonical-plural utility so other language normalizers can reuse it.
- Around line 234-280: The branch handling "en"/"ett" + multiplier is
inconsistent: add the missing plural "miljoner" or unify all multiplier checks
into a single tuple-driven branch to avoid asymmetry; locate the blocks that
check fw in ("en", "ett") and _fold(words[i + 1]) == "miljon" (and the other
blocks using "tusen"/"miljard"/"biljon") inside the _parse_number logic and
either include "miljoner" alongside "miljon" or replace the repeated if-blocks
with one that looks up _fold(words[i + 1]) in a mapping/tuple of multipliers
(e.g., {"tusen":1000, "miljon":1_000_000, "miljoner":1_000_000,
"miljard":1_000_000_000, "miljarder":1_000_000_000, "biljon":1_000_000_000_000,
"biljoner":1_000_000_000_000}), then call self._parse_number(words, i+2, n) and
return the combined value as currently done.

In `@tests/unit/languages/swedish_operators_test.py`:
- Line 4: Add a unit test that verifies package-level import triggers the
`@register_language` registration for Swedish: instead of directly importing
SwedishOperators, import the package module (e.g., "from normalization.languages
import swedish" or "import normalization.languages; import
normalization.languages.swedish") and then assert the registry contains the "sv"
entry or that swedish.SwedishOperators is available; this ensures the
__init__.py import chain exercises the register_language decorator rather than
relying on a direct import of SwedishOperators.

In `@tests/unit/steps/text/remove_standalone_currency_symbols_test.py`:
- Around line 13-17: Add a new assertion to the test to ensure standalone "kr"
is handled correctly: update the test function
test_multi_char_kr_not_stripped_from_kronor (or add a new sibling test) to call
RemoveStandaloneCurrencySymbolsStep() with SwedishOperators() and assert that an
input like "25 kr" is transformed/removed according to expected behavior (e.g.,
becomes "25" or "25 " depending on trimming rules) so the suite covers both
multi-char "kronor" retention and standalone "kr" handling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c1ea7130-8c85-44f0-a1df-7adf62e56f29

📥 Commits

Reviewing files that changed from the base of the PR and between 0e06d06 and b785e77.

⛔ Files ignored due to path filters (1)

tests/e2e/files/gladia-3/sv.csv is excluded by !**/*.csv

📒 Files selected for processing (14)

README.md
docs/contributing-guide.md
docs/steps.md
normalization/languages/__init__.py
normalization/languages/swedish/__init__.py
normalization/languages/swedish/number_normalizer.py
normalization/languages/swedish/operators.py
normalization/languages/swedish/replacements.py
normalization/steps/text/remove_standalone_currency_symbols.py
normalization/steps/text/replace_currency.py
tests/unit/languages/swedish_number_normalizer_test.py
tests/unit/languages/swedish_operators_test.py
tests/unit/steps/text/remove_standalone_currency_symbols_test.py
tests/unit/steps/text/replace_currency_test.py

coderabbitai · 2026-05-05T16:11:00Z

+_RE_MIXED_NUMBER = re.compile(
+    r"\b(\d+)\s+("
+    r"miljon|miljoner|miljard|miljarder|biljon|biljoner|tusen"
+    r")\b",
+    re.IGNORECASE,
+)
+
+_BIG_MULT: dict[str, int] = {
+    "tusen": 1000,
+    "miljon": 1_000_000,
+    "miljoner": 1_000_000,
+    "miljard": 1_000_000_000,
+    "miljarder": 1_000_000_000,
+    "biljon": 1_000_000_000_000,
+    "biljoner": 1_000_000_000_000,
+}
+
+
+def _normalize_mixed_numbers(text: str) -> str:
+    """Convert ``3 miljard`` → ``tre miljard`` so the word parser yields 3e9."""
+
+    def replace(match: re.Match[str]) -> str:
+        number = match.group(1)
+        multiplier = match.group(2)
+        if len(number) == 1 and number in _DIGIT_TO_SWEDISH:
+            return f"{_DIGIT_TO_SWEDISH[number]} {multiplier}"
+        return match.group(0)
+
+    return _RE_MIXED_NUMBER.sub(replace, text)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Multi-digit mixed numbers are silently dropped.

_RE_MIXED_NUMBER matches \d+ (any digit count), but replace() only rewrites single-digit cases (len(number) == 1). As a result, "20 miljoner" stays unchanged and never gets composed by the spelled-out parser, so it is not converted to 20000000. Either narrow the regex to a single digit (so the intent is explicit) or extend the rewrite to handle multi-digit values.

♻️ Option A: narrow regex (current behavior, made explicit)

_RE_MIXED_NUMBER = re.compile( - r"\b(\d+)\s+(" + r"\b(\d)\s+(" r"miljon|miljoner|miljard|miljarder|biljon|biljoner|tusen" r")\b", re.IGNORECASE, )

♻️ Option B: handle multi-digit numbers by multiplying directly

def _normalize_mixed_numbers(text: str) -> str: - """Convert ``3 miljard`` → ``tre miljard`` so the word parser yields 3e9.""" + """Convert ``3 miljard`` → ``tre miljard`` (single digit) or ``20 miljoner`` → digits.""" def replace(match: re.Match[str]) -> str: number = match.group(1) multiplier = match.group(2) if len(number) == 1 and number in _DIGIT_TO_SWEDISH: return f"{_DIGIT_TO_SWEDISH[number]} {multiplier}" - return match.group(0) + mult = _BIG_MULT.get(multiplier.lower()) + if mult is None: + return match.group(0) + return str(int(number) * mult) return _RE_MIXED_NUMBER.sub(replace, text)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@normalization/languages/swedish/number_normalizer.py` around lines 82 - 110, The regex _RE_MIXED_NUMBER currently matches multi-digit numerals but replace() only handles single digits (checking len(number) == 1), so strings like "20 miljoner" are left unchanged; update _normalize_mixed_numbers (replace) to handle multi-digit numbers by parsing number = int(match.group(1)) and, if multiplier = match.group(2) is in _BIG_MULT, compute value = number * _BIG_MULT[multiplier] and return str(value) (otherwise fall back to the existing single-digit logic that uses _DIGIT_TO_SWEDISH), or alternatively make _RE_MIXED_NUMBER only match a single digit if you prefer the original single-digit-only behavior.

coderabbitai · 2026-05-05T16:11:00Z

+    currency_symbol_to_word={
+        "€": "euros",
+        "$": "dollars",
+        "£": "pounds",
+        "¢": "cent",
+        "¥": "yens",
+        "kr": "kronor",
+    },


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

"¢": "cent" is inconsistent with the plural canonical used elsewhere.

All other entries map to the plural form (euros, dollars, pounds, yens, kronor), and the number normalizer's plural-fix logic relies on that convention: in _currency_plural_fix_patterns, the entry is skipped when singular.lower() == trailing.lower() (which is the case for cent). Net effect: 5 ¢ becomes 5 cent and 5 euro becomes 5 euros — different singular/plural canonicalization across currencies, which will hurt WER consistency. Likely intended to be "cents".

🩹 Suggested fix

currency_symbol_to_word={ "€": "euros", "$": "dollars", "£": "pounds", - "¢": "cent", + "¢": "cents", "¥": "yens", "kr": "kronor", },

Note: if cents is added, also extend _singular_spoken_unit in number_normalizer.py with if t == "cents": return "cent" so the plural fix actually matches.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@normalization/languages/swedish/operators.py` around lines 36 - 43, The currency mapping in currency_symbol_to_word in operators.py uses "¢": "cent" which breaks the pluralization convention; change that entry to "¢": "cents" so it matches the plural canonical form used for other currencies and the _currency_plural_fix_patterns logic, and also add a branch in number_normalizer.py's _singular_spoken_unit (e.g., if t == "cents": return "cent") so the plural-to-singular fix recognizes "cents".

feat: init swedish language

b60f4a3

Made-with: Cursor

Karamouche approved these changes May 5, 2026

View reviewed changes

Merge branch 'main' into feat/add-swedish-language

b785e77

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Karamouche merged commit bc509e4 into main May 5, 2026
10 checks passed

Karamouche deleted the feat/add-swedish-language branch May 5, 2026 16:21

coderabbitai Bot mentioned this pull request May 5, 2026

feat: init norwegian basic normalizer #24

Merged

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: init swedish language basic normalization#23

feat: init swedish language basic normalization#23
Karamouche merged 2 commits intomainfrom
feat/add-swedish-language

egenthon-cmd commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

egenthon-cmd commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change

Checklist

New language

Edit existing language

New step

Edit existing step

Preset change

How was this tested?

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

egenthon-cmd commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading