Skip to content

feat: init swedish language basic normalization#23

Merged
Karamouche merged 2 commits intomainfrom
feat/add-swedish-language
May 5, 2026
Merged

feat: init swedish language basic normalization#23
Karamouche merged 2 commits intomainfrom
feat/add-swedish-language

Conversation

@egenthon-cmd
Copy link
Copy Markdown
Contributor

@egenthon-cmd egenthon-cmd commented Apr 23, 2026

adds Swedish (sv) normalization (operators, replacements, number normalizer, registry wiring, unit and gladia-3 e2e tests

Type of change

  • New language
  • Edit existing language (fix a replacement, tweak config, …)
  • New normalization step
  • Edit existing step (bug fix, behaviour change)
  • New preset version
  • Bug fix (other)
  • Refactor / docs / CI

Checklist

Only fill in the section(s) that match your change — delete the rest.


New language

  • Created normalization/languages/{lang}/ with operators.py, replacements.py, __init__.py
  • Word substitutions are in replacements.py (not hardcoded in operators.py)
  • LanguageConfig is filled in with the language's data (separators, currency words, digit words, …)
  • Subclassed LanguageOperators — only override methods where the logic changes, not just the data
  • Class is decorated with @register_language and imported in normalization/languages/__init__.py
  • Unit tests added in tests/unit/languages/
  • E2e CSV added in tests/e2e/files/{preset}/{lang}.csv (e.g. tests/e2e/files/gladia-3/fr.csv)

Edit existing language

  • New/changed word substitutions go in replacements.py, not inline in operators.py
  • If you changed a config field that can be None: the step reading it still handles None gracefully
  • Unit tests updated or added
  • E2e CSV updated if the expected output changed

New step

  • Unique name class attribute set (this is the key used in YAML presets)
  • Decorated with @register_step and imported in steps/text/__init__.py or steps/word/__init__.py
  • No hardcoded language values — read data from operators.config.* instead
  • If placeholder-based: protect + restore are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
  • Unit tests added in tests/unit/steps/
  • Step name added to the relevant preset YAML — or a new preset file created if existing presets are affected
  • If the docstring changed: ran uv run scripts/generate_step_docs.py

Edit existing step

  • Step name is unchanged — if the output changes, create a new step name + new preset instead
  • No language-specific logic or string literals added inside the step
  • Unit tests updated or added
  • If the docstring changed: ran uv run scripts/generate_step_docs.py

Preset change

  • Existing preset files are not modified — new behaviour goes in a new preset file
  • pipeline.validate() passes (runs automatically via loader.py)

How was this tested?

uv run pytest tests/

Summary by CodeRabbit

  • New Features

    • Added Swedish language support with number normalization and spelling variant mappings.
  • Bug Fixes

    • Improved currency symbol handling to differentiate single-character and multi-character symbols correctly.
  • Documentation

    • Updated language support table and contributing guide.
    • Enhanced text processing steps documentation with implementation details.

Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

This PR adds Swedish language support to a text normalization library by introducing a complete Swedish language package with number normalization, operators, and word replacements, while also improving multi-character currency symbol handling in existing text-processing steps to avoid partial matches.

Changes

Swedish Language Support

Layer / File(s) Summary
Module Setup
normalization/languages/__init__.py, normalization/languages/swedish/__init__.py
Swedish language module is registered and exported alongside other language packages, exposing SwedishOperators and SWEDISH_REPLACEMENTS.
Language Configuration & Replacements
normalization/languages/swedish/operators.py, normalization/languages/swedish/replacements.py
SWEDISH_CONFIG defines Swedish number formatting (decimal/thousand separators), currency mappings, and filler words. SWEDISH_REPLACEMENTS maps colloquial variants (mejmig, domde, etc.) to canonical forms.
Number Normalization Engine
normalization/languages/swedish/number_normalizer.py
SwedishNumberNormalizer implements recursive parsing of Swedish spelled-out numbers (0–999, tusen, miljon, miljard, biljon), with optional currency symbol rewriting and plural currency suffix restoration.
Language Operators
normalization/languages/swedish/operators.py
SwedishOperators class, registered via @register_language, instantiates SwedishNumberNormalizer and provides expand_written_numbers() and get_word_replacements() methods.
Currency Symbol Handling Integration
normalization/steps/text/remove_standalone_currency_symbols.py, normalization/steps/text/replace_currency.py
Multi-character currency symbols (e.g., kr) are now matched with word boundaries to avoid partially stripping text like kronor; single-character symbols retain existing regex logic.
Tests
tests/unit/languages/swedish_number_normalizer_test.py, tests/unit/languages/swedish_operators_test.py, tests/unit/steps/text/remove_standalone_currency_symbols_test.py, tests/unit/steps/text/replace_currency_test.py
Comprehensive unit tests verify Swedish number parsing, operator registration, word replacement mappings, and currency handling correctness (e.g., 25 kronor remains unchanged while 25 kr expands to 25 kronor).
Documentation
README.md, docs/contributing-guide.md, docs/steps.md
Swedish (sv) is added to supported languages table; E2E test fixtures list includes sv.csv; documentation clarifies caching behavior and multi-character symbol word-boundary matching.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • gladiaio/normalization#22: Both PRs add a new language package following the same pattern (operators, number_normalizer, replacements, tests) and update normalization/languages/__init__.py exports.
  • gladiaio/normalization#15: Both PRs add new language packages and register them in normalization/languages/__init__.py module exports.
  • gladiaio/normalization#19: Both PRs add a new language package following the same structural pattern and update normalization/languages/__init__.py to export the new language.

Suggested reviewers

  • Karamouche

🐰 A Swedish tongue joins the fold with grace,
Numbers parsed at their own measured pace,
From "tjugo fem" to digits they spring,
While "kronor" remains a most treasured thing.
Multi-char symbols know their rightful place—
Word boundaries guide them with Nordic embrace.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: init swedish language basic normalization' clearly and specifically summarizes the main change—adding Swedish language support with basic normalization.
Description check ✅ Passed The PR description follows the template structure, includes the required 'New language' checklist with all items marked as complete, and provides the testing command.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/add-swedish-language

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
normalization/languages/swedish/number_normalizer.py (2)

113-125: ⚖️ Poor tradeoff

Hardcoded singular mapping couples normalizer to specific currency words.

_singular_spoken_unit enumerates euros/dollars/pounds/kronor/yens directly. Any new entry added to currency_symbol_to_word (e.g. a new locale) silently falls back to using the trailing word as both singular and plural, defeating the plural-fix patterns built later. Consider either deriving singular forms from a small data table colocated with LanguageConfig or, if the canonical-plural strategy is project-wide, lifting this map into a shared module so it can be reused by other language normalizers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@normalization/languages/swedish/number_normalizer.py` around lines 113 - 125,
_singular_spoken_unit currently hardcodes a few currency plurals so new entries
in currency_symbol_to_word won't get correct singular forms; replace the ad-hoc
mapping by deriving singulars from a small colocated data table (or lift the map
into the shared config) and look up trailing_word there: update
_singular_spoken_unit to consult the new singulars map (or shared module)
instead of hardcoding euros/dollars/pounds/kronor/yens, and ensure the table is
kept next to LanguageConfig or exported from the global canonical-plural utility
so other language normalizers can reuse it.

234-280: 💤 Low value

Inconsistent en/ett + multiplier coverage.

The fast-path branches at lines 234, 242, 250 and 266 handle en/ett tusen, en/ett miljon (singular only), and en/ett miljard(er) / en/ett biljon(er) (singular and plural). miljon does not get its plural form miljoner listed alongside it as the others do. While en miljoner is not idiomatic Swedish and rarely produced by STT, the asymmetry is easy to miss in maintenance. Consider unifying the multipliers in a single tuple-driven branch.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@normalization/languages/swedish/number_normalizer.py` around lines 234 - 280,
The branch handling "en"/"ett" + multiplier is inconsistent: add the missing
plural "miljoner" or unify all multiplier checks into a single tuple-driven
branch to avoid asymmetry; locate the blocks that check fw in ("en", "ett") and
_fold(words[i + 1]) == "miljon" (and the other blocks using
"tusen"/"miljard"/"biljon") inside the _parse_number logic and either include
"miljoner" alongside "miljon" or replace the repeated if-blocks with one that
looks up _fold(words[i + 1]) in a mapping/tuple of multipliers (e.g.,
{"tusen":1000, "miljon":1_000_000, "miljoner":1_000_000,
"miljard":1_000_000_000, "miljarder":1_000_000_000, "biljon":1_000_000_000_000,
"biljoner":1_000_000_000_000}), then call self._parse_number(words, i+2, n) and
return the combined value as currently done.
tests/unit/steps/text/remove_standalone_currency_symbols_test.py (1)

13-17: ⚡ Quick win

Add a standalone kr assertion to prevent no-op false positives.

Lines 13–17 verify boundary safety, but adding a companion case for standalone kr removal would ensure the core behavior still works (not just the “do not strip inside words” case).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/steps/text/remove_standalone_currency_symbols_test.py` around
lines 13 - 17, Add a new assertion to the test to ensure standalone "kr" is
handled correctly: update the test function
test_multi_char_kr_not_stripped_from_kronor (or add a new sibling test) to call
RemoveStandaloneCurrencySymbolsStep() with SwedishOperators() and assert that an
input like "25 kr" is transformed/removed according to expected behavior (e.g.,
becomes "25" or "25 " depending on trimming rules) so the suite covers both
multi-char "kronor" retention and standalone "kr" handling.
tests/unit/languages/swedish_operators_test.py (1)

4-4: ⚡ Quick win

Add a test that verifies "sv" becomes available via package import flow.

The direct import on line 4 triggers the @register_language decorator at import time, making the registry tests pass. However, this bypasses verification that the package-level wiring in normalization/languages/__init__.py actually exercises the registration. Consider adding a separate test that imports via from normalization.languages import swedish (or similar) to ensure the __init__.py import chain properly registers Swedish operators.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/languages/swedish_operators_test.py` at line 4, Add a unit test
that verifies package-level import triggers the `@register_language` registration
for Swedish: instead of directly importing SwedishOperators, import the package
module (e.g., "from normalization.languages import swedish" or "import
normalization.languages; import normalization.languages.swedish") and then
assert the registry contains the "sv" entry or that swedish.SwedishOperators is
available; this ensures the __init__.py import chain exercises the
register_language decorator rather than relying on a direct import of
SwedishOperators.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@normalization/languages/swedish/number_normalizer.py`:
- Around line 82-110: The regex _RE_MIXED_NUMBER currently matches multi-digit
numerals but replace() only handles single digits (checking len(number) == 1),
so strings like "20 miljoner" are left unchanged; update
_normalize_mixed_numbers (replace) to handle multi-digit numbers by parsing
number = int(match.group(1)) and, if multiplier = match.group(2) is in
_BIG_MULT, compute value = number * _BIG_MULT[multiplier] and return str(value)
(otherwise fall back to the existing single-digit logic that uses
_DIGIT_TO_SWEDISH), or alternatively make _RE_MIXED_NUMBER only match a single
digit if you prefer the original single-digit-only behavior.

In `@normalization/languages/swedish/operators.py`:
- Around line 36-43: The currency mapping in currency_symbol_to_word in
operators.py uses "¢": "cent" which breaks the pluralization convention; change
that entry to "¢": "cents" so it matches the plural canonical form used for
other currencies and the _currency_plural_fix_patterns logic, and also add a
branch in number_normalizer.py's _singular_spoken_unit (e.g., if t == "cents":
return "cent") so the plural-to-singular fix recognizes "cents".

---

Nitpick comments:
In `@normalization/languages/swedish/number_normalizer.py`:
- Around line 113-125: _singular_spoken_unit currently hardcodes a few currency
plurals so new entries in currency_symbol_to_word won't get correct singular
forms; replace the ad-hoc mapping by deriving singulars from a small colocated
data table (or lift the map into the shared config) and look up trailing_word
there: update _singular_spoken_unit to consult the new singulars map (or shared
module) instead of hardcoding euros/dollars/pounds/kronor/yens, and ensure the
table is kept next to LanguageConfig or exported from the global
canonical-plural utility so other language normalizers can reuse it.
- Around line 234-280: The branch handling "en"/"ett" + multiplier is
inconsistent: add the missing plural "miljoner" or unify all multiplier checks
into a single tuple-driven branch to avoid asymmetry; locate the blocks that
check fw in ("en", "ett") and _fold(words[i + 1]) == "miljon" (and the other
blocks using "tusen"/"miljard"/"biljon") inside the _parse_number logic and
either include "miljoner" alongside "miljon" or replace the repeated if-blocks
with one that looks up _fold(words[i + 1]) in a mapping/tuple of multipliers
(e.g., {"tusen":1000, "miljon":1_000_000, "miljoner":1_000_000,
"miljard":1_000_000_000, "miljarder":1_000_000_000, "biljon":1_000_000_000_000,
"biljoner":1_000_000_000_000}), then call self._parse_number(words, i+2, n) and
return the combined value as currently done.

In `@tests/unit/languages/swedish_operators_test.py`:
- Line 4: Add a unit test that verifies package-level import triggers the
`@register_language` registration for Swedish: instead of directly importing
SwedishOperators, import the package module (e.g., "from normalization.languages
import swedish" or "import normalization.languages; import
normalization.languages.swedish") and then assert the registry contains the "sv"
entry or that swedish.SwedishOperators is available; this ensures the
__init__.py import chain exercises the register_language decorator rather than
relying on a direct import of SwedishOperators.

In `@tests/unit/steps/text/remove_standalone_currency_symbols_test.py`:
- Around line 13-17: Add a new assertion to the test to ensure standalone "kr"
is handled correctly: update the test function
test_multi_char_kr_not_stripped_from_kronor (or add a new sibling test) to call
RemoveStandaloneCurrencySymbolsStep() with SwedishOperators() and assert that an
input like "25 kr" is transformed/removed according to expected behavior (e.g.,
becomes "25" or "25 " depending on trimming rules) so the suite covers both
multi-char "kronor" retention and standalone "kr" handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c1ea7130-8c85-44f0-a1df-7adf62e56f29

📥 Commits

Reviewing files that changed from the base of the PR and between 0e06d06 and b785e77.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3/sv.csv is excluded by !**/*.csv
📒 Files selected for processing (14)
  • README.md
  • docs/contributing-guide.md
  • docs/steps.md
  • normalization/languages/__init__.py
  • normalization/languages/swedish/__init__.py
  • normalization/languages/swedish/number_normalizer.py
  • normalization/languages/swedish/operators.py
  • normalization/languages/swedish/replacements.py
  • normalization/steps/text/remove_standalone_currency_symbols.py
  • normalization/steps/text/replace_currency.py
  • tests/unit/languages/swedish_number_normalizer_test.py
  • tests/unit/languages/swedish_operators_test.py
  • tests/unit/steps/text/remove_standalone_currency_symbols_test.py
  • tests/unit/steps/text/replace_currency_test.py

Comment on lines +82 to +110
_RE_MIXED_NUMBER = re.compile(
r"\b(\d+)\s+("
r"miljon|miljoner|miljard|miljarder|biljon|biljoner|tusen"
r")\b",
re.IGNORECASE,
)

_BIG_MULT: dict[str, int] = {
"tusen": 1000,
"miljon": 1_000_000,
"miljoner": 1_000_000,
"miljard": 1_000_000_000,
"miljarder": 1_000_000_000,
"biljon": 1_000_000_000_000,
"biljoner": 1_000_000_000_000,
}


def _normalize_mixed_numbers(text: str) -> str:
"""Convert ``3 miljard`` → ``tre miljard`` so the word parser yields 3e9."""

def replace(match: re.Match[str]) -> str:
number = match.group(1)
multiplier = match.group(2)
if len(number) == 1 and number in _DIGIT_TO_SWEDISH:
return f"{_DIGIT_TO_SWEDISH[number]} {multiplier}"
return match.group(0)

return _RE_MIXED_NUMBER.sub(replace, text)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Multi-digit mixed numbers are silently dropped.

_RE_MIXED_NUMBER matches \d+ (any digit count), but replace() only rewrites single-digit cases (len(number) == 1). As a result, "20 miljoner" stays unchanged and never gets composed by the spelled-out parser, so it is not converted to 20000000. Either narrow the regex to a single digit (so the intent is explicit) or extend the rewrite to handle multi-digit values.

♻️ Option A: narrow regex (current behavior, made explicit)
 _RE_MIXED_NUMBER = re.compile(
-    r"\b(\d+)\s+("
+    r"\b(\d)\s+("
     r"miljon|miljoner|miljard|miljarder|biljon|biljoner|tusen"
     r")\b",
     re.IGNORECASE,
 )
♻️ Option B: handle multi-digit numbers by multiplying directly
 def _normalize_mixed_numbers(text: str) -> str:
-    """Convert ``3 miljard`` → ``tre miljard`` so the word parser yields 3e9."""
+    """Convert ``3 miljard`` → ``tre miljard`` (single digit) or ``20 miljoner`` → digits."""

     def replace(match: re.Match[str]) -> str:
         number = match.group(1)
         multiplier = match.group(2)
         if len(number) == 1 and number in _DIGIT_TO_SWEDISH:
             return f"{_DIGIT_TO_SWEDISH[number]} {multiplier}"
-        return match.group(0)
+        mult = _BIG_MULT.get(multiplier.lower())
+        if mult is None:
+            return match.group(0)
+        return str(int(number) * mult)

     return _RE_MIXED_NUMBER.sub(replace, text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@normalization/languages/swedish/number_normalizer.py` around lines 82 - 110,
The regex _RE_MIXED_NUMBER currently matches multi-digit numerals but replace()
only handles single digits (checking len(number) == 1), so strings like "20
miljoner" are left unchanged; update _normalize_mixed_numbers (replace) to
handle multi-digit numbers by parsing number = int(match.group(1)) and, if
multiplier = match.group(2) is in _BIG_MULT, compute value = number *
_BIG_MULT[multiplier] and return str(value) (otherwise fall back to the existing
single-digit logic that uses _DIGIT_TO_SWEDISH), or alternatively make
_RE_MIXED_NUMBER only match a single digit if you prefer the original
single-digit-only behavior.

Comment on lines +36 to +43
currency_symbol_to_word={
"€": "euros",
"$": "dollars",
"£": "pounds",
"¢": "cent",
"¥": "yens",
"kr": "kronor",
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

"¢": "cent" is inconsistent with the plural canonical used elsewhere.

All other entries map to the plural form (euros, dollars, pounds, yens, kronor), and the number normalizer's plural-fix logic relies on that convention: in _currency_plural_fix_patterns, the entry is skipped when singular.lower() == trailing.lower() (which is the case for cent). Net effect: 5 ¢ becomes 5 cent and 5 euro becomes 5 euros — different singular/plural canonicalization across currencies, which will hurt WER consistency. Likely intended to be "cents".

🩹 Suggested fix
     currency_symbol_to_word={
         "€": "euros",
         "$": "dollars",
         "£": "pounds",
-        "¢": "cent",
+        "¢": "cents",
         "¥": "yens",
         "kr": "kronor",
     },

Note: if cents is added, also extend _singular_spoken_unit in number_normalizer.py with if t == "cents": return "cent" so the plural fix actually matches.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@normalization/languages/swedish/operators.py` around lines 36 - 43, The
currency mapping in currency_symbol_to_word in operators.py uses "¢": "cent"
which breaks the pluralization convention; change that entry to "¢": "cents" so
it matches the plural canonical form used for other currencies and the
_currency_plural_fix_patterns logic, and also add a branch in
number_normalizer.py's _singular_spoken_unit (e.g., if t == "cents": return
"cent") so the plural-to-singular fix recognizes "cents".

@Karamouche Karamouche merged commit bc509e4 into main May 5, 2026
10 checks passed
@Karamouche Karamouche deleted the feat/add-swedish-language branch May 5, 2026 16:21
@coderabbitai coderabbitai Bot mentioned this pull request May 5, 2026
31 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants