Skip to content

feat: init norwegian basic normalizer#24

Merged
Karamouche merged 5 commits intomainfrom
feat/add-norwegian-language
May 5, 2026
Merged

feat: init norwegian basic normalizer#24
Karamouche merged 5 commits intomainfrom
feat/add-norwegian-language

Conversation

@egenthon-cmd
Copy link
Copy Markdown
Contributor

@egenthon-cmd egenthon-cmd commented Apr 23, 2026

Made-with: Cursor

What does this PR do?

dds norvegian normalization (operators, replacements, number normalizer, registry wiring, unit and gladia-3 e2e tests

Type of change

  • New language
  • Edit existing language (fix a replacement, tweak config, …)
  • New normalization step
  • Edit existing step (bug fix, behaviour change)
  • New preset version
  • Bug fix (other)
  • Refactor / docs / CI

Checklist

Only fill in the section(s) that match your change — delete the rest.


New language

  • Created normalization/languages/{lang}/ with operators.py, replacements.py, __init__.py
  • Word substitutions are in replacements.py (not hardcoded in operators.py)
  • LanguageConfig is filled in with the language's data (separators, currency words, digit words, …)
  • Subclassed LanguageOperators — only override methods where the logic changes, not just the data
  • Class is decorated with @register_language and imported in normalization/languages/__init__.py
  • Unit tests added in tests/unit/languages/
  • E2e CSV added in tests/e2e/files/{preset}/{lang}.csv (e.g. tests/e2e/files/gladia-3/fr.csv)

Edit existing language

  • New/changed word substitutions go in replacements.py, not inline in operators.py
  • If you changed a config field that can be None: the step reading it still handles None gracefully
  • Unit tests updated or added
  • E2e CSV updated if the expected output changed

New step

  • Unique name class attribute set (this is the key used in YAML presets)
  • Decorated with @register_step and imported in steps/text/__init__.py or steps/word/__init__.py
  • No hardcoded language values — read data from operators.config.* instead
  • If placeholder-based: protect + restore are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
  • Unit tests added in tests/unit/steps/
  • Step name added to the relevant preset YAML — or a new preset file created if existing presets are affected
  • If the docstring changed: ran uv run scripts/generate_step_docs.py

Edit existing step

  • Step name is unchanged — if the output changes, create a new step name + new preset instead
  • No language-specific logic or string literals added inside the step
  • Unit tests updated or added
  • If the docstring changed: ran uv run scripts/generate_step_docs.py

Preset change

  • Existing preset files are not modified — new behaviour goes in a new preset file
  • pipeline.validate() passes (runs automatically via loader.py)

How was this tested?

uv run pytest tests/

Summary by CodeRabbit

  • New Features

    • Added Norwegian language support with spelled-number expansion and currency normalization
  • Improvements

    • Roman numeral conversion now respects case sensitivity via configuration
    • All-caps token handling is now configurable
    • Multi-character currency symbols are handled more precisely based on digit proximity
    • Percent signs expand contextually only after numeric literals
  • Documentation

    • Enhanced documentation for text normalization steps with clarified configuration options and behaviors

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@Karamouche has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 56 minutes and 26 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 95fbbf0a-09ee-4bf4-874c-db47d4d973f9

📥 Commits

Reviewing files that changed from the base of the PR and between d07510c and a0204f6.

⛔ Files ignored due to path filters (2)
  • tests/e2e/files/gladia-3/fi.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/sv.csv is excluded by !**/*.csv
📒 Files selected for processing (1)
  • README.md
📝 Walkthrough

Walkthrough

Added Norwegian language support by implementing number normalization, word replacements, and language-specific operators. Extended configuration with two new boolean flags for Roman numeral casing and ALL-CAPS letter expansion, then modified existing text processing steps to respect these flags. Updated documentation and comprehensive test coverage for both Norwegian functionality and modified step behaviors.

Changes

Norwegian Language Support & Step Enhancements

Layer / File(s) Summary
Configuration Data Shape
normalization/languages/base/language_config.py
Added roman_numerals_uppercase_only and expand_all_caps_letter_by_letter boolean flags to LanguageConfig to control casing-sensitive processing behavior.
Norwegian Language Implementation
normalization/languages/norwegian/number_normalizer.py, normalization/languages/norwegian/operators.py, normalization/languages/norwegian/replacements.py, normalization/languages/norwegian/__init__.py
Implemented NorwegianNumberNormalizer with complete parsing logic for Norwegian spelled-out numbers (0–999, scaled to billions), currency normalization with plural forms, and NorwegianOperators registered as language "no" with digit/number word mappings and word replacements dictionary.
Step Enhancements for Config Flags
normalization/steps/text/convert_roman_numerals_to_digits.py, normalization/steps/text/expand_alphanumeric_codes.py, normalization/steps/text/remove_standalone_currency_symbols.py, normalization/steps/text/remove_symbols.py
Updated steps to consult new config flags: Roman numeral step branches on roman_numerals_uppercase_only, alphanumeric step short-circuits pure ALL-CAPS tokens when flag is disabled, currency step distinguishes single- vs multi-character symbols with whole-token and digit-adjacency matching, percent symbol step conditionally expands only after numeric literals.
Language Registry Integration
normalization/languages/__init__.py
Added "norwegian" to module imports and __all__ exports, enabling registration of Norwegian language support.
Tests & Documentation
docs/steps.md, tests/unit/languages/norwegian_*_test.py, tests/unit/steps/text/*_test.py
Added comprehensive test fixtures and parametrized tests for Norwegian number parsing, operators, and step behavior; updated step documentation to clarify new config flag behaviors and multi-character symbol matching rules.

Sequence Diagram

sequenceDiagram
    actor User
    participant NormalizationPipeline
    participant RomanStep as ConvertRomanNumeralsStep
    participant AlphanumericStep as ExpandAlphanumericCodesStep
    participant CurrencyStep as RemoveStandaloneCurrencySymbolsStep
    participant PercentStep as RemoveSymbolsStep
    participant Config as LanguageConfig

    User->>NormalizationPipeline: text with Roman numerals,<br/>ALL-CAPS acronyms,<br/>currency symbols, %
    
    NormalizationPipeline->>RomanStep: text + operators
    RomanStep->>Config: check roman_numerals_uppercase_only
    alt Flag = True (Uppercase Only)
        RomanStep->>RomanStep: Match only ALL-CAPS "VI"<br/>(skip "vi", "Vi")
    else Flag = False (Case Insensitive)
        RomanStep->>RomanStep: Match "VI", "vi", "Vi"
    end
    RomanStep->>NormalizationPipeline: converted text

    NormalizationPipeline->>AlphanumericStep: text + operators
    AlphanumericStep->>Config: check expand_all_caps_letter_by_letter
    alt Flag = False (Preserve ALL-CAPS Letters)
        AlphanumericStep->>AlphanumericStep: Leave "SMS" unchanged<br/>but space "ABC123" → "A B C 1 2 3"
    else Flag = True (Expand All)
        AlphanumericStep->>AlphanumericStep: Space "SMS" → "S M S"
    end
    AlphanumericStep->>NormalizationPipeline: spaced text

    NormalizationPipeline->>CurrencyStep: text + operators
    CurrencyStep->>CurrencyStep: Remove multi-char symbols<br/>only as whole tokens (\b...\b)<br/>skip if digit adjacent
    CurrencyStep->>NormalizationPipeline: currency removed

    NormalizationPipeline->>PercentStep: text + operators
    PercentStep->>Config: check symbols_to_words["%"]
    alt % Follows Numeric Literal
        PercentStep->>PercentStep: Replace "8.75%" with<br/>"8.75 percent/prosent"
    else % Not After Number
        PercentStep->>PercentStep: Leave "%" unchanged
    end
    PercentStep->>NormalizationPipeline: normalized text

    NormalizationPipeline->>User: final normalized output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • gladiaio/normalization#19: Both add new language support and update the language registry with all exports and language-specific operators modules.
  • gladiaio/normalization#23: Both modify remove_standalone_currency_symbols.py to distinguish single- vs multi-character currency symbol handling with word-boundary and digit-adjacency matching.
  • gladiaio/normalization#15: Both expand the language registry in normalization/languages/__init__.py to add new language exports.

Suggested reviewers

  • Karamouche

🐰 A Norwegian hare hops with glee,
"Tall tales now told with accuracy!
From 'fem og tjue' blooms the digit five-and-twenty bright,
While ALL-CAPS acronyms dance upright—
No needless spacing steals their might!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.20% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: init norwegian basic normalizer' accurately and concisely describes the main change—initialization of a Norwegian language normalizer with core functionality.
Description check ✅ Passed The PR description follows the template structure, marks all relevant 'New language' checklist items as completed, and provides testing confirmation via 'uv run pytest tests/'.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/add-norwegian-language

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Caution

Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted.

Error details
{"name":"HttpError","status":502,"request":{"method":"PATCH","url":"https://api.github.com/repos/gladiaio/normalization/issues/comments/4304466396","headers":{"accept":"application/vnd.github.v3+json","user-agent":"octokit.js/0.0.0-development octokit-core.js/7.0.6 Node.js/24","authorization":"token [REDACTED]","content-type":"application/json; charset=utf-8"},"body":{"body":"<!-- This is an auto-generated comment: summarize by coderabbit.ai -->\n<!-- This is an auto-generated comment: review in progress by coderabbit.ai -->\n\n> [!NOTE]\n> Currently processing new changes in this PR. This may take a few minutes, please wait...\n> \n> <details>\n> <summary>⚙️ Run configuration</summary>\n> \n> **Configuration used**: defaults\n> \n> **Review profile**: CHILL\n> \n> **Plan**: Pro\n> \n> **Run ID**: `05953dea-436e-4527-aa22-4b6db986864c`\n> \n> </details>\n> \n> <details>\n> <summary>📥 Commits</summary>\n> \n> Reviewing files that changed from the base of the PR and between 88b54a41a29d2494b82e7ed679d9887fb15c8f9b and 0a7332c1438c7e813608b20285b48a615236f95c.\n> \n> </details>\n> \n> <details>\n> <summary>⛔ Files ignored due to path filters (1)</summary>\n> \n> * `tests/e2e/files/gladia-3/no.csv` is excluded by `!**/*.csv`\n> \n> </details>\n> \n> <details>\n> <summary>📒 Files selected for processing (19)</summary>\n> \n> * `docs/steps.md`\n> * `normalization/languages/__init__.py`\n> * `normalization/languages/base/language_config.py`\n> * `normalization/languages/norwegian/__init__.py`\n> * `normalization/languages/norwegian/number_normalizer.py`\n> * `normalization/languages/norwegian/operators.py`\n> * `normalization/languages/norwegian/replacements.py`\n> * `normalization/steps/text/convert_roman_numerals_to_digits.py`\n> * `normalization/steps/text/expand_alphanumeric_codes.py`\n> * `normalization/steps/text/remove_standalone_currency_symbols.py`\n> * `normalization/steps/text/remove_symbols.py`\n> * `normalization/steps/text/replace_currency.py`\n> * `tests/unit/languages/norwegian_number_normalizer_test.py`\n> * `tests/unit/languages/norwegian_operators_test.py`\n> * `tests/unit/steps/text/convert_roman_numerals_to_digits_test.py`\n> * `tests/unit/steps/text/expand_alphanumeric_codes_test.py`\n> * `tests/unit/steps/text/remove_standalone_currency_symbols_test.py`\n> * `tests/unit/steps/text/remove_symbols_test.py`\n> * `tests/unit/steps/text/replace_currency_kr_test.py`\n> \n> </details>\n> \n> ```ascii\n>  ________________________________________________________\n> < Your error handling is basically thoughts and prayers. >\n>  --------------------------------------------------------\n>   \\\n>    \\   (\\__/)\n>        (•ㅅ•)\n>        /   づ\n> ```\n\n<!-- end of auto-generated comment: review in progress by coderabbit.ai -->\n\n\n<!-- finishing_touch_checkbox_start -->\n\n<details>\n<summary>✨ Finishing Touches</summary>\n\n<details>\n<summary>📝 Generate docstrings</summary>\n\n- [ ] <!-- {\"checkboxId\": \"7962f53c-55bc-4827-bfbf-6a18da830691\"} --> Create stacked PR\n- [ ] <!-- {\"checkboxId\": \"3e1879ae-f29b-4d0d-8e06-d12b7ba33d98\"} --> Commit on current branch\n\n</details>\n<details>\n<summary>🧪 Generate unit tests (beta)</summary>\n\n- [ ] <!-- {\"checkboxId\": \"f47ac10b-58cc-4372-a567-0e02b2c3d479\", \"radioGroupId\": \"utg-output-choice-group-unknown_comment_id\"} -->   Create PR with unit tests\n- [ ] <!-- {\"checkboxId\": \"6ba7b810-9dad-11d1-80b4-00c04fd430c8\", \"radioGroupId\": \"utg-output-choice-group-unknown_comment_id\"} -->   Commit unit tests in branch `feat/add-norwegian-language`\n\n</details>\n\n</details>\n\n<!-- finishing_touch_checkbox_end -->\n\n<!-- announcements_start -->\n\n> [!TIP]\n> <details>\n> <summary>💬 Introducing Slack Agent: The best way for teams to turn conversations into code.</summary>\n> \n> [Slack Agent](https://www.coderabbit.ai/agent) is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.\n> \n> - Generate code and open pull requests\n> - Plan features and break down work\n> - Investigate incidents and troubleshoot customer tickets together\n> - Automate recurring tasks and respond to alerts with triggers\n> - Summarize progress and report instantly\n> \n> Built for teams:\n> \n> - **Shared memory** across your entire org—no repeating context\n> - **Per-thread sandboxes** to safely plan and execute work\n> - **Governance built-in**—scoped access, auditability, and budget controls\n> \n> One agent for your entire SDLC. Right inside Slack.\n> \n> 👉 [Get started](https://agent.coderabbit.ai/)\n> \n> </details>\n\n<!-- announcements_end -->\n\n<!-- tips_start -->\n\n---\n\nThanks for using [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=gladiaio/normalization&utm_content=24)! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.\n\n<details>\n<summary>❤️ Share</summary>\n\n- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)\n- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)\n- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)\n- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)\n\n</details>\n<!-- review_rate_limit_status_start -->\n<sub>Review rate limit: 0/1 reviews remaining, refill in 60 minutes.</sub>\n<!-- review_rate_limit_status_end -->\n\n<sub>Comment `@coderabbitai help` to get the list of available commands and usage tips.</sub>\n\n<!-- tips_end -->\n<!-- usage_tips_start -->\n\n> [!TIP]\n> <details>\n> <summary>CodeRabbit can scan for known vulnerabilities in your dependencies using OSV Scanner.</summary>\n> \n> OSV Scanner will automatically detect and report security vulnerabilities in your project's dependencies. No additional configuration is required.\n> \n> </details>\n\n<!-- usage_tips_end -->"},"request":{"retryCount":3,"signal":{},"retries":3,"retryAfter":16}},"response":{"url":"https://api.github.com/repos/gladiaio/normalization/issues/comments/4304466396","status":502,"headers":{"content-length":"32","content-type":"application/json","date":"Mon, 04 May 2026 16:06:11 GMT","etag":"\"69f8a404-20\"","server":"github.com","vary":"Accept-Encoding, Accept, X-Requested-With","x-github-request-id":"4211:6B9BA:1BF4A20:6F65270:69F8C3E9","x-ratelimit-limit":"15000","x-ratelimit-remaining":"14831","x-ratelimit-reset":"1777912571","x-ratelimit-resource":"core","x-ratelimit-used":"169"},"data":{"message":"Server Error"}}}

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/norwegian/number_normalizer.py`:
- Around line 237-243: The code is prematurely consuming an optional "og" token
via _skip_optional_og before confirming a numeric tail, which can drop "og" when
the subsequent words are not numbers; change each branch (the block handling fw
== "tusen" and the other listed ranges) to first call self._parse_number on the
next token(s) without skipping "og", and only if that returns a numeric tail
(tail is not None) then call _skip_optional_og to consume "og" and parse the
tail (or call _skip_optional_og inside a path taken after successful parse), so
that _skip_optional_og is only applied when tail parsing succeeds (use the same
pattern for fw, _parse_number, and _skip_optional_og in the other branches).

In `@tests/unit/steps/text/remove_standalone_currency_symbols_test.py`:
- Around line 13-18: The test
test_multi_char_kr_does_not_match_letters_inside_words uses tokens ("punkt",
"euros") that don't exercise the boundary rule for the "kr" substring; update
the test to use a token that actually contains "kr" (for example "kroner" or
similar) so RemoveStandaloneCurrencySymbolsStep is exercised for multi-char "kr"
handling with NorwegianOperators; ensure the assertion still expects the
original word unchanged (e.g., assert step("kroner", ops) == "kroner").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 05953dea-436e-4527-aa22-4b6db986864c

📥 Commits

Reviewing files that changed from the base of the PR and between 88b54a4 and 0a7332c.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3/no.csv is excluded by !**/*.csv
📒 Files selected for processing (19)
  • docs/steps.md
  • normalization/languages/__init__.py
  • normalization/languages/base/language_config.py
  • normalization/languages/norwegian/__init__.py
  • normalization/languages/norwegian/number_normalizer.py
  • normalization/languages/norwegian/operators.py
  • normalization/languages/norwegian/replacements.py
  • normalization/steps/text/convert_roman_numerals_to_digits.py
  • normalization/steps/text/expand_alphanumeric_codes.py
  • normalization/steps/text/remove_standalone_currency_symbols.py
  • normalization/steps/text/remove_symbols.py
  • normalization/steps/text/replace_currency.py
  • tests/unit/languages/norwegian_number_normalizer_test.py
  • tests/unit/languages/norwegian_operators_test.py
  • tests/unit/steps/text/convert_roman_numerals_to_digits_test.py
  • tests/unit/steps/text/expand_alphanumeric_codes_test.py
  • tests/unit/steps/text/remove_standalone_currency_symbols_test.py
  • tests/unit/steps/text/remove_symbols_test.py
  • tests/unit/steps/text/replace_currency_kr_test.py

Comment on lines +237 to +243
if fw == "tusen":
j = _skip_optional_og(words, i + 1, n)
tail = self._parse_number(words, j, n)
if tail is not None:
end, v2 = tail
return end, 1000 + v2
return j, 1000
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t consume optional og unless a numeric tail is actually parsed.

Right now og is skipped before validating that the tail is a number, and fallback returns can drop that token. This can silently change meaning (e.g., non-numeric continuations lose og).

Suggested fix pattern
-            j = _skip_optional_og(words, i + 1, n)
-            tail = self._parse_number(words, j, n)
+            j = i + 1
+            j_after_og = _skip_optional_og(words, j, n)
+            tail = self._parse_number(words, j_after_og, n)
             if tail is not None:
                 end, v2 = tail
                 return end, 1000 + v2
-            return j, 1000
+            return i + 1, 1000

Apply the same fallback rule to the other branches that currently call _skip_optional_og(...) before parsing tail: only consume og when tail parsing succeeds.

Also applies to: 245-253, 255-267, 269-285, 287-303, 313-324, 325-337

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/norwegian/number_normalizer.py` around lines 237 -
243, The code is prematurely consuming an optional "og" token via
_skip_optional_og before confirming a numeric tail, which can drop "og" when the
subsequent words are not numbers; change each branch (the block handling fw ==
"tusen" and the other listed ranges) to first call self._parse_number on the
next token(s) without skipping "og", and only if that returns a numeric tail
(tail is not None) then call _skip_optional_og to consume "og" and parse the
tail (or call _skip_optional_og inside a path taken after successful parse), so
that _skip_optional_og is only applied when tail parsing succeeds (use the same
pattern for fw, _parse_number, and _skip_optional_og in the other branches).

Comment thread tests/unit/steps/text/remove_standalone_currency_symbols_test.py
@Karamouche Karamouche force-pushed the feat/add-norwegian-language branch from f92023f to 0a7332c Compare May 5, 2026 17:00
@Karamouche Karamouche merged commit 25a0c69 into main May 5, 2026
10 checks passed
@Karamouche Karamouche deleted the feat/add-norwegian-language branch May 5, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants