feat: init norwegian basic normalizer#24
Conversation
Made-with: Cursor
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdded Norwegian language support by implementing number normalization, word replacements, and language-specific operators. Extended configuration with two new boolean flags for Roman numeral casing and ALL-CAPS letter expansion, then modified existing text processing steps to respect these flags. Updated documentation and comprehensive test coverage for both Norwegian functionality and modified step behaviors. ChangesNorwegian Language Support & Step Enhancements
Sequence DiagramsequenceDiagram
actor User
participant NormalizationPipeline
participant RomanStep as ConvertRomanNumeralsStep
participant AlphanumericStep as ExpandAlphanumericCodesStep
participant CurrencyStep as RemoveStandaloneCurrencySymbolsStep
participant PercentStep as RemoveSymbolsStep
participant Config as LanguageConfig
User->>NormalizationPipeline: text with Roman numerals,<br/>ALL-CAPS acronyms,<br/>currency symbols, %
NormalizationPipeline->>RomanStep: text + operators
RomanStep->>Config: check roman_numerals_uppercase_only
alt Flag = True (Uppercase Only)
RomanStep->>RomanStep: Match only ALL-CAPS "VI"<br/>(skip "vi", "Vi")
else Flag = False (Case Insensitive)
RomanStep->>RomanStep: Match "VI", "vi", "Vi"
end
RomanStep->>NormalizationPipeline: converted text
NormalizationPipeline->>AlphanumericStep: text + operators
AlphanumericStep->>Config: check expand_all_caps_letter_by_letter
alt Flag = False (Preserve ALL-CAPS Letters)
AlphanumericStep->>AlphanumericStep: Leave "SMS" unchanged<br/>but space "ABC123" → "A B C 1 2 3"
else Flag = True (Expand All)
AlphanumericStep->>AlphanumericStep: Space "SMS" → "S M S"
end
AlphanumericStep->>NormalizationPipeline: spaced text
NormalizationPipeline->>CurrencyStep: text + operators
CurrencyStep->>CurrencyStep: Remove multi-char symbols<br/>only as whole tokens (\b...\b)<br/>skip if digit adjacent
CurrencyStep->>NormalizationPipeline: currency removed
NormalizationPipeline->>PercentStep: text + operators
PercentStep->>Config: check symbols_to_words["%"]
alt % Follows Numeric Literal
PercentStep->>PercentStep: Replace "8.75%" with<br/>"8.75 percent/prosent"
else % Not After Number
PercentStep->>PercentStep: Leave "%" unchanged
end
PercentStep->>NormalizationPipeline: normalized text
NormalizationPipeline->>User: final normalized output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/norwegian/number_normalizer.py`:
- Around line 237-243: The code is prematurely consuming an optional "og" token
via _skip_optional_og before confirming a numeric tail, which can drop "og" when
the subsequent words are not numbers; change each branch (the block handling fw
== "tusen" and the other listed ranges) to first call self._parse_number on the
next token(s) without skipping "og", and only if that returns a numeric tail
(tail is not None) then call _skip_optional_og to consume "og" and parse the
tail (or call _skip_optional_og inside a path taken after successful parse), so
that _skip_optional_og is only applied when tail parsing succeeds (use the same
pattern for fw, _parse_number, and _skip_optional_og in the other branches).
In `@tests/unit/steps/text/remove_standalone_currency_symbols_test.py`:
- Around line 13-18: The test
test_multi_char_kr_does_not_match_letters_inside_words uses tokens ("punkt",
"euros") that don't exercise the boundary rule for the "kr" substring; update
the test to use a token that actually contains "kr" (for example "kroner" or
similar) so RemoveStandaloneCurrencySymbolsStep is exercised for multi-char "kr"
handling with NorwegianOperators; ensure the assertion still expects the
original word unchanged (e.g., assert step("kroner", ops) == "kroner").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 05953dea-436e-4527-aa22-4b6db986864c
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3/no.csvis excluded by!**/*.csv
📒 Files selected for processing (19)
docs/steps.mdnormalization/languages/__init__.pynormalization/languages/base/language_config.pynormalization/languages/norwegian/__init__.pynormalization/languages/norwegian/number_normalizer.pynormalization/languages/norwegian/operators.pynormalization/languages/norwegian/replacements.pynormalization/steps/text/convert_roman_numerals_to_digits.pynormalization/steps/text/expand_alphanumeric_codes.pynormalization/steps/text/remove_standalone_currency_symbols.pynormalization/steps/text/remove_symbols.pynormalization/steps/text/replace_currency.pytests/unit/languages/norwegian_number_normalizer_test.pytests/unit/languages/norwegian_operators_test.pytests/unit/steps/text/convert_roman_numerals_to_digits_test.pytests/unit/steps/text/expand_alphanumeric_codes_test.pytests/unit/steps/text/remove_standalone_currency_symbols_test.pytests/unit/steps/text/remove_symbols_test.pytests/unit/steps/text/replace_currency_kr_test.py
| if fw == "tusen": | ||
| j = _skip_optional_og(words, i + 1, n) | ||
| tail = self._parse_number(words, j, n) | ||
| if tail is not None: | ||
| end, v2 = tail | ||
| return end, 1000 + v2 | ||
| return j, 1000 |
There was a problem hiding this comment.
Don’t consume optional og unless a numeric tail is actually parsed.
Right now og is skipped before validating that the tail is a number, and fallback returns can drop that token. This can silently change meaning (e.g., non-numeric continuations lose og).
Suggested fix pattern
- j = _skip_optional_og(words, i + 1, n)
- tail = self._parse_number(words, j, n)
+ j = i + 1
+ j_after_og = _skip_optional_og(words, j, n)
+ tail = self._parse_number(words, j_after_og, n)
if tail is not None:
end, v2 = tail
return end, 1000 + v2
- return j, 1000
+ return i + 1, 1000Apply the same fallback rule to the other branches that currently call _skip_optional_og(...) before parsing tail: only consume og when tail parsing succeeds.
Also applies to: 245-253, 255-267, 269-285, 287-303, 313-324, 325-337
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/norwegian/number_normalizer.py` around lines 237 -
243, The code is prematurely consuming an optional "og" token via
_skip_optional_og before confirming a numeric tail, which can drop "og" when the
subsequent words are not numbers; change each branch (the block handling fw ==
"tusen" and the other listed ranges) to first call self._parse_number on the
next token(s) without skipping "og", and only if that returns a numeric tail
(tail is not None) then call _skip_optional_og to consume "og" and parse the
tail (or call _skip_optional_og inside a path taken after successful parse), so
that _skip_optional_og is only applied when tail parsing succeeds (use the same
pattern for fw, _parse_number, and _skip_optional_og in the other branches).
f92023f to
0a7332c
Compare
Made-with: Cursor
What does this PR do?
dds norvegian normalization (operators, replacements, number normalizer, registry wiring, unit and gladia-3 e2e tests
Type of change
Checklist
Only fill in the section(s) that match your change — delete the rest.
New language
normalization/languages/{lang}/withoperators.py,replacements.py,__init__.pyreplacements.py(not hardcoded inoperators.py)LanguageConfigis filled in with the language's data (separators, currency words, digit words, …)LanguageOperators— only override methods where the logic changes, not just the data@register_languageand imported innormalization/languages/__init__.pytests/unit/languages/tests/e2e/files/{preset}/{lang}.csv(e.g.tests/e2e/files/gladia-3/fr.csv)Edit existing language
replacements.py, not inline inoperators.pyNone: the step reading it still handlesNonegracefullyNew step
nameclass attribute set (this is the key used in YAML presets)@register_stepand imported insteps/text/__init__.pyorsteps/word/__init__.pyoperators.config.*insteadsteps/text/placeholders.pyandpipeline/base.py'svalidate()is updatedtests/unit/steps/uv run scripts/generate_step_docs.pyEdit existing step
nameis unchanged — if the output changes, create a new step name + new preset insteaduv run scripts/generate_step_docs.pyPreset change
pipeline.validate()passes (runs automatically vialoader.py)How was this tested?
Summary by CodeRabbit
New Features
Improvements
Documentation