Harden control character filtering with C1 support#2002
Conversation
📝 WalkthroughSummaryThis commit extends control character filtering to include UTF-8 C1 controls as a hardening follow-up to security advisory GHSA-q64m-h5hc-c4qq. Two inline functions are added to
Implementation qualityThe code is cleanly isolated and logically straightforward. The dual-branch logic in Known limitationsAs raised in review, the byte-range check (0xC2 0x80–0x9F) is only valid for UTF-8. In non-UTF-8 single-byte encodings (ISO-8859-1, Windows-1252, etc.), these byte sequences may represent printable glyphs rather than control characters, leading to incorrect character replacement. A locale-aware approach using standard character classification functions ( The trade-off appears intentional for a targeted security hardening measure, but the encoding assumption constrains portability. WalkthroughThis PR extends UTF-8 control character handling in XUtils.h. A new inline helper Possibly related PRs
Suggested reviewers
Poem
Comment |
|
|
|
|
| const char* p = &data_c[j]; | ||
| CharType* q = &this->chptr[i]; |
There was a problem hiding this comment.
| const char* p = &data_c[j]; | |
| CharType* q = &this->chptr[i]; | |
| const char* p = data_c + j; | |
| CharType* q = this->chptr + i; |
|
@BenBE @natoscott
And then...
|
Hardening follow-up: extend control character filtering to also catch C1 controls (U+0080-U+009F), which are encoded as two-byte UTF-8 sequences starting with 0xC2. Related-to: GHSA-q64m-h5hc-c4qq Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4920493 to
76b2e01
Compare
|
I agree that the RichString non-wide path shouldn't apply UTF-8 decoding, dropped that. The change now only extends String_stripControlChars to catch UTF-8 encoded C1 controls (0xC2 0x80-0x9F), protecting against terminal escape injection via those sequences. This still looks like the most straightforward, practical approach here to me. |
0xC2 0x80-0x9F is not terminal escape injection, is it? Now I don't understand the threat model, and this PR looks like "fixing" stuff where the problem was invalid in the first place. Even if we consider 0xC2 0x80-0x9F sequences are malicious, in this branch of the function the best approach would to sanitize all possible lead bytes of a multibyte sequence, since if we leak any multibyte sequence, it won't display properly in this no-multibyte-support build. |
Yes, C1 controls absolutely include terminal escape sequences. |
While I can see the Control Sequence Introducer possible to be represented by a byte valye of 0x9B. It should apply to only the ISO 2022 kind of encodings. For an encoding like Big5, 0x9B may be a lead byte of valid, private use character. |
Summary
Hardening follow-up to PR #2001 and GHSA-q64m-h5hc-c4qq:
Char_isC1Control()helper to detect UTF-8 encoded C1 control characters (U+0080-U+009F, encoded as 0xC2 0x80-0x9F)String_stripControlChars()to replace C1 two-byte sequences with??