Skip to content

Harden control character filtering with C1 support#2002

Open
natoscott wants to merge 1 commit into
htop-dev:mainfrom
natoscott:security/c1-control-hardening
Open

Harden control character filtering with C1 support#2002
natoscott wants to merge 1 commit into
htop-dev:mainfrom
natoscott:security/c1-control-hardening

Conversation

@natoscott
Copy link
Copy Markdown
Member

Summary

Hardening follow-up to PR #2001 and GHSA-q64m-h5hc-c4qq:

  • Add Char_isC1Control() helper to detect UTF-8 encoded C1 control characters (U+0080-U+009F, encoded as 0xC2 0x80-0x9F)
  • Extend String_stripControlChars() to replace C1 two-byte sequences with ??
  • Apply C1 filtering in the non-wide RichString rendering path

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Review Change Stack

📝 Walkthrough

Summary

This commit extends control character filtering to include UTF-8 C1 controls as a hardening follow-up to security advisory GHSA-q64m-h5hc-c4qq. Two inline functions are added to XUtils.h:

  1. New helper: Char_isC1Control(char c, char next) detects two-byte UTF-8 sequences where the first byte is 0xC2 and the second is 0x80–0x9F, representing Unicode C1 controls (U+0080–U+009F)
  2. Extended filtering: String_stripControlChars() now replaces both C0/DEL single-byte controls and C1 two-byte sequences with '?', advancing the string pointer by 2 when a C1 sequence is detected

Implementation quality

The code is cleanly isolated and logically straightforward. The dual-branch logic in String_stripControlChars() correctly handles single-byte and two-byte cases without affecting the existing C0/DEL filtering path. However, the implementation exhibits a notable limitation: it assumes UTF-8 encoding without verifying the active locale or character encoding.

Known limitations

As raised in review, the byte-range check (0xC2 0x80–0x9F) is only valid for UTF-8. In non-UTF-8 single-byte encodings (ISO-8859-1, Windows-1252, etc.), these byte sequences may represent printable glyphs rather than control characters, leading to incorrect character replacement. A locale-aware approach using standard character classification functions (isprint()/iswprint()) would be more robust across different encoding environments.

The trade-off appears intentional for a targeted security hardening measure, but the encoding assumption constrains portability.

Walkthrough

This PR extends UTF-8 control character handling in XUtils.h. A new inline helper Char_isC1Control(char c, char next) detects 2-byte C1 control sequences (0xC2 followed by 0x800x9F). The String_stripControlChars() function now uses this helper to replace both bytes of C1 sequences with ? and advances the loop to skip the second byte, complementing its existing single-byte C0/DEL replacement logic. The function's documentation comment is updated to reflect C0/DEL/C1 coverage.

Possibly related PRs

  • htop-dev/htop#2001: Directly related control-character sanitization work in the same file and function.

Suggested reviewers

  • fasterit
  • BenBE

Poem

Safe bytes stripped and tamed,
Two-byte sequences renamed,
Garbled text made clean—
Control characters unseen,
XUtils keeps the strings pristine.


Comment @coderabbitai help to get the list of available commands and usage tips.

@Explorer09
Copy link
Copy Markdown
Contributor

Explorer09 commented May 18, 2026

I'm not satisfied with this one. First it assumes a UTF-8 encoding without a check. Second there should be more control characters that we should filter, and that we should filter them with a whitelist approach.

The whitelisted set would be the set defined in iswprint(3).

@Explorer09
Copy link
Copy Markdown
Contributor

Explorer09 commented May 18, 2026

Wait. I want to carefully analyse the context of when this RichString_writeFromWide is called.

This variant RichString_writeFromWide function seems to apply when HAVE_LIBNCURSESW is not defined. For this particular case, it may be assumed that no multibyte string will be supported (right?). If so, then the 0xC2 0x80 to 0xC2 0x9F range check might not be applicable.

Copy link
Copy Markdown
Member

@BenBE BenBE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on IRC …

Comment thread RichString.c Outdated
Comment on lines +200 to +201
const char* p = &data_c[j];
CharType* q = &this->chptr[i];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const char* p = &data_c[j];
CharType* q = &this->chptr[i];
const char* p = data_c + j;
CharType* q = this->chptr + i;

Comment thread RichString.c Outdated
Comment thread XUtils.h Outdated
@Explorer09
Copy link
Copy Markdown
Contributor

@BenBE @natoscott
I think we're trying to fix the wrong thing. Perhaps a little explanation can help you understand.

  1. The RichString_writeFromWide function being patched is a variant when HAVE_LIBNCURSESW is not defined. In other words this happens in the build where the wide characters in ncurses is not supported.
  2. From ncurses man page:

ncursesw
is the library in its "wide" configuration, which handles
character encodings requiring a larger data type than char
(a byte-sized type) can represent.
ncurses
is the library in its "non-wide" configuration, handling
only eight-bit characters.

And then...

Application Structure
A curses application uses information from the system locale;
setlocale(3) prepares it for curses library calls.
setlocale(LC_ALL, "");
If the locale is not thus initialized, the library assumes that
characters are printable as in ISO 8859-1, to work with certain
legacy programs. You should initialize the locale; do not expect
consistent behavior from the library when the locale has not been
set up.

  1. Thus, the function we are editing is assumed to not support any multibyte encoding at all (UTF-8 is a multibyte encoding). When we're not in UTF-8 locale, the sequence of bytes 0xC2 0x80 to 0xC2 0x9F should not be interpreted as any C1 control character (or else it's a bug that's outside htop).
  2. There are many single-byte encodings in the world that ncurses can support. While we can tell that the byte range 0x80..0x9F is the C1 control character range in ISO 8859-1, it won't be the case for an encoding such as Windows-1252 (Windows-1252 takes some byte value in this range and use them as graphic characters).
  3. (In ctype.h, although iscntrl() and isprint() are mutually exclusive sets of characters, it is possible for a byte value to be neither control or printable character. I didn't have an example for a single byte character set, but I know Unicode noncharacters are in neither iswcntrl() or iswprint().)
  4. So, combined with the reasons above, we should not use the simple 0x80..0x9F range check for control characters either. The best we should do instead is really the !isprint() function.

Hardening follow-up: extend control character filtering to also
catch C1 controls (U+0080-U+009F), which are encoded as two-byte
UTF-8 sequences starting with 0xC2.

Related-to: GHSA-q64m-h5hc-c4qq
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@natoscott natoscott force-pushed the security/c1-control-hardening branch from 4920493 to 76b2e01 Compare May 19, 2026 07:12
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 825c5f5f-fa1c-445b-b369-12753df44f84

📥 Commits

Reviewing files that changed from the base of the PR and between 4920493 and 76b2e01.

📒 Files selected for processing (1)
  • XUtils.h

Comment thread XUtils.h
@natoscott
Copy link
Copy Markdown
Member Author

I agree that the RichString non-wide path shouldn't apply UTF-8 decoding, dropped that.

The change now only extends String_stripControlChars to catch UTF-8 encoded C1 controls (0xC2 0x80-0x9F), protecting against terminal escape injection via those sequences.

This still looks like the most straightforward, practical approach here to me.

@Explorer09
Copy link
Copy Markdown
Contributor

Explorer09 commented May 19, 2026

I agree that the RichString non-wide path shouldn't apply UTF-8 decoding, dropped that.

The change now only extends String_stripControlChars to catch UTF-8 encoded C1 controls (0xC2 0x80-0x9F), protecting against terminal escape injection via those sequences.

This still looks like the most straightforward, practical approach here to me.

0xC2 0x80-0x9F is not terminal escape injection, is it? Now I don't understand the threat model, and this PR looks like "fixing" stuff where the problem was invalid in the first place.

Even if we consider 0xC2 0x80-0x9F sequences are malicious, in this branch of the function the best approach would to sanitize all possible lead bytes of a multibyte sequence, since if we leak any multibyte sequence, it won't display properly in this no-multibyte-support build.

@natoscott
Copy link
Copy Markdown
Member Author

0xC2 0x80-0x9F is not terminal escape injection, is it? Now I don't understand [...]

Yes, C1 controls absolutely include terminal escape sequences.

@Explorer09
Copy link
Copy Markdown
Contributor

0xC2 0x80-0x9F is not terminal escape injection, is it? Now I don't understand [...]

Yes, C1 controls absolutely include terminal escape sequences.

While I can see the Control Sequence Introducer possible to be represented by a byte valye of 0x9B. It should apply to only the ISO 2022 kind of encodings.

For an encoding like Big5, 0x9B may be a lead byte of valid, private use character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants