Harden control character filtering with C1 support by natoscott · Pull Request #2002 · htop-dev/htop

natoscott · 2026-05-18T06:06:18Z

Summary

Hardening follow-up to PR #2001 and GHSA-q64m-h5hc-c4qq:

Add Char_isC1Control() helper to detect UTF-8 encoded C1 control characters (U+0080-U+009F, encoded as 0xC2 0x80-0x9F)
Extend String_stripControlChars() to replace C1 two-byte sequences with ??
Apply C1 filtering in the non-wide RichString rendering path

coderabbitai · 2026-05-18T06:06:29Z

📝 Walkthrough

Summary

This commit extends control character filtering to include UTF-8 C1 controls as a hardening follow-up to security advisory GHSA-q64m-h5hc-c4qq. Two inline functions are added to XUtils.h:

New helper: Char_isC1Control(char c, char next) detects two-byte UTF-8 sequences where the first byte is 0xC2 and the second is 0x80–0x9F, representing Unicode C1 controls (U+0080–U+009F)
Extended filtering: String_stripControlChars() now replaces both C0/DEL single-byte controls and C1 two-byte sequences with '?', advancing the string pointer by 2 when a C1 sequence is detected

Implementation quality

The code is cleanly isolated and logically straightforward. The dual-branch logic in String_stripControlChars() correctly handles single-byte and two-byte cases without affecting the existing C0/DEL filtering path. However, the implementation exhibits a notable limitation: it assumes UTF-8 encoding without verifying the active locale or character encoding.

Known limitations

As raised in review, the byte-range check (0xC2 0x80–0x9F) is only valid for UTF-8. In non-UTF-8 single-byte encodings (ISO-8859-1, Windows-1252, etc.), these byte sequences may represent printable glyphs rather than control characters, leading to incorrect character replacement. A locale-aware approach using standard character classification functions (isprint()/iswprint()) would be more robust across different encoding environments.

The trade-off appears intentional for a targeted security hardening measure, but the encoding assumption constrains portability.

Walkthrough

This PR extends UTF-8 control character handling in XUtils.h. A new inline helper Char_isC1Control(char c, char next) detects 2-byte C1 control sequences (0xC2 followed by 0x80–0x9F). The String_stripControlChars() function now uses this helper to replace both bytes of C1 sequences with ? and advances the loop to skip the second byte, complementing its existing single-byte C0/DEL replacement logic. The function's documentation comment is updated to reflect C0/DEL/C1 coverage.

Possibly related PRs

htop-dev/htop#2001: Directly related control-character sanitization work in the same file and function.

Suggested reviewers

fasterit
BenBE

Poem

Safe bytes stripped and tamed,
Two-byte sequences renamed,
Garbled text made clean—
Control characters unseen,
XUtils keeps the strings pristine.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Explorer09 · 2026-05-18T07:29:41Z

I'm not satisfied with this one. First it assumes a UTF-8 encoding without a check. Second there should be more control characters that we should filter, and that we should filter them with a whitelist approach.

~~The whitelisted set would be the set defined in iswprint(3).~~

Explorer09 · 2026-05-18T09:36:05Z

~~Wait. I want to carefully analyse the context of when this RichString_writeFromWide is called.~~

This variant RichString_writeFromWide function seems to apply when HAVE_LIBNCURSESW is not defined. For this particular case, it may be assumed that no multibyte string will be supported (right?). If so, then the 0xC2 0x80 to 0xC2 0x9F range check might not be applicable.

BenBE

As discussed on IRC …

BenBE · 2026-05-19T05:49:01Z

+      const char* p = &data_c[j];
+      CharType* q = &this->chptr[i];


Suggested change

const char* p = &data_c[j];

CharType* q = &this->chptr[i];

const char* p = data_c + j;

CharType* q = this->chptr + i;

Explorer09 · 2026-05-19T06:42:59Z

@BenBE @natoscott
I think we're trying to fix the wrong thing. Perhaps a little explanation can help you understand.

The RichString_writeFromWide function being patched is a variant when HAVE_LIBNCURSESW is not defined. In other words this happens in the build where the wide characters in ncurses is not supported.
From ncurses man page:

ncursesw
is the library in its "wide" configuration, which handles
character encodings requiring a larger data type than char
(a byte-sized type) can represent.
ncurses
is the library in its "non-wide" configuration, handling
only eight-bit characters.

And then...

Application Structure
A curses application uses information from the system locale;
setlocale(3) prepares it for curses library calls.
setlocale(LC_ALL, "");
If the locale is not thus initialized, the library assumes that
characters are printable as in ISO 8859-1, to work with certain
legacy programs. You should initialize the locale; do not expect
consistent behavior from the library when the locale has not been
set up.

Thus, the function we are editing is assumed to not support any multibyte encoding at all (UTF-8 is a multibyte encoding). When we're not in UTF-8 locale, the sequence of bytes 0xC2 0x80 to 0xC2 0x9F should not be interpreted as any C1 control character (or else it's a bug that's outside htop).
There are many single-byte encodings in the world that ncurses can support. While we can tell that the byte range 0x80..0x9F is the C1 control character range in ISO 8859-1, it won't be the case for an encoding such as Windows-1252 (Windows-1252 takes some byte value in this range and use them as graphic characters).
(In ctype.h, although iscntrl() and isprint() are mutually exclusive sets of characters, it is possible for a byte value to be neither control or printable character. I didn't have an example for a single byte character set, but I know Unicode noncharacters are in neither iswcntrl() or iswprint().)
So, combined with the reasons above, we should not use the simple 0x80..0x9F range check for control characters either. The best we should do instead is really the !isprint() function.

Hardening follow-up: extend control character filtering to also catch C1 controls (U+0080-U+009F), which are encoded as two-byte UTF-8 sequences starting with 0xC2. Related-to: GHSA-q64m-h5hc-c4qq Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 825c5f5f-fa1c-445b-b369-12753df44f84

📥 Commits

Reviewing files that changed from the base of the PR and between 4920493 and 76b2e01.

📒 Files selected for processing (1)

XUtils.h

natoscott · 2026-05-19T07:33:12Z

I agree that the RichString non-wide path shouldn't apply UTF-8 decoding, dropped that.

The change now only extends String_stripControlChars to catch UTF-8 encoded C1 controls (0xC2 0x80-0x9F), protecting against terminal escape injection via those sequences.

This still looks like the most straightforward, practical approach here to me.

Explorer09 · 2026-05-19T07:45:40Z

I agree that the RichString non-wide path shouldn't apply UTF-8 decoding, dropped that.

The change now only extends String_stripControlChars to catch UTF-8 encoded C1 controls (0xC2 0x80-0x9F), protecting against terminal escape injection via those sequences.

This still looks like the most straightforward, practical approach here to me.

0xC2 0x80-0x9F is not terminal escape injection, is it? Now I don't understand the threat model, and this PR looks like "fixing" stuff where the problem was invalid in the first place.

Even if we consider 0xC2 0x80-0x9F sequences are malicious, in this branch of the function the best approach would to sanitize all possible lead bytes of a multibyte sequence, since if we leak any multibyte sequence, it won't display properly in this no-multibyte-support build.

natoscott · 2026-05-19T07:57:40Z

0xC2 0x80-0x9F is not terminal escape injection, is it? Now I don't understand [...]

Yes, C1 controls absolutely include terminal escape sequences.

Explorer09 · 2026-05-19T09:24:39Z

0xC2 0x80-0x9F is not terminal escape injection, is it? Now I don't understand [...]

Yes, C1 controls absolutely include terminal escape sequences.

While I can see the Control Sequence Introducer possible to be represented by a byte valye of 0x9B. It should apply to only the ISO 2022 kind of encodings.

For an encoding like Big5, 0x9B may be a lead byte of valid, private use character.

BenBE reviewed May 19, 2026

View reviewed changes

natoscott force-pushed the security/c1-control-hardening branch from 4920493 to 76b2e01 Compare May 19, 2026 07:12

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread XUtils.h

Uh oh!

Conversation

natoscott commented May 18, 2026

Summary

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation quality

Known limitations

Walkthrough

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

Explorer09 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Explorer09 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenBE left a comment

Choose a reason for hiding this comment

Uh oh!

BenBE May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Explorer09 commented May 19, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

natoscott commented May 19, 2026

Uh oh!

Explorer09 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natoscott commented May 19, 2026

Uh oh!

Explorer09 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Explorer09 commented May 18, 2026 •

edited

Loading

Explorer09 commented May 18, 2026 •

edited

Loading

Explorer09 commented May 19, 2026 •

edited

Loading