Reduce Planning Time for Large NOT IN Lists Containing NULL

zhjwpku · zhjwpku · commit 03f759f5ef9c · 2026-03-20T23:23:02.000+08:00
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -3,6 +3,8 @@
 # 🇬🇧 English
 
 - [2026](./en/2026/README.md)
+  - [Week 12](./en/2026/12/README.md)
+    - [Reduce Planning Time for Large NOT IN Lists Containing NULL](./en/2026/12/not-in-null-planning-optimization.md)
   - [Week 11](./en/2026/11/README.md)
     - [Converting NOT IN Sublinks to Anti-Joins When Safe](./en/2026/11/not-in-sublinks-anti-joins.md)
   - [Week 10](./en/2026/10/README.md)
@@ -29,6 +31,8 @@
 # 🇨🇳 中文
 
 - [2026](./cn/2026/README.md)
+  - [第 12 周](./cn/2026/12/README.md)
+    - [缩短含 NULL 的大规模 NOT IN 列表的规划时间](./cn/2026/12/not-in-null-planning-optimization.md)
   - [第 11 周](./cn/2026/11/README.md)
     - [将 NOT IN 子链接安全地转换为 ANTI JOIN](./cn/2026/11/not-in-sublinks-anti-joins.md)
   - [第 10 周](./cn/2026/10/README.md)
diff --git a/src/cn/2026/12/README.md b/src/cn/2026/12/README.md
@@ -0,0 +1,9 @@
+# 第 12 周（2026）
+
+2026 年第 12 周 PostgreSQL 邮件列表讨论。
+
+🇬🇧 [English Version](../../../en/2026/12/index.html)
+
+## 文章
+
+- [缩短含 NULL 的大规模 NOT IN 列表的规划时间](./not-in-null-planning-optimization.md)
diff --git a/src/cn/2026/12/not-in-null-planning-optimization.md b/src/cn/2026/12/not-in-null-planning-optimization.md
@@ -0,0 +1,77 @@
+# 缩短含 NULL 的大规模 NOT IN 列表的规划时间
+
+## 引言
+
+当查询使用 `x NOT IN (NULL, ...)` 或 `x <> ALL (NULL, ...)` 时，结果恒为空——不会有任何行匹配。在 SQL 中，NOT IN 列表中只要存在一个 NULL，整个谓词对每一行都会得到 NULL 或 false，因此选择性为 0.0。然而 PostgreSQL 优化器此前仍会遍历列表中的每个元素，并为每个元素调用算子的选择性估计函数，在大列表上浪费规划时间。
+
+Ilia Evdokimov（Tantor Labs）提出了一项简单优化：在 `<> ALL` / NOT IN 情形下，检测数组是否包含 NULL，并提前短路选择性计算循环，直接返回 0.0。该补丁经过多轮评审，发现并修复了与「曾含 NULL 但现已不含」的数组相关的回归问题，最终由 David Rowley 于 2026 年 3 月提交。
+
+## 为何重要
+
+在报表和 ETL 场景中，带有大规模 `NOT IN` 或 `<> ALL` 列表的查询很常见。当这类列表包含 NULL 时（无论是有意还是来自子查询），优化器此前会做大量无用工作：
+
+- 对于常量数组：解构数组，遍历每个元素，并调用算子的选择性函数。
+- 对于非常量表达式：遍历列表元素以检查 NULL。
+
+在 Ilia 的测试中，当列具有详细统计信息时，`WHERE x NOT IN (NULL, ...)` 的规划时间从 **5–200 ms** 降至 **约 1–2 ms**，具体取决于列表大小。该优化保持语义不变，且不会影响常见场景。
+
+## 技术分析
+
+### 语义
+
+对于 `x NOT IN (a, b, c, ...)`（或 `x <> ALL (array)`）：
+
+- 若任一元素为 NULL，谓词对每一行都会得到 NULL（在 WHERE 子句中意味着该行被过滤掉）。
+- 优化器将其建模为选择性 = 0.0：无行匹配。
+
+`src/backend/utils/adt/selfuncs.c` 中的 `scalararraysel()` 通过 `useOr` 标志同时处理 `= ANY`（IN）和 `<> ALL`（NOT IN）。对于 `<> ALL`，选择性通过遍历数组元素并组合各元素估计值来计算。当存在 NULL 时，最终结果恒为 0.0，因此该循环是多余的。
+
+### 补丁演进
+
+**v1** 在 `deconstruct_array()` 之后增加早期检查，使用 `memchr()` 在 `elem_nulls` 上检测任意 NULL。David Geier 提出疑问：`memchr()` 在每次调用时都会带来开销。他建议在 `ArrayType` 上增加标志位。
+
+**v2** 改为在逐元素循环内部短路：当元素为 `Const` 且为 NULL 时，立即返回 0.0。这避免了单独一轮遍历，但仍需进入循环。
+
+**v3** 将检查提前到 `DatumGetArrayTypeP()` 之后，使用 `ARR_HASNULL()` 在解构数组之前检测 NULL。这样在存在 NULL 时，既不需要解构数组，也不需要逐元素循环。Ilia 报告规划时间从 5–200 ms 降至约 1–2 ms。
+
+**v4** 修复了 Zsolt Parragi 发现的回归。宏 `ARR_HASNULL()` 仅检查 NULL 位图是否存在，而非是否有元素实际为 NULL。一个最初含有 NULL、但后来通过 `array_set_element()` 等将所有 NULL 替换掉的数组，仍可能保留 NULL 位图。仅使用 `ARR_HASNULL()` 会导致对此类数组错误地返回选择性 0.0。
+
+修复方案：使用 `array_contains_nulls()`，该函数会遍历 NULL 位图，仅在实际存在 NULL 元素时返回 true。v4 还增加了回归测试，通过 `replace_elem(ARRAY[1,NULL,3], 2, 99)` 构造数组，确保优化器对 `x <> ALL(replace_elem(ARRAY[1,NULL,3], 2, 99))` 估计 997 行（而非 0）。
+
+**v5–v9** 采纳 David Rowley 的反馈：将测试移至新的 `selectivity_est.sql` 文件（后更名为 `planner_est.sql`），使用 `test_setup.sql` 中的 `tenk1` 而非自定义表，增加关于假定算子为 strict 的注释（与 `var_eq_const()` 一致），并简化测试以断言不变量（存在 NULL 时选择性为 0.0），而非精确行估计。
+
+## 社区反馈
+
+- **David Geier** 质疑 `memchr()` 的代价，并建议在 `ArrayType` 上增加标志；Ilia 发现已有 `ARR_HASNULL()` / `array_contains_nulls()`。
+- **Zsolt Parragi** 发现了与「已将 NULL 替换掉」的数组相关的回归，并提出了 `replace_elem` 测试用例。
+- **David Geier** 澄清：`ARR_HASNULL()` 检查的是位图是否存在，而非实际 NULL 元素；`array_contains_nulls()` 才是正确的检查。
+- **David Rowley** 建议将测试移至专门的 planner 估计相关文件，使用现有测试表，并注明 strict 算子假设。他还提交了重构补丁（planner_est.sql）和主优化补丁。
+
+## 技术细节
+
+### 实现
+
+该优化在 `scalararraysel()` 中增加两条短路路径：
+
+1. **常量数组情形**：在 `DatumGetArrayTypeP()` 之后，若 `!useOr`（即 `<> ALL` / NOT IN）且 `array_contains_nulls(arrayval)`，则立即返回 0.0。这发生在 `deconstruct_array()` 之前。
+
+2. **非常量列表情形**：在逐元素循环中，若 `!useOr` 且元素为 `constisnull` 的 `Const`，则返回 0.0。这处理了当列表来自非常量表达式时的 `x NOT IN (1, 2, NULL, ...)`。
+
+代码假定算子为 strict（与 `var_eq_const()` 一致）：当常量为 NULL 时，算子返回零选择性。这与现有优化器行为一致。
+
+### 边界情况
+
+- **有 NULL 位图但无实际 NULL 的数组**：通过使用 `array_contains_nulls()` 而非 `ARR_HASNULL()` 处理。
+- **非 strict 算子**：注释中说明，该短路与 `var_eq_const()` 采用相同假设。
+
+### 相关工作
+
+David Geier 指出，一旦[基于哈希的 NOT IN 代码](https://www.postgresql.org/message-id/flat/7db341e0-fbc6-4ec5-922c-11fdafe7be12%40tantorlabs.com)合并后，加速效果会有所减弱，但仍能在选择性估计阶段节省大量计算。
+
+## 当前状态
+
+该补丁由 David Rowley 于 2026 年 3 月 19 日提交。将优化器行估计测试迁移至 `planner_est.sql` 的重构先被提交，主优化随后跟进。该优化将出现在后续 PostgreSQL 版本中。
+
+## 结论
+
+一项小改动——在 NOT IN / `<> ALL` 列表中检测 NULL 并提前返回选择性 0.0——避免了规划阶段不必要的逐元素计算。修复需要正确处理 NULL 位图与实际 NULL 元素的区别，并得益于细致的评审和回归测试。该优化现已成为 PostgreSQL 的一部分，将惠及使用含 NULL 的大规模 NOT IN 列表的负载。
diff --git a/src/cn/2026/README.md b/src/cn/2026/README.md
@@ -4,6 +4,8 @@
 
 ## 各周
 
+- [第 12 周](/cn/2026/12/index.html)
+  - [缩短含 NULL 的大规模 NOT IN 列表的规划时间](/cn/2026/12/not-in-null-planning-optimization.html)
 - [第 11 周](/cn/2026/11/index.html)
   - [将 NOT IN 子链接安全地转换为 ANTI JOIN](/cn/2026/11/not-in-sublinks-anti-joins.html)
 - [第 10 周](/cn/2026/10/index.html)
diff --git a/src/en/2026/12/README.md b/src/en/2026/12/README.md
@@ -0,0 +1,9 @@
+# Week 12 (2026)
+
+PostgreSQL mailing list discussions for Week 12, 2026.
+
+🇨🇳 [中文版本](../../../cn/2026/12/index.html)
+
+## Articles
+
+- [Reduce Planning Time for Large NOT IN Lists Containing NULL](./not-in-null-planning-optimization.md)
diff --git a/src/en/2026/12/not-in-null-planning-optimization.md b/src/en/2026/12/not-in-null-planning-optimization.md
@@ -0,0 +1,77 @@
+# Reduce Planning Time for Large NOT IN Lists Containing NULL
+
+## Introduction
+
+When a query uses `x NOT IN (NULL, ...)` or `x <> ALL (NULL, ...)`, the result is always empty—no rows can match. In SQL, the presence of a single NULL in a NOT IN list makes the entire predicate evaluate to NULL or false for every row, so the selectivity is 0.0. Yet the PostgreSQL planner was still iterating over every element in the list and invoking the operator's selectivity estimator for each one, wasting planning time on large lists.
+
+Ilia Evdokimov (Tantor Labs) proposed a simple optimization: detect when the array contains NULL and short-circuit the selectivity loop for the `<> ALL` / NOT IN case, returning 0.0 immediately. The patch went through several rounds of review, uncovered a subtle regression involving arrays that once contained NULL but no longer do, and was committed by David Rowley in March 2026.
+
+## Why This Matters
+
+Queries with large `NOT IN` or `<> ALL` lists are common in reporting and ETL workloads. When such a list includes NULL—whether intentionally or from a subquery—the planner was doing unnecessary work:
+
+- For constant arrays: deconstructing the array, iterating over each element, and calling the operator's selectivity function.
+- For non-constant expressions: iterating over list elements to check for NULL.
+
+In Ilia's benchmarks, planning time for `WHERE x NOT IN (NULL, ...)` dropped from **5–200 ms** to **~1–2 ms** depending on list size, when the column had detailed statistics. The optimization preserves semantics and avoids regressions in the common case.
+
+## Technical Analysis
+
+### The Semantics
+
+For `x NOT IN (a, b, c, ...)` (or `x <> ALL (array)`):
+
+- If any element is NULL, the predicate yields NULL for every row (in a WHERE clause, that means the row is filtered out).
+- The planner models this as selectivity = 0.0: no rows match.
+
+The current code in `scalararraysel()` in `src/backend/utils/adt/selfuncs.c` handles both `= ANY` (IN) and `<> ALL` (NOT IN) via a `useOr` flag. For `<> ALL`, the selectivity is computed by iterating over array elements and combining per-element estimates. When a NULL is present, the final result is always 0.0, so the loop is redundant.
+
+### Patch Evolution
+
+**v1** added an early check after `deconstruct_array()` using `memchr()` on `elem_nulls` to detect any NULL. David Geier raised a concern: `memchr()` adds overhead on every call. He suggested a flag on `ArrayType` instead.
+
+**v2** switched to short-circuiting inside the per-element loop: when a `Const` element is NULL, return 0.0 immediately. This avoids a separate pass but still requires entering the loop.
+
+**v3** moved the check earlier, right after `DatumGetArrayTypeP()`, using `ARR_HASNULL()` to detect NULL before deconstructing the array. This avoids both the deconstruction and the per-element loop when NULL is present. Ilia reported planning time dropping from 5–200 ms to ~1–2 ms.
+
+**v4** addressed a regression found by Zsolt Parragi. The macro `ARR_HASNULL()` only checks for the *existence* of a NULL bitmap—not whether any element is actually NULL. An array that originally had NULL but had all NULLs replaced (e.g., via `array_set_element()`) can still have a NULL bitmap. Using `ARR_HASNULL()` alone caused incorrect selectivity 0.0 for such arrays.
+
+The fix: use `array_contains_nulls()`, which iterates the NULL bitmap and returns true only when an element is actually NULL. v4 also added a regression test that constructs an array from `ARRAY[1, NULL, 3]` with the NULL replaced by 99, ensuring the planner estimates 997 rows (not 0) for `x <> ALL(replace_elem(ARRAY[1,NULL,3], 2, 99))`.
+
+**v5–v9** incorporated feedback from David Rowley: move the test to a new `selectivity_est.sql` file (later renamed to `planner_est.sql`), use `tenk1` from `test_setup.sql` instead of a custom table, add a comment about assuming the operator is strict (like `var_eq_const()`), and simplify tests to assert the invariant (selectivity 0.0 when NULL is present) rather than exact row estimates.
+
+## Community Insights
+
+- **David Geier** questioned the cost of `memchr()` and suggested an `ArrayType` flag; Ilia found that `ARR_HASNULL()` / `array_contains_nulls()` already existed.
+- **Zsolt Parragi** found the regression with arrays that had NULLs replaced, and proposed the `replace_elem` test case.
+- **David Geier** clarified that `ARR_HASNULL()` checks the bitmap's existence, not actual NULL elements; `array_contains_nulls()` is the correct check.
+- **David Rowley** suggested moving tests to a dedicated planner-estimation file, using existing test tables, and documenting the strict-operator assumption. He also pushed the refactoring patch (planner_est.sql) and the main optimization.
+
+## Technical Details
+
+### Implementation
+
+The optimization adds two short-circuit paths in `scalararraysel()`:
+
+1. **Constant array case**: After `DatumGetArrayTypeP()`, if `!useOr` (i.e., `<> ALL` / NOT IN) and `array_contains_nulls(arrayval)`, return 0.0 immediately. This runs before `deconstruct_array()`.
+
+2. **Non-constant list case**: In the per-element loop, if `!useOr` and the element is a `Const` with `constisnull`, return 0.0. This handles `x NOT IN (1, 2, NULL, ...)` when the list comes from a non-constant expression.
+
+The code assumes the operator is strict (like `var_eq_const()`): when the constant is NULL, the operator returns zero selectivity. This is consistent with existing planner behavior.
+
+### Edge Cases
+
+- **Arrays with NULL bitmap but no actual NULLs**: Handled by `array_contains_nulls()` instead of `ARR_HASNULL()`.
+- **Non-strict operators**: The comment documents that the short-circuit follows the same assumption as `var_eq_const()`.
+
+### Related Work
+
+David Geier noted that the speedup will be less pronounced once the [hash-based NOT IN code](https://www.postgresql.org/message-id/flat/7db341e0-fbc6-4ec5-922c-11fdafe7be12%40tantorlabs.com) is merged, but the optimization still saves cycles during selectivity estimation.
+
+## Current Status
+
+The patch was committed by David Rowley on March 19, 2026. The refactoring that moved planner row-estimation tests to `planner_est.sql` was committed first; the main optimization followed. It will appear in a future PostgreSQL release.
+
+## Conclusion
+
+A small change—detecting NULL in NOT IN / `<> ALL` lists and returning selectivity 0.0 early—avoids unnecessary per-element work during planning. The fix required careful handling of the NULL bitmap vs. actual NULL elements, and benefited from thorough review and regression tests. The optimization is now part of PostgreSQL and will help workloads that use large NOT IN lists containing NULL.
diff --git a/src/en/2026/README.md b/src/en/2026/README.md
@@ -4,6 +4,8 @@ PostgreSQL Weekly posts for 2026.
 
 ## Weeks
 
+- [Week 12](/en/2026/12/index.html)
+  - [Reduce Planning Time for Large NOT IN Lists Containing NULL](/en/2026/12/not-in-null-planning-optimization.html)
 - [Week 11](/en/2026/11/index.html)
   - [Converting NOT IN Sublinks to Anti-Joins When Safe](/en/2026/11/not-in-sublinks-anti-joins.html)
 - [Week 10](/en/2026/10/index.html)