UUID and Base32hex Encoding

zhjwpku · zhjwpku · commit 00852a7495db · 2026-03-29T22:08:24.000+08:00
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -4,6 +4,7 @@
 
 - [2026](./en/2026/README.md)
   - [Week 13](./en/2026/13/README.md)
+    - [UUID and Base32hex Encoding](./en/2026/13/uuid-base32hex-encode-decode.md)
     - [JSONPath String Methods: Cleaning JSON Inside the Path—and a Long Debate About Immutability](./en/2026/13/jsonpath-string-methods.md)
   - [Week 12](./en/2026/12/README.md)
     - [Reduce Planning Time for Large NOT IN Lists Containing NULL](./en/2026/12/not-in-null-planning-optimization.md)
@@ -34,6 +35,7 @@
 
 - [2026](./cn/2026/README.md)
   - [第 13 周](./cn/2026/13/README.md)
+    - [UUID 与 base32hex 编码](./cn/2026/13/uuid-base32hex-encode-decode.md)
     - [JSONPath 字符串方法：在路径里清洗 JSON，以及一场关于不可变性的长跑讨论](./cn/2026/13/jsonpath-string-methods.md)
   - [第 12 周](./cn/2026/12/README.md)
     - [缩短含 NULL 的大规模 NOT IN 列表的规划时间](./cn/2026/12/not-in-null-planning-optimization.md)
diff --git a/src/cn/2026/13/README.md b/src/cn/2026/13/README.md
@@ -4,4 +4,5 @@
 
 ## 文章
 
+- [UUID 与 base32hex 编码](./uuid-base32hex-encode-decode.md)
 - [JSONPath 字符串方法：在路径里清洗 JSON，以及一场关于不可变性的长跑讨论](./jsonpath-string-methods.md)
diff --git a/src/cn/2026/13/uuid-base32hex-encode-decode.md b/src/cn/2026/13/uuid-base32hex-encode-decode.md
@@ -0,0 +1,128 @@
+# UUID 与 base32hex 编码
+
+## 引言
+
+在 URL、日志、JSON 等场景里，经常需要一种更短、更易口述的 UUID 文本形式。2025 年 10 月起，在 [pgsql-hackers 讨论串](https://www.postgresql.org/message-id/flat/CAJ7c6TOramr1UTLcyB128LWMqita1Y7%3Darq3KHaU%3Dqikf5yKOQ%40mail.gmail.com) 中，有人倡议增加两个内置函数——`uuid_to_base32hex()` 与 `base32hex_to_uuid()`——把 UUID 编成 [RFC 4648 第 7 节](https://datatracker.ietf.org/doc/html/rfc4648#section-7) 的 **base32hex** 字符串（26 个字符）。后续讨论的焦点很快从「格式好不好」转向：**在 PostgreSQL 的 SQL 接口里，这类能力应该放在哪里**——是再增加一对 UUID 专用函数，还是走 `encode()` / `decode()` 与显式类型转换的组合。
+
+## 为何值得关注
+
+PostgreSQL 已有 `uuid` 类型与 `uuidv7()` 等函数，存储层面很高效；但在系统间交换、对外 API 和人工阅读时，**文本形态**仍然重要。**Base32hex** 在字节层面保持 **字典序与二进制一致**（与常见的 base64 不同），解码时 **大小写不敏感**，口述时也不必区分字母大小写。讨论中还把该格式与 DNSSEC 等现实用法、以及 **RFC 9562** 对「规范 hyphen 表示 + 库内二进制存储」的取向联系起来，强调**紧凑文本编码若碎片化**（各家用各家的短编码）会带来互操作噩梦。
+
+## 技术分析
+
+### 最初设想（概念）
+
+线程中保留的原始说明（`hi-hackers.txt`）大致描述：
+
+- **`uuid_to_base32hex(uuid) → text`**：26 个大写 base32hex 字符，无连字符、无填充；在 128 位与 base32 的 5 位对齐需求之间补两位零比特。
+- **`base32hex_to_uuid(text) → uuid`**：解码大小写不敏感；非法输入返回 NULL。
+
+作者同时对比了 base36（性能）、Crockford Base32（标准库支持弱）等，倾向 base32hex。
+
+### 社区更偏好的方向：组合而非重复
+
+**Aleksander Alekseev** 的第一回复既包含流程建议（避免大量 `Cc`；用 `git format-patch`；在 [Commitfest](https://commitfest.postgresql.org/) 登记），也包含接口设计：单独的 UUID 编解码函数 **可组合性差**，更稳妥的是：
+
+1. 提供显式的 **`uuid ↔ bytea`** 转换；
+2. 在 **`encode(bytea, ...)` / `decode(text, ...)`** 上增加 **base32hex** 格式，例如：
+
+```sql
+SELECT encode(uuidv7()::bytea, 'base32hex');
+```
+
+（早期邮件里写的是 `'base32'`；后续补丁采用的格式名是 **`base32hex`**，与 RFC 4648 的命名一致。）
+
+**Andrey Borodin** 与 **Jelte Fennema-Nio** 同意扩展 `encode()` 更符合既有习惯；Jelte 以 PostgreSQL 18 中 **base64url** 的引入为例（[提交 e1d917182](https://git.postgresql.org/cgit/postgresql.git/commit/?h=REL_18_0&id=e1d917182c1953b16b32a39ed2fe38e3d0823047)），说明在相同入口上增加编码格式是可行路径。
+
+### 补丁系列实际走向（概要）
+
+下载的补丁系列（讨论串中可见到 **v12** 等版本）逐步收敛为：
+
+- 在 `encode.c` 中实现 **`encode(bytea, 'base32hex')`** 与 **`decode(text, 'base32hex')`**（编码侧按 RFC 使用 `=` 填充；解码侧接受带填充或不带填充；大小写与空白处理以补丁说明为准）。
+- 在 `func-binarystring.sgml` 中记录该格式，并给出紧凑 UUID 的推荐写法：
+
+```sql
+rtrim(encode(uuid_value::bytea, 'base32hex'), '=')
+```
+
+即相对 36 字符的规范 UUID 文本，得到 **26 字符** 的短形式。
+
+- 与之配套：**`uuid` 与 `bytea` 的显式转换**，避免在核心中为每种编码再增加一对 UUID 专用函数。
+
+### SQL 示例（示意）
+
+下面语句对应讨论中形成的用法（`encode` / `decode` 使用 `'base32hex'`，UUID 经 `::bytea` 参与编码）。**需以包含该功能的 PostgreSQL 版本为准**——撰写本文时相关补丁仍在评审流程中。
+
+**1. 带 RFC 填充的原始输出** — `encode()` 会输出 `'='` 填充；对 16 字节的 UUID，在未 `rtrim` 前字符串长度会大于 26：
+
+```sql
+SELECT encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex') AS padded;
+-- 32 个字符（按 RFC 4648 填充到 8 的倍数，末尾为 '='）
+```
+
+**2. 26 字符紧凑形式** — 去掉尾部 `=`，适合 URL、日志等（与线程中的示例 UUID 一致）：
+
+```sql
+SELECT rtrim(
+         encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
+         '='
+       ) AS short_id;
+-- 06AJBM9TUTSVND36VA87V8BVJO
+```
+
+**3. 往返** — `decode()` 返回 `bytea`，再 cast 回 `uuid`：
+
+```sql
+WITH x AS (
+  SELECT rtrim(
+           encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
+           '='
+         ) AS short_id
+)
+SELECT short_id,
+       decode(short_id, 'base32hex')::uuid AS back_to_uuid
+FROM x;
+-- back_to_uuid = 019535d9-3df7-79fb-b466-fa907fa17f9e
+```
+
+**4. 与 `uuidv7()` 组合**（需要时间有序 ID 且对外用短字符串时）：
+
+```sql
+SELECT rtrim(encode(uuidv7()::bytea, 'base32hex'), '=') AS short_new_id;
+```
+
+### 补丁演进
+
+早期版本曾出现 `uuid_encode` / `uuid_decode` 一类接口；后续版本将能力并入 **`encode`/`decode`**，并补充回归测试与文档，还涉及 **排序与排序规则（collation）** 的说明与小幅修正（附件列表中有单独的 doc 补丁）。
+
+## 社区观点
+
+- **列表与可见性**：Aleksander 指出 Sergey 若未订阅列表，首发邮件未必被所有人看到；附原文并建议订阅，有助于讨论在归档里自洽。
+
+- **为何选 base32hex**：Sergey 归纳了排序保持、体积、标准库支持、口述友好、JSON 场景下实现简单等理由，并强调**短格式若各自为政**会导致生态分裂。
+
+- **API 膨胀**：**Masahiko Sawada** 认为若再堆一批 `uuid_*` 编码函数，容易与 `encode()` 职责重叠；他支持 **`encode`/`decode` + UUID 与 bytea 转换**，并认为转换开销可忽略。
+
+- **多态 `encode` 与 `decode` 的签名限制**：Sergey 曾问能否让 `encode()` 直接吃 `uuid`，或让 `decode()` 直接产出 `uuid`。Masahiko 说明：在 PostgreSQL 里 **无法用同一组参数类型** 为 `decode(text, text)` 定义两种不同返回类型；**显式 cast + 可内联的 SQL 包装函数** 是务实做法，例如：
+
+```sql
+CREATE FUNCTION uuid_to_base32(u uuid) RETURNS text
+LANGUAGE SQL IMMUTABLE STRICT
+BEGIN ATOMIC
+  SELECT encode($1::bytea, 'base32hex');
+END;
+```
+
+并说明与手写 `encode($1::bytea, ...)` 相比，额外成本主要在类型转换。
+
+- **「只能有一种短格式」吗**：Sergey 担心多种短编码并存；Masahiko 指出在异构系统集成时，开发者仍可能因兼容性选择 **hex** 等。工程上的折中仍是在核心提供 **标准的 base32hex**，并在文档中写清 **26 字符 UUID** 的用法。
+
+## 技术细节
+
+- **填充**：按 RFC，`encode()` 会输出带 **`=`** 的填充；去掉尾部 `=` 即得到提案中的 26 字符形态。
+- **排序**：base32hex 保持字节序对应的字典序；若把编码结果当作 **text** 比较，需注意 **排序规则** 与二进制比较的差异。
+- **实现位置**：把编解码放在 `encode.c` 与现有 **base64url** 等并列，符合「二进制↔文本编码集中管理」的习惯。
+
+## 结语
+
+Base32hex 本身并不复杂，但这串讨论体现了 PostgreSQL 对 **接口正交性** 的偏好：`uuid` 存数据、`bytea` 表示字节、`encode`/`decode` 负责文本编码。若你需要紧凑且排序友好的 UUID 文本，正在形成的使用习惯是 **`encode(uuid::bytea, 'base32hex')`**，必要时 **`rtrim(..., '=')`**；若业务希望一行 SQL更短，可以用 **可内联的包装函数** 封装，而不必把每一种编码都塞进核心函数名列表。
diff --git a/src/cn/2026/README.md b/src/cn/2026/README.md
@@ -5,6 +5,7 @@
 ## 各周
 
 - [第 13 周](/cn/2026/13/index.html)
+  - [UUID 与 base32hex 编码](/cn/2026/13/uuid-base32hex-encode-decode.html)
   - [JSONPath 字符串方法：在路径里清洗 JSON，以及一场关于不可变性的长跑讨论](/cn/2026/13/jsonpath-string-methods.html)
 - [第 12 周](/cn/2026/12/index.html)
   - [缩短含 NULL 的大规模 NOT IN 列表的规划时间](/cn/2026/12/not-in-null-planning-optimization.html)
diff --git a/src/en/2026/13/README.md b/src/en/2026/13/README.md
@@ -4,4 +4,5 @@ PostgreSQL mailing list discussions for Week 13, 2026.
 
 ## Articles
 
+- [UUID and Base32hex Encoding](./uuid-base32hex-encode-decode.md)
 - [JSONPath String Methods: Cleaning JSON Inside the Path—and a Long Debate About Immutability](./jsonpath-string-methods.md)
diff --git a/src/en/2026/13/uuid-base32hex-encode-decode.md b/src/en/2026/13/uuid-base32hex-encode-decode.md
@@ -0,0 +1,128 @@
+# UUID and Base32hex Encoding
+
+## Introduction
+
+Compact, copy-paste-friendly representations of UUIDs come up in URLs, logs, JSON payloads, and anything humans have to read aloud. A [pgsql-hackers thread](https://www.postgresql.org/message-id/flat/CAJ7c6TOramr1UTLcyB128LWMqita1Y7%3Darq3KHaU%3Dqikf5yKOQ%40mail.gmail.com) starting in October 2025 began with advocacy for two new builtins—`uuid_to_base32hex()` and `base32hex_to_uuid()`—encoding UUIDs as 26-character [RFC 4648 Section 7](https://datatracker.ietf.org/doc/html/rfc4648#section-7) **base32hex** strings. What followed was less about the numeric format (everyone agreed it is standardized and useful) and more about **where that behavior should live** in PostgreSQL’s SQL surface: separate UUID-specific functions, or the existing `encode()` / `decode()` pipeline plus explicit type conversions.
+
+## Why This Matters
+
+PostgreSQL already stores UUIDs efficiently as a dedicated type and ships `uuidv7()` and friends. Text interchange formats still matter for APIs and cross-system contracts. **Base32hex** preserves **lexicographic sort order** of the underlying bytes (unlike typical base64), stays **case-insensitive for decoding**, and avoids ambiguous oral spelling of mixed-case hex. The thread connects that format to real-world precedent (DNSSEC encoders, JSON-heavy pipelines) and to **RFC 9562** positioning: canonical hyphenated UUID strings for compatibility, binary storage in databases, and a desire—not always fully standardized in the RFC text—for *one* compact text encoding to reduce ecosystem fragmentation.
+
+## Technical Analysis
+
+### Original proposal (conceptual)
+
+The opening pitch (preserved in the thread as `hi-hackers.txt`) described:
+
+- **`uuid_to_base32hex(uuid) → text`**: 26 uppercase base32hex characters, no hyphens, no padding; two zero bits appended so 128 bits map cleanly to base32’s 5-bit alphabet.
+- **`base32hex_to_uuid(text) → uuid`**: case-insensitive decode; invalid input yields NULL.
+
+Sergey Prokhorenko also argued against alternatives such as base36 (performance) and Crockford Base32 (weaker presence in standard libraries), positioning base32hex as the practical compact choice.
+
+### Community direction: compose, do not duplicate
+
+**Aleksander Alekseev**’s first reply mixed process advice (avoid mass `Cc:` lists; use `git format-patch`; register on [Commitfest](https://commitfest.postgresql.org/)) with API feedback: standalone UUID helpers are **not composable**. Prefer:
+
+1. Explicit **`uuid ↔ bytea`** casts (or conversions).
+2. Extending **`encode(bytea, ...)`** and **`decode(text, ...)`** with a **base32hex** format—so usage looks like:
+
+```sql
+SELECT encode(uuidv7()::bytea, 'base32hex');
+```
+
+(The early mail said `'base32'`; the implemented format name in later patches is **`base32hex`**, matching RFC 4648’s “base32hex” alphabet.)
+
+**Andrey Borodin** and **Jelte Fennema-Nio** agreed that extending `encode()` matches existing practice; Jelte pointed to **base64url** support added in PostgreSQL 18 ([commit e1d917182](https://git.postgresql.org/cgit/postgresql.git/commit/?h=REL_18_0&id=e1d917182c1953b16b32a39ed2fe38e3d0823047)) as a precedent for adding encodings to the same entry points.
+
+### What landed in the patch series (high level)
+
+The downloadable series (revisions through **v12** in the thread attachments) converges on:
+
+- **`encode(data bytea, 'base32hex')`** and **`decode(text, 'base32hex')`** implemented in `encode.c`, with RFC 4648 padding on encode and tolerant decode (padded or unpadded; case-insensitive; whitespace ignored—see patch headers for exact rules).
+- Documentation under `func-binarystring.sgml` describing base32hex and explicitly recommending:
+
+```sql
+rtrim(encode(uuid_value::bytea, 'base32hex'), '=')
+```
+
+for a **26-character** compact UUID string versus 36 characters in canonical hex-with-hyphens form.
+
+- Companion work: **explicit casting between `uuid` and `bytea`**, so the `encode()` path does not require ad-hoc UUID-only builtins.
+
+### SQL examples (illustrative)
+
+The snippets below match the **API shape** discussed on the list (`encode` / `decode` with `'base32hex'` and `uuid::bytea`). They assume a PostgreSQL build that includes that support—the thread’s patches were still under review when this post was written.
+
+**1. Raw RFC output (with padding)** — `encode()` emits `'='` padding; for a 16-byte UUID the string is longer than 26 characters until trimmed:
+
+```sql
+SELECT encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex') AS padded;
+-- 32 characters (RFC 4648 padding to a multiple of 8; ends with '=')
+```
+
+**2. Compact 26-character form** — strip padding for URLs and logs (same example UUID as in the original thread):
+
+```sql
+SELECT rtrim(
+         encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
+         '='
+       ) AS short_id;
+-- 06AJBM9TUTSVND36VA87V8BVJO
+```
+
+**3. Round trip** — `decode()` yields `bytea`; cast back to `uuid`:
+
+```sql
+WITH x AS (
+  SELECT rtrim(
+           encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
+           '='
+         ) AS short_id
+)
+SELECT short_id,
+       decode(short_id, 'base32hex')::uuid AS back_to_uuid
+FROM x;
+-- back_to_uuid = 019535d9-3df7-79fb-b466-fa907fa17f9e
+```
+
+**4. New IDs with `uuidv7()`** (when you want time-ordered values and a short external form):
+
+```sql
+SELECT rtrim(encode(uuidv7()::bytea, 'base32hex'), '=') AS short_new_id;
+```
+
+### Patch evolution
+
+Early revisions explored UUID-specific `uuid_encode` / `uuid_decode` style APIs; later revisions fold behavior into **`encode`/`decode`**, rebase cast support, add regression tests, and iterate on documentation—including notes on **collation** and **sortability** of encoded text (additional small doc patches appear in the attachment list).
+
+## Community Insights
+
+- **Process and visibility**: Aleksander noted Sergey was not subscribed to the list, so the original message did not reach all readers; subscribing and attaching the full proposal helped restore context.
+
+- **Why base32hex**: Sergey summarized advantages—sort preservation, compactness, wide library support, easy dictation without case ambiguity, and simple codecs for JSON-centric systems. He also framed **standardization pressure**: many ad-hoc “short UUID” schemes risk incompatibility; RFC 9562 authors reportedly favored a single compact encoding, even if the full compact-text story did not ship inside that RFC’s timeline.
+
+- **API surface area**: **Masahiko Sawada** argued that naming another `uuid_*` encoding beside `encode()` invites a proliferation of one-off functions; **`+1` for `encode`/`decode` + `uuid`/`bytea` conversion**, with negligible cost for the cast.
+
+- **Design tension—polymorphism vs. casts**: Sergey asked whether `encode()` could take `uuid` directly and whether `decode()` could return `uuid` without going through `bytea`. Masahiko explained PostgreSQL cannot overload `decode(text, text)` with two different result types for the same signature; **casts plus inline SQL wrappers** are the idiomatic compromise. He gave an **inlineable** example:
+
+```sql
+CREATE FUNCTION uuid_to_base32(u uuid) RETURNS text
+LANGUAGE SQL IMMUTABLE STRICT
+BEGIN ATOMIC
+  SELECT encode($1::bytea, 'base32hex');
+END;
+```
+
+and noted the runtime difference versus calling `encode(...)` with an explicit cast is essentially a small conversion cost.
+
+- **Sergey’s “one short format” concern**: He warned against a world where different systems pick Crockford base32, base36, unsorted base64, etc. Masahiko countered that **developers still choose** encodings when integrating heterogeneous stacks—e.g., hex when every component supports it. The thread’s engineering answer is still to expose **one standard base32hex** in core and document the **26-character** UUID recipe.
+
+## Technical Details
+
+- **Padding**: RFC-style `encode()` output includes **`=`** padding to a multiple of 8 characters; trimming yields the compact 26-character UUID form used in the original proposal.
+- **Sort order**: Base32hex preserves byte order for lexicographic comparisons; documentation and follow-up patches discuss **collation** effects when comparing encoded **text** (binary UUID comparisons remain the authoritative ordering).
+- **Integration**: Placing the codec in `encode.c` keeps binary encoding formats in one place and mirrors how **base64url** was added.
+
+## Conclusion
+
+Base32hex is a small feature on paper, but the thread is a clear example of PostgreSQL’s preference for **orthogonal primitives**: typed storage (`uuid`), explicit binary views (`bytea`), and shared text encodings (`encode`/`decode`). If you need compact, sort-friendly UUID text, the emerging pattern is **`encode(uuid::bytea, 'base32hex')`** with optional **`rtrim(..., '=')`**, not a parallel family of UUID-only functions—unless you wrap that one-liner for your own schema’s ergonomics.
diff --git a/src/en/2026/README.md b/src/en/2026/README.md
@@ -5,6 +5,7 @@ PostgreSQL Weekly posts for 2026.
 ## Weeks
 
 - [Week 13](/en/2026/13/index.html)
+  - [UUID and Base32hex Encoding](/en/2026/13/uuid-base32hex-encode-decode.html)
   - [JSONPath String Methods: Cleaning JSON Inside the Path—and a Long Debate About Immutability](/en/2026/13/jsonpath-string-methods.html)
 - [Week 12](/en/2026/12/index.html)
   - [Reduce Planning Time for Large NOT IN Lists Containing NULL](/en/2026/12/not-in-null-planning-optimization.html)

Original file line number	Diff line number	Diff line change
`@@ -4,4 +4,5 @@`
`4`	`4`
`5`	`5`	`## 文章`
`6`	`6`
	`7`	`+- [UUID 与 base32hex 编码](./uuid-base32hex-encode-decode.md)`
`7`	`8`	`- [JSONPath 字符串方法：在路径里清洗 JSON，以及一场关于不可变性的长跑讨论](./jsonpath-string-methods.md)`
Original file line number	Diff line number	Diff line change
`@@ -4,4 +4,5 @@ PostgreSQL mailing list discussions for Week 13, 2026.`
`4`	`4`
`5`	`5`	`## Articles`
`6`	`6`
	`7`	`+- [UUID and Base32hex Encoding](./uuid-base32hex-encode-decode.md)`
`7`	`8`	`- [JSONPath String Methods: Cleaning JSON Inside the Path—and a Long Debate About Immutability](./jsonpath-string-methods.md)`