Skip to content

Commit 00852a7

Browse files
committed
UUID and Base32hex Encoding
1 parent 2fb7325 commit 00852a7

7 files changed

Lines changed: 262 additions & 0 deletions

File tree

src/SUMMARY.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
- [2026](./en/2026/README.md)
66
- [Week 13](./en/2026/13/README.md)
7+
- [UUID and Base32hex Encoding](./en/2026/13/uuid-base32hex-encode-decode.md)
78
- [JSONPath String Methods: Cleaning JSON Inside the Path—and a Long Debate About Immutability](./en/2026/13/jsonpath-string-methods.md)
89
- [Week 12](./en/2026/12/README.md)
910
- [Reduce Planning Time for Large NOT IN Lists Containing NULL](./en/2026/12/not-in-null-planning-optimization.md)
@@ -34,6 +35,7 @@
3435

3536
- [2026](./cn/2026/README.md)
3637
- [第 13 周](./cn/2026/13/README.md)
38+
- [UUID 与 base32hex 编码](./cn/2026/13/uuid-base32hex-encode-decode.md)
3739
- [JSONPath 字符串方法:在路径里清洗 JSON,以及一场关于不可变性的长跑讨论](./cn/2026/13/jsonpath-string-methods.md)
3840
- [第 12 周](./cn/2026/12/README.md)
3941
- [缩短含 NULL 的大规模 NOT IN 列表的规划时间](./cn/2026/12/not-in-null-planning-optimization.md)

src/cn/2026/13/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@
44

55
## 文章
66

7+
- [UUID 与 base32hex 编码](./uuid-base32hex-encode-decode.md)
78
- [JSONPath 字符串方法:在路径里清洗 JSON,以及一场关于不可变性的长跑讨论](./jsonpath-string-methods.md)
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# UUID 与 base32hex 编码
2+
3+
## 引言
4+
5+
在 URL、日志、JSON 等场景里,经常需要一种更短、更易口述的 UUID 文本形式。2025 年 10 月起,在 [pgsql-hackers 讨论串](https://www.postgresql.org/message-id/flat/CAJ7c6TOramr1UTLcyB128LWMqita1Y7%3Darq3KHaU%3Dqikf5yKOQ%40mail.gmail.com) 中,有人倡议增加两个内置函数——`uuid_to_base32hex()``base32hex_to_uuid()`——把 UUID 编成 [RFC 4648 第 7 节](https://datatracker.ietf.org/doc/html/rfc4648#section-7)**base32hex** 字符串(26 个字符)。后续讨论的焦点很快从「格式好不好」转向:**在 PostgreSQL 的 SQL 接口里,这类能力应该放在哪里**——是再增加一对 UUID 专用函数,还是走 `encode()` / `decode()` 与显式类型转换的组合。
6+
7+
## 为何值得关注
8+
9+
PostgreSQL 已有 `uuid` 类型与 `uuidv7()` 等函数,存储层面很高效;但在系统间交换、对外 API 和人工阅读时,**文本形态**仍然重要。**Base32hex** 在字节层面保持 **字典序与二进制一致**(与常见的 base64 不同),解码时 **大小写不敏感**,口述时也不必区分字母大小写。讨论中还把该格式与 DNSSEC 等现实用法、以及 **RFC 9562** 对「规范 hyphen 表示 + 库内二进制存储」的取向联系起来,强调**紧凑文本编码若碎片化**(各家用各家的短编码)会带来互操作噩梦。
10+
11+
## 技术分析
12+
13+
### 最初设想(概念)
14+
15+
线程中保留的原始说明(`hi-hackers.txt`)大致描述:
16+
17+
- **`uuid_to_base32hex(uuid) → text`**:26 个大写 base32hex 字符,无连字符、无填充;在 128 位与 base32 的 5 位对齐需求之间补两位零比特。
18+
- **`base32hex_to_uuid(text) → uuid`**:解码大小写不敏感;非法输入返回 NULL。
19+
20+
作者同时对比了 base36(性能)、Crockford Base32(标准库支持弱)等,倾向 base32hex。
21+
22+
### 社区更偏好的方向:组合而非重复
23+
24+
**Aleksander Alekseev** 的第一回复既包含流程建议(避免大量 `Cc`;用 `git format-patch`;在 [Commitfest](https://commitfest.postgresql.org/) 登记),也包含接口设计:单独的 UUID 编解码函数 **可组合性差**,更稳妥的是:
25+
26+
1. 提供显式的 **`uuid ↔ bytea`** 转换;
27+
2.**`encode(bytea, ...)` / `decode(text, ...)`** 上增加 **base32hex** 格式,例如:
28+
29+
```sql
30+
SELECT encode(uuidv7()::bytea, 'base32hex');
31+
```
32+
33+
(早期邮件里写的是 `'base32'`;后续补丁采用的格式名是 **`base32hex`**,与 RFC 4648 的命名一致。)
34+
35+
**Andrey Borodin****Jelte Fennema-Nio** 同意扩展 `encode()` 更符合既有习惯;Jelte 以 PostgreSQL 18 中 **base64url** 的引入为例([提交 e1d917182](https://git.postgresql.org/cgit/postgresql.git/commit/?h=REL_18_0&id=e1d917182c1953b16b32a39ed2fe38e3d0823047)),说明在相同入口上增加编码格式是可行路径。
36+
37+
### 补丁系列实际走向(概要)
38+
39+
下载的补丁系列(讨论串中可见到 **v12** 等版本)逐步收敛为:
40+
41+
-`encode.c` 中实现 **`encode(bytea, 'base32hex')`****`decode(text, 'base32hex')`**(编码侧按 RFC 使用 `=` 填充;解码侧接受带填充或不带填充;大小写与空白处理以补丁说明为准)。
42+
-`func-binarystring.sgml` 中记录该格式,并给出紧凑 UUID 的推荐写法:
43+
44+
```sql
45+
rtrim(encode(uuid_value::bytea, 'base32hex'), '=')
46+
```
47+
48+
即相对 36 字符的规范 UUID 文本,得到 **26 字符** 的短形式。
49+
50+
- 与之配套:**`uuid``bytea` 的显式转换**,避免在核心中为每种编码再增加一对 UUID 专用函数。
51+
52+
### SQL 示例(示意)
53+
54+
下面语句对应讨论中形成的用法(`encode` / `decode` 使用 `'base32hex'`,UUID 经 `::bytea` 参与编码)。**需以包含该功能的 PostgreSQL 版本为准**——撰写本文时相关补丁仍在评审流程中。
55+
56+
**1. 带 RFC 填充的原始输出**`encode()` 会输出 `'='` 填充;对 16 字节的 UUID,在未 `rtrim` 前字符串长度会大于 26:
57+
58+
```sql
59+
SELECT encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex') AS padded;
60+
-- 32 个字符(按 RFC 4648 填充到 8 的倍数,末尾为 '=')
61+
```
62+
63+
**2. 26 字符紧凑形式** — 去掉尾部 `=`,适合 URL、日志等(与线程中的示例 UUID 一致):
64+
65+
```sql
66+
SELECT rtrim(
67+
encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
68+
'='
69+
) AS short_id;
70+
-- 06AJBM9TUTSVND36VA87V8BVJO
71+
```
72+
73+
**3. 往返**`decode()` 返回 `bytea`,再 cast 回 `uuid`
74+
75+
```sql
76+
WITH x AS (
77+
SELECT rtrim(
78+
encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
79+
'='
80+
) AS short_id
81+
)
82+
SELECT short_id,
83+
decode(short_id, 'base32hex')::uuid AS back_to_uuid
84+
FROM x;
85+
-- back_to_uuid = 019535d9-3df7-79fb-b466-fa907fa17f9e
86+
```
87+
88+
**4. 与 `uuidv7()` 组合**(需要时间有序 ID 且对外用短字符串时):
89+
90+
```sql
91+
SELECT rtrim(encode(uuidv7()::bytea, 'base32hex'), '=') AS short_new_id;
92+
```
93+
94+
### 补丁演进
95+
96+
早期版本曾出现 `uuid_encode` / `uuid_decode` 一类接口;后续版本将能力并入 **`encode`/`decode`**,并补充回归测试与文档,还涉及 **排序与排序规则(collation)** 的说明与小幅修正(附件列表中有单独的 doc 补丁)。
97+
98+
## 社区观点
99+
100+
- **列表与可见性**:Aleksander 指出 Sergey 若未订阅列表,首发邮件未必被所有人看到;附原文并建议订阅,有助于讨论在归档里自洽。
101+
102+
- **为何选 base32hex**:Sergey 归纳了排序保持、体积、标准库支持、口述友好、JSON 场景下实现简单等理由,并强调**短格式若各自为政**会导致生态分裂。
103+
104+
- **API 膨胀****Masahiko Sawada** 认为若再堆一批 `uuid_*` 编码函数,容易与 `encode()` 职责重叠;他支持 **`encode`/`decode` + UUID 与 bytea 转换**,并认为转换开销可忽略。
105+
106+
- **多态 `encode``decode` 的签名限制**:Sergey 曾问能否让 `encode()` 直接吃 `uuid`,或让 `decode()` 直接产出 `uuid`。Masahiko 说明:在 PostgreSQL 里 **无法用同一组参数类型**`decode(text, text)` 定义两种不同返回类型;**显式 cast + 可内联的 SQL 包装函数** 是务实做法,例如:
107+
108+
```sql
109+
CREATE FUNCTION uuid_to_base32(u uuid) RETURNS text
110+
LANGUAGE SQL IMMUTABLE STRICT
111+
BEGIN ATOMIC
112+
SELECT encode($1::bytea, 'base32hex');
113+
END;
114+
```
115+
116+
并说明与手写 `encode($1::bytea, ...)` 相比,额外成本主要在类型转换。
117+
118+
- **「只能有一种短格式」吗**:Sergey 担心多种短编码并存;Masahiko 指出在异构系统集成时,开发者仍可能因兼容性选择 **hex** 等。工程上的折中仍是在核心提供 **标准的 base32hex**,并在文档中写清 **26 字符 UUID** 的用法。
119+
120+
## 技术细节
121+
122+
- **填充**:按 RFC,`encode()` 会输出带 **`=`** 的填充;去掉尾部 `=` 即得到提案中的 26 字符形态。
123+
- **排序**:base32hex 保持字节序对应的字典序;若把编码结果当作 **text** 比较,需注意 **排序规则** 与二进制比较的差异。
124+
- **实现位置**:把编解码放在 `encode.c` 与现有 **base64url** 等并列,符合「二进制↔文本编码集中管理」的习惯。
125+
126+
## 结语
127+
128+
Base32hex 本身并不复杂,但这串讨论体现了 PostgreSQL 对 **接口正交性** 的偏好:`uuid` 存数据、`bytea` 表示字节、`encode`/`decode` 负责文本编码。若你需要紧凑且排序友好的 UUID 文本,正在形成的使用习惯是 **`encode(uuid::bytea, 'base32hex')`**,必要时 **`rtrim(..., '=')`**;若业务希望一行 SQL更短,可以用 **可内联的包装函数** 封装,而不必把每一种编码都塞进核心函数名列表。

src/cn/2026/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
## 各周
66

77
- [第 13 周](/cn/2026/13/index.html)
8+
- [UUID 与 base32hex 编码](/cn/2026/13/uuid-base32hex-encode-decode.html)
89
- [JSONPath 字符串方法:在路径里清洗 JSON,以及一场关于不可变性的长跑讨论](/cn/2026/13/jsonpath-string-methods.html)
910
- [第 12 周](/cn/2026/12/index.html)
1011
- [缩短含 NULL 的大规模 NOT IN 列表的规划时间](/cn/2026/12/not-in-null-planning-optimization.html)

src/en/2026/13/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@ PostgreSQL mailing list discussions for Week 13, 2026.
44

55
## Articles
66

7+
- [UUID and Base32hex Encoding](./uuid-base32hex-encode-decode.md)
78
- [JSONPath String Methods: Cleaning JSON Inside the Path—and a Long Debate About Immutability](./jsonpath-string-methods.md)
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# UUID and Base32hex Encoding
2+
3+
## Introduction
4+
5+
Compact, copy-paste-friendly representations of UUIDs come up in URLs, logs, JSON payloads, and anything humans have to read aloud. A [pgsql-hackers thread](https://www.postgresql.org/message-id/flat/CAJ7c6TOramr1UTLcyB128LWMqita1Y7%3Darq3KHaU%3Dqikf5yKOQ%40mail.gmail.com) starting in October 2025 began with advocacy for two new builtins—`uuid_to_base32hex()` and `base32hex_to_uuid()`—encoding UUIDs as 26-character [RFC 4648 Section 7](https://datatracker.ietf.org/doc/html/rfc4648#section-7) **base32hex** strings. What followed was less about the numeric format (everyone agreed it is standardized and useful) and more about **where that behavior should live** in PostgreSQL’s SQL surface: separate UUID-specific functions, or the existing `encode()` / `decode()` pipeline plus explicit type conversions.
6+
7+
## Why This Matters
8+
9+
PostgreSQL already stores UUIDs efficiently as a dedicated type and ships `uuidv7()` and friends. Text interchange formats still matter for APIs and cross-system contracts. **Base32hex** preserves **lexicographic sort order** of the underlying bytes (unlike typical base64), stays **case-insensitive for decoding**, and avoids ambiguous oral spelling of mixed-case hex. The thread connects that format to real-world precedent (DNSSEC encoders, JSON-heavy pipelines) and to **RFC 9562** positioning: canonical hyphenated UUID strings for compatibility, binary storage in databases, and a desire—not always fully standardized in the RFC text—for *one* compact text encoding to reduce ecosystem fragmentation.
10+
11+
## Technical Analysis
12+
13+
### Original proposal (conceptual)
14+
15+
The opening pitch (preserved in the thread as `hi-hackers.txt`) described:
16+
17+
- **`uuid_to_base32hex(uuid) → text`**: 26 uppercase base32hex characters, no hyphens, no padding; two zero bits appended so 128 bits map cleanly to base32’s 5-bit alphabet.
18+
- **`base32hex_to_uuid(text) → uuid`**: case-insensitive decode; invalid input yields NULL.
19+
20+
Sergey Prokhorenko also argued against alternatives such as base36 (performance) and Crockford Base32 (weaker presence in standard libraries), positioning base32hex as the practical compact choice.
21+
22+
### Community direction: compose, do not duplicate
23+
24+
**Aleksander Alekseev**’s first reply mixed process advice (avoid mass `Cc:` lists; use `git format-patch`; register on [Commitfest](https://commitfest.postgresql.org/)) with API feedback: standalone UUID helpers are **not composable**. Prefer:
25+
26+
1. Explicit **`uuid ↔ bytea`** casts (or conversions).
27+
2. Extending **`encode(bytea, ...)`** and **`decode(text, ...)`** with a **base32hex** format—so usage looks like:
28+
29+
```sql
30+
SELECT encode(uuidv7()::bytea, 'base32hex');
31+
```
32+
33+
(The early mail said `'base32'`; the implemented format name in later patches is **`base32hex`**, matching RFC 4648’s “base32hex” alphabet.)
34+
35+
**Andrey Borodin** and **Jelte Fennema-Nio** agreed that extending `encode()` matches existing practice; Jelte pointed to **base64url** support added in PostgreSQL 18 ([commit e1d917182](https://git.postgresql.org/cgit/postgresql.git/commit/?h=REL_18_0&id=e1d917182c1953b16b32a39ed2fe38e3d0823047)) as a precedent for adding encodings to the same entry points.
36+
37+
### What landed in the patch series (high level)
38+
39+
The downloadable series (revisions through **v12** in the thread attachments) converges on:
40+
41+
- **`encode(data bytea, 'base32hex')`** and **`decode(text, 'base32hex')`** implemented in `encode.c`, with RFC 4648 padding on encode and tolerant decode (padded or unpadded; case-insensitive; whitespace ignored—see patch headers for exact rules).
42+
- Documentation under `func-binarystring.sgml` describing base32hex and explicitly recommending:
43+
44+
```sql
45+
rtrim(encode(uuid_value::bytea, 'base32hex'), '=')
46+
```
47+
48+
for a **26-character** compact UUID string versus 36 characters in canonical hex-with-hyphens form.
49+
50+
- Companion work: **explicit casting between `uuid` and `bytea`**, so the `encode()` path does not require ad-hoc UUID-only builtins.
51+
52+
### SQL examples (illustrative)
53+
54+
The snippets below match the **API shape** discussed on the list (`encode` / `decode` with `'base32hex'` and `uuid::bytea`). They assume a PostgreSQL build that includes that support—the thread’s patches were still under review when this post was written.
55+
56+
**1. Raw RFC output (with padding)**`encode()` emits `'='` padding; for a 16-byte UUID the string is longer than 26 characters until trimmed:
57+
58+
```sql
59+
SELECT encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex') AS padded;
60+
-- 32 characters (RFC 4648 padding to a multiple of 8; ends with '=')
61+
```
62+
63+
**2. Compact 26-character form** — strip padding for URLs and logs (same example UUID as in the original thread):
64+
65+
```sql
66+
SELECT rtrim(
67+
encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
68+
'='
69+
) AS short_id;
70+
-- 06AJBM9TUTSVND36VA87V8BVJO
71+
```
72+
73+
**3. Round trip**`decode()` yields `bytea`; cast back to `uuid`:
74+
75+
```sql
76+
WITH x AS (
77+
SELECT rtrim(
78+
encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
79+
'='
80+
) AS short_id
81+
)
82+
SELECT short_id,
83+
decode(short_id, 'base32hex')::uuid AS back_to_uuid
84+
FROM x;
85+
-- back_to_uuid = 019535d9-3df7-79fb-b466-fa907fa17f9e
86+
```
87+
88+
**4. New IDs with `uuidv7()`** (when you want time-ordered values and a short external form):
89+
90+
```sql
91+
SELECT rtrim(encode(uuidv7()::bytea, 'base32hex'), '=') AS short_new_id;
92+
```
93+
94+
### Patch evolution
95+
96+
Early revisions explored UUID-specific `uuid_encode` / `uuid_decode` style APIs; later revisions fold behavior into **`encode`/`decode`**, rebase cast support, add regression tests, and iterate on documentation—including notes on **collation** and **sortability** of encoded text (additional small doc patches appear in the attachment list).
97+
98+
## Community Insights
99+
100+
- **Process and visibility**: Aleksander noted Sergey was not subscribed to the list, so the original message did not reach all readers; subscribing and attaching the full proposal helped restore context.
101+
102+
- **Why base32hex**: Sergey summarized advantages—sort preservation, compactness, wide library support, easy dictation without case ambiguity, and simple codecs for JSON-centric systems. He also framed **standardization pressure**: many ad-hoc “short UUID” schemes risk incompatibility; RFC 9562 authors reportedly favored a single compact encoding, even if the full compact-text story did not ship inside that RFC’s timeline.
103+
104+
- **API surface area**: **Masahiko Sawada** argued that naming another `uuid_*` encoding beside `encode()` invites a proliferation of one-off functions; **`+1` for `encode`/`decode` + `uuid`/`bytea` conversion**, with negligible cost for the cast.
105+
106+
- **Design tension—polymorphism vs. casts**: Sergey asked whether `encode()` could take `uuid` directly and whether `decode()` could return `uuid` without going through `bytea`. Masahiko explained PostgreSQL cannot overload `decode(text, text)` with two different result types for the same signature; **casts plus inline SQL wrappers** are the idiomatic compromise. He gave an **inlineable** example:
107+
108+
```sql
109+
CREATE FUNCTION uuid_to_base32(u uuid) RETURNS text
110+
LANGUAGE SQL IMMUTABLE STRICT
111+
BEGIN ATOMIC
112+
SELECT encode($1::bytea, 'base32hex');
113+
END;
114+
```
115+
116+
and noted the runtime difference versus calling `encode(...)` with an explicit cast is essentially a small conversion cost.
117+
118+
- **Sergey’s “one short format” concern**: He warned against a world where different systems pick Crockford base32, base36, unsorted base64, etc. Masahiko countered that **developers still choose** encodings when integrating heterogeneous stacks—e.g., hex when every component supports it. The thread’s engineering answer is still to expose **one standard base32hex** in core and document the **26-character** UUID recipe.
119+
120+
## Technical Details
121+
122+
- **Padding**: RFC-style `encode()` output includes **`=`** padding to a multiple of 8 characters; trimming yields the compact 26-character UUID form used in the original proposal.
123+
- **Sort order**: Base32hex preserves byte order for lexicographic comparisons; documentation and follow-up patches discuss **collation** effects when comparing encoded **text** (binary UUID comparisons remain the authoritative ordering).
124+
- **Integration**: Placing the codec in `encode.c` keeps binary encoding formats in one place and mirrors how **base64url** was added.
125+
126+
## Conclusion
127+
128+
Base32hex is a small feature on paper, but the thread is a clear example of PostgreSQL’s preference for **orthogonal primitives**: typed storage (`uuid`), explicit binary views (`bytea`), and shared text encodings (`encode`/`decode`). If you need compact, sort-friendly UUID text, the emerging pattern is **`encode(uuid::bytea, 'base32hex')`** with optional **`rtrim(..., '=')`**, not a parallel family of UUID-only functions—unless you wrap that one-liner for your own schema’s ergonomics.

src/en/2026/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ PostgreSQL Weekly posts for 2026.
55
## Weeks
66

77
- [Week 13](/en/2026/13/index.html)
8+
- [UUID and Base32hex Encoding](/en/2026/13/uuid-base32hex-encode-decode.html)
89
- [JSONPath String Methods: Cleaning JSON Inside the Path—and a Long Debate About Immutability](/en/2026/13/jsonpath-string-methods.html)
910
- [Week 12](/en/2026/12/index.html)
1011
- [Reduce Planning Time for Large NOT IN Lists Containing NULL](/en/2026/12/not-in-null-planning-optimization.html)

0 commit comments

Comments
 (0)