[spark] Add 8 aggregate functions (count_distinct, collect_set, count_if, max_by/min_by, bool_and/or, kurtosis) by aaron-ang · Pull Request #450 · duckdb/duckdb-python

aaron-ang · 2026-05-11T00:20:59Z

Summary

Adds 8 PySpark pyspark.sql.functions aggregate functions (plus 3 common aliases) tracked in #14525. All map directly to existing DuckDB SQL aggregates — pure expression wrappers, no new infrastructure.

PySpark	DuckDB SQL
`count_distinct` / `countDistinct`	`array_length(array_distinct(list(...)))` (single col) or `array_length(array_distinct(list(CASE WHEN any IS NULL THEN NULL ELSE struct_pack(...) END)))` (multi col)
`collect_set`	`array_distinct(list(x))`
`count_if`	`count_if(x)`
`max_by(col, ord)`	`arg_max(arg, val)`
`min_by(col, ord)`	`arg_min(arg, val)`
`bool_and` / `every`	`bool_and(x)`
`bool_or` / `some` / `any`	`bool_or(x)`
`kurtosis`	`kurtosis(x)`

Notes:

count_distinct(col, *cols) accepts one or more columns. For the multi-column form, tuples where any column is NULL are excluded from the count via a CASE WHEN that nulls out the struct before array_distinct strips it — matching Spark / standard SQL COUNT(DISTINCT col1, col2, ...) semantics.
collect_set order is non-deterministic, NULL excluded — both match Spark semantics. Test sorts before comparing.
any matches Spark's alias of bool_or and shadows the Python builtin at module scope; users importing the namespaced module (from duckdb.experimental.spark.sql import functions as F) call F.any(...) and keep the builtin unaffected.

Test plan

New tests/fast/spark/test_spark_functions_aggregate.py covers all 12 exported names (8 functions + 4 aliases) plus a dedicated multi-column count_distinct case. 13/13 pass.
Full tests/fast/spark/ suite on a fresh debug build of this branch: 300+ passed, 4 skipped, 1 xfailed, 0 failures.

…f, max_by, min_by, bool_and, bool_or, kurtosis Adds 8 PySpark aggregate functions (plus 3 aliases: countDistinct, every, some) tracked in duckdb/duckdb#14525: - count_distinct / countDistinct: array_length(array_distinct(list(x))) - collect_set: array_distinct(list(x)) (excludes NULL) - count_if: count_if(x) - max_by(col, ord): arg_max(arg, val) - min_by(col, ord): arg_min(arg, val) - bool_and / every: bool_and(x) - bool_or / some: bool_or(x) - kurtosis: kurtosis(x) Single-column count_distinct only (matching existing approx_count_distinct). Multi-column variant left for a follow-up due to Spark/SQL NULL-handling semantics.

aaron-ang force-pushed the spark-aggregate-functions branch 4 times, most recently from 59b9118 to 3bb6b10 Compare May 11, 2026 03:09

aaron-ang force-pushed the spark-aggregate-functions branch from 3bb6b10 to 95ef9f7 Compare May 11, 2026 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Add 8 aggregate functions (count_distinct, collect_set, count_if, max_by/min_by, bool_and/or, kurtosis)#450

[spark] Add 8 aggregate functions (count_distinct, collect_set, count_if, max_by/min_by, bool_and/or, kurtosis)#450
aaron-ang wants to merge 1 commit into
duckdb:mainfrom
aaron-ang:spark-aggregate-functions

aaron-ang commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaron-ang commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aaron-ang commented May 11, 2026 •

edited

Loading