Skip to content

[spark] Add 8 aggregate functions (count_distinct, collect_set, count_if, max_by/min_by, bool_and/or, kurtosis)#450

Open
aaron-ang wants to merge 1 commit into
duckdb:mainfrom
aaron-ang:spark-aggregate-functions
Open

[spark] Add 8 aggregate functions (count_distinct, collect_set, count_if, max_by/min_by, bool_and/or, kurtosis)#450
aaron-ang wants to merge 1 commit into
duckdb:mainfrom
aaron-ang:spark-aggregate-functions

Conversation

@aaron-ang
Copy link
Copy Markdown

@aaron-ang aaron-ang commented May 11, 2026

Summary

Adds 8 PySpark pyspark.sql.functions aggregate functions (plus 3 common aliases) tracked in #14525. All map directly to existing DuckDB SQL aggregates — pure expression wrappers, no new infrastructure.

PySpark DuckDB SQL
count_distinct / countDistinct array_length(array_distinct(list(...))) (single col) or array_length(array_distinct(list(CASE WHEN any IS NULL THEN NULL ELSE struct_pack(...) END))) (multi col)
collect_set array_distinct(list(x))
count_if count_if(x)
max_by(col, ord) arg_max(arg, val)
min_by(col, ord) arg_min(arg, val)
bool_and / every bool_and(x)
bool_or / some / any bool_or(x)
kurtosis kurtosis(x)

Notes:

  • count_distinct(col, *cols) accepts one or more columns. For the multi-column form, tuples where any column is NULL are excluded from the count via a CASE WHEN that nulls out the struct before array_distinct strips it — matching Spark / standard SQL COUNT(DISTINCT col1, col2, ...) semantics.
  • collect_set order is non-deterministic, NULL excluded — both match Spark semantics. Test sorts before comparing.
  • any matches Spark's alias of bool_or and shadows the Python builtin at module scope; users importing the namespaced module (from duckdb.experimental.spark.sql import functions as F) call F.any(...) and keep the builtin unaffected.

Test plan

  • New tests/fast/spark/test_spark_functions_aggregate.py covers all 12 exported names (8 functions + 4 aliases) plus a dedicated multi-column count_distinct case. 13/13 pass.
  • Full tests/fast/spark/ suite on a fresh debug build of this branch: 300+ passed, 4 skipped, 1 xfailed, 0 failures.

@aaron-ang aaron-ang force-pushed the spark-aggregate-functions branch 4 times, most recently from 59b9118 to 3bb6b10 Compare May 11, 2026 03:09
…f, max_by, min_by, bool_and, bool_or, kurtosis

Adds 8 PySpark aggregate functions (plus 3 aliases: countDistinct, every, some)
tracked in duckdb/duckdb#14525:

- count_distinct / countDistinct: array_length(array_distinct(list(x)))
- collect_set: array_distinct(list(x)) (excludes NULL)
- count_if: count_if(x)
- max_by(col, ord): arg_max(arg, val)
- min_by(col, ord): arg_min(arg, val)
- bool_and / every: bool_and(x)
- bool_or / some: bool_or(x)
- kurtosis: kurtosis(x)

Single-column count_distinct only (matching existing approx_count_distinct).
Multi-column variant left for a follow-up due to Spark/SQL NULL-handling semantics.
@aaron-ang aaron-ang force-pushed the spark-aggregate-functions branch from 3bb6b10 to 95ef9f7 Compare May 11, 2026 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant