[spark] Add 8 aggregate functions (count_distinct, collect_set, count_if, max_by/min_by, bool_and/or, kurtosis)#450
Open
aaron-ang wants to merge 1 commit into
Open
Conversation
59b9118 to
3bb6b10
Compare
…f, max_by, min_by, bool_and, bool_or, kurtosis Adds 8 PySpark aggregate functions (plus 3 aliases: countDistinct, every, some) tracked in duckdb/duckdb#14525: - count_distinct / countDistinct: array_length(array_distinct(list(x))) - collect_set: array_distinct(list(x)) (excludes NULL) - count_if: count_if(x) - max_by(col, ord): arg_max(arg, val) - min_by(col, ord): arg_min(arg, val) - bool_and / every: bool_and(x) - bool_or / some: bool_or(x) - kurtosis: kurtosis(x) Single-column count_distinct only (matching existing approx_count_distinct). Multi-column variant left for a follow-up due to Spark/SQL NULL-handling semantics.
3bb6b10 to
95ef9f7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds 8 PySpark
pyspark.sql.functionsaggregate functions (plus 3 common aliases) tracked in #14525. All map directly to existing DuckDB SQL aggregates — pure expression wrappers, no new infrastructure.count_distinct/countDistinctarray_length(array_distinct(list(...)))(single col) orarray_length(array_distinct(list(CASE WHEN any IS NULL THEN NULL ELSE struct_pack(...) END)))(multi col)collect_setarray_distinct(list(x))count_ifcount_if(x)max_by(col, ord)arg_max(arg, val)min_by(col, ord)arg_min(arg, val)bool_and/everybool_and(x)bool_or/some/anybool_or(x)kurtosiskurtosis(x)Notes:
count_distinct(col, *cols)accepts one or more columns. For the multi-column form, tuples where any column is NULL are excluded from the count via aCASE WHENthat nulls out the struct beforearray_distinctstrips it — matching Spark / standard SQLCOUNT(DISTINCT col1, col2, ...)semantics.collect_setorder is non-deterministic, NULL excluded — both match Spark semantics. Test sorts before comparing.anymatches Spark's alias ofbool_orand shadows the Python builtin at module scope; users importing the namespaced module (from duckdb.experimental.spark.sql import functions as F) callF.any(...)and keep the builtin unaffected.Test plan
tests/fast/spark/test_spark_functions_aggregate.pycovers all 12 exported names (8 functions + 4 aliases) plus a dedicated multi-columncount_distinctcase. 13/13 pass.tests/fast/spark/suite on a fresh debug build of this branch: 300+ passed, 4 skipped, 1 xfailed, 0 failures.