Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions docs/language/reference/builders/aggregates.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,15 @@ Current aggregate authoring is explicit and scalar-expression-based.
| `col` | `def col(name: str) -> ColumnExpr` | Column reference builder used by aggregates, filters, and projections. |
| `lit` | `def lit(value: int \| float \| str \| bool) -> ColumnExpr` | Canonical scalar literal helper. |
| `sum` | `def sum(expr: ColumnExpr) -> AggregateMeasure` | Sum one scalar expression. |
| `count` | `def count(*exprs: ColumnExpr) -> AggregateMeasure` | Count rows with no argument, or count non-null expression values with one argument. |
| `count` | `def count() -> AggregateMeasure`; `def count(expr: ColumnExpr) -> AggregateMeasure` | Count rows with no argument, or count non-null expression values with one argument. |
| `count_expr` | `def count_expr(expr: ColumnExpr) -> AggregateMeasure` | Compatibility spelling for `count(expr)`. |
| `count_distinct` | `def count_distinct(expr: ColumnExpr) -> AggregateMeasure` | Count distinct non-null expression values. |
| `count_if` | `def count_if(predicate: ColumnExpr) -> AggregateMeasure` | Count rows where the predicate is true. |
| `avg` | `def avg(expr: ColumnExpr) -> AggregateMeasure` | Average one numeric scalar expression. |
| `min` | `def min(expr: ColumnExpr) -> AggregateMeasure` | Return the minimum non-null value for one orderable scalar expression. |
| `max` | `def max(expr: ColumnExpr) -> AggregateMeasure` | Return the maximum non-null value for one orderable scalar expression. |
| `approx_count_distinct` | `def approx_count_distinct(expr: ColumnExpr) -> AggregateMeasure` | Estimate distinct non-null expression values. |
| `approx_percentile` | `def approx_percentile(expr: ColumnExpr, percentile: float, accuracy: int = 10000) -> AggregateMeasure` | Estimate one percentile over numeric non-null values. |

## Modifiers

Expand All @@ -30,7 +32,7 @@ Aggregate measures support method-style modifiers:
## Example

```incan
from pub::inql.functions import add, avg, col, count, count_distinct, count_if, eq, lit, max, min, str_lit, sum
from pub::inql.functions import add, approx_count_distinct, approx_percentile, avg, col, count, count_distinct, count_if, eq, lit, max, min, str_lit, sum

grouped = orders.group_by([col("customer_id")]).agg([
sum(add(col("amount"), lit(5))),
Expand All @@ -42,6 +44,8 @@ grouped = orders.group_by([col("customer_id")]).agg([
avg(col("amount")),
min(col("created_at")),
max(col("created_at")),
approx_count_distinct(col("user_id")),
approx_percentile(col("latency_ms"), 0.95),
])
```

Expand All @@ -55,5 +59,9 @@ grouped = orders.group_by([col("customer_id")]).agg([
- `count_if(predicate)` is compatibility sugar for `count().filter(predicate)`. Rows where the predicate is false or
null do not contribute to the aggregate.
- `sum`, `avg`, `min`, and `max` skip null values. They return backend-null results when no non-null input value exists.
- `approx_count_distinct` and `approx_percentile` are approximate aggregate choices. They allow aggregate-local filters
but reject extra `DISTINCT` and ordered input in the portable contract.
- `approx_percentile` output names include percentile and accuracy parameters so two percentile estimates over the same
expression do not collapse into the same output column name.
- Unsupported aggregate modifiers fail at lowering or backend planning; they are not ignored.
- Future `.column` sugar and scoped aggregate symbols should lower to this same surface rather than replacing its semantics.
43 changes: 43 additions & 0 deletions docs/language/reference/functions/approximate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Approximate Functions (Reference)

Approximate helpers are explicit opt-in functions. InQL does not silently replace exact aggregates with approximate
execution because a backend can do so.

The portable RFC 023 aggregate surface is:

| Function | Meaning |
| --- | --- |
| `approx_count_distinct(expr)` | Estimate the number of distinct non-null values produced by one expression. |
| `approx_percentile(expr, percentile, accuracy=10000)` | Estimate one percentile over numeric non-null values. |

```incan
from pub::inql.functions import approx_count_distinct, approx_percentile, col

summary = (
events
.group_by([col("campaign_id")])
.agg([
approx_count_distinct(col("user_id")),
approx_percentile(col("latency_ms"), 0.95),
])
)
```

`approx_count_distinct` is registered as an approximate aggregate with HyperLogLog-family metadata. The portable author
contract is an approximate non-null distinct-count estimate. It does not expose a user-tunable relative-error parameter
because the registered InQL Substrait extension mapping for this function is unary. Backend adapters must keep this
approximation visible in capability/error handling rather than redefining exact `count_distinct` semantics.

`approx_percentile` is registered as an approximate aggregate with t-digest-family metadata. `percentile` must be between
`0.0` and `1.0` inclusive. `accuracy` must be positive and is carried as an explicit aggregate argument so backend
capability handling can accept, emulate, or reject the requested approximation instead of silently changing semantics.
Generated aggregate output names include the percentile and accuracy arguments.

Both helpers lower through registered InQL Substrait aggregate extension names. The DataFusion adapter maps
`approx_count_distinct` to DataFusion's `approx_distinct` implementation and maps `approx_percentile` to
`approx_percentile_cont` at the backend boundary.

Sketch-state construction, merge, estimate, serialization, and deserialization helpers are not exposed as lowerable
portable functions in RFC 023. They are delegated to InQL RFC 025, which defines typed sketch logical values with sketch
family, value domain, merge compatibility, and serialized format identity. Exposing those helpers as strings or binary
payloads would violate the RFC 023 type-safety requirement.
2 changes: 2 additions & 0 deletions docs/language/reference/functions/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Today the concrete shipped surfaces are documented here:
- [Nested data functions](nested.md)
- [Window functions](windows.md)
- [Format functions](format.md)
- [Approximate functions](approximate.md)

The canonical scalar literal helper is `lit(...)`. Typed literal helpers construct the same scalar-expression representation.

Expand Down Expand Up @@ -41,5 +42,6 @@ The registered helper surface currently includes:
| `md5(...)`, `sha1(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, `sha2(...)`, `crc32(...)`, `xxhash64(...)`, JSON helpers, CSV helpers, URL helpers | scalar | registered RFC 022 format helpers; concrete helpers lower through Substrait extension mappings, while `sha2(...)` rewrites to a supported concrete SHA-2 helper |
| `asc(...)`, `desc(...)`, `asc_nulls_first(...)`, `asc_nulls_last(...)`, `desc_nulls_first(...)`, `desc_nulls_last(...)` | ordering | structural sort-field helpers consumed by `order_by(...)` and lowered to Substrait `SortRel.sorts` |
| `sum(...)`, `count(...)`, `count_expr(...)`, `count_distinct(...)`, `count_if(...)`, `avg(...)`, `min(...)`, `max(...)` | aggregate | registered Substrait extension functions for core aggregates plus compatibility rewrites for `count_expr(...)`, `count_distinct(...)`, and `count_if(...)`; core aggregates allow `DISTINCT` and aggregate-local `FILTER` where the aggregate shape is valid |
| `approx_count_distinct(...)`, `approx_percentile(...)` | aggregate | registered approximate aggregate extension functions; both are explicit approximate choices and keep DataFusion implementation-name rewrites inside the backend adapter |

Future ANSI-style families should grow under this section instead of bloating `dataset_types` or `dataset_methods`.
3 changes: 3 additions & 0 deletions docs/release_notes/v0_1.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ Entries will be filled in as work lands (link RFCs and PRs when applicable).
- **Generator functions:** RFC 021 adds registry-backed generator applications for `explode(...)`, `explode_outer(...)`, `posexplode(...)`, `posexplode_outer(...)`, `inline(...)`, `inline_outer(...)`, portable `flatten(...)`, and `stack(...)`. Generators remain relation-shaping operations applied with `generate(...)`; they preserve input columns, require explicit output aliases, lower through the current Substrait extension-relation gap encoding, and execute through the DataFusion Session adapter with concrete output-column materialization.
- **Window functions:** RFC 019 adds `window()` specs, explicit row/range frame bounds, ranking and distribution helpers (`row_number`, `rank`, `dense_rank`, `percent_rank`, `cume_dist`, `ntile`), offset and value helpers (`lag`, `lead`, `first_value`, `last_value`, `nth_value`), and aggregate-over-window placement through `with_window_column(...)`. Portable window helpers require explicit ordering where appropriate, lower through Substrait `ConsistentPartitionWindowRel`, and execute through the DataFusion session adapter.
- **Format functions:** RFC 022 adds scalar payload helpers for deterministic hashes (`md5`, `sha1`, `sha224`, `sha256`, `sha384`, `sha512`, `sha2`, `crc32`, and `xxhash64`), URL parsing/encoding/decoding, JSON validation/path/schema helpers, and CSV row/schema helpers. Format helpers lower through registry-owned Substrait metadata; the DataFusion adapter executes the full helper set with native functions where available and Incan-authored adapter callbacks for non-native helpers.
- **Approximate functions:** RFC 023 adds explicit approximate aggregate helpers for `approx_count_distinct(...)` and
`approx_percentile(...)`. They carry approximation policy in registry metadata, lower through InQL-owned Substrait
extension names, and keep DataFusion implementation-name rewrites inside the backend adapter.
- **Function registry:** RFC 014 adds declaration-site registry decorators for the current public helper surface, including stable function references, checked signature projection, lifecycle metadata, behavior categories, alias policy, Substrait mapping categories, and checked API metadata drift validation.
- **Function extension policy:** InQL RFC 024 policy metadata now distinguishes portable core functions, namespaced extension-only functions, opt-in compatibility aliases, engine-specific functions, and rejected compatibility requests without adding an extension plugin system or backend-owned semantics.
- **Projection:** builder-based `with_column`, `add`, `mul`, and literal expression helpers now lower derived columns through Prism, Substrait, and Session execution.
Expand Down
Loading
Loading