Skip to content

RFC 025: Typed sketch logical values #51

@dannymeijer

Description

@dannymeijer

Use this issue to track InQL RFC 025, proposed at docs/rfcs/025_typed_sketch_logical_values.md.

Area

  • Specification (RFCs)
  • Package & tests
  • Documentation

Summary

RFC 025 defines typed sketch logical values for InQL. Sketch helpers must not be modeled as ordinary strings or binary blobs; they must produce and consume logical sketch values that record sketch family, input value domain, parameterization, merge compatibility, and serialization format identity.

Motivation

RFC 023 covers portable approximate aggregates that directly return scalar estimates. Stored and mergeable sketch state is a separate problem: authors may want to materialize a sketch, merge sketches across partitions or files, estimate from a stored sketch later, or serialize sketch state for transport. Those operations require compatibility rules that bytes or str alone cannot represent.

Without a typed sketch value model, InQL cannot reject invalid operations such as merging a HyperLogLog sketch with a KLL sketch, merging sketches over incompatible value domains, or deserializing a payload with an incompatible format version. That would push semantic validation into backend-specific runtime failures and weaken the Substrait boundary.

Proposal sketch

Define sketch logical values as first-class typed values. A sketch value should carry at least:

  • sketch family identity, such as HyperLogLog, KLL, theta, count-min, or bitmap;
  • input value domain, such as string identifiers, integer identifiers, numeric values, or categorical values;
  • family parameters that affect merge compatibility, such as precision, accuracy, nominal entries, width/depth, seed, or ordering policy;
  • format identity and version when the value can be serialized;
  • nullability and ordinary column-position metadata needed by existing InQL expression and relation surfaces.

Sketch construction helpers that summarize rows should be aggregate measures. Merge helpers should validate sketch family compatibility before lowering. Estimate helpers should declare their scalar result type. Serialization/deserialization should be explicit and should require enough metadata to identify family, domain, parameters, and format version.

The Substrait boundary must preserve sketch logical type identity through extension type metadata or reject unsupported operations before execution. Backend adapters may implement, emulate, or reject sketch operations, but they must not redefine ordinary binary or string expressions as sketch states.

Alternatives considered

  • Treat sketches as bytes: rejected because it prevents typechecking merge compatibility and moves semantic errors into backend runtime failures.
  • Expose only scalar approximate aggregates: rejected as a complete long-term answer because stored and mergeable sketches are a legitimate analytics need.
  • Copy one backend's sketch catalog directly: rejected because InQL needs backend-neutral semantics and capability reporting.
  • Make sketch values ordinary structs: rejected unless the struct carries a distinct logical type; ordinary structs do not by themselves encode family-specific compatibility rules.

Impact / compatibility

This RFC is additive. RFC 023 approximate scalar-result aggregates remain valid. Existing string or binary columns should not be retroactively treated as sketch values. If a backend or existing dataset stores sketch bytes, authors should use explicit deserialization with the required sketch type metadata.

This affects the InQL specification, public sketch helper metadata, typechecking/diagnostics for sketch-valued expressions, Substrait lowering or rejection, backend adapter capability handling, and documentation that distinguishes typed approximate state from backend-specific blobs.

Implementation notes (optional)

Resolve the RFC 025 Draft questions before implementation: public type spelling, first sketch family, merge-compatible parameters, serialization portability, and whether sketch values are legal in table schemas before InQL has a broader logical type model beyond primitive row columns.

This is intentionally split from RFC 023 so PRs implementing portable approximate aggregates do not smuggle in untyped sketch payload semantics.

Checklist

  • I checked for an existing RFC or issue covering this.
  • I can describe how this impacts existing code and how to migrate (if needed).

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRFC design and planningdocumentationImprovements or additions to documentationpackageLibrary source, tests, incan.tomlspecificationdocs/rfcs/ normative RFCs

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions