diff --git a/docs/source/user-guide/dataframe/arrow-interface.rst b/docs/source/user-guide/dataframe/arrow-interface.rst new file mode 100644 index 000000000..e8ec0cc6b --- /dev/null +++ b/docs/source/user-guide/dataframe/arrow-interface.rst @@ -0,0 +1,123 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Arrow Interface +=============== + +DataFusion DataFrames can stream results to Arrow-based Python libraries +without first materializing all batches in memory. This page focuses on the +Arrow interfaces exposed by ``DataFrame`` and when to use each one. + +Zero-copy Streaming +------------------- + +DataFusion DataFrames implement the ``__arrow_c_stream__`` protocol, enabling +zero-copy, lazy streaming into Arrow-based Python libraries. With the streaming +protocol, batches are produced on demand. + +.. note:: + + The protocol is implementation-agnostic and works with any Python library + that understands the Arrow C streaming interface (for example, PyArrow + or other Arrow-compatible implementations). The sections below provide a + short PyArrow-specific example and general guidance for other + implementations. + +PyArrow +------- + +.. code-block:: python + + import pyarrow as pa + + # Create a PyArrow RecordBatchReader without materializing all batches + reader = pa.RecordBatchReader.from_stream(df) + for batch in reader: + ... # process each batch as it is produced + +DataFrames are also iterable, yielding :class:`datafusion.RecordBatch` +objects lazily so you can loop over results directly without importing +PyArrow: + +.. code-block:: python + + for batch in df: + ... # each batch is a ``datafusion.RecordBatch`` + +Each batch exposes ``to_pyarrow()``, allowing conversion to a PyArrow +table. ``pa.table(df)`` collects the entire DataFrame eagerly into a +PyArrow table: + +.. code-block:: python + + import pyarrow as pa + + table = pa.table(df) + +Asynchronous iteration is supported as well, allowing integration with +``asyncio`` event loops: + +.. code-block:: python + + async for batch in df: + ... # process each batch as it is produced + +Execute as Stream +----------------- + +For finer control over streaming execution, use +:py:meth:`~datafusion.DataFrame.execute_stream` to obtain a +:py:class:`datafusion.RecordBatchStream`: + +.. code-block:: python + + stream = df.execute_stream() + for batch in stream: + ... # process each batch as it is produced + +.. tip:: + + To get a PyArrow reader instead, call + + ``pa.RecordBatchReader.from_stream(df)``. + +When partition boundaries are important, +:py:meth:`~datafusion.DataFrame.execute_stream_partitioned` +returns an iterable of :py:class:`datafusion.RecordBatchStream` objects, one per +partition: + +.. code-block:: python + + for stream in df.execute_stream_partitioned(): + for batch in stream: + ... # each stream yields RecordBatches + +To process partitions concurrently, first collect the streams into a list +and then poll each one in a separate ``asyncio`` task: + +.. code-block:: python + + import asyncio + + async def consume(stream): + async for batch in stream: + ... + + streams = list(df.execute_stream_partitioned()) + await asyncio.gather(*(consume(s) for s in streams)) + +See :doc:`../io/arrow` for additional details on the Arrow interface. diff --git a/docs/source/user-guide/dataframe/index.rst b/docs/source/user-guide/dataframe/index.rst index 8475a7bd7..07a173f45 100644 --- a/docs/source/user-guide/dataframe/index.rst +++ b/docs/source/user-guide/dataframe/index.rst @@ -28,6 +28,18 @@ filtering, projection, aggregation, joining, and more. A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called. +This page follows that lifecycle: first creating a DataFrame, then building a +query plan with common operations, and finally executing the plan to inspect or +export results. More specialized topics live on their own pages: + +* :doc:`../common-operations/index` covers filtering, joins, aggregations, + functions, windows, and UDFs in more detail. +* :doc:`arrow-interface` explains Arrow streaming and the ``__arrow_c_stream__`` + interface. +* :doc:`rendering` covers display behavior in notebooks and terminals. +* :doc:`execution-metrics` explains how to inspect runtime statistics after a + DataFrame executes. + Creating DataFrames ------------------- @@ -203,120 +215,20 @@ To materialize the results of your DataFrame operations: # Collect a single column of data as a PyArrow Array arr = df.collect_column("age") -Zero-copy streaming to Arrow-based Python libraries ---------------------------------------------------- - -DataFusion DataFrames implement the ``__arrow_c_stream__`` protocol, enabling -zero-copy, lazy streaming into Arrow-based Python libraries. With the streaming -protocol, batches are produced on demand. - -.. note:: - - The protocol is implementation-agnostic and works with any Python library - that understands the Arrow C streaming interface (for example, PyArrow - or other Arrow-compatible implementations). The sections below provide a - short PyArrow-specific example and general guidance for other - implementations. - -PyArrow -------- - -.. code-block:: python - - import pyarrow as pa - - # Create a PyArrow RecordBatchReader without materializing all batches - reader = pa.RecordBatchReader.from_stream(df) - for batch in reader: - ... # process each batch as it is produced - -DataFrames are also iterable, yielding :class:`datafusion.RecordBatch` -objects lazily so you can loop over results directly without importing -PyArrow: - -.. code-block:: python - - for batch in df: - ... # each batch is a ``datafusion.RecordBatch`` - -Each batch exposes ``to_pyarrow()``, allowing conversion to a PyArrow -table. ``pa.table(df)`` collects the entire DataFrame eagerly into a -PyArrow table: - -.. code-block:: python - - import pyarrow as pa - table = pa.table(df) - -Asynchronous iteration is supported as well, allowing integration with -``asyncio`` event loops: - -.. code-block:: python - - async for batch in df: - ... # process each batch as it is produced - -To work with the stream directly, use ``execute_stream()``, which returns a -:class:`~datafusion.RecordBatchStream`. - -.. code-block:: python - - stream = df.execute_stream() - for batch in stream: - ... - -Execute as Stream -^^^^^^^^^^^^^^^^^ - -For finer control over streaming execution, use -:py:meth:`~datafusion.DataFrame.execute_stream` to obtain a -:py:class:`datafusion.RecordBatchStream`: - -.. code-block:: python - - stream = df.execute_stream() - for batch in stream: - ... # process each batch as it is produced - -.. tip:: - - To get a PyArrow reader instead, call - - ``pa.RecordBatchReader.from_stream(df)``. - -When partition boundaries are important, -:py:meth:`~datafusion.DataFrame.execute_stream_partitioned` -returns an iterable of :py:class:`datafusion.RecordBatchStream` objects, one per -partition: - -.. code-block:: python - - for stream in df.execute_stream_partitioned(): - for batch in stream: - ... # each stream yields RecordBatches - -To process partitions concurrently, first collect the streams into a list -and then poll each one in a separate ``asyncio`` task: - -.. code-block:: python - - import asyncio - - async def consume(stream): - async for batch in stream: - ... - - streams = list(df.execute_stream_partitioned()) - await asyncio.gather(*(consume(s) for s in streams)) - -See :doc:`../io/arrow` for additional details on the Arrow interface. - -HTML Rendering +Related Topics -------------- -When working in Jupyter notebooks or other environments that support HTML rendering, DataFrames will -automatically display as formatted HTML tables. For detailed information about customizing HTML -rendering, formatting options, and advanced styling, see :doc:`rendering`. +DataFusion DataFrames implement the ``__arrow_c_stream__`` protocol and can +also be displayed in notebooks, streamed to Arrow-based libraries, and inspected +after execution. These topics are covered separately so this overview can stay +focused on the DataFrame lifecycle: + +* :doc:`arrow-interface` covers lazy PyArrow interop, direct + ``RecordBatchStream`` usage, and partitioned stream processing. +* :doc:`rendering` covers notebook and terminal display behavior, formatting + options, and custom styling. +* :doc:`execution-metrics` covers per-operator runtime statistics after + execution, including row counts and compute time. Core Classes ------------ @@ -365,16 +277,9 @@ DataFusion provides many built-in functions for data manipulation: For a complete list of available functions, see the :py:mod:`datafusion.functions` module documentation. -Execution Metrics ------------------ - -After executing a DataFrame (via ``collect()``, ``execute_stream()``, etc.), -DataFusion populates per-operator runtime statistics such as row counts and -compute time. See :doc:`execution-metrics` for a full explanation and -worked example. - .. toctree:: :maxdepth: 1 + arrow-interface rendering execution-metrics