Skip to content

Commit a46c383

Browse files
gauravsinghaniajesrypandawa
andauthored
doc: rfc for python udf (#129)
* doc: rfc for python udf * docs: update rfc Co-authored-by: jesrypandawa <jesry.p.pandawa@gmail.com>
1 parent 8eebcf6 commit a46c383

1 file changed

Lines changed: 71 additions & 0 deletions

File tree

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
## Motivation
2+
Dagger users include developers, analysts, data scientists, etc. For users to use Dagger, they can add new capabilities by defining their own functions commonly referred to as UDFs. Currently, Dagger only supports java as the language for the UDFs. To democratize the process of creating and maintaining the UDFs we want to add support for python.
3+
4+
## Requirement
5+
Support for adding Python UDF on Dagger
6+
End-to-end flow on adding and using Python UDF on Dagger.
7+
8+
9+
## Python User Defined Function
10+
There are two kinds of Python UDF that can be registered on Dagger:
11+
* General Python UDF
12+
* Vectorized Python UDF
13+
14+
15+
It shares a similar way as the general user-defined functions on how to define vectorized user-defined functions. Users only need to add an extra parameter func_type="pandas" in the decorator udf or udaf to mark it as a vectorized user-defined function.
16+
17+
18+
Type | General Python UDF | Vectorized Python UDF
19+
--- | --- | --- |
20+
Data Processing Method | One piece of data is processed each time a UDF is called | Multiple pieces of data are processed each time a UDF is called
21+
Serialization/Deserialization | Serialization and Deserialization are required for each piece of data on the Java side and Python side| The data transmission format between Java and Python is based on Apache Arrow: <ul><li> Pandas supports Apache Arrow natively, so serialization and deserialization are not required on Python side</li><li>On the Java side, vectorized optimization is possible, and serialization/deserialization can be avoided as much as possible</li></ul>|
22+
|Exection Performance|Poor|Good<ul><li>Vectorized execution is of high efficiency</li><li>High-performance python UDF can be implemented based on high performance libraries such as pandas and numpy</li></ul>
23+
24+
Note:
25+
26+
When using vectorized udf, Flink will convert the messages to pandas.series, and the udf will use that as an input and the output also pandas.series. The pandas.series size for input and output should be the same.
27+
28+
## Configuration
29+
There are a few configurations that required for using python UDF, and also options we can adjust for optimization.
30+
31+
Configuration that will be added on Dagger codebase:
32+
| Key | Default | Type | Example
33+
| --- | --- | --- | ---- |
34+
|PYTHON_UDF_ENABLE|false|Boolean|false|
35+
|PYTHON_UDF_CONFIG|(none)|String|{"PYTHON_FILES":"/path/to/files.zip", "PYTHON_REQUIREMENTS": "requirements.txt", "PYTHON_FN_EXECUTION_BUNDLE_SIZE": "1000"}|
36+
37+
The following variables than can be configurable on `PYTHON_UDF_CONFIG`:
38+
| Key | Default | Type | Example
39+
| --- | --- | --- | ---- |
40+
|PYTHON_ARCHIVES|(none)|String|/path/to/data.zip|
41+
|PYTHON_FILES|(none)|String|/path/to/files.zip|
42+
|PYTHON_REQUIREMENTS|(none)|String|/path/to/requirements.txt|
43+
|PYTHON_FN_EXECUTION_ARROW_BATCH_SIZE|10000|Integer|10000|
44+
|PYTHON_FN_EXECUTION_BUNDLE_SIZE|100000|Integer|100000|
45+
|PYTHON_FN_EXECUTION_BUNDLE_TIME|1000|Long|1000|
46+
47+
48+
## Registering the Udf
49+
Dagger will automatically register the python udf as long as the files meets the following criteria:
50+
* Python file names should be the same with its function method
51+
Example:
52+
53+
sample.py
54+
```
55+
from pyflink.table import DataTypes
56+
from pyflink.table.udf import udf
57+
58+
59+
@udf(result_type=DataTypes.STRING())
60+
def sample(word: str):
61+
return word + "_test"
62+
```
63+
* Avoid adding duplicate `.py` filenames. e.g: `__init__.py`
64+
65+
66+
## Release the Udf
67+
List of udfs for dagger, will be added on directory `dagger-py-functions` include with its test, data files that are used on the udf, and the udf dependency(requirements.txt).
68+
All of these files will be bundled to single zip file and uploaded to assets on release.
69+
70+
## Reference
71+
[Flink Python User Defined Functions](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/python/table/udfs/overview/)

0 commit comments

Comments
 (0)