Skip to content

Commit 0d4d82d

Browse files
Yicong-HuangHyukjinKwon
authored andcommitted
[SPARK-56584][PYTHON] Generalize RESULT_TYPE_MISMATCH_FOR_ARROW_UDF error class and remove dead SCHEMA_MISMATCH_FOR_ARROW_PYTHON_UDF
### What changes were proposed in this pull request? Two related cleanups in the PySpark result-verify path: 1. Rename error class `RESULT_TYPE_MISMATCH_FOR_ARROW_UDF` to the more general `RESULT_COLUMN_TYPES_MISMATCH` (parallel to `RESULT_COLUMN_NAMES_MISMATCH` / `RESULT_COLUMN_SCHEMA_MISMATCH`). The error is raised from the generic `verify_arrow_result` path in `python/pyspark/worker.py`; the name shouldn't mention "ARROW_UDF". 2. Reword the message to align with its siblings: - Before: `Columns do not match in their data type: <mismatch>.` - After: `Column types of the returned data do not match specified schema. Mismatch: <mismatch>.` 3. Remove the dead error class `SCHEMA_MISMATCH_FOR_ARROW_PYTHON_UDF`. `git grep` confirms no code path raises it, and its message body is identical to `SCHEMA_MISMATCH_FOR_PANDAS_UDF`. ### Why are the changes needed? Part of [SPARK-55388](https://issues.apache.org/jira/browse/SPARK-55388) (Refactor PythonEvalType processing logic). Cleanup to make error class names and messages consistent across the result-verify path, and to remove dead code. ### Does this PR introduce _any_ user-facing change? Yes. User-visible error class name and message for result column type mismatches in Arrow UDFs change. The unreleased `SCHEMA_MISMATCH_FOR_ARROW_PYTHON_UDF` class is removed (no code raised it). ### How was this patch tested? Updated 4 existing asserts in `test_arrow_grouped_map.py` and `test_arrow_cogrouped_map.py` that match the new message. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55494 from Yicong-Huang/SPARK-56584. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
1 parent 845a1b5 commit 0d4d82d

4 files changed

Lines changed: 13 additions & 14 deletions

File tree

python/pyspark/errors/error-conditions.json

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -808,26 +808,21 @@
808808
"Number of columns of the returned data doesn't match specified schema. Expected: <expected> Actual: <actual>"
809809
]
810810
},
811-
"RESULT_ROWS_MISMATCH": {
811+
"RESULT_COLUMN_TYPES_MISMATCH": {
812812
"message": [
813-
"The number of output rows (<output_length>) must match the number of input rows (<input_length>)."
813+
"Column types of the returned data do not match specified schema. Mismatch: <mismatch>."
814814
]
815815
},
816-
"RESULT_TYPE_MISMATCH_FOR_ARROW_UDF": {
816+
"RESULT_ROWS_MISMATCH": {
817817
"message": [
818-
"Columns do not match in their data type: <mismatch>."
818+
"The number of output rows (<output_length>) must match the number of input rows (<input_length>)."
819819
]
820820
},
821821
"REUSE_OBSERVATION": {
822822
"message": [
823823
"An Observation can be used with a DataFrame only once."
824824
]
825825
},
826-
"SCHEMA_MISMATCH_FOR_ARROW_PYTHON_UDF": {
827-
"message": [
828-
"Result vector from <udf_type> was not the required length: expected <expected>, got <actual>."
829-
]
830-
},
831826
"SCHEMA_MISMATCH_FOR_PANDAS_UDF": {
832827
"message": [
833828
"Result vector from <udf_type> was not the required length: expected <expected>, got <actual>."

python/pyspark/sql/tests/arrow/test_arrow_cogrouped_map.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,8 @@ def test_apply_in_arrow_returning_wrong_types(self):
147147
with self.quiet():
148148
with self.assertRaisesRegex(
149149
PythonException,
150-
f"Columns do not match in their data type: {expected}",
150+
"Column types of the returned data do not match specified schema. "
151+
f"Mismatch: {expected}",
151152
):
152153
self.cogrouped.applyInArrow(
153154
lambda left, right: left, schema=schema
@@ -171,7 +172,8 @@ def test_apply_in_arrow_returning_wrong_types_positional_assignment(self):
171172
with self.quiet():
172173
with self.assertRaisesRegex(
173174
PythonException,
174-
f"Columns do not match in their data type: {expected}",
175+
"Column types of the returned data do not match specified schema. "
176+
f"Mismatch: {expected}",
175177
):
176178
self.cogrouped.applyInArrow(
177179
lambda left, right: left, schema=schema

python/pyspark/sql/tests/arrow/test_arrow_grouped_map.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,8 @@ def test_apply_in_arrow_returning_wrong_types(self):
171171
for func_variation in function_variations(lambda table: table):
172172
with self.assertRaisesRegex(
173173
PythonException,
174-
f"Columns do not match in their data type: {expected}",
174+
"Column types of the returned data do not match specified schema. "
175+
f"Mismatch: {expected}",
175176
):
176177
df.groupby("id").applyInArrow(func_variation, schema=schema).collect()
177178

@@ -196,7 +197,8 @@ def test_apply_in_arrow_returning_wrong_types_positional_assignment(self):
196197
for func_variation in function_variations(lambda table: table):
197198
with self.assertRaisesRegex(
198199
PythonException,
199-
f"Columns do not match in their data type: {expected}",
200+
"Column types of the returned data do not match specified schema. "
201+
f"Mismatch: {expected}",
200202
):
201203
df.groupby("id").applyInArrow(
202204
func_variation, schema=schema

python/pyspark/worker.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -612,7 +612,7 @@ def verify_arrow_result(result, assign_cols_by_name, expected_cols_and_types):
612612

613613
if type_mismatch:
614614
raise PySparkRuntimeError(
615-
errorClass="RESULT_TYPE_MISMATCH_FOR_ARROW_UDF",
615+
errorClass="RESULT_COLUMN_TYPES_MISMATCH",
616616
messageParameters={
617617
"mismatch": ", ".join(
618618
"column '{}' (expected {}, actual {})".format(name, expected, actual)

0 commit comments

Comments
 (0)