[SPARK-56584][PYTHON] Generalize RESULT_TYPE_MISMATCH_FOR_ARROW_UDF error class and remove dead SCHEMA_MISMATCH_FOR_ARROW_PYTHON_UDF

Yicong-Huang · HyukjinKwon · commit 0d4d82d21fa3 · 2026-04-24T13:37:22.000+09:00
### What changes were proposed in this pull request? Two related cleanups in the PySpark result-verify path: 1. Rename error class `RESULT_TYPE_MISMATCH_FOR_ARROW_UDF` to the more general `RESULT_COLUMN_TYPES_MISMATCH` (parallel to `RESULT_COLUMN_NAMES_MISMATCH` / `RESULT_COLUMN_SCHEMA_MISMATCH`). The error is raised from the generic `verify_arrow_result` path in `python/pyspark/worker.py`; the name shouldn't mention "ARROW_UDF". 2. Reword the message to align with its siblings: - Before: `Columns do not match in their data type: <mismatch>.` - After: `Column types of the returned data do not match specified schema. Mismatch: <mismatch>.` 3. Remove the dead error class `SCHEMA_MISMATCH_FOR_ARROW_PYTHON_UDF`. `git grep` confirms no code path raises it, and its message body is identical to `SCHEMA_MISMATCH_FOR_PANDAS_UDF`. ### Why are the changes needed? Part of [SPARK-55388](https://issues.apache.org/jira/browse/SPARK-55388) (Refactor PythonEvalType processing logic). Cleanup to make error class names and messages consistent across the result-verify path, and to remove dead code. ### Does this PR introduce _any_ user-facing change? Yes. User-visible error class name and message for result column type mismatches in Arrow UDFs change. The unreleased `SCHEMA_MISMATCH_FOR_ARROW_PYTHON_UDF` class is removed (no code raised it). ### How was this patch tested? Updated 4 existing asserts in `test_arrow_grouped_map.py` and `test_arrow_cogrouped_map.py` that match the new message. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55494 from Yicong-Huang/SPARK-56584. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
diff --git a/python/pyspark/errors/error-conditions.json b/python/pyspark/errors/error-conditions.json
@@ -808,26 +808,21 @@
       "Number of columns of the returned data doesn't match specified schema. Expected: <expected> Actual: <actual>"
     ]
   },
-  "RESULT_ROWS_MISMATCH": {
+  "RESULT_COLUMN_TYPES_MISMATCH": {
     "message": [
-      "The number of output rows (<output_length>) must match the number of input rows (<input_length>)."
+      "Column types of the returned data do not match specified schema. Mismatch: <mismatch>."
     ]
   },
-  "RESULT_TYPE_MISMATCH_FOR_ARROW_UDF": {
+  "RESULT_ROWS_MISMATCH": {
     "message": [
-      "Columns do not match in their data type: <mismatch>."
+      "The number of output rows (<output_length>) must match the number of input rows (<input_length>)."
     ]
   },
   "REUSE_OBSERVATION": {
     "message": [
       "An Observation can be used with a DataFrame only once."
     ]
   },
-  "SCHEMA_MISMATCH_FOR_ARROW_PYTHON_UDF": {
-    "message": [
-      "Result vector from <udf_type> was not the required length: expected <expected>, got <actual>."
-    ]
-  },
   "SCHEMA_MISMATCH_FOR_PANDAS_UDF": {
     "message": [
       "Result vector from <udf_type> was not the required length: expected <expected>, got <actual>."
diff --git a/python/pyspark/sql/tests/arrow/test_arrow_cogrouped_map.py b/python/pyspark/sql/tests/arrow/test_arrow_cogrouped_map.py
@@ -147,7 +147,8 @@ def test_apply_in_arrow_returning_wrong_types(self):
                 with self.quiet():
                     with self.assertRaisesRegex(
                         PythonException,
-                        f"Columns do not match in their data type: {expected}",
+                        "Column types of the returned data do not match specified schema. "
+                        f"Mismatch: {expected}",
                     ):
                         self.cogrouped.applyInArrow(
                             lambda left, right: left, schema=schema
@@ -171,7 +172,8 @@ def test_apply_in_arrow_returning_wrong_types_positional_assignment(self):
                     with self.quiet():
                         with self.assertRaisesRegex(
                             PythonException,
-                            f"Columns do not match in their data type: {expected}",
+                            "Column types of the returned data do not match specified schema. "
+                            f"Mismatch: {expected}",
                         ):
                             self.cogrouped.applyInArrow(
                                 lambda left, right: left, schema=schema
diff --git a/python/pyspark/sql/tests/arrow/test_arrow_grouped_map.py b/python/pyspark/sql/tests/arrow/test_arrow_grouped_map.py
@@ -171,7 +171,8 @@ def test_apply_in_arrow_returning_wrong_types(self):
                     for func_variation in function_variations(lambda table: table):
                         with self.assertRaisesRegex(
                             PythonException,
-                            f"Columns do not match in their data type: {expected}",
+                            "Column types of the returned data do not match specified schema. "
+                            f"Mismatch: {expected}",
                         ):
                             df.groupby("id").applyInArrow(func_variation, schema=schema).collect()
 
@@ -196,7 +197,8 @@ def test_apply_in_arrow_returning_wrong_types_positional_assignment(self):
                         for func_variation in function_variations(lambda table: table):
                             with self.assertRaisesRegex(
                                 PythonException,
-                                f"Columns do not match in their data type: {expected}",
+                                "Column types of the returned data do not match specified schema. "
+                                f"Mismatch: {expected}",
                             ):
                                 df.groupby("id").applyInArrow(
                                     func_variation, schema=schema
diff --git a/python/pyspark/worker.py b/python/pyspark/worker.py
@@ -612,7 +612,7 @@ def verify_arrow_result(result, assign_cols_by_name, expected_cols_and_types):
 
         if type_mismatch:
             raise PySparkRuntimeError(
-                errorClass="RESULT_TYPE_MISMATCH_FOR_ARROW_UDF",
+                errorClass="RESULT_COLUMN_TYPES_MISMATCH",
                 messageParameters={
                     "mismatch": ", ".join(
                         "column '{}' (expected {}, actual {})".format(name, expected, actual)