Skip to content

Un-ignore test#452

Open
Rich-T-kid wants to merge 2 commits into
datafusion-contrib:mainfrom
Rich-T-kid:rich-T-kid/un-ignore-test
Open

Un-ignore test#452
Rich-T-kid wants to merge 2 commits into
datafusion-contrib:mainfrom
Rich-T-kid:rich-T-kid/un-ignore-test

Conversation

@Rich-T-kid
Copy link
Copy Markdown
Contributor

@Rich-T-kid Rich-T-kid commented May 13, 2026

see #451
color codes original ignored test
Image 5-13-26 at 2 28 PM
Image 5-13-26 at 2 28 PM (1)

change log

  • Added PerTestConfig in src/test_utils/property_based.rs with two booleans:
    • uses_undeterministic_limit_operator — LIMIT without ORDER BY (q17).
    • non_deterministic_sort — ORDER BY with unbroken ties at the LIMIT boundary (q21, q22, q24, q31, q32).
    • compare_result_set takes &PerTestConfig and short-circuits to Ok(()) when either flag is set, with a comment explaining that a strict set comparison would false-positive on SQL-legal answer variation.
    • Each call site is now explicit; only flagged queries opt out.
    • Threaded the same param through test_tpcds_query for symmetry; all tpcds correctness call sites pass PerTestConfig::default().

Clickbench correctness q3 : shifted precision back by one digit. see comment for details
q0 & q6 are still ignored because datafusions planner is smart enough to resolve the answer through statistics.

Still ignored (3)

  • Clickbench q42 : compare_ordering reports LexOrdering PartialEq inequality even though the Debug output is byte-identical on both sides. fields are missing that are needed to debug. the current diff shows identical structs even though the comparison fails
  • TPCDs q70 & q86 : GROUP BY rollup(s_state, s_county) with two grouping() calls in the projection. The trailing two projection columns swap positions between runs because the planner iterates a HashMap to figure out which columns the rank window function needs. That swap propagates upward through SortExec / SortPreservingMergeExec as @5/@6 index flips. these are very flaky and it flips between passing and failing
  • TPCDs q30 - this works locally & fails on the CI, it might make sense to change the query to match what was done in upstream datafusion I.E changing the query. Will wait for @gabotechs thoughts on this.

@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/un-ignore-test branch from 54cae02 to 7ec7dd9 Compare May 13, 2026 04:19
@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented May 13, 2026

seems like test_tpcds_shard03_q30 has a schema mismatch since the datasource that we are using uses c_last_review_date but the query expects c_last_review_date_sk. It seems datafusion renamed the column

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

seems like the #[ignore = "reason"] is a bit outdated. Updating this. seems to be the exact same issue across multiple test & it has something to do with how snapshots are handled.

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

datasource that we are using uses c_last_review_date but the query expects c_last_review_date_sk.
taking a look at the q30.sql file this seems to be true

-- SQLBench-DS query 30 derived from TPC-DS query 30 under the terms of the TPC Fair Use Policy.
-- TPC-DS queries are Copyright 2021 Transaction Processing Performance Council.
-- This query was generated at scale factor 1.
with customer_total_return as
 (select wr_returning_customer_sk as ctr_customer_sk
        ,ca_state as ctr_state, 
 	sum(wr_return_amt) as ctr_total_return
 from web_returns
     ,date_dim
     ,customer_address
 where wr_returned_date_sk = d_date_sk 
   and d_year =2000
   and wr_returning_addr_sk = ca_address_sk 
 group by wr_returning_customer_sk
         ,ca_state)
  select  c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag
       ,c_birth_day,c_birth_month,c_birth_year,c_birth_country,c_login,c_email_address
       ,c_last_review_date_sk,ctr_total_return
 from customer_total_return ctr1
     ,customer_address
     ,customer
 where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
 			  from customer_total_return ctr2 
                  	  where ctr1.ctr_state = ctr2.ctr_state)
       and ca_address_sk = c_current_addr_sk
       and ca_state = 'KS'
       and ctr1.ctr_customer_sk = c_customer_sk
 order by c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag
                  ,c_birth_day,c_birth_month,c_birth_year,c_birth_country,c_login,c_email_address
                  ,c_last_review_date_sk,ctr_total_return
 LIMIT 100;

spoke to claude about this and it might be a github caching issue. ill try invalidating the cache to see if it re-downloads the data. But besides q30 the rest of the issues are snapshot issues which I need to do more research on.

@gabotechs
Copy link
Copy Markdown
Collaborator

I think q30 is failing for something else, see this PR that just got merged upstream:

Comment thread testdata/clickbench/queries/q17.sql Outdated

SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
--SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
-- previous query did not have an ORDER BY, so the results were non-deterministic. Adding ORDER BY to make the test deterministic.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabotechs what do you think of this approach? certain queries can return correct results while not being an exact match of vanilla data fusion. The most straightforward approach to me is to update queries to only have one correct answer. This avoids updating the testing logic everywhere.
I commented out the old query as a refrence, if this looks good ill update the remaining breaking test.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This queries are standard industry benchmarks, it might be better to stick to them, otherwise we would be testing a different thing.

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented May 13, 2026

chaning the queries is not a valid solution since there are standardized. The main issue that comes to mind is that record batches aren't reaching the final coordinator node. Assuming the comparison logic itself is correct, if the rows aren't there to be re-ordered into a deterministic order, the comparison will always fail.
For example: take a query that returns 100 rows, run it twice, once on vanilla DF and once on DDF. Since there's no ORDER BY, the engine can decide how to apply the LIMIT and either result is correct. The problem when we go to order the final results is that the data may have already been filtered before the final record batch was produced, which is likely what's causing the mismatches we're seeing.

fix

  1. the initial approach I want to try is looking at the raw output of both queries & then looking at the raw output after applying a limit 10.
  2. dig into the comparison logic. this is assuming that the produced record batch from the queries contain the same rows from step 1 if not comparing batches that include different rows will always be wrong.
  3. double check how vanilla df handles Limit's. does it have free range to select any rows? since DDF is backed by vanilla DF it should do the exact same processing, but if vanila DF allows for non-deterministic limits we may need to change the approach

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

did steps 1 &2 and it seems like this comes down to the engines picking two different 10-row slices. Going to add a flag to comparison functions that check for the limit operator. if it exist we cannot guarantee that the results are the exact same.this boils down to skipping the detailed_batch_comparison() call while keeping the row count check & schema validation.

@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/un-ignore-test branch 3 times, most recently from 83663c6 to 152d8b3 Compare May 13, 2026 18:51
Comment thread src/test_utils/property_based.rs Outdated
// to correctly return different result sets (e.g. LIMIT without ORDER BY, or ORDER BY
// with unbroken ties at the LIMIT boundary). Running the full comparison on these
// queries would surface false positives even though both engines are right.
if config.uses_undeterministic_limit_operator || config.non_deterministic_sort {
Copy link
Copy Markdown
Contributor Author

@Rich-T-kid Rich-T-kid May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has the draw back of not actually checking if the results are correct/identical, but I don't see another way of handling this. the record batches returned by run(ctx,query) has already filter/truncated the results.

@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/un-ignore-test branch from 152d8b3 to a20e110 Compare May 13, 2026 18:58
…gful checks for nondeterministic results"

Revert "Rename file to read-me"

fix remaining tpcds_correctness test

revised ignore reasons

fix 3, ignore 1

revert broken ci test

re-run 30 on CI

revised querys instead of test logic

fix limit tests

4 test remain

 resolves 30, 42,70,86 remain
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/un-ignore-test branch from a20e110 to e3a2e84 Compare May 13, 2026 19:07
// to correctly return different result sets (e.g. LIMIT without ORDER BY, or ORDER BY
// with unbroken ties at the LIMIT boundary). Running the full comparison on these
// queries would surface false positives even though both engines are right.
if config.uses_undeterministic_limit_operator || config.non_deterministic_sort {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this function a little lower so that it runs after a schema check and a row count check

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

@gabotechs this is good to review now 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants