@@ -7,9 +7,45 @@ Feature: Pipeline tests using the movies dataset
77 Some validation of entity attributes is performed: SQL expressions and Python filter
88 functions are used, and templatable business rules feature in the transformations.
99
10+ Scenario : Validate and filter movies (spark)
11+ Given I submit the movies file movies.json for processing
12+ And A spark pipeline is configured
13+ And I create the following reference data tables in the database movies_refdata
14+ | table_name | parquet_path |
15+ | sequels | tests /testdata /movies /refdata /movies_sequels .parquet |
16+ And I add initial audit entries for the submission
17+ Then the latest audit record for the submission is marked with processing status file_transformation
18+ When I run the file transformation phase
19+ Then the movies entity is stored as a parquet after the file_transformation phase
20+ And the latest audit record for the submission is marked with processing status data_contract
21+ When I run the data contract phase
22+ Then there are 3 record rejections from the data_contract phase
23+ And there are errors with the following details and associated error_count from the data_contract phase
24+ | ErrorCode | ErrorMessage | error_count |
25+ | BLANKYEAR | year not provided | 1 |
26+ | DODGYYEAR | year value (NOT_A_NUMBER ) is invalid | 1 |
27+ | DODGYDATE | date_joined value is not valid : daft_date | 1 |
28+ And the movies entity is stored as a parquet after the data_contract phase
29+ And the latest audit record for the submission is marked with processing status business_rules
30+ When I run the business rules phase
31+ Then The rules restrict "movies" to 4 qualifying records
32+ And At least one row from "movies" has generated error code "LIMITED_RATINGS"
33+ And At least one row from "derived" has generated error code "RUBBISH_SEQUEL"
34+ And the latest audit record for the submission is marked with processing status error_report
35+ When I run the error report phase
36+ Then An error report is produced
37+ And The statistics entry for the submission shows the following information
38+ | parameter | value |
39+ | record_count | 5 |
40+ | number_record_rejections | 4 |
41+ | number_warnings | 1 |
42+
1043 Scenario : Validate and filter movies (duckdb)
1144 Given I submit the movies file movies.json for processing
12- And A duckdb pipeline is configured
45+ And A duckdb pipeline is configured with schema file 'movies_ddb.dischema.json'
46+ And I create the following reference data tables in the database "movies_refdata"
47+ | table_name | parquet_path |
48+ | sequels | tests /testdata /movies /refdata /movies_sequels .parquet |
1349 And I add initial audit entries for the submission
1450 Then the latest audit record for the submission is marked with processing status file_transformation
1551 When I run the file transformation phase
@@ -24,3 +60,16 @@ Feature: Pipeline tests using the movies dataset
2460 | DODGYDATE | date_joined value is not valid : daft_date | 1 |
2561 And the movies entity is stored as a parquet after the data_contract phase
2662 And the latest audit record for the submission is marked with processing status business_rules
63+ When I run the business rules phase
64+ Then The rules restrict "movies" to 4 qualifying records
65+ And At least one row from "movies" has generated error code "LIMITED_RATINGS"
66+ And At least one row from "derived" has generated error code "RUBBISH_SEQUEL"
67+ And the latest audit record for the submission is marked with processing status error_report
68+ When I run the error report phase
69+ Then An error report is produced
70+ And The statistics entry for the submission shows the following information
71+ | parameter | value |
72+ | record_count | 5 |
73+ | number_record_rejections | 4 |
74+ | number_warnings | 1 |
75+
0 commit comments