11Feature : Pipeline tests using the movies dataset
2- Tests for the processing framework which use the movies dataset.
2+ Tests for the processing framework which use the movies dataset.
33
4- This tests submissions in JSON format, with configuration in JSON config files.
5- Complex types are tested (arrays, nested structs)
4+ This tests submissions in JSON format, with configuration in JSON config files.
5+ Complex types are tested (arrays, nested structs)
66
7- Some validation of entity attributes is performed: SQL expressions and Python filter
8- functions are used, and templatable business rules feature in the transformations.
7+ Some validation of entity attributes is performed: SQL expressions and Python filter
8+ functions are used, and templatable business rules feature in the transformations.
99
1010 Scenario : Validate and filter movies (spark)
11- Given I submit the movies file movies.json for processing
12- And A spark pipeline is configured
13- And I create the following reference data tables in the database movies_refdata
14- | table_name | parquet_path |
15- | sequels | tests /testdata /movies /refdata /movies_sequels .parquet |
16- And I add initial audit entries for the submission
17- Then the latest audit record for the submission is marked with processing status file_transformation
18- When I run the file transformation phase
19- Then the movies entity is stored as a parquet after the file_transformation phase
20- And the latest audit record for the submission is marked with processing status data_contract
21- When I run the data contract phase
22- Then there are 3 record rejections from the data_contract phase
23- And there are errors with the following details and associated error_count from the data_contract phase
24- | ErrorCode | ErrorMessage | error_count |
25- | BLANKYEAR | year not provided | 1 |
26- | DODGYYEAR | year value (NOT_A_NUMBER ) is invalid | 1 |
27- | DODGYDATE | date_joined value is not valid : daft_date | 1 |
28- And the movies entity is stored as a parquet after the data_contract phase
29- And the latest audit record for the submission is marked with processing status business_rules
30- When I run the business rules phase
31- Then The rules restrict "movies" to 4 qualifying records
32- And At least one row from "movies" has generated error code "LIMITED_RATINGS"
33- And At least one row from "derived" has generated error code "RUBBISH_SEQUEL"
34- And the latest audit record for the submission is marked with processing status error_report
35- When I run the error report phase
36- Then An error report is produced
37- And The statistics entry for the submission shows the following information
38- | parameter | value |
39- | record_count | 5 |
40- | number_record_rejections | 4 |
41- | number_warnings | 1 |
11+ Given I submit the movies file movies.json for processing
12+ And A spark pipeline is configured
13+ And I create the following reference data tables in the database movies_refdata
14+ | table_name | parquet_path |
15+ | sequels | tests /testdata /movies /refdata /movies_sequels .parquet |
16+ And I add initial audit entries for the submission
17+ Then the latest audit record for the submission is marked with processing status file_transformation
18+ When I run the file transformation phase
19+ Then the movies entity is stored as a parquet after the file_transformation phase
20+ And the latest audit record for the submission is marked with processing status data_contract
21+ When I run the data contract phase
22+ Then there are 3 record rejections from the data_contract phase
23+ And there are errors with the following details and associated error_count from the data_contract phase
24+ | ErrorCode | ErrorMessage | error_count |
25+ | BLANKYEAR | year not provided | 1 |
26+ | DODGYYEAR | year value (NOT_A_NUMBER ) is invalid | 1 |
27+ | DODGYDATE | date_joined value is not valid : daft_date | 1 |
28+ And the movies entity is stored as a parquet after the data_contract phase
29+ And the latest audit record for the submission is marked with processing status business_rules
30+ When I run the business rules phase
31+ Then The rules restrict "movies" to 4 qualifying records
32+ And there are errors with the following details and associated error_count from the business_rules phase
33+ | ErrorCode | ErrorMessage | error_count |
34+ | LIMITED_RATINGS | Movie has too few ratings ([6 .1 ]) | 1 |
35+ | RUBBISH_SEQUEL | The movie The Greatest Movie Ever has a rubbish sequel | 1 |
36+ And the latest audit record for the submission is marked with processing status error_report
37+ When I run the error report phase
38+ Then An error report is produced
39+ And The statistics entry for the submission shows the following information
40+ | parameter | value |
41+ | record_count | 5 |
42+ | number_record_rejections | 4 |
43+ | number_warnings | 1 |
4244
4345 Scenario : Validate and filter movies (duckdb)
4446 Given I submit the movies file movies.json for processing
@@ -62,8 +64,10 @@ Feature: Pipeline tests using the movies dataset
6264 And the latest audit record for the submission is marked with processing status business_rules
6365 When I run the business rules phase
6466 Then The rules restrict "movies" to 4 qualifying records
65- And At least one row from "movies" has generated error code "LIMITED_RATINGS"
66- And At least one row from "derived" has generated error code "RUBBISH_SEQUEL"
67+ And there are errors with the following details and associated error_count from the business_rules phase
68+ | ErrorCode | ErrorMessage | error_count |
69+ | LIMITED_RATINGS | Movie has too few ratings ([6 .1 ]) | 1 |
70+ | RUBBISH_SEQUEL | The movie The Greatest Movie Ever has a rubbish sequel | 1 |
6771 And the latest audit record for the submission is marked with processing status error_report
6872 When I run the error report phase
6973 Then An error report is produced
0 commit comments