data-validation-engine/tests/features/books.feature at 65b87e94431c231a88e771502608a75f024a9017 · NHSDigital/data-validation-engine · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
Feature: Pipeline tests using the books dataset
    Tests for the processing framework which use the books dataset.

    This tests submissions using nested, complex JSON datasets with arrays, and
    introduces more complex transformations that require aggregation.

    Scenario: Validate complex nested XML data (spark)
        Given I submit the books file nested_books.XML for processing
        And A spark pipeline is configured with schema file 'nested_books.dischema.json'
        And I add initial audit entries for the submission
        Then the latest audit record for the submission is marked with processing status file_transformation
        When I run the file transformation phase
        Then the header entity is stored as a parquet after the file_transformation phase
        And the nested_books entity is stored as a parquet after the file_transformation phase
        And the latest audit record for the submission is marked with processing status data_contract
        When I run the data contract phase
        Then there is 1 record rejection from the data_contract phase
        And the header entity is stored as a parquet after the data_contract phase
        And the nested_books entity is stored as a parquet after the data_contract phase
        And the latest audit record for the submission is marked with processing status business_rules
        When I run the business rules phase
        Then The rules restrict "nested_books" to 3 qualifying records
        And The entity "nested_books" contains an entry for "17.85" in column "total_value_of_books"
        And the nested_books entity is stored as a parquet after the business_rules phase
        And the latest audit record for the submission is marked with processing status error_report
        When I run the error report phase
        Then An error report is produced
        And The statistics entry for the submission shows the following information
            | parameter                | value |
            | record_count             | 4     |
            | number_record_rejections | 2     |
            | number_warnings          | 0     |

    Scenario: Validate complex nested XML data (duckdb)
        Given I submit the books file nested_books.XML for processing
        And A duckdb pipeline is configured with schema file 'nested_books_ddb.dischema.json'
        And I add initial audit entries for the submission
        Then the latest audit record for the submission is marked with processing status file_transformation
        When I run the file transformation phase
        Then the header entity is stored as a parquet after the file_transformation phase
        And the nested_books entity is stored as a parquet after the file_transformation phase
        And the latest audit record for the submission is marked with processing status data_contract
        When I run the data contract phase
        Then there is 1 record rejection from the data_contract phase
        And the header entity is stored as a parquet after the data_contract phase
        And the nested_books entity is stored as a parquet after the data_contract phase
        And the latest audit record for the submission is marked with processing status business_rules
        When I run the business rules phase
        Then The rules restrict "nested_books" to 3 qualifying records
        And The entity "nested_books" contains an entry for "17.85" in column "total_value_of_books"
        And the nested_books entity is stored as a parquet after the business_rules phase
        And the latest audit record for the submission is marked with processing status error_report
        When I run the error report phase
        Then An error report is produced
        And The statistics entry for the submission shows the following information
            | parameter                | value |
            | record_count             | 4     |
            | number_record_rejections | 2     |
            | number_warnings          | 0     |

    Scenario: Handle a file with a malformed tag (duckdb)
        Given I submit the books file malformed_books.xml for processing
        And A duckdb pipeline is configured with schema file 'nested_books_ddb.dischema.json'
        And I add initial audit entries for the submission
        Then the latest audit record for the submission is marked with processing status file_transformation
        When I run the file transformation phase
        Then the latest audit record for the submission is marked with processing status failed
        # TODO - handle above within the stream xml reader - specific

    Scenario: Handle a file that fails XSD validation (duckdb)
        Given I submit the books file books_xsd_fail.xml for processing
        And A duckdb pipeline is configured with schema file 'nested_books_ddb.dischema.json'
        And I add initial audit entries for the submission
        Then the latest audit record for the submission is marked with processing status file_transformation
        When I run the file transformation phase
        Then the latest audit record for the submission is marked with processing status error_report
        When I run the error report phase
        Then An error report is produced