You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/index.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ tags:
11
11
12
12
# Data Validation Engine
13
13
14
-
The Data Validation Engine (DVE) is a configuration driven data validation library written in [Python](https://www.python.org/), [Pydantic](https://docs.pydantic.dev/latest/) and a SQL backend currently consisting of [DuckDB](https://duckdb.org/) or [Spark](https://spark.apache.org/sql/). The configuration to run validations against a dataset are defined and written in a json document, which we will be referring to as the "dischema". The rules written within the dischema are designed to be run against all incoming data in a given submision - as this allows the DVE to capture all possible issues with the data without the submitter having resubmit the same data repeatedly which is burdensome and time consuming for the submitter and receiver of the data. Additionally, the rules can be configured to have the following behaviour:
14
+
The Data Validation Engine (DVE) is a configuration driven data validation library written in [Python](https://www.python.org/), [Pydantic](https://docs.pydantic.dev/latest/) and a SQL backend currently consisting of [DuckDB](https://duckdb.org/) or [Spark](https://spark.apache.org/sql/). The configuration to run validations against a dataset are defined and written in a json document, which we will be referring to as the "dischema". The rules written within the dischema are designed to be run against all incoming data in a given submission - as this allows the DVE to capture all possible issues with the data without the submitter having resubmit the same data repeatedly which is burdensome and time consuming for the submitter and receiver of the data. Additionally, the rules can be configured to have the following behaviour:
15
15
16
16
-**File Rejection** - The entire submission will be rejected if the given rule triggers one or more times.
17
17
-**Row Rejection** - The row that triggered the rule will be rejected. Rows that pass the validation will be flowed through into a validated entity.
@@ -30,9 +30,9 @@ The DVE has 3 core components:
30
30
31
31
3.[Business rules](user_guidance/business_rules.md) - Performs simple and complex validations such as comparisons between fields, entities and/or lookups against reference data.
32
32
33
-
For each component listed above, a [feedback message](user_guidance/feedback_messages.md) is generated whenever a rule is violated. These [feedback messages](user_guidance/feedback_messages.md) can be interegated directly into your system given you can consume `jsonl` files. Alternatively, we offer a fourth component called the [Error Reports](user_guidance/error_reports.md). This component will load the [feedback messages](user_guidance/feedback_messages.md) into an `.xlsx` (Excel) file which could be sent back to the submitter of the data. The excel file is compatiable with services that offer spreadsheet reading such as [Microsoft Excel](https://www.microsoft.com/en/microsoft-365/excel), [Google Docs](https://docs.google.com/), [Libre Office Calc](https://www.libreoffice.org/discover/calc/) etc.
33
+
For each component listed above, a [feedback message](user_guidance/feedback_messages.md) is generated whenever a rule is violated. These [feedback messages](user_guidance/feedback_messages.md) can be integrated directly into your system given you can consume `JSONL` files. Alternatively, we offer a fourth component called the [Error Reports](user_guidance/error_reports.md). This component will load the [feedback messages](user_guidance/feedback_messages.md) into an `.xlsx` (Excel) file which could be sent back to the submitter of the data. The excel file is compatible with services that offer spreadsheet reading such as [Microsoft Excel](https://www.microsoft.com/en/microsoft-365/excel), [Google Docs](https://docs.google.com/), [Libre Office Calc](https://www.libreoffice.org/discover/calc/) etc.
34
34
35
-
To be able to run the DVE out of the box, you can have look at the Backend Implementations sections with[DuckDB](user_guidance/implementations/duckdb.md) or [Spark](user_guidance/implementations/spark.md). If you to need a write a custom backend implementation, you may want to look at the [Advanced User Guidance](advanced_guidance/backends.md) section.
35
+
To be able to run the DVE out of the box, you will need to choose and install one of the supported Backend Implementations such as[DuckDB](user_guidance/implementations/duckdb.md) or [Spark](user_guidance/implementations/spark.md). If you to need a write a custom backend implementation, you may want to look at the [Advanced User Guidance](advanced_guidance/backends.md) section.
36
36
37
37
Feel free to use the Table of Contents on the left hand side of the page to navigate to sections of interest or to use the "Next" and "Previous" buttons at the bottom of each page if you want to read through each page in sequential order.
This section has not yet been written. Coming soon.
1
+
---
2
+
tags:
3
+
- Auditing
4
+
---
5
+
6
+
The Auditing objects within the DVE are used to help control and store information about a given submission and what stage it's currently at. In addition to the above, it's also used to store statistics about the submission and the number of validations it has triggered etc. So, for users not interested in using the Error reports stage, you could source information directly from the audit tables.
7
+
8
+
## Audit Tables
9
+
Currently, these are the audit tables that can be accessed within the DVE:
10
+
11
+
| Table Name | Purpose |
12
+
| --------------------- | ------- |
13
+
| processing_status | Contains information about the submission and what the current processing status is. |
14
+
| submission_info | Contains information about the submitted file. |
15
+
| submission_statistics | Contains validation statistics for each submission. |
16
+
17
+
## Audit Objects
18
+
19
+
You can use the the following methods to help you interact with the tables above or you can query the table via `sql`.
This section has not yet been written. Coming soon.
1
+
---
2
+
title: File Transformation
3
+
tags:
4
+
- Contract
5
+
- Data Contract
6
+
- File Transformation
7
+
- Readers
8
+
---
9
+
10
+
The File Transformation stage within the DVE is used to convert submitted files to stringified parquet format. This is critical as the rest of the stages within the DVE are reliant on the data being in parquet format. [Parquet was choosen as it's a very efficient column oriented format](https://www.databricks.com/glossary/what-is-parquet). When specifying which formats you are expecting, you will define it in your dischema like this:
11
+
12
+
=== "DuckDB"
13
+
14
+
```json
15
+
{
16
+
"contract": {
17
+
"datasets": {
18
+
"<entity_name>": {
19
+
"fields": {
20
+
...
21
+
},
22
+
},
23
+
"reader_config": {
24
+
".json": {
25
+
"reader": "DuckDBJSONReader",
26
+
"kwargs": {
27
+
...
28
+
}
29
+
},
30
+
".xml": {
31
+
"reader": "DuckDBXMLStreamReader",
32
+
"kwargs": {
33
+
...
34
+
}
35
+
}
36
+
}
37
+
}
38
+
}
39
+
}
40
+
```
41
+
42
+
=== "Spark"
43
+
44
+
```json
45
+
{
46
+
"contract": {
47
+
"datasets": {
48
+
"<entity_name>": {
49
+
"fields": {
50
+
...
51
+
},
52
+
},
53
+
"reader_config": {
54
+
".csv": {
55
+
"reader": "SparkCSVReader",
56
+
"kwargs": {
57
+
...
58
+
}
59
+
},
60
+
".json": {
61
+
"reader": "SparkJSONReader",
62
+
"kwargs": {
63
+
...
64
+
}
65
+
}
66
+
}
67
+
}
68
+
}
69
+
}
70
+
```
71
+
72
+
The secondary use of the File Transformation stage is the ability to normalise your data into multiple entities. Imagine you had something like Hospital and Patient data in a single submission. You could split this out into seperate entities so that the validated outputs of the data could be loaded into seperate tables. For example:
73
+
74
+
=== "DuckDB"
75
+
76
+
```json
77
+
{
78
+
"contract": {
79
+
"datasets": {
80
+
"hospital": {
81
+
"fields": {
82
+
"hospital_id": "int",
83
+
"hospital_name": "string"
84
+
},
85
+
"reader_config": {
86
+
".json": {
87
+
"reader": "DuckDBJSONReader",
88
+
"kwargs": {
89
+
"encoding": "utf-8",
90
+
"multi_line": true,
91
+
}
92
+
}
93
+
}
94
+
},
95
+
"patients": {
96
+
"fields": {
97
+
"patient_id": "int",
98
+
"patient_name": "string"
99
+
},
100
+
"reader_config": {
101
+
".json": {
102
+
"reader": "DuckDBJSONReader",
103
+
"kwargs": {
104
+
"encoding": "utf-8",
105
+
"multi_line": true,
106
+
}
107
+
}
108
+
}
109
+
}
110
+
}
111
+
}
112
+
}
113
+
```
114
+
115
+
116
+
=== "Spark"
117
+
118
+
```json
119
+
{
120
+
"contract": {
121
+
"datasets": {
122
+
"hospital": {
123
+
"fields": {
124
+
"hospital_id": "int",
125
+
"hospital_name": "string"
126
+
},
127
+
"reader_config": {
128
+
".json": {
129
+
"reader": "SparkJSONReader",
130
+
"kwargs": {
131
+
"encoding": "utf-8",
132
+
"multi_line": true,
133
+
}
134
+
}
135
+
}
136
+
},
137
+
"patients": {
138
+
"fields": {
139
+
"patient_id": "int",
140
+
"patient_name": "string"
141
+
},
142
+
"reader_config": {
143
+
".json": {
144
+
"reader": "SparkJSONReader",
145
+
"kwargs": {
146
+
"encoding": "utf-8",
147
+
"multi_line": true,
148
+
}
149
+
}
150
+
}
151
+
}
152
+
}
153
+
}
154
+
}
155
+
```
156
+
157
+
!!! abstract ""
158
+
You can read more about the readers and kwargs [here](../advanced_guidance/package_documentation/readers.md).
0 commit comments