Skip to content

Commit a02c3bd

Browse files
docs: further wip docs
1 parent 35bf828 commit a02c3bd

8 files changed

Lines changed: 306 additions & 19 deletions

File tree

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
## CSV
2+
3+
=== "Base"
4+
5+
::: src.dve.core_engine.backends.readers.csv.CSVFileReader
6+
options:
7+
heading_level: 3
8+
merge_init_into_class: true
9+
members: false
10+
11+
=== "DuckDB"
12+
13+
::: src.dve.core_engine.backends.implementations.duckdb.readers.csv.DuckDBCSVReader
14+
options:
15+
heading_level: 3
16+
members:
17+
- __init__
18+
19+
::: src.dve.core_engine.backends.implementations.duckdb.readers.csv.PolarsToDuckDBCSVReader
20+
options:
21+
heading_level: 3
22+
members:
23+
- __init__
24+
25+
::: src.dve.core_engine.backends.implementations.duckdb.readers.csv.DuckDBCSVRepeatingHeaderReader
26+
options:
27+
heading_level: 3
28+
members:
29+
- __init__
30+
31+
=== "Spark"
32+
33+
::: src.dve.core_engine.backends.implementations.spark.readers.csv.SparkCSVReader
34+
options:
35+
heading_level: 3
36+
members:
37+
- __init__
38+
39+
## JSON
40+
41+
=== "DuckDB"
42+
43+
::: src.dve.core_engine.backends.implementations.duckdb.readers.json.DuckDBJSONReader
44+
options:
45+
heading_level: 3
46+
members:
47+
- __init__
48+
49+
=== "Spark"
50+
51+
::: src.dve.core_engine.backends.implementations.spark.readers.json.SparkJSONReader
52+
options:
53+
heading_level: 3
54+
members:
55+
- __init__
56+
57+
## XML
58+
59+
=== "Base"
60+
61+
::: src.dve.core_engine.backends.readers.xml.BasicXMLFileReader
62+
options:
63+
heading_level: 3
64+
merge_init_into_class: true
65+
members: false
66+
67+
=== "DuckDB"
68+
69+
::: src.dve.core_engine.backends.implementations.duckdb.readers.xml.DuckDBXMLStreamReader
70+
options:
71+
heading_level: 3
72+
members:
73+
- __init__
74+
75+
=== "Spark"
76+
77+
::: src.dve.core_engine.backends.implementations.spark.readers.xml.SparkXMLStreamReader
78+
options:
79+
heading_level: 3
80+
members:
81+
- __init__
82+
83+
::: src.dve.core_engine.backends.implementations.spark.readers.xml.SparkXMLReader
84+
options:
85+
heading_level: 3
86+
members:
87+
- __init__

docs/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ tags:
1111

1212
# Data Validation Engine
1313

14-
The Data Validation Engine (DVE) is a configuration driven data validation library written in [Python](https://www.python.org/), [Pydantic](https://docs.pydantic.dev/latest/) and a SQL backend currently consisting of [DuckDB](https://duckdb.org/) or [Spark](https://spark.apache.org/sql/). The configuration to run validations against a dataset are defined and written in a json document, which we will be referring to as the "dischema". The rules written within the dischema are designed to be run against all incoming data in a given submision - as this allows the DVE to capture all possible issues with the data without the submitter having resubmit the same data repeatedly which is burdensome and time consuming for the submitter and receiver of the data. Additionally, the rules can be configured to have the following behaviour:
14+
The Data Validation Engine (DVE) is a configuration driven data validation library written in [Python](https://www.python.org/), [Pydantic](https://docs.pydantic.dev/latest/) and a SQL backend currently consisting of [DuckDB](https://duckdb.org/) or [Spark](https://spark.apache.org/sql/). The configuration to run validations against a dataset are defined and written in a json document, which we will be referring to as the "dischema". The rules written within the dischema are designed to be run against all incoming data in a given submission - as this allows the DVE to capture all possible issues with the data without the submitter having resubmit the same data repeatedly which is burdensome and time consuming for the submitter and receiver of the data. Additionally, the rules can be configured to have the following behaviour:
1515

1616
- **File Rejection** - The entire submission will be rejected if the given rule triggers one or more times.
1717
- **Row Rejection** - The row that triggered the rule will be rejected. Rows that pass the validation will be flowed through into a validated entity.
@@ -30,9 +30,9 @@ The DVE has 3 core components:
3030

3131
3. [Business rules](user_guidance/business_rules.md) - Performs simple and complex validations such as comparisons between fields, entities and/or lookups against reference data.
3232

33-
For each component listed above, a [feedback message](user_guidance/feedback_messages.md) is generated whenever a rule is violated. These [feedback messages](user_guidance/feedback_messages.md) can be interegated directly into your system given you can consume `jsonl` files. Alternatively, we offer a fourth component called the [Error Reports](user_guidance/error_reports.md). This component will load the [feedback messages](user_guidance/feedback_messages.md) into an `.xlsx` (Excel) file which could be sent back to the submitter of the data. The excel file is compatiable with services that offer spreadsheet reading such as [Microsoft Excel](https://www.microsoft.com/en/microsoft-365/excel), [Google Docs](https://docs.google.com/), [Libre Office Calc](https://www.libreoffice.org/discover/calc/) etc.
33+
For each component listed above, a [feedback message](user_guidance/feedback_messages.md) is generated whenever a rule is violated. These [feedback messages](user_guidance/feedback_messages.md) can be integrated directly into your system given you can consume `JSONL` files. Alternatively, we offer a fourth component called the [Error Reports](user_guidance/error_reports.md). This component will load the [feedback messages](user_guidance/feedback_messages.md) into an `.xlsx` (Excel) file which could be sent back to the submitter of the data. The excel file is compatible with services that offer spreadsheet reading such as [Microsoft Excel](https://www.microsoft.com/en/microsoft-365/excel), [Google Docs](https://docs.google.com/), [Libre Office Calc](https://www.libreoffice.org/discover/calc/) etc.
3434

35-
To be able to run the DVE out of the box, you can have look at the Backend Implementations sections with [DuckDB](user_guidance/implementations/duckdb.md) or [Spark](user_guidance/implementations/spark.md). If you to need a write a custom backend implementation, you may want to look at the [Advanced User Guidance](advanced_guidance/backends.md) section.
35+
To be able to run the DVE out of the box, you will need to choose and install one of the supported Backend Implementations such as [DuckDB](user_guidance/implementations/duckdb.md) or [Spark](user_guidance/implementations/spark.md). If you to need a write a custom backend implementation, you may want to look at the [Advanced User Guidance](advanced_guidance/backends.md) section.
3636

3737
Feel free to use the Table of Contents on the left hand side of the page to navigate to sections of interest or to use the "Next" and "Previous" buttons at the bottom of each page if you want to read through each page in sequential order.
3838

docs/user_guidance/auditing.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,34 @@
1-
!!! note
2-
This section has not yet been written. Coming soon.
1+
---
2+
tags:
3+
- Auditing
4+
---
5+
6+
The Auditing objects within the DVE are used to help control and store information about a given submission and what stage it's currently at. In addition to the above, it's also used to store statistics about the submission and the number of validations it has triggered etc. So, for users not interested in using the Error reports stage, you could source information directly from the audit tables.
7+
8+
## Audit Tables
9+
Currently, these are the audit tables that can be accessed within the DVE:
10+
11+
| Table Name | Purpose |
12+
| --------------------- | ------- |
13+
| processing_status | Contains information about the submission and what the current processing status is. |
14+
| submission_info | Contains information about the submitted file. |
15+
| submission_statistics | Contains validation statistics for each submission. |
16+
17+
## Audit Objects
18+
19+
You can use the the following methods to help you interact with the tables above or you can query the table via `sql`.
20+
21+
<hr>
22+
23+
::: src.dve.core_engine.backends.base.auditing.BaseAuditingManager
24+
options:
25+
heading_level: 3
26+
members:
27+
- get_submission_info
28+
- get_submission_statistics
29+
- get_submission_status
30+
- get_all_file_transformation_submissions
31+
- get_all_data_contract_submissions
32+
- get_all_business_rule_submissions
33+
- get_all_error_report_submissions
34+
- get_current_processing_info
Lines changed: 166 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,166 @@
1-
!!! note
2-
This section has not yet been written. Coming soon.
1+
---
2+
title: File Transformation
3+
tags:
4+
- Contract
5+
- Data Contract
6+
- File Transformation
7+
- Readers
8+
---
9+
10+
The File Transformation stage within the DVE is used to convert submitted files to stringified parquet format. This is critical as the rest of the stages within the DVE are reliant on the data being in parquet format. [Parquet was choosen as it's a very efficient column oriented format](https://www.databricks.com/glossary/what-is-parquet). When specifying which formats you are expecting, you will define it in your dischema like this:
11+
12+
=== "DuckDB"
13+
14+
```json
15+
{
16+
"contract": {
17+
"datasets": {
18+
"<entity_name>": {
19+
"fields": {
20+
...
21+
},
22+
},
23+
"reader_config": {
24+
".json": {
25+
"reader": "DuckDBJSONReader",
26+
"kwargs": {
27+
...
28+
}
29+
},
30+
".xml": {
31+
"reader": "DuckDBXMLStreamReader",
32+
"kwargs": {
33+
...
34+
}
35+
}
36+
}
37+
}
38+
}
39+
}
40+
```
41+
42+
=== "Spark"
43+
44+
```json
45+
{
46+
"contract": {
47+
"datasets": {
48+
"<entity_name>": {
49+
"fields": {
50+
...
51+
},
52+
},
53+
"reader_config": {
54+
".csv": {
55+
"reader": "SparkCSVReader",
56+
"kwargs": {
57+
...
58+
}
59+
},
60+
".json": {
61+
"reader": "SparkJSONReader",
62+
"kwargs": {
63+
...
64+
}
65+
}
66+
}
67+
}
68+
}
69+
}
70+
```
71+
72+
The secondary use of the File Transformation stage is the ability to normalise your data into multiple entities. Imagine you had something like Hospital and Patient data in a single submission. You could split this out into seperate entities so that the validated outputs of the data could be loaded into seperate tables. For example:
73+
74+
=== "DuckDB"
75+
76+
```json
77+
{
78+
"contract": {
79+
"datasets": {
80+
"hospital": {
81+
"fields": {
82+
"hospital_id": "int",
83+
"hospital_name": "string"
84+
},
85+
"reader_config": {
86+
".json": {
87+
"reader": "DuckDBJSONReader",
88+
"kwargs": {
89+
"encoding": "utf-8",
90+
"multi_line": true,
91+
}
92+
}
93+
}
94+
},
95+
"patients": {
96+
"fields": {
97+
"patient_id": "int",
98+
"patient_name": "string"
99+
},
100+
"reader_config": {
101+
".json": {
102+
"reader": "DuckDBJSONReader",
103+
"kwargs": {
104+
"encoding": "utf-8",
105+
"multi_line": true,
106+
}
107+
}
108+
}
109+
}
110+
}
111+
}
112+
}
113+
```
114+
115+
116+
=== "Spark"
117+
118+
```json
119+
{
120+
"contract": {
121+
"datasets": {
122+
"hospital": {
123+
"fields": {
124+
"hospital_id": "int",
125+
"hospital_name": "string"
126+
},
127+
"reader_config": {
128+
".json": {
129+
"reader": "SparkJSONReader",
130+
"kwargs": {
131+
"encoding": "utf-8",
132+
"multi_line": true,
133+
}
134+
}
135+
}
136+
},
137+
"patients": {
138+
"fields": {
139+
"patient_id": "int",
140+
"patient_name": "string"
141+
},
142+
"reader_config": {
143+
".json": {
144+
"reader": "SparkJSONReader",
145+
"kwargs": {
146+
"encoding": "utf-8",
147+
"multi_line": true,
148+
}
149+
}
150+
}
151+
}
152+
}
153+
}
154+
}
155+
```
156+
157+
!!! abstract ""
158+
You can read more about the readers and kwargs [here](../advanced_guidance/package_documentation/readers.md).
159+
160+
## Supported Formats
161+
162+
| Format | DuckDB | Spark | Version Available |
163+
| ------- | ------------------ | ------------------ | ----------------- |
164+
| `.csv` | :white_check_mark: | :white_check_mark: | >= 0.1.0 |
165+
| `.json` | :white_check_mark: | :white_check_mark: | >= 0.1.0 |
166+
| `.xml` | :white_check_mark: | :white_check_mark: | >= 0.1.0 |

0 commit comments

Comments
 (0)