Skip to content

Commit 1b25735

Browse files
docs: spag and other fixes
1 parent ebd7ef5 commit 1b25735

10 files changed

Lines changed: 59 additions & 30 deletions

File tree

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
2+
::: dve.core_engine.models
3+
handler: python
4+
options:
5+
show_root_heading: true
6+
heading_level: 2

docs/index.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ tags:
1111

1212
# Data Validation Engine
1313

14-
The Data Validation Engine (DVE) is a configuration driven data validation library written in [Python](https://www.python.org/), [Pydantic](https://docs.pydantic.dev/latest/) and a SQL backend currently consisting of [DuckDB](https://duckdb.org/) or [Spark](https://spark.apache.org/sql/). The configuration to run validations against a dataset are defined and written in a json document, which we will be referring to as the "dischema". The rules written within the dischema are designed to be run against all incoming data in a given submission - as this allows the DVE to capture all possible issues with the data without the submitter having resubmit the same data repeatedly which is burdensome and time consuming for the submitter and receiver of the data. Additionally, the rules can be configured to have the following behaviour:
14+
The Data Validation Engine (DVE) is a configuration driven data validation library written in [Python](https://www.python.org/), [Pydantic](https://docs.pydantic.dev/latest/) and a SQL backend currently consisting of [DuckDB](https://duckdb.org/) or [Spark](https://spark.apache.org/sql/). The configuration to run validations against a dataset are defined and written in a json document, which we will be referring to as the "dischema". The rules written within the dischema are designed to be run against all incoming data in a given submission - as this allows the DVE to capture all possible issues with the data without the submitter having to resubmit the same data repeatedly which is burdensome and time consuming for both the submitter and receiver of the data. Additionally, the rules can be configured to have the following behaviour:
1515

1616
- **File Rejection** - The entire submission will be rejected if the given rule triggers one or more times.
1717
- **Row Rejection** - The row that triggered the rule will be rejected. Rows that pass the validation will be flowed through into a validated entity.
@@ -23,16 +23,13 @@ The DVE has 3 core components:
2323

2424
1. [File Transformation](user_guidance/file_transformation.md) - Parsing submitted files into a "stringified" (all fields casted to string) parquet format.
2525

26-
???+ tip
27-
If your files are already in a parquet format, you do not need to use the file transformation and you can move straight onto the Data Contract.
28-
29-
2. [Data Contract](user_guidance/data_contract.md) - Validates submitted data against a specified datatype and casts successful records to that type.
26+
2. [Data Contract](user_guidance/data_contract.md) - Validates submitted data against a specified datatypes and casts successful records to those types. Additionally providing modelling of your data as well.
3027

3128
3. [Business rules](user_guidance/business_rules.md) - Performs simple and complex validations such as comparisons between fields, entities and/or lookups against reference data.
3229

3330
For each component listed above, a [feedback message](user_guidance/feedback_messages.md) is generated whenever a rule is violated. These [feedback messages](user_guidance/feedback_messages.md) can be integrated directly into your system given you can consume `JSONL` files. Alternatively, we offer a fourth component called the [Error Reports](user_guidance/error_reports.md). This component will load the [feedback messages](user_guidance/feedback_messages.md) into an `.xlsx` (Excel) file which could be sent back to the submitter of the data. The excel file is compatible with services that offer spreadsheet reading such as [Microsoft Excel](https://www.microsoft.com/en/microsoft-365/excel), [Google Docs](https://docs.google.com/), [Libre Office Calc](https://www.libreoffice.org/discover/calc/) etc.
3431

35-
To be able to run the DVE out of the box, you will need to choose and install one of the supported Backend Implementations such as [DuckDB](user_guidance/implementations/duckdb.md) or [Spark](user_guidance/implementations/spark.md). If you to need a write a custom backend implementation, you may want to look at the [Advanced User Guidance](advanced_guidance/backends.md) section.
32+
DVE currently comes with two supported backend implementations. These are [DuckDB](user_guidance/implementations/duckdb.md) and [Spark](user_guidance/implementations/spark.md). If you to need a write a custom backend implementation, you may want to look at the [Advanced User Guidance](advanced_guidance/new_backend.md) section.
3633

3734
Feel free to use the Table of Contents on the left hand side of the page to navigate to sections of interest or to use the "Next" and "Previous" buttons at the bottom of each page if you want to read through each page in sequential order.
3835

docs/user_guidance/auditing.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,12 @@ The Auditing objects within the DVE are used to help control and store informati
99

1010
Currently, these are the audit tables that can be accessed within the DVE:
1111

12-
| Table Name | Purpose |
13-
| --------------------- | ------- |
14-
| `processing_status` | Contains information about the submission and what the current processing status is. |
15-
| `submission_info` | Contains information about the submitted file. |
16-
| `submission_statistics` | Contains validation statistics for each submission. |
12+
| Table Name | Purpose | When Available |
13+
| ----------------------- | ------- | -------------- |
14+
| `processing_status` | Contains information about the submission and what the current processing status is. | >= File Transformation |
15+
| `submission_info` | Contains information about the submitted file. | >= File Transformation |
16+
| `submission_statistics` | Contains validation statistics for each submission. | >= Error Reports |
17+
| `aggregates` | Contains aggregate counts of errors triggered for a submission | >= Error Reports |
1718

1819
## Audit Objects
1920

docs/user_guidance/business_rules.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ The Business Rules section contain the rules you want to apply to your dataset.
1515

1616
All rules are written in `SQL`. Depending on which [backend implementation](./implementations/) you have choosen, the syntax might be different between implementations.
1717

18-
When writing the rules, you need to be aware that the expressions are wrapped in `NOT` expression. So, you should write the rules as though you are looking for non problematic values.
18+
When writing the rules, you need to be aware that the expressions are negated (wrapped in a `NOT` expression). So, you should write the rules as though you are looking for non problematic values.
1919

2020
When rules are being applied, [Complex Rules](./business_rules.md#complex-rules) are always applied before [Rules](./business_rules.md#rules) and [Filters](./business_rules.md#filters).
2121

@@ -46,7 +46,7 @@ For the simplest rules, you can write them in the filters section. For example,
4646
{
4747
"entity": "movies",
4848
"name": "Ensure movie is less than 4 hours long",
49-
"expression": "duration_minutes > 240",
49+
"expression": "duration_minutes < 240",
5050
"failure_type": "record",
5151
"error_code": "MOVIE_TOO_LONG",
5252
"failure_message": "Movie must be less than 4 hours long.",
@@ -77,7 +77,7 @@ For the simplest rules, you can write them in the filters section. For example,
7777
{
7878
"entity": "movies",
7979
"name": "Ensure movie is less than 4 hours long",
80-
"expression": "duration_minutes > 240",
80+
"expression": "duration_minutes < 240",
8181
"failure_type": "submission",
8282
"error_code": "MOVIE_TOO_LONG",
8383
"failure_message": "Movie must be less than 4 hours long.",
@@ -108,7 +108,7 @@ For the simplest rules, you can write them in the filters section. For example,
108108
{
109109
"entity": "movies",
110110
"name": "Ensure movie is less than 4 hours long",
111-
"expression": "duration_minutes > 240",
111+
"expression": "duration_minutes < 240",
112112
"failure_type": "record",
113113
"is_informational": true,
114114
"error_code": "MOVIE_TOO_LONG",
@@ -205,7 +205,7 @@ The difference between modifiying the existing entity and adding a new one is si
205205

206206
!!! warning
207207

208-
If you add columns to an existing entity defined within the contract, that column will be written out with the projected entity. To get around this, you will either need to create new entities *or* you can see the [post rule logic](./business_rules.md#post-rule) section to remove the column.
208+
When adding new columns to an existing entity these will be projected in the final entity. This might be something that you want and have intended (derived fields) but if not, you will need to write [post rule logic](./business_rules.md#post-rule) section to remove the column.
209209

210210
### Operations
211211

@@ -215,7 +215,7 @@ For a full list of operations that you can perform during the pre-steps see [Adv
215215

216216
When a Business Rule has been finished, "post step rules" can be run. This is useful in situtations where you've created lots of new entities *or* you have added lots of new columns to existing entities.
217217

218-
For new entities, it's advised that you always remove them. In instances where you have derived new columns for existing entities you may not want them to persist the columns in the projected assets. The code snippets below showcases how you can remove columns and new entities:
218+
For new entities, you may not want to persist these in final outputs. If this is the case, then you can add post rules to remove the entity entirely or just a column in any existing entity (other than refdata entities). The code snippets below showcases how you can remove columns and new entities:
219219

220220
=== "New Column Removal"
221221

@@ -362,7 +362,7 @@ For latest supported reference data types, see [Advanced User Guidance: Referenc
362362

363363
## Complex Rules
364364

365-
Complex Rules are recommended when you need to perform a number of "pre-step" operations before you can apply a business rule (filter). For instance, if you needed to add a column, filter and then join you would need to add all these steps into your [Rules](./business_rules.md#rules) section. This might be ok, if you only need a small number of pre-steps or only have a couple of rules. However, when you have lots of rules and more than 1 have a number of operations required, it's best to place these into a [Rulestore](./business_rules.md#rule-stores) and reference them within the complex rules. Otherwise, you could start to make the dischema document completely unmaintainable.
365+
Complex Rules are recommended when you need to perform a number of "pre-step" operations before you can apply a business rule (filter). For instance, if you needed to add a column, filter and then join you would need to add all these steps into your [Rules](./business_rules.md#rules) section. This might be ok, if you only need a small number of pre-steps or only have a couple of rules. However, when you have lots of rules and more than 1 have a number of operations required, it's best to place these into a [Rulestore](./business_rules.md#rule-stores) and reference them within the complex rules. Rules Stores also have other benefits that you can read [here](./business_rules.md#rule-stores).
366366

367367
Here is an example of defining a complex rule:
368368

docs/user_guidance/data_contract.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ The models within the Data Contract are written under the `datasets` key. For ex
7171
}
7272
```
7373

74-
From the example above, we've built two models from the source data which in turn will provide two seperated entities to work with in the business rules and how the data will be written out at the end of the process. Those models being `"movie"` and `"cast"` with `fields` specifying the name of the columns and the data type they should be casted to. We will look into [data types later in this page](data_contract.md#types).
74+
From the example above, we've built two models from the source data which in turn will provide two seperated entities to work with in the business rules and how the data will be written out at the end of the process. Those models being `"movie"` and `"cast"` with `fields` specifying the name of the columns and the data type they should be cast to. We will look into [data types later in this page](data_contract.md#types).
7575

7676

7777
### Mandatory Fields
@@ -213,7 +213,7 @@ Within the `fields` section of the contract you must define what data type a giv
213213

214214
### Constraints
215215

216-
Given the DVE supports Pydantic types, you can use any of the [constrained types available](https://docs.pydantic.dev/1.10/usage/types/#constrained-types). The docs will also show you what `kwarg` arguments are available for each constraint such as min/max length, regex patterns etc.
216+
Given the DVE supports Pydantic types, you can use any of the [constrained types available](https://docs.pydantic.dev/1.10/usage/types/#constrained-types). The Pydantic docs will also show you what `kwarg` arguments are available for each constraint such as min/max length, regex patterns etc.
217217

218218
For example, if you wanted to use a `constr` type for a field, you would define it like this:
219219

docs/user_guidance/file_transformation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ tags:
77
- Readers
88
---
99

10-
The File Transformation stage within the DVE is used to convert submitted files to stringified parquet format. This is critical as the rest of the stages within the DVE are reliant on the data being in parquet format. [Parquet was choosen as it's a very efficient column oriented format](https://www.databricks.com/glossary/what-is-parquet). When specifying which formats you are expecting, you will define it in your dischema like this:
10+
The File Transformation stage within the DVE is used to convert submitted files to stringified parquet format. This is critical as the rest of the stages within the DVE are reliant on the data being in parquet format. [Parquet was chosen as it's a very efficient column oriented format](https://www.databricks.com/glossary/what-is-parquet). When specifying which formats you are expecting, you will define it in your dischema like this:
1111

1212
=== "DuckDB"
1313

docs/user_guidance/getting_started.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ tags:
99

1010
## Rules Configuration Introduction
1111

12-
To use the DVE you will need to create a dischema document. The dischema document describes how the DVE should validate your data. It's divided into two primary parts. The first part is the `contract` (data contract) - this defines the structure of your data and determines how it is modeled and typecast. For example, here is a dischema document describing how the DVE may validate data about a movies:
12+
To use the DVE you will need to create a dischema document. The dischema document describes how the DVE should validate your data. It's divided into two primary parts. The first part is the `contract` (data contract) - this defines the structure of your data and determines how it is modeled and typecast. For example, here is a dischema document describing how the DVE may validate data about movies:
1313

1414
!!! example "Example `movies.dischema.json`"
1515

@@ -63,16 +63,16 @@ Within the example above, there are two parent keys - `schemas` and `datasets`.
6363

6464
`schemas` allow you to define custom complex data types. So, in the example above, the field `cast` would be expecting an array of structs containing the actors name, role and the date they joined the movie.
6565

66-
`datasets` describe the actual models for the entities you want to load. In the example above, we only want to load a single entity called `movies` which contains the fields `title, year, genre, duration_minutes, ratings and cast`. However, you could load the complex type `cast` into a seperate entity if you wanted to split your data into seperate entities. This can be useful in situations where a given entity has all the information you need to perform a given validation rule against, making the performance of rule faster & more efficient as there's less data to scan in a given entity.
66+
`datasets` describe the actual models for the entities you want to load. In the example above, we only want to load a single entity called `movies` which contains the fields `title, year, genre, duration_minutes, ratings and cast`. However, you could load the complex type `cast` into a seperate entity if you wanted. This can be useful in situations where a given entity has all the information you need to perform a given validation rule against, making the performance of rule faster & more efficient as there's less data to scan in a given entity.
6767

6868
!!! note
6969
The "splitting" of entities is considerably more useful in situtations where you want to normalise/de-normalise your data. If you're unfamiliar with this concept, you can read more about it [here](https://en.wikipedia.org/wiki/Database_normalization). However, you should keep in mind potential performance impacts of doing this. If you have rules that requires fields from different entities, you will have to perform a `join` between the split entities to be able to perform the rule.
7070

71-
For each dataset definition, you will need to provide a `reader_config` which describes how to load the data during the [File Transformation](file_transformation.md) stage. So, in the example above, we expect `movies` to come in as a `JSON` file. However, you can add more readers if you have the same data in different data formats (e.g. `csv`, `xml`, `json`). Regardless, of what they submit, the [File Transformation](file_transformation.md) stage will turn their submissions into a "stringified" parquet format which is a requirement for the subsequent stages.
71+
For each dataset definition, you will need to provide a `reader_config` which describes how to load the data during the [File Transformation](file_transformation.md) stage. So, in the example above, we expect `movies` to come in as a `JSON` file. However, you can add more readers if you have the same data in different data formats (e.g. `csv`, `xml`, `json`). Regardless of what file format, the [File Transformation](file_transformation.md) stage will convert the submitted data into a "stringified" parquet format which is a requirement for the subsequent stages.
7272

7373
To learn more about how you can construct your Data Contract please read [here](data_contract.md).
7474

75-
The second part of the dischema are the `tranformations` (business_rules). This section describes the validation rules you want to apply to entities defined within the `contract`. For example, with our `movies` dataset above, we may want to check that movies in this dataset are less than 4 hours long. The expression to write this check is written in SQL and that syntax may change slightly depending on the SQL backend you've choosen (we currently support [DuckDB](implementations/duckdb.md) and [Spark SQL](implementations/spark.md)).
75+
The second part of the dischema are the `tranformations` (Business Rules). This section describes the validation rules you want to apply to entities defined within the `contract`. For example, with our `movies` dataset above, we may want to check that movies in this dataset are less than 4 hours long. The expression to write this check is written in SQL and that syntax may change slightly depending on the SQL backend you've chosen (we currently support [DuckDB](implementations/duckdb.md) and [Spark SQL](implementations/spark.md)).
7676
!!! example "Example `movies.dischema.json`"
7777

7878
```json

docs/user_guidance/implementations/duckdb.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,7 @@ Now you have the DuckDB connection object setup, you are ready to setup the requ
3636

3737
## Generating SubmissionInfo objects
3838

39-
Before we utilise the DVE, we need to generate an iterable object containing `SubmissionInfo` objects. These objects effectively contain the necessery metadata for the DVE to work with a given submission. Here is an example function used to generate SubmissionInfo objects from a given path:
40-
39+
Before we utilise the DVE, we need to generate an iterable object containing `SubmissionInfo` objects. These objects effectively contain the necessery metadata for the DVE to work with a given submission. Here is an example function used to generate [SubmissionInfo](../../advanced_guidance/package_documentation/models.md#dve.core_engine.models.SubmissionInfo) objects from a given path:
4140
```py
4241
import glob
4342
from datetime import date, datetime

0 commit comments

Comments
 (0)