docs: spag and other fixes

georgeRobertson · georgeRobertson · commit 1b25735a6076 · 2026-03-17T12:20:36.000Z
diff --git a/docs/advanced_guidance/package_documentation/models.md b/docs/advanced_guidance/package_documentation/models.md
@@ -0,0 +1,6 @@
+
+::: dve.core_engine.models
+    handler: python
+    options:
+        show_root_heading: true
+        heading_level: 2
diff --git a/docs/index.md b/docs/index.md
@@ -11,7 +11,7 @@ tags:
 
 # Data Validation Engine
 
-The Data Validation Engine (DVE) is a configuration driven data validation library written in [Python](https://www.python.org/), [Pydantic](https://docs.pydantic.dev/latest/) and a SQL backend currently consisting of [DuckDB](https://duckdb.org/) or [Spark](https://spark.apache.org/sql/). The configuration to run validations against a dataset are defined and written in a json document, which we will be referring to as the "dischema". The rules written within the dischema are designed to be run against all incoming data in a given submission - as this allows the DVE to capture all possible issues with the data without the submitter having resubmit the same data repeatedly which is burdensome and time consuming for the submitter and receiver of the data. Additionally, the rules can be configured to have the following behaviour:
+The Data Validation Engine (DVE) is a configuration driven data validation library written in [Python](https://www.python.org/), [Pydantic](https://docs.pydantic.dev/latest/) and a SQL backend currently consisting of [DuckDB](https://duckdb.org/) or [Spark](https://spark.apache.org/sql/). The configuration to run validations against a dataset are defined and written in a json document, which we will be referring to as the "dischema". The rules written within the dischema are designed to be run against all incoming data in a given submission - as this allows the DVE to capture all possible issues with the data without the submitter having to resubmit the same data repeatedly which is burdensome and time consuming for both the submitter and receiver of the data. Additionally, the rules can be configured to have the following behaviour:
 
 - **File Rejection** - The entire submission will be rejected if the given rule triggers one or more times.
 - **Row Rejection** - The row that triggered the rule will be rejected. Rows that pass the validation will be flowed through into a validated entity.
@@ -23,16 +23,13 @@ The DVE has 3 core components:
 
 1. [File Transformation](user_guidance/file_transformation.md) - Parsing submitted files into a "stringified" (all fields casted to string) parquet format.
 
-    ???+ tip
-        If your files are already in a parquet format, you do not need to use the file transformation and you can move straight onto the Data Contract.
-
-2. [Data Contract](user_guidance/data_contract.md) - Validates submitted data against a specified datatype and casts successful records to that type.
+2. [Data Contract](user_guidance/data_contract.md) - Validates submitted data against a specified datatypes and casts successful records to those types. Additionally providing modelling of your data as well.
 
 3. [Business rules](user_guidance/business_rules.md) - Performs simple and complex validations such as comparisons between fields, entities and/or lookups against reference data.
 
 For each component listed above, a [feedback message](user_guidance/feedback_messages.md) is generated whenever a rule is violated. These [feedback messages](user_guidance/feedback_messages.md) can be integrated directly into your system given you can consume `JSONL` files. Alternatively, we offer a fourth component called the [Error Reports](user_guidance/error_reports.md). This component will load the [feedback messages](user_guidance/feedback_messages.md) into an `.xlsx` (Excel) file which could be sent back to the submitter of the data. The excel file is compatible with services that offer spreadsheet reading such as [Microsoft Excel](https://www.microsoft.com/en/microsoft-365/excel), [Google Docs](https://docs.google.com/), [Libre Office Calc](https://www.libreoffice.org/discover/calc/) etc.
 
-To be able to run the DVE out of the box, you will need to choose and install one of the supported Backend Implementations such as [DuckDB](user_guidance/implementations/duckdb.md) or [Spark](user_guidance/implementations/spark.md). If you to need a write a custom backend implementation, you may want to look at the [Advanced User Guidance](advanced_guidance/backends.md) section.
+DVE currently comes with two supported backend implementations. These are [DuckDB](user_guidance/implementations/duckdb.md) and [Spark](user_guidance/implementations/spark.md). If you to need a write a custom backend implementation, you may want to look at the [Advanced User Guidance](advanced_guidance/new_backend.md) section.
 
 Feel free to use the Table of Contents on the left hand side of the page to navigate to sections of interest or to use the "Next" and "Previous" buttons at the bottom of each page if you want to read through each page in sequential order.
 
diff --git a/docs/user_guidance/auditing.md b/docs/user_guidance/auditing.md
@@ -9,11 +9,12 @@ The Auditing objects within the DVE are used to help control and store informati
 
 Currently, these are the audit tables that can be accessed within the DVE:
 
-| Table Name            | Purpose |
-| --------------------- | ------- |
-| `processing_status`     | Contains information about the submission and what the current processing status is. |
-| `submission_info`       | Contains information about the submitted file. |
-| `submission_statistics` | Contains validation statistics for each submission. |
+| Table Name              | Purpose | When Available |
+| ----------------------- | ------- | -------------- |
+| `processing_status`     | Contains information about the submission and what the current processing status is. | >= File Transformation |
+| `submission_info`       | Contains information about the submitted file. | >= File Transformation |
+| `submission_statistics` | Contains validation statistics for each submission. | >= Error Reports |
+| `aggregates`            | Contains aggregate counts of errors triggered for a submission | >= Error Reports |
 
 ## Audit Objects
 
diff --git a/docs/user_guidance/business_rules.md b/docs/user_guidance/business_rules.md
@@ -15,7 +15,7 @@ The Business Rules section contain the rules you want to apply to your dataset.
 
 All rules are written in `SQL`. Depending on which [backend implementation](./implementations/) you have choosen, the syntax might be different between implementations.
 
-When writing the rules, you need to be aware that the expressions are wrapped in `NOT` expression. So, you should write the rules as though you are looking for non problematic values.
+When writing the rules, you need to be aware that the expressions are negated (wrapped in a `NOT` expression). So, you should write the rules as though you are looking for non problematic values.
 
 When rules are being applied, [Complex Rules](./business_rules.md#complex-rules) are always applied before [Rules](./business_rules.md#rules) and [Filters](./business_rules.md#filters).
 
@@ -46,7 +46,7 @@ For the simplest rules, you can write them in the filters section. For example,
                 {
                     "entity": "movies",
                     "name": "Ensure movie is less than 4 hours long",
-                    "expression": "duration_minutes > 240",
+                    "expression": "duration_minutes < 240",
                     "failure_type": "record",
                     "error_code": "MOVIE_TOO_LONG",
                     "failure_message": "Movie must be less than 4 hours long.",
@@ -77,7 +77,7 @@ For the simplest rules, you can write them in the filters section. For example,
                 {
                     "entity": "movies",
                     "name": "Ensure movie is less than 4 hours long",
-                    "expression": "duration_minutes > 240",
+                    "expression": "duration_minutes < 240",
                     "failure_type": "submission",
                     "error_code": "MOVIE_TOO_LONG",
                     "failure_message": "Movie must be less than 4 hours long.",
@@ -108,7 +108,7 @@ For the simplest rules, you can write them in the filters section. For example,
                 {
                     "entity": "movies",
                     "name": "Ensure movie is less than 4 hours long",
-                    "expression": "duration_minutes > 240",
+                    "expression": "duration_minutes < 240",
                     "failure_type": "record",
                     "is_informational": true,
                     "error_code": "MOVIE_TOO_LONG",
@@ -205,7 +205,7 @@ The difference between modifiying the existing entity and adding a new one is si
 
 !!! warning
 
-    If you add columns to an existing entity defined within the contract, that column will be written out with the projected entity. To get around this, you will either need to create new entities *or* you can see the [post rule logic](./business_rules.md#post-rule) section to remove the column.
+    When adding new columns to an existing entity these will be projected in the final entity. This might be something that you want and have intended (derived fields) but if not, you will need to write [post rule logic](./business_rules.md#post-rule) section to remove the column.
 
 ### Operations
 
@@ -215,7 +215,7 @@ For a full list of operations that you can perform during the pre-steps see [Adv
 
 When a Business Rule has been finished, "post step rules" can be run. This is useful in situtations where you've created lots of new entities *or* you have added lots of new columns to existing entities.
 
-For new entities, it's advised that you always remove them. In instances where you have derived new columns for existing entities you may not want them to persist the columns in the projected assets. The code snippets below showcases how you can remove columns and new entities:
+For new entities, you may not want to persist these in final outputs. If this is the case, then you can add post rules to remove the entity entirely or just a column in any existing entity (other than refdata entities). The code snippets below showcases how you can remove columns and new entities:
 
 === "New Column Removal"
 
@@ -362,7 +362,7 @@ For latest supported reference data types, see [Advanced User Guidance: Referenc
 
 ## Complex Rules
 
-Complex Rules are recommended when you need to perform a number of "pre-step" operations before you can apply a business rule (filter). For instance, if you needed to add a column, filter and then join you would need to add all these steps into your [Rules](./business_rules.md#rules) section. This might be ok, if you only need a small number of pre-steps or only have a couple of rules. However, when you have lots of rules and more than 1 have a number of operations required, it's best to place these into a [Rulestore](./business_rules.md#rule-stores) and reference them within the complex rules. Otherwise, you could start to make the dischema document completely unmaintainable.
+Complex Rules are recommended when you need to perform a number of "pre-step" operations before you can apply a business rule (filter). For instance, if you needed to add a column, filter and then join you would need to add all these steps into your [Rules](./business_rules.md#rules) section. This might be ok, if you only need a small number of pre-steps or only have a couple of rules. However, when you have lots of rules and more than 1 have a number of operations required, it's best to place these into a [Rulestore](./business_rules.md#rule-stores) and reference them within the complex rules. Rules Stores also have other benefits that you can read [here](./business_rules.md#rule-stores).
 
 Here is an example of defining a complex rule:
 
diff --git a/docs/user_guidance/data_contract.md b/docs/user_guidance/data_contract.md
@@ -71,7 +71,7 @@ The models within the Data Contract are written under the `datasets` key. For ex
     }
     ```
 
-From the example above, we've built two models from the source data which in turn will provide two seperated entities to work with in the business rules and how the data will be written out at the end of the process. Those models being `"movie"` and `"cast"` with `fields` specifying the name of the columns and the data type they should be casted to. We will look into [data types later in this page](data_contract.md#types).
+From the example above, we've built two models from the source data which in turn will provide two seperated entities to work with in the business rules and how the data will be written out at the end of the process. Those models being `"movie"` and `"cast"` with `fields` specifying the name of the columns and the data type they should be cast to. We will look into [data types later in this page](data_contract.md#types).
 
 
 ### Mandatory Fields
@@ -213,7 +213,7 @@ Within the `fields` section of the contract you must define what data type a giv
 
 ### Constraints
 
-Given the DVE supports Pydantic types, you can use any of the [constrained types available](https://docs.pydantic.dev/1.10/usage/types/#constrained-types). The docs will also show you what `kwarg` arguments are available for each constraint such as min/max length, regex patterns etc.
+Given the DVE supports Pydantic types, you can use any of the [constrained types available](https://docs.pydantic.dev/1.10/usage/types/#constrained-types). The Pydantic docs will also show you what `kwarg` arguments are available for each constraint such as min/max length, regex patterns etc.
 
 For example, if you wanted to use a `constr` type for a field, you would define it like this:
 
diff --git a/docs/user_guidance/file_transformation.md b/docs/user_guidance/file_transformation.md
@@ -7,7 +7,7 @@ tags:
     - Readers
 ---
 
-The File Transformation stage within the DVE is used to convert submitted files to stringified parquet format. This is critical as the rest of the stages within the DVE are reliant on the data being in parquet format. [Parquet was choosen as it's a very efficient column oriented format](https://www.databricks.com/glossary/what-is-parquet). When specifying which formats you are expecting, you will define it in your dischema like this:
+The File Transformation stage within the DVE is used to convert submitted files to stringified parquet format. This is critical as the rest of the stages within the DVE are reliant on the data being in parquet format. [Parquet was chosen as it's a very efficient column oriented format](https://www.databricks.com/glossary/what-is-parquet). When specifying which formats you are expecting, you will define it in your dischema like this:
 
 === "DuckDB"
 
diff --git a/docs/user_guidance/getting_started.md b/docs/user_guidance/getting_started.md
@@ -9,7 +9,7 @@ tags:
 
 ## Rules Configuration Introduction
 
-To use the DVE you will need to create a dischema document. The dischema document describes how the DVE should validate your data. It's divided into two primary parts. The first part is the `contract` (data contract) - this defines the structure of your data and determines how it is modeled and typecast. For example, here is a dischema document describing how the DVE may validate data about a movies:
+To use the DVE you will need to create a dischema document. The dischema document describes how the DVE should validate your data. It's divided into two primary parts. The first part is the `contract` (data contract) - this defines the structure of your data and determines how it is modeled and typecast. For example, here is a dischema document describing how the DVE may validate data about movies:
 
 !!! example "Example `movies.dischema.json`"
 
@@ -63,16 +63,16 @@ Within the example above, there are two parent keys - `schemas` and `datasets`.
 
 `schemas` allow you to define custom complex data types. So, in the example above, the field `cast` would be expecting an array of structs containing the actors name, role and the date they joined the movie.
 
-`datasets` describe the actual models for the entities you want to load. In the example above, we only want to load a single entity called `movies` which contains the fields `title, year, genre, duration_minutes, ratings and cast`. However, you could load the complex type `cast` into a seperate entity if you wanted to split your data into seperate entities. This can be useful in situations where a given entity has all the information you need to perform a given validation rule against, making the performance of rule faster & more efficient as there's less data to scan in a given entity.
+`datasets` describe the actual models for the entities you want to load. In the example above, we only want to load a single entity called `movies` which contains the fields `title, year, genre, duration_minutes, ratings and cast`. However, you could load the complex type `cast` into a seperate entity if you wanted. This can be useful in situations where a given entity has all the information you need to perform a given validation rule against, making the performance of rule faster & more efficient as there's less data to scan in a given entity.
 
 !!! note
     The "splitting" of entities is considerably more useful in situtations where you want to normalise/de-normalise your data. If you're unfamiliar with this concept, you can read more about it [here](https://en.wikipedia.org/wiki/Database_normalization). However, you should keep in mind potential performance impacts of doing this. If you have rules that requires fields from different entities, you will have to perform a `join` between the split entities to be able to perform the rule.
 
-For each dataset definition, you will need to provide a `reader_config` which describes how to load the data during the [File Transformation](file_transformation.md) stage. So, in the example above, we expect `movies` to come in as a `JSON` file. However, you can add more readers if you have the same data in different data formats (e.g. `csv`, `xml`, `json`). Regardless, of what they submit, the [File Transformation](file_transformation.md) stage will turn their submissions into a "stringified" parquet format which is a requirement for the subsequent stages.
+For each dataset definition, you will need to provide a `reader_config` which describes how to load the data during the [File Transformation](file_transformation.md) stage. So, in the example above, we expect `movies` to come in as a `JSON` file. However, you can add more readers if you have the same data in different data formats (e.g. `csv`, `xml`, `json`). Regardless of what file format, the [File Transformation](file_transformation.md) stage will convert the submitted data into a "stringified" parquet format which is a requirement for the subsequent stages.
 
 To learn more about how you can construct your Data Contract please read [here](data_contract.md).
 
-The second part of the dischema are the `tranformations` (business_rules). This section describes the validation rules you want to apply to entities defined within the `contract`. For example, with our `movies` dataset above, we may want to check that movies in this dataset are less than 4 hours long. The expression to write this check is written in SQL and that syntax may change slightly depending on the SQL backend you've choosen (we currently support [DuckDB](implementations/duckdb.md) and [Spark SQL](implementations/spark.md)).
+The second part of the dischema are the `tranformations` (Business Rules). This section describes the validation rules you want to apply to entities defined within the `contract`. For example, with our `movies` dataset above, we may want to check that movies in this dataset are less than 4 hours long. The expression to write this check is written in SQL and that syntax may change slightly depending on the SQL backend you've chosen (we currently support [DuckDB](implementations/duckdb.md) and [Spark SQL](implementations/spark.md)).
 !!! example "Example `movies.dischema.json`"
 
     ```json
diff --git a/docs/user_guidance/implementations/duckdb.md b/docs/user_guidance/implementations/duckdb.md
@@ -36,8 +36,7 @@ Now you have the DuckDB connection object setup, you are ready to setup the requ
 
 ## Generating SubmissionInfo objects
 
-Before we utilise the DVE, we need to generate an iterable object containing `SubmissionInfo` objects. These objects effectively contain the necessery metadata for the DVE to work with a given submission. Here is an example function used to generate SubmissionInfo objects from a given path:
-
+Before we utilise the DVE, we need to generate an iterable object containing `SubmissionInfo` objects. These objects effectively contain the necessery metadata for the DVE to work with a given submission. Here is an example function used to generate [SubmissionInfo](../../advanced_guidance/package_documentation/models.md#dve.core_engine.models.SubmissionInfo) objects from a given path:
 ```py
 import glob
 from datetime import date, datetime
diff --git a/docs/user_guidance/implementations/spark.md b/docs/user_guidance/implementations/spark.md
diff --git a/zensical.toml b/zensical.toml