This project demonstrates a data governance pipeline focused on compliance, featuring data cataloging, access control, lineage tracking, and automated quality validations.
The goal is to align data delivery with LGPD and ISO 27001 guidelines.
data_governance/
├── config/
│ └── policies.yaml # Regulatory policies, access rules, quality rules
├── data/
│ ├── raw/ # Mocked source data
│ └── processed/ # Governed outputs produced by the pipeline
├── logs/ # Audit records
├── src/
│ └── data_governance/ # Framework implementation
│ ├── access_control.py
│ ├── catalog.py
│ ├── lineage.py
│ ├── pipeline.py
│ ├── policies.py
│ └── quality.py
└── README.md
-
Data Catalog
Registers assets with metadata, sensitivity classification, tags, and regulatory compliance attributes. -
Access Control
Enforces role-based policies aligned with LGPD and ISO 27001 requirements. -
Lineage Tracking
Captures end-to-end transformations, mapping inputs and outputs for auditability. -
Automated Data Quality
Validates consent recency, email format, monetary values, and other business rules. -
Auditing
Generates JSON logs containing evidence of execution, loaded policies, and identified quality issues.
- (Optional) Create a virtual environment and install dependencies:
pip install pandas pyyaml
- Execute o pipeline:
python -m data_governance.src.data_governance.pipeline
from data_governance.src.data_governance.pipeline import run_pipeline
audit_log = run_pipeline("data_governance")
print(audit_log)- The pipeline filters customers without valid consent and tracks the latest update timestamp.
- Documented access policies and lineage records provide evidence for security and compliance audits.
- Integrate with a corporate catalog (e.g., Apache Atlas).
- Automate ingestion of policies from a GRC platform.
- Extend quality rules with statistical profiling and anomaly detection.