Commit 292891b
* fix: OOM issue when processing large CSV files by allowing a maximum of 5 chunks
* Revert "fix: OOM issue when processing large CSV files by allowing a maximum of 5 chunks"
This reverts commit aa8dda1.
* fix: use iterator to avoid OOM issues for large files
* fix: parquet and datalake tests
* fix: optimized the file reading code for all the support file types
* fix: s3fs version and json file reading
* test: added and fixed file reading tests
* fix: generate close on read first chunk
* refact: remove iterator instance check fromt tests
* refactor(datalake): Convert DataFrame API from list-based to generator-based
Refactors the pandas profiler and datalake readers to use lazy
generator-based
DataFrame iteration instead of loading entire datasets into memory. This improves memory efficiency for large files and enables streaming processing.
## Core API Changes
### DataFrame Readers (dataframes now returns callable)
- base.py: Updated read_first_chunk() to handle callable dataframes
- parquet.py: All storage backends (S3, GCS, Azure, Local) return callables
- dsv.py: CSV/TSV readers return callables
- json.py: JSON/JSONL readers return callables
- avro.py: All storage backends return callables (fixed S3 reader)
- mf4.py: Fixed empty case to return callable
### Profiler Interface
- profiler_interface.py: Added _type_casted_dataset() wrapper that applies
type casting lazily; dataset is now a callable generator factory
- runner.py: Updated PandasRunner to work with callable dataset
### Metrics (updated to iterate over dataset)
- All static metrics (count, min, max, sum, mean, stddev, etc.)
- All window metrics (median, first_quartile, third_quartile)
- Hybrid metrics (histogram, cardinality_distribution)
### Data Quality Validators
- pandas_validator_mixin.py: Updated to iterate over dataframes
- base_test_handler.py: Handle callable dataframes
- Table validators: tableColumnCount*, tableColumnNameToExist,
tableColumnToMatchSet, tableCustomSQLQuery
### Utilities
- datalake_utils.py: fetch_dataframe() and fetch_dataframe_first_chunk()
now call the dataframes callable before iterating
- pandas_mixin.py: Updated partitioning logic
- sampler.py: Updated DatalakeSampler for generator-based dataset
## Test Updates
- test_profiler_interface.py, test_profiler.py, test_sample.py,
test_custom_metrics.py, test_datalake_metrics.py: Mock get_dataframes()
- test_*_reader.py: Use list(result.dataframes()) pattern
- test_parquet_azure_reader.py: Consume generator before mock assertions
* style: ran python linting
* refactor(datalake): addressed issues after initial review
* refactor: add logging when error
* fix: sdk validator
* style: ran java style check
* fix(observability): failing DQ tests
* fix: tests related to callable dataframes
* fix: failing tests
* fix: _stream_json_lines - use file_obj.readline()
* fix: pymssql version
* fix: ijson version
* fix: gitar typehinting comment
* fix: removed fetch_dataframe method
* fix: setup.py libraries to fix timeout
---------
Co-authored-by: TeddyCr <teddy.crepineau@gmail.com>
1 parent 1711b6d commit 292891b
82 files changed
Lines changed: 2830 additions & 827 deletions
File tree
- ingestion
- src/metadata
- data_quality
- builders
- validations
- column/pandas
- mixins
- table/pandas
- ingestion/source
- database/datalake
- storage
- mixins/pandas
- profiler
- interface/pandas
- metrics
- hybrid
- static
- window
- processor
- readers
- dataframe
- file
- sampler/pandas
- sdk/data_quality/dataframes
- utils/datalake
- tests/unit
- profiler/pandas
- readers
- test_suite
- topology
- database
- storage
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
25 | 26 | | |
26 | | - | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
30 | | - | |
| 31 | + | |
31 | 32 | | |
32 | 33 | | |
33 | 34 | | |
| 35 | + | |
34 | 36 | | |
35 | 37 | | |
36 | 38 | | |
| |||
56 | 58 | | |
57 | 59 | | |
58 | 60 | | |
| 61 | + | |
59 | 62 | | |
60 | 63 | | |
61 | 64 | | |
| |||
66 | 69 | | |
67 | 70 | | |
68 | 71 | | |
69 | | - | |
| 72 | + | |
| 73 | + | |
70 | 74 | | |
71 | 75 | | |
72 | 76 | | |
| |||
80 | 84 | | |
81 | 85 | | |
82 | 86 | | |
| 87 | + | |
83 | 88 | | |
84 | 89 | | |
85 | 90 | | |
86 | 91 | | |
87 | 92 | | |
88 | 93 | | |
| 94 | + | |
89 | 95 | | |
90 | 96 | | |
91 | 97 | | |
| |||
167 | 173 | | |
168 | 174 | | |
169 | 175 | | |
170 | | - | |
| 176 | + | |
171 | 177 | | |
172 | 178 | | |
173 | 179 | | |
| |||
233 | 239 | | |
234 | 240 | | |
235 | 241 | | |
| 242 | + | |
236 | 243 | | |
237 | 244 | | |
238 | 245 | | |
239 | 246 | | |
240 | 247 | | |
241 | 248 | | |
| 249 | + | |
242 | 250 | | |
243 | 251 | | |
244 | 252 | | |
| 253 | + | |
| 254 | + | |
245 | 255 | | |
246 | 256 | | |
247 | 257 | | |
| |||
252 | 262 | | |
253 | 263 | | |
254 | 264 | | |
255 | | - | |
| 265 | + | |
256 | 266 | | |
257 | 267 | | |
258 | 268 | | |
| |||
335 | 345 | | |
336 | 346 | | |
337 | 347 | | |
338 | | - | |
| 348 | + | |
339 | 349 | | |
340 | 350 | | |
341 | 351 | | |
| |||
Lines changed: 5 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
| |||
51 | 53 | | |
52 | 54 | | |
53 | 55 | | |
54 | | - | |
| 56 | + | |
55 | 57 | | |
56 | 58 | | |
57 | 59 | | |
58 | 60 | | |
59 | 61 | | |
60 | 62 | | |
61 | 63 | | |
62 | | - | |
| 64 | + | |
63 | 65 | | |
64 | 66 | | |
65 | 67 | | |
| |||
Lines changed: 2 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
46 | | - | |
| 46 | + | |
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
51 | | - | |
52 | 51 | | |
53 | 52 | | |
54 | 53 | | |
| |||
105 | 104 | | |
106 | 105 | | |
107 | 106 | | |
108 | | - | |
| 107 | + | |
109 | 108 | | |
110 | 109 | | |
111 | 110 | | |
| |||
148 | 147 | | |
149 | 148 | | |
150 | 149 | | |
151 | | - | |
152 | 150 | | |
153 | | - | |
154 | | - | |
155 | 151 | | |
156 | 152 | | |
157 | 153 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
92 | | - | |
| 92 | + | |
93 | 93 | | |
94 | 94 | | |
95 | 95 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
90 | | - | |
| 90 | + | |
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
94 | | - | |
| 94 | + | |
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
95 | | - | |
| 95 | + | |
96 | 96 | | |
97 | 97 | | |
98 | 98 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
91 | | - | |
| 91 | + | |
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
94 | 94 | | |
95 | 95 | | |
96 | 96 | | |
97 | | - | |
| 97 | + | |
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
92 | | - | |
| 92 | + | |
93 | 93 | | |
94 | 94 | | |
95 | 95 | | |
| |||
0 commit comments