Apply subjects filter to chunked reads in sqlite/import.py#1983
Open
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Open
Apply subjects filter to chunked reads in sqlite/import.py#1983Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Conversation
When the MIMIC-IV CSV being imported exceeds THRESHOLD_SIZE (50MB) the loader switches to a chunked pd.read_csv loop, but the chunked path calls process_dataframe(chunk) without passing 'subjects'. The small- file path just above passes subjects=subjects, and process_dataframe only filters to the requested subject_ids when that argument is supplied, so --limit is silently ignored for any table large enough to be chunked (e.g. chartevents, labevents). The resulting database ends up containing every subject in those tables instead of the limited subset the user asked for. Pass subjects=subjects in the chunked call so --limit is honored for large tables as well.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
mimic-iv/buildmimic/sqlite/import.pyprovides a--limit Nflag that restricts the imported database to the firstNsubject_ids. It works for small tables but is silently ignored for any table whose CSV exceedsTHRESHOLD_SIZE(50 MB) — i.e. the big ones (chartevents,labevents,outputevents,emar, etc.). The resulting SQLite database ends up containing every row of those tables instead of the requested subset.Root cause
The import loop has two paths:
```python
if os.path.getsize(f) < THRESHOLD_SIZE:
df = pd.read_csv(f, dtype=mimic_dtypes)
df = process_dataframe(df, subjects=subjects) # <- filters
...
else:
# If the file is too large, let's do the work in chunks
for chunk in pd.read_csv(f, chunksize=CHUNKSIZE, low_memory=False, dtype=mimic_dtypes):
chunk = process_dataframe(chunk) # <- NO subjects=
...
```
process_dataframeonly filters whensubjectsis passed:```python
def process_dataframe(df, subjects=None):
...
if subjects is not None and 'subject_id' in df:
df = df.loc[df['subject_id'].isin(subjects)]
return df
```
The chunked call omits
subjects=subjects, so every chunk of a large table is inserted unfiltered.Fix
Pass
subjects=subjectsin the chunked call, matching the small-file path:```python
chunk = process_dataframe(chunk, subjects=subjects)
```
One-line change. With this fix,
--limit Nis honored for every table regardless of file size.