Skip to content

Apply subjects filter to chunked reads in sqlite/import.py#1983

Open
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Chessing234:fix/sqlite-import-subjects-chunks
Open

Apply subjects filter to chunked reads in sqlite/import.py#1983
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Chessing234:fix/sqlite-import-subjects-chunks

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

mimic-iv/buildmimic/sqlite/import.py provides a --limit N flag that restricts the imported database to the first N subject_ids. It works for small tables but is silently ignored for any table whose CSV exceeds THRESHOLD_SIZE (50 MB) — i.e. the big ones (chartevents, labevents, outputevents, emar, etc.). The resulting SQLite database ends up containing every row of those tables instead of the requested subset.

Root cause

The import loop has two paths:

```python
if os.path.getsize(f) < THRESHOLD_SIZE:
df = pd.read_csv(f, dtype=mimic_dtypes)
df = process_dataframe(df, subjects=subjects) # <- filters
...
else:
# If the file is too large, let's do the work in chunks
for chunk in pd.read_csv(f, chunksize=CHUNKSIZE, low_memory=False, dtype=mimic_dtypes):
chunk = process_dataframe(chunk) # <- NO subjects=
...
```

process_dataframe only filters when subjects is passed:

```python
def process_dataframe(df, subjects=None):
...
if subjects is not None and 'subject_id' in df:
df = df.loc[df['subject_id'].isin(subjects)]
return df
```

The chunked call omits subjects=subjects, so every chunk of a large table is inserted unfiltered.

Fix

Pass subjects=subjects in the chunked call, matching the small-file path:

```python
chunk = process_dataframe(chunk, subjects=subjects)
```

One-line change. With this fix, --limit N is honored for every table regardless of file size.

When the MIMIC-IV CSV being imported exceeds THRESHOLD_SIZE (50MB) the
loader switches to a chunked pd.read_csv loop, but the chunked path
calls process_dataframe(chunk) without passing 'subjects'. The small-
file path just above passes subjects=subjects, and process_dataframe
only filters to the requested subject_ids when that argument is
supplied, so --limit is silently ignored for any table large enough to
be chunked (e.g. chartevents, labevents). The resulting database ends
up containing every subject in those tables instead of the limited
subset the user asked for.

Pass subjects=subjects in the chunked call so --limit is honored for
large tables as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant