Apply subjects filter to chunked reads in sqlite/import.py by Chessing234 · Pull Request #1983 · MIT-LCP/mimic-code

Chessing234 · 2026-04-10T09:14:47Z

Bug

mimic-iv/buildmimic/sqlite/import.py provides a --limit N flag that restricts the imported database to the first N subject_ids. It works for small tables but is silently ignored for any table whose CSV exceeds THRESHOLD_SIZE (50 MB) — i.e. the big ones (chartevents, labevents, outputevents, emar, etc.). The resulting SQLite database ends up containing every row of those tables instead of the requested subset.

Root cause

The import loop has two paths:

```python
if os.path.getsize(f) < THRESHOLD_SIZE:
df = pd.read_csv(f, dtype=mimic_dtypes)
df = process_dataframe(df, subjects=subjects) # <- filters
...
else:
# If the file is too large, let's do the work in chunks
for chunk in pd.read_csv(f, chunksize=CHUNKSIZE, low_memory=False, dtype=mimic_dtypes):
chunk = process_dataframe(chunk) # <- NO subjects=
...
```

process_dataframe only filters when subjects is passed:

```python
def process_dataframe(df, subjects=None):
...
if subjects is not None and 'subject_id' in df:
df = df.loc[df['subject_id'].isin(subjects)]
return df
```

The chunked call omits subjects=subjects, so every chunk of a large table is inserted unfiltered.

Fix

Pass subjects=subjects in the chunked call, matching the small-file path:

```python
chunk = process_dataframe(chunk, subjects=subjects)
```

One-line change. With this fix, --limit N is honored for every table regardless of file size.

When the MIMIC-IV CSV being imported exceeds THRESHOLD_SIZE (50MB) the loader switches to a chunked pd.read_csv loop, but the chunked path calls process_dataframe(chunk) without passing 'subjects'. The small- file path just above passes subjects=subjects, and process_dataframe only filters to the requested subject_ids when that argument is supplied, so --limit is silently ignored for any table large enough to be chunked (e.g. chartevents, labevents). The resulting database ends up containing every subject in those tables instead of the limited subset the user asked for. Pass subjects=subjects in the chunked call so --limit is honored for large tables as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply subjects filter to chunked reads in sqlite/import.py#1983

Apply subjects filter to chunked reads in sqlite/import.py#1983
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Chessing234:fix/sqlite-import-subjects-chunks

Chessing234 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 10, 2026

Bug

Root cause

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant