Skip to content

Commit 2b43c57

Browse files
committed
feat: Add multi-library S3 storage support (s3torchconnector, minio, s3dlio)
- Add URI-based storage handler with 3 library backends - Integrate s3dlio v0.9.40 native API (put_bytes, get_bytes, list) - Apply PR mlcommons#232 fix for empty data_dir handling - Add comprehensive test suite with 3 validated implementations - Organize project structure (tests/, docs/, patches/) - Document MLP vs dpsi architectural comparison Changes preserved in patches/ directory for flexible integration approach. Test results: All 3 libraries working (s3torch: 30s, minio: 15s, s3dlio: 31s)
1 parent b41fc05 commit 2b43c57

19 files changed

Lines changed: 2393 additions & 78 deletions

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,7 @@ mlperf.history
6565
# OS
6666
.DS_Store
6767
Thumbs.db
68+
69+
# Test artifacts
70+
hydra_log/
71+
minio_test/

docs/IMPLEMENTATION_COMPARISON.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# MLP vs dpsi Implementation Comparison
2+
3+
## Critical Finding: DIFFERENT BASE CODE
4+
5+
### Repository Origins
6+
7+
**MLP Implementation (mlp-storage/dlio_benchmark):**
8+
- Repository: `https://github.com/russfellows/dlio_benchmark.git`
9+
- Branch: `main`
10+
- HEAD Commit: `ed7f476` "Add configurable dgen-py data generation support"
11+
12+
**dpsi Implementation (mlp-storage-dpsi):**
13+
- Wrapper Repository: `https://github.com/dpsi/storage.git` (branch: darien-TF_ObjectStorage)
14+
- Embedded DLIO: `https://github.com/dpsi/dlio_benchmark.git@darien-s3-refactor`
15+
- HEAD Commit: `7078286` "Refactor S3 pytorch implementation. Change code to use storage_root config option and namespace. Removes urlparsing for each I/O..."
16+
17+
### Common Ancestor
18+
19+
Both implementations **diverged from a common upstream** around commit `3c2be85`:
20+
```
21+
3c2be85 - Fix the first epoch AU calculation (#318) (#319)
22+
0207330 - feat(s3 checkpointing support): added pytorch s3 for checkpointing (#315)
23+
002424d - docs(profiling): fix dftracer broken link (#314)
24+
...
25+
```
26+
27+
**Divergence Point:**
28+
- **After 3c2be85**, russfellows added: `ed7f476` (dgen-py support)
29+
- **After 3c2be85**, dpsi added: `585f375` + `7078286` (S3 refactor)
30+
31+
## Implementation Differences
32+
33+
### File Sizes
34+
- **dpsi**: 145 lines (simple, focused)
35+
- **MLP**: 382 lines (complex, multi-library)
36+
37+
### Architecture Philosophy
38+
39+
**dpsi Approach:**
40+
```python
41+
# Bucket+key separation via config
42+
storage_root = "bucket-name" # The S3 bucket
43+
data_folder = "prefix/path" # Object key prefix
44+
namespace = "train" # Subdirectory
45+
46+
# Result: s3://bucket-name/prefix/path/train/file.npz
47+
```
48+
49+
**MLP Approach:**
50+
```python
51+
# URI-based with runtime parsing
52+
data_dir = "s3://bucket-name/prefix/path"
53+
namespace = "train"
54+
55+
# Runtime: urlparse(data_dir) → bucket="bucket-name", key="prefix/path"
56+
# Result: s3://bucket-name/prefix/path/train/file.npz
57+
```
58+
59+
### Library Support
60+
61+
**dpsi:**
62+
- **Single library**: s3torchconnector only
63+
- Simple, well-tested
64+
- 145-line implementation
65+
66+
**MLP:**
67+
- **Multi-library**: s3torchconnector, minio, s3dlio
68+
- Environment variable selector: `STORAGE_LIBRARY`
69+
- MinIOAdapter wrapper class (83 lines)
70+
- Dynamic library loading
71+
- 382-line implementation
72+
73+
### Modified Files Overlap (MERGE CONFLICTS EXPECTED)
74+
75+
Both implementations modified the SAME core files:
76+
77+
1. **dlio_benchmark/storage/s3_torch_storage.py**
78+
- dpsi: Simplified to 145 lines, removed URL parsing
79+
- MLP: Expanded to 382 lines, added multi-library support
80+
81+
2. **dlio_benchmark/storage/storage_handler.py**
82+
- dpsi: Added namespace handling
83+
- MLP: Added `self.logger` attribute
84+
85+
3. **dlio_benchmark/storage/storage_factory.py**
86+
- dpsi: No changes
87+
- MLP: Added DLIO_S3_IMPLEMENTATION env var selector
88+
89+
## Code Changes Breakdown
90+
91+
### dpsi Refactor (commit 7078286, 9 files changed)
92+
```
93+
dlio_benchmark/checkpointing/base_checkpointing.py | 4 +-
94+
dlio_benchmark/checkpointing/pytorch_s3_checkpointing.py | 49 ++---------
95+
dlio_benchmark/configs/workload/unet3d_a100_s3.yaml | 4 +-
96+
dlio_benchmark/configs/workload/unet3d_h100_s3.yaml | 4 +-
97+
dlio_benchmark/main.py | 3 +-
98+
dlio_benchmark/storage/s3_storage.py | 56 ++++---------
99+
dlio_benchmark/storage/s3_torch_storage.py | 98 +++++++---------------
100+
dlio_benchmark/storage/storage_handler.py | 1 +
101+
dlio_benchmark/utils/config.py | 7 +-
102+
```
103+
**Goal**: Simplify S3 implementation, eliminate per-I/O URL parsing overhead
104+
105+
### MLP Changes (custom modifications)
106+
```
107+
dlio_benchmark/storage/storage_factory.py | Added implementation selector
108+
dlio_benchmark/storage/s3_torch_storage.py | 383 lines (multi-library)
109+
dlio_benchmark/storage/s3_torch_storage_dpsi.py | 145 lines (dpsi copy)
110+
dlio_benchmark/storage/s3_storage_dpsi.py | dpsi base class copy
111+
dlio_benchmark/storage/storage_handler.py | Added self.logger
112+
```
113+
**Goal**: Enable runtime library selection (s3torchconnector/minio/s3dlio)
114+
115+
## Merge Implications
116+
117+
### Option 1: Keep Separate (Current State)
118+
**Pros:**
119+
- Clean comparison possible
120+
- No merge conflicts
121+
- Can benchmark both approaches independently
122+
123+
**Cons:**
124+
- Two codebases to maintain
125+
- Can't combine dpsi simplifications with MLP multi-library
126+
127+
### Option 2: Merge dpsi into MLP
128+
**Strategy**: Add dpsi as 4th library option
129+
```python
130+
STORAGE_LIBRARY options:
131+
- s3torchconnector (MLP URI-based)
132+
- minio (MLP URI-based)
133+
- s3dlio (MLP URI-based, currently broken)
134+
- s3torch-dpsi (dpsi bucket+key architecture)
135+
```
136+
137+
**Pros:**
138+
- Best of both worlds
139+
- Structured comparison
140+
- Single codebase
141+
142+
**Cons:**
143+
- Requires careful refactoring
144+
- Must preserve both URI and bucket+key approaches
145+
146+
### Option 3: Replace MLP with dpsi + Add Libraries
147+
**Strategy**: Use dpsi's 145-line base, add minio/s3dlio adapters
148+
149+
**Pros:**
150+
- Simpler base (145 lines)
151+
- Cleaner architecture
152+
- Less URL parsing overhead
153+
154+
**Cons:**
155+
- Lose MLP's URI convenience
156+
- Must adapt configs to bucket+key format
157+
158+
## Testing Status
159+
160+
### ✅ Completed Tests
161+
1. **dpsi + s3torchconnector** (BASELINE)
162+
- Bucket: dpsi-s3torch
163+
- Result: ✅ 3 NPZ files created in ~23 seconds
164+
165+
### ⏳ Pending Tests
166+
2. **MLP + s3torchconnector**
167+
- Bucket: mlp-s3torch
168+
- Expected: ✅ Should match baseline
169+
170+
3. **MLP + minio**
171+
- Bucket: mlp-minio
172+
- Expected: ✅ Should work
173+
174+
4. **MLP + s3dlio**
175+
- Bucket: mlp-s3dlio
176+
- Expected: ❌ Known bug at compat layer line 571
177+
178+
## Recommendations
179+
180+
### Immediate Actions (Phase 1)
181+
1. ✅ Run MLP + s3torchconnector test (validate MLP URI parsing works)
182+
2. ✅ Run MLP + minio test (validate multi-library switching)
183+
3. Fix s3dlio bug and test
184+
4. **Compare performance**: dpsi (145 lines, no URL parsing) vs MLP (382 lines, runtime parsing)
185+
186+
### Decision Point (Phase 2)
187+
Based on test results, decide:
188+
- **If dpsi is faster**: Adopt bucket+key architecture, add libraries to it
189+
- **If MLP matches dpsi**: Keep MLP approach, incorporate dpsi's simplifications
190+
- **If both equal**: Choose based on config convenience (URI vs bucket+key)
191+
192+
### Integration Strategy (Phase 3)
193+
Likely approach:
194+
```python
195+
# Hybrid: Support both config styles
196+
if config.storage_root and config.data_folder:
197+
# dpsi bucket+key mode
198+
bucket = config.storage_root
199+
prefix = config.data_folder
200+
else:
201+
# MLP URI mode (backward compatible)
202+
bucket, prefix = parse_s3_uri(config.data_dir)
203+
204+
# Then use selected library (s3torchconnector/minio/s3dlio)
205+
```
206+
207+
## Key Takeaway
208+
209+
**The implementations started from the SAME upstream DLIO codebase but diverged:**
210+
- dpsi focused on **simplification** (145 lines, bucket+key)
211+
- MLP focused on **flexibility** (382 lines, multi-library, URI-based)
212+
213+
Both are valid approaches. Testing will reveal which architecture performs better.

0 commit comments

Comments
 (0)