|
| 1 | +# MLP vs dpsi Implementation Comparison |
| 2 | + |
| 3 | +## Critical Finding: DIFFERENT BASE CODE |
| 4 | + |
| 5 | +### Repository Origins |
| 6 | + |
| 7 | +**MLP Implementation (mlp-storage/dlio_benchmark):** |
| 8 | +- Repository: `https://github.com/russfellows/dlio_benchmark.git` |
| 9 | +- Branch: `main` |
| 10 | +- HEAD Commit: `ed7f476` "Add configurable dgen-py data generation support" |
| 11 | + |
| 12 | +**dpsi Implementation (mlp-storage-dpsi):** |
| 13 | +- Wrapper Repository: `https://github.com/dpsi/storage.git` (branch: darien-TF_ObjectStorage) |
| 14 | +- Embedded DLIO: `https://github.com/dpsi/dlio_benchmark.git@darien-s3-refactor` |
| 15 | +- HEAD Commit: `7078286` "Refactor S3 pytorch implementation. Change code to use storage_root config option and namespace. Removes urlparsing for each I/O..." |
| 16 | + |
| 17 | +### Common Ancestor |
| 18 | + |
| 19 | +Both implementations **diverged from a common upstream** around commit `3c2be85`: |
| 20 | +``` |
| 21 | +3c2be85 - Fix the first epoch AU calculation (#318) (#319) |
| 22 | +0207330 - feat(s3 checkpointing support): added pytorch s3 for checkpointing (#315) |
| 23 | +002424d - docs(profiling): fix dftracer broken link (#314) |
| 24 | +... |
| 25 | +``` |
| 26 | + |
| 27 | +**Divergence Point:** |
| 28 | +- **After 3c2be85**, russfellows added: `ed7f476` (dgen-py support) |
| 29 | +- **After 3c2be85**, dpsi added: `585f375` + `7078286` (S3 refactor) |
| 30 | + |
| 31 | +## Implementation Differences |
| 32 | + |
| 33 | +### File Sizes |
| 34 | +- **dpsi**: 145 lines (simple, focused) |
| 35 | +- **MLP**: 382 lines (complex, multi-library) |
| 36 | + |
| 37 | +### Architecture Philosophy |
| 38 | + |
| 39 | +**dpsi Approach:** |
| 40 | +```python |
| 41 | +# Bucket+key separation via config |
| 42 | +storage_root = "bucket-name" # The S3 bucket |
| 43 | +data_folder = "prefix/path" # Object key prefix |
| 44 | +namespace = "train" # Subdirectory |
| 45 | + |
| 46 | +# Result: s3://bucket-name/prefix/path/train/file.npz |
| 47 | +``` |
| 48 | + |
| 49 | +**MLP Approach:** |
| 50 | +```python |
| 51 | +# URI-based with runtime parsing |
| 52 | +data_dir = "s3://bucket-name/prefix/path" |
| 53 | +namespace = "train" |
| 54 | + |
| 55 | +# Runtime: urlparse(data_dir) → bucket="bucket-name", key="prefix/path" |
| 56 | +# Result: s3://bucket-name/prefix/path/train/file.npz |
| 57 | +``` |
| 58 | + |
| 59 | +### Library Support |
| 60 | + |
| 61 | +**dpsi:** |
| 62 | +- **Single library**: s3torchconnector only |
| 63 | +- Simple, well-tested |
| 64 | +- 145-line implementation |
| 65 | + |
| 66 | +**MLP:** |
| 67 | +- **Multi-library**: s3torchconnector, minio, s3dlio |
| 68 | +- Environment variable selector: `STORAGE_LIBRARY` |
| 69 | +- MinIOAdapter wrapper class (83 lines) |
| 70 | +- Dynamic library loading |
| 71 | +- 382-line implementation |
| 72 | + |
| 73 | +### Modified Files Overlap (MERGE CONFLICTS EXPECTED) |
| 74 | + |
| 75 | +Both implementations modified the SAME core files: |
| 76 | + |
| 77 | +1. **dlio_benchmark/storage/s3_torch_storage.py** |
| 78 | + - dpsi: Simplified to 145 lines, removed URL parsing |
| 79 | + - MLP: Expanded to 382 lines, added multi-library support |
| 80 | + |
| 81 | +2. **dlio_benchmark/storage/storage_handler.py** |
| 82 | + - dpsi: Added namespace handling |
| 83 | + - MLP: Added `self.logger` attribute |
| 84 | + |
| 85 | +3. **dlio_benchmark/storage/storage_factory.py** |
| 86 | + - dpsi: No changes |
| 87 | + - MLP: Added DLIO_S3_IMPLEMENTATION env var selector |
| 88 | + |
| 89 | +## Code Changes Breakdown |
| 90 | + |
| 91 | +### dpsi Refactor (commit 7078286, 9 files changed) |
| 92 | +``` |
| 93 | +dlio_benchmark/checkpointing/base_checkpointing.py | 4 +- |
| 94 | +dlio_benchmark/checkpointing/pytorch_s3_checkpointing.py | 49 ++--------- |
| 95 | +dlio_benchmark/configs/workload/unet3d_a100_s3.yaml | 4 +- |
| 96 | +dlio_benchmark/configs/workload/unet3d_h100_s3.yaml | 4 +- |
| 97 | +dlio_benchmark/main.py | 3 +- |
| 98 | +dlio_benchmark/storage/s3_storage.py | 56 ++++--------- |
| 99 | +dlio_benchmark/storage/s3_torch_storage.py | 98 +++++++--------------- |
| 100 | +dlio_benchmark/storage/storage_handler.py | 1 + |
| 101 | +dlio_benchmark/utils/config.py | 7 +- |
| 102 | +``` |
| 103 | +**Goal**: Simplify S3 implementation, eliminate per-I/O URL parsing overhead |
| 104 | + |
| 105 | +### MLP Changes (custom modifications) |
| 106 | +``` |
| 107 | +dlio_benchmark/storage/storage_factory.py | Added implementation selector |
| 108 | +dlio_benchmark/storage/s3_torch_storage.py | 383 lines (multi-library) |
| 109 | +dlio_benchmark/storage/s3_torch_storage_dpsi.py | 145 lines (dpsi copy) |
| 110 | +dlio_benchmark/storage/s3_storage_dpsi.py | dpsi base class copy |
| 111 | +dlio_benchmark/storage/storage_handler.py | Added self.logger |
| 112 | +``` |
| 113 | +**Goal**: Enable runtime library selection (s3torchconnector/minio/s3dlio) |
| 114 | + |
| 115 | +## Merge Implications |
| 116 | + |
| 117 | +### Option 1: Keep Separate (Current State) |
| 118 | +✅ **Pros:** |
| 119 | +- Clean comparison possible |
| 120 | +- No merge conflicts |
| 121 | +- Can benchmark both approaches independently |
| 122 | + |
| 123 | +❌ **Cons:** |
| 124 | +- Two codebases to maintain |
| 125 | +- Can't combine dpsi simplifications with MLP multi-library |
| 126 | + |
| 127 | +### Option 2: Merge dpsi into MLP |
| 128 | +**Strategy**: Add dpsi as 4th library option |
| 129 | +```python |
| 130 | +STORAGE_LIBRARY options: |
| 131 | +- s3torchconnector (MLP URI-based) |
| 132 | +- minio (MLP URI-based) |
| 133 | +- s3dlio (MLP URI-based, currently broken) |
| 134 | +- s3torch-dpsi (dpsi bucket+key architecture) |
| 135 | +``` |
| 136 | + |
| 137 | +✅ **Pros:** |
| 138 | +- Best of both worlds |
| 139 | +- Structured comparison |
| 140 | +- Single codebase |
| 141 | + |
| 142 | +❌ **Cons:** |
| 143 | +- Requires careful refactoring |
| 144 | +- Must preserve both URI and bucket+key approaches |
| 145 | + |
| 146 | +### Option 3: Replace MLP with dpsi + Add Libraries |
| 147 | +**Strategy**: Use dpsi's 145-line base, add minio/s3dlio adapters |
| 148 | + |
| 149 | +✅ **Pros:** |
| 150 | +- Simpler base (145 lines) |
| 151 | +- Cleaner architecture |
| 152 | +- Less URL parsing overhead |
| 153 | + |
| 154 | +❌ **Cons:** |
| 155 | +- Lose MLP's URI convenience |
| 156 | +- Must adapt configs to bucket+key format |
| 157 | + |
| 158 | +## Testing Status |
| 159 | + |
| 160 | +### ✅ Completed Tests |
| 161 | +1. **dpsi + s3torchconnector** (BASELINE) |
| 162 | + - Bucket: dpsi-s3torch |
| 163 | + - Result: ✅ 3 NPZ files created in ~23 seconds |
| 164 | + |
| 165 | +### ⏳ Pending Tests |
| 166 | +2. **MLP + s3torchconnector** |
| 167 | + - Bucket: mlp-s3torch |
| 168 | + - Expected: ✅ Should match baseline |
| 169 | + |
| 170 | +3. **MLP + minio** |
| 171 | + - Bucket: mlp-minio |
| 172 | + - Expected: ✅ Should work |
| 173 | + |
| 174 | +4. **MLP + s3dlio** |
| 175 | + - Bucket: mlp-s3dlio |
| 176 | + - Expected: ❌ Known bug at compat layer line 571 |
| 177 | + |
| 178 | +## Recommendations |
| 179 | + |
| 180 | +### Immediate Actions (Phase 1) |
| 181 | +1. ✅ Run MLP + s3torchconnector test (validate MLP URI parsing works) |
| 182 | +2. ✅ Run MLP + minio test (validate multi-library switching) |
| 183 | +3. Fix s3dlio bug and test |
| 184 | +4. **Compare performance**: dpsi (145 lines, no URL parsing) vs MLP (382 lines, runtime parsing) |
| 185 | + |
| 186 | +### Decision Point (Phase 2) |
| 187 | +Based on test results, decide: |
| 188 | +- **If dpsi is faster**: Adopt bucket+key architecture, add libraries to it |
| 189 | +- **If MLP matches dpsi**: Keep MLP approach, incorporate dpsi's simplifications |
| 190 | +- **If both equal**: Choose based on config convenience (URI vs bucket+key) |
| 191 | + |
| 192 | +### Integration Strategy (Phase 3) |
| 193 | +Likely approach: |
| 194 | +```python |
| 195 | +# Hybrid: Support both config styles |
| 196 | +if config.storage_root and config.data_folder: |
| 197 | + # dpsi bucket+key mode |
| 198 | + bucket = config.storage_root |
| 199 | + prefix = config.data_folder |
| 200 | +else: |
| 201 | + # MLP URI mode (backward compatible) |
| 202 | + bucket, prefix = parse_s3_uri(config.data_dir) |
| 203 | + |
| 204 | +# Then use selected library (s3torchconnector/minio/s3dlio) |
| 205 | +``` |
| 206 | + |
| 207 | +## Key Takeaway |
| 208 | + |
| 209 | +**The implementations started from the SAME upstream DLIO codebase but diverged:** |
| 210 | +- dpsi focused on **simplification** (145 lines, bucket+key) |
| 211 | +- MLP focused on **flexibility** (382 lines, multi-library, URI-based) |
| 212 | + |
| 213 | +Both are valid approaches. Testing will reveal which architecture performs better. |
0 commit comments