|
| 1 | +# AGENTS.md — Guidance for AI agents running remote-module ingestion |
| 2 | + |
| 3 | +This file is written for AI coding agents (Claude Code, Cursor, GPT |
| 4 | +Codex, similar) that are asked to ingest a remote ITK module into |
| 5 | +the main source tree. It complements `README.md` (human-focused) |
| 6 | +and `INGESTION_STRATEGY.md` (policy-focused). Read this before |
| 7 | +running `ingest-remote-module.sh`. |
| 8 | + |
| 9 | +## When to invoke this workflow |
| 10 | + |
| 11 | +The user has asked you to: |
| 12 | + |
| 13 | +- "Ingest `<Module>` into `Modules/<Group>/`" |
| 14 | +- "Move `<Module>` from `Modules/Remote/` into the main tree" |
| 15 | +- "Bring `<Module>` in-tree, preserving history" |
| 16 | + |
| 17 | +If the user only says "build `<Module>`" or "enable `<Module>`", |
| 18 | +they want `-DModule_<Module>:BOOL=ON` — not an ingest. Don't start |
| 19 | +an ingest unless the destination is `Modules/<Group>/<Module>/`. |
| 20 | + |
| 21 | +## Pre-flight (mandatory) |
| 22 | + |
| 23 | +1. **Confirm `git-filter-repo` is available**: |
| 24 | + ``` |
| 25 | + command -v git-filter-repo |
| 26 | + ``` |
| 27 | + If missing: `pixi global install git-filter-repo`. |
| 28 | + |
| 29 | +2. **Confirm the working tree is clean** on an ITK checkout: |
| 30 | + ``` |
| 31 | + git status --porcelain # must be empty |
| 32 | + git rev-parse --abbrev-ref HEAD # current branch |
| 33 | + ``` |
| 34 | + |
| 35 | +3. **Confirm you are NOT on `main`**. Create a feature branch: |
| 36 | + ``` |
| 37 | + git checkout -b ingest-<Module> upstream/main |
| 38 | + ``` |
| 39 | + |
| 40 | +4. **Confirm the upstream URL**. The script reads it from |
| 41 | + `Modules/Remote/<Module>.remote.cmake` by default, but sanity- |
| 42 | + check by eyeball: |
| 43 | + ``` |
| 44 | + grep GIT_REPOSITORY Modules/Remote/<Module>.remote.cmake |
| 45 | + ``` |
| 46 | + |
| 47 | +5. **Always dry-run first** for any module you haven't ingested before: |
| 48 | + ``` |
| 49 | + Utilities/Maintenance/RemoteModuleIngest/ingest-remote-module.sh \ |
| 50 | + <Module> <Group> --dry-run --keep-tempdir |
| 51 | + ``` |
| 52 | + Inspect the tempdir output before the real run. In particular: |
| 53 | + - `ls Modules/<Group>/<Module>/` — does the file set look right? |
| 54 | + - `grep -c '\.md5$' \| grep -c '\.cid$'` — content-link inventory? |
| 55 | + - `git log --oneline | wc -l` — surviving commit count sane? |
| 56 | + |
| 57 | +## Decision points during an ingest |
| 58 | + |
| 59 | +### 1. Does the upstream have a `LICENSE` that is NOT Apache-2.0? |
| 60 | + |
| 61 | +Whitelist auto-excludes `LICENSE`. If the upstream's license is not |
| 62 | +Apache-2.0 (e.g., MIT, BSD-3-Clause, a dual license, or something |
| 63 | +unusual), **stop and escalate to the human**. ITK's root-level |
| 64 | +license applies in-tree; non-Apache modules need a per-module |
| 65 | +decision and possibly a `NOTICE` entry. Do not silently strip a |
| 66 | +non-Apache license. |
| 67 | + |
| 68 | +### 2. Does the upstream have files outside the whitelist that |
| 69 | +look necessary? |
| 70 | + |
| 71 | +The whitelist is `include/`, `src/`, `test/`, `wrapping/`, |
| 72 | +`CMakeLists.txt`, `itk-module.cmake`. If the upstream relies on |
| 73 | +e.g. `CMake/<Module>Targets.cmake.in` or a custom `<Module>Config.cmake` |
| 74 | +at root, and the module won't build without it, widen the whitelist |
| 75 | +for that invocation. Document the widening in the merge commit body: |
| 76 | + |
| 77 | +#### Important: directory-level whitelisting is not enough |
| 78 | + |
| 79 | +`--path test` admits **everything** under `test/` in every historical |
| 80 | +commit, including files with scaffolding-ish names that don't belong |
| 81 | +in ITK. Example caught on PR #6093: upstream |
| 82 | +`ITKAnisotropicDiffusionLBR` had `test/azure-pipelines.yml` and |
| 83 | +`test/Docker/Dockerfile` in several historical commits. Those leaked |
| 84 | +through the directory-level whitelist and were only caught by a |
| 85 | +follow-up history-wide scan. |
| 86 | + |
| 87 | +The driver handles this automatically via a **deny-pattern pass** |
| 88 | +(step 3 in the script) that strips scaffolding filenames anywhere in |
| 89 | +the filtered tree, followed by a **whitelist-verification scan** |
| 90 | +(step 3b) that aborts the ingest if any scaffolding pattern is still |
| 91 | +reachable in any commit. You don't normally need to do anything |
| 92 | +extra — but if the verification scan trips on a NEW pattern the |
| 93 | +driver doesn't know about yet, add it to the `--path-glob` list in |
| 94 | +step 3 and re-run. Do not push an ingest where the verification |
| 95 | +scan triggered a warning. |
| 96 | + |
| 97 | +``` |
| 98 | +Whitelist passes (git filter-repo): |
| 99 | + - --path include --path src --path test --path wrapping |
| 100 | + - --path CMakeLists.txt --path itk-module.cmake |
| 101 | + - --path CMake/<Module>Config.cmake.in # needed to resolve the |
| 102 | + # find_package() chain |
| 103 | +``` |
| 104 | + |
| 105 | +### 3. Does the `test/` tree reference raw binary files (not |
| 106 | +content-links)? |
| 107 | + |
| 108 | +If `test/` contains `.png`, `.nrrd`, `.mha`, `.vtk`, etc., directly |
| 109 | +(without a sibling `.md5` / `.cid`), those are raw binaries that |
| 110 | +should either (a) be moved to the ExternalData fetch path (uploaded |
| 111 | +to web3.storage, replaced with a `.cid` stub) or (b) be stripped |
| 112 | +entirely. |
| 113 | + |
| 114 | +**Do not merge an ingest that ships raw binary test assets.** |
| 115 | +Either upload them via `w3 up` per |
| 116 | +`Documentation/docs/contributing/upload_binary_data.md` and commit |
| 117 | +the resulting `.cid` stubs, or escalate to the human. |
| 118 | + |
| 119 | +### 4. Does the ingest produce any `.md5` or `.shaNNN` |
| 120 | +content-links? |
| 121 | + |
| 122 | +**CID conversion is mandatory. It MUST complete before the PR is |
| 123 | +pushed. It is not optional, and it cannot be deferred to a |
| 124 | +follow-up PR.** The ingestion is incomplete until every |
| 125 | +content-link is `.cid`. |
| 126 | + |
| 127 | +Timing: the conversion may happen either immediately after the |
| 128 | +merge commit (preferred: same session, same branch) or as a |
| 129 | +fixup folded into the merge commit. Either is fine. What is |
| 130 | +**not** fine is pushing the PR with `.md5` or `.shaNNN` stubs still |
| 131 | +in the tree. |
| 132 | + |
| 133 | +When the `ingest-remote-module.sh` driver finishes and |
| 134 | +non-`.cid` content-links remain, it exits with code `2` |
| 135 | +("action required") and prints the list of files. Don't treat that |
| 136 | +as a warning; treat it as a blocking pre-push gate. |
| 137 | + |
| 138 | +The conversion workflow: |
| 139 | + |
| 140 | +1. **If the agent has network access to web3.storage + the data |
| 141 | + mirrors:** run `Utilities/Maintenance/RemoteModuleIngest/cid-normalize.sh` |
| 142 | + on the ingested module path. It reads each `.md5` / `.shaNNN` |
| 143 | + file, resolves the content via the ExternalData mirror, uploads |
| 144 | + the bytes via `w3 up` if not already pinned, writes the resulting |
| 145 | + `.cid` stub, and deletes the old hash file. Commit as a |
| 146 | + single `STYLE: Convert content-links to .cid` commit. |
| 147 | + |
| 148 | +2. **If the agent lacks network access** (common in sandboxed |
| 149 | + environments): do not push. Instead, walk the human through the |
| 150 | + manual flow: |
| 151 | + - Install `@web3-storage/w3cli` per |
| 152 | + `Documentation/docs/contributing/upload_binary_data.md`. |
| 153 | + - Run `w3 login <email>` and `w3 up <file>` for each content |
| 154 | + referenced by the `.md5` / `.shaNNN` stubs. |
| 155 | + - Write each returned CID to the sibling `.cid` file; delete the |
| 156 | + old hash file. |
| 157 | + - Re-invoke the driver's post-merge verification to confirm all |
| 158 | + content-links are now `.cid`. |
| 159 | + |
| 160 | +3. **Before pushing the PR**, the agent must run |
| 161 | + `verify-cid-access.sh` (in this directory) which fetches each |
| 162 | + `.cid` through an IPFS gateway to confirm the content is |
| 163 | + actually retrievable. A local build + test cycle must also |
| 164 | + succeed — the ingested module's test targets must fetch and |
| 165 | + pass using the new `.cid` stubs. No shortcuts here; a passing |
| 166 | + CI run on an in-tree module that can't resolve its test data |
| 167 | + is worse than a red CI check. |
| 168 | + |
| 169 | +### 5. Does the upstream have an `examples/` directory? |
| 170 | + |
| 171 | +The whitelist auto-excludes `examples/`. Per @dzenanz on |
| 172 | +PR #6093, per-module examples do NOT ingest inline. Three options |
| 173 | +for the human to pick from: |
| 174 | + |
| 175 | +- **Archive only** (default): leave `examples/` in the archived |
| 176 | + upstream repo. |
| 177 | +- **Relocate**: open a separate follow-up PR that copies useful |
| 178 | + examples into `InsightSoftwareConsortium/ITK/Examples/<Module>/`, |
| 179 | + with their own CMake test registration. |
| 180 | +- **Ignore**: if the examples are clearly obsolete, do nothing. |
| 181 | + |
| 182 | +The AI agent should **never decide (b) unilaterally** — present the |
| 183 | +options to the human and get direction. The ingest PR itself does |
| 184 | +not relocate examples. |
| 185 | + |
| 186 | +### 6. Is the audit's recommended mode `squash`? |
| 187 | + |
| 188 | +Under the v3 whitelist, `squash` is rare. If the audit recommends |
| 189 | +it, double-check the audit output. Likely causes: |
| 190 | + |
| 191 | +- Whitelisted paths legitimately contain a big file (e.g., a |
| 192 | + `test/Baseline/huge_file.nrrd` without a content-link — escalate |
| 193 | + per decision 3). |
| 194 | +- Upstream has hundreds of automated commits in whitelisted paths |
| 195 | + (e.g., a dependabot loop that churned `test/CMakeLists.txt`). |
| 196 | + Escalate to human for a threshold discussion. |
| 197 | + |
| 198 | +Do not squash silently; always get human confirmation first. |
| 199 | + |
| 200 | +### 7. After the ITK ingest PR merges — upstream archival PR |
| 201 | + |
| 202 | +**The ingestion workflow is not complete when the ITK ingest PR |
| 203 | +merges.** The final step, which must happen on the ORIGINAL |
| 204 | +upstream repo, enforces the one-definition rule: any file that |
| 205 | +now lives at `Modules/<Group>/<Module>/` in ITK should not also |
| 206 | +live at the upstream's tree tip. |
| 207 | + |
| 208 | +Open a PR on the upstream repo that: |
| 209 | + |
| 210 | +1. **Deletes every whitelisted file** from the upstream tree tip |
| 211 | + (i.e., `include/`, `src/`, `test/`, `wrapping/`, `CMakeLists.txt`, |
| 212 | + `itk-module.cmake` — the same set the ingest transferred). |
| 213 | +2. **Adds a `MIGRATION_README.md`** at the upstream repo root that |
| 214 | + directs future readers to the new in-tree location. Template: |
| 215 | + |
| 216 | + ```markdown |
| 217 | + # Migrated to ITK main |
| 218 | + |
| 219 | + As of <date>, the `<Module>` module has been ingested into |
| 220 | + the main ITK source tree. The authoritative location is now: |
| 221 | + |
| 222 | + https://github.com/InsightSoftwareConsortium/ITK/tree/main/Modules/<Group>/<Module> |
| 223 | + |
| 224 | + See `Modules/<Group>/<Module>/README.md` in the ITK tree for |
| 225 | + details on what moved and what remains in this archived repo. |
| 226 | + |
| 227 | + This repository is retained read-only for historical reference |
| 228 | + (deep git history, paper material, example assets not migrated |
| 229 | + to ITK). It will be marked ARCHIVED after this PR merges. |
| 230 | + |
| 231 | + Related: |
| 232 | + - ITK ingest PR: InsightSoftwareConsortium/ITK#<NNNN> |
| 233 | + - Consolidation issue: InsightSoftwareConsortium/ITK#6060 |
| 234 | + ``` |
| 235 | +3. **Explicitly states the post-merge intent to archive** the |
| 236 | + repository via GitHub's repository-settings → Danger Zone → |
| 237 | + "Archive this repository". |
| 238 | + |
| 239 | +When the upstream maintainer merges that PR and archives the repo, |
| 240 | +the ingestion is complete: deep history remains reachable at a |
| 241 | +read-only URL, ITK carries the whitelisted authoritative copy, and |
| 242 | +users who clone either side see an unambiguous pointer to the |
| 243 | +other. |
| 244 | + |
| 245 | +AI agents **must prompt the human to open this upstream PR** as |
| 246 | +the final step of the workflow. The agent cannot open the PR |
| 247 | +itself (different repo, different permissions in most cases) but |
| 248 | +should draft the `MIGRATION_README.md` text and the removal diff |
| 249 | +for the human to push. |
| 250 | + |
| 251 | +## Post-ingest validation |
| 252 | + |
| 253 | +After the merge commit lands, run: |
| 254 | + |
| 255 | +```bash |
| 256 | +# 1. `git blame` walks across the merge boundary to upstream authors. |
| 257 | +git blame Modules/<Group>/<Module>/include/<first-header>.h | head |
| 258 | + |
| 259 | +# 2. Author set is preserved. |
| 260 | +git log --format='%an <%ae>' Modules/<Group>/<Module>/ | sort -u |
| 261 | + |
| 262 | +# 3. No upstream-only paths leaked in. |
| 263 | +find Modules/<Group>/<Module> -name '.github' -o -name 'pyproject.toml' \ |
| 264 | + -o -name 'paper' -o -name 'Old' -o -name 'CTestConfig.cmake' |
| 265 | +# Expect: no output. |
| 266 | + |
| 267 | +# 4. CI builds the module with the configure-ci opt-in. |
| 268 | +pixi run -e cxx configure-ci # must succeed |
| 269 | +pixi run -e cxx build # must succeed |
| 270 | +``` |
| 271 | + |
| 272 | +## What NOT to do |
| 273 | + |
| 274 | +- **Don't `git filter-repo --to-subdirectory-filter` without the |
| 275 | + whitelist first.** That reintroduces the v1 bloat. |
| 276 | +- **Don't run multiple ingests in one PR.** One module per PR. |
| 277 | +- **Don't edit the merge commit message after the fact to add |
| 278 | + authorship.** `Co-authored-by:` trailers are generated from |
| 279 | + upstream git log by the script; if an author is missing, they're |
| 280 | + missing from upstream too. |
| 281 | +- **Don't force-push an ingest PR.** Once the merge commit is in, |
| 282 | + amend via fixup commits or add follow-on commits instead. Rewrites |
| 283 | + break the `git blame` walk for anyone who has the old SHA cached. |
| 284 | +- **Don't silently widen the whitelist.** If you need to admit an |
| 285 | + extra path, document it in the merge commit body with a reason. |
| 286 | +- **Don't push the PR with `.md5` / `.shaNNN` content-links still |
| 287 | + in the tree.** CID conversion is mandatory pre-push. If you |
| 288 | + can't complete it because of network restrictions, stop and hand |
| 289 | + back to the human. |
| 290 | +- **Don't skip the local build + test gate.** The ingested module |
| 291 | + must configure, build, and run its tests locally — with the |
| 292 | + new `.cid` stubs resolving to actual content — before the PR is |
| 293 | + pushed. A green CI that can't resolve test data is a trap for |
| 294 | + the next reviewer, not a pass. |
| 295 | +- **Don't skip the upstream-archival PR.** The workflow is |
| 296 | + not complete when ITK's ingest PR merges; the upstream repo |
| 297 | + needs its own follow-up PR that deletes the migrated files and |
| 298 | + adds `MIGRATION_README.md` before being archived (decision 7). |
| 299 | + |
| 300 | +## Escalation triggers |
| 301 | + |
| 302 | +Hand back to the human immediately if: |
| 303 | + |
| 304 | +- The upstream license is not Apache-2.0. |
| 305 | +- `test/` contains raw binary assets (not behind content-links). |
| 306 | +- The audit recommends `squash` mode. |
| 307 | +- `git blame` fails to walk across the merge boundary after the |
| 308 | + merge commit (indicates a filter-repo misconfiguration). |
| 309 | +- Any `.pdf`, `.mp4`, or `> 1 MiB` image survives the whitelist. |
| 310 | +- The inferred upstream URL doesn't match what the human expects. |
| 311 | +- Network access is unavailable and `.md5` / `.shaNNN` content-links |
| 312 | + need CID conversion. The agent cannot push an ingest PR with |
| 313 | + non-`.cid` stubs present — stop and hand back. |
| 314 | +- Local build or test fails after the module is in-tree with the |
| 315 | + new `.cid` stubs. A broken ingest must not be pushed; diagnose |
| 316 | + (likely a `.cid` that never resolved, a missing dependency in |
| 317 | + `itk-module.cmake`, or a whitelist widening that's needed) before |
| 318 | + retrying. |
| 319 | +- The human has not been shown the draft upstream-archival PR |
| 320 | + content (`MIGRATION_README.md` + removal diff) — decision 7 |
| 321 | + requires explicit human action on a different repo, so the agent |
| 322 | + must surface the draft rather than silently complete. |
| 323 | + |
| 324 | +## References |
| 325 | + |
| 326 | +- `INGESTION_STRATEGY.md` in this directory — policy document. |
| 327 | +- `AUDIT_DESIGN.md` in this directory — what the audit reports. |
| 328 | +- `CLEANUP_CHECKLIST.md` in this directory — what to strip. |
| 329 | +- `Documentation/docs/contributing/upload_binary_data.md` — the |
| 330 | + `@web3-storage/w3cli` workflow for CID normalization. |
| 331 | +- ITK commit [`f3899ce8c6`](https://github.com/InsightSoftwareConsortium/ITK/commit/f3899ce8c6) |
| 332 | + — `.md5`/`.sha512` → `.cid` migration precedent. |
0 commit comments