Skip to content

Commit 92cd95b

Browse files
authored
Merge pull request InsightSoftwareConsortium#6098 from hjmjohnson/remote-module-ingest-tooling
ENH: Add Utilities/Maintenance/RemoteModuleIngest tooling
2 parents 7a3a506 + febe0a7 commit 92cd95b

9 files changed

Lines changed: 2474 additions & 0 deletions

File tree

Lines changed: 332 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,332 @@
1+
# AGENTS.md — Guidance for AI agents running remote-module ingestion
2+
3+
This file is written for AI coding agents (Claude Code, Cursor, GPT
4+
Codex, similar) that are asked to ingest a remote ITK module into
5+
the main source tree. It complements `README.md` (human-focused)
6+
and `INGESTION_STRATEGY.md` (policy-focused). Read this before
7+
running `ingest-remote-module.sh`.
8+
9+
## When to invoke this workflow
10+
11+
The user has asked you to:
12+
13+
- "Ingest `<Module>` into `Modules/<Group>/`"
14+
- "Move `<Module>` from `Modules/Remote/` into the main tree"
15+
- "Bring `<Module>` in-tree, preserving history"
16+
17+
If the user only says "build `<Module>`" or "enable `<Module>`",
18+
they want `-DModule_<Module>:BOOL=ON` — not an ingest. Don't start
19+
an ingest unless the destination is `Modules/<Group>/<Module>/`.
20+
21+
## Pre-flight (mandatory)
22+
23+
1. **Confirm `git-filter-repo` is available**:
24+
```
25+
command -v git-filter-repo
26+
```
27+
If missing: `pixi global install git-filter-repo`.
28+
29+
2. **Confirm the working tree is clean** on an ITK checkout:
30+
```
31+
git status --porcelain # must be empty
32+
git rev-parse --abbrev-ref HEAD # current branch
33+
```
34+
35+
3. **Confirm you are NOT on `main`**. Create a feature branch:
36+
```
37+
git checkout -b ingest-<Module> upstream/main
38+
```
39+
40+
4. **Confirm the upstream URL**. The script reads it from
41+
`Modules/Remote/<Module>.remote.cmake` by default, but sanity-
42+
check by eyeball:
43+
```
44+
grep GIT_REPOSITORY Modules/Remote/<Module>.remote.cmake
45+
```
46+
47+
5. **Always dry-run first** for any module you haven't ingested before:
48+
```
49+
Utilities/Maintenance/RemoteModuleIngest/ingest-remote-module.sh \
50+
<Module> <Group> --dry-run --keep-tempdir
51+
```
52+
Inspect the tempdir output before the real run. In particular:
53+
- `ls Modules/<Group>/<Module>/` — does the file set look right?
54+
- `grep -c '\.md5$' \| grep -c '\.cid$'` — content-link inventory?
55+
- `git log --oneline | wc -l` — surviving commit count sane?
56+
57+
## Decision points during an ingest
58+
59+
### 1. Does the upstream have a `LICENSE` that is NOT Apache-2.0?
60+
61+
Whitelist auto-excludes `LICENSE`. If the upstream's license is not
62+
Apache-2.0 (e.g., MIT, BSD-3-Clause, a dual license, or something
63+
unusual), **stop and escalate to the human**. ITK's root-level
64+
license applies in-tree; non-Apache modules need a per-module
65+
decision and possibly a `NOTICE` entry. Do not silently strip a
66+
non-Apache license.
67+
68+
### 2. Does the upstream have files outside the whitelist that
69+
look necessary?
70+
71+
The whitelist is `include/`, `src/`, `test/`, `wrapping/`,
72+
`CMakeLists.txt`, `itk-module.cmake`. If the upstream relies on
73+
e.g. `CMake/<Module>Targets.cmake.in` or a custom `<Module>Config.cmake`
74+
at root, and the module won't build without it, widen the whitelist
75+
for that invocation. Document the widening in the merge commit body:
76+
77+
#### Important: directory-level whitelisting is not enough
78+
79+
`--path test` admits **everything** under `test/` in every historical
80+
commit, including files with scaffolding-ish names that don't belong
81+
in ITK. Example caught on PR #6093: upstream
82+
`ITKAnisotropicDiffusionLBR` had `test/azure-pipelines.yml` and
83+
`test/Docker/Dockerfile` in several historical commits. Those leaked
84+
through the directory-level whitelist and were only caught by a
85+
follow-up history-wide scan.
86+
87+
The driver handles this automatically via a **deny-pattern pass**
88+
(step 3 in the script) that strips scaffolding filenames anywhere in
89+
the filtered tree, followed by a **whitelist-verification scan**
90+
(step 3b) that aborts the ingest if any scaffolding pattern is still
91+
reachable in any commit. You don't normally need to do anything
92+
extra — but if the verification scan trips on a NEW pattern the
93+
driver doesn't know about yet, add it to the `--path-glob` list in
94+
step 3 and re-run. Do not push an ingest where the verification
95+
scan triggered a warning.
96+
97+
```
98+
Whitelist passes (git filter-repo):
99+
- --path include --path src --path test --path wrapping
100+
- --path CMakeLists.txt --path itk-module.cmake
101+
- --path CMake/<Module>Config.cmake.in # needed to resolve the
102+
# find_package() chain
103+
```
104+
105+
### 3. Does the `test/` tree reference raw binary files (not
106+
content-links)?
107+
108+
If `test/` contains `.png`, `.nrrd`, `.mha`, `.vtk`, etc., directly
109+
(without a sibling `.md5` / `.cid`), those are raw binaries that
110+
should either (a) be moved to the ExternalData fetch path (uploaded
111+
to web3.storage, replaced with a `.cid` stub) or (b) be stripped
112+
entirely.
113+
114+
**Do not merge an ingest that ships raw binary test assets.**
115+
Either upload them via `w3 up` per
116+
`Documentation/docs/contributing/upload_binary_data.md` and commit
117+
the resulting `.cid` stubs, or escalate to the human.
118+
119+
### 4. Does the ingest produce any `.md5` or `.shaNNN`
120+
content-links?
121+
122+
**CID conversion is mandatory. It MUST complete before the PR is
123+
pushed. It is not optional, and it cannot be deferred to a
124+
follow-up PR.** The ingestion is incomplete until every
125+
content-link is `.cid`.
126+
127+
Timing: the conversion may happen either immediately after the
128+
merge commit (preferred: same session, same branch) or as a
129+
fixup folded into the merge commit. Either is fine. What is
130+
**not** fine is pushing the PR with `.md5` or `.shaNNN` stubs still
131+
in the tree.
132+
133+
When the `ingest-remote-module.sh` driver finishes and
134+
non-`.cid` content-links remain, it exits with code `2`
135+
("action required") and prints the list of files. Don't treat that
136+
as a warning; treat it as a blocking pre-push gate.
137+
138+
The conversion workflow:
139+
140+
1. **If the agent has network access to web3.storage + the data
141+
mirrors:** run `Utilities/Maintenance/RemoteModuleIngest/cid-normalize.sh`
142+
on the ingested module path. It reads each `.md5` / `.shaNNN`
143+
file, resolves the content via the ExternalData mirror, uploads
144+
the bytes via `w3 up` if not already pinned, writes the resulting
145+
`.cid` stub, and deletes the old hash file. Commit as a
146+
single `STYLE: Convert content-links to .cid` commit.
147+
148+
2. **If the agent lacks network access** (common in sandboxed
149+
environments): do not push. Instead, walk the human through the
150+
manual flow:
151+
- Install `@web3-storage/w3cli` per
152+
`Documentation/docs/contributing/upload_binary_data.md`.
153+
- Run `w3 login <email>` and `w3 up <file>` for each content
154+
referenced by the `.md5` / `.shaNNN` stubs.
155+
- Write each returned CID to the sibling `.cid` file; delete the
156+
old hash file.
157+
- Re-invoke the driver's post-merge verification to confirm all
158+
content-links are now `.cid`.
159+
160+
3. **Before pushing the PR**, the agent must run
161+
`verify-cid-access.sh` (in this directory) which fetches each
162+
`.cid` through an IPFS gateway to confirm the content is
163+
actually retrievable. A local build + test cycle must also
164+
succeed — the ingested module's test targets must fetch and
165+
pass using the new `.cid` stubs. No shortcuts here; a passing
166+
CI run on an in-tree module that can't resolve its test data
167+
is worse than a red CI check.
168+
169+
### 5. Does the upstream have an `examples/` directory?
170+
171+
The whitelist auto-excludes `examples/`. Per @dzenanz on
172+
PR #6093, per-module examples do NOT ingest inline. Three options
173+
for the human to pick from:
174+
175+
- **Archive only** (default): leave `examples/` in the archived
176+
upstream repo.
177+
- **Relocate**: open a separate follow-up PR that copies useful
178+
examples into `InsightSoftwareConsortium/ITK/Examples/<Module>/`,
179+
with their own CMake test registration.
180+
- **Ignore**: if the examples are clearly obsolete, do nothing.
181+
182+
The AI agent should **never decide (b) unilaterally** — present the
183+
options to the human and get direction. The ingest PR itself does
184+
not relocate examples.
185+
186+
### 6. Is the audit's recommended mode `squash`?
187+
188+
Under the v3 whitelist, `squash` is rare. If the audit recommends
189+
it, double-check the audit output. Likely causes:
190+
191+
- Whitelisted paths legitimately contain a big file (e.g., a
192+
`test/Baseline/huge_file.nrrd` without a content-link — escalate
193+
per decision 3).
194+
- Upstream has hundreds of automated commits in whitelisted paths
195+
(e.g., a dependabot loop that churned `test/CMakeLists.txt`).
196+
Escalate to human for a threshold discussion.
197+
198+
Do not squash silently; always get human confirmation first.
199+
200+
### 7. After the ITK ingest PR merges — upstream archival PR
201+
202+
**The ingestion workflow is not complete when the ITK ingest PR
203+
merges.** The final step, which must happen on the ORIGINAL
204+
upstream repo, enforces the one-definition rule: any file that
205+
now lives at `Modules/<Group>/<Module>/` in ITK should not also
206+
live at the upstream's tree tip.
207+
208+
Open a PR on the upstream repo that:
209+
210+
1. **Deletes every whitelisted file** from the upstream tree tip
211+
(i.e., `include/`, `src/`, `test/`, `wrapping/`, `CMakeLists.txt`,
212+
`itk-module.cmake` — the same set the ingest transferred).
213+
2. **Adds a `MIGRATION_README.md`** at the upstream repo root that
214+
directs future readers to the new in-tree location. Template:
215+
216+
```markdown
217+
# Migrated to ITK main
218+
219+
As of <date>, the `<Module>` module has been ingested into
220+
the main ITK source tree. The authoritative location is now:
221+
222+
https://github.com/InsightSoftwareConsortium/ITK/tree/main/Modules/<Group>/<Module>
223+
224+
See `Modules/<Group>/<Module>/README.md` in the ITK tree for
225+
details on what moved and what remains in this archived repo.
226+
227+
This repository is retained read-only for historical reference
228+
(deep git history, paper material, example assets not migrated
229+
to ITK). It will be marked ARCHIVED after this PR merges.
230+
231+
Related:
232+
- ITK ingest PR: InsightSoftwareConsortium/ITK#<NNNN>
233+
- Consolidation issue: InsightSoftwareConsortium/ITK#6060
234+
```
235+
3. **Explicitly states the post-merge intent to archive** the
236+
repository via GitHub's repository-settings → Danger Zone →
237+
"Archive this repository".
238+
239+
When the upstream maintainer merges that PR and archives the repo,
240+
the ingestion is complete: deep history remains reachable at a
241+
read-only URL, ITK carries the whitelisted authoritative copy, and
242+
users who clone either side see an unambiguous pointer to the
243+
other.
244+
245+
AI agents **must prompt the human to open this upstream PR** as
246+
the final step of the workflow. The agent cannot open the PR
247+
itself (different repo, different permissions in most cases) but
248+
should draft the `MIGRATION_README.md` text and the removal diff
249+
for the human to push.
250+
251+
## Post-ingest validation
252+
253+
After the merge commit lands, run:
254+
255+
```bash
256+
# 1. `git blame` walks across the merge boundary to upstream authors.
257+
git blame Modules/<Group>/<Module>/include/<first-header>.h | head
258+
259+
# 2. Author set is preserved.
260+
git log --format='%an <%ae>' Modules/<Group>/<Module>/ | sort -u
261+
262+
# 3. No upstream-only paths leaked in.
263+
find Modules/<Group>/<Module> -name '.github' -o -name 'pyproject.toml' \
264+
-o -name 'paper' -o -name 'Old' -o -name 'CTestConfig.cmake'
265+
# Expect: no output.
266+
267+
# 4. CI builds the module with the configure-ci opt-in.
268+
pixi run -e cxx configure-ci # must succeed
269+
pixi run -e cxx build # must succeed
270+
```
271+
272+
## What NOT to do
273+
274+
- **Don't `git filter-repo --to-subdirectory-filter` without the
275+
whitelist first.** That reintroduces the v1 bloat.
276+
- **Don't run multiple ingests in one PR.** One module per PR.
277+
- **Don't edit the merge commit message after the fact to add
278+
authorship.** `Co-authored-by:` trailers are generated from
279+
upstream git log by the script; if an author is missing, they're
280+
missing from upstream too.
281+
- **Don't force-push an ingest PR.** Once the merge commit is in,
282+
amend via fixup commits or add follow-on commits instead. Rewrites
283+
break the `git blame` walk for anyone who has the old SHA cached.
284+
- **Don't silently widen the whitelist.** If you need to admit an
285+
extra path, document it in the merge commit body with a reason.
286+
- **Don't push the PR with `.md5` / `.shaNNN` content-links still
287+
in the tree.** CID conversion is mandatory pre-push. If you
288+
can't complete it because of network restrictions, stop and hand
289+
back to the human.
290+
- **Don't skip the local build + test gate.** The ingested module
291+
must configure, build, and run its tests locally — with the
292+
new `.cid` stubs resolving to actual content — before the PR is
293+
pushed. A green CI that can't resolve test data is a trap for
294+
the next reviewer, not a pass.
295+
- **Don't skip the upstream-archival PR.** The workflow is
296+
not complete when ITK's ingest PR merges; the upstream repo
297+
needs its own follow-up PR that deletes the migrated files and
298+
adds `MIGRATION_README.md` before being archived (decision 7).
299+
300+
## Escalation triggers
301+
302+
Hand back to the human immediately if:
303+
304+
- The upstream license is not Apache-2.0.
305+
- `test/` contains raw binary assets (not behind content-links).
306+
- The audit recommends `squash` mode.
307+
- `git blame` fails to walk across the merge boundary after the
308+
merge commit (indicates a filter-repo misconfiguration).
309+
- Any `.pdf`, `.mp4`, or `> 1 MiB` image survives the whitelist.
310+
- The inferred upstream URL doesn't match what the human expects.
311+
- Network access is unavailable and `.md5` / `.shaNNN` content-links
312+
need CID conversion. The agent cannot push an ingest PR with
313+
non-`.cid` stubs present — stop and hand back.
314+
- Local build or test fails after the module is in-tree with the
315+
new `.cid` stubs. A broken ingest must not be pushed; diagnose
316+
(likely a `.cid` that never resolved, a missing dependency in
317+
`itk-module.cmake`, or a whitelist widening that's needed) before
318+
retrying.
319+
- The human has not been shown the draft upstream-archival PR
320+
content (`MIGRATION_README.md` + removal diff) — decision 7
321+
requires explicit human action on a different repo, so the agent
322+
must surface the draft rather than silently complete.
323+
324+
## References
325+
326+
- `INGESTION_STRATEGY.md` in this directory — policy document.
327+
- `AUDIT_DESIGN.md` in this directory — what the audit reports.
328+
- `CLEANUP_CHECKLIST.md` in this directory — what to strip.
329+
- `Documentation/docs/contributing/upload_binary_data.md` — the
330+
`@web3-storage/w3cli` workflow for CID normalization.
331+
- ITK commit [`f3899ce8c6`](https://github.com/InsightSoftwareConsortium/ITK/commit/f3899ce8c6)
332+
`.md5`/`.sha512``.cid` migration precedent.

0 commit comments

Comments
 (0)