in-context-learning: human↔robot pairing + RICL on EgoVerse pi0.5 by RyanPCo · Pull Request #491 · GaTech-RL2/EgoVerse

RyanPCo · 2026-06-08T14:45:57Z

This branch adds the in-context learning line of work on top of main.

1. Human↔robot episode pairing (`ed870df5`, `f2adfcd3`)

Dense-language-annotation pairing between human (aria) and robot (eva) episodes for side-by-side eval: egomimic/scripts/human_robot_pairs.json (Tier-1 co-located alignment sets + Tier-2 similar-task object-overlap pairs), the matcher, and a findings doc for future agents.

2. RICL — retrieval-based in-context learning on pi0.5 (`5b6a442a`)

kNN-retrieved in-context demonstrations injected into the pi0.5 prefix, reusing the existing zarr / Cartesian / trainHydra / PI-algo / DINOv3 stack. For each query observation we retrieve the k nearest demos and add their (image, state, action) to the prefix so the policy can imitate a task it never trained on. Cross-embodiment: retrieval bank = aria (human), query = eva (robot).

Key point — no PI0Pytorch surgery. The flow pi0.5 embed_prefix iterates over all images in the observation and embeds the full prompt, with the entire prefix attended bidirectionally (and EgoVerse already feeds it a variable image set). So PIRicl only: (1) appends the k retrieved base_0_rgb frames as extra entries in the obs images dict, and (2) splices each retrieved demo's discretized (state, action) into the prompt text (same binning as the pi0.5 State block). eva (14-D) and aria (12-D, no gripper) already share one 32-D action space via the converters.

Added

egomimic/ricl/ — retrieval (DINOv3 → cKDTree → per-query top-k cache), conditioning (the prefix surgery), data (collate + frame-idx wrapper + bank provider), metrics (+ CPU unit tests for each)
egomimic/algo/pi_ricl.py (PIRicl(PI), three small overrides) and egomimic/eval/pi_ricl_eval.py (PIRiclEval: retrieval vs zero-context floor)
RiclDataModuleWrapper in pl_utils/pl_data_utils.py; configs model/pi0.5_ricl, data/cotrain_pi_ricl, evaluator/eval_pi_ricl
egomimic/ricl/README.md — design notes + cluster runbook

Status. All logic is CPU-verified locally (23 checks: retrieval smoke, conditioning, data collate, metrics; both Hydra configs validated; ruff-clean; imports resolve in the emimic env). The finetune (from the openpi pi0.5 base), DINOv3 embedding at scale, and the D0 (eva→eva sanity) → D1 (aria→eva) eval are GPU-cluster steps documented in the README — they can't run on the dev box (no openpi/CUDA).

🤖 Generated with Claude Code

RyanPCo · 2026-06-08T14:46:11Z

in-context-learning: human↔robot pairing + RICL on EgoVerse pi0.5 #491 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2026-06-08T14:47:45Z

Claude Code Review

Review of PR #491: RICL on EgoVerse pi0.5

Summary

Adds retrieval-based in-context learning to pi0.5 via a thin PIRicl(PI) subclass that splices k-NN-retrieved (image, state, action) demos into the prefix without touching PI0Pytorch. Clean separation, good test coverage at the unit level, but several cluster-side correctness questions remain.

Key concerns

1. `frame_idx` semantics may misalign with retrieval cache

RiclQueryDataset derives frame_idx from MultiDataset.index_map[idx] -> (dataset_name, local_idx). The docstring notes this is "best-effort" with a clamp fallback. However:

The retrieval cache is built per-frame from DINOv3 embeddings indexed 0..T-1 over the raw episode. If local_idx is the chunk start frame but the embedding cache indexes raw frames, this works. If MultiDataset uses any subsampling/filtering (it often does), local_idx and the embedding row index will silently drift, and retrieval will return neighbors for the wrong observation.
This needs explicit verification against ZarrDataset.__getitem__ and the embedding pipeline before training. Recommend a runtime assertion that the cache's per-episode T matches the dataset's episode length.

2. Bank-side normalization is unverified

ZarrBankFrameProvider reads raw observations.state.ee_pose and actions_cartesian from the bank zarr and applies only converter.to32(action). The retrieved state is never passed through any normalization, but the comment in data.py claims it's "normalized to the query's 32-D convention." This is a bug:

The query's state in the prompt is normalized by norm_stats before discretization (in PI._discretize_state_for_sample). Retrieved state must go through the same normalization to live in [-1, 1] and yield comparable bins.
Without this, retrieved state bins will saturate at 0 or 255, making the in-context "state" tokens meaningless or actively misleading.

Same concern for actions: the converter's to32 does layout mapping, but does it also produce values in the [-1, 1] normalized range that pi0.5's action discretizer expects? If not, the same saturation issue applies.

3. Embedding key inconsistency

DINOV3_ZARR_KEY = "observations.embeddings.dinov3.front_1" in retrieval.py
README runbook uses --output-keys observations.embeddings.dinov3.front_1 ✓
But ZarrBankFrameProvider reads images.front_1 (not observations.images.front_1). This images.front_1 key path is unusual for EgoVerse zarr — please confirm against the actual zarr layout (typically observations.images.{cam}).

4. Coordinate frame at retrieval time

The codebase convention is poses stored in SLAM world frame, re-expressed to head frame at training time. Retrieved state comes raw from zarr (observations.state.ee_pose) without any frame transform. If the query state has been transformed to the head frame by the training pipeline but the retrieved state is in world frame, bin comparison is meaningless. This needs the same transform_lists applied as the query.

5. Image normalization mismatch

build_ricl_collate scales uint8 → [0, 1], then augment_images_with_retrieved checks if img.dtype not int → maps [0,1] → [-1,1]. This works but is fragile across two modules. One float path through _to_minus1_1 already in augment_images_with_retrieved handles uint8 directly — consider passing uint8 through and centralizing the normalization in one place.

6. Token budget estimation under-counts

estimate_prompt_tokens assumes ~1.3 tokens per number, but PaliGemma's tokenizer often splits 3-digit numbers into 2-3 tokens. At k=4, action_steps=1, state_dim=32, action_dim=32: actual usage is likely closer to 600+ tokens. The bump to max_token_len=512 may silently truncate. Run a tokenizer measurement on a real prompt before training.

7. Missing test: retrieval correctness end-to-end

All tests use mocks. There's no test that:

A built cache + provider round-trips a known (state, action) into the prompt correctly.
The actual PIRicl._build_prompts output is parseable / token-count is within budget.

The smoke test in retrieval.py --smoke exercises only the index, not the integration.

Suggestions

Add a --validate mode to the retrieval CLI that loads the cache + opens one bank zarr and prints (state, action, image) for a sampled query, so cluster users can sanity-check before training.
Normalize retrieved state/action through norm_stats in ZarrBankFrameProvider (or in the collate). Pass the algo's norm_stats into the provider for the appropriate embodiment.
Apply the same coordinate-frame transforms to retrieved state as the query loader. Reuse the existing transform_lists from the embodiment config.
Assert cache/episode length consistency at RiclDataModuleWrapper.__init__ or first batch: for each cached query hash, confirm cache.T == dataset_episode_length.
Image key: confirm images.front_1 vs observations.images.front_1 against actual zarr; align with image_zarr_key default.
Measure tokenizer length on a sample prompt and either raise max_token_len or document the truncation behavior. Convert the warning into a hard error if estimated > max.
Add frame_idx to the base ZarrDataset return dict rather than wrapping; the current best-effort approach is brittle and hides the index-map dependency.
AGENTS.md: miss

Reviewed by Claude · Review workflow

… annotations Tooling + data for pairing human (aria) and robot (eva) pick_place episodes for side-by-side eval. - inspect_episode_metadata.py: read-only audit of app.episodes (scene/objects/ task_description coverage per embodiment) to choose a pairing strategy. - pair_episodes_by_language.py: pull per-episode dense-language annotations from R2, parse + normalize the manipulated-object set, and match aria scenes to eva demos by object-set containment. Emits two tiers: the co-located "alignment data set" true pairs (from the DB) and language-matched similar-task pairs. - human_robot_pairs.json: generated pairs (both tiers). The DB scene/objects columns are unusable for matching (objects="{None}", scene has 2 values), so matching uses the annotations. Bulk aria/eva episodes share a ~50-object vocabulary but no identical scenes; only the alignment sets are truly co-located. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…re agents Add egomimic/scripts/human_robot_pairing.md capturing the non-obvious data facts (R2 vs AWS-S3 access gotcha, junk scene/objects columns, the alignment co-located captures, annotation format/coverage, the eva/aria segment-count asymmetry that motivates containment matching, the object-name synonym map, and next steps) plus how to run the two scripts. Pointer added from top-level AGENTS.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kNN-retrieved in-context demos injected into the pi0.5 prefix, reusing the existing zarr/Cartesian/trainHydra/PI stack. No PI0Pytorch surgery: the flow prefix is image-count-agnostic and fully bidirectional, so PIRicl just appends k retrieved base_0_rgb images to the obs dict and splices discretized retrieved state/action into the prompt (same binning as the pi0.5 State block). Cross-embodiment: bank=aria, query=eva, scoped by human_robot_pairs.json. eva (14-D) and aria (12-D, no gripper) share one 32-D action space via the converters. Adds egomimic/ricl/ (retrieval DINOv3->cKDTree->top-k cache, conditioning, data collate, metrics + CPU tests), PIRicl algo, PIRiclEval (retrieval vs zero-context floor), RiclDataModuleWrapper, and pi0.5_ricl / cotrain_pi_ricl / eval_pi_ricl configs. Logic CPU-verified (23 checks); finetune + DINOv3-at-scale + D0/D1 eval are GPU-cluster steps (see egomimic/ricl/README.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RICL's configs resolve against the pi0.5/pi-train setup. Bring in only the dependency-closure pieces needed to run on this cluster (PR #491 stays main-based; does not pull pi-train's pi.py/human.py code or mecka-only data configs): - paths/default.yaml dataset_dir -> /storage/project/r-dxu345-0/shared/egoverseS3ZarrDatasets - model/pi0.5_base.yaml pi05 checkpoint path + training recipe (inherited via pi0.5_bc_eva -> pi0.5_ricl) - train_zarr_cartesian.yaml cluster training settings (gpus, sample_frac, num_workers) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@OverRide

The @OverRide runtime check (overrides 7.7.0) rejects the subclass method because the parent PI._build_prompts is annotated `-> list[str]` while the override had no return annotation (treated as None). This blocked PIRicl instantiation in any current env. The override already returns list[str]; just annotate it to match the parent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

First end-to-end run of `model=pi0.5_ricl data=cotrain_pi_ricl` through trainHydra (bank=aria -> query=eva). Fixes the integration gaps that blocked it; fast_dev_run now completes clean (train step with retrieved bank frames + PIRiclEval). See egomimic/ricl/SMOKE_TEST_BRINGUP.md. - ZarrBankFrameProvider: build a ZarrDataset (bank keymap + transform_list) per episode instead of reading post-transform keys off raw zarr. observations.state.ee_pose / actions_cartesian / base_0_rgb are produced at load time, never stored, so the old direct-read raised KeyError. Wire bank_keymap/bank_transform_list through RiclDataModuleWrapper; set them (Aria cartesian_pi keymap + cartesian transform) in cotrain_pi_ricl.yaml. - RiclQueryDataset: delegate unknown attrs to the wrapped dataset (set_norm_stats_from). - MultiDataset._iter_leaves: unwrap dataset wrappers (.base) so key/shape inference reaches the real leaves (else embodiment never registers -> ac_keys[emb] KeyError). - trainHydra: accept RiclDataModuleWrapper in the datamodule assertion. - cotrain_pi_ricl: re-point valid_datasets with the data.-prefixed interpolation (eva_pi/cotrain_pi_base use a root-absolute ${train_datasets...} that doesn't resolve once nested under `data`). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RyanPCo changed the title ~~in-context-learning: human<->robot episode pairing via dense-language annotations~~ in-context-learning: human↔robot pairing + RICL on EgoVerse pi0.5 Jun 8, 2026

RyanPCo marked this pull request as ready for review June 8, 2026 14:46

RyanPCo marked this pull request as draft June 8, 2026 15:11

RyanPCo and others added 5 commits June 8, 2026 13:08

RyanPCo force-pushed the ryanco/in-context-learning branch 5 times, most recently from 830d3fc to d92ed0e Compare June 10, 2026 22:34

RyanPCo force-pushed the ryanco/in-context-learning branch from d92ed0e to 7e11df7 Compare June 11, 2026 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in-context-learning: human↔robot pairing + RICL on EgoVerse pi0.5#491

in-context-learning: human↔robot pairing + RICL on EgoVerse pi0.5#491
RyanPCo wants to merge 6 commits into
mainfrom
ryanco/in-context-learning

RyanPCo commented Jun 8, 2026 •

edited

Loading

Uh oh!

RyanPCo commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RyanPCo commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Human↔robot episode pairing (ed870df5, f2adfcd3)

2. RICL — retrieval-based in-context learning on pi0.5 (5b6a442a)

Uh oh!

RyanPCo commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Claude Code Review

Review of PR #491: RICL on EgoVerse pi0.5

Summary

Key concerns

1. frame_idx semantics may misalign with retrieval cache

2. Bank-side normalization is unverified

3. Embedding key inconsistency

4. Coordinate frame at retrieval time

5. Image normalization mismatch

6. Token budget estimation under-counts

7. Missing test: retrieval correctness end-to-end

Suggestions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RyanPCo commented Jun 8, 2026 •

edited

Loading

1. Human↔robot episode pairing (`ed870df5`, `f2adfcd3`)

2. RICL — retrieval-based in-context learning on pi0.5 (`5b6a442a`)

1. `frame_idx` semantics may misalign with retrieval cache