feat(addons/agenthands): hands + orb that gesture and point as the agent speaks by salmanmkc · Pull Request #416 · google/xrblocks

salmanmkc · 2026-06-26T13:37:02Z

adds a free-standing pair of agent hands and a glowing orb that gesture while the agent talks and physically point at real things in the room. inspired by the AgentHands paper (CHI 2026, https://www.duruofei.com/papers/Liu_AgentHands-GeneratingInteractiveHandsGesturesForSpatiallyGroundedAgentConversationsInXR_CHI2026.pdf).

Demo (youtube link: https://youtu.be/JgDE5crT3MQ)

agent.hands.demo.google.xrblocks.mov

the paper's nice insight is that an agent feels a lot more present when it can gesture and actually point at things in your space, not just talk at you. It lays out a clean way to drive that from an llm (inline gesture markup), grounds the gestures to real objects, and keeps the embodiment deliberately minimal: a calm orb as the locus of attention plus translucent hands, no face, so it doesn't tip into uncanny.

Without a key the demo plays a short scripted monologue so you can see the gestures. add ?key=... and it's the full loop: you talk (speech recognizer), gemini replies with inline gesture markup, the reply is spoken (TTS) and the hands gesture in sync with the actual spoken words, pointing at real detected objects when it refers to them. runs on desktop through the simulator, and in headset via a small spatial control panel (talk / scan).

new addon src/addons/agenthands:

AgentHand: one posable hand rig. loads the webxr generic-hand glb, poses the bones toward a SimulatorHandPose, and can aim its index finger at a world point (pivoting from the wrist so close-range pointing stays accurate). a layered motion offset lets gestures move the hand on top of the pose and aim.
AgentHands: the left + right pair. dispatches gestures and points, picks the pointing hand by which side the target is on, and has beat / wave / iconic size / count motions.
AgentGestures: parses inline markup from the reply, [gesture:thumbs_up], [point:the lamp], [wave], [beat], [size:big], [count:2], into poses, motions and point targets, anchored to the spoken text so each fires on the right word.
AgentHead: the agent's presence, a semi-transparent blue orb with a drifting particle shell. breathes while idle, pulses while speaking, gazes at whatever it's pointing at. matches the paper's embodiment (orb as the locus of attention, no face), and the hands are the same translucent blue.

one SDK change: SpeechSynthesizer.onBoundaryCallback, fired on each word boundary with the character index into the spoken text, so callers can sync visuals (here, gestures) to the actual spoken words. optional, off by default.

pointing is grounded with a lightweight raycast against the depth mesh, so the demo only needs gemini and no extra detection deps. one thing that ended up different from the paper: there, you register objects one at a time in a dedicated mode (look at a thing, say "register this", and it builds a rich oriented box with face and region labels that lands in a registry). Here the agent detects the whole room in a single pass and keeps re-grounding in the background as you move, so the grounding isn't a one-time observation, it stays current as you walk around without any manual step. the tradeoff is that it's coarser: a point per object rather than the paper's region-level boxes. the richer 3D object-detection addon that grew alongside this (objects3d) is going up as its own PR.

Why bring it into xrblocks: the paper's system is a unity study app on a galaxy xr headset. this ports the idea to the open web on three.js / webxr, packaged as a reusable addon so it's something you can drop into an app rather than a one-off, and it runs in the desktop simulator so you can try the whole loop without a headset. the gesture taxonomy, the markup-driven control, and the minimal orb-plus-hands embodiment all follow the paper; the main new plumbing is syncing gestures to the spoken words through the small SpeechSynthesizer hook above.

Worth being upfront the per-hand gesture state machine is simpler, there are fewer gesture types, and timing comes from tts word boundaries rather than the paper's per-word energy model. pointing leans on depth-mesh quality and 2D detection instead of a full scene mesh, and there's no user-gaze input yet (the orb gazes, but we don't read where you look). most of the tuning happened in the simulator, so the in-headset path is wired but less exercised. plenty of room to grow it (more gestures, gaze, better grounding), which is part of why it's an addon.

colocated vitest specs for the addon cover the hand rig, the pair, the gesture parser and the orb. lint, prettier and build are clean.

Load the WebXR generic-hand glb as a free-standing pair of hands (not the user's tracked input), pose it with the simulator pose library, and cycle through gestures. Proves the standalone rig + pose animation before building the AgentHands feature.

Loads the WebXR generic-hand glb as a free-standing hand (not the user's tracked input) and animates its bones toward a SimulatorHandPose using the simulator pose library. The bone-lerp step is a pure, tested helper.

Owns a left/right AgentHand, loads both, and animates them toward their current poses each frame. gesture(pose, hand?) sets one or both hands; rest() relaxes them.

Add a gesture->pose vocabulary and parseAgentGestures(), which strips [gesture:point] style markup from the agent's text and returns the cleaned speech plus the ordered gestures anchored to where they occur. Pure + tested.

A free-standing pair of agent hands raised in front of the user that gesture as a scripted line is 'spoken': each line's [gesture:...] markup is parsed and played in sequence, then the hands relax. Runs without a key; the same pipeline is driven by Gemini Live next.

…bjects3d dependency)

… leaking into replies

…tion so re-aiming is absolute

…lback

Syncs gesture playback with spoken text: a timed queue drives the timeline, and word boundaries from the synthesizer fire matching steps early for tighter timing.

Timed firing then rest, early firing on word boundaries, playback with no synthesizer, bare timelines with onNext, and empty-queue ticks.

World understanding: runs object detection, grounds each detection to a 3D point against the depth mesh, caches and optionally persists the results to local storage, and re-scans as the camera moves.

Depth-mesh grounding and position fallback, label matching, pointFor, local-storage persistence and restore, and movement-triggered rescan.

Moves the world understanding, gesture parsing/animation and TTS-synced playback out of main.js into the agenthands addon. The demo is now scene glue (lighting, head-anchored rig, pointer viz, spatial panel, mic) that drives AgentWorld, buildGestureSteps, AgentSpeechConductor and AgentGestureAnimator. No behavior change; main.js drops from 838 to 626 lines.

Credits the AgentHands paper, describes the four modules the demo drives, the gesture markup, and what the demo can and cannot do today.

Overview of the modules and a quick-start wiring example.

…ance Word-boundary sync fired a step early while the timed queue still fired it again, which double-played animated motions (wave/beat/size/count). Route both paths through a per-utterance fired set. Also clear the boundary callback if speak() throws synchronously, not only via the promise, so a failure never leaves a stale handler installed.

A step fired on a word boundary is not replayed by the timed queue, and a synchronous speak() failure clears the boundary callback.

Adds a scanned flag, set only when a detection pass finishes. Objects loaded from local storage do not set it, so callers can tell a persisted cache apart from a confirmed, current view of the room.

False on construction and after a persisted-cache load, true only after a scan completes.

A restored local-storage cache can be from a different room, so the demo no longer lists it to the model until a fresh scan confirms what is actually present; otherwise the first reply could point at stale objects.

Grounds the comparison in the AgentHands paper: gesture range, timing, object grounding, user gaze, and maturity, so the differences from the paper are explicit.

Task-oriented guide for giving an agent gesturing hands and an orb that points at real objects, wiring the four modules together.

The paper's agent travels to the object it is discussing; this demo keeps the hands anchored in front of the user and points from there.

salmanmkc · 2026-07-01T07:34:21Z

Hi salmanmkc, this is an awesome demo!

One high-level suggestion regarding main.js:

I read that AgentHands consists of four major modules: (1) World Understanding: Returns 3D object positions and saves them to local storage. (2) Clip-Embedded Response Parser: Parses the AI output into an executable dictionary. (3) TTS Timestamp Matcher: Tracks which word is currently being spoken. (4) Gesture Animator: Controls the AgentHands to change gestures and move.

I see you have all the necessary functions for these inside main.js, but I highly recommend refactoring these key modules out into src/addons/agenthands/ to keep the code modular and easier to maintain.

Also, it would be great to add a quick README that credits the AgentHands paper and clarifies what this demo can and can't do right now.

Thanks so much @qxziuan! I've done as you suggested, pulled all four out of main.js into src/addons/agenthands/.

world understanding -> AgentWorld: detection + depth-mesh grounding + the object cache, and it persists to localStorage like you mentioned.
clip-embedded response parser -> buildGestureSteps in AgentGestures: markup into a timed step list, with each point target resolved to a world position.
tts timestamp matcher -> AgentSpeechConductor: drives the timeline and syncs it to the synth's word boundaries.
gesture animator -> AgentGestureAnimator: poses / motions / pointing, and tracks which hand is pointing for the pointer viz and the orb's gaze.

each has colocated vitest specs. while wiring it up i also caught a latent double-play: the word-boundary sync and the timed queue could both fire the same motion, so a step now plays once per utterance.

Yeah that's a good idea, I tried to credit in this PR description, but putting in in a readme is a good idea, I've added
demos/agent_hands/README.md. it credits the paper and is upfront about what the demo does and doesn't do vs it yet, gesture range is a smaller flat set (no two-hand iconic, affective vfx, measurement, or high-five), timing comes off the tts word boundaries rather than a per-word energy model, grounding is one depth-mesh point per object instead of the gaze-registration flow with oriented boxes, no user-gaze input, and the hands stay anchored in front of you rather than the agent walking out to the target.

There's also a short src/addons/agenthands/README.md for the addon api, and an xb-agenthands skill.md file that can aid agents, based on the pattern @ruofeidu made before.

Replaces the inline 0.18 / 0.55 / 0.35 and the 0.1..0.8 clamp in sizeWidth with named constants, and uses THREE.MathUtils.clamp instead of a manual Math.max/min.

Names the post-speech rest delay (0.8s) and adds an estimateSpeechDuration helper for the length-based duration estimate the demo was computing inline, so the magic 1.2 floor and 0.06 per-char rate live in one place.

Scales with length and never drops below the floor.

The 0.5 m / 0.6 rad / 5000 ms constructor defaults become named constants.

Names the head-anchor offsets, follow smoothing, idle bob/sway, lean clamps, and scripted pacing, and uses estimateSpeechDuration instead of the inline formula. Cosmetic values (colors, panel and pointer geometry, light intensities) stay inline. No behavior change.

salmanmkc added 30 commits June 26, 2026 21:15

agenthands: add AgentHand, a standalone posable hand rig

cf5e203

Loads the WebXR generic-hand glb as a free-standing hand (not the user's tracked input) and animates its bones toward a SimulatorHandPose using the simulator pose library. The bone-lerp step is a pure, tested helper.

agenthands: add AgentHands pair manager + gesture API

1064f63

Owns a left/right AgentHand, loads both, and animates them toward their current poses each frame. gesture(pose, hand?) sets one or both hands; rest() relaxes them.

agenthands: parse gesture markup from agent replies

b7239f6

Add a gesture->pose vocabulary and parseAgentGestures(), which strips [gesture:point] style markup from the agent's text and returns the cleaned speech plus the ordered gestures anchored to where they occur. Pure + tested.

demos: add talk button and genai importmap to agent_hands

77d7c06

demos: wire interactive gemini mode into agent_hands

169cfa2

demos: add spatial control panel to agent_hands for immersive parity

bf09af3

demos: add gemini key-entry overlay to agent_hands

fdef3eb

agenthands: parse optional point target from gesture markup

f42b53b

agenthands: aim a hand's index finger at a world position

c16e133

agenthands: add pointAt to the hands pair

a41e2c9

demos: ground agent_hands pointing in detected objects

5defd34

agenthands: reach the hand toward the object it points at

a5adc1e

demos: load 3D detection deps in agent_hands importmap

ea595c3

demos: point agent_hands at oriented 3D boxes via Object3DDetector

3d30df6

demos: scan the room off the conversation path so replies stay fast

5ebe410

demos: ground agent_hands pointing with a lightweight raycast (drop o…

fc2152c

…bjects3d dependency)

demos: trim agent_hands importmap back to gemini only

a991ee0

demos: auto-rescan agent_hands in the background as the view moves

402e74d

demos: serialize agent_hands scanning and chat to stop detection JSON…

f609c55

… leaking into replies

agenthands: measure pointing direction before restoring the live rota…

13b852c

…tion so re-aiming is absolute

agenthands: pick the pointing hand in the pair's local frame

50fe417

agenthands: anchor gesture index to the normalized speech text

4245679

demos: clear stale objects on empty scans and guard overlapping TTS

3f046e2

agenthands: expose the index fingertip world position

2c16c1f

sound: let SpeechSynthesizer report word boundaries via onBoundaryCal…

3b10211

…lback

demos: head-anchor the agent hands to the user with idle sway

147991f

demos: draw a pointer ray and target ring while the agent points

a3a48db

demos: lean the agent hands toward the object they point at

734a1d9

salmanmkc added 19 commits July 1, 2026 14:18

feat(addons/agenthands): add AgentSpeechConductor

aded8b2

Syncs gesture playback with spoken text: a timed queue drives the timeline, and word boundaries from the synthesizer fire matching steps early for tighter timing.

test(addons/agenthands): cover AgentSpeechConductor

80b076a

Timed firing then rest, early firing on word boundaries, playback with no synthesizer, bare timelines with onNext, and empty-queue ticks.

feat(addons/agenthands): add AgentWorld

93c19e4

World understanding: runs object detection, grounds each detection to a 3D point against the depth mesh, caches and optionally persists the results to local storage, and re-scans as the camera moves.

test(addons/agenthands): cover AgentWorld

b5f3a7c

Depth-mesh grounding and position fallback, label matching, pointFor, local-storage persistence and restore, and movement-triggered rescan.

feat(addons/agenthands): export AgentGestureAnimator from barrel

0431342

feat(addons/agenthands): export AgentSpeechConductor from barrel

b99c947

feat(addons/agenthands): export AgentWorld from barrel

4e7f93c

docs(agent_hands): add demo README

5c58864

Credits the AgentHands paper, describes the four modules the demo drives, the gesture markup, and what the demo can and cannot do today.

docs(addons/agenthands): add addon README

4e2e4e2

Overview of the modules and a quick-start wiring example.

test(addons/agenthands): cover single-fire and sync-throw cleanup

4eb25c1

A step fired on a word boundary is not replayed by the timed queue, and a synchronous speak() failure clears the boundary callback.

feat(addons/agenthands): track whether a scan completed this session

8b21040

Adds a scanned flag, set only when a detection pass finishes. Objects loaded from local storage do not set it, so callers can tell a persisted cache apart from a confirmed, current view of the room.

test(addons/agenthands): cover the scanned flag

a258497

False on construction and after a persisted-cache load, true only after a scan completes.

fix(agent_hands): only offer objects from a scan completed this session

f68067d

A restored local-storage cache can be from a different room, so the demo no longer lists it to the model until a fresh scan confirms what is actually present; otherwise the first reply could point at stale objects.

docs(agent_hands): expand what the demo can and cannot do

a1a5394

Grounds the comparison in the AgentHands paper: gesture range, timing, object grounding, user gaze, and maturity, so the differences from the paper are explicit.

docs(agenthands): add xb-agenthands skill

2a8bc19

Task-oriented guide for giving an agent gesturing hands and an orb that points at real objects, wiring the four modules together.

docs(skills): register xb-agenthands in the skill registry

c565e59

docs(agent_hands): note the agent does not locomote to targets

72f4fa7

The paper's agent travels to the object it is discussing; this demo keeps the hands anchored in front of the user and points from there.

salmanmkc added 10 commits July 1, 2026 16:46

refactor(addons/agenthands): name the iconic-size widths

b3baf44

Replaces the inline 0.18 / 0.55 / 0.35 and the 0.1..0.8 clamp in sizeWidth with named constants, and uses THREE.MathUtils.clamp instead of a manual Math.max/min.

refactor(addons/agenthands): name speech-timing constants

1b1d788

Names the post-speech rest delay (0.8s) and adds an estimateSpeechDuration helper for the length-based duration estimate the demo was computing inline, so the magic 1.2 floor and 0.06 per-char rate live in one place.

test(addons/agenthands): cover estimateSpeechDuration

2653c78

Scales with length and never drops below the floor.

refactor(addons/agenthands): name AgentWorld's default rescan thresholds

fe330f9

The 0.5 m / 0.6 rad / 5000 ms constructor defaults become named constants.

docs: list agenthands in the AGENTS.md addons overview

3920b8d

docs: list agenthands in the src/SKILL.md addons map

d71656b

docs: list lipsync in the AGENTS.md addons overview

3c14c70

docs: list lipsync in the src/SKILL.md addons map

45c82db

Merge branch 'main' into agenthands

b495bc0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(addons/agenthands): hands + orb that gesture and point as the agent speaks#416

feat(addons/agenthands): hands + orb that gesture and point as the agent speaks#416
salmanmkc wants to merge 116 commits into
google:mainfrom
salmanmkc:agenthands

salmanmkc commented Jun 26, 2026 •

edited

Loading

Uh oh!

salmanmkc commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

salmanmkc commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

salmanmkc commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

salmanmkc commented Jun 26, 2026 •

edited

Loading

salmanmkc commented Jul 1, 2026 •

edited

Loading