Commit 190a448
Add TraceEncoder module for agent trace → VL-JEPA embedding conversion
Implements the trace encoding pipeline described in AGENT_TRACE_TRAINING.md:
converts structured ATN agent traces (JSON) into embedding sequences suitable
for VL-JEPA next-turn prediction training.
New file: nodes/common/trace_encoder.py
Architecture:
- TraceEncoderConfig: configuration dataclass (embed_dim=384 matches VLJEPAConfig)
- _SequenceEncoder: shared byte-level transformer backbone with mean-pooling
- TextEncoder: encodes turn.content to (embed_dim,) via _SequenceEncoder
- ActionEncoder: serialises tool calls to text, encodes to (embed_dim,)
- ResultEncoder: serialises tool results to text, encodes to (embed_dim,)
- TurnFuser: self-attention over (text, action, result) modality slots → single vector
- OutcomeEncoder: structured (success, task_completed) + error text → (embed_dim,)
- TraceEncoder: orchestrates encode_trace() → {embeddings, turn_mask, outcome_embedding}
- TraceDataset: torch Dataset with quality filtering, deduplication, JSONL/directory loaders
Quality filtering:
- Skip traces with fewer than min_turns (default: 2)
- Skip errored sessions unless include_errored=True
- Optional embedding-based deduplication via cosine similarity threshold
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>1 parent f03dc20 commit 190a448
1 file changed
Lines changed: 885 additions & 0 deletions
0 commit comments