cradle · pipeline architecture

The harness

A short experiment video goes in. A self-contained interactive report comes out — playable simulation, regime-by-regime formulations, solved problems with error budgets. Between the two ends sits a multi-agent orchestrator that reasons about the physics step by step, tool by tool, with every artifact written to disk and tailed live to this UI over SSE.

Pipeline phases

VLM Observer

bytedance-seed/seed-2.0-lite

Inventories what is visually in the scene — entities, materials, geometry, evidence frames. Does not name phenomena.

draft/scene.json

Temporal Segmenter

gemma-4-31b-it (or any vision LLM)

Partitions the video into regimes with structured `kind` (static / free_fall / oscillation / impact / …). Hypothesis only — verified downstream.

reviewer/regimes.json

2.5

Point Tracker

TAPNext++ (DeepMind, vendored)

Per-entity 2D centroid trajectories from the tracker venv. Streaming online inference for Apple Silicon. Cluster-vote centroid handles phantom seed points.

reviewer/trajectories/regime-N-*/trajectory.json

Physicist Reviewer

openai/gpt-5.4 (xhigh reasoning)

Drives the orchestrator tool-use loop: validates the segmenter against tracker kinematics, builds the physical model, refines regime boundaries with detected contact times, dispatches downstream stages.

reviewer/physical-model.json
reviewer/regimes.json (revised)

Simulation Builder

anthropic/claude-opus-4

Compiles the physical model to a System spec (bodies, forces, constraints, events). A verifier loop catches anti-patterns before the spec ships.

reviewer/system.json

Formulation Agent

openai/gpt-5.4

Per-regime rigorous problem statement (21-point canonical structure with inline SVG schematics).

reviewer/formulations/regime-N.md

Solver Agent

openai/gpt-5.4

Solves each regime problem analytically or numerically; produces an error budget and sensitivity analysis.

reviewer/solutions/regime-N.md

Sim/Tracker Diff

(deterministic)

Compares the integrated simulation trajectory against the tracker. Verdict matrix: PASS / SCALE / FRAME / DEGENERATE on the dominant-motion axis.

state.sim_self_diff

The orchestrator's tool catalog

The reviewer agent drives a tool-use loop. Each tool call is a deterministic action — read a file, run a tracker, fit contacts — that returns structured JSON the agent reasons over. Every call is logged with full input/output to orchestrator/calls/call-NNN/.

observation

observe_scene

VLM scene inventory — entities, geometry, evidence frames.

detect_entity

Moondream3 detection at a specific frame, returning pixel bboxes.

extract_frames

Pull individual frames at given times for evidence.

extract_segment

Cut a video sub-clip via ffmpeg.

vlm_on_clip

Targeted VLM re-call on a short window with a specific question.

segmentation & motion

segment_video_temporally

VLM-driven temporal partition into kind-tagged regimes.

track_entity

Dispatch the TAPNext++ point tracker on a seeded bbox.

compute_contacts

Per-axis sign-flip detector (legacy, axis-aligned only).

compute_contacts_v2

Frame-invariant impulse-norm detector with closed-form normal + restitution + friction.

compute_micro_bounces

Sub-impact rebounds via relaxed velocity + displacement local-minima passes.

compute_rest_onset

First frame where speed stays below threshold for N consecutive samples.

verify_segmenter_regimes

Kind-by-kind kinematic check on each regime; surfaces hallucinated holds, missing impacts, oscillation/sliding mismatches.

physics modeling

build_physical_model

Architect agent — emits the typed PhysicalModel JSON (bodies, fields, links, contacts, events).

build_simulation

Sim-builder + verifier loop — compiles the model to System spec.

simulate_trace

Numerically integrate the system spec forward in time.

diff_sim_to_tracker

Compare sim trajectory to tracker; verdict matrix on dominant axis.

documentation

build_formulation

Per-regime 21-point formulation (Formulation Agent).

solve_regime

Closed-form / numerical solution + error budget (Solver Agent).

finalize

Write per-regime clip.mp4 trims, summarize the run, mark manifest done.

state

write_state

Apply a JSON patch to the run-state store; the only way to mutate state.

read_file

Read any artifact under the run folder by relative path.

Regime kinds

Every regime declares one of eleven canonical kinds. The verifier runs the matching kinematic signature against the tracker data — kind/signature mismatch is the most reliable way to surface segmenter mistakes.

kind	signature
`static`	peak ‖v‖ < threshold across the window
`free_fall`	quadratic y(t) fit: ay > 0 ∧ R² ≥ 0.85
`projectile`	parabolic in y; same fit as free_fall
`oscillation`	autocorrelation peak > 0.3 at non-trivial lag
`impact`	≥ 1 contacts_v2 events
`bounce_sequence`	≥ 2 contacts_v2 events with flight between
`sliding`	speed > threshold ∧ no impulses ∧ no autocorr peak
`rolling`	continuous-contact, kinematic rolling constraint
`release`	ballistic phase reached (peak ‖v‖ > 2× threshold)
`mixed`	— (skip; segmenter could not split)
`unknown`	— (skip; verifier opts out)

Verification flow

The VLM segmenter is a hypothesis generator — its narrative is checked against the tracker's kinematic data before anything downstream depends on it. The two paths reconcile through the orchestrator's state store; disagreements surface as explicit warnings the reviewer can act on.

Storage layout

One folder per experiment. Runs accumulate as numbered subfolders; nothing is overwritten. The UI tails events.jsonl live, so the reasoning progress is visible as it happens.

experiments/<slug>/
├─ video.mp4                                  # source, timestamp burned-in
├─ run-config.json                            # per-experiment overrides
├─ manifest.json                              # status + cost ledger
└─ runs/run-NN/
   ├─ events.jsonl                            # SSE-tailed live to the UI
   ├─ frames/orch_fNNNNN_tT.TTT.jpg          # per-call frame extracts
   ├─ draft/scene.json                        # VLM observer inventory
   └─ reviewer/
      ├─ regimes.json                         # narrative + kind tags
      ├─ physical-model.json                  # bodies, forces, contacts
      ├─ system.json                          # sim spec
      ├─ trajectories/regime-N-slug/
      │   ├─ trajectory.json                  # tracker output (per body)
      │   ├─ clip.mp4                         # auto-trimmed regime clip
      │   ├─ derived.json                     # § 2.6 analytics
      │   └─ fit.json                         # § 2.7 fit
      ├─ formulations/regime-N.md             # 21-point problem statement
      ├─ solutions/regime-N.md                # solved problem + error budget
      └─ orchestrator/
          ├─ run_state.json                   # canonical state store
          ├─ messages.jsonl                   # full agent transcript
          └─ calls/call-NNN-tool/
              ├─ inputs.json
              └─ output.json

Recent improvements (April 2026)

contacts_v2 — frame-invariant impulse detector. Local maxima of ‖a(t)‖ above a robust MAD-based threshold; recovers contact normal, restitution, and friction at any contact angle. Two-pass smoothing handles tracker noise without erasing small rebounds. (see notes)
Regime-kind verification. Each regime declares one of 11 canonical kinds; the verifier runs the matching kinematic signature test (quadratic-fit free-fall, autocorrelation oscillation, etc.) instead of keyword matching.
Multi-body x-coordinates preserved. initial_position_m now first-class on every body, so dual-ball and Galilean-cannon experiments don't collapse to a single vertical line.
Dominant-axis sim diff. The four-quadrant verdict (PASS / SCALE / FRAME / DEGENERATE) picks the informative motion axis automatically — pendulum runs no longer mis-classify as DEGENERATE.
TAPNext++ default tracker. Streaming online inference for Apple Silicon, vendored under trackers/tapnext_vendor/. (see plan)