VLM Observer
bytedance-seed/seed-2.0-lite
Inventories what is visually in the scene — entities, materials, geometry, evidence frames. Does not name phenomena.
draft/scene.json
A short experiment video goes in. A self-contained interactive report comes out — playable simulation, regime-by-regime formulations, solved problems with error budgets. Between the two ends sits a multi-agent orchestrator that reasons about the physics step by step, tool by tool, with every artifact written to disk and tailed live to this UI over SSE.
bytedance-seed/seed-2.0-lite
Inventories what is visually in the scene — entities, materials, geometry, evidence frames. Does not name phenomena.
draft/scene.jsongemma-4-31b-it (or any vision LLM)
Partitions the video into regimes with structured `kind` (static / free_fall / oscillation / impact / …). Hypothesis only — verified downstream.
reviewer/regimes.jsonTAPNext++ (DeepMind, vendored)
Per-entity 2D centroid trajectories from the tracker venv. Streaming online inference for Apple Silicon. Cluster-vote centroid handles phantom seed points.
reviewer/trajectories/regime-N-*/trajectory.jsonopenai/gpt-5.4 (xhigh reasoning)
Drives the orchestrator tool-use loop: validates the segmenter against tracker kinematics, builds the physical model, refines regime boundaries with detected contact times, dispatches downstream stages.
reviewer/physical-model.jsonreviewer/regimes.json (revised)anthropic/claude-opus-4
Compiles the physical model to a System spec (bodies, forces, constraints, events). A verifier loop catches anti-patterns before the spec ships.
reviewer/system.jsonopenai/gpt-5.4
Per-regime rigorous problem statement (21-point canonical structure with inline SVG schematics).
reviewer/formulations/regime-N.mdopenai/gpt-5.4
Solves each regime problem analytically or numerically; produces an error budget and sensitivity analysis.
reviewer/solutions/regime-N.md(deterministic)
Compares the integrated simulation trajectory against the tracker. Verdict matrix: PASS / SCALE / FRAME / DEGENERATE on the dominant-motion axis.
state.sim_self_diffThe reviewer agent drives a tool-use loop. Each tool call is a deterministic action — read a file, run a tracker, fit contacts — that returns structured JSON the agent reasons over. Every call is logged with full input/output to orchestrator/calls/call-NNN/.
observe_sceneVLM scene inventory — entities, geometry, evidence frames.
detect_entityMoondream3 detection at a specific frame, returning pixel bboxes.
extract_framesPull individual frames at given times for evidence.
extract_segmentCut a video sub-clip via ffmpeg.
vlm_on_clipTargeted VLM re-call on a short window with a specific question.
segment_video_temporallyVLM-driven temporal partition into kind-tagged regimes.
track_entityDispatch the TAPNext++ point tracker on a seeded bbox.
compute_contactsPer-axis sign-flip detector (legacy, axis-aligned only).
compute_contacts_v2Frame-invariant impulse-norm detector with closed-form normal + restitution + friction.
compute_micro_bouncesSub-impact rebounds via relaxed velocity + displacement local-minima passes.
compute_rest_onsetFirst frame where speed stays below threshold for N consecutive samples.
verify_segmenter_regimesKind-by-kind kinematic check on each regime; surfaces hallucinated holds, missing impacts, oscillation/sliding mismatches.
build_physical_modelArchitect agent — emits the typed PhysicalModel JSON (bodies, fields, links, contacts, events).
build_simulationSim-builder + verifier loop — compiles the model to System spec.
simulate_traceNumerically integrate the system spec forward in time.
diff_sim_to_trackerCompare sim trajectory to tracker; verdict matrix on dominant axis.
build_formulationPer-regime 21-point formulation (Formulation Agent).
solve_regimeClosed-form / numerical solution + error budget (Solver Agent).
finalizeWrite per-regime clip.mp4 trims, summarize the run, mark manifest done.
write_stateApply a JSON patch to the run-state store; the only way to mutate state.
read_fileRead any artifact under the run folder by relative path.
Every regime declares one of eleven canonical kinds. The verifier runs the matching kinematic signature against the tracker data — kind/signature mismatch is the most reliable way to surface segmenter mistakes.
| kind | signature |
|---|---|
static | peak ‖v‖ < threshold across the window |
free_fall | quadratic y(t) fit: ay > 0 ∧ R² ≥ 0.85 |
projectile | parabolic in y; same fit as free_fall |
oscillation | autocorrelation peak > 0.3 at non-trivial lag |
impact | ≥ 1 contacts_v2 events |
bounce_sequence | ≥ 2 contacts_v2 events with flight between |
sliding | speed > threshold ∧ no impulses ∧ no autocorr peak |
rolling | continuous-contact, kinematic rolling constraint |
release | ballistic phase reached (peak ‖v‖ > 2× threshold) |
mixed | — (skip; segmenter could not split) |
unknown | — (skip; verifier opts out) |
The VLM segmenter is a hypothesis generator — its narrative is checked against the tracker's kinematic data before anything downstream depends on it. The two paths reconcile through the orchestrator's state store; disagreements surface as explicit warnings the reviewer can act on.
One folder per experiment. Runs accumulate as numbered subfolders; nothing is overwritten. The UI tails events.jsonl live, so the reasoning progress is visible as it happens.
experiments/<slug>/
├─ video.mp4 # source, timestamp burned-in
├─ run-config.json # per-experiment overrides
├─ manifest.json # status + cost ledger
└─ runs/run-NN/
├─ events.jsonl # SSE-tailed live to the UI
├─ frames/orch_fNNNNN_tT.TTT.jpg # per-call frame extracts
├─ draft/scene.json # VLM observer inventory
└─ reviewer/
├─ regimes.json # narrative + kind tags
├─ physical-model.json # bodies, forces, contacts
├─ system.json # sim spec
├─ trajectories/regime-N-slug/
│ ├─ trajectory.json # tracker output (per body)
│ ├─ clip.mp4 # auto-trimmed regime clip
│ ├─ derived.json # § 2.6 analytics
│ └─ fit.json # § 2.7 fit
├─ formulations/regime-N.md # 21-point problem statement
├─ solutions/regime-N.md # solved problem + error budget
└─ orchestrator/
├─ run_state.json # canonical state store
├─ messages.jsonl # full agent transcript
└─ calls/call-NNN-tool/
├─ inputs.json
└─ output.json‖a(t)‖ above a robust MAD-based threshold; recovers contact normal, restitution, and friction at any contact angle. Two-pass smoothing handles tracker noise without erasing small rebounds. (see notes)initial_position_m now first-class on every body, so dual-ball and Galilean-cannon experiments don't collapse to a single vertical line.trackers/tapnext_vendor/. (see plan)