track_entity. TAPNext++ is the latest DeepMind checkpoint (same TAPNext architecture, fine-tuned on 1024-frame synthetic sequences) — 40× longer stable tracking, tracking through occlusions, and strong re-detection. The architecture already isolates trackers behind a subprocess + JSON contract, so the integration shape is well-understood: vendor the model, write a sibling CLI, verify it produces correct output in isolation, wire it into the pipeline, validate it against BootsTAPIR on our real experiments with explicit pass/fail gates, then pick a default. Six phases, ~6 clock hours.The rubber2 validation report identified vis_ratio = 0.62 on the post-impact static phase as the dominant residual error — BootsTAPIR loses lock when the small green ball merges with the textured concrete floor at frame 49+. The contact detector and rest-onset detector both honour the visibility flag and do the right thing, but downstream sim-vs-tracker validation has no ground-truth comparison from t=1.567 s onward.
TAPNext (DeepMind) reformulates point tracking as next-token prediction — propagating information through a recurrent network rather than iteratively refining. TAPNext++ is the latest checkpoint of that same architecture, fine-tuned on 1024-frame synthetic sequences with novel training strategies. The repo claims 40× longer stable tracking, tracking through occlusions, and strong re-detection — exactly the failure modes that bite us on rubber2's static tail (low-contrast ball merging with floor for ~1 s after impact). Switching may push vis_ratio on rubber2 from 0.62 toward 0.85+, with the re-detection capability as upside if the ball is briefly lost and re-acquired.
Caveat: the rubber2 failure may be partly feature-level (small low-contrast ball blending with floor), not model-bound. Even a perfect tracker could plateau here. P4's A/B will tell us.
Last revision conflated two different questions into a single A/B benchmark. Splitting them:
| Step | Question | Where it lives | Gate |
|---|---|---|---|
| Verification | Does the integration do what it claims to do? — weights load, model runs, output schema matches, single-clip trajectory is plausible. | P3 (in trackers/, isolated from pipeline) | BLOCKS P4. If TAPNext++ outputs garbage on a known-good clip, no point wiring it in. |
| Validation | Does it actually improve our outcomes? — vis_ratio on rubber2, physics fit on pen, both-entity coverage on dual-ball-drop. | P5 (full pipeline, real experiments) | BLOCKS default-flip. If it doesn't beat BootsTAPIR on metrics that matter, TAPNext++ stays opt-in. |
The distinction matters because verification failures and validation failures point to different fixes. A schema mismatch is a CLI bug; a vis_ratio regression is a model-fit problem. Conflating them wastes triage time.
The pipeline already isolates trackers behind a subprocess interface — adding a third backend is mostly drop-in:
This is the same pattern that already works for TAPIR + CoTracker. New code is contained in trackers/ (a separate uv venv, gitignored weights), plus a tiny dispatcher branch in the pipeline.
tapnet/torch/tapnext_model.py + supporting modules from github.com/google-deepmind/tapnet. Vendor only what's needed (not the full repo); copy under trackers/tapnext_vendor/. The model class is shared between TAPNext and TAPNext++ — the difference is the checkpoint.trackers/download_checkpoints.sh. Verify the URL from the upstream README / model card (Checkpoints section, fine-tuned 1024-frame variant).trackers/pyproject.toml + uv.lock for new deps (likely einops; PyTorch already there).trackers/LICENSE-tapnext + NOTICE entry for the vendored Apache-2.0 code.Done when: cd trackers && uv sync && ./download_checkpoints.sh pulls TAPNext++ weights without error and uv run python -c "from tapnext_vendor.tapnext_model import TAPNext" imports cleanly.
tapnext_cli.py~ 2 hrtapir_cli.py structure: same flags (--video --bbox --out --start --end), same output JSON schema (entities[].points[].samples[{t,frame,x,y,visible}]).tapnet sys.modules stub trick — bypass the parent package's TF/JAX init.visible from occlusion logits → scale tracks back to fullres.--backend tapnext_pp switch baked in (constant for now; harmless future-proofing if we ever ship plain TAPNext alongside).Done when: uv run python tapnext_cli.py --video pen.mp4 --bbox "x,y,w,h:bob" --out /tmp/x.json writes a JSON file with the documented schema (no exceptions, no shape mismatches). Correctness of the trajectory is P3's job, not P2's.
Before wiring into the pipeline, prove the CLI produces correct output in isolation. Three layered checks:
| # | Check | How | Pass |
|---|---|---|---|
| V1 | Schema — JSON round-trips through kinematic_regimes._load_samples + contacts_from_trajectory without a parse error. | Run on pen-19 clip; uv run python -c "import json, …; samples = _load_samples(json.load(open('/tmp/x.json'))); assert len(samples) > 0" | No exceptions; samples non-empty. |
| V2 | Sanity — pen-19 trajectory tracks the bob through full oscillation. Visual + numeric. | Plot x(t), y(t) from the JSON; eyeball against the existing BootsTAPIR pen-19 trajectory. Compute period via FFT — must agree with prior pen fit (period ≈ 0.85 s) within 10%. | Period within ±10%; visible flag = true on ≥ 95% of frames where the bob is in-frame. |
| V3 | Stability — re-running the CLI on the same input is deterministic (within float epsilon). | Run twice; numpy.allclose on the sample arrays. | max |Δ| < 1 px. |
Optional V4 (recommended): run on the static-tail segment of rubber2 in isolation (frames 49–77) — the exact failure mode we're trying to fix. If TAPNext++ also loses the ball here, validation in P5 is going to fail and we can stop early.
Done when: V1 + V2 + V3 all green, written up as a small trackers/scripts/verify_tapnext.py harness so re-running is one command. Failure here blocks P4.
models.tracker on Models dataclass. Allowed: "bootstapir" | "tapnext_pp" | "cotracker3". Default stays bootstapir until P5 validates the swap.merge_run_config + config.toml.pipeline/src/pipeline/tracker.py dispatcher chooses CLI path from settings.models.tracker.backend in the track_entity tool result (small change in _impl_track_entity) so the orchestrator and proof-of-work pages can attribute results.Done when: setting tracker = "tapnext_pp" in a per-experiment run-config.json routes track_entity through the new CLI, the result dict carries backend: "tapnext_pp", and a smoke run on pen-19 completes through the full pipeline (no orchestrator-side breakage).
Re-run three experiments with tracker = "tapnext_pp" and compare against the existing BootsTAPIR runs. Each row has an explicit pass condition; all three must pass to flip the default.
| Experiment | Primary metric | Pass condition | Why this metric |
|---|---|---|---|
| rubber2 | vis_ratio on post-impact static tail (frames 49–77) | ≥ 0.80 (BootsTAPIR was 0.62) AND contact still at f=20 AND rest still at f=26 | This is the failure that motivated the swap. If it doesn't move, we got nothing. |
| pen run-19 | Fitted period + damping β; diff_sim_to_tracker RMSE | Period within ±5% and RMSE within ±5% of BootsTAPIR baseline | Guard against accuracy regression on a clean known-good case. |
| dual-ball-drop run-11 | Per-entity vis_ratio; basketball impact frame | Both entities ≥ 0.7 vis_ratio; basketball impact frame unchanged | Multi-object case — verifies we don't lose one tracker target while gaining the other. |
Secondary metrics (logged but not gating): per-frame inference time (TAPNext claims faster than TAPIR — verify), GPU memory, output JSON file size.
Decision matrix:
| Outcome | Action |
|---|---|
| All 3 pass | Flip default to tapnext_pp in config.toml. Keep BootsTAPIR available as fallback. |
| 2 of 3 pass (rubber2 wins, no regression elsewhere) | Flip default. Document the failing case as a known-good for BootsTAPIR; allow per-experiment override. |
| rubber2 fails OR pen regresses | Keep BootsTAPIR default. TAPNext++ stays opt-in. Investigate root cause before next attempt. |
Done when: all three runs complete, metrics computed, decision recorded in the proof-of-work HTML.
report-app/public/notes/2026-04-27-tapnext-validation.html with V&V results: P3 verification table + P5 validation A/B numbers + decision verdict.orchestrator_sys.md only if track_entity result shape changed materially (additive backend field shouldn't trigger this).CLAUDE.md Architecture notes / Trackers section: list TAPNext++ alongside CoTracker3 + BootsTAPIR; note the sys.modules stub trick is reused.| # | Risk | Mitigation |
|---|---|---|
| R1 | TAPNext PyTorch port may have a different forward signature than TAPIR (different point query format, different head outputs). | Match the repo's reference inference loop verbatim; don't try to port to our existing inference shape. |
| R2 | The tapnet parent package init drags in TF + JAX. Same problem we already hit with TAPIR. | Reuse the sys.modules['tapnet'] stub trick. Already documented in CLAUDE.md. |
| R3 | TAPNext is recurrent → may be slower per-frame than TAPIR's iterative refinement. The README claims it's faster end-to-end ("most capable, fastest, yet simplest"); we should still measure. | Measure in P4. If a clip takes 2× longer with no vis_ratio gain, that's a fail. |
| R4 | TAPNext++ checkpoint URL: the README references it but the explicit download link may live on the TAPNext++ project page or in the Checkpoints section. | If unclear, fall back to plain TAPNext checkpoint (still better than BootsTAPIR for our failure mode); convert from JAX as last resort. |
| R5 | License: tapnet is Apache-2.0 — compatible, but vendored code needs attribution. | Add LICENSE-tapnext + per-file header. |
| R6 | The visible flag might be derived differently in TAPNext vs TAPIR. Mismatched flag semantics could break downstream filters. | Calibrate on pen run-19: ground-truth visibility is "ball in frame, not occluded by hand"; TAPNext's flag should agree on at least 95% of frames there. |
track_entity calls; no need to change that.