Cradle · plan · 2026-04-27 (rev 3)

TAPNext++ — integration plan

Plan to swap (or A/B) BootsTAPIR for TAPNext++ as the point-tracker behind track_entity. TAPNext++ is the latest DeepMind checkpoint (same TAPNext architecture, fine-tuned on 1024-frame synthetic sequences) — 40× longer stable tracking, tracking through occlusions, and strong re-detection. The architecture already isolates trackers behind a subprocess + JSON contract, so the integration shape is well-understood: vendor the model, write a sibling CLI, verify it produces correct output in isolation, wire it into the pipeline, validate it against BootsTAPIR on our real experiments with explicit pass/fail gates, then pick a default. Six phases, ~6 clock hours.

Contents · Why TAPNext++ · V&V philosophy · Architecture overview · P1 vendor · P2 cli · P3 verification · P4 backend switch · P5 validation · P6 docs · Risks & unknowns · Out of scope

Why TAPNext++

The rubber2 validation report identified vis_ratio = 0.62 on the post-impact static phase as the dominant residual error — BootsTAPIR loses lock when the small green ball merges with the textured concrete floor at frame 49+. The contact detector and rest-onset detector both honour the visibility flag and do the right thing, but downstream sim-vs-tracker validation has no ground-truth comparison from t=1.567 s onward.

TAPNext (DeepMind) reformulates point tracking as next-token prediction — propagating information through a recurrent network rather than iteratively refining. TAPNext++ is the latest checkpoint of that same architecture, fine-tuned on 1024-frame synthetic sequences with novel training strategies. The repo claims 40× longer stable tracking, tracking through occlusions, and strong re-detection — exactly the failure modes that bite us on rubber2's static tail (low-contrast ball merging with floor for ~1 s after impact). Switching may push vis_ratio on rubber2 from 0.62 toward 0.85+, with the re-detection capability as upside if the ball is briefly lost and re-acquired.

Caveat: the rubber2 failure may be partly feature-level (small low-contrast ball blending with floor), not model-bound. Even a perfect tracker could plateau here. P4's A/B will tell us.

Verification & validation philosophy

Last revision conflated two different questions into a single A/B benchmark. Splitting them:

Step	Question	Where it lives	Gate
Verification	Does the integration do what it claims to do? — weights load, model runs, output schema matches, single-clip trajectory is plausible.	P3 (in `trackers/`, isolated from pipeline)	BLOCKS P4. If TAPNext++ outputs garbage on a known-good clip, no point wiring it in.
Validation	Does it actually improve our outcomes? — vis_ratio on rubber2, physics fit on pen, both-entity coverage on dual-ball-drop.	P5 (full pipeline, real experiments)	BLOCKS default-flip. If it doesn't beat BootsTAPIR on metrics that matter, TAPNext++ stays opt-in.

The distinction matters because verification failures and validation failures point to different fixes. A schema mismatch is a CLI bug; a vis_ratio regression is a model-fit problem. Conflating them wastes triage time.

Architecture overview

The pipeline already isolates trackers behind a subprocess interface — adding a third backend is mostly drop-in:

┌────────────────────────────────────────────────────────────┐ │ pipeline/src/pipeline/tracker.py │ │ (dispatcher) │ │ │ │ settings.models.tracker == "bootstapir" │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ tapir_ │ │ tapnext_ │ │ cotracker│ │ │ │ cli.py │ │ cli.py │ │ _cli.py │ │ │ │ (NEW) │ │ (NEW) │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ every CLI emits the same JSON shape: │ │ entities[].points[].samples[{t, frame, x, y, visible}] │ └────────────────────────────────────────────────────────────┘

This is the same pattern that already works for TAPIR + CoTracker. New code is contained in trackers/ (a separate uv venv, gitignored weights), plus a tiny dispatcher branch in the pipeline.

Six phases

P1Vendor TAPNext + TAPNext++ weights~ 1 hr

Pull tapnet/torch/tapnext_model.py + supporting modules from github.com/google-deepmind/tapnet. Vendor only what's needed (not the full repo); copy under trackers/tapnext_vendor/. The model class is shared between TAPNext and TAPNext++ — the difference is the checkpoint.
Add the TAPNext++ checkpoint download to trackers/download_checkpoints.sh. Verify the URL from the upstream README / model card (Checkpoints section, fine-tuned 1024-frame variant).
Bump trackers/pyproject.toml + uv.lock for new deps (likely einops; PyTorch already there).
Add trackers/LICENSE-tapnext + NOTICE entry for the vendored Apache-2.0 code.

Done when: cd trackers && uv sync && ./download_checkpoints.sh pulls TAPNext++ weights without error and uv run python -c "from tapnext_vendor.tapnext_model import TAPNext" imports cleanly.

P2Write tapnext_cli.py~ 2 hr

Mirror tapir_cli.py structure: same flags (--video --bbox --out --start --end), same output JSON schema (entities[].points[].samples[{t,frame,x,y,visible}]).
Reuse the tapnet sys.modules stub trick — bypass the parent package's TF/JAX init.
Forward pass: preprocess (resize → normalize) → query points (t,y,x) → recurrent forward → derive visible from occlusion logits → scale tracks back to fullres.
Add a --backend tapnext_pp switch baked in (constant for now; harmless future-proofing if we ever ship plain TAPNext alongside).

Done when: uv run python tapnext_cli.py --video pen.mp4 --bbox "x,y,w,h:bob" --out /tmp/x.json writes a JSON file with the documented schema (no exceptions, no shape mismatches). Correctness of the trajectory is P3's job, not P2's.

P3Verification — does TAPNext++ actually work?~ 1 hr

Before wiring into the pipeline, prove the CLI produces correct output in isolation. Three layered checks:

#	Check	How	Pass
V1	Schema — JSON round-trips through `kinematic_regimes._load_samples` + `contacts_from_trajectory` without a parse error.	Run on pen-19 clip; `uv run python -c "import json, …; samples = _load_samples(json.load(open('/tmp/x.json'))); assert len(samples) > 0"`	No exceptions; `samples` non-empty.
V2	Sanity — pen-19 trajectory tracks the bob through full oscillation. Visual + numeric.	Plot x(t), y(t) from the JSON; eyeball against the existing BootsTAPIR pen-19 trajectory. Compute period via FFT — must agree with prior pen fit (period ≈ 0.85 s) within 10%.	Period within ±10%; visible flag = true on ≥ 95% of frames where the bob is in-frame.
V3	Stability — re-running the CLI on the same input is deterministic (within float epsilon).	Run twice; `numpy.allclose` on the sample arrays.	`max \|Δ\|` < 1 px.

Optional V4 (recommended): run on the static-tail segment of rubber2 in isolation (frames 49–77) — the exact failure mode we're trying to fix. If TAPNext++ also loses the ball here, validation in P5 is going to fail and we can stop early.

Done when: V1 + V2 + V3 all green, written up as a small trackers/scripts/verify_tapnext.py harness so re-running is one command. Failure here blocks P4.

P4Pipeline backend switch~ 30 min

New field models.tracker on Models dataclass. Allowed: "bootstapir" | "tapnext_pp" | "cotracker3". Default stays bootstapir until P5 validates the swap.
Plumb through merge_run_config + config.toml.
pipeline/src/pipeline/tracker.py dispatcher chooses CLI path from settings.models.tracker.
Surface backend in the track_entity tool result (small change in _impl_track_entity) so the orchestrator and proof-of-work pages can attribute results.

Done when: setting tracker = "tapnext_pp" in a per-experiment run-config.json routes track_entity through the new CLI, the result dict carries backend: "tapnext_pp", and a smoke run on pen-19 completes through the full pipeline (no orchestrator-side breakage).

P5Validation — does it actually beat BootsTAPIR?~ 1.5 hr clock + run time

Re-run three experiments with tracker = "tapnext_pp" and compare against the existing BootsTAPIR runs. Each row has an explicit pass condition; all three must pass to flip the default.

Experiment	Primary metric	Pass condition	Why this metric
rubber2	`vis_ratio` on post-impact static tail (frames 49–77)	≥ 0.80 (BootsTAPIR was 0.62) AND contact still at f=20 AND rest still at f=26	This is the failure that motivated the swap. If it doesn't move, we got nothing.
pen run-19	Fitted period + damping β; `diff_sim_to_tracker` RMSE	Period within ±5% and RMSE within ±5% of BootsTAPIR baseline	Guard against accuracy regression on a clean known-good case.
dual-ball-drop run-11	Per-entity `vis_ratio`; basketball impact frame	Both entities ≥ 0.7 vis_ratio; basketball impact frame unchanged	Multi-object case — verifies we don't lose one tracker target while gaining the other.

Secondary metrics (logged but not gating): per-frame inference time (TAPNext claims faster than TAPIR — verify), GPU memory, output JSON file size.

Decision matrix:

Outcome	Action
All 3 pass	Flip default to `tapnext_pp` in `config.toml`. Keep BootsTAPIR available as fallback.
2 of 3 pass (rubber2 wins, no regression elsewhere)	Flip default. Document the failing case as a known-good for BootsTAPIR; allow per-experiment override.
rubber2 fails OR pen regresses	Keep BootsTAPIR default. TAPNext++ stays opt-in. Investigate root cause before next attempt.

Done when: all three runs complete, metrics computed, decision recorded in the proof-of-work HTML.

P6Docs + proof-of-work~ 30 min

HANDOFF section: tracker switch is now config-driven; document the model role and per-experiment override pattern.
Proof-of-work HTML at report-app/public/notes/2026-04-27-tapnext-validation.html with V&V results: P3 verification table + P5 validation A/B numbers + decision verdict.
Update orchestrator_sys.md only if track_entity result shape changed materially (additive backend field shouldn't trigger this).
Update CLAUDE.md Architecture notes / Trackers section: list TAPNext++ alongside CoTracker3 + BootsTAPIR; note the sys.modules stub trick is reused.

Risks & unknowns

#	Risk	Mitigation
R1	TAPNext PyTorch port may have a different forward signature than TAPIR (different point query format, different head outputs).	Match the repo's reference inference loop verbatim; don't try to port to our existing inference shape.
R2	The `tapnet` parent package init drags in TF + JAX. Same problem we already hit with TAPIR.	Reuse the `sys.modules['tapnet']` stub trick. Already documented in CLAUDE.md.
R3	TAPNext is recurrent → may be slower per-frame than TAPIR's iterative refinement. The README claims it's faster end-to-end ("most capable, fastest, yet simplest"); we should still measure.	Measure in P4. If a clip takes 2× longer with no `vis_ratio` gain, that's a fail.
R4	TAPNext++ checkpoint URL: the README references it but the explicit download link may live on the TAPNext++ project page or in the Checkpoints section.	If unclear, fall back to plain TAPNext checkpoint (still better than BootsTAPIR for our failure mode); convert from JAX as last resort.
R5	License: tapnet is Apache-2.0 — compatible, but vendored code needs attribution.	Add `LICENSE-tapnext` + per-file header.
R6	The `visible` flag might be derived differently in TAPNext vs TAPIR. Mismatched flag semantics could break downstream filters.	Calibrate on pen run-19: ground-truth visibility is "ball in frame, not occluded by hand"; TAPNext's flag should agree on at least 95% of frames there.

Out of scope

Auto-tuning the tracker per experiment from segmenter hints (e.g. "small object on textured background → use TAPNext"). Manual config knob is enough for now.
Multi-object tracking. Both entities in dbd already track via separate track_entity calls; no need to change that.
Online / streaming inference (TAPNext supports it; we don't need it — full-video inference is fine for our clip lengths).
Mixing trackers within one run (e.g. TAPNext for the static phase, TAPIR for the free fall). Possible but premature; pick one per experiment.

plan · 2026-04-27 · cradle pipeline
six phases · ~6 hr clock · P3 verification gates P4; P5 validation gates default-flip
ready to start P1 on go.