Cradle · plan · 2026-04-27 (rev 3)

TAPNext++ — integration plan

Plan to swap (or A/B) BootsTAPIR for TAPNext++ as the point-tracker behind track_entity. TAPNext++ is the latest DeepMind checkpoint (same TAPNext architecture, fine-tuned on 1024-frame synthetic sequences) — 40× longer stable tracking, tracking through occlusions, and strong re-detection. The architecture already isolates trackers behind a subprocess + JSON contract, so the integration shape is well-understood: vendor the model, write a sibling CLI, verify it produces correct output in isolation, wire it into the pipeline, validate it against BootsTAPIR on our real experiments with explicit pass/fail gates, then pick a default. Six phases, ~6 clock hours.

Contents  ·  Why TAPNext++  ·  V&V philosophy  ·  Architecture overview  ·  P1 vendor  ·  P2 cli  ·  P3 verification  ·  P4 backend switch  ·  P5 validation  ·  P6 docs  ·  Risks & unknowns  ·  Out of scope

Why TAPNext++

The rubber2 validation report identified vis_ratio = 0.62 on the post-impact static phase as the dominant residual error — BootsTAPIR loses lock when the small green ball merges with the textured concrete floor at frame 49+. The contact detector and rest-onset detector both honour the visibility flag and do the right thing, but downstream sim-vs-tracker validation has no ground-truth comparison from t=1.567 s onward.

TAPNext (DeepMind) reformulates point tracking as next-token prediction — propagating information through a recurrent network rather than iteratively refining. TAPNext++ is the latest checkpoint of that same architecture, fine-tuned on 1024-frame synthetic sequences with novel training strategies. The repo claims 40× longer stable tracking, tracking through occlusions, and strong re-detection — exactly the failure modes that bite us on rubber2's static tail (low-contrast ball merging with floor for ~1 s after impact). Switching may push vis_ratio on rubber2 from 0.62 toward 0.85+, with the re-detection capability as upside if the ball is briefly lost and re-acquired.

Caveat: the rubber2 failure may be partly feature-level (small low-contrast ball blending with floor), not model-bound. Even a perfect tracker could plateau here. P4's A/B will tell us.

Verification & validation philosophy

Last revision conflated two different questions into a single A/B benchmark. Splitting them:

StepQuestionWhere it livesGate
VerificationDoes the integration do what it claims to do? — weights load, model runs, output schema matches, single-clip trajectory is plausible.P3 (in trackers/, isolated from pipeline)BLOCKS P4. If TAPNext++ outputs garbage on a known-good clip, no point wiring it in.
ValidationDoes it actually improve our outcomes? — vis_ratio on rubber2, physics fit on pen, both-entity coverage on dual-ball-drop.P5 (full pipeline, real experiments)BLOCKS default-flip. If it doesn't beat BootsTAPIR on metrics that matter, TAPNext++ stays opt-in.

The distinction matters because verification failures and validation failures point to different fixes. A schema mismatch is a CLI bug; a vis_ratio regression is a model-fit problem. Conflating them wastes triage time.

Architecture overview

The pipeline already isolates trackers behind a subprocess interface — adding a third backend is mostly drop-in:

┌────────────────────────────────────────────────────────────┐ │ pipeline/src/pipeline/tracker.py │ │ (dispatcher) │ │ │ │ settings.models.tracker == "bootstapir" │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ tapir_ │ │ tapnext_ │ │ cotracker│ │ │ │ cli.py │ │ cli.py │ │ _cli.py │ │ │ │ (NEW) │ │ (NEW) │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ every CLI emits the same JSON shape: │ │ entities[].points[].samples[{t, frame, x, y, visible}] │ └────────────────────────────────────────────────────────────┘

This is the same pattern that already works for TAPIR + CoTracker. New code is contained in trackers/ (a separate uv venv, gitignored weights), plus a tiny dispatcher branch in the pipeline.

Six phases

P1Vendor TAPNext + TAPNext++ weights~ 1 hr

Done when: cd trackers && uv sync && ./download_checkpoints.sh pulls TAPNext++ weights without error and uv run python -c "from tapnext_vendor.tapnext_model import TAPNext" imports cleanly.

P2Write tapnext_cli.py~ 2 hr

Done when: uv run python tapnext_cli.py --video pen.mp4 --bbox "x,y,w,h:bob" --out /tmp/x.json writes a JSON file with the documented schema (no exceptions, no shape mismatches). Correctness of the trajectory is P3's job, not P2's.

P3Verification — does TAPNext++ actually work?~ 1 hr

Before wiring into the pipeline, prove the CLI produces correct output in isolation. Three layered checks:

#CheckHowPass
V1Schema — JSON round-trips through kinematic_regimes._load_samples + contacts_from_trajectory without a parse error.Run on pen-19 clip; uv run python -c "import json, …; samples = _load_samples(json.load(open('/tmp/x.json'))); assert len(samples) > 0"No exceptions; samples non-empty.
V2Sanity — pen-19 trajectory tracks the bob through full oscillation. Visual + numeric.Plot x(t), y(t) from the JSON; eyeball against the existing BootsTAPIR pen-19 trajectory. Compute period via FFT — must agree with prior pen fit (period ≈ 0.85 s) within 10%.Period within ±10%; visible flag = true on ≥ 95% of frames where the bob is in-frame.
V3Stability — re-running the CLI on the same input is deterministic (within float epsilon).Run twice; numpy.allclose on the sample arrays.max |Δ| < 1 px.

Optional V4 (recommended): run on the static-tail segment of rubber2 in isolation (frames 49–77) — the exact failure mode we're trying to fix. If TAPNext++ also loses the ball here, validation in P5 is going to fail and we can stop early.

Done when: V1 + V2 + V3 all green, written up as a small trackers/scripts/verify_tapnext.py harness so re-running is one command. Failure here blocks P4.

P4Pipeline backend switch~ 30 min

Done when: setting tracker = "tapnext_pp" in a per-experiment run-config.json routes track_entity through the new CLI, the result dict carries backend: "tapnext_pp", and a smoke run on pen-19 completes through the full pipeline (no orchestrator-side breakage).

P5Validation — does it actually beat BootsTAPIR?~ 1.5 hr clock + run time

Re-run three experiments with tracker = "tapnext_pp" and compare against the existing BootsTAPIR runs. Each row has an explicit pass condition; all three must pass to flip the default.

ExperimentPrimary metricPass conditionWhy this metric
rubber2vis_ratio on post-impact static tail (frames 49–77)≥ 0.80 (BootsTAPIR was 0.62) AND contact still at f=20 AND rest still at f=26This is the failure that motivated the swap. If it doesn't move, we got nothing.
pen run-19Fitted period + damping β; diff_sim_to_tracker RMSEPeriod within ±5% and RMSE within ±5% of BootsTAPIR baselineGuard against accuracy regression on a clean known-good case.
dual-ball-drop run-11Per-entity vis_ratio; basketball impact frameBoth entities ≥ 0.7 vis_ratio; basketball impact frame unchangedMulti-object case — verifies we don't lose one tracker target while gaining the other.

Secondary metrics (logged but not gating): per-frame inference time (TAPNext claims faster than TAPIR — verify), GPU memory, output JSON file size.

Decision matrix:

OutcomeAction
All 3 passFlip default to tapnext_pp in config.toml. Keep BootsTAPIR available as fallback.
2 of 3 pass (rubber2 wins, no regression elsewhere)Flip default. Document the failing case as a known-good for BootsTAPIR; allow per-experiment override.
rubber2 fails OR pen regressesKeep BootsTAPIR default. TAPNext++ stays opt-in. Investigate root cause before next attempt.

Done when: all three runs complete, metrics computed, decision recorded in the proof-of-work HTML.

P6Docs + proof-of-work~ 30 min

Risks & unknowns

#RiskMitigation
R1TAPNext PyTorch port may have a different forward signature than TAPIR (different point query format, different head outputs).Match the repo's reference inference loop verbatim; don't try to port to our existing inference shape.
R2The tapnet parent package init drags in TF + JAX. Same problem we already hit with TAPIR.Reuse the sys.modules['tapnet'] stub trick. Already documented in CLAUDE.md.
R3TAPNext is recurrent → may be slower per-frame than TAPIR's iterative refinement. The README claims it's faster end-to-end ("most capable, fastest, yet simplest"); we should still measure.Measure in P4. If a clip takes 2× longer with no vis_ratio gain, that's a fail.
R4TAPNext++ checkpoint URL: the README references it but the explicit download link may live on the TAPNext++ project page or in the Checkpoints section.If unclear, fall back to plain TAPNext checkpoint (still better than BootsTAPIR for our failure mode); convert from JAX as last resort.
R5License: tapnet is Apache-2.0 — compatible, but vendored code needs attribution.Add LICENSE-tapnext + per-file header.
R6The visible flag might be derived differently in TAPNext vs TAPIR. Mismatched flag semantics could break downstream filters.Calibrate on pen run-19: ground-truth visibility is "ball in frame, not occluded by hand"; TAPNext's flag should agree on at least 95% of frames there.

Out of scope


plan · 2026-04-27 · cradle pipeline
six phases · ~6 hr clock · P3 verification gates P4; P5 validation gates default-flip
ready to start P1 on go.