One algebra for batched tensor search: global graphs, local connectors, optimal transport, rollouts, and learned priors as a single Generate·Score·Reduce operator.




Parallel hardware is a design constraint, like dynamics, collision, or uncertainty.
That is the RoboARCH question: match each algorithm to the hardware it runs best on, from CPU SIMD and GPU to FPGA, MCU, edge, and cloud.
Ask: “what planner would I write if parallelism were the primitive?”
Make many similar subproblems visible at once, as one batched object.
Evaluate every candidate independently; keep geometry and logic, do not flatten into brute force.
End with one structured reduction: min, Sinkhorn, expectation, score, DP.
Three verbs, held for the whole talk: Generate · Score · Reduce.
Vectorize the kernels of sampling-based planning. 35 µs median, ~25 kHz on one CPU core (7-DoF Panda); FCIT* = first fully-connected a.s.-optimal.
Batched trajopt; GPU-parallel RRT-Connect. cuRobo: ~60× vs CPU, ~30 ms full pipeline. pRRTC (MotionBenchMaker Panda, mean planning time): 128× cuRobo, 3× slower than VAMP-RRTC, 1.4× slower than GTMP; cpRRTC up to 165× constrained.
Make the graph itself a batched object. GTMP: 1000 paths in 4.6 ms.
One design pressure across the field: expose more regular, batched, schedulable work. Benchmark of record: MotionBenchMaker.
$M$ layers, $N$ samples per layer, each a point in configuration space $\mathbb{R}^d$. Now turn this into a search.
The batched object is the Generate stage. Design a graph whose natural operations are batched kernels.
teal · tensor / batch axes sage · value / cost orchid · operator / reducer rose · this‑talk highlight
Three verbs. Each is a typed map; their composition is the planner.
$$\textbf{Planner}=\textcolor{#8a3f9c}{\textstyle\bigoplus}\,\circ\,\textcolor{#557a3a}{s}\,\circ\,\textcolor{#2f7d8a}{\mathcal{G}_\theta}.\qquad\text{\small Different planners are different semiring choices for one skeleton.}$$
Generate and Score are shared; a planner only chooses the reducer $\oplus$ over the semiring $(\mathbb{S},\oplus,\otimes)$. One compiled kernel, five planners, with a single temperature knob sliding between them.
Every $\oplus_\lambda$ is associative, so all five share one iteration space, one memory layout, and one $O(\log B)$ reduction tree. One compiled template instantiated by swapping the combine op.
The proof object and the compute object are the same: the semiring is both the algebra in the theorem and the reduction in the kernel. Invariant skeleton, method-specific arithmetic. Next: the per-method work and depth.
Embarrassingly-parallel Score gives $W=|B|\,\text{cost}(s)$ at fixed depth; an associative $\oplus$ folds it in $O(\log B)$. Available parallelism $W/D$ sets the ceiling; realized speed is then capped by bandwidth and occupancy.
| Method | Semiring $(\oplus,\otimes)$ | Work $W$ | Depth $D$ |
|---|---|---|---|
| GTMP | $(\min,+)$ | $\Theta(MN^2)$ | $\Theta(M\log N)$ |
| MPOT | soft-min, $+$ | $\Theta(T\,nm)$ | $\Theta(T\log n)$ |
| CLOT | OT $\otimes$ STL min/max | $\Theta(R\,(Tnm{+}|\Phi|T))$ | $\Theta(R\log nm)$ |
| MTP | expectation | $\Theta(BH)$ | $\Theta(H{+}\log B)$ |
| MPD | score | $\Theta(KB\,c_\theta)$ | $\Theta(K\,d_\theta)$ |
Every method shares the depth signature $D=(\text{sequential scan})+O(\log B)$. The scan length is the number an architect should optimize.

One graph contains several qualitatively different route families, so diversity is a property of the data structure, not a post-processing step.
Single start/goal, complete adjacent-layer connectivity. Every adjacent-layer pair is an independent local subproblem.
A batched min-reduction over the next-layer axis: $J_m = C_m\boxtimes J_{m+1}$, tropical GEMV: the segment_min the accelerator runs.
Reach $\Lambda$: at budget $s$, with confidence $1-\tau$, the connector succeeds on local hops up to length $\ell$.
GTMP consumes, doesn't compete: each connector is the per-edge local Generate and Score, a pluggable oracle, while the tropical Reduce stays fixed.
VAMP‑RRTC · pRRTC · AORRTC · FCIT* · cuRobo · CHOMP · MPD


geometric rate, so almost-sure coverage as $K\to\infty$ (2nd Borel–Cantelli)
optimality by contracting support
Two schedules of the same reducer: diversity repeats generation, optimality enriches it.


anytime-GTMP charts: results from a manuscript under preparation.
500 parallel Panda instances in ~0.3 ms via JAX vmap on RTX 3090. On MotionBenchMaker feasibility time, GTMP sits with VAMP-RRTC at the top, ~1.4× faster than pRRTC.
GTMP selects one vertex. Relax the selection and the same DP becomes differentiable transport.
$\lambda\to 0$ recovers GTMP: the transport plan concentrates on the min-cost vertices, soft selection becomes hard selection.

Sinkhorn is matrix-scaling: the densest GPU-friendly inner loop.

Choosing $D_P$ is choosing the Generate stage; Sinkhorn does the Reduce.


Same reducer as GTMP, one temperature warmer. The batch axis is the trajectory; the polytope is the per-waypoint candidate set.

vmap substrate.Same Generate·Score·Reduce as GTMP, now replanning every step instead of building one graph.

$B$ control sequences of horizon $H$, scored by a black-box simulator in parallel.
Each component is a Gibbs measure $\pi_\bullet\propto e^{-J/\eta_\bullet}$ (its weights $p_{\eta_\bullet}=\nabla_v R_{\eta_\bullet}$), so the convex mix is again a valid distribution: an explicit exploit (cold $\eta_{\text{loc}}$) to explore (warm $\eta_{\text{glb}}$) knob $\beta$.
Both reducers are $\nabla R_\lambda$ from the temperature family: expectation is just soft-min, differentiated.



Same batch budget, same hardware. The reducer's temperature decides how much of the space the rollouts actually see. Demonstrated across in-hand cube, G1 whole-body, and crane.

GPU-parallel, gradient-free Sinkhorn steps over batches of system-wide trajectories.
One robot, one transport cost. Now couple many robots under temporal logic.
Parse-tree decomposition: a batched min/max reduction over time slices, tropical again.
Task robustness + obstacle penalty + inter-robot penalty, all dense per timestep, summed over robots.
Algebraic connectivity $\lambda_2$ of the inter-robot Laplacian $L(t)$: a smooth, batchable surrogate for the Boolean connected predicate. $\lambda_2\ge\varepsilon\Rightarrow\varepsilon$-robustly connected.
The Sinkhorn step is untouched. The cost surface gets richer; the update stays a dense regular kernel.
CLOT is GSR nested three deep. Each level is its own Generate·Score·Reduce; the semirings compose without changing the kernel schedule.
$\nu$ indexes both the discrete planning sequence and the continuous candidate-trajectory set, so the search is over both at once.
Coverage minus accumulated cost plus back-propagated feedback $\psi(\nu)$; the weight $\eta\in(0,1)$ scales cost onto the coverage scale.
| Method | Success N=12 | Time N=12 |
|---|---|---|
| CLOT (ours) | 1.00 | 20.3 s |
| MPOT | 0.10 | 26.5 s |
| FSOT | 0.70 | 19.8 s |
| CONS | 0.00 | n/a |
| MICP / NLP / CBS | 0.00 | n/a |




STL goal: $\varphi = \textcolor{#8a3f9c}{G}_{\textcolor{#2f7d8a}{[8,15]}}\,\textcolor{#557a3a}{\mu_{B}}$. Always between $t=8$ and $t=15\,\text{s}$, maintain the linear formation.

The per-step $-\nabla_z J$ is the score $\nabla_z\log p(z)$ of the Boltzmann target $p\propto e^{-J}$ (the normalizer drops out), the score end of the temperature continuum.





One Generate·Score·Reduce skeleton; each planner exposes the parallel work its hardware runs best.
Generate many · Score in batch · Reduce with structure
It is a way to design algorithms whose bottlenecks are visible, measurable, and schedulable.
The same plan flows across three schedulers: the right allocation of algorithm to computing hardware.
Skip most edges while proving homotopy-critical edges survive.
If the graph contains many route families, enumerate them deliberately.
Adapt $(M,N,s)$ online from measured $\bigl(q_s,\Lambda_s\bigr)$ profiles.
Should planners ship as XLA / JAX programs? Benchmark homotopy coverage and compute efficiency, not only first-solution time.
These are algorithm, architecture, and systems questions, not only motion-planning questions.
The next planning algorithms will be judged by the workloads they expose, not only the paths they return.
the math is too beautiful to ignore.
| Method | Generate · batch axes | Reduce · semiring | Operator | $W$ | $D$ |
|---|---|---|---|---|---|
| GTMP | graph $(M,N)$ | min-plus | $J_m(i)=\min_j[C_m(i,j)+J_{m+1}(j)]$ | $\Theta(MN^2)$ | $\Theta(M\log N)$ |
| MPOT | waypoint $\times D_P$ | entropic | $W^\star_\lambda=\arg\min\langle W,C\rangle-\lambda H(W)$ | $\Theta(Tnm)$ | $\Theta(T\log n)$ |
| CLOT | robots $\times$ wp $\times D_P$ | OT $\otimes$ STL | Sinkhorn inner $\oplus$ tropical STL | $\Theta(R(Tnm{+}|\Phi|T))$ | $\Theta(R\log nm)$ |
| MTP | rollouts $\mathbb{R}^{B\times H\times m}$ | expectation | $\pi^{k+1}=(1{-}\beta)\pi_{\text{loc}}+\beta\pi_{\text{glb}}$ | $\Theta(BH)$ | $\Theta(H{+}\log B)$ |
| MPD | latents $(B,H)$ | score | $z_{k-1}=D_\theta(z_k,c)-\eta\nabla_z J(B_\psi z_k)$ | $\Theta(KBc_\theta)$ | $\Theta(Kd_\theta)$ |
All $\oplus$ are associative $\Rightarrow$ all share the $+O(\log B)$ reduction tree; the prefactor is the per-method sequential scan.
Legendre dual: $R_\lambda(v)=\min_{p\in\Delta}\langle p,v\rangle-\lambda H(p)$, so “Sinkhorn = soft-min” is an identity, and $\lambda\to0$ is item (i).
Limit = min (GTMP) · value = soft-min (MPOT/CLOT) · gradient = expectation (MTP) · log-partition gradient = score (MPD).
Benchmark of record: MotionBenchMaker (Chamzas et al., 2022).