Invited talk · RoboARCH @ ICRA 2026 · Vienna · 5 June 2026

Designing Planning Algorithms
for the Era of Parallelism

One idea: the vectorizable planning kernel. Build the planner so its core data structure is a graph that GPUs store and compute natively. Then one Generate·Score·Reduce algebra generalizes it across the whole family.

MPD diffusion motion planning, Panda in a warehouse

MPOT batched trajectory optimization on a TIAGo mobile manipulator

MTP in-hand cube rotation via batched rollouts

MPD diffusion sampling many 2D trajectories at once

One vectorizable kernel for five planners, generalized by Generate · Score · Reduce.

Speaker

An Thai Le

Assistant Professor · Director of Foundation AI

Affiliations

VinUniversity · TU Darmstadt (visiting) · VinRobotics

Contact

anindex.github.io

Positioning

The hardware stopped waiting for the algorithm.

50 years of microprocessor trend data: single-thread performance and clock frequency plateau around 2005 while transistor count and number of logical cores keep rising

50 years of microprocessor trend data. K. Rupp, CC BY 4.0; original data to 2010 by Horowitz, Labonte, Shacham, Olukotun, Hammond, and Batten. Single-thread performance plateaued around 2005; the gains since are parallel.

Parallel hardware is a design constraint, like dynamics, collision, or uncertainty.

In the spirit of Sutton, The Bitter Lesson (2019) & Hooker, The Hardware Lottery (CACM 2021): methods that ride general, scalable compute win.

What changed

Performance now comes from more cores, not faster ones. GPUs and SIMD CPUs (one instruction, many data lanes) are now commodity for robotics.
ML stacks (JAX · XLA · MuJoCo XLA · Triton) made dense, batched compute the default.
Classical sampling-based planners (OMPL, the MoveIt default) are serial pointer-chasing, so most of this hardware sits idle.

batchexpose many subproblems at once

denseregular array math, not pointer chasing

localindependent work on each edge

A vectorizable planning kernel exposes all three. Its core data structure is a tensor: a dense, regular array of numbers that GPUs store and compute directly. That is this talk's answer to the RoboARCH question of matching the algorithm to the hardware.

Core thesis

The usual question is “how do I parallelize my planner?”

A more useful one: “what planner would I write if parallelism were the primitive?”

1 · Generate

Make many similar subproblems visible at once, as one batched object.

2 · Score

Evaluate every candidate independently. Keep the geometry and logic; do not flatten into brute force.

3 · Reduce

End with one structured reduction: min, Sinkhorn, expectation, score, DP.

The design move: make the algorithm itself vectorizable. The algebra that generalizes it, Generate · Score · Reduce, comes after we build it concretely.

Current landscape

Two ways to use parallel hardware: vectorize the kernels, or vectorize the algorithm.

CPU · SIMD

VAMP · FCIT*

Thomason, Kingston, Kavraki · Wilson et al. (FCIT*)
ICRA 2024 · 2025

Parallelize the kernels inside an existing planner. VAMP SIMD-vectorizes collision checking; FCIT* evaluates all pairwise edges cheaply to drop nearest-neighbor queries. 35 µs median, ~25 kHz on one core (7-DoF Panda); FCIT* first fully-connected a.s.-optimal.

GPU · SIMT

cuRobo · pRRTC · cpRRTC

Sundaralingam et al. · Huang, Jadhav, Plancher, Kingston · Hu et al. (cpRRTC)
2023–2025

Parallelize the kernels inside an existing planner. pRRTC / cpRRTC do SIMT collision checking and parallel tree expansion for RRT-Connect. cuRobo ~60× vs CPU; pRRTC ~1.4× slower than GTMP, ~3× slower than VAMP-RRTC; cpRRTC up to 165× constrained.

Vectorizable kernel this talk

GTMP · MPOT · CLOT · MTP · MPD

Le · Carvalho · Zhang & Guo (PKU) · Nguyen · Peters
RA-L · NeurIPS · ICRA · ICLR · T-RO · 2023–2026

Make the planning algorithm itself the vectorizable object. The graph is a tensor, so search becomes native GPU math. GTMP: 1000 paths in 4.6 ms.

Family lineage · 2017–2026

OMPL Kavraki & Moll · 2008 → 2.0 (SIMD/GPU) VAMP microseconds · Kavraki · 2024 FCIT* fully-connected a.s.-optimal · 2025 cpRRTC constrained · RTX 5090 · 2025 MPD Carvalho, A. Le et al. · T-RO 2025

Both directions are winning, and they meet. Recent work vectorizes the kernels; this talk designs the algorithm to be vectorizable, then plugs those kernels in as connectors. Underneath sits batched, differentiable simulation (Isaac Gym, MuJoCo MJX/XLA, Brax); the benchmark of record is MotionBenchMaker.

RA-L 2025GTMP · concrete story (1/6)

Global search becomes a layered random multipartite graph.

The idea: the graph is a tensor

Sample N points in each of M layers.
Evaluate adjacent-layer edges in batch.
Run finite-horizon value iteration on the layered DAG.

One graph holds several different route families (paths you can't bend into one another without crossing an obstacle). So diversity is built into the data structure, not added in a later step. Swap the straight edge for a smooth Akima connector and the same kernel returns smooth plans, at no extra cost.

Le et al., Global Tensor Motion Planning, IEEE RA-L 2025 (arXiv:2411.19393).

teal · tensor / batch axes sage · value / cost orchid · operator / reducer rose · this‑talk highlight

GTMP = tropical GSR · story (2/6)

The Bellman backward pass is a tropical (min-plus) matrix–vector product.

Edge count

$$|E|=2\textcolor{#2f7d8a}{N}+(\textcolor{#2f7d8a}{M}-1)\,\textcolor{#2f7d8a}{N}^{2}=\Theta(\textcolor{#2f7d8a}{MN^{2}})$$

Single start/goal, complete adjacent-layer connectivity. Every adjacent-layer pair is an independent local subproblem.

Backward DP · the Reduce step

$$J_m(u)=\textcolor{#8a3f9c}{\min_{v\in V_{m+1}}}\bigl[\,\textcolor{#557a3a}{c(u,v)}+\textcolor{#557a3a}{J_{m+1}(v)}\,\bigr]$$

A batched min-reduction over the next-layer axis: $J_m = C_m\boxtimes J_{m+1}$. In tropical (min-plus) algebra, “add” means take the minimum and “multiply” means add costs. So it runs like a dense matrix–vector product, the regular work a GPU is fastest at.

Connectorized GTMP · story (3/6)

The edge is a pluggable connector, including a smooth Akima edge that stays in-kernel.

$$\textcolor{#8a3f9c}{\mathcal{LC}}(x_a,x_b;\textcolor{#2f7d8a}{s})\;\to\;\{\,\Gamma_{a,b}\ \text{or fail}\,\}$$

$$\textcolor{#557a3a}{\Lambda_{\textcolor{#2f7d8a}{s}}(\tau)}=\sup\bigl\{\,\ell:\ \textcolor{#557a3a}{q_{\textcolor{#2f7d8a}{s}}(\ell)}\ge 1-\tau\,\bigr\}$$

Reach $\Lambda$: at budget $s$, with confidence $1-\tau$, the connector succeeds on local hops up to length $\ell$.

This builds on OMPL; it does not replace it. Eighteen years of sampling-based engineering (OMPL since 2008, 40+ planners, the MoveIt default, Kavraki & Moll) becomes the per-edge Generate+Score, while the tropical Reduce stays fixed. GTMP consumes connectors; it does not compete with them.
VAMP‑RRTC · pRRTC · AORRTC · FCIT* · cuRobo · CHOMP · MPD

The field is converging: OMPL 2.0 plans to add SIMD/GPU acceleration (VAMP). The vectorizable graph is where the connectors plug in.

Connectors: VAMP-RRTC (Kavraki, ICRA 2024), FCIT* (ICRA 2025) · pRRTC (arXiv:2503.06757) · cuRobo (NVIDIA) · CHOMP · MPD. OMPL: Kavraki & Moll, Rice, since 2008.

batched box-sweep plans

tabletop, Akima splines

GTMP-Akima · the design payoff · story (4/6)

Smooth planning, no extra solve: swap the straight edge for an Akima spline.

GTMP layered multipartite graph with smooth Akima edges: start s, layers t1 to t3, goal g

The layered graph, now with smooth Akima edges (s → t₁…t₃ → g)

A batch of smooth GTMP-Akima paths, planned at once

Akima splines are local: each segment uses only its neighbors. So there is no global linear solve to serialize, unlike a natural cubic spline, which couples every segment. They stay C¹ and overshoot-free, so each edge stays an independent subproblem, and a smooth plan falls out of the same Reduce at no extra cost.

Anytime GTMP · story (5/6)

Anytime should not conflate diversity and optimality.

Mode RR · diversity

Fix $(M,\,N,\,s)$; draw independent graph realizations.
Top-$B$ class-indexed elite archive.
Diversity = coverage of distinct route families: paths you cannot bend into one another without hitting an obstacle (distinct homotopy classes).

$$\Pr[\,h\ \text{uncovered after}\ K\ \text{draws}\,]\le(1-p_h)^{K}\to 0,\ \ p_h>0$$

geometric rate, so almost-sure coverage as $K\to\infty$ (2nd Borel–Cantelli)

Mode AO · cost

Grow $M_\nu,\,N_\nu \to \infty$; keep $s$ fixed.
Informed samples inside the ellipsoid that shrinks around the best path so far.
Reuse + prune edges; AO$^\star$-style convergence to $c^\star$.

optimality by contracting support

Two schedules of the same reducer: diversity repeats generation, optimality enriches it.

GTMP results · story (6/6)

Competitive on feasibility, with batch throughput few methods reach.

AOGTMP path-cost vs AIT*, EIT*, FCIT*, AORRTC

anytime path cost vs AIT* · EIT* · FCIT* · AORRTC

success rate vs wall-clock budget

anytime-GTMP charts: results from a manuscript under preparation.

4.6ms

1000 paths in one batch (RTX 3090, warm JIT), ~250× less wall-clock than sequential baselines, $89\%$ collision-free, full value-iteration optimality.

500 parallel Panda instances in ~0.3 ms via JAX vmap on RTX 3090.

The object · Generate

The planner's data structure is a graph that parallel hardware stores and computes, not only a path generator.

$$\textcolor{#2f7d8a}{\mathbf{Q}}\in\mathbb{R}^{\textcolor{#2f7d8a}{M}\times \textcolor{#2f7d8a}{N}\times d}\qquad\text{a batched tensor of candidate vertices}$$

A 3-D array of candidate vertices: $M$ layers, $N$ samples per layer, each a point in $\mathbb{R}^d$. The graph is laid out as a tensor that GPUs read directly, so search becomes a native operation.

The object · Generate

The planner's data structure is a graph that parallel hardware stores and computes, not only a path generator.

$$\textcolor{#2f7d8a}{\mathbf{Q}}\in\mathbb{R}^{\textcolor{#2f7d8a}{M}\times \textcolor{#2f7d8a}{N}\times d}\quad\text{vertex tensor}$$

$$\textcolor{#2f7d8a}{\mathbf{C}}\in\mathbb{R}^{\textcolor{#2f7d8a}{M}\times \textcolor{#2f7d8a}{N}\times \textcolor{#2f7d8a}{N}}\quad\text{edge-cost tensor}$$

$$\textcolor{#557a3a}{J_m(i)}=\textcolor{#8a3f9c}{\min_{j}}\bigl[\,C_m(i,j)+\textcolor{#557a3a}{J_{m+1}(j)}\bigr]\quad\text{Bellman}$$

What this buys you

Many candidate futures in one object, so diversity comes built in.
Local work distributed across SIMT lanes.
Global decisions recovered by reductions (DP, Sinkhorn, prune).
Theorems and kernels describe the same object.

This batched object is a vectorizable planning kernel: a graph whose natural operations are GPU kernels. Building it is the Generate stage.

The operator

Every planner here is Generate → Score → Reduce.

Generate

→

Score

→

Reduce

Three verbs. Each is a typed map; their composition is the planner.

The operator

Every planner here is Generate → Score → Reduce.

Generate

$$\textcolor{#2f7d8a}{X}=\mathcal{G}_\theta(\xi)\in\mathbb{R}^{\textcolor{#2f7d8a}{\mathbf{B}}\times d}$$

Structured sampler. Batch axes $\mathbf{B}$ are the independent subproblems exposed at once.

→

Score

$$\textcolor{#557a3a}{S}=s(\textcolor{#2f7d8a}{X})\in\mathbb{R}^{\textcolor{#2f7d8a}{\mathbf{B}}}$$

Elementwise, communication-free: $\partial S_a/\partial X_b=0$. Collision, dynamics, logic live here.

→

Reduce

$$\textcolor{#8a3f9c}{y}=\textcolor{#8a3f9c}{\textstyle\bigoplus_{a\in\mathcal{A}}}\,\textcolor{#557a3a}{S_a}$$

The only cross-candidate stage: a structured reduction over a semiring $(\mathbb{S},\oplus,\otimes)$.

$$\textbf{Planner}=\textcolor{#8a3f9c}{\textstyle\bigoplus}\,\circ\,\textcolor{#557a3a}{s}\,\circ\,\textcolor{#2f7d8a}{\mathcal{G}_\theta}.\qquad\small\text{Different planners are different semiring choices for one skeleton.}$$

One operator

Five planners, one kernel: they differ only in the reduction.

$$\textcolor{#8a3f9c}{y}=\textcolor{#8a3f9c}{\textstyle\bigoplus_{a\in\mathcal{A}}}\,\textcolor{#557a3a}{s(\textcolor{#2f7d8a}{X})},\qquad \textcolor{#8a3f9c}{\oplus}\in\Bigl\{\ \min,\ \ -\lambda\log\!\textstyle\sum e^{-(\cdot)/\lambda},\ \ \mathbb{E}_{p_\eta},\ \ \nabla\!\log\!\textstyle\int e^{-J}\ \Bigr\}$$

$\min$

tropical

GTMP

$-\lambda\log\sum e^{-\cdot/\lambda}$

entropic OT

MPOT · CLOT

$\mathbb{E}_{p_\eta}[\cdot]$

expectation

MTP

$\nabla\log\int e^{-J}$

score

MPD

Generate and Score are shared. A planner only picks the reducer $\oplus$ over the semiring $(\mathbb{S},\oplus,\otimes)$ (an algebra with one combine op $\oplus$ and one chain op $\otimes$, in effect a swappable arithmetic). One compiled kernel spans all five, with a single temperature knob $\lambda$ sliding between them.

Why it matters

One kernel schedule for all five planners.

Reduction-invariance

$$\mathbf{B},\,s\ \text{fixed};\quad \text{only}\ \textcolor{#8a3f9c}{\bigoplus_\lambda}\ \text{varies with}\ \lambda.$$

Every $\oplus_\lambda$ is associative, so all five share one iteration space, one memory layout, and one $O(\log B)$ reduction tree. One compiled template instantiated by swapping the combine op.

What actually differs

min is one comparator; soft-min spends $\exp/\log$ on the SFU (with a max-shift for stability).
Register pressure and occupancy differ, so “one template” $\ne$ identical wall-clock.
$\min$ returns an index (argmin path); soft-min returns a distribution.

The proof object and the compute object are the same: the semiring is both the algebra in the theorem and the reduction in the kernel. Invariant skeleton, method-specific arithmetic. Next: the per-method work and depth.

Algebra becomes schedule

Associativity buys an $O(\log B)$ reduction tree. That is the architecture payload.

Work–depth · Brent's theorem

$$T_p\ \le\ \tfrac{W}{p}+D,\qquad D_{\text{GSR}}=\underbrace{\text{depth}(s)}_{\text{scan}}+\underbrace{O(\log B)}_{\text{tree}}$$

Embarrassingly-parallel Score gives $W=|B|\,\text{cost}(s)$ at fixed depth; an associative $\oplus$ folds it in $O(\log B)$. Available parallelism $W/D$ sets the ceiling; realized speed is then capped by bandwidth and occupancy.

Method	Semiring $(\oplus,\otimes)$	Work $W$	Depth $D$
GTMP	$(\min,+)$	$\Theta(MN^2)$	$\Theta(M\log N)$
MPOT	soft-min, $+$	$\Theta(T\,nm)$	$\Theta(T\log n)$
CLOT	OT $\otimes$ STL min/max	$\Theta(R\,(Tnm{+}\|\Phi\|T))$	$\Theta(R\log nm)$
MTP	expectation	$\Theta(BH)$	$\Theta(H{+}\log B)$
MPD	score	$\Theta(KB\,c_\theta)$	$\Theta(K\,d_\theta)$

$M$ layers · $N$ samples · $n$ waypoints · $m=|D_P|$ probes · $H$ horizon · $K$ steps · $T$ iters · $R$ robots

Every method shares the depth signature $D=(\text{sequential scan})+O(\log B)$. The scan length is the number an architect should optimize.

NeurIPS 2023MPOT · story (1/3)

Warm the temperature: swap GTMP's hard min for a soft-min.

$$\textcolor{#8a3f9c}{\min_{j}}\bigl[\,C(i,j)+J_j\,\bigr]$$

GTMP selects one vertex. Relax the selection and the same DP becomes differentiable transport.

NeurIPS 2023MPOT · story (1/3)

Sinkhorn Step: the soft-min relaxation of the tropical DP.

$$\textcolor{#8a3f9c}{W^{\star}_{\lambda}}=\arg\min_{W\in\mathcal{U}(n,m)}\;\textcolor{#557a3a}{\langle W,C\rangle}-\htmlClass{fr-caution}{\lambda\,H(W)}$$

$$\textcolor{#2f7d8a}{Z^{k+1}}=\textcolor{#2f7d8a}{Z^{k}}+\alpha_k\,\underbrace{\mathrm{diag}(W^{\star}_{\lambda}\mathbf{1})^{-1}}_{\text{barycentric avg}}\,\textcolor{#8a3f9c}{W^{\star}_{\lambda}}\,\textcolor{#2f7d8a}{D_{P}}$$

Optimal transport (OT) = the cheapest plan to move mass from one set of points to another; entropic OT = a soft, temperature-blurred version; Sinkhorn = repeated row/column rescaling of a matrix that solves it. $\lambda\to 0$ recovers GTMP: the transport plan concentrates on the min-cost vertices, so soft selection becomes hard selection.

batch smooth trajopt via entropic OT · Le et al. (NeurIPS 2023); Sinkhorn after Cuturi 2013

Sinkhorn is matrix-scaling: the densest GPU-friendly inner loop.

MPOT · the search set · story (2/3)

Gradient-free because the search set is a polytope.

each waypoint probes a local polytope; entropic OT routes transport mass toward low-cost step points

The inner loop: matrix scaling

$$\textcolor{#8a3f9c}{K}=\exp\!\bigl(-\textcolor{#557a3a}{C}/\htmlClass{fr-caution}{\lambda}\bigr)$$

$$\textcolor{#2f7d8a}{u}\leftarrow\frac{\textcolor{#2f7d8a}{a}}{\textcolor{#8a3f9c}{K}\textcolor{#2f7d8a}{v}},\qquad \textcolor{#2f7d8a}{v}\leftarrow\frac{\textcolor{#2f7d8a}{b}}{\textcolor{#8a3f9c}{K}^{\!\top}\textcolor{#2f7d8a}{u}}$$

$$\textcolor{#8a3f9c}{W^{\star}_{\lambda}}=\mathrm{diag}(\textcolor{#2f7d8a}{u})\,\textcolor{#8a3f9c}{K}\,\mathrm{diag}(\textcolor{#2f7d8a}{v})$$

Two GEMV + two divides per iteration, shared across the whole batch. No gradients, no factorizations. The densest GPU-friendly inner loop.

Choosing the polytope probes ($D_P\subset\mathbb{S}^{d-1}$, $m$ directions) is the Generate stage; this scaling is the Reduce.

MPOT results · story (3/3)

From a 2-D probe to whole-body mobile manipulation.

MPOT on a simulated TIAGo mobile manipulator

TIAGo (sim) · batched OT whole-body trajopt

space-time particle dynamics

10K

waypoints optimized in 1–2 s on an RTX 3080 Ti after JIT: the whole trajectory batch in one Sinkhorn loop.

Same reducer as GTMP, one temperature warmer. The batch axis is the trajectory; the polytope is the per-waypoint candidate set.

ICRA 2026CLOT · story (1/8)

From single-robot transport to multi-robot, temporal-logic-coupled transport.

CLOT architecture: a hybrid planning-order tree feeds per-robot parallelized local optimal transport (Sinkhorn iter 0 to 100) to produce candidate system-wide trajectories

CLOT architecture: a hybrid sequence-search tree (left) feeds per-robot parallelized local optimal transport (middle · Sinkhorn iter 0→100) to emit candidate system-wide trajectories under each planning order (right)

Multi-robot STL tasks

Signal Temporal Logic (STL): timed rules like “hold the formation between 8 and 15 s.” Here, that means collision avoidance, dynamic feasibility, relative formation, connectivity, and bounded-time goals $G_I$, $F_I$, $U_I$.

How

GPU-parallel, gradient-free zero-order Sinkhorn Steps over batches of system-wide smooth trajectories.

Scale

Few seconds for small teams; tractable for 100+ robots.

CLOT: Multi-robot Motion Planning via Collaborative Optimal Transport under STL · Y. Zhang, Y. Zhang, A. T. Le, M. Guo (Meng Guo's group, Peking University) · ICRA 2026.

CLOT scales MPOT · story (2/8)

One Sinkhorn kernel, reused per robot: CLOT scales 5 → 15 → 30 → 100 robots.

5 simulated robots, STL-coupled transport

5 robots · STL-coupled transport

15 robots · +constraints

30 simulated robots, more constraints; each color is a distinct robot path

30 robots · more constraints

100 robots · scaling

$$\underbrace{\mathbb{R}^{\textcolor{#ff217d}{B}\times\textcolor{#2f7d8a}{n}\times \textcolor{#2f7d8a}{m}}}_{B\approx10^{3}\ \text{trajectories batched / robot}}\ \times\ \underbrace{\textcolor{#ff217d}{R}\ \text{robots}}_{\text{planned sequentially, search-ordered}}\qquad\small\text{same kernel }K_{ij}=e^{-C_{ij}/\lambda}$$

CLOT reuses MPOT's per-robot Sinkhorn batch unchanged, then sequences robots in a dependency order via hybrid search. The kernel never changes; the parallel batch is the ~10³ trajectories per robot, not the robots themselves. All-simulation scaling study (each trajectory color is a distinct robot's path); the 3-UAV result (story 8/8) is the hardware validation.

CLOT core · what is new vs. MPOT · story (3/8)

Start from the MPOT cost …

$$\textcolor{#557a3a}{c}(\textcolor{#2f7d8a}{Z})=\textcolor{#557a3a}{\langle W,C\rangle}$$

One robot, one transport cost. Now couple many robots under temporal logic.

CLOT core · what is new vs. MPOT · story (4/8)

STL robustness, collective cost, connectivity: all batchable.

$$\textcolor{#557a3a}{c_{\text{tsk}}}(\textcolor{#2f7d8a}{Z(T)})=-\textcolor{#557a3a}{\rho^{\phi}(Z,0)}+\textcolor{#8a3f9c}{\textstyle\sum_{t,i}} \textcolor{#557a3a}{g_{\text{obs}}}+\textcolor{#8a3f9c}{\textstyle\sum_{t,\,i\lt j}} \textcolor{#557a3a}{g_{\text{int}}}$$

$$\underbrace{\textcolor{#2f7d8a}{Z_i}\in\mathbb{R}^{\textcolor{#2f7d8a}{n\times m}}}_{\text{per-robot solve (unchanged)}}\ \xrightarrow{\ \text{richer cost }+\ \text{sequence }\textcolor{#ff217d}{R}\text{ robots}\ }\ \underbrace{\Gamma=\{\textcolor{#2f7d8a}{Z_1,\dots,Z_R}\}}_{\text{CLOT collective trajectory}},\qquad K_{ij}=e^{-C_{ij}/\lambda}\ \text{unchanged}$$

STL robustness (how strongly the spec is satisfied)

$$\textcolor{#557a3a}{\rho^{G_{[a,b]}\phi}(x,t)}=\textcolor{#8a3f9c}{\min_{t'\in[t+a,\,t+b]}}\textcolor{#557a3a}{\rho^{\phi}(x,t')}$$

Parse-tree decomposition: a batched min/max reduction over time slices, tropical again.

Collective cost

Task robustness + obstacle penalty + inter-robot penalty, all dense per timestep, summed over robots.

Connectivity

$$\textcolor{#557a3a}{\lambda_{2}}\bigl(\textcolor{#8a3f9c}{L(t)}\bigr)\ >\ 0$$

Algebraic connectivity $\lambda_2$ (Fiedler value) of the inter-robot Laplacian $L(t)$: $\lambda_2>0\Leftrightarrow$ the team is connected — a smooth, batchable stand-in for the yes/no connectivity test, with $\lambda_2$ itself as the robustness margin.

The Sinkhorn step is untouched. The cost surface gets richer; the update stays a dense regular kernel.

CLOT = nested GSR · story (5/8)

Three reducers, one stack: sequence · optimal transport · STL min/max.

Outer · sequence (best-first $\max$)

$$\textcolor{#8a3f9c}{\max_{\nu}}\ \textcolor{#557a3a}{\chi(\nu)}\quad\text{over planning order }\nu=(N_\nu,\Gamma_\nu)$$

Middle · transport (soft-min)

$$\textcolor{#8a3f9c}{-\lambda\log\textstyle\sum} e^{-C/\lambda}\quad\text{per robot, over polytope }D_P$$

Inner · logic (tropical)

$$\textcolor{#8a3f9c}{\min/\max}\ \textcolor{#557a3a}{\rho^{\phi}}\quad\text{over the STL parse tree, inside the score}$$

CLOT is GSR nested three deep. Each level is its own Generate·Score·Reduce; the semirings compose without changing the kernel schedule.

Sequence matters · story (6/8)

CLOT treats the robot planning order as part of the search.

Hybrid search node

$$\textcolor{#8a3f9c}{\nu}=(\textcolor{#8a3f9c}{N_{\nu}},\textcolor{#8a3f9c}{\Gamma_{\nu}})$$

$\nu$ indexes both the discrete planning sequence and the continuous candidate-trajectory set, so the search is over both at once.

Value function

$$\textcolor{#557a3a}{\chi(\nu)}=\frac{|\textcolor{#8a3f9c}{N_{\nu}}|}{\textcolor{#2f7d8a}{N}}-\eta\,\textcolor{#557a3a}{\xi(\Gamma_{\nu})}+\textcolor{#557a3a}{\psi(\nu)}$$

Coverage minus accumulated cost plus back-propagated feedback $\psi(\nu)$; the weight $\eta>0$ trades cost against coverage (the decay $\gamma\in(0,1)$ discounts back-prop by tree distance).

CLOT results · story (7/8)

Scalability and reliability on STL-coupled multi-robot tasks.

Method	Success N=12	Time N=12
CLOT (ours)	1.00	20.3 s
MPOT	0.10	26.5 s
FSOT	0.70	19.8 s
CONS	0.00	n/a
MICP / NLP / CBS	0.00	n/a

FSOT is the fixed-sequence ablation, CONS the consensus baseline; MICP, NLP, and CBS are infeasible at scale. Simplified from Table I, RTX 4080. CLOT, Meng Guo's group (PKU), ICRA 2026.

100

robots: first feasible solution for the hardest STL split-and-merge in 971.5 s on one RTX 4080. At $N=12$ only CLOT holds 1.00 — MPOT collapses to 0.10, FSOT to 0.70, the rest to 0.00.

CLOT coordinated multi-robot trajectories realized over the STL horizon

CLOT realizes coordinated multi-robot trajectories across the STL horizon (1/15 to 15/15)

Real-world validation · story (8/8)

3-UAV hardware: planned in 2.91 s, formation tracked to ±0.1 m.

30-robot split-and-merge (top); 5-robot dynamic-obstacle + 3-UAV hardware (bottom)

OptiTrack quadcopters hold the formation envelope

3-UAV collision-free coordination in simulation

3-UAV coordination in simulation

hardware: 3 Crazyflies hold the line through the corridor

STL goal: $\varphi = \textcolor{#8a3f9c}{G}_{\textcolor{#2f7d8a}{[8,15]}}\,\textcolor{#557a3a}{\mu_{B}}$. Always between $t=8$ and $t=15\,\text{s}$, maintain the linear formation.

TMLR 2025 · ICLR 2026MTP · story (1/3)

Finite temperature turns the same kernel into an online control loop.

in-hand cube rotation: high-entropy batched MuJoCo / XLA rollouts close an online loop

The idea: a graph becomes a loop

The candidate object is a control rollout, not an edge.
Tensorized MuJoCo-XLA rollouts, batched with vmap.
The reducer is a finite-temperature Gibbs expectation (weight each rollout by $e^{-\text{cost}}$, then average), not a hard min.

Same Generate·Score·Reduce as GTMP, now replanning every step instead of building one graph.

Le, Nguyen, Vu, Carvalho, Peters. TMLR 2025 · ICLR 2026 Journal Track (arXiv:2505.01059).

MTP · the reducer · story (2/3)

Two Boltzmann reducers at two temperatures, convexly mixed.

structured rollout bases over the layered graph: linear, B-spline, Akima edges (the Generate stage)

Roll out

$$\textcolor{#2f7d8a}{\mathbf{U}}\in\mathbb{R}^{\textcolor{#2f7d8a}{B}\times \textcolor{#2f7d8a}{H}\times \textcolor{#2f7d8a}{m}}$$

$B$ control sequences of horizon $H$, scored by a black-box simulator in parallel.

Then mix

$$\textcolor{#8a3f9c}{\pi^{k+1}}=(1{-}\textcolor{#557a3a}{\beta})\,\textcolor{#8a3f9c}{\pi^{k}_{\text{loc}}}+\textcolor{#557a3a}{\beta}\,\textcolor{#8a3f9c}{\pi^{k}_{\text{glb}}},\qquad \pi\propto e^{-J/\eta}$$

Each component is a Gibbs measure $\pi_\bullet\propto e^{-J/\eta_\bullet}$ (its weights $p_{\eta_\bullet}=\nabla_v R_{\eta_\bullet}$), so the convex mix is again a valid distribution. $\beta$ is one knob from exploit (cold $\eta_{\text{loc}}$) to explore (warm $\eta_{\text{glb}}$).

Both reducers are $\nabla R_\lambda$ from the temperature family: expectation is just soft-min, differentiated.

MTP · why high entropy wins · story (3/3)

The expectation reducer explores where MPPI and OpenAI-ES stall.

MTP (Akima) · high-entropy, covers the maze

MPPI · collapses to a narrow mode

OpenAI-ES · slow to spread

Same batch budget, same hardware. The reducer's temperature decides how much of the space the rollouts actually see. Demonstrated across in-hand cube, G1 whole-body, and crane.

Baselines: MPPI (Williams et al., 2017) · OpenAI-ES (Salimans et al., 2017).

ICRA Workshop 2026MTP · sim-to-real · deployment

The same expectation reducer closes a contact-rich loop on real hardware.

MTP Push-T deployment on a Franka Research 3 robot

Push-T on a Franka Research 3: batched MuJoCo-MJX rollouts run an online MPC loop through a complete real→sim→real pipeline

From XLA rollouts to a real robot

The Boltzmann expectation reducer runs unchanged on hardware, replanning every step.
JAX/XLA-parallel sampling-based MPC over a high-fidelity MuJoCo MJX model.
Structured global sampling beats unimodal CEM / MPPI / PS on multimodal, contact-rich manipulation.
Domain randomization runs online, inside the MPC sample budget; contact-initiation parameters give interpretable adaptation signals.

Real-world evidence that high-entropy batched rollouts survive the sim-to-real gap, not just simulation.

Dierking, Carvalho, Le, Chalvatzaki, Peters. Real-World Deployment of Massively Parallel Sampling-Based MPC for Contact-Rich Manipulation. ICRA Workshop on Frontiers of Optimization for Robotics, 2026.

T-RO 2025 · IROS 2025MPD / FlowMP · learned priors

Guidance is a score-function reducer, the warm end of the family.

FlowMP diffusion priors over trajectories

diffusion proposes many smooth candidates in parallel; cost-guided B-spline projection verifies (Carvalho, Nguyen, Le et al.)

Denoise, then guide

$$\textcolor{#2f7d8a}{z_{k-1}}=\textcolor{#8a3f9c}{D_{\theta}}(\textcolor{#2f7d8a}{z_k},c)\;-\;\eta\,\textcolor{#557a3a}{\nabla_z J}\!\bigl(\textcolor{#2f7d8a}{B_{\psi}}\textcolor{#2f7d8a}{z_k}\bigr)$$

$D_\theta$: learned prior, the Generate stage.
$\nabla_z J$: guidance, the score-function Reduce.
$B_\psi$: maps control latents to a smooth B-spline trajectory (learnable knots), so guidance acts on $C^2$, dynamically feasible curves.

The per-step $-\nabla_z J$ is the score $\nabla_z\log p(z)$ of the Boltzmann target $p\propto e^{-J}$ (the normalizer drops out). In plain terms, it points the way that most lowers cost. This is the warm, score end of the temperature continuum.

T-RO 2025 (arXiv:2412.19948) · FlowMP, IROS 2025 (arXiv:2503.06135).

The payoff

One vectorizable kernel, generalized: five planners on a single temperature axis.

GTMP

$\min$

tropical · $\lambda{\to}0$

MPOT

$-\lambda\log\sum e^{-\cdot/\lambda}$

entropic

CLOT

nested soft-min

OT $\otimes$ STL

MTP

$\mathbb{E}_{p_\eta}$

Boltzmann

MPD

$\nabla\log\int e^{-J}$

score · warm

The same design move every time: make the algorithm the vectorizable object. Generate·Score·Reduce is just the algebra that turns the knob.

Design principles

Four rules I keep returning to.

Generate many · Score in batch · Reduce with structure

Expose the independent work early.
Replace pointer-chasing control flow (classical OMPL-style search) with dense algebra.
Treat diversity and optimality as different compute allocations.
Make the proof object match the compute object.

What this is not

Not “GPU everywhere.”
Not abandoning geometry or proofs.
Not claiming one planner wins everywhere.

It is a way to design algorithms whose bottlenecks are visible, measurable, and schedulable.

Hardware-conscious algorithm sketch

A planner can be compiled into a heterogeneous robotics stack.

GPU lanedense kernels · SIMT (threads in lockstep)

tensor graph construction batched connector calls (cuRobo · pRRTC) Sinkhorn / DP reductions STL robustness reductions

CPU laneorchestration · SIMD (one op, many lanes)

hybrid sequence search class-indexed archives VAMP / FCIT* edge validation on OMPL (Kavraki) JAX/XLA driver

Edge loopexecution · replan trigger

MPC / reactive tracker (Scaramuzza-style) execution monitor replan on infeasibility edge-cloud round-trip

The same plan flows across three schedulers: the right allocation of algorithm to computing hardware.

Open problems

Where I hope this community pushes next.

Lazy tensor graphs with guarantees

Skip most edges while proving homotopy-critical edges survive.

Class-aware extraction

If the graph contains many route families, enumerate them deliberately.

Connector-aware scheduling

Adapt $(M,N,s)$ online from measured $\bigl(q_s,\Lambda_s\bigr)$ profiles.

Compiler / runtime co-design

Should planners ship as XLA / JAX programs? Benchmark homotopy coverage and compute efficiency, not only first-solution time.

These are algorithm, architecture, and systems questions, not only motion-planning questions.

Take home

Parallelism is a design constraint, not an implementation detail.

01Representation

Lift graphs, automata, trajectories, and controls into batched tensor objects: design the planner as a vectorizable kernel whose data structure GPUs compute directly. Build on OMPL's connectors; don't replace them.

02Search

Replace one brittle serial loop with Generate · Score · Reduce: min, Sinkhorn, expectation, score, all structured reductions on one semiring continuum.

03Hardware

Memory layout, batch size, occupancy, and divergence are algorithm parameters, not deployment afterthoughts.

The next planning algorithms will be judged by the workloads they can handle, not only the paths they return.

Most results here are large-batch simulation studies; the 3-UAV flight is the hardware validation. Large-batch simulation is what makes batched planning measurable.

the structure was hiding in plain sight.

Thank you

Jan Peters

TU Darmstadt · IAS

Georgia Chalvatzaki

TU Darmstadt · PEARL

Zachary Kingston

Purdue

Meng Guo

Peking University

Armin Biess

Apple

Joe Watson

University of Oxford

João Carvalho

TU Darmstadt

Julen Urain

Amazon FAR

Kay Pompetzki

TU Darmstadt · IAS

Magnus Dierking

TU Darmstadt · PEARL

VinUniversity· TU Darmstadt· IAS Lab· VinRobotics

questions · collaborations

anindex.github.io

an@robot-learning.de

papers · code

Backup · for Q&A

The full map: object, semiring, work, depth.

Method	Generate · batch axes	Reduce · semiring	Operator	$W$	$D$
GTMP	graph $(M,N)$	min-plus	$J_m(i)=\min_j[C_m(i,j)+J_{m+1}(j)]$	$\Theta(MN^2)$	$\Theta(M\log N)$
MPOT	waypoint $\times D_P$	entropic	$W^\star_\lambda=\arg\min\langle W,C\rangle-\lambda H(W)$	$\Theta(Tnm)$	$\Theta(T\log n)$
CLOT	robots $\times$ wp $\times D_P$	OT $\otimes$ STL	Sinkhorn inner $\oplus$ tropical STL	$\Theta(R(Tnm{+}\|\Phi\|T))$	$\Theta(R\log nm)$
MTP	rollouts $\mathbb{R}^{B\times H\times m}$	expectation	$\pi^{k+1}=(1{-}\beta)\pi_{\text{loc}}+\beta\pi_{\text{glb}}$	$\Theta(BH)$	$\Theta(H{+}\log B)$
MPD	latents $(B,H)$	score	$z_{k-1}=D_\theta(z_k,c)-\eta\nabla_z J(B_\psi z_k)$	$\Theta(KBc_\theta)$	$\Theta(Kd_\theta)$

All $\oplus$ are associative $\Rightarrow$ all share the $+O(\log B)$ reduction tree; the prefactor is the per-method sequential scan.

Backup · the temperature family

One generating function. Its limit, gradient, and log-partition are the four reducers.

Proposition · one family, four faces

$$R_\lambda(v)=-\lambda\,\log\!\textstyle\sum_{a} e^{-v_a/\lambda},\qquad p_\lambda(v)_a=\frac{e^{-v_a/\lambda}}{\sum_b e^{-v_b/\lambda}}$$

(i)$$\lim_{\lambda\to 0^+}R_\lambda(v)=\textcolor{#8a3f9c}{\min_a v_a},\quad \min_a v_a-\lambda\log B\le R_\lambda\le \min_a v_a$$

(ii)$$\nabla_v R_\lambda(v)=p_\lambda(v)\ \Rightarrow\ \langle \nabla R_\lambda,\Phi\rangle=\textcolor{#557a3a}{\mathbb{E}_{a\sim p_\lambda}[\Phi_a]}$$

(iii)$$\nabla_\theta\log\!\textstyle\int e^{-J_\theta}=\textcolor{#8a3f9c}{\mathbb{E}_{p}[-\nabla_\theta J_\theta]}\quad(\text{log-partition gradient}\to\text{score})$$

Legendre dual: $R_\lambda(v)=\min_{p\in\Delta}\langle p,v\rangle-\lambda H(p)$, so “Sinkhorn = soft-min” is an identity, and $\lambda\to0$ is item (i).

soft-min collapses onto the hard min as the temperature cools

Limit = min (GTMP) · value = soft-min (MPOT/CLOT) · gradient = expectation (MTP) · log-partition gradient = score (MPD).

Backup · references

References.

GTMP · Le et al., Global Tensor Motion Planning. IEEE RA-L 2025. arXiv:2411.19393.
MPOT · Le et al., Accelerating Motion Planning via Optimal Transport. NeurIPS 2023. arXiv:2309.15970.
MTP · Le, Nguyen, Vu, Carvalho, Peters, Model Tensor Planning. TMLR 2025 (ICLR 2026 Journal Track). arXiv:2505.01059.
MTP deployment · Dierking, Carvalho, Le, Chalvatzaki, Peters, Real-World Deployment of Massively Parallel Sampling-Based MPC for Contact-Rich Manipulation. ICRA Workshop on Frontiers of Optimization for Robotics, 2026.
CLOT · Zhang, Zhang, Le, Guo (Peking University), collaborative optimal transport under STL. ICRA 2026.
MPD · Carvalho, Le, Kicki, Koert, Peters, Motion Planning Diffusion. IEEE T-RO 2025. arXiv:2412.19948.
FlowMP · Nguyen, Le et al., learning motion fields with conditional flow matching. IROS 2025. arXiv:2503.06135.
VAMP · Thomason, Kingston, Kavraki, motions in microseconds. ICRA 2024. arXiv:2309.14545.
FCIT* · Wilson et al., fully connected informed trees. ICRA 2025. arXiv:2411.17902.
pRRTC · Huang, Jadhav, Plancher, Kingston, GPU-parallel RRT-Connect. ICRA 2026. arXiv:2503.06757.
cpRRTC · Hu, Wang, Christensen (UCSD), constrained GPU-parallel RRT-Connect. RSS 2025. arXiv:2505.06791.
OMPL · Șucan, Moll, Kavraki, The Open Motion Planning Library. IEEE RAM 2012; library since 2008; OMPL 2.0 (2026) adds SIMD/GPU. ompl.kavrakilab.org.
Akima · H. Akima, A New Method of Interpolation Based on Local Procedures. JACM 1970.
cuRobo · Sundaralingam et al. (NVIDIA), CUDA-accelerated motion generation. curobo.org.
Trends · K. Rupp, 50 years of microprocessor trend data (CC BY 4.0).
Sinkhorn / entropic OT · Cuturi, Sinkhorn Distances: Lightspeed Computation of Optimal Transport. NeurIPS 2013. arXiv:1306.0895.
MPPI · Williams et al., Information-Theoretic Model Predictive Control. IEEE T-RO 2017.
OpenAI-ES · Salimans et al., Evolution Strategies as a Scalable Alternative to RL. 2017. arXiv:1703.03864.
Isaac Gym · Makoviychuk et al., GPU-based physics simulation for robot learning. NeurIPS 2021 Datasets. arXiv:2108.10470.
Brax · Freeman et al., differentiable physics for large-scale rigid-body simulation. 2021. arXiv:2106.13281.
MuJoCo MJX · Todorov et al., MuJoCo / MJX XLA backend (mujoco.readthedocs.io).
Diffuser · Janner et al., Planning with Diffusion for Flexible Behavior Synthesis. ICML 2022. arXiv:2205.09991.
The Bitter Lesson · R. Sutton, 2019 (incompleteideas.net).
The Hardware Lottery · S. Hooker, CACM 2021. arXiv:2009.06489.

Benchmark of record: MotionBenchMaker (Chamzas et al., 2022).

Designing Planning Algorithmsfor the Era of Parallelism

The hardware stopped waiting for the algorithm.

What changed

The usual question is “how do I parallelize my planner?”

1 · Generate

2 · Score

3 · Reduce

Two ways to use parallel hardware: vectorize the kernels, or vectorize the algorithm.

VAMP · FCIT*

cuRobo · pRRTC · cpRRTC

GTMP · MPOT · CLOT · MTP · MPD

Global search becomes a layered random multipartite graph.

The idea: the graph is a tensor

The Bellman backward pass is a tropical (min-plus) matrix–vector product.

Edge count

Backward DP · the Reduce step

The edge is a pluggable connector, including a smooth Akima edge that stays in-kernel.

Smooth planning, no extra solve: swap the straight edge for an Akima spline.

Anytime should not conflate diversity and optimality.

Mode RR · diversity

Mode AO · cost

Competitive on feasibility, with batch throughput few methods reach.

The planner's data structure is a graph that parallel hardware stores and computes, not only a path generator.

The planner's data structure is a graph that parallel hardware stores and computes, not only a path generator.

What this buys you

Every planner here is Generate → Score → Reduce.

Every planner here is Generate → Score → Reduce.

Five planners, one kernel: they differ only in the reduction.

One kernel schedule for all five planners.

Reduction-invariance

What actually differs

Associativity buys an $O(\log B)$ reduction tree. That is the architecture payload.

Warm the temperature: swap GTMP's hard min for a soft-min.

Sinkhorn Step: the soft-min relaxation of the tropical DP.

Gradient-free because the search set is a polytope.

The inner loop: matrix scaling

From a 2-D probe to whole-body mobile manipulation.

From single-robot transport to multi-robot, temporal-logic-coupled transport.

Multi-robot STL tasks

How

Scale

One Sinkhorn kernel, reused per robot: CLOT scales 5 → 15 → 30 → 100 robots.

Start from the MPOT cost …

STL robustness, collective cost, connectivity: all batchable.

STL robustness (how strongly the spec is satisfied)

Collective cost

Connectivity

Three reducers, one stack: sequence · optimal transport · STL min/max.

CLOT treats the robot planning order as part of the search.

Hybrid search node

Value function

Scalability and reliability on STL-coupled multi-robot tasks.

3-UAV hardware: planned in 2.91 s, formation tracked to ±0.1 m.

Finite temperature turns the same kernel into an online control loop.

The idea: a graph becomes a loop

Two Boltzmann reducers at two temperatures, convexly mixed.

Roll out

Then mix

The expectation reducer explores where MPPI and OpenAI-ES stall.

The same expectation reducer closes a contact-rich loop on real hardware.

From XLA rollouts to a real robot

Guidance is a score-function reducer, the warm end of the family.

Denoise, then guide

One vectorizable kernel, generalized: five planners on a single temperature axis.

Four rules I keep returning to.

What this is not

A planner can be compiled into a heterogeneous robotics stack.

Where I hope this community pushes next.

Lazy tensor graphs with guarantees

Class-aware extraction

Connector-aware scheduling

Compiler / runtime co-design

Parallelism is a design constraint, not an implementation detail.

Thank you

The full map: object, semiring, work, depth.

One generating function. Its limit, gradient, and log-partition are the four reducers.

References.

Designing Planning Algorithms
for the Era of Parallelism