Invited talk · RoboARCH @ ICRA 2026 · Vienna · 5 June 2026

Designing
Planning Algorithms
for the Era of Parallelism

One algebra for batched tensor search: global graphs, local connectors, optimal transport, rollouts, and learned priors as a single Generate·Score·Reduce operator.

Speaker
An Thai Le
Assistant Professor · Director of Foundation AI
Affiliations
VinUniversity
TU Darmstadt (visiting) · VinRobotics
Contact
anindex.github.io
MPD diffusion motion planning, Panda in a warehouse
MPOT batched trajectory optimization on a TIAGo mobile manipulator
MTP in-hand cube rotation via batched rollouts
MPD diffusion sampling many 2D trajectories at once
Five planners, one batched Generate · Score · Reduce kernel.
Positioning

The hardware stopped waiting for the algorithm. The algorithm has to meet the hardware.

50 years of microprocessor trend data: single-thread performance and clock frequency plateau around 2005 while transistor count and number of logical cores keep rising
50 years of microprocessor trend data. K. Rupp, CC BY 4.0; original data to 2010 by Horowitz, Labonte, Shacham, Olukotun, Hammond, and Batten. Single-thread performance plateaued around 2005; the gains since are parallel.

Parallel hardware is a design constraint, like dynamics, collision, or uncertainty.

What changed

  • GPUs, SIMD CPUs, FPGAs, and edge accelerators are now commodity for robotics.
  • ML stacks (JAX · XLA · MuJoCo XLA · Triton) made dense, batched compute the default kernel.
  • Classical planners, built for serial pointer chasing, leave most of this hardware idle.
batchsolve many problems at once
denseregular array math, not pointer chasing
localindependent work on each edge
reuserecycle archives, priors, informed sets

That is the RoboARCH question: match each algorithm to the hardware it runs best on, from CPU SIMD and GPU to FPGA, MCU, edge, and cloud.

Core thesis

Do not ask “how do I parallelize my planner?”

Ask: “what planner would I write if parallelism were the primitive?”

1 · Generate

Make many similar subproblems visible at once, as one batched object.

2 · Score

Evaluate every candidate independently; keep geometry and logic, do not flatten into brute force.

3 · Reduce

End with one structured reduction: min, Sinkhorn, expectation, score, DP.

Three verbs, held for the whole talk: Generate · Score · Reduce.

Current landscape

Three schools, one lesson: design for what batches.

CPU · SIMD

VAMP · FCIT*

Thomason · Kingston · Kavraki
ICRA 2024 · 2025

Vectorize the kernels of sampling-based planning. 35 µs median, ~25 kHz on one CPU core (7-DoF Panda); FCIT* = first fully-connected a.s.-optimal.

GPU · SIMT

cuRobo · pRRTC · cpRRTC

Sundaralingam et al. · Huang, Jadhav, Plancher, Kingston
2023–2025

Batched trajopt; GPU-parallel RRT-Connect. cuRobo: ~60× vs CPU, ~30 ms full pipeline. pRRTC (MotionBenchMaker Panda, mean planning time): 128× cuRobo, 3× slower than VAMP-RRTC, 1.4× slower than GTMP; cpRRTC up to 165× constrained.

Tensor search this talk

GTMP · MPOT · CLOT · MTP · MPD

Le · Carvalho · Zhang & Guo (PKU) · Nguyen · Peters
RA-L · NeurIPS · ICRA · ICLR · T-RO · 2023–2026

Make the graph itself a batched object. GTMP: 1000 paths in 4.6 ms.

Family lineage · 2017–2026
GMT* Stanford ASL · 2017 VAMP microseconds · Kavraki · 2024 FCIT* fully-connected a.s.-optimal · 2025 cpRRTC constrained · RTX 5090 · 2025 MPD Carvalho, A. Le et al. · T-RO 2025

One design pressure across the field: expose more regular, batched, schedulable work. Benchmark of record: MotionBenchMaker.

The object · Generate

A planner is a compute substrate, not only a path generator.

$$\textcolor{#2f7d8a}{\mathbf{Q}}\in\mathbb{R}^{\textcolor{#2f7d8a}{M}\times \textcolor{#2f7d8a}{N}\times d}\qquad\text{a batched tensor of candidate vertices}$$

$M$ layers, $N$ samples per layer, each a point in configuration space $\mathbb{R}^d$. Now turn this into a search.

The object · Generate

A planner is a compute substrate, not only a path generator.

$$\textcolor{#2f7d8a}{\mathbf{Q}}\in\mathbb{R}^{\textcolor{#2f7d8a}{M}\times \textcolor{#2f7d8a}{N}\times d}\quad\text{vertex tensor}$$
$$\textcolor{#2f7d8a}{\mathbf{C}}\in\mathbb{R}^{\textcolor{#2f7d8a}{M}\times \textcolor{#2f7d8a}{N}\times \textcolor{#2f7d8a}{N}}\quad\text{edge-cost tensor}$$
$$\textcolor{#557a3a}{J_m(i)}=\textcolor{#8a3f9c}{\min_{j}}\bigl[\,C_m(i,j)+\textcolor{#557a3a}{J_{m+1}(j)}\bigr]\quad\text{Bellman}$$

What this buys you

  • Many candidate futures in one object, diversity by construction.
  • Local work distributed across SIMT lanes.
  • Global decisions recovered by reductions (DP, Sinkhorn, prune).
  • Theorems and kernels describe the same object.

The batched object is the Generate stage. Design a graph whose natural operations are batched kernels.
teal · tensor / batch axes   sage · value / cost   orchid · operator / reducer   rose · this‑talk highlight

The operator

Every planner here is GenerateScoreReduce.

Generate
Score
Reduce

Three verbs. Each is a typed map; their composition is the planner.

The operator

Every planner here is GenerateScoreReduce.

Generate
$$\textcolor{#2f7d8a}{X}=\mathcal{G}_\theta(\xi)\in\mathbb{R}^{\textcolor{#2f7d8a}{\mathbf{B}}\times d}$$
Structured sampler. Batch axes $\mathbf{B}$ are the independent subproblems exposed at once.
Score
$$\textcolor{#557a3a}{S}=s(\textcolor{#2f7d8a}{X})\in\mathbb{R}^{\textcolor{#2f7d8a}{\mathbf{B}}}$$
Elementwise, communication-free: $\partial S_a/\partial X_b=0$. Collision, dynamics, logic live here.
Reduce
$$\textcolor{#8a3f9c}{y}=\textcolor{#8a3f9c}{\textstyle\bigoplus_{a\in\mathcal{A}}}\,\textcolor{#557a3a}{S_a}$$
The only cross-candidate stage: a structured reduction over a semiring $(\mathbb{S},\oplus,\otimes)$.

$$\textbf{Planner}=\textcolor{#8a3f9c}{\textstyle\bigoplus}\,\circ\,\textcolor{#557a3a}{s}\,\circ\,\textcolor{#2f7d8a}{\mathcal{G}_\theta}.\qquad\text{\small Different planners are different semiring choices for one skeleton.}$$

One operator

Five planners, one kernel: they differ only in the reduction.

$$\textcolor{#8a3f9c}{y}=\textcolor{#8a3f9c}{\textstyle\bigoplus_{a\in\mathcal{A}}}\,\textcolor{#557a3a}{s(\textcolor{#2f7d8a}{X})},\qquad \textcolor{#8a3f9c}{\oplus}\in\Bigl\{\ \min,\ \ -\lambda\log\!\textstyle\sum e^{-(\cdot)/\lambda},\ \ \mathbb{E}_{p_\eta},\ \ \nabla\!\log\!\textstyle\int e^{-J}\ \Bigr\}$$
$\min$
tropical
GTMP
$-\lambda\log\sum e^{-\cdot/\lambda}$
entropic OT
MPOT · CLOT
$\mathbb{E}_{p_\eta}[\cdot]$
expectation
MTP
$\nabla\log\int e^{-J}$
score
MPD

Generate and Score are shared; a planner only chooses the reducer $\oplus$ over the semiring $(\mathbb{S},\oplus,\otimes)$. One compiled kernel, five planners, with a single temperature knob sliding between them.

Why it matters

One kernel schedule for all five planners, and the honest caveat.

Reduction-invariance

$$\mathbf{B},\,s\ \text{fixed};\quad \text{only}\ \textcolor{#8a3f9c}{\bigoplus_\lambda}\ \text{varies with}\ \lambda.$$

Every $\oplus_\lambda$ is associative, so all five share one iteration space, one memory layout, and one $O(\log B)$ reduction tree. One compiled template instantiated by swapping the combine op.

What actually differs

  • min is one comparator; soft-min spends $\exp/\log$ on the SFU (with a max-shift for stability).
  • Register pressure and occupancy differ, so “one template” $\ne$ identical wall-clock.
  • $\min$ returns an index (argmin path); soft-min returns a distribution.

The proof object and the compute object are the same: the semiring is both the algebra in the theorem and the reduction in the kernel. Invariant skeleton, method-specific arithmetic. Next: the per-method work and depth.

Algebra becomes schedule

Associativity buys an $O(\log B)$ reduction tree. That is the architecture payload.

Work–depth · Brent's theorem
$$T_p\ \le\ \tfrac{W}{p}+D,\qquad D_{\text{GSR}}=\underbrace{\text{depth}(s)}_{\text{scan}}+\underbrace{O(\log B)}_{\text{tree}}$$

Embarrassingly-parallel Score gives $W=|B|\,\text{cost}(s)$ at fixed depth; an associative $\oplus$ folds it in $O(\log B)$. Available parallelism $W/D$ sets the ceiling; realized speed is then capped by bandwidth and occupancy.

MethodSemiring $(\oplus,\otimes)$Work $W$Depth $D$
GTMP$(\min,+)$$\Theta(MN^2)$$\Theta(M\log N)$
MPOTsoft-min, $+$$\Theta(T\,nm)$$\Theta(T\log n)$
CLOTOT $\otimes$ STL min/max$\Theta(R\,(Tnm{+}|\Phi|T))$$\Theta(R\log nm)$
MTPexpectation$\Theta(BH)$$\Theta(H{+}\log B)$
MPDscore$\Theta(KB\,c_\theta)$$\Theta(K\,d_\theta)$
$M$ layers · $N$ samples · $n$ waypoints · $m=|D_P|$ probes · $H$ horizon · $K$ steps · $T$ iters · $R$ robots

Every method shares the depth signature $D=(\text{sequential scan})+O(\log B)$. The scan length is the number an architect should optimize.

GTMP · concrete story (1/5)

Global search becomes a layered random multipartite graph.

GTMP layered graph

The aha: the graph IS a tensor

  • Sample N dream points in each of M layers.
  • Evaluate adjacent-layer edges in batch.
  • Run finite-horizon value iteration on the layered DAG.

One graph contains several qualitatively different route families, so diversity is a property of the data structure, not a post-processing step.

Le et al., Global Tensor Motion Planning, IEEE RA-L 2025 (arXiv:2411.19393).
GTMP = tropical GSR · story (2/5)

The Bellman backward pass is a tropical matrix–vector product.

Edge count

$$|E|=2\textcolor{#2f7d8a}{N}+(\textcolor{#2f7d8a}{M}-1)\,\textcolor{#2f7d8a}{N}^{2}=\Theta(\textcolor{#2f7d8a}{MN^{2}})$$

Single start/goal, complete adjacent-layer connectivity. Every adjacent-layer pair is an independent local subproblem.

Backward DP · the Reduce step

$$J_m(u)=\textcolor{#8a3f9c}{\min_{v\in V_{m+1}}}\bigl[\,\textcolor{#557a3a}{c(u,v)}+\textcolor{#557a3a}{J_{m+1}(v)}\,\bigr]$$

A batched min-reduction over the next-layer axis: $J_m = C_m\boxtimes J_{m+1}$, tropical GEMV: the segment_min the accelerator runs.

s g argmin path recovered by backpointers · depth Θ(M), width Θ(MN²) of independent work
Connectorized GTMP · story (3/5)

The edge is the interface between global combinatorics and local geometry.

$$\textcolor{#8a3f9c}{\mathcal{LC}}(x_a,x_b;\textcolor{#2f7d8a}{s})\;\to\;\{\,\Gamma_{a,b}\ \text{or fail}\,\}$$
$$\textcolor{#557a3a}{\Lambda_{\textcolor{#2f7d8a}{s}}(\tau)}=\sup\bigl\{\,\ell:\ \textcolor{#557a3a}{q_{\textcolor{#2f7d8a}{s}}(\ell)}\ge 1-\tau\,\bigr\}$$

Reach $\Lambda$: at budget $s$, with confidence $1-\tau$, the connector succeeds on local hops up to length $\ell$.


GTMP consumes, doesn't compete: each connector is the per-edge local Generate and Score, a pluggable oracle, while the tropical Reduce stays fixed.
VAMP‑RRTC · pRRTC · AORRTC · FCIT* · cuRobo · CHOMP · MPD

Connectors: VAMP-RRTC, FCIT* (Kavraki, ICRA 2024/2025) · pRRTC (arXiv:2503.06757) · cuRobo (NVIDIA) · CHOMP · MPD.
GTMP box-sweep demo
batched box-sweep plans
GTMP tabletop demo
tabletop, Akima splines
Anytime GTMP · story (4/5)

Anytime should not conflate diversity and optimality.

Mode RR · diversity

  • Fix $(M,\,N,\,s)$; draw independent graph realizations.
  • Top-$B$ class-indexed elite archive.
  • Diversity = coverage of distinct route families: $\delta$-robust homotopy classes, paths inequivalent under deformation in the $\delta$-inflated free space.
$$\Pr[\,h\ \text{uncovered after}\ K\ \text{draws}\,]\le(1-p_h)^{K}\to 0,\ \ p_h>0$$

geometric rate, so almost-sure coverage as $K\to\infty$ (2nd Borel–Cantelli)

Mode AO · cost

  • Grow $M_\nu,\,N_\nu \to \infty$; keep $s$ fixed.
  • Informed samples inside a shrinking prolate hyperspheroid.
  • Reuse + prune edges; AO$^\star$-style convergence to $c^\star$.

optimality by contracting support

Two schedules of the same reducer: diversity repeats generation, optimality enriches it.

GTMP results · story (5/5)

Top of the feasibility hierarchy, with a batch throughput nothing else matches.

AOGTMP path-cost vs AIT*, EIT*, FCIT*, AORRTC
anytime path cost vs AIT* · EIT* · FCIT* · AORRTC
AOGTMP success rate
success rate vs wall-clock budget

anytime-GTMP charts: results from a manuscript under preparation.

4.6ms
1000 paths in one batch (RTX 3090, warm JIT), ~250× less wall-clock than sequential baselines, $89\%$ collision-free, full value-iteration optimality.

500 parallel Panda instances in ~0.3 ms via JAX vmap on RTX 3090. On MotionBenchMaker feasibility time, GTMP sits with VAMP-RRTC at the top, ~1.4× faster than pRRTC.

MPOT · NeurIPS 2023 · story (1/3)

Warm the temperature: swap GTMP's hard min for a soft-min.

$$\textcolor{#8a3f9c}{\min_{j}}\bigl[\,C(i,j)+J_j\,\bigr]$$

GTMP selects one vertex. Relax the selection and the same DP becomes differentiable transport.

MPOT · NeurIPS 2023 · story (1/3)

Sinkhorn Step: the soft-min relaxation of the tropical DP.

$$\textcolor{#8a3f9c}{W^{\star}_{\lambda}}=\arg\min_{W\in\mathcal{U}(n,m)}\;\textcolor{#557a3a}{\langle W,C\rangle}-\htmlClass{fr-caution}{\lambda\,H(W)}$$
$$\textcolor{#2f7d8a}{Z^{k+1}}=\textcolor{#2f7d8a}{Z^{k}}+\alpha_k\,\underbrace{\mathrm{diag}(W^{\star}_{\lambda}\mathbf{1})^{-1}}_{\text{barycentric avg}}\,\textcolor{#8a3f9c}{W^{\star}_{\lambda}}\,\textcolor{#2f7d8a}{D_{P}}$$

$\lambda\to 0$ recovers GTMP: the transport plan concentrates on the min-cost vertices, soft selection becomes hard selection.

MPOT batch trajectory optimization
batch smooth trajopt via entropic OT · Le, Chalvatzaki, Biess, Peters (NeurIPS 2023)

Sinkhorn is matrix-scaling: the densest GPU-friendly inner loop.

MPOT · the search set · story (2/3)

Gradient-free because the search set is a polytope.

MPOT Sinkhorn Step: per-waypoint polytope probes and entropic-OT transport mass
each waypoint probes a local polytope; entropic OT routes transport mass (0.2 to 0.8) toward low-cost step points

The generator $\mathcal{G}_\theta$

$$\textcolor{#2f7d8a}{D_P}=\{\,d_1,\dots,d_{|D_P|}\,\}\subset\mathbb{S}^{d-1}$$
  • Each waypoint emits a batch of candidate moves along polytope vertices.
  • No gradients: the score is a black box evaluated per direction.
  • Rotations randomize the probe so coverage stays isotropic.

Choosing $D_P$ is choosing the Generate stage; Sinkhorn does the Reduce.

MPOT results · story (3/3)

From a 2-D probe to whole-body mobile manipulation.

MPOT on TIAGo mobile manipulator
TIAGo · batched OT trajopt on a real mobile manipulator
MPOT space-time particle dynamics
space-time particle dynamics
10K
waypoints optimized in 1–2 s on an RTX 3080 Ti after JIT: the whole trajectory batch in one Sinkhorn loop.

Same reducer as GTMP, one temperature warmer. The batch axis is the trajectory; the polytope is the per-waypoint candidate set.

MTP · ICLR 2026 · story (1/3)

Finite temperature turns the same substrate into an online control loop.

MTP in-hand cube rotation rollouts
in-hand cube rotation: high-entropy batched MuJoCo / XLA rollouts close an online loop

The aha: a graph becomes a loop

  • The candidate object is a control rollout, not an edge.
  • Tensorized MuJoCo-XLA rollouts on the vmap substrate.
  • The reducer is a finite-$\lambda$ Gibbs expectation, not a hard min.

Same Generate·Score·Reduce as GTMP, now replanning every step instead of building one graph.

Le, Nguyen, Carvalho, Peters. ICLR 2026 · TMLR 2025 (arXiv:2505.01059).
MTP · the reducer · story (2/3)

Two Boltzmann reducers at two temperatures, convexly mixed.

MTP structured rollout bases: linear, B-spline, and Akima edges over the layered graph
structured rollout bases over the layered graph: linear, B-spline, Akima edges (the Generate stage)

Roll out

$$\textcolor{#2f7d8a}{\mathbf{U}}\in\mathbb{R}^{\textcolor{#2f7d8a}{B}\times \textcolor{#2f7d8a}{H}\times \textcolor{#2f7d8a}{m}}$$

$B$ control sequences of horizon $H$, scored by a black-box simulator in parallel.

Then mix

$$\textcolor{#8a3f9c}{\pi^{k+1}}=(1{-}\textcolor{#557a3a}{\beta})\,\textcolor{#8a3f9c}{\pi^{k}_{\text{loc}}}+\textcolor{#557a3a}{\beta}\,\textcolor{#8a3f9c}{\pi^{k}_{\text{glb}}},\qquad \pi\propto e^{-J/\eta}$$

Each component is a Gibbs measure $\pi_\bullet\propto e^{-J/\eta_\bullet}$ (its weights $p_{\eta_\bullet}=\nabla_v R_{\eta_\bullet}$), so the convex mix is again a valid distribution: an explicit exploit (cold $\eta_{\text{loc}}$) to explore (warm $\eta_{\text{glb}}$) knob $\beta$.

Both reducers are $\nabla R_\lambda$ from the temperature family: expectation is just soft-min, differentiated.

MTP · why high entropy wins · story (3/3)

The expectation reducer explores where MPPI and OpenAI-ES stall.

MTP-Akima planar exploration
MTP (Akima) · high-entropy, covers the maze
MPPI planar exploration
MPPI · collapses to a narrow mode
OpenAI-ES planar exploration
OpenAI-ES · slow to spread

Same batch budget, same hardware. The reducer's temperature decides how much of the space the rollouts actually see. Demonstrated across in-hand cube, G1 whole-body, and crane.

Baselines: MPPI (Williams et al., 2017) · OpenAI-ES (Salimans et al., 2017).
CLOT · ICRA 2026 · story (1/7)

From single-robot transport to multi-robot, temporal-logic-coupled transport.

CLOT architecture diagram
CLOT architecture: hybrid sequence search plus per-robot Sinkhorn block

Multi-robot STL tasks

  • Collision avoidance + dynamic feasibility.
  • Relative formation + connectivity maintenance.
  • Bounded-time temporal goals: globally $G^I$, eventually $F^I$, until $U^I$.

GPU-parallel, gradient-free Sinkhorn steps over batches of system-wide trajectories.

Joint work led by Meng Guo's group, Peking University · Y. Zhang, Y. Zhang, A. T. Le, M. Guo · ICRA 2026.
CLOT core · what is new vs. MPOT · story (2/7)

Start from the MPOT cost …

$$\textcolor{#557a3a}{c}(\textcolor{#2f7d8a}{Z})=\textcolor{#557a3a}{\langle W,C\rangle}$$

One robot, one transport cost. Now couple many robots under temporal logic.

CLOT core · what is new vs. MPOT · story (3/7)

STL robustness, collective cost, connectivity: all batchable.

$$\textcolor{#557a3a}{c_{\text{tsk}}}(\textcolor{#2f7d8a}{Z(T)})=-\textcolor{#557a3a}{\rho^{\phi}(Z,0)}+\textcolor{#8a3f9c}{\textstyle\sum_{t,i}} \textcolor{#557a3a}{g_{\text{obs}}}+\textcolor{#8a3f9c}{\textstyle\sum_{t,\,i\lt j}} \textcolor{#557a3a}{g_{\text{int}}}$$

STL robustness

$$\textcolor{#557a3a}{\rho^{G_{[a,b]}\phi}(x,t)}=\textcolor{#8a3f9c}{\min_{t'\in[t+a,\,t+b]}}\textcolor{#557a3a}{\rho^{\phi}(x,t')}$$

Parse-tree decomposition: a batched min/max reduction over time slices, tropical again.

Collective cost

Task robustness + obstacle penalty + inter-robot penalty, all dense per timestep, summed over robots.

Connectivity

$$\textcolor{#557a3a}{\lambda_{2}}\bigl(\textcolor{#8a3f9c}{L(t)}\bigr)\ \ge\ \varepsilon$$

Algebraic connectivity $\lambda_2$ of the inter-robot Laplacian $L(t)$: a smooth, batchable surrogate for the Boolean connected predicate. $\lambda_2\ge\varepsilon\Rightarrow\varepsilon$-robustly connected.

The Sinkhorn step is untouched. The cost surface gets richer; the update stays a dense regular kernel.

CLOT = nested GSR · story (4/7)

Three reducers, one stack: sequence · optimal transport · STL min/max.

Outer · sequence (best-first $\max$)
$$\textcolor{#8a3f9c}{\max_{\nu}}\ \textcolor{#557a3a}{\chi(\nu)}\quad\text{over planning order }\nu=(N_\nu,\Gamma_\nu)$$
Middle · transport (soft-min)
$$\textcolor{#8a3f9c}{-\lambda\log\textstyle\sum} e^{-C/\lambda}\quad\text{per robot, over polytope }D_P$$
Inner · logic (tropical)
$$\textcolor{#8a3f9c}{\min/\max}\ \textcolor{#557a3a}{\rho^{\phi}}\quad\text{over the STL parse tree, inside the score}$$

CLOT is GSR nested three deep. Each level is its own Generate·Score·Reduce; the semirings compose without changing the kernel schedule.

Sequence matters · story (5/7)

CLOT treats the robot planning order as part of the search.

Hybrid search node

$$\textcolor{#8a3f9c}{\nu}=(\textcolor{#8a3f9c}{N_{\nu}},\textcolor{#8a3f9c}{\Gamma_{\nu}})$$

$\nu$ indexes both the discrete planning sequence and the continuous candidate-trajectory set, so the search is over both at once.

Value function

$$\textcolor{#557a3a}{\chi(\nu)}=\frac{|\textcolor{#8a3f9c}{N_{\nu}}|}{\textcolor{#2f7d8a}{N}}-\eta\,\textcolor{#557a3a}{\xi(\Gamma_{\nu})}+\textcolor{#557a3a}{\psi(\nu)}$$

Coverage minus accumulated cost plus back-propagated feedback $\psi(\nu)$; the weight $\eta\in(0,1)$ scales cost onto the coverage scale.

root no robots planned choose robot Sinkhorn batch over polytope D_P candidates verify STL + safety archive rank by χ(ν) back-prop ψ replan on infeasibility · expand promising sequences
CLOT results · story (6/7)

Scalability and reliability on STL-coupled multi-robot tasks.

MethodSuccess N=12Time N=12
CLOT (ours)1.0020.3 s
MPOT0.1026.5 s
FSOT0.7019.8 s
CONS0.00n/a
MICP / NLP / CBS0.00n/a
FSOT is the fixed-sequence ablation, CONS the consensus baseline; MICP, NLP, and CBS are infeasible at scale. Simplified from Table I, RTX 3090. CLOT, Meng Guo's group (PKU), ICRA 2026.
100
robots: first feasible solution for the hardest STL split-and-merge in 971.5 s on one RTX 3090. All baselines fail at $N\ge 12$.
CLOT coordinated multi-robot trajectories realized over the STL horizon
CLOT realizes coordinated multi-robot trajectories across the STL horizon (1/15 to 15/15)
Real-world validation · story (7/7)

3-UAV hardware: planned in 2.91 s, formation tracked to ±0.1 m.

CLOT overall scenarios
30-robot split-and-merge (top); rendezvous + 3-UAV hardware (bottom)
3 quadcopter hardware setup
OptiTrack quadcopters hold the formation envelope
CLOT multi-drone formation animation
batched OT plans a connected, collision-free swarm

STL goal: $\varphi = \textcolor{#8a3f9c}{G}_{\textcolor{#2f7d8a}{[8,15]}}\,\textcolor{#557a3a}{\mu_{B}}$. Always between $t=8$ and $t=15\,\text{s}$, maintain the linear formation.

MPD / FlowMP · learned priors

Guidance is a score-function reducer, the warm end of the family.

FlowMP diffusion priors over trajectories
diffusion proposes many smooth candidates in parallel; cost-guided B-spline projection verifies (Carvalho, Nguyen, Le et al.)

Denoise, then guide

$$\textcolor{#2f7d8a}{z_{k-1}}=\textcolor{#8a3f9c}{D_{\theta}}(\textcolor{#2f7d8a}{z_k},c)\;-\;\eta\,\textcolor{#557a3a}{\nabla_z J}\!\bigl(\textcolor{#2f7d8a}{B_{\psi}}\textcolor{#2f7d8a}{z_k}\bigr)$$
  • $D_\theta$: learned prior, the Generate stage.
  • $\nabla_z J$: guidance, the score-function Reduce.
  • $B_\psi$: maps control latents to a smooth B-spline trajectory (learnable knots), so guidance acts on $C^2$, dynamically feasible curves.

The per-step $-\nabla_z J$ is the score $\nabla_z\log p(z)$ of the Boltzmann target $p\propto e^{-J}$ (the normalizer drops out), the score end of the temperature continuum.

T-RO 2025 (arXiv:2412.19948) · FlowMP, IROS 2025 (arXiv:2503.06135).
The payoff

Five planners, one batched kernel.

GTMP
$\min$
GTMP
tropical · $\lambda{\to}0$
MPOT
$-\lambda\log\sum e^{-\cdot/\lambda}$
MPOT
entropic
CLOT
nested soft-min
CLOT
OT $\otimes$ STL
MTP
$\mathbb{E}_{p_\eta}$
MTP push-T
Boltzmann
MPD
$\nabla\log\int e^{-J}$
MPD
score · warm

One Generate·Score·Reduce skeleton; each planner exposes the parallel work its hardware runs best.

Design principles

Five rules I keep returning to.

Generate many · Score in batch · Reduce with structure

  1. Expose independent work early.
  2. Replace pointer-heavy control flow with dense algebra.
  3. Separate global exploration from local connection.
  4. Treat diversity and optimality as different compute allocations.
  5. Make the proof object match the compute object.

What this is not

  • Not “GPU everywhere.”
  • Not abandoning geometry or proofs.
  • Not claiming one planner wins everywhere.

It is a way to design algorithms whose bottlenecks are visible, measurable, and schedulable.

Hardware-conscious algorithm sketch

A planner can be compiled into a heterogeneous robotics stack.

G
GPU lanedense kernels · SIMT
tensor graph construction batched connector calls (cuRobo · pRRTC) Sinkhorn / DP reductions STL robustness reductions
C
CPU laneorchestration · SIMD
hybrid sequence search class-indexed archives VAMP / FCIT* edge validation (Kavraki) JAX/XLA driver
E
Edge loopexecution · replan trigger
MPC / reactive tracker (Scaramuzza-style) execution monitor replan on infeasibility edge-cloud round-trip

The same plan flows across three schedulers: the right allocation of algorithm to computing hardware.

Open problems

Where I hope this community pushes next.

Lazy tensor graphs with guarantees

Skip most edges while proving homotopy-critical edges survive.

Class-aware extraction

If the graph contains many route families, enumerate them deliberately.

Connector-aware scheduling

Adapt $(M,N,s)$ online from measured $\bigl(q_s,\Lambda_s\bigr)$ profiles.

Compiler / runtime co-design

Should planners ship as XLA / JAX programs? Benchmark homotopy coverage and compute efficiency, not only first-solution time.

These are algorithm, architecture, and systems questions, not only motion-planning questions.

Take home

Parallelism is a design constraint, not an implementation detail.

01Representation
Lift graphs, automata, trajectories, and controls into batched tensor objects. The planner's data structure should already look like a kernel.
02Search
Replace one brittle serial loop with Generate · Score · Reduce: min, Sinkhorn, expectation, score, all structured reductions on one semiring continuum.
03Hardware
Memory layout, batch size, occupancy, and divergence are algorithm parameters, not deployment afterthoughts.

The next planning algorithms will be judged by the workloads they expose, not only the paths they return.

the math is too beautiful to ignore.

Thank you

Jan Peters
TU Darmstadt · IAS
Georgia Chalvatzaki
TU Darmstadt
Zachary Kingston
Purdue
Meng Guo
Peking University
Joe Watson
TU Darmstadt
João Carvalho
TU Darmstadt
Julen Urain
Amazon FAR
Kay Pompetzki
TU Darmstadt · IAS
VinUniversity· TU Darmstadt· IAS Lab· VinRobotics
QR code for anindex.github.io
questions · collaborations
anindex.github.io
an@robot-learning.de
papers · code
Backup · for Q&A

The full map: object, semiring, work, depth.

MethodGenerate · batch axesReduce · semiringOperator$W$$D$
GTMPgraph $(M,N)$min-plus$J_m(i)=\min_j[C_m(i,j)+J_{m+1}(j)]$$\Theta(MN^2)$$\Theta(M\log N)$
MPOTwaypoint $\times D_P$entropic$W^\star_\lambda=\arg\min\langle W,C\rangle-\lambda H(W)$$\Theta(Tnm)$$\Theta(T\log n)$
CLOTrobots $\times$ wp $\times D_P$OT $\otimes$ STLSinkhorn inner $\oplus$ tropical STL$\Theta(R(Tnm{+}|\Phi|T))$$\Theta(R\log nm)$
MTProllouts $\mathbb{R}^{B\times H\times m}$expectation$\pi^{k+1}=(1{-}\beta)\pi_{\text{loc}}+\beta\pi_{\text{glb}}$$\Theta(BH)$$\Theta(H{+}\log B)$
MPDlatents $(B,H)$score$z_{k-1}=D_\theta(z_k,c)-\eta\nabla_z J(B_\psi z_k)$$\Theta(KBc_\theta)$$\Theta(Kd_\theta)$

All $\oplus$ are associative $\Rightarrow$ all share the $+O(\log B)$ reduction tree; the prefactor is the per-method sequential scan.

Backup · the temperature family

One generating function. Its limit, gradient, and log-partition are the four reducers.

Proposition · one family, four faces
$$R_\lambda(v)=-\lambda\,\log\!\textstyle\sum_{a} e^{-v_a/\lambda},\qquad p_\lambda(v)_a=\frac{e^{-v_a/\lambda}}{\sum_b e^{-v_b/\lambda}}$$
(i)$$\lim_{\lambda\to 0^+}R_\lambda(v)=\textcolor{#8a3f9c}{\min_a v_a},\quad \min_a v_a-\lambda\log B\le R_\lambda\le \min_a v_a$$
(ii)$$\nabla_v R_\lambda(v)=p_\lambda(v)\ \Rightarrow\ \langle \nabla R_\lambda,\Phi\rangle=\textcolor{#557a3a}{\mathbb{E}_{a\sim p_\lambda}[\Phi_a]}$$
(iii)$$\nabla_\theta\log\!\textstyle\int e^{-J_\theta}=\textcolor{#8a3f9c}{\mathbb{E}_{p}[-\nabla_\theta J_\theta]}\quad(\text{log-partition gradient}\to\text{score})$$

Legendre dual: $R_\lambda(v)=\min_{p\in\Delta}\langle p,v\rangle-\lambda H(p)$, so “Sinkhorn = soft-min” is an identity, and $\lambda\to0$ is item (i).

candidate cost landscape v λ = 1 λ = 0.3 λ → 0 : min
soft-min collapses onto the hard min as the temperature cools

Limit = min (GTMP) · value = soft-min (MPOT/CLOT) · gradient = expectation (MTP) · log-partition gradient = score (MPD).

Backup · references

References.

  • GTMP · Le et al., Global Tensor Motion Planning. IEEE RA-L 2025. arXiv:2411.19393.
  • MPOT · Le et al., Accelerating Motion Planning via Optimal Transport. NeurIPS 2023. arXiv:2309.15970.
  • MTP · Le et al., Model Tensor Planning. TMLR 2025 / ICLR 2026. arXiv:2505.01059.
  • CLOT · Zhang, Zhang, Le, Guo (Peking University), collaborative optimal transport under STL. ICRA 2026.
  • MPD · Carvalho, Le, Kicki, Koert, Peters, Motion Planning Diffusion. IEEE T-RO 2025. arXiv:2412.19948.
  • FlowMP · Nguyen, Le et al., learning motion fields with conditional flow matching. IROS 2025. arXiv:2503.06135.
  • VAMP · Thomason, Kingston, Kavraki, motions in microseconds. ICRA 2024. arXiv:2309.14545.
  • FCIT* · Wilson et al., fully connected informed trees. ICRA 2025. arXiv:2411.17902.
  • pRRTC · Huang, Jadhav, Plancher, Kingston, GPU-parallel RRT-Connect. arXiv:2503.06757.
  • cpRRTC · constrained GPU-parallel RRT-Connect. arXiv:2505.06791.
  • cuRobo · Sundaralingam et al. (NVIDIA), CUDA-accelerated motion generation. curobo.org.
  • Trends · K. Rupp, 50 years of microprocessor trend data (CC BY 4.0).

Benchmark of record: MotionBenchMaker (Chamzas et al., 2022).