Optimization: switching tree traversal per branch #923

Kenny-Vilella · 2025-12-15T03:36:50Z

This PR changes the way we do tree traversal depending on tree topology.
In case where the kinematic tree is deep and narrow, the tree traversal is switched to a per branch traversal.

This PR changes the way we do tree traversal from a level approach to a per branch traversal.

Perf improvement on my RTX4080 (average of 5 runs):

Environment	Step time (main)	Step time (this PR)	Δ %
AlohaPot	691.5	681	1.5
ApptronikApolloFlat	674.6	632	6.3
ApptronikApolloHfield	3057.5	3032	0.8
ApptronikApolloTerrain	1940.6	1889.4	2.6
FrankaEmikaPanda	82.0	72.1	12.0
Humanoid	474.7	462.7	2.5
ThreeHumanoids	5951.5	5883.7	1.1

Kenny-Vilella · 2025-12-15T05:35:59Z

@erikfrey @thowell This optimization adds some complexity to the codebase (one more switch), so I just wonder if in principle you would be in favor of such a change.

One thing is that I am not yet totally convinced about the backward tree traversal using segments.
In theory, it should work great, but in practice I do not see significant speedup, maybe less than 1% for the franka benchmark.

This PR should work well with Alain's PR about quat/mat33, so I will wait for his PR to be merged and then update this one.
I will also reevaluate the usefulness of the backward tree traversal after the change.

thowell · 2025-12-15T09:22:23Z

nice optimizations @Kenny-Vilella!

this would primarily improve performance for environments that contain chains with many dofs?

if that is the case, it would be interesting to see comparison benchmark results for:

a world with 1 chain and many dofs
a world with many chains each with many dofs.

erikfrey · 2025-12-15T15:59:32Z

@Kenny-Vilella your PRs are always a treat!

You’re right that this adds complexity - do we need both approaches? Did you find actually meaningful benchmark scenes for which your approach is a lot slower?

Basically I’m trying to justify whether we really need both approaches, or just fully switch to yours, or keep the current.

Kenny-Vilella · 2025-12-16T05:23:39Z

I made some tests using a script that creates an articulation with N chains with every chain having M DoFs. This type of very orderly system introduce some bias, but let's say is good enough for our purpose.

First thing to mention is that separating the calculation of xipos and ximat has sometimes a huge impact for both cases.
For example, if I have 2 chains of 15 DoF, with the per-level approach the total time spent on kinematic level goes from 420ms on main to 255ms + 23ms on my branch.
Because of this, it only makes sense to compare the two approaches within this PR and not with main.

For a chain of 64 DoFs, I can only see perf improvement in Nsight because qLD_acc totally dominate the compute time (another optimization potential but quite challenging...).
Chain of 2 DoFs => 0% perf improvement
Chain of 4 DoFs => 10% perf improvement
Chain of 8 DoFs => 4% perf improvement
Chain of 16 DoFs => 7% perf improvement
Chain of 30 DoFs => 6% perf improvement

Surprisingly, even for 16 chains of 1 DoF, I see a negligible 1% perf improvement.
Only cases where I can measure significantly worse perf is for cases with 1-2 branches and 1-2 DoFs.
For example -12% for 2 branches, 1 DoF.

For X chains of X Dofs, these are the results:
2 => 3% perf decrease
3 => 3% perf improvement
4 => 1% perf improvement
5 => 2% perf improvement
6 => 4% perf improvement
7 => 4% perf improvement
8 => 3% perf improvement

I think it could make sense to totally switch to this new approach.
I didn't expect that it performs so well for most cases.

erikfrey · 2025-12-26T17:24:46Z

@Kenny-Vilella OK! If we aren't seeing any reason to keep both, can you take a stab at changing this PR to only do per branch traversal? Please focus on simplicity and legibility as smooth.py is the first place many curious newcomers land when they are trying to understand the code.

Once you've done that let us know and we'll do a proper review. Thanks for finding this speedup!

Kenny-Vilella · 2025-12-31T01:04:29Z

@erikfrey Thanks !
I am waiting for this PR to be merged and then will update this one.

…ing_tree_traversal_per_branch

Kenny-Vilella · 2026-01-15T01:58:45Z

@erikfrey @thowell Updated this PR, since the mat33 <-> quat PR is on pause.

Latest number on my RTX4080, median of 5 runs of the benchmark suite:

Environment	Step time (main)	Step time (this PR)	Δ %
aloha_cloth	3784938	3802677	-0.5
aloha_pot	684.8	663.9	3.2
aloha_sdf	4365	4371	-0.1
apollo_flat	651.1	608.9	6.9
apollo_hfield	1092	1052	3.8
apollo_terrain	1931	1878	2.8
cloth	3209902	3192023	0.6
franka_emika_panda	82.7	72.0	14.8
humanoid	478.7	463.7	3.2
three_humanoids	5752	5795	-0.7

Negative number are within standard deviation.

erikfrey

Apologies in advance for the nitpickiness - this is touching the very early parts of our code that people are likely to read first, so trying to keep it as simple as possible.

One high level question: this seems to be changing two functions, kinematics and crb. Did you verify that they both lead to the speedup or is one more important than the other?

Want to be 100% sure we are only adding complexity here if it's helpful

mujoco_warp/_src/io.py

mujoco_warp/_src/smooth.py

…ing_tree_traversal_per_branch

Kenny-Vilella · 2026-01-20T02:36:49Z

I do enjoy some nitpickiness !

Let's start with the meaty comment.
Give you some numbers from nsight on the two benchmarks that significantly improved, Apollo flat and Franka.
The kernel changed are:

_kinematics_level => to a branch approach
_cacc => to a branch approach
_comvel_level => to a branch approach
_subtree_com_acc => to a segment approach
_crb_accumulate => to a segment approach
_cfrc_backward => to a segment approach

On the Franka:

kernel	main (%)	main (avg us)	HEAD (%)	HEAD (avg us)
`_kinematics_level`	16.3	413	8.7	196.2
`_comvel_level`	6.3	161.6	4.8	107.3
`_cacc`	3.9	100.4	2.7	61.4
`_crb_accumulate`	3.0	78.0	2.5	57.4
`_cfrc_backward`	2.2	57.1	1.8	41.6
`_subtree_com_acc`	1.4	35.3	1.0	22.3

On the apollo flat:

kernel	main (%)	main (avg us)	HEAD (%)	HEAD (avg us)
`_kinematics_level`	6.2	325.1	2.4	115.8
`_cacc`	3.6	190.4	2.1	103.7
`_comvel_level`	2.2	116.8	1.4	68.4
`_cfrc_backward`	1.9	97.8	1.7	82.8
`_crb_accumulate`	1.4	73.3	1.3	61.6
`_subtree_com_acc`	0.7	34.3	0.5	26.6

I think that it is better for the user to limit as much as possible the number of tree traversal methods we are using.
So let's group the kernels into two, forward propagation (branch based) and backward propagation (segment based).
In both cases, forward propagation group accounts for ~85% of the perf improvement.

So if you want to limit the amount of change, I can revert the change to the segment kernels, i.e., _cfrc_backward, _crb_accumulate, and _subtree_com_acc.

erikfrey

Next round of comments - I promise we are converging, thanks for your patience!

erikfrey · 2026-01-20T21:26:40Z

mujoco_warp/_src/smooth.py

+  length = branch_length[branchid]
+
+  for i in range(length):
+    bodyid = branch_bodies[start + i]


Dumb question just for my own education:

because branches can share bodies (e.g. if two branches have the same ancestor), does that mean we have multiple threads writing to the same array indices?

e.g. we have two threads both writing to xpos_out[1] if they both share bodyid=1 in their respective branches?

this seems like excessive array writing, no? did you play with a version where we record an "owner" thread for each body, so that information only gets written once?

You are totally right, we are writing multiple times the same value to the same array.
The benefit of parallelizing the work is much higher than the cost of these excessive array writing.

I did not try but I can do some experiment. We will basically trade more read for less write.
We will need to be smart about what thread write what.

Yes agreed- more read but for nbody size array vs more write for nworld * nbody size array!

So I think it would win but then again, my intuition hit rate is not super high for cuda code. :-)

Spent an hour or two on this and didn't manage to get any perf out of it.
Only tried on the kinematic_branch kernel though.

At the end of the day we are as slow as the slowest threads.
My guess is that the write was not the bottleneck because we were overlapping it with other operation, so the change only had thread divergence, which makes the code slightly slower.

Tried three load balancing strategy:

Shortest branch has priority for writing

First come - First served, i.e. following the branch creation

Tried to equalize writing between threads

Interesting, thanks for trying. Warp perf is still so surprising sometimes!

erikfrey · 2026-01-20T21:30:35Z

mujoco_warp/_src/smooth.py

+    xpos = xpos_out[worldid, pid]
+    xquat = xquat_out[worldid, pid]
+    xmat = xmat_out[worldid, pid]
+    _compute_body_kinematics(


nit: unless we plan to use _compute_body_kinematics elsewhere (probably not) let's just inline the logic into the kernel, keep it all in one place - the user has to do less mental bookkeeping on the list of arguments.

Oups forgot to revert it when removing the legacy method.

Since we are not moving forward with the mat33 -> quat PR, I also split out the xmat calculation in a single kernel.
We can squeeze 1-2% gain out of this. Will update the benchmark number later.

Similarly, removed the _compute_body_comvel function.
Note that we could also be smart and split out the calculation of cdof_dot.
But it will have no perf gain for most cases.

mujoco_warp/_src/types.py

erikfrey · 2026-01-21T01:04:23Z

mujoco_warp/_src/types.py

    block_dim: block dim options
    body_tree: list of body ids by tree level
+    num_branches: number of branches (leaf-to-root paths)
+    branch_bodies: flattened body ids for all branches


nit: let's use the style where the first word communicates the element we are dealing with, so maybe something like body_branches?

Are you sure about this?
It feels a bit weird to have body_branches and then branch_start.
Or we should rename branch_start to body_branch_start?

that's correct, in general we try to make the first word reflect the id type, e.g. dof or jnt or body - so body_branch_start works!

So changed both !

mujoco_warp/_src/smooth.py

…ing_tree_traversal_per_branch

Kenny-Vilella

Still need to do the experiment with removing unnecessary write and run again the benchmark.

I reverted the optimization with the segment_chain.
It increases the complexity significantly with a minor perf gain, so let's focus this PR on the most obvious gain.

I will keep investigating the segment_chain optimization after this PR is merged and see if I can get significant enough perf gain to justify the change.

Kenny-Vilella · 2026-01-21T02:25:01Z

mujoco_warp/_src/smooth.py

+    xpos = xpos_out[worldid, pid]
+    xquat = xquat_out[worldid, pid]
+    xmat = xmat_out[worldid, pid]
+    _compute_body_kinematics(


Oups forgot to revert it when removing the legacy method.

Since we are not moving forward with the mat33 -> quat PR, I also split out the xmat calculation in a single kernel.
We can squeeze 1-2% gain out of this. Will update the benchmark number later.

Similarly, removed the _compute_body_comvel function.
Note that we could also be smart and split out the calculation of cdof_dot.
But it will have no perf gain for most cases.

mujoco_warp/_src/types.py

Kenny-Vilella · 2026-01-21T03:19:26Z

mujoco_warp/_src/types.py

    block_dim: block dim options
    body_tree: list of body ids by tree level
+    num_branches: number of branches (leaf-to-root paths)
+    branch_bodies: flattened body ids for all branches


Are you sure about this?
It feels a bit weird to have body_branches and then branch_start.
Or we should rename branch_start to body_branch_start?

mujoco_warp/_src/smooth.py

mujoco_warp/_src/io.py

Kenny-Vilella · 2026-01-21T07:04:02Z

The last benchmark,

On the Franka:

kernel	main (%)	main (avg us)	HEAD (%)	HEAD (avg us)
`_kinematics_level`	16.3	413	7.8	171.3
`_comvel_level`	6.3	161.6	4.8	107.3
`_cacc`	3.9	100.4	2.8	61.4

On the apollo flat:

kernel	main (%)	main (avg us)	HEAD (%)	HEAD (avg us)
`_kinematics_level`	6.2	325.1	2.4	119.7
`_cacc`	3.6	190.4	2.1	103.8
`_comvel_level`	2.2	116.8	1.2	60.8

Note that _kinematics_level also include _compute_body_inertial_frames and _compute_body_matrices.
My previous post did not include _compute_body_inertial_frames.

erikfrey

OK one last nit and we are good to merge. Thanks for your patience @Kenny-Vilella - I think this PR landed in a great place.

erikfrey · 2026-01-21T21:03:52Z

mujoco_warp/_src/types.py

    block_dim: block dim options
    body_tree: list of body ids by tree level
+    num_branches: number of branches (leaf-to-root paths)
+    branch_bodies: flattened body ids for all branches


that's correct, in general we try to make the first word reflect the id type, e.g. dof or jnt or body - so body_branch_start works!

…ing_tree_traversal_per_branch

Kenny-Vilella · 2026-01-23T01:24:34Z

@erikfrey Is it ready to be merged
I don't have the sacred power to merge^^

erikfrey · 2026-01-23T02:43:14Z

Looks great, thanks Kenny!

Kenny-Vilella added 6 commits December 12, 2025 11:08

Preparing data for branch based tree traversal

e46a45f

Preliminary optimized version

3e5ab01

Optimize further

1bd0819

Move position of new kernel

bc9040e

Merge remote with local

e82617b

Fix compute_body_kinematic kernel

239d6db

Change for kernel analyzer

edee109

Kenny-Vilella added 3 commits January 14, 2026 10:13

Merge remote-tracking branch 'upstream/main' into dev/kvilella/switch…

52d12dd

…ing_tree_traversal_per_branch

Remove old tree traversal kernels

e4b0835

Remove extra comments

d70c40b

erikfrey reviewed Jan 19, 2026

View reviewed changes

mujoco_warp/_src/io.py Outdated Show resolved Hide resolved

mujoco_warp/_src/io.py Outdated Show resolved Hide resolved

mujoco_warp/_src/io.py Outdated Show resolved Hide resolved

mujoco_warp/_src/smooth.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into dev/kvilella/switch…

4a30f80

…ing_tree_traversal_per_branch

Kenny-Vilella added 2 commits January 20, 2026 11:10

Move model init to its original location in io.py

7553ecb

Optimized branch computation in io.py

7d4266d

erikfrey reviewed Jan 21, 2026

View reviewed changes

Kenny-Vilella added 6 commits January 21, 2026 10:03

Merge remote-tracking branch 'upstream/main' into dev/kvilella/switch…

b5f65a6

…ing_tree_traversal_per_branch

Rework _kinematics_branch

98922be

Rename num_branches -> nbranch

550d359

Remove branch_length

ac07444

Rework _comvel_branch

621c54b

Remove segment_chain optimization

149945f

Kenny-Vilella commented Jan 21, 2026

View reviewed changes

Remove extra comment

c6058d7

erikfrey approved these changes Jan 21, 2026

View reviewed changes

Kenny-Vilella added 2 commits January 22, 2026 09:08

Merge remote-tracking branch 'upstream/main' into dev/kvilella/switch…

47bba11

…ing_tree_traversal_per_branch

Rename branch properties

26b8c5e

erikfrey merged commit d112bfa into google-deepmind:main Jan 23, 2026
9 checks passed

Kenny-Vilella deleted the dev/kvilella/switching_tree_traversal_per_branch branch January 23, 2026 03:01

Optimization: switching tree traversal per branch #923

Optimization: switching tree traversal per branch #923

Conversation

Kenny-Vilella commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kenny-Vilella commented Dec 15, 2025

Uh oh!

thowell commented Dec 15, 2025

Uh oh!

erikfrey commented Dec 15, 2025

Uh oh!

Kenny-Vilella commented Dec 16, 2025

Uh oh!

erikfrey commented Dec 26, 2025

Uh oh!

Kenny-Vilella commented Dec 31, 2025

Uh oh!

Kenny-Vilella commented Jan 15, 2026

Uh oh!

erikfrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kenny-Vilella commented Jan 20, 2026

Uh oh!

erikfrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kenny-Vilella left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Kenny-Vilella commented Jan 21, 2026

Uh oh!

erikfrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kenny-Vilella commented Jan 23, 2026

Uh oh!

erikfrey commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Kenny-Vilella commented Dec 15, 2025 •

edited

Loading