Skip to content

Conversation

@Kenny-Vilella
Copy link
Collaborator

@Kenny-Vilella Kenny-Vilella commented Dec 15, 2025

This PR changes the way we do tree traversal depending on tree topology.
In case where the kinematic tree is deep and narrow, the tree traversal is switched to a per branch traversal.

This PR changes the way we do tree traversal from a level approach to a per branch traversal.

Perf improvement on my RTX4080 (average of 5 runs):

Environment Step time (main) Step time (this PR) Δ %
AlohaPot 691.5 681 1.5
ApptronikApolloFlat 674.6 632 6.3
ApptronikApolloHfield 3057.5 3032 0.8
ApptronikApolloTerrain 1940.6 1889.4 2.6
FrankaEmikaPanda 82.0 72.1 12.0
Humanoid 474.7 462.7 2.5
ThreeHumanoids 5951.5 5883.7 1.1

@Kenny-Vilella
Copy link
Collaborator Author

@erikfrey @thowell This optimization adds some complexity to the codebase (one more switch), so I just wonder if in principle you would be in favor of such a change.

One thing is that I am not yet totally convinced about the backward tree traversal using segments.
In theory, it should work great, but in practice I do not see significant speedup, maybe less than 1% for the franka benchmark.

This PR should work well with Alain's PR about quat/mat33, so I will wait for his PR to be merged and then update this one.
I will also reevaluate the usefulness of the backward tree traversal after the change.

@thowell
Copy link
Collaborator

thowell commented Dec 15, 2025

nice optimizations @Kenny-Vilella!

this would primarily improve performance for environments that contain chains with many dofs?

if that is the case, it would be interesting to see comparison benchmark results for:

  • a world with 1 chain and many dofs
  • a world with many chains each with many dofs.

@erikfrey
Copy link
Collaborator

@Kenny-Vilella your PRs are always a treat!

You’re right that this adds complexity - do we need both approaches? Did you find actually meaningful benchmark scenes for which your approach is a lot slower?

Basically I’m trying to justify whether we really need both approaches, or just fully switch to yours, or keep the current.

@Kenny-Vilella
Copy link
Collaborator Author

I made some tests using a script that creates an articulation with N chains with every chain having M DoFs. This type of very orderly system introduce some bias, but let's say is good enough for our purpose.

First thing to mention is that separating the calculation of xipos and ximat has sometimes a huge impact for both cases.
For example, if I have 2 chains of 15 DoF, with the per-level approach the total time spent on kinematic level goes from 420ms on main to 255ms + 23ms on my branch.
Because of this, it only makes sense to compare the two approaches within this PR and not with main.

For a chain of 64 DoFs, I can only see perf improvement in Nsight because qLD_acc totally dominate the compute time (another optimization potential but quite challenging...).
Chain of 2 DoFs => 0% perf improvement
Chain of 4 DoFs => 10% perf improvement
Chain of 8 DoFs => 4% perf improvement
Chain of 16 DoFs => 7% perf improvement
Chain of 30 DoFs => 6% perf improvement

Surprisingly, even for 16 chains of 1 DoF, I see a negligible 1% perf improvement.
Only cases where I can measure significantly worse perf is for cases with 1-2 branches and 1-2 DoFs.
For example -12% for 2 branches, 1 DoF.

For X chains of X Dofs, these are the results:
2 => 3% perf decrease
3 => 3% perf improvement
4 => 1% perf improvement
5 => 2% perf improvement
6 => 4% perf improvement
7 => 4% perf improvement
8 => 3% perf improvement

I think it could make sense to totally switch to this new approach.
I didn't expect that it performs so well for most cases.

@erikfrey
Copy link
Collaborator

@Kenny-Vilella OK! If we aren't seeing any reason to keep both, can you take a stab at changing this PR to only do per branch traversal? Please focus on simplicity and legibility as smooth.py is the first place many curious newcomers land when they are trying to understand the code.

Once you've done that let us know and we'll do a proper review. Thanks for finding this speedup!

@Kenny-Vilella
Copy link
Collaborator Author

@erikfrey Thanks !
I am waiting for this PR to be merged and then will update this one.

@Kenny-Vilella
Copy link
Collaborator Author

@erikfrey @thowell Updated this PR, since the mat33 <-> quat PR is on pause.

Latest number on my RTX4080, median of 5 runs of the benchmark suite:

Environment Step time (main) Step time (this PR) Δ %
aloha_cloth 3784938 3802677 -0.5
aloha_pot 684.8 663.9 3.2
aloha_sdf 4365 4371 -0.1
apollo_flat 651.1 608.9 6.9
apollo_hfield 1092 1052 3.8
apollo_terrain 1931 1878 2.8
cloth 3209902 3192023 0.6
franka_emika_panda 82.7 72.0 14.8
humanoid 478.7 463.7 3.2
three_humanoids 5752 5795 -0.7

Negative number are within standard deviation.

Copy link
Collaborator

@erikfrey erikfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies in advance for the nitpickiness - this is touching the very early parts of our code that people are likely to read first, so trying to keep it as simple as possible.

One high level question: this seems to be changing two functions, kinematics and crb. Did you verify that they both lead to the speedup or is one more important than the other?

Want to be 100% sure we are only adding complexity here if it's helpful

@Kenny-Vilella
Copy link
Collaborator Author

I do enjoy some nitpickiness !

Let's start with the meaty comment.
Give you some numbers from nsight on the two benchmarks that significantly improved, Apollo flat and Franka.
The kernel changed are:

  • _kinematics_level => to a branch approach
  • _cacc => to a branch approach
  • _comvel_level => to a branch approach
  • _subtree_com_acc => to a segment approach
  • _crb_accumulate => to a segment approach
  • _cfrc_backward => to a segment approach

On the Franka:

kernel main (%) main (avg us) HEAD (%) HEAD (avg us)
_kinematics_level 16.3 413 8.7 196.2
_comvel_level 6.3 161.6 4.8 107.3
_cacc 3.9 100.4 2.7 61.4
_crb_accumulate 3.0 78.0 2.5 57.4
_cfrc_backward 2.2 57.1 1.8 41.6
_subtree_com_acc 1.4 35.3 1.0 22.3

On the apollo flat:

kernel main (%) main (avg us) HEAD (%) HEAD (avg us)
_kinematics_level 6.2 325.1 2.4 115.8
_cacc 3.6 190.4 2.1 103.7
_comvel_level 2.2 116.8 1.4 68.4
_cfrc_backward 1.9 97.8 1.7 82.8
_crb_accumulate 1.4 73.3 1.3 61.6
_subtree_com_acc 0.7 34.3 0.5 26.6

I think that it is better for the user to limit as much as possible the number of tree traversal methods we are using.
So let's group the kernels into two, forward propagation (branch based) and backward propagation (segment based).
In both cases, forward propagation group accounts for ~85% of the perf improvement.

So if you want to limit the amount of change, I can revert the change to the segment kernels, i.e., _cfrc_backward, _crb_accumulate, and _subtree_com_acc.

Copy link
Collaborator

@erikfrey erikfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next round of comments - I promise we are converging, thanks for your patience!

length = branch_length[branchid]

for i in range(length):
bodyid = branch_bodies[start + i]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question just for my own education:

because branches can share bodies (e.g. if two branches have the same ancestor), does that mean we have multiple threads writing to the same array indices?

e.g. we have two threads both writing to xpos_out[1] if they both share bodyid=1 in their respective branches?

this seems like excessive array writing, no? did you play with a version where we record an "owner" thread for each body, so that information only gets written once?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are totally right, we are writing multiple times the same value to the same array.
The benefit of parallelizing the work is much higher than the cost of these excessive array writing.

I did not try but I can do some experiment. We will basically trade more read for less write.
We will need to be smart about what thread write what.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed- more read but for nbody size array vs more write for nworld * nbody size array!

So I think it would win but then again, my intuition hit rate is not super high for cuda code. :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent an hour or two on this and didn't manage to get any perf out of it.
Only tried on the kinematic_branch kernel though.

At the end of the day we are as slow as the slowest threads.
My guess is that the write was not the bottleneck because we were overlapping it with other operation, so the change only had thread divergence, which makes the code slightly slower.

Tried three load balancing strategy:

  • Shortest branch has priority for writing
  • First come - First served, i.e. following the branch creation
  • Tried to equalize writing between threads

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks for trying. Warp perf is still so surprising sometimes!

xpos = xpos_out[worldid, pid]
xquat = xquat_out[worldid, pid]
xmat = xmat_out[worldid, pid]
_compute_body_kinematics(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unless we plan to use _compute_body_kinematics elsewhere (probably not) let's just inline the logic into the kernel, keep it all in one place - the user has to do less mental bookkeeping on the list of arguments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oups forgot to revert it when removing the legacy method.

Since we are not moving forward with the mat33 -> quat PR, I also split out the xmat calculation in a single kernel.
We can squeeze 1-2% gain out of this. Will update the benchmark number later.

Similarly, removed the _compute_body_comvel function.
Note that we could also be smart and split out the calculation of cdof_dot.
But it will have no perf gain for most cases.

block_dim: block dim options
body_tree: list of body ids by tree level
num_branches: number of branches (leaf-to-root paths)
branch_bodies: flattened body ids for all branches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's use the style where the first word communicates the element we are dealing with, so maybe something like body_branches?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this?
It feels a bit weird to have body_branches and then branch_start.
Or we should rename branch_start to body_branch_start?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct, in general we try to make the first word reflect the id type, e.g. dof or jnt or body - so body_branch_start works!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So changed both !

Copy link
Collaborator Author

@Kenny-Vilella Kenny-Vilella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to do the experiment with removing unnecessary write and run again the benchmark.

I reverted the optimization with the segment_chain.
It increases the complexity significantly with a minor perf gain, so let's focus this PR on the most obvious gain.

I will keep investigating the segment_chain optimization after this PR is merged and see if I can get significant enough perf gain to justify the change.

xpos = xpos_out[worldid, pid]
xquat = xquat_out[worldid, pid]
xmat = xmat_out[worldid, pid]
_compute_body_kinematics(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oups forgot to revert it when removing the legacy method.

Since we are not moving forward with the mat33 -> quat PR, I also split out the xmat calculation in a single kernel.
We can squeeze 1-2% gain out of this. Will update the benchmark number later.

Similarly, removed the _compute_body_comvel function.
Note that we could also be smart and split out the calculation of cdof_dot.
But it will have no perf gain for most cases.

block_dim: block dim options
body_tree: list of body ids by tree level
num_branches: number of branches (leaf-to-root paths)
branch_bodies: flattened body ids for all branches
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this?
It feels a bit weird to have body_branches and then branch_start.
Or we should rename branch_start to body_branch_start?

@Kenny-Vilella
Copy link
Collaborator Author

The last benchmark,

On the Franka:

kernel main (%) main (avg us) HEAD (%) HEAD (avg us)
_kinematics_level 16.3 413 7.8 171.3
_comvel_level 6.3 161.6 4.8 107.3
_cacc 3.9 100.4 2.8 61.4

On the apollo flat:

kernel main (%) main (avg us) HEAD (%) HEAD (avg us)
_kinematics_level 6.2 325.1 2.4 119.7
_cacc 3.6 190.4 2.1 103.8
_comvel_level 2.2 116.8 1.2 60.8

Note that _kinematics_level also include _compute_body_inertial_frames and _compute_body_matrices.
My previous post did not include _compute_body_inertial_frames.

Copy link
Collaborator

@erikfrey erikfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK one last nit and we are good to merge. Thanks for your patience @Kenny-Vilella - I think this PR landed in a great place.

block_dim: block dim options
body_tree: list of body ids by tree level
num_branches: number of branches (leaf-to-root paths)
branch_bodies: flattened body ids for all branches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct, in general we try to make the first word reflect the id type, e.g. dof or jnt or body - so body_branch_start works!

@Kenny-Vilella
Copy link
Collaborator Author

@erikfrey Is it ready to be merged
I don't have the sacred power to merge^^

@erikfrey
Copy link
Collaborator

Looks great, thanks Kenny!

@erikfrey erikfrey merged commit d112bfa into google-deepmind:main Jan 23, 2026
9 checks passed
@Kenny-Vilella Kenny-Vilella deleted the dev/kvilella/switching_tree_traversal_per_branch branch January 23, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants