-
Notifications
You must be signed in to change notification settings - Fork 126
Optimization: switching tree traversal per branch #923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization: switching tree traversal per branch #923
Conversation
|
@erikfrey @thowell This optimization adds some complexity to the codebase (one more switch), so I just wonder if in principle you would be in favor of such a change. One thing is that I am not yet totally convinced about the backward tree traversal using segments. This PR should work well with Alain's PR about quat/mat33, so I will wait for his PR to be merged and then update this one. |
|
nice optimizations @Kenny-Vilella! this would primarily improve performance for environments that contain chains with many dofs? if that is the case, it would be interesting to see comparison benchmark results for:
|
|
@Kenny-Vilella your PRs are always a treat! You’re right that this adds complexity - do we need both approaches? Did you find actually meaningful benchmark scenes for which your approach is a lot slower? Basically I’m trying to justify whether we really need both approaches, or just fully switch to yours, or keep the current. |
|
I made some tests using a script that creates an articulation with N chains with every chain having M DoFs. This type of very orderly system introduce some bias, but let's say is good enough for our purpose. First thing to mention is that separating the calculation of xipos and ximat has sometimes a huge impact for both cases. For a chain of 64 DoFs, I can only see perf improvement in Nsight because qLD_acc totally dominate the compute time (another optimization potential but quite challenging...). Surprisingly, even for 16 chains of 1 DoF, I see a negligible 1% perf improvement. For X chains of X Dofs, these are the results: I think it could make sense to totally switch to this new approach. |
|
@Kenny-Vilella OK! If we aren't seeing any reason to keep both, can you take a stab at changing this PR to only do per branch traversal? Please focus on simplicity and legibility as Once you've done that let us know and we'll do a proper review. Thanks for finding this speedup! |
|
@erikfrey @thowell Updated this PR, since the mat33 <-> quat PR is on pause. Latest number on my RTX4080, median of 5 runs of the benchmark suite:
Negative number are within standard deviation. |
erikfrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies in advance for the nitpickiness - this is touching the very early parts of our code that people are likely to read first, so trying to keep it as simple as possible.
One high level question: this seems to be changing two functions, kinematics and crb. Did you verify that they both lead to the speedup or is one more important than the other?
Want to be 100% sure we are only adding complexity here if it's helpful
…ing_tree_traversal_per_branch
|
I do enjoy some nitpickiness ! Let's start with the meaty comment.
On the Franka:
On the apollo flat:
I think that it is better for the user to limit as much as possible the number of tree traversal methods we are using. So if you want to limit the amount of change, I can revert the change to the segment kernels, i.e., |
erikfrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next round of comments - I promise we are converging, thanks for your patience!
mujoco_warp/_src/smooth.py
Outdated
| length = branch_length[branchid] | ||
|
|
||
| for i in range(length): | ||
| bodyid = branch_bodies[start + i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question just for my own education:
because branches can share bodies (e.g. if two branches have the same ancestor), does that mean we have multiple threads writing to the same array indices?
e.g. we have two threads both writing to xpos_out[1] if they both share bodyid=1 in their respective branches?
this seems like excessive array writing, no? did you play with a version where we record an "owner" thread for each body, so that information only gets written once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are totally right, we are writing multiple times the same value to the same array.
The benefit of parallelizing the work is much higher than the cost of these excessive array writing.
I did not try but I can do some experiment. We will basically trade more read for less write.
We will need to be smart about what thread write what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed- more read but for nbody size array vs more write for nworld * nbody size array!
So I think it would win but then again, my intuition hit rate is not super high for cuda code. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spent an hour or two on this and didn't manage to get any perf out of it.
Only tried on the kinematic_branch kernel though.
At the end of the day we are as slow as the slowest threads.
My guess is that the write was not the bottleneck because we were overlapping it with other operation, so the change only had thread divergence, which makes the code slightly slower.
Tried three load balancing strategy:
- Shortest branch has priority for writing
- First come - First served, i.e. following the branch creation
- Tried to equalize writing between threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, thanks for trying. Warp perf is still so surprising sometimes!
mujoco_warp/_src/smooth.py
Outdated
| xpos = xpos_out[worldid, pid] | ||
| xquat = xquat_out[worldid, pid] | ||
| xmat = xmat_out[worldid, pid] | ||
| _compute_body_kinematics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unless we plan to use _compute_body_kinematics elsewhere (probably not) let's just inline the logic into the kernel, keep it all in one place - the user has to do less mental bookkeeping on the list of arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oups forgot to revert it when removing the legacy method.
Since we are not moving forward with the mat33 -> quat PR, I also split out the xmat calculation in a single kernel.
We can squeeze 1-2% gain out of this. Will update the benchmark number later.
Similarly, removed the _compute_body_comvel function.
Note that we could also be smart and split out the calculation of cdof_dot.
But it will have no perf gain for most cases.
mujoco_warp/_src/types.py
Outdated
| block_dim: block dim options | ||
| body_tree: list of body ids by tree level | ||
| num_branches: number of branches (leaf-to-root paths) | ||
| branch_bodies: flattened body ids for all branches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's use the style where the first word communicates the element we are dealing with, so maybe something like body_branches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about this?
It feels a bit weird to have body_branches and then branch_start.
Or we should rename branch_start to body_branch_start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct, in general we try to make the first word reflect the id type, e.g. dof or jnt or body - so body_branch_start works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So changed both !
Kenny-Vilella
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to do the experiment with removing unnecessary write and run again the benchmark.
I reverted the optimization with the segment_chain.
It increases the complexity significantly with a minor perf gain, so let's focus this PR on the most obvious gain.
I will keep investigating the segment_chain optimization after this PR is merged and see if I can get significant enough perf gain to justify the change.
mujoco_warp/_src/smooth.py
Outdated
| xpos = xpos_out[worldid, pid] | ||
| xquat = xquat_out[worldid, pid] | ||
| xmat = xmat_out[worldid, pid] | ||
| _compute_body_kinematics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oups forgot to revert it when removing the legacy method.
Since we are not moving forward with the mat33 -> quat PR, I also split out the xmat calculation in a single kernel.
We can squeeze 1-2% gain out of this. Will update the benchmark number later.
Similarly, removed the _compute_body_comvel function.
Note that we could also be smart and split out the calculation of cdof_dot.
But it will have no perf gain for most cases.
mujoco_warp/_src/types.py
Outdated
| block_dim: block dim options | ||
| body_tree: list of body ids by tree level | ||
| num_branches: number of branches (leaf-to-root paths) | ||
| branch_bodies: flattened body ids for all branches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about this?
It feels a bit weird to have body_branches and then branch_start.
Or we should rename branch_start to body_branch_start?
|
The last benchmark, On the Franka:
On the apollo flat:
Note that |
erikfrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK one last nit and we are good to merge. Thanks for your patience @Kenny-Vilella - I think this PR landed in a great place.
mujoco_warp/_src/types.py
Outdated
| block_dim: block dim options | ||
| body_tree: list of body ids by tree level | ||
| num_branches: number of branches (leaf-to-root paths) | ||
| branch_bodies: flattened body ids for all branches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct, in general we try to make the first word reflect the id type, e.g. dof or jnt or body - so body_branch_start works!
…ing_tree_traversal_per_branch
|
@erikfrey Is it ready to be merged |
|
Looks great, thanks Kenny! |
This PR changes the way we do tree traversal depending on tree topology.In case where the kinematic tree is deep and narrow, the tree traversal is switched to a per branch traversal.
This PR changes the way we do tree traversal from a level approach to a per branch traversal.
Perf improvement on my RTX4080 (average of 5 runs):