Skip to content

Optimization: replace mat33 with wp.quat, move to quaternion math#902

Open
adenzler-nvidia wants to merge 116 commits intogoogle-deepmind:mainfrom
adenzler-nvidia:dev/adenzler/matrices
Open

Optimization: replace mat33 with wp.quat, move to quaternion math#902
adenzler-nvidia wants to merge 116 commits intogoogle-deepmind:mainfrom
adenzler-nvidia:dev/adenzler/matrices

Conversation

@adenzler-nvidia
Copy link
Collaborator

Sorry for the big PR. I hope this does not have too many downstream changes.

This is giving us decent speedups in the early parts of the pipeline.

Numbers on an RTX Pro 6000 Blackwell:

Environment Steps/s (main) Steps/s (this PR) Δ steps/s Δ %
humanoid 3,590,377 3,658,177 67,800 1.9%
n_humanoids 524,477 529,471 4,994 1.0%
aloha_pot (lift pot) 2,382,858 2,427,237 44,379 1.9%
Aloha SDF 525,527 536,578 11,051 2.1%
Apollo flat 2,576,095 2,629,018 52,923 2.1%
Apollo terrain 1,005,948 1,012,418 6,470 0.6%
Franka 13,974,922 15,055,237 1,080,315 7.7%
Kitchen G1 27,167 27,450 283 1.0%
Allegro Hand 6,254,950 6,508,127 253,177 4.1%
G1 with hands 435,392 439,071 3,679 0.8%

Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
Signed-off-by: Alain Denzler <[email protected]>
@erikfrey
Copy link
Collaborator

Given the intention is to remove the xmat fields and replace them with xquat, really want to hear from @kevinzakka and @StafaH - can you both let us know whether this API change will be a big hassle on your end or not too bad?

@StafaH
Copy link
Collaborator

StafaH commented Dec 26, 2025

Downstream we would need to update the code, but it will be a straightforward task, and the performance gains will be worth it. I suggest moving forward with this change.

Rendering is a memory heavy task so more compact representations + better memory load/store will have a large impact.

@kevinzakka
Copy link
Collaborator

Hi all, we only use geom_xmat and site_xmat in two places, and both immediately convert to quaternions anyway, so native quaternion fields would actually simplify our code. ximat isn't used at all.

@adenzler-nvidia
Copy link
Collaborator Author

Nice, thanks for the info.

@StafaH does the batch renderer use the mujoco camera system? Should I include the change from cam_xmat to cam_xquat as well in this change? Otherwise we can also make it a follow-up, but maybe it's a good idea to batch potential API changes in the same PR.

@erikfrey erikfrey changed the title Optimization: replace mat33 with wp.quat, move to quaternion math Optimization: replace mat33 with wp.quat, move to quaternion math. Jan 5, 2026
@erikfrey erikfrey changed the title Optimization: replace mat33 with wp.quat, move to quaternion math. Optimization: replace mat33 with wp.quat, move to quaternion math Jan 5, 2026
Copy link
Collaborator

@erikfrey erikfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just one question and one nit. Let's wait to hear from @StafaH and then I can help you get this merged.

collision_pairid: ids from broadphase (naconmax, 2)
collision_worldid: collision world ids from broadphase (naconmax,)
ncollision: collision count from broadphase (1,)
xiquat: Cartesian orientation of body inertia (nworld, nbody, 4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like shape info column is misaligned by one space

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed.

@adenzler-nvidia
Copy link
Collaborator Author

Now I remember why I didn't change from mat to quat in that 1 place - box_edge failing due to numerical (rounding) issues because floating point math is different now. I adjusted 1 threshold and all the tests are ok now, but I'm not sure if I'm allowed to make that change.

@adenzler-nvidia
Copy link
Collaborator Author

@erikfrey should be good to go now!

@thowell
Copy link
Collaborator

thowell commented Jan 13, 2026

@adenzler-nvidia

following some discussion with @yuvaltassa and @kbayes, we are thinking to hold off on merging this pr for now. additional considerations we have in mind:

  • @kbayes has an idea to reformulate parts of the narrowphase that would probably make sense to compare to the changes in this pr
  • it probably makes sense to benchmark a scene that utilizes the renderer with the proposed changes @StafaH
  • there is also a question about how these changes would interact with potential future memory alignment features in warp

@adenzler-nvidia
Copy link
Collaborator Author

Thanks for the update. I do understand the concerns, however I would like to provide a rebuttal:

  1. The main performance advantages of this refactor are in kinematics, and it's unlikely that narrowphase changes would change the picture significantly. I did make some microbenchmarks and quats won every case I tested.
  2. We did make the change in the newton tiled renderer and saw a 6% speedup. Happy to help evaluating the MjWarp tiled renderer.
  3. Fair question. Quats are going to benefit much more from any alignment hints as they are going to be 16B aligned. The compiler can generate 16B loads per quat, and everything is going to be coalesced. Essentially, that means that 32 threads of a warp can load a quat using 1 memory transaction. This is going to be much worse for Mat33. The struct is 36B, but because consecutive Mat33s aren't on 16B boundaries, this will compile down to 9 individual 4B loads. These 4B loads then are also not contiguous but span multiple 128B cache-lines per request, and thus will be broken down into even more transactions. Obviously there are ways of adding padding such that we could potentially mitigate this issue, or have threads cooperate differently, but my point here is that nothing is as well-suited to a GPU as a quaternion.

Happy to discuss further and explain more.

@yuvaltassa
Copy link
Collaborator

Alain, two questions:

  1. Do you think this change would still make sense if we kept the xmats in data but added the quats rather than replacing, so downstream code could choose what to use?
  2. Related to the above, I am not surprised that a full frame transform wins with quats since 3x3 matmat is 27 muls while quatquat is 16. However for a single vector transform this is very surprising. 3x3 matvec is 9 muls while rotvecquat is 18. So even with the reduced memory access, is seems a bit surprising, no? Maybe the answer is that on GPU less memory access wins every time...

As Taylor said and you agreed, once we have the aligned memory this should be even more pronounced, so we should probably get back to it once we have that feature.

@adenzler-nvidia
Copy link
Collaborator Author

for 1 - I don't think this makes a lot of sense, as it will likely slow down things even more. It might be an option to have a separate kernel that does quat->mat conversion at the end, but in my opinion it's not a good idea to introduce multiple sources of truth into your API. We could make it optional though.

  1. It does not make a lot of sense when purely looking at arithmetic - but indeed on GPU with warp memory access and parallelism is what counts most. You need to keep in mind that registers are scarce and affect occupancy and thus parallelism heavily, you lose the ability to fully occupy your GPU once you go beyond 32 registers per thread.

I do see the dilemma of having to change your API, only to reverse that decision potentially once warp gains new features/memory alignment. I don't mind shelving this until we have a better indication of what's going to come. Maybe I'll do a few experiments myself. I stand by my opinion above that memory alignment is not going to significantly change these numbers, and is going to be in favor of quaternions. But I'm happy to be proven wrong.

@StafaH
Copy link
Collaborator

StafaH commented Jan 14, 2026

After testing this change downstream, look like for small scenes (<20 geoms) there won't be any perf gain, but for large scenes there is a big speedup.

mjwarp-testspeed benchmark/franka_emika_panda/scene.xml --function=render --nstep=20

~15 geoms

Branch:

Summary for 8192 parallel rollouts

Total JIT time: 0.01 s
Total simulation time: 1.13 s
Total steps per second: 145,172

Main:

Summary for 8192 parallel rollouts

Total JIT time: 0.01 s
Total simulation time: 1.11 s
Total steps per second: 147,900

Large scene > 100 geom:

Branch:

Summary for 8192 parallel rollouts

Total JIT time: 0.01 s
Total simulation time: 2.61 s
Total steps per second: 313,744

Main:

Summary for 8192 parallel rollouts

Total JIT time: 0.01 s
Total simulation time: 2.85 s
Total steps per second: 286,947

I imported the changes into my fork here: https://github.com/StafaH/mujoco_warp/tree/dev/adenzler/matrices

You can run the benchmark from that branch against the main render branch. Values might look different for others, this is from my 3080.

The change itself is pretty straightforward. I think we can add in cam_xmat and see if that makes a difference, thats the only xmat the renderer uses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants