Optimization: replace mat33 with wp.quat, move to quaternion math by adenzler-nvidia · Pull Request #902 · google-deepmind/mujoco_warp

adenzler-nvidia · 2025-12-10T12:24:23Z

Sorry for the big PR. I hope this does not have too many downstream changes.

This is giving us decent speedups in the early parts of the pipeline.

Numbers on an RTX Pro 6000 Blackwell:

Environment	Steps/s (main)	Steps/s (this PR)	Δ steps/s	Δ %
humanoid	3,590,377	3,658,177	67,800	1.9%
n_humanoids	524,477	529,471	4,994	1.0%
aloha_pot (lift pot)	2,382,858	2,427,237	44,379	1.9%
Aloha SDF	525,527	536,578	11,051	2.1%
Apollo flat	2,576,095	2,629,018	52,923	2.1%
Apollo terrain	1,005,948	1,012,418	6,470	0.6%
Franka	13,974,922	15,055,237	1,080,315	7.7%
Kitchen G1	27,167	27,450	283	1.0%
Allegro Hand	6,254,950	6,508,127	253,177	4.1%
G1 with hands	435,392	439,071	3,679	0.8%

Signed-off-by: Alain Denzler <[email protected]>

erikfrey · 2025-12-23T19:06:08Z

Given the intention is to remove the xmat fields and replace them with xquat, really want to hear from @kevinzakka and @StafaH - can you both let us know whether this API change will be a big hassle on your end or not too bad?

StafaH · 2025-12-26T22:16:35Z

Downstream we would need to update the code, but it will be a straightforward task, and the performance gains will be worth it. I suggest moving forward with this change.

Rendering is a memory heavy task so more compact representations + better memory load/store will have a large impact.

kevinzakka · 2025-12-29T15:12:35Z

Hi all, we only use geom_xmat and site_xmat in two places, and both immediately convert to quaternions anyway, so native quaternion fields would actually simplify our code. ximat isn't used at all.

adenzler-nvidia · 2026-01-05T08:02:16Z

Nice, thanks for the info.

@StafaH does the batch renderer use the mujoco camera system? Should I include the change from cam_xmat to cam_xquat as well in this change? Otherwise we can also make it a follow-up, but maybe it's a good idea to batch potential API changes in the same PR.

Signed-off-by: Alain Denzler <[email protected]>

erikfrey

Looks great, just one question and one nit. Let's wait to hear from @StafaH and then I can help you get this merged.

erikfrey · 2026-01-05T23:20:16Z

mujoco_warp/_src/types.py

    collision_pairid: ids from broadphase                       (naconmax, 2)
    collision_worldid: collision world ids from broadphase      (naconmax,)
    ncollision: collision count from broadphase                 (1,)
+    xiquat: Cartesian orientation of body inertia                (nworld, nbody, 4)


nit: looks like shape info column is misaligned by one space

thanks, fixed.

mujoco_warp/_src/collision_gjk.py

Signed-off-by: Alain Denzler <[email protected]>

adenzler-nvidia · 2026-01-08T08:25:51Z

Now I remember why I didn't change from mat to quat in that 1 place - box_edge failing due to numerical (rounding) issues because floating point math is different now. I adjusted 1 threshold and all the tests are ok now, but I'm not sure if I'm allowed to make that change.

mujoco_warp/_src/collision_gjk.py

Signed-off-by: Alain Denzler <[email protected]>

adenzler-nvidia · 2026-01-12T08:59:14Z

@erikfrey should be good to go now!

thowell · 2026-01-13T11:07:09Z

@adenzler-nvidia

following some discussion with @yuvaltassa and @kbayes, we are thinking to hold off on merging this pr for now. additional considerations we have in mind:

@kbayes has an idea to reformulate parts of the narrowphase that would probably make sense to compare to the changes in this pr
it probably makes sense to benchmark a scene that utilizes the renderer with the proposed changes @StafaH
there is also a question about how these changes would interact with potential future memory alignment features in warp

adenzler-nvidia · 2026-01-13T12:13:26Z

Thanks for the update. I do understand the concerns, however I would like to provide a rebuttal:

The main performance advantages of this refactor are in kinematics, and it's unlikely that narrowphase changes would change the picture significantly. I did make some microbenchmarks and quats won every case I tested.
We did make the change in the newton tiled renderer and saw a 6% speedup. Happy to help evaluating the MjWarp tiled renderer.
Fair question. Quats are going to benefit much more from any alignment hints as they are going to be 16B aligned. The compiler can generate 16B loads per quat, and everything is going to be coalesced. Essentially, that means that 32 threads of a warp can load a quat using 1 memory transaction. This is going to be much worse for Mat33. The struct is 36B, but because consecutive Mat33s aren't on 16B boundaries, this will compile down to 9 individual 4B loads. These 4B loads then are also not contiguous but span multiple 128B cache-lines per request, and thus will be broken down into even more transactions. Obviously there are ways of adding padding such that we could potentially mitigate this issue, or have threads cooperate differently, but my point here is that nothing is as well-suited to a GPU as a quaternion.

Happy to discuss further and explain more.

yuvaltassa · 2026-01-13T12:54:39Z

Alain, two questions:

Do you think this change would still make sense if we kept the xmats in data but added the quats rather than replacing, so downstream code could choose what to use?
Related to the above, I am not surprised that a full frame transform wins with quats since 3x3 matmat is 27 muls while quatquat is 16. However for a single vector transform this is very surprising. 3x3 matvec is 9 muls while rotvecquat is 18. So even with the reduced memory access, is seems a bit surprising, no? Maybe the answer is that on GPU less memory access wins every time...

As Taylor said and you agreed, once we have the aligned memory this should be even more pronounced, so we should probably get back to it once we have that feature.

adenzler-nvidia · 2026-01-13T13:44:31Z

for 1 - I don't think this makes a lot of sense, as it will likely slow down things even more. It might be an option to have a separate kernel that does quat->mat conversion at the end, but in my opinion it's not a good idea to introduce multiple sources of truth into your API. We could make it optional though.

It does not make a lot of sense when purely looking at arithmetic - but indeed on GPU with warp memory access and parallelism is what counts most. You need to keep in mind that registers are scarce and affect occupancy and thus parallelism heavily, you lose the ability to fully occupy your GPU once you go beyond 32 registers per thread.

I do see the dilemma of having to change your API, only to reverse that decision potentially once warp gains new features/memory alignment. I don't mind shelving this until we have a better indication of what's going to come. Maybe I'll do a few experiments myself. I stand by my opinion above that memory alignment is not going to significantly change these numbers, and is going to be in favor of quaternions. But I'm happy to be proven wrong.

StafaH · 2026-01-14T00:38:04Z

After testing this change downstream, look like for small scenes (<20 geoms) there won't be any perf gain, but for large scenes there is a big speedup.

mjwarp-testspeed benchmark/franka_emika_panda/scene.xml --function=render --nstep=20

~15 geoms

Branch:

Summary for 8192 parallel rollouts

Total JIT time: 0.01 s
Total simulation time: 1.13 s
Total steps per second: 145,172

Main:

Summary for 8192 parallel rollouts

Total JIT time: 0.01 s
Total simulation time: 1.11 s
Total steps per second: 147,900

Large scene > 100 geom:

Branch:

Summary for 8192 parallel rollouts

Total JIT time: 0.01 s
Total simulation time: 2.61 s
Total steps per second: 313,744

Main:

Summary for 8192 parallel rollouts

Total JIT time: 0.01 s
Total simulation time: 2.85 s
Total steps per second: 286,947

I imported the changes into my fork here: https://github.com/StafaH/mujoco_warp/tree/dev/adenzler/matrices

You can run the benchmark from that branch against the main render branch. Values might look different for others, this is from my 3080.

The change itself is pretty straightforward. I think we can add in cam_xmat and see if that makes a difference, thats the only xmat the renderer uses.

adenzler-nvidia added 30 commits December 5, 2025 17:40

no more xmat

25485ae

Signed-off-by: Alain Denzler <[email protected]>

fix tests

b43507f

Signed-off-by: Alain Denzler <[email protected]>

slow transition

577367d

Signed-off-by: Alain Denzler <[email protected]>

set xiquat

bc47238

Signed-off-by: Alain Denzler <[email protected]>

first conversions

fada45b

Signed-off-by: Alain Denzler <[email protected]>

remove prints

ccb951e

Signed-off-by: Alain Denzler <[email protected]>

conversions in sensor

1f4d7b0

Signed-off-by: Alain Denzler <[email protected]>

sensor_vel

31e52f8

Signed-off-by: Alain Denzler <[email protected]>

cinert

8e1e825

Signed-off-by: Alain Denzler <[email protected]>

subtree_vel_forward

37d9587

Signed-off-by: Alain Denzler <[email protected]>

no more ximat

10ac877

Signed-off-by: Alain Denzler <[email protected]>

properly test xiquat

2c14b6d

Signed-off-by: Alain Denzler <[email protected]>

proper put_data conversion

1db0e22

Signed-off-by: Alain Denzler <[email protected]>

better conversions

b1e359b

Signed-off-by: Alain Denzler <[email protected]>

initial infra

bb8eaf1

Signed-off-by: Alain Denzler <[email protected]>

kernel call needs it as well

b3768ef

Signed-off-by: Alain Denzler <[email protected]>

spatial_tendon_geom

243b8bf

Signed-off-by: Alain Denzler <[email protected]>

better put_data

7ab5218

Signed-off-by: Alain Denzler <[email protected]>

better smooth_test

de23411

Signed-off-by: Alain Denzler <[email protected]>

sensor_tactile

6f7df83

Signed-off-by: Alain Denzler <[email protected]>

sensor_vel

1c50e3f

Signed-off-by: Alain Denzler <[email protected]>

sensor_pos

11076fd

Signed-off-by: Alain Denzler <[email protected]>

crashing tests

3f47579

Signed-off-by: Alain Denzler <[email protected]>

no more geom_xmat in mjwarp

0835117

Signed-off-by: Alain Denzler <[email protected]>

maybe this helps

62068e1

Signed-off-by: Alain Denzler <[email protected]>

more quats instead of matrices

358f442

Signed-off-by: Alain Denzler <[email protected]>

remove unnecessary line

7ea3e42

Signed-off-by: Alain Denzler <[email protected]>

fix the GJK test

3a2469f

Signed-off-by: Alain Denzler <[email protected]>

Merge branch 'main' into dev/adenzler/matrices

f35c485

fix

4a89cbb

Signed-off-by: Alain Denzler <[email protected]>

Kenny-Vilella mentioned this pull request Dec 31, 2025

Optimization: switching tree traversal per branch #923

Merged

adenzler-nvidia added 5 commits January 5, 2026 10:24

Merge branch 'main' into dev/adenzler/matrices

ff97ba2

updates

5fef4e9

Signed-off-by: Alain Denzler <[email protected]>

missing upadte

4f7fb9d

Signed-off-by: Alain Denzler <[email protected]>

another missing change.

b41b334

Signed-off-by: Alain Denzler <[email protected]>

missing updates

c35e2d4

Signed-off-by: Alain Denzler <[email protected]>

erikfrey changed the title ~~Optimization: replace mat33 with wp.quat, move to quaternion math~~ Optimization: replace mat33 with wp.quat, move to quaternion math. Jan 5, 2026

erikfrey changed the title ~~Optimization: replace mat33 with wp.quat, move to quaternion math.~~ Optimization: replace mat33 with wp.quat, move to quaternion math Jan 5, 2026

erikfrey approved these changes Jan 5, 2026

View reviewed changes

adenzler-nvidia added 6 commits January 7, 2026 08:37

whitespace fix

75235a7

Signed-off-by: Alain Denzler <[email protected]>

missing mat->quat conversions

739eca0

Signed-off-by: Alain Denzler <[email protected]>

Merge branch 'main' into dev/adenzler/matrices

1484174

import cleanup

62ea9ac

Signed-off-by: Alain Denzler <[email protected]>

change epsilon in half-space calc

a9dd5cf

Signed-off-by: Alain Denzler <[email protected]>

Merge branch 'main' into dev/adenzler/matrices

9800214

adenzler-nvidia commented Jan 8, 2026

View reviewed changes

mujoco_warp/_src/collision_gjk.py Show resolved Hide resolved

adenzler-nvidia mentioned this pull request Jan 9, 2026

update constants for float #1000

Open

3 tasks

adenzler-nvidia added 2 commits January 12, 2026 09:46

Merge branch 'main' into dev/adenzler/matrices

7c782e3

heightfield merge conflicts

c674bf7

Signed-off-by: Alain Denzler <[email protected]>

Conversation

adenzler-nvidia commented Dec 10, 2025

Uh oh!

erikfrey commented Dec 23, 2025

Uh oh!

StafaH commented Dec 26, 2025

Uh oh!

kevinzakka commented Dec 29, 2025

Uh oh!

adenzler-nvidia commented Jan 5, 2026

Uh oh!

erikfrey left a comment

Choose a reason for hiding this comment

Uh oh!

erikfrey Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

adenzler-nvidia Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adenzler-nvidia commented Jan 8, 2026

Uh oh!

Uh oh!

adenzler-nvidia commented Jan 12, 2026

Uh oh!

thowell commented Jan 13, 2026

Uh oh!

adenzler-nvidia commented Jan 13, 2026

Uh oh!

yuvaltassa commented Jan 13, 2026

Uh oh!

adenzler-nvidia commented Jan 13, 2026

Uh oh!

StafaH commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

StafaH commented Jan 14, 2026 •

edited

Loading