Require for Full Models and Tutorails of End-to-end Audio and Action Output

Hi! I'm very interested in this repo and want to reproduce it! But I met some issues when trying reproducing e2e audio and action output with RoboOmni as displayed in your demo video.

I have noted that, in `RoboOmni` class from `experiments/libero/roboomni_utils.py`, the VLA model seems to be loaded by `Qwen2_5OmniThinkerForConditionalGeneration.from_pretrain` (no audio ouput ability) instead of `Qwen2_5OmniForConditionalGeneration.from_pretrain` (with talker and token2wav module).

When I tried to load model with `Qwen2_5OmniForConditionalGeneration.from_pretrain`, I got following output:
```
Qwen2_5OmniForConditionalGeneration LOAD REPORT
 from: ./RoboOmni-LIBERO-Spatial
Key                                                                                                      | Status     | 
---------------------------------------------------------------------------------------------------------+------------+-
visual.blocks.{0...31}.attn.k.weight                                                                     | UNEXPECTED | 
visual.blocks.{0...31}.attn.k.bias                                                                       | UNEXPECTED | 
visual.blocks.{0...31}.attn.q.weight                                                                     | UNEXPECTED | 
model.layers.{0...35}.self_attn.q_proj.bias                                                              | UNEXPECTED | 
visual.blocks.{0...31}.attn.v.bias                                                                       | UNEXPECTED | 
model.layers.{0...35}.self_attn.q_proj.weight                                                            | UNEXPECTED | 
audio_tower.layers.{0...31}.self_attn.v_proj.bias                                                        | UNEXPECTED | 
audio_tower.layers.{0...31}.self_attn_layer_norm.weight                                                  | UNEXPECTED | 
audio_tower.layers.{0...31}.fc2.weight                                                                   | UNEXPECTED | 
audio_tower.layers.{0...31}.self_attn.k_proj.weight                                                      | UNEXPECTED | 
model.layers.{0...35}.self_attn.o_proj.weight                                                            | UNEXPECTED | 
audio_tower.layers.{0...31}.self_attn.q_proj.bias                                                        | UNEXPECTED | 
visual.blocks.{0...31}.attn.q.bias                                                                       | UNEXPECTED | 
audio_tower.layers.{0...31}.fc2.bias                                                                     | UNEXPECTED | 
model.layers.{0...35}.post_attention_layernorm.weight                                                    | UNEXPECTED | 
audio_tower.layers.{0...31}.fc1.weight                                                                   | UNEXPECTED | 
audio_tower.layers.{0...31}.fc1.bias                                                                     | UNEXPECTED | 
visual.blocks.{0...31}.norm1.weight                                                                      | UNEXPECTED | 
audio_tower.layers.{0...31}.self_attn_layer_norm.bias                                                    | UNEXPECTED | 
audio_tower.layers.{0...31}.self_attn.out_proj.weight                                                    | UNEXPECTED | 
model.layers.{0...35}.mlp.up_proj.weight                                                                 | UNEXPECTED | 
model.layers.{0...35}.mlp.down_proj.weight                                                               | UNEXPECTED | 
visual.blocks.{0...31}.mlp.down_proj.bias                                                                | UNEXPECTED | 
model.layers.{0...35}.self_attn.k_proj.weight                                                            | UNEXPECTED | 
audio_tower.layers.{0...31}.self_attn.q_proj.weight                                                      | UNEXPECTED | 
visual.blocks.{0...31}.mlp.gate_proj.weight                                                              | UNEXPECTED | 
model.layers.{0...35}.input_layernorm.weight                                                             | UNEXPECTED | 
audio_tower.layers.{0...31}.self_attn.out_proj.bias                                                      | UNEXPECTED | 
audio_tower.layers.{0...31}.final_layer_norm.weight                                                      | UNEXPECTED | 
visual.blocks.{0...31}.attn.proj.bias                                                                    | UNEXPECTED | 
visual.blocks.{0...31}.attn.proj.weight                                                                  | UNEXPECTED | 
model.layers.{0...35}.self_attn.v_proj.weight                                                            | UNEXPECTED | 
model.layers.{0...35}.self_attn.k_proj.bias                                                              | UNEXPECTED | 
visual.blocks.{0...31}.norm2.weight                                                                      | UNEXPECTED | 
visual.blocks.{0...31}.mlp.down_proj.weight                                                              | UNEXPECTED | 
visual.blocks.{0...31}.attn.v.weight                                                                     | UNEXPECTED | 
model.layers.{0...35}.self_attn.v_proj.bias                                                              | UNEXPECTED | 
audio_tower.layers.{0...31}.final_layer_norm.bias                                                        | UNEXPECTED | 
audio_tower.layers.{0...31}.self_attn.v_proj.weight                                                      | UNEXPECTED | 
audio_tower.proj.bias                                                                                    | UNEXPECTED | 
lm_head.weight                                                                                           | UNEXPECTED | 
visual.blocks.{0...31}.mlp.gate_proj.bias                                                                | UNEXPECTED | 
model.layers.{0...35}.mlp.gate_proj.weight                                                               | UNEXPECTED | 
visual.merger.mlp.{0, 2}.bias                                                                            | UNEXPECTED | 
visual.merger.mlp.{0, 2}.weight                                                                          | UNEXPECTED | 
visual.blocks.{0...31}.mlp.up_proj.bias                                                                  | UNEXPECTED | 
audio_tower.ln_post.weight                                                                               | UNEXPECTED | 
visual.blocks.{0...31}.mlp.up_proj.weight                                                                | UNEXPECTED | 
model.norm.weight                                                                                        | UNEXPECTED | 
audio_tower.conv2.bias                                                                                   | UNEXPECTED | 
audio_tower.conv2.weight                                                                                 | UNEXPECTED | 
audio_tower.conv1.bias                                                                                   | UNEXPECTED | 
model.embed_tokens.weight                                                                                | UNEXPECTED | 
audio_tower.conv1.weight                                                                                 | UNEXPECTED | 
audio_tower.audio_bos_eos_token.weight                                                                   | UNEXPECTED | 
audio_tower.ln_post.bias                                                                                 | UNEXPECTED | 
audio_tower.proj.weight                                                                                  | UNEXPECTED | 
visual.merger.ln_q.weight                                                                                | UNEXPECTED | 
visual.patch_embed.proj.weight                                                                           | UNEXPECTED | 
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.convs1.{0, 1, 2}.weight                              | MISSING    | 
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.convs1.{0, 1, 2}.bias                                | MISSING    | 
thinker.visual.blocks.{0...31}.mlp.down_proj.bias                                                        | MISSING    | 
talker.model.layers.{0...27}.mlp.up_proj.weight                                                          | MISSING    | 
talker.model.layers.{0...27}.post_attention_layernorm.weight                                             | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_out.0.weight                            | MISSING    | 
thinker.audio_tower.layers.{0...31}.final_layer_norm.bias                                                | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.0.conv.weight                                | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.ff.ff.{0, 3}.weight                             | MISSING    | 
thinker.audio_tower.layers.{0...31}.self_attn.k_proj.weight                                              | MISSING    | 
thinker.model.layers.{0...27}.input_layernorm.weight                                                     | MISSING    | 
thinker.visual.blocks.{0...31}.attn.k.bias                                                               | MISSING    | 
thinker.visual.blocks.{0...31}.norm2.weight                                                              | MISSING    | 
thinker.model.layers.{0...27}.self_attn.v_proj.weight                                                    | MISSING    | 
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.activations.{0, 1, 2, 3, 4, 5}.act.beta              | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_v.bias                                  | MISSING    | 
thinker.visual.blocks.{0...31}.attn.v.bias                                                               | MISSING    | 
thinker.audio_tower.layers.{0...31}.self_attn.v_proj.bias                                                | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_out.0.bias                              | MISSING    | 
thinker.visual.blocks.{0...31}.mlp.up_proj.weight                                                        | MISSING    | 
thinker.visual.blocks.{0...31}.attn.q.bias                                                               | MISSING    | 
thinker.audio_tower.layers.{0...31}.self_attn_layer_norm.bias                                            | MISSING    | 
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.activations.{0, 1, 2, 3, 4, 5}.act.alpha             | MISSING    | 
thinker.model.layers.{0...27}.self_attn.k_proj.bias                                                      | MISSING    | 
thinker.visual.blocks.{0...31}.norm1.weight                                                              | MISSING    | 
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.convs2.{0, 1, 2}.weight                              | MISSING    | 
thinker.audio_tower.conv1.bias                                                                           | MISSING    | 
talker.model.layers.{0...27}.self_attn.v_proj.bias                                                       | MISSING    | 
thinker.model.layers.{0...27}.self_attn.q_proj.bias                                                      | MISSING    | 
talker.model.layers.{0...27}.self_attn.q_proj.bias                                                       | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn_norm.linear.bias                           | MISSING    | 
talker.model.layers.{0...27}.self_attn.o_proj.weight                                                     | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.ff.ff.{0, 3}.bias                               | MISSING    | 
thinker.visual.blocks.{0...31}.attn.proj.weight                                                          | MISSING    | 
thinker.visual.blocks.{0...31}.attn.q.weight                                                             | MISSING    | 
thinker.model.layers.{0...27}.mlp.up_proj.weight                                                         | MISSING    | 
thinker.audio_tower.layers.{0...31}.self_attn.q_proj.bias                                                | MISSING    | 
thinker.audio_tower.layers.{0...31}.fc1.bias                                                             | MISSING    | 
thinker.visual.blocks.{0...31}.attn.k.weight                                                             | MISSING    | 
talker.model.layers.{0...27}.self_attn.q_proj.weight                                                     | MISSING    | 
thinker.audio_tower.layers.{0...31}.final_layer_norm.weight                                              | MISSING    | 
thinker.visual.blocks.{0...31}.mlp.down_proj.weight                                                      | MISSING    | 
thinker.visual.blocks.{0...31}.attn.v.weight                                                             | MISSING    | 
thinker.visual.blocks.{0...31}.mlp.gate_proj.bias                                                        | MISSING    | 
thinker.audio_tower.layers.{0...31}.self_attn.out_proj.bias                                              | MISSING    | 
thinker.model.layers.{0...27}.post_attention_layernorm.weight                                            | MISSING    | 
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.convs2.{0, 1, 2}.bias                                | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.tdnn1.conv.weight                  | MISSING    | 
thinker.audio_tower.layers.{0...31}.fc2.weight                                                           | MISSING    | 
talker.model.layers.{0...27}.mlp.down_proj.weight                                                        | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.se_block.conv1.weight              | MISSING    | 
thinker.model.layers.{0...27}.self_attn.k_proj.weight                                                    | MISSING    | 
talker.model.layers.{0...27}.self_attn.k_proj.weight                                                     | MISSING    | 
thinker.model.layers.{0...27}.mlp.down_proj.weight                                                       | MISSING    | 
talker.model.layers.{0...27}.input_layernorm.weight                                                      | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_q.bias                                  | MISSING    | 
thinker.audio_tower.layers.{0...31}.fc2.bias                                                             | MISSING    | 
thinker.audio_tower.layers.{0...31}.fc1.weight                                                           | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_k.bias                                  | MISSING    | 
thinker.model.layers.{0...27}.self_attn.q_proj.weight                                                    | MISSING    | 
thinker.model.layers.{0...27}.self_attn.v_proj.bias                                                      | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.mfa.conv.bias                                       | MISSING    | 
thinker.visual.blocks.{0...31}.mlp.gate_proj.weight                                                      | MISSING    | 
thinker.visual.blocks.{0...31}.mlp.up_proj.bias                                                          | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_v.weight                                | MISSING    | 
thinker.lm_head.weight                                                                                   | MISSING    | 
thinker.audio_tower.layers.{0...31}.self_attn.v_proj.weight                                              | MISSING    | 
thinker.audio_tower.layers.{0...31}.self_attn_layer_norm.weight                                          | MISSING    | 
thinker.model.layers.{0...27}.mlp.gate_proj.weight                                                       | MISSING    | 
talker.model.layers.{0...27}.self_attn.k_proj.bias                                                       | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_q.weight                                | MISSING    | 
thinker.audio_tower.layers.{0...31}.self_attn.out_proj.weight                                            | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_k.weight                                | MISSING    | 
token2wav.code2wav_bigvgan_model.conv_pre.weight                                                         | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.mfa.conv.weight                                     | MISSING    | 
talker.model.layers.{0...27}.mlp.gate_proj.weight                                                        | MISSING    | 
thinker.model.layers.{0...27}.self_attn.o_proj.weight                                                    | MISSING    | 
talker.model.layers.{0...27}.self_attn.v_proj.weight                                                     | MISSING    | 
thinker.visual.blocks.{0...31}.attn.proj.bias                                                            | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.se_block.conv2.bias                | MISSING    | 
thinker.audio_tower.layers.{0...31}.self_attn.q_proj.weight                                              | MISSING    | 
thinker.visual.merger.ln_q.weight                                                                        | MISSING    | 
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn_norm.linear.weight                         | MISSING    | 
token2wav.code2wav_bigvgan_model.conv_pre.bias                                                           | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.se_block.conv2.weight              | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.res2net_block.blocks.0.conv.weight | MISSING    | 
thinker.visual.merger.mlp.{0, 2}.weight                                                                  | MISSING    | 
token2wav.code2wav_bigvgan_model.ups.{0, 1, 2, 3, 4, 5}.0.bias                                           | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.res2net_block.blocks.0.conv.bias   | MISSING    | 
token2wav.code2wav_dit_model.time_embed.time_mlp.{0, 2}.weight                                           | MISSING    | 
token2wav.code2wav_dit_model.text_embed.codec_embed.weight                                               | MISSING    | 
token2wav.code2wav_dit_model.proj_out.bias                                                               | MISSING    | 
token2wav.code2wav_dit_model.norm_out.linear.weight                                                      | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.asp.tdnn.conv.weight                                | MISSING    | 
token2wav.code2wav_bigvgan_model.ups.{0, 1, 2, 3, 4, 5}.0.weight                                         | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.asp.conv.weight                                     | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.fc.weight                                           | MISSING    | 
thinker.audio_tower.conv1.weight                                                                         | MISSING    | 
thinker.visual.merger.mlp.{0, 2}.bias                                                                    | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.tdnn2.conv.bias                    | MISSING    | 
token2wav.code2wav_dit_model.time_embed.time_mlp.{0, 2}.bias                                             | MISSING    | 
thinker.audio_tower.ln_post.bias                                                                         | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.tdnn2.conv.weight                  | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.asp.tdnn.conv.bias                                  | MISSING    | 
talker.model.norm.weight                                                                                 | MISSING    | 
token2wav.code2wav_dit_model.input_embed.proj.bias                                                       | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.tdnn1.conv.bias                    | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.fc.bias                                             | MISSING    | 
token2wav.code2wav_dit_model.norm_out.linear.bias                                                        | MISSING    | 
talker.model.embed_tokens.weight                                                                         | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.se_block.conv1.bias                | MISSING    | 
token2wav.code2wav_bigvgan_model.activation_post.act.beta                                                | MISSING    | 
token2wav.code2wav_bigvgan_model.activation_post.act.alpha                                               | MISSING    | 
thinker.model.norm.weight                                                                                | MISSING    | 
talker.codec_head.weight                                                                                 | MISSING    | 
thinker.audio_tower.proj.bias                                                                            | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.0.conv.bias                                  | MISSING    | 
thinker.audio_tower.conv2.weight                                                                         | MISSING    | 
thinker.visual.patch_embed.proj.weight                                                                   | MISSING    | 
talker.thinker_to_talker_proj.bias                                                                       | MISSING    | 
token2wav.code2wav_dit_model.input_embed.spk_encoder.asp.conv.bias                                       | MISSING    | 
thinker.audio_tower.proj.weight                                                                          | MISSING    | 
token2wav.code2wav_bigvgan_model.conv_post.weight                                                        | MISSING    | 
token2wav.code2wav_dit_model.proj_out.weight                                                             | MISSING    | 
thinker.model.embed_tokens.weight                                                                        | MISSING    | 
talker.thinker_to_talker_proj.weight                                                                     | MISSING    | 
thinker.audio_tower.conv2.bias                                                                           | MISSING    | 
thinker.audio_tower.ln_post.weight                                                                       | MISSING    | 
token2wav.code2wav_dit_model.input_embed.proj.weight                                                     | MISSING    | 
thinker.audio_tower.audio_bos_eos_token.weight                                                           | MISSING    | 

Notes:
- UNEXPECTED	
:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
```
followed by an error:
```
OSError: ./RoboOmni-LIBERO-Spatial does not appear to have a file named spk_dict.pt. Checkout 'https://huggingface.co/./RoboOmni-LIBERO-Spatial/tree/main' for available files.
```

This likely means that `RoboOmni-LIBERO-Spatial` does not have talker and token2wav module (MISSING). However, as described in official RoboOmni demo video, it supports end-to-end audio and action output. And those missing modules likely cannot be simply loaded from original Qwen2.5Omni backbone.

While possibly lacking full models, sufficient tutorials for identical reproduction (e.g, how to output audio with action as displayed in demo) also lacks in current repo.

Therefore, to facilitate the development of community, I personally and kindly propose requirement for Full Model and Tutorials if possible❤.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require for Full Models and Tutorails of End-to-end Audio and Action Output #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Require for Full Models and Tutorails of End-to-end Audio and Action Output #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions