Hi! I'm very interested in this repo and want to reproduce it! But I met some issues when trying reproducing e2e audio and action output with RoboOmni as displayed in your demo video.
I have noted that, in RoboOmni class from experiments/libero/roboomni_utils.py, the VLA model seems to be loaded by Qwen2_5OmniThinkerForConditionalGeneration.from_pretrain (no audio ouput ability) instead of Qwen2_5OmniForConditionalGeneration.from_pretrain (with talker and token2wav module).
When I tried to load model with Qwen2_5OmniForConditionalGeneration.from_pretrain, I got following output:
Qwen2_5OmniForConditionalGeneration LOAD REPORT
from: ./RoboOmni-LIBERO-Spatial
Key | Status |
---------------------------------------------------------------------------------------------------------+------------+-
visual.blocks.{0...31}.attn.k.weight | UNEXPECTED |
visual.blocks.{0...31}.attn.k.bias | UNEXPECTED |
visual.blocks.{0...31}.attn.q.weight | UNEXPECTED |
model.layers.{0...35}.self_attn.q_proj.bias | UNEXPECTED |
visual.blocks.{0...31}.attn.v.bias | UNEXPECTED |
model.layers.{0...35}.self_attn.q_proj.weight | UNEXPECTED |
audio_tower.layers.{0...31}.self_attn.v_proj.bias | UNEXPECTED |
audio_tower.layers.{0...31}.self_attn_layer_norm.weight | UNEXPECTED |
audio_tower.layers.{0...31}.fc2.weight | UNEXPECTED |
audio_tower.layers.{0...31}.self_attn.k_proj.weight | UNEXPECTED |
model.layers.{0...35}.self_attn.o_proj.weight | UNEXPECTED |
audio_tower.layers.{0...31}.self_attn.q_proj.bias | UNEXPECTED |
visual.blocks.{0...31}.attn.q.bias | UNEXPECTED |
audio_tower.layers.{0...31}.fc2.bias | UNEXPECTED |
model.layers.{0...35}.post_attention_layernorm.weight | UNEXPECTED |
audio_tower.layers.{0...31}.fc1.weight | UNEXPECTED |
audio_tower.layers.{0...31}.fc1.bias | UNEXPECTED |
visual.blocks.{0...31}.norm1.weight | UNEXPECTED |
audio_tower.layers.{0...31}.self_attn_layer_norm.bias | UNEXPECTED |
audio_tower.layers.{0...31}.self_attn.out_proj.weight | UNEXPECTED |
model.layers.{0...35}.mlp.up_proj.weight | UNEXPECTED |
model.layers.{0...35}.mlp.down_proj.weight | UNEXPECTED |
visual.blocks.{0...31}.mlp.down_proj.bias | UNEXPECTED |
model.layers.{0...35}.self_attn.k_proj.weight | UNEXPECTED |
audio_tower.layers.{0...31}.self_attn.q_proj.weight | UNEXPECTED |
visual.blocks.{0...31}.mlp.gate_proj.weight | UNEXPECTED |
model.layers.{0...35}.input_layernorm.weight | UNEXPECTED |
audio_tower.layers.{0...31}.self_attn.out_proj.bias | UNEXPECTED |
audio_tower.layers.{0...31}.final_layer_norm.weight | UNEXPECTED |
visual.blocks.{0...31}.attn.proj.bias | UNEXPECTED |
visual.blocks.{0...31}.attn.proj.weight | UNEXPECTED |
model.layers.{0...35}.self_attn.v_proj.weight | UNEXPECTED |
model.layers.{0...35}.self_attn.k_proj.bias | UNEXPECTED |
visual.blocks.{0...31}.norm2.weight | UNEXPECTED |
visual.blocks.{0...31}.mlp.down_proj.weight | UNEXPECTED |
visual.blocks.{0...31}.attn.v.weight | UNEXPECTED |
model.layers.{0...35}.self_attn.v_proj.bias | UNEXPECTED |
audio_tower.layers.{0...31}.final_layer_norm.bias | UNEXPECTED |
audio_tower.layers.{0...31}.self_attn.v_proj.weight | UNEXPECTED |
audio_tower.proj.bias | UNEXPECTED |
lm_head.weight | UNEXPECTED |
visual.blocks.{0...31}.mlp.gate_proj.bias | UNEXPECTED |
model.layers.{0...35}.mlp.gate_proj.weight | UNEXPECTED |
visual.merger.mlp.{0, 2}.bias | UNEXPECTED |
visual.merger.mlp.{0, 2}.weight | UNEXPECTED |
visual.blocks.{0...31}.mlp.up_proj.bias | UNEXPECTED |
audio_tower.ln_post.weight | UNEXPECTED |
visual.blocks.{0...31}.mlp.up_proj.weight | UNEXPECTED |
model.norm.weight | UNEXPECTED |
audio_tower.conv2.bias | UNEXPECTED |
audio_tower.conv2.weight | UNEXPECTED |
audio_tower.conv1.bias | UNEXPECTED |
model.embed_tokens.weight | UNEXPECTED |
audio_tower.conv1.weight | UNEXPECTED |
audio_tower.audio_bos_eos_token.weight | UNEXPECTED |
audio_tower.ln_post.bias | UNEXPECTED |
audio_tower.proj.weight | UNEXPECTED |
visual.merger.ln_q.weight | UNEXPECTED |
visual.patch_embed.proj.weight | UNEXPECTED |
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.convs1.{0, 1, 2}.weight | MISSING |
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.convs1.{0, 1, 2}.bias | MISSING |
thinker.visual.blocks.{0...31}.mlp.down_proj.bias | MISSING |
talker.model.layers.{0...27}.mlp.up_proj.weight | MISSING |
talker.model.layers.{0...27}.post_attention_layernorm.weight | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_out.0.weight | MISSING |
thinker.audio_tower.layers.{0...31}.final_layer_norm.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.0.conv.weight | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.ff.ff.{0, 3}.weight | MISSING |
thinker.audio_tower.layers.{0...31}.self_attn.k_proj.weight | MISSING |
thinker.model.layers.{0...27}.input_layernorm.weight | MISSING |
thinker.visual.blocks.{0...31}.attn.k.bias | MISSING |
thinker.visual.blocks.{0...31}.norm2.weight | MISSING |
thinker.model.layers.{0...27}.self_attn.v_proj.weight | MISSING |
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.activations.{0, 1, 2, 3, 4, 5}.act.beta | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_v.bias | MISSING |
thinker.visual.blocks.{0...31}.attn.v.bias | MISSING |
thinker.audio_tower.layers.{0...31}.self_attn.v_proj.bias | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_out.0.bias | MISSING |
thinker.visual.blocks.{0...31}.mlp.up_proj.weight | MISSING |
thinker.visual.blocks.{0...31}.attn.q.bias | MISSING |
thinker.audio_tower.layers.{0...31}.self_attn_layer_norm.bias | MISSING |
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.activations.{0, 1, 2, 3, 4, 5}.act.alpha | MISSING |
thinker.model.layers.{0...27}.self_attn.k_proj.bias | MISSING |
thinker.visual.blocks.{0...31}.norm1.weight | MISSING |
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.convs2.{0, 1, 2}.weight | MISSING |
thinker.audio_tower.conv1.bias | MISSING |
talker.model.layers.{0...27}.self_attn.v_proj.bias | MISSING |
thinker.model.layers.{0...27}.self_attn.q_proj.bias | MISSING |
talker.model.layers.{0...27}.self_attn.q_proj.bias | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn_norm.linear.bias | MISSING |
talker.model.layers.{0...27}.self_attn.o_proj.weight | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.ff.ff.{0, 3}.bias | MISSING |
thinker.visual.blocks.{0...31}.attn.proj.weight | MISSING |
thinker.visual.blocks.{0...31}.attn.q.weight | MISSING |
thinker.model.layers.{0...27}.mlp.up_proj.weight | MISSING |
thinker.audio_tower.layers.{0...31}.self_attn.q_proj.bias | MISSING |
thinker.audio_tower.layers.{0...31}.fc1.bias | MISSING |
thinker.visual.blocks.{0...31}.attn.k.weight | MISSING |
talker.model.layers.{0...27}.self_attn.q_proj.weight | MISSING |
thinker.audio_tower.layers.{0...31}.final_layer_norm.weight | MISSING |
thinker.visual.blocks.{0...31}.mlp.down_proj.weight | MISSING |
thinker.visual.blocks.{0...31}.attn.v.weight | MISSING |
thinker.visual.blocks.{0...31}.mlp.gate_proj.bias | MISSING |
thinker.audio_tower.layers.{0...31}.self_attn.out_proj.bias | MISSING |
thinker.model.layers.{0...27}.post_attention_layernorm.weight | MISSING |
token2wav.code2wav_bigvgan_model.resblocks.{0...17}.convs2.{0, 1, 2}.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.tdnn1.conv.weight | MISSING |
thinker.audio_tower.layers.{0...31}.fc2.weight | MISSING |
talker.model.layers.{0...27}.mlp.down_proj.weight | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.se_block.conv1.weight | MISSING |
thinker.model.layers.{0...27}.self_attn.k_proj.weight | MISSING |
talker.model.layers.{0...27}.self_attn.k_proj.weight | MISSING |
thinker.model.layers.{0...27}.mlp.down_proj.weight | MISSING |
talker.model.layers.{0...27}.input_layernorm.weight | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_q.bias | MISSING |
thinker.audio_tower.layers.{0...31}.fc2.bias | MISSING |
thinker.audio_tower.layers.{0...31}.fc1.weight | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_k.bias | MISSING |
thinker.model.layers.{0...27}.self_attn.q_proj.weight | MISSING |
thinker.model.layers.{0...27}.self_attn.v_proj.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.mfa.conv.bias | MISSING |
thinker.visual.blocks.{0...31}.mlp.gate_proj.weight | MISSING |
thinker.visual.blocks.{0...31}.mlp.up_proj.bias | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_v.weight | MISSING |
thinker.lm_head.weight | MISSING |
thinker.audio_tower.layers.{0...31}.self_attn.v_proj.weight | MISSING |
thinker.audio_tower.layers.{0...31}.self_attn_layer_norm.weight | MISSING |
thinker.model.layers.{0...27}.mlp.gate_proj.weight | MISSING |
talker.model.layers.{0...27}.self_attn.k_proj.bias | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_q.weight | MISSING |
thinker.audio_tower.layers.{0...31}.self_attn.out_proj.weight | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn.to_k.weight | MISSING |
token2wav.code2wav_bigvgan_model.conv_pre.weight | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.mfa.conv.weight | MISSING |
talker.model.layers.{0...27}.mlp.gate_proj.weight | MISSING |
thinker.model.layers.{0...27}.self_attn.o_proj.weight | MISSING |
talker.model.layers.{0...27}.self_attn.v_proj.weight | MISSING |
thinker.visual.blocks.{0...31}.attn.proj.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.se_block.conv2.bias | MISSING |
thinker.audio_tower.layers.{0...31}.self_attn.q_proj.weight | MISSING |
thinker.visual.merger.ln_q.weight | MISSING |
token2wav.code2wav_dit_model.transformer_blocks.{0...21}.attn_norm.linear.weight | MISSING |
token2wav.code2wav_bigvgan_model.conv_pre.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.se_block.conv2.weight | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.res2net_block.blocks.0.conv.weight | MISSING |
thinker.visual.merger.mlp.{0, 2}.weight | MISSING |
token2wav.code2wav_bigvgan_model.ups.{0, 1, 2, 3, 4, 5}.0.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.res2net_block.blocks.0.conv.bias | MISSING |
token2wav.code2wav_dit_model.time_embed.time_mlp.{0, 2}.weight | MISSING |
token2wav.code2wav_dit_model.text_embed.codec_embed.weight | MISSING |
token2wav.code2wav_dit_model.proj_out.bias | MISSING |
token2wav.code2wav_dit_model.norm_out.linear.weight | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.asp.tdnn.conv.weight | MISSING |
token2wav.code2wav_bigvgan_model.ups.{0, 1, 2, 3, 4, 5}.0.weight | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.asp.conv.weight | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.fc.weight | MISSING |
thinker.audio_tower.conv1.weight | MISSING |
thinker.visual.merger.mlp.{0, 2}.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.tdnn2.conv.bias | MISSING |
token2wav.code2wav_dit_model.time_embed.time_mlp.{0, 2}.bias | MISSING |
thinker.audio_tower.ln_post.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.tdnn2.conv.weight | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.asp.tdnn.conv.bias | MISSING |
talker.model.norm.weight | MISSING |
token2wav.code2wav_dit_model.input_embed.proj.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.tdnn1.conv.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.fc.bias | MISSING |
token2wav.code2wav_dit_model.norm_out.linear.bias | MISSING |
talker.model.embed_tokens.weight | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.{1, 2, 3}.se_block.conv1.bias | MISSING |
token2wav.code2wav_bigvgan_model.activation_post.act.beta | MISSING |
token2wav.code2wav_bigvgan_model.activation_post.act.alpha | MISSING |
thinker.model.norm.weight | MISSING |
talker.codec_head.weight | MISSING |
thinker.audio_tower.proj.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.blocks.0.conv.bias | MISSING |
thinker.audio_tower.conv2.weight | MISSING |
thinker.visual.patch_embed.proj.weight | MISSING |
talker.thinker_to_talker_proj.bias | MISSING |
token2wav.code2wav_dit_model.input_embed.spk_encoder.asp.conv.bias | MISSING |
thinker.audio_tower.proj.weight | MISSING |
token2wav.code2wav_bigvgan_model.conv_post.weight | MISSING |
token2wav.code2wav_dit_model.proj_out.weight | MISSING |
thinker.model.embed_tokens.weight | MISSING |
talker.thinker_to_talker_proj.weight | MISSING |
thinker.audio_tower.conv2.bias | MISSING |
thinker.audio_tower.ln_post.weight | MISSING |
token2wav.code2wav_dit_model.input_embed.proj.weight | MISSING |
thinker.audio_tower.audio_bos_eos_token.weight | MISSING |
Notes:
- UNEXPECTED
:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
followed by an error:
OSError: ./RoboOmni-LIBERO-Spatial does not appear to have a file named spk_dict.pt. Checkout 'https://huggingface.co/./RoboOmni-LIBERO-Spatial/tree/main' for available files.
This likely means that RoboOmni-LIBERO-Spatial does not have talker and token2wav module (MISSING). However, as described in official RoboOmni demo video, it supports end-to-end audio and action output. And those missing modules likely cannot be simply loaded from original Qwen2.5Omni backbone.
While possibly lacking full models, sufficient tutorials for identical reproduction (e.g, how to output audio with action as displayed in demo) also lacks in current repo.
Therefore, to facilitate the development of community, I personally and kindly propose requirement for Full Model and Tutorials if possible❤.
Hi! I'm very interested in this repo and want to reproduce it! But I met some issues when trying reproducing e2e audio and action output with RoboOmni as displayed in your demo video.
I have noted that, in
RoboOmniclass fromexperiments/libero/roboomni_utils.py, the VLA model seems to be loaded byQwen2_5OmniThinkerForConditionalGeneration.from_pretrain(no audio ouput ability) instead ofQwen2_5OmniForConditionalGeneration.from_pretrain(with talker and token2wav module).When I tried to load model with
Qwen2_5OmniForConditionalGeneration.from_pretrain, I got following output:followed by an error:
This likely means that
RoboOmni-LIBERO-Spatialdoes not have talker and token2wav module (MISSING). However, as described in official RoboOmni demo video, it supports end-to-end audio and action output. And those missing modules likely cannot be simply loaded from original Qwen2.5Omni backbone.While possibly lacking full models, sufficient tutorials for identical reproduction (e.g, how to output audio with action as displayed in demo) also lacks in current repo.
Therefore, to facilitate the development of community, I personally and kindly propose requirement for Full Model and Tutorials if possible❤.