Skip to content

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

License

Notifications You must be signed in to change notification settings

Amshaker/Mobile-O

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

arXiv Demo Project Page Models Datasets App Store

Abdelrahman Shaker1,โˆ—,โ€ , Ahmed Heakl1,โˆ—, Jaseel Muhammad1,
Ritesh Thawkar1, Omkar Thawakar1, Senmao Li1, Hisham Cholakkal1,
Ian Reid1, Eric P. Xing1,2, Salman Khan1,โ€ , Fahad Shahbaz Khan1,3,โ€ 

1Mohamed bin Zayed University of Artificial Intelligence ย  2Carnegie Mellon University ย  3Linkรถping University

*Equal Contributions ย  โ€ Project Leaders


๐Ÿ“ฃ Announcement

  • Mobile-O Live Demo: Interactively explore the modelโ€™s capabilities Mobile-O Online Demo
  • Mobile-O is now fully released! This includes models, training and evaluation code, inference scripts, paper, and the complete mobile app.

๐Ÿ“Œ Overview

Mobile-O is a compact, efficient unified visionโ€“languageโ€“diffusion model that performs both multimodal understanding (VQA, OCR, reasoning) and image generation within a single architecture, while running entirely on-device. It is designed specifically for mobile and edge deployment, achieving real-time performance with a small memory footprint.

Mobile-O Overview

๐Ÿง  Model Capabilities

๐Ÿ–ผ๏ธ Image Generation ๐Ÿ‘๏ธ Image Understanding โœ๏ธ Image Editing
Image Generation Image Understanding Image Editing

๐Ÿ—๏ธ Architecture

Mobile-O Architecture
Overall architecture of Mobile-O: a unified visionโ€“languageโ€“diffusion model for on-device multimodal understanding and generation.

Mobile-O consists of three main components:

  • Vision-Language Model (VLM): A compact multimodal backbone based on FastVLM, combining a FastViT-based vision encoder with a lightweight autoregressive language model (Qwen2-0.5B) for efficient visualโ€“text understanding.

  • Diffusion Decoder: A lightweight DiT-style diffusion transformer based on SANA, paired with a VAE encoderโ€“decoder, designed for 512ร—512 text-to-image generation under mobile constraints.

  • Mobile Conditioning Projector (MCP): A novel lightweight connector (~2.4M params) that bridges the VLM and diffusion decoder using layerwise feature fusion with temperature-scaled learnable weights, depthwise-separable 1D convolutions, and efficient channel attention. Unlike query-token approaches, MCP directly conditions the diffusion model on weighted VLM hidden states with minimal overhead.


๐ŸŽฏ Supported Tasks

Task Input Output Description
๐Ÿ’ฌ Text โ†’ Text Text Text General conversational AI
๐Ÿ‘๏ธ Image โ†’ Text Image + Text Text Image understanding (VQA, OCR, reasoning)
๐Ÿ–ผ๏ธ Text โ†’ Image Text Image High-quality image generation at 512ร—512
โœ๏ธ Text + Image โ†’ Image Image + Text Image Instruction-based image editing
๐Ÿ”„ Unified Training Mixed Mixed Joint image generation and understanding

๐Ÿ“ฑ Mobile App

Mobile-O runs entirely on-device with no cloud dependency. We release the full source code of the iOS app along with optimized MLX and CoreML model components. The app runs smoothly on iPhone 15 Pro, iPhone 16 Pro, and iPhone 17 Pro โœ….

Download on the App Store

๐Ÿ“ฑ iOS App Source Code Mobile-O-App
๐Ÿงฉ MLX & CoreML Models ๐Ÿค— HuggingFace

โšก ~3-4s Image Generation ย ย โ€ขย ย  ๐Ÿ‘๏ธ ~0.4s Visual Understanding ย ย โ€ขย ย  ๐Ÿ’พ < 2GB Memory Footprint


๐Ÿ“Š Training Datasets

Stage Description Download
Pre-training 9M text-image pairs (JourneyDB+BLIP3o-Pretrain-Short-Caption) ๐Ÿค— HuggingFace
SFT ~105K curated prompt-image pairs ๐Ÿค— HuggingFace
Post-training ~105K unified quadruplet samples ๐Ÿค— HuggingFace

โš™๏ธ Setup

conda create -n mobileo python=3.12 -y
conda activate mobileo
pip install -r requirements.txt

๐Ÿš€ Inference

Download Checkpoint

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='Amshaker/Mobile-O-0.5B', repo_type='model', local_dir='checkpoints', allow_patterns=['final_merged_model_23620/*']))"

1. Image Understanding

python infer_image_understanding.py \
    --model_path /HF_model/checkpoint/path/ \
    --image_path assets/cute_cat.png \
    --prompt "What is in the image?"

2. Image Generation

python infer_image_generation.py \
    --model_path /HF_model/checkpoint/path/ \
    --prompt "A vibrant tropical rainforest scene with a scarlet macaw perched on a moss-covered branch"

3. Image Editing

python infer_image_editing.py \
    --model_path /HF_model/checkpoint/path/ \
    --image_path assets/cute_cat.png \
    --prompt "Make the cat wear a hat"

๐Ÿ‹๏ธ Training

Stage 1: Pretraining (Cross-Modal Alignment)

We pretrain the DiT and Mobile Conditioning Projector (MCP) components on 9M text-image pairs from JourneyDB (4M) and BLIP3o-Short-Caption (5M) using data. The visual encoders, LLM backbone, and VAE are frozen.

bash scripts/Mobile-O-0.5B/pretrain.sh

Stage 2: Supervised Fine-tuning (SFT)

We finetune the DiT and MCP components on ~105K curated prompt-image pairs (60K from BLIP3o + 45K from ShareGPT-4o-Image) using data. The visual encoders, LLM backbone, and VAE remain frozen.

bash scripts/Mobile-O-0.5B/sft.sh

Stage 3: Unified Multimodal Post-Training

We post-train the DiT, MCP, LLM (via LoRA), and visual encoder components on ~105K quadruplet samples in the format (generation prompt, image, question, answer) using data. Only the VAE remains frozen.

bash scripts/Mobile-O-0.5B/post_train.sh

Post-Training Pipeline
Unified multimodal post-training: jointly optimizing image generation and visual understanding via a multi-task objective.

Merging LoRA Weights

Since the output of post-training is LoRA adaptor weights for the LLM, you can merge them with the base model using merge_lora.py to get the final merged checkpoint for inference.

python mobileo/merge_lora.py \
    --checkpoint_dir /path/to/lora_weights/ \
    --base_weights /path/to/sft_checkpoint/ \
    --output_dir /path/to/final_merged_model/

Example with actual paths:

python mobileo/merge_lora.py \
    --checkpoint_dir checkpoints/Mobile-O-0.5B-Post-Train/ \
    --base_weights checkpoints/Mobile-O-0.5B-SFT/ \
    --output_dir checkpoints/Mobile-O-0.5B-Post-Train/final_merged_model/

๐ŸŽจ Qualitative Results

Generation Samples:

Generation Results

Qualitative Comparison:

Qualitative Results

More Generation Comparison:

Generation Comparison

More Understanding Comparison:

Understanding Comparison


๐Ÿ“„ Citation

If you find Mobile-O useful in your research, please consider citing:

@article{shaker2026mobileo,
  title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
  author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
  journal={arXiv preprint arXiv:2602.20161},
  year={2026}
}

๐Ÿ™ Acknowledgements

This repo is partially built upon BLIP3o. Thanks to all the contributors for their great efforts.


๐Ÿ“œ License

  • The Mobile-O models, source code, and mobile application are released exclusively for research and non-commercial use under the CC BY-NC-SA 4.0 license. Any commercial use is strictly prohibited without prior explicit written permission from the authors.

๐ŸŒ Project Page ย โ€ขย  ๐Ÿš€ Live Demo ย โ€ขย  ๐Ÿ“„ Paper

About

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors