Skip to content

fix: apply post_layernorm in CLIPVisionTransformer forward#3456

Open
gururajkosuru wants to merge 1 commit intomlc-ai:mainfrom
gururajkosuru:fix/clip-post-layernorm
Open

fix: apply post_layernorm in CLIPVisionTransformer forward#3456
gururajkosuru wants to merge 1 commit intomlc-ai:mainfrom
gururajkosuru:fix/clip-post-layernorm

Conversation

@gururajkosuru
Copy link
Contributor

The post_layernorm was defined in CLIPVisionTransformer.init but never applied in the forward method. This fix applies it to the final encoder hidden state, replacing it in the returned tuple. This matches the HuggingFace CLIPVisionTransformer behavior where last_hidden_state is post-normalized.

The [-2] indexing in CLIPVisionModel (used by LLaVA and Phi3V to get penultimate layer features) continues to work correctly since only the last element of the tuple is replaced.

Fixes #3205

The post_layernorm was defined in CLIPVisionTransformer.__init__ but
never applied in the forward method. This fix applies it to the final
encoder hidden state, replacing it in the returned tuple. This matches
the HuggingFace CLIPVisionTransformer behavior where last_hidden_state
is post-normalized.

The [-2] indexing in CLIPVisionModel (used by LLaVA and Phi3V to get
penultimate layer features) continues to work correctly since only the
last element of the tuple is replaced.

Fixes mlc-ai#3205

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a bug in the CLIPVisionTransformer where the post_layernorm was defined but not correctly applied in the forward method. The fix ensures that the final encoder hidden state is properly normalized, bringing the model's output in line with the expected behavior of the HuggingFace implementation. This correction addresses an inconsistency while maintaining compatibility with existing usages that access intermediate hidden states.

Highlights

  • CLIPVisionTransformer Forward Method: Applied the post_layernorm to the final encoder hidden state within the forward method of CLIPVisionTransformer. This change aligns the model's behavior with the HuggingFace CLIPVisionTransformer implementation, ensuring the last_hidden_state is correctly post-normalized.
  • Compatibility: Ensured that the modification preserves compatibility for downstream models like LLaVA and Phi3V, which rely on [-2] indexing to retrieve penultimate layer features from CLIPVisionModel.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/mlc_llm/model/vision/clip_vision.py
    • Applied the post_layernorm to the final encoder hidden state.
    • Modified the return tuple of the forward method to include the post-normalized last hidden state.
Activity
  • No specific activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in CLIPVisionTransformer where the post_layernorm layer was defined but not applied in the forward method. The fix applies this layer to the final hidden state from the encoder, which aligns the model's behavior with the reference implementation from HuggingFace. The change modifies the last element of the returned tuple of hidden states, ensuring that indexing for intermediate layers in downstream models remains correct. The implementation is correct and resolves the issue described.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Missing post layernorm in CLIP model

1 participant