Intuition on Visual Projection #817

adrielkuek · 2023-11-17T06:52:00Z

adrielkuek
Nov 17, 2023

Dear @haotian-liu and team, thanks for the wonderful work and I'm always excited see new updates to the LlaVA family with the inclusion of the latest LlaVA-plus model. I would just like understand from the point of view of the team's perspective on the choice of the current visual projection strategy. As all of us know, one of the critical components in effective visual understanding with LLMs lies within the multimodal interaction space. In the LlaVA papers, the authors argued that a simple MLP projection with a visual encoder provides the ease of training and is effective in bridging the multimodality gap. Has the team done any ablation experiments on the comparision of visual projection architectures i.e. some of the popular frameworks like Flamingo-interleaved, BLIP-2 Q-former, Soft-prompt prefixing, as well as the recently introduced Fuyu-8B where they did away with the visual encoder and insert visual patch tokens directly into the transformer to scale resolution. In the light of all these emerging strategies, it would be insightful to understand the team's consideration and arguments for the strategy choice, and how do you see VLMs evolving in the near future.

Thanks very much for the food for thought.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intuition on Visual Projection #817

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Intuition on Visual Projection #817

adrielkuek Nov 17, 2023

Replies: 0 comments

adrielkuek
Nov 17, 2023