You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear @haotian-liu and team, thanks for the wonderful work and I'm always excited see new updates to the LlaVA family with the inclusion of the latest LlaVA-plus model. I would just like understand from the point of view of the team's perspective on the choice of the current visual projection strategy. As all of us know, one of the critical components in effective visual understanding with LLMs lies within the multimodal interaction space. In the LlaVA papers, the authors argued that a simple MLP projection with a visual encoder provides the ease of training and is effective in bridging the multimodality gap. Has the team done any ablation experiments on the comparision of visual projection architectures i.e. some of the popular frameworks like Flamingo-interleaved, BLIP-2 Q-former, Soft-prompt prefixing, as well as the recently introduced Fuyu-8B where they did away with the visual encoder and insert visual patch tokens directly into the transformer to scale resolution. In the light of all these emerging strategies, it would be insightful to understand the team's consideration and arguments for the strategy choice, and how do you see VLMs evolving in the near future.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Dear @haotian-liu and team, thanks for the wonderful work and I'm always excited see new updates to the LlaVA family with the inclusion of the latest LlaVA-plus model. I would just like understand from the point of view of the team's perspective on the choice of the current visual projection strategy. As all of us know, one of the critical components in effective visual understanding with LLMs lies within the multimodal interaction space. In the LlaVA papers, the authors argued that a simple MLP projection with a visual encoder provides the ease of training and is effective in bridging the multimodality gap. Has the team done any ablation experiments on the comparision of visual projection architectures i.e. some of the popular frameworks like Flamingo-interleaved, BLIP-2 Q-former, Soft-prompt prefixing, as well as the recently introduced Fuyu-8B where they did away with the visual encoder and insert visual patch tokens directly into the transformer to scale resolution. In the light of all these emerging strategies, it would be insightful to understand the team's consideration and arguments for the strategy choice, and how do you see VLMs evolving in the near future.
Thanks very much for the food for thought.
Beta Was this translation helpful? Give feedback.
All reactions