Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable_diffusion: document embedding size from ViT-H into Unet #663

Closed
matthew-frank opened this issue Jun 22, 2023 · 2 comments · Fixed by #677
Closed

Stable_diffusion: document embedding size from ViT-H into Unet #663

matthew-frank opened this issue Jun 22, 2023 · 2 comments · Fixed by #677

Comments

@matthew-frank
Copy link
Contributor

In the model description for the stable diffusion benchmark https://github.com/mlcommons/training/tree/master/stable_diffusion#the-model we are quite clear that the latent output of the autoencoder is 64x64x4, but then don't state the output embedding size of the OpenCLIP-ViT/H text-encoder that is also fed into the UNet backbone.

I am not sure, but the correct reference might be https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/ViT-H-14.json? in which case that embedding size is 1024?

@ahmadki
Copy link
Contributor

ahmadki commented Jul 23, 2023

Added with #677

@nv-rborkar
Copy link
Contributor

Closed with the PR above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants