You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the model description for the stable diffusion benchmark https://github.com/mlcommons/training/tree/master/stable_diffusion#the-model we are quite clear that the latent output of the autoencoder is 64x64x4, but then don't state the output embedding size of the OpenCLIP-ViT/H text-encoder that is also fed into the UNet backbone.
In the model description for the stable diffusion benchmark https://github.com/mlcommons/training/tree/master/stable_diffusion#the-model we are quite clear that the latent output of the autoencoder is 64x64x4, but then don't state the output embedding size of the OpenCLIP-ViT/H text-encoder that is also fed into the UNet backbone.
I am not sure, but the correct reference might be https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/ViT-H-14.json? in which case that embedding size is 1024?
The text was updated successfully, but these errors were encountered: