Stable_diffusion: document embedding size from ViT-H into Unet #663

matthew-frank · 2023-06-22T18:11:01Z

In the model description for the stable diffusion benchmark https://github.com/mlcommons/training/tree/master/stable_diffusion#the-model we are quite clear that the latent output of the autoencoder is 64x64x4, but then don't state the output embedding size of the OpenCLIP-ViT/H text-encoder that is also fed into the UNet backbone.

I am not sure, but the correct reference might be https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/ViT-H-14.json? in which case that embedding size is 1024?

ahmadki · 2023-07-23T17:34:23Z

Added with #677

nv-rborkar · 2023-07-27T15:50:10Z

Closed with the PR above.

ahmadki mentioned this issue Jul 23, 2023

[SD] Finalized the benchmark #677

Merged

nv-rborkar closed this as completed Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable_diffusion: document embedding size from ViT-H into Unet #663

Stable_diffusion: document embedding size from ViT-H into Unet #663

matthew-frank commented Jun 22, 2023

ahmadki commented Jul 23, 2023

nv-rborkar commented Jul 27, 2023

Stable_diffusion: document embedding size from ViT-H into Unet #663

Stable_diffusion: document embedding size from ViT-H into Unet #663

Comments

matthew-frank commented Jun 22, 2023

ahmadki commented Jul 23, 2023

nv-rborkar commented Jul 27, 2023