Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tutorial] init images and semantic vs direct prompting #41

Open
dmarx opened this issue May 18, 2022 · 2 comments
Open

[tutorial] init images and semantic vs direct prompting #41

dmarx opened this issue May 18, 2022 · 2 comments

Comments

@dmarx
Copy link
Member

dmarx commented May 18, 2022

https://discord.com/channels/869630568818696202/899135695677968474/976590248387698748

@dmarx
Copy link
Member Author

dmarx commented May 18, 2022

  1. init image just used to initialize the generation process instead of noise. all steering achieved by text prompt. the init image only directly impacts the first frame of the generation process.
scenes: "a photograph of a cat"
init_image: dog.png
  1. "direct" weight added to the init image to discourage the individual pixels in the image from changing from their current values. The init image "directly" impacts the generation of all frames in the scene.
scenes: "a photograph of a cat"
init_image: dog.png
direct_init_weight: 1
  1. Using a semantic weight instead changes the behavior a lot. Like in (1), the actual pixel values of the init image only directly inform the first frame. the rest of the frames will still be informed by the information in the image similarly to if someone was describing the image out loud but you never got to look at it. "this is a picture of a dog", "this is a photograph", "it is day time" etc. But we lose all the positional information, which we kept in (2)
scenes: "a photograph of a cat"
init_image: dog.png
semantic_init_weight: 1
  1. We can mix and match the effects here however we want
scenes: "a photograph of a cat"
init_image: dog.png
semantic_init_weight: 1
direct_init_weight: 1

@dmarx
Copy link
Member Author

dmarx commented May 18, 2022

so let's say you've got an init image and a text prompt
the generation will load up that init image the same as if it were like a "previous frame"
but then when it starts doing it's thing, it's gonna basically ignore the content of the image and just start trying to manipulate the image towards the text prompt
so pytti achieves this by converting the text into a "semantic" representation, i.e. a bunch of numbers that carries a bunch of informational content from the text
pytti does the same thing to the image you're generating and tries to move the image's semantic representation close to the prompt's. that's how CLIP guidance works.

so CLIP can take either text or an image and represent those things in the same semantic space (edited)
which means that you can use the semantic content of images to steer the generation process exactly the same way as you do with text
so we consequently have two ways we can use images for the steering process
we can be very literal and old school with what is often called a "reconstruction loss"
which means we compare the image we generated with the steering image pixel-by-pixel, and try to nudge the generation to be close to the original picture at each pixel position
this is what pytti calls "direct" prompting or stabilization
alternatively, we can use the image a source of information content like we would with text
this is what pytti calls "semantic" prompting/stabilization

allright, so back to init images
we have direct_init_weight and semantic_init_weight
so when you give pytti an init image, you can tell it to use a reconstruction loss, or a semantic (CLIP) loss, or even both (edited)
or neither

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant