[tutorial] init images and semantic vs direct prompting #41

dmarx · 2022-05-18T21:04:13Z

https://discord.com/channels/869630568818696202/899135695677968474/976590248387698748

dmarx · 2022-05-18T21:06:09Z

init image just used to initialize the generation process instead of noise. all steering achieved by text prompt. the init image only directly impacts the first frame of the generation process.

scenes: "a photograph of a cat"
init_image: dog.png

"direct" weight added to the init image to discourage the individual pixels in the image from changing from their current values. The init image "directly" impacts the generation of all frames in the scene.

scenes: "a photograph of a cat"
init_image: dog.png
direct_init_weight: 1

Using a semantic weight instead changes the behavior a lot. Like in (1), the actual pixel values of the init image only directly inform the first frame. the rest of the frames will still be informed by the information in the image similarly to if someone was describing the image out loud but you never got to look at it. "this is a picture of a dog", "this is a photograph", "it is day time" etc. But we lose all the positional information, which we kept in (2)

scenes: "a photograph of a cat"
init_image: dog.png
semantic_init_weight: 1

We can mix and match the effects here however we want

scenes: "a photograph of a cat"
init_image: dog.png
semantic_init_weight: 1
direct_init_weight: 1

dmarx · 2022-05-18T21:07:30Z

so let's say you've got an init image and a text prompt
the generation will load up that init image the same as if it were like a "previous frame"
but then when it starts doing it's thing, it's gonna basically ignore the content of the image and just start trying to manipulate the image towards the text prompt
so pytti achieves this by converting the text into a "semantic" representation, i.e. a bunch of numbers that carries a bunch of informational content from the text
pytti does the same thing to the image you're generating and tries to move the image's semantic representation close to the prompt's. that's how CLIP guidance works.

so CLIP can take either text or an image and represent those things in the same semantic space (edited)
which means that you can use the semantic content of images to steer the generation process exactly the same way as you do with text
so we consequently have two ways we can use images for the steering process
we can be very literal and old school with what is often called a "reconstruction loss"
which means we compare the image we generated with the steering image pixel-by-pixel, and try to nudge the generation to be close to the original picture at each pixel position
this is what pytti calls "direct" prompting or stabilization
alternatively, we can use the image a source of information content like we would with text
this is what pytti calls "semantic" prompting/stabilization

allright, so back to init images
we have direct_init_weight and semantic_init_weight
so when you give pytti an init image, you can tell it to use a reconstruction loss, or a semantic (CLIP) loss, or even both (edited)
or neither

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tutorial] init images and semantic vs direct prompting #41

[tutorial] init images and semantic vs direct prompting #41

dmarx commented May 18, 2022

dmarx commented May 18, 2022

dmarx commented May 18, 2022

[tutorial] init images and semantic vs direct prompting #41

[tutorial] init images and semantic vs direct prompting #41

Comments

dmarx commented May 18, 2022

dmarx commented May 18, 2022

dmarx commented May 18, 2022