【consult🙋‍♂️】 some confusion about CustomVisionTransformer.forward_features #130

TITC · 2022-04-19T17:23:43Z

TITC
Apr 19, 2022
Collaborator

Thanks for your response these days and I learned a lot from your repo.

When I read code in this part, I can understand what the operation has done at the API level, but I don't know why should do this thing and the theory behind it.

the reason of concatenate cls_tokens and x？

x = torch.cat((cls_tokens, x), dim=1)

cls_tokens is (batch_size,1,embed_dim)
x is (batch_size, patch_nums, embed_dim)

and returned x is (batch_size, patch_nums+1, embed_dim)

the special purpose of self.width//self.patch_size-w, can I replace it with a random number, such as 2, or 3?

pos_emb_ind = repeat(torch.arange(h)*(self.width//self.patch_size-w), 'h -> (h w)', w=w)+torch.arange(h*w)

I know torch.arange(h) generate a list from 0 to h-1, and then use (self.width//self.patch_size-w upscale the gap between 2 rows. Then use repeat to transfer (h,1) dim tensor to (h,w) tensor and squeeze it into a vector. These operations distinguish row's position information in the image, last plus a list torch.arange(h*w) from 0-h*w-1, distinguish column elements in the same row.

example

h=12
self.width//self.patch_size=13
self.width//self.patch_size-w =3
w=10

assume each column chunk separated by blank cell,

first chunk denotes torch.arange(h)
second chunk represents torch.arange(h)*(self.width//self.patch_size-w)
third chunk denotes repeat(torch.arange(h)*(self.width//self.patch_size-w), 'h -> h w', w=w)

0	0	0	0	0	0	0	0	0	0	0	0
1	3	3	3	3	3	3	3	3	3	3	3
2	6	6	6	6	6	6	6	6	6	6	6
3	9	9	9	9	9	9	9	9	9	9	9
4	12	12	12	12	12	12	12	12	12	12	12
5	15	15	15	15	15	15	15	15	15	15	15
6	18	18	18	18	18	18	18	18	18	18	18
7	21	21	21	21	21	21	21	21	21	21	21
8	24	24	24	24	24	24	24	24	24	24	24
9	27	27	27	27	27	27	27	27	27	27	27
10	30	30	30	30	30	30	30	30	30	30	30
11	33	33	33	33	33	33	33	33	33	33	33

if visualize torch.arange(h*w) in 2 dim

0	1	2	3	4	5	6	7	8	9
10	11	12	13	14	15	16	17	18	19
20	21	22	23	24	25	26	27	28	29
30	31	32	33	34	35	36	37	38	39
40	41	42	43	44	45	46	47	48	49
50	51	52	53	54	55	56	57	58	59
60	61	62	63	64	65	66	67	68	69
70	71	72	73	74	75	76	77	78	79
80	81	82	83	84	85	86	87	88	89
90	91	92	93	94	95	96	97	98	99
100	101	102	103	104	105	106	107	108	109
110	111	112	113	114	115	116	117	118	119

and put them together is

0	1	2	3	4	5	6	7	8	9
13	14	15	16	17	18	19	20	21	22
26	27	28	29	30	31	32	33	34	35
39	40	41	42	43	44	45	46	47	48
52	53	54	55	56	57	58	59	60	61
65	66	67	68	69	70	71	72	73	74
78	79	80	81	82	83	84	85	86	87
91	92	93	94	95	96	97	98	99	100
104	105	106	107	108	109	110	111	112	113
117	118	119	120	121	122	123	124	125	126
130	131	132	133	134	135	136	137	138	139
143	144	145	146	147	148	149	150	151	152

last squeeze it is to vector pos_emb_ind

Answered by lukas-blecher

Apr 19, 2022

Hi, I'll try to answer your question to your satisfaction.

The cls token was first introduced in the BERT paper https://arxiv.org/abs/1810.04805.
It's short for classification token and basically marks the start of a sequence. I just adopted the system.
Ok what is your question here? So the problem I was facing is that images of equations come in many different resolutions. The naive way to deal with this would be to pad all images to a fixed image size (here self.width x self.height). But then there would be much computation done without any meaning.
That's why I'm just using the smallest bounding box that encapsulates every relevant pixel. But this is only generating a new problem: How…

View full answer

lukas-blecher · 2022-04-19T18:20:39Z

lukas-blecher
Apr 19, 2022
Maintainer

Hi, I'll try to answer your question to your satisfaction.

The cls token was first introduced in the BERT paper https://arxiv.org/abs/1810.04805.
It's short for classification token and basically marks the start of a sequence. I just adopted the system.
Ok what is your question here? So the problem I was facing is that images of equations come in many different resolutions. The naive way to deal with this would be to pad all images to a fixed image size (here self.width x self.height). But then there would be much computation done without any meaning.
That's why I'm just using the smallest bounding box that encapsulates every relevant pixel. But this is only generating a new problem: How can we tell the model which patch belongs where in the grid? After all the model only gets a stack of patches without positional information of where it belongs to in the image. That's what the positional embedding is for. Each patch coordinate has a distinct positional encoding that we add to the data and it's important that every embedding is always matched up with the same coordinate because in the end of the function every information about the structure is discarded. In your example the image is 3 patches smaller in width than the maximum supported image size. For the first row everything is normal. But in the second row we need to make sure to skip the encodings that would be necessary for a larger image, and so on. If we didn't do this the model has thinks it is looking at a totally different image and is unable to learn.
This line pos_emb_ind = repeat(torch.arange(h)*(self.width//self.patch_size-w), 'h -> (h w)', w=w)+torch.arange(h*w) basically gives us the flat index of the relevant coordinates efficiently.
I've compiled a small image to visualize what I'm talking about

On the left is what we want, the case on the right would happen if this line wasn't there.
Hope this was helpful, please tell me if I missed your point.

6 replies

lukas-blecher Apr 20, 2022
Maintainer

That's a good point. There is probably no need for the additional classification token. If I have the time and resources I'll experiment with dropping it.
The [SEP] token woudn't habe an interpretation here because we would need like another image.

TITC Apr 20, 2022
Collaborator Author

Thanks for your guidance~

I told my colleagues I think [SEP] is used for separate token sequences in some missions like BERT NSP. But they think it's important and can't remove. Thank you for opening my mind.

No more questions about these two lines.

TITC Apr 21, 2022
Collaborator Author

append question~

2.1 what's the smallest bbox refer to?

That's why I'm just using the smallest bounding box that encapsulates every relevant pixel.

I reckon it denotes patch size before, but I found one pixel is mentioned in ViT‘ paper which is smaller than 16 used in this repo. so I am confuse about it.

2.2 hope I have not misunderstood your meaning, thus I want to paraphrase your explanation. You meaning that you create a coordinate with size of (args.max_height, args.max_width) when the encoder is initialized.

encoder = CustomVisionTransformer(img_size=(args.max_height, args.max_width),
                                      patch_size=args.patch_size,
                                      in_chans=args.channels,
                                      num_classes=0,
                                      embed_dim=args.dim,
                                      depth=args.encoder_depth,
                                      num_heads=args.heads,
                                      embed_layer=embed_layer
                                      ).to(args.device)

Initialized process use CustomVisionTransformer parent class VisionTransformer

self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + self.num_tokens, embed_dim))

self.pos_embed is the coordinate, then each time calculate every patch's coordinate pos_emb_ind before the input tensor flow through the model.

If I have misunderstood your point, please tell me.

lukas-blecher Apr 21, 2022
Maintainer

2.1 I'm not sure what they meant by that in the paper. I think the backbone of the hybrids is chosen in a way that the output is exactly scaled down by the patch size, meaning each patch is encoded into one spatial pixel? Not sure though.
So they would also have to make sure that every image has at leas the dimension patch size x patch size. Just like I do in this repo.

2.2 Almost, yes. The coordinate system has originally the size (args.max_height, args.max_width) but it's immediately flattened into args.max_height * args.max_width. At each function call I select all the indices corresponding to the coordinates used in the image.

TITC Apr 22, 2022
Collaborator Author

2.1 sounds reasonable, agree!

2.2 got it

Thanks for your guidance again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【consult🙋‍♂️】 some confusion about CustomVisionTransformer.forward_features #130

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

【consult🙋‍♂️】 some confusion about CustomVisionTransformer.forward_features #130

TITC Apr 19, 2022 Collaborator

Replies: 1 comment · 6 replies

lukas-blecher Apr 19, 2022 Maintainer

lukas-blecher Apr 20, 2022 Maintainer

TITC Apr 20, 2022 Collaborator Author

TITC Apr 21, 2022 Collaborator Author

lukas-blecher Apr 21, 2022 Maintainer

TITC Apr 22, 2022 Collaborator Author

TITC
Apr 19, 2022
Collaborator

Replies: 1 comment 6 replies

lukas-blecher
Apr 19, 2022
Maintainer

lukas-blecher Apr 20, 2022
Maintainer

TITC Apr 20, 2022
Collaborator Author

TITC Apr 21, 2022
Collaborator Author

lukas-blecher Apr 21, 2022
Maintainer

TITC Apr 22, 2022
Collaborator Author