γconsultπββοΈγ some confusion about CustomVisionTransformer.forward_features #130
-
Thanks for your response these days and I learned a lot from your repo. When I read code in this part, I can understand what the operation has done at the API level, but I don't know why should do this thing and the theory behind it.
x = torch.cat((cls_tokens, x), dim=1) cls_tokens is (batch_size,1,embed_dim) and returned x is (batch_size, patch_nums+1, embed_dim)
pos_emb_ind = repeat(torch.arange(h)*(self.width//self.patch_size-w), 'h -> (h w)', w=w)+torch.arange(h*w) I know example
assume each column chunk separated by blank cell,
if visualize
and put them together is
last squeeze it is to vector |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
Hi, I'll try to answer your question to your satisfaction.
|
Beta Was this translation helpful? Give feedback.
Hi, I'll try to answer your question to your satisfaction.
It's short for classification token and basically marks the start of a sequence. I just adopted the system.
That's why I'm just using the smallest bounding box that encapsulates every relevant pixel. But this is only generating a new problem: Howβ¦