Contents

how To Find Relationships Between Patches In Vision Transformer

   Jul 20, 2023     8 min read

This page collects papers related to ViT to help understand ViT.

related project

Not yet

related posts

about ViT(Vision Transformer)

Summary

(It might be wrong because I organized my thoughts. If you find wrong infos or have any ideas, please leave a comments below)

  1. Pre-training data is used to determine the similarity between patches, but it is not essential. Pre-training data in NLP is the relationship between words that someone has previously learned. In vision, also there is data that someone has previously learned between patches.
  2. Learning is possible even without pre-training data! According to the paper(DOSOVITSKIY, Alexey, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.), it used pretrained data. but according to the keras source, that’s not necessarily the case.
  3. The similarity between words is not manually calculated by someone, but as obtained by CBOW and skip-gram in word2vec. Data between images can also obtain the relationship between patches using the same principle.
  4. The reason for using patch is to reduce the amount of computation. It is possible to insert pixels directly without using patch, but the amount of calculation is too much. See at Images to Patch Embeddings part

Explain

I recommend to read about How does the embeddings work in vision transformer from paper. You can learn about How does embeddings in ViT.

I try to explain with upper link and my background knowledge. It may be wrong.

From a self attention and embedding perspective, it could be the answer. First I start with the embedding. During the embedding process, the image patches are somehow converted into vectors, and word2vec is used in the process.

To use word2vec, vectors over 3D (x,y,channel) must be expressed as one vector. So we create a patch of (16:x,16:y,3:channel). Since 16163=768, we have 768 dimension embeddings.

With this embedding, the dot product has no meaning, so a bias vector W is added. W can be obtained by training.

Then, in order to understand word2vec, you need to find out how word2vec works. The main idea of ​​word2vec is that words with similar distributions have similar meanings. paper says that values ​​are found using CBOW and Skip-gram. Paper(MIKOLOV, Tomas, et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.) says that values ​​are found using CBOW and Skip-gram.

Please see image on wikimedia commons because of license.

CBOW is a continuous bag of words, which means predicting a given word from its surrounding words. If the total number of words is c, we predict a given word with c/2 number of words forward and backward. This model works by maximizing the conditional probability of the word center.

In case of skip-gram, it is to predict multiple words from one word. Adjust the parameters in a way that minimizes the loss function. For example, if there are 10 words and 3 hidden layers, going from input to hidden is a matrix of [10,3]. After training these parameters, those with similar row vectors mean similar words. In this way, if you find the cosine similarity to find the row vector similarity between row vectors, you will get this type of commonly seen graph. The image below is a simplified one. (Red for English and blue for Estonian. They mean the same thing. In this way, you can also find similar languages ​​within the same language.)

Please see image from first image on Task paragraph because of license.

In other words, word2vec is a value calculated through learning, not data obtained by someone in advance.

This is the meaning of word2vec, and let’s consider ViT with the concept that cbow and skip-gram find the similarity between words with the cosine similarity between row vectors without prior knowledge. If so, you might think that once you think of words as patches, similarity can be measured without any prior human concept. So I think the fundamentals of word2vec will be helpful in figuring out the correlation between patches in ViT.

what has been trained in advance using word2vec or the like is called a language model. If you read the paper, there are concepts that may be confused. “Someone has created similarities between words in advance”. It’s very likely that you mean a language model, not word2vec. It is said that learning with this pre-trained language model can greatly reduce the cost of data production.

The final or intermediate output value of the pre-trained language model is called embedding or representation.(I recommend to read transformer in NLP first and then read these pages, pre-trained language model and transformer. This is korean blog, but we can use translator in browser’s options)

To summarize my thoughts, in NLP, language models are created with things like word2vec for efficiency, and that language models are used in transformers. So, let’s think logically. The structure of [word2vec-like technique] -> [language model] -> [transformer] involves the process of creating a language model. This process has the advantage of being able to reuse language models, but has the inconvenience of having to create language models. If we have a lot of data and never use pre-trained model forever, I think we can apply it directly with a technique like word2vec -> transformer.

If so, let’s move on to ViT again.

Vectors over 3D (x,y,channel) must be expressed as one vector. So we create a patch of (16:x,16:y,3:channel). Since 16163=768, we have 768 dimension embeddings per patch.

One patch will have 768 embeddings, multiplied by the bias vector. The reason we multiply the bias vector is, as mentioned above, we multiply the trained bias vector to make the dot product large. (If you don’t understand that multiplying the bias vector to increase the dot product, you need to understand the formula more.)

The bias vector is multiplied, and the output goes through position embedding on the patch as shown in the image below, and then enters the encoder.

Please see page 3 image on (DOSOVITSKIY, Alexey, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.) paper, but if you can’t access this paper then watch this wikimedia commons because of license.

When we get into the encoder part, there is multihead attention. One of the pieces of advice I received from someone is that understanding self-attention will help us understand what we’re curious about.

Self attention, multihead attention in the picture. It is the core of the transformer (It is also said to be important in relation to the transformer for NLP). Self attention is initially given three input values: query, key, and value vectors. However, these QKT vectors are the combined data of patch and position encoding obtained earlier. It works with this picture. First of all, think of it as an NLP transformer. Self-attention calculation is completed by multiplying the softmax probability values ​​calculated from the query and key with the value vector with the pre-trained data. In NLP, when using pre-trained data, it is trained in advance using a technique such as word2vec and a language model is used. Because it is more efficient. In the case of GPT-3, pretraining is performed through a technique to predict the next word. Self attention is done with that pre-trained data.

Then, in ViT, which we were curious about is what kind of data and how to do pre-learning is the key.

Remember the cifar-10 dataset? There is an example code using a vision transformer using a pre-trained model created using cifar-10.

import pretrained model…

#import pretrained model
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

and extract feature with pretrained model

#feature extractor part
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')

Of course, learning can be done without pre-trained data.

The sentence given below was taken from the sentence at the bottom of the keras blog.

Note that the state of the art results reported in the paper are achieved by pre-training the ViT model using the JFT-300M dataset, then fine-tuning it on the target dataset. To improve the model quality without pre-training, you can try to train the model for more epochs, use a larger number of Transformer layers, resize the input images, change the patch size, or increase the projection dimensions. Besides, as mentioned in the paper, the quality of the model is affected not only by architecture choices, but also by parameters such as the learning rate schedule, optimizer, weight decay, etc. In practice, it's recommended to fine-tune a ViT model that was pre-trained using a large, high-resolution dataset.

According to the vision transformer initial thesis, (DOSOVITSKIY, Alexey, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.), pre-training data was used, but it is possible without it, it is in keras. The key sentence is, “To improve the model quality without pre-training, you can try to train the model for more epochs, use a larger number of Transformer layers, resize the input images, change the patch size, or increase the projection dimensions.”.

Which paper recommand to read? (wrote as ISO 690 format)

concept paper in ViT

  • DOSOVITSKIY, Alexey, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

help understand to ViT find relationships between patches

about word2vec
  • MIKOLOV, Tomas, et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
about Visual Attention
  • MNIH, Volodymyr, et al. Recurrent models of visual attention. Advances in neural information processing systems, 2014, 27.

Watch example source codes