about ViT(Vision Transformer)

This page collects papers related to ViT to help understand ViT.

Not yet

What is ViT?

ViT, also known as Visual Transformer. See at wikipedia.

Which paper recommand to read? (wrote as ISO 690 format)

First concept paper

A paper that applied the transformer used in NLP (Natural Language Processing) without using CNN in computer vision.

DOSOVITSKIY, Alexey, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

Vision-language tasks such as image/video captioning and question answering.

generative image-to-text transformer. The encoder takes an image as input. Put the combination of the encoder’s output and text as the decoder’s input. The output of the decoder is the text of the answer to the text of the question.

WANG, Jianfeng, et al. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.

review paper

needs more background to understand this paper.

LI, Xiangtai, et al. Transformer-based visual segmentation: A survey. arXiv preprint arXiv:2304.09854, 2023.

How to find relationships between patches in transformer?

learn transformer(NLP), attention, and word2vec will be help answer.

how To Find Relationships Between Patches In Vision Transformer