TinyViT: Fast Pretraining Distillation for Small Vision Transformers

📌 This is an official PyTorch implementation of [ECCV 2022] - TinyViT: Fast Pretraining Distillation for Small Vision Transformers.

TinyViT is a new family of tiny and efficient vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads.

🚀 TinyViT with only 21M parameters achieves 84.8% top-1 accuracy on ImageNet-1k, and 86.5% accuracy under 512x512 resolutions.

☀️ Hiring research interns for neural architecture search, tiny transformer design, model compression projects: houwen.peng@microsoft.com.

Highlights

TinyViT-21M on IN-22k achieves 84.8% top-1 accuracy on IN-1k, and 86.5% accuracy under 512x512 resolutions.
TinyViT-21M trained from scratch on IN-1k without distillation achieves 83.1 top-1 accuracy, under 4.3 GFLOPs and 1,571 images/s throughput on V100 GPU.
TinyViT-5M reaches 80.7% top-1 accuracy on IN-1k under 3,060 images/s throughput.
Save teacher logits once, and reuse the saved sparse logits to distill arbitrary students without overhead of teacher model. It takes 16 GB / 481 GB storage space for IN-1k (300 epochs) and IN-22k (90 epochs), respectively.

Features

Efficient Distillation. The teacher logits can be saved in parallel and reused for arbitrary student models, to avoid re-forwarding cost of the large teacher model.
Reproducibility. We provide the hyper-parameters of IN-1k training, IN-22k pre-training with distillation, IN-22kto1k fine-tuning, and higher resolution fine-tuning. In addition, all training logs are public (in Model Zoo).

Ease of Use. One file to build a TinyViT model. The file models/tiny_vit.py defines TinyViT model family.

from tiny_vit import tiny_vit_21m_224
model = tiny_vit_21m_224(pretrained=True)
output = model(image)

Extensibility. Add custom dataset, student and teacher models with no need to modify your code. The class DatasetWrapper wraps the general dataset to support saving and loading sparse logits. It only need the logits of models for knowledge distillation.
Public teacher model. We provide CLIP-ViT-Large/16-22k, a powerful teacher model on pretraining distillation (Acc@1 85.894 Acc@5 97.566 on IN-1k). We finetuned CLIP-ViT-Large/16 released by OpenAI on IN-22k.
Online Logging. Support wandb for checking the results anytime anywhere.

Model Zoo

Model	Pretrain	Input	Acc@1	Acc@5	#Params	MACs	FPS	22k Model	1k Model
TinyViT-5M	IN-22k	224x224	80.7	95.6	5.4M	1.3G	3,060	link/config/log	link/config/log
TinyViT-11M	IN-22k	224x224	83.2	96.5	11M	2.0G	2,468	link/config/log	link/config/log
TinyViT-21M	IN-22k	224x224	84.8	97.3	21M	4.3G	1,571	link/config/log	link/config/log
TinyViT-21M-384	IN-22k	384x384	86.2	97.8	21M	13.8G	394	-	link/config/log
TinyViT-21M-512	IN-22k	512x512	86.5	97.9	21M	27.0G	167	-	link/config/log
TinyViT-5M	IN-1k	224x224	79.1	94.8	5.4M	1.3G	3,060	-	link/config/log
TinyViT-11M	IN-1k	224x224	81.5	95.8	11M	2.0G	2,468	-	link/config/log
TinyViT-21M	IN-1k	224x224	83.1	96.5	21M	4.3G	1,571	-	link/config/log

ImageNet-22k (IN-22k) is the same as ImageNet-21k (IN-21k), where the number of classes is 21,841.

The models with are pretrained on ImageNet-22k with the distillation of CLIP-ViT-L/14-22k, then finetuned on ImageNet-1k.

We finetune the 1k models on IN-1k to higher resolution progressively (224 -> 384 -> 512) [detail], without any IN-1k knowledge distillation.

Getting Started

🔰 Here is the setup tutorial and evaluation scripts.

Install dependencies and prepare datasets

Preparation

Evaluate it !

Evaluation

Pretrain a TinyViT model on ImageNet

🔰 For the proposed fast pretraining distillation, we need to save teacher sparse logits firstly, then pretrain a model.

Citation

If this repo is helpful for you, please consider to cite it. 📣 Thank you! :)

@InProceedings{tiny_vit,
  title={TinyViT: Fast Pretraining Distillation for Small Vision Transformers},
  author={Wu, Kan and Zhang, Jinnian and Peng, Houwen and Liu, Mengchen and Xiao, Bin and Fu, Jianlong and Yuan, Lu},
  booktitle={European conference on computer vision (ECCV)},
  year={2022}
}

Acknowledge

Our code is based on Swin Transformer, LeViT, pytorch-image-models, CLIP and PyTorch. Thank contributors for their awesome contribution!

License

License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.figure		.figure
configs		configs
data		data
docs		docs
models		models
LICENSE		LICENSE
README.md		README.md
config.py		config.py
imagenet_1kto22k.txt		imagenet_1kto22k.txt
logger.py		logger.py
lr_scheduler.py		lr_scheduler.py
main.py		main.py
my_meter.py		my_meter.py
optimizer.py		optimizer.py
requirements.txt		requirements.txt
save_logits.py		save_logits.py
tinyvit_utils.py		tinyvit_utils.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

Highlights

Features

Model Zoo

Getting Started

Install dependencies and prepare datasets

Evaluate it !

Pretrain a TinyViT model on ImageNet

Citation

Acknowledge

License

About

Releases

Packages

Languages

License

MatejVitek/TinyViT

Folders and files

Latest commit

History

Repository files navigation

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

Highlights

Features

Model Zoo

Getting Started

Install dependencies and prepare datasets

Evaluate it !

Pretrain a TinyViT model on ImageNet

Citation

Acknowledge

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages