



This is the official code release of VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text presented at NeurIPS 2021.

This codebase has minimum framework requirements: – Python 3.8, CUDA 10.1, NVIDIA Driver v 440.100, CuDNN 7.6.5

Make sure to install the following libraries by running pip install -rrequirements.txt .

tensorflow==2.7.0 tensorflow_addons==0.15.0 tensorflow_probability==0.15.0 tensorflow_text==2.7.0 keras==2.7.0 scikit-image scikit-learn scipy six numpy yaml dmvr absl data

The data pipeline in this code is based on DMVR which supports TF Example and TF SequenceExample. The data loader assumes your dataset is stored as TF records similar to this example.

Make sure to put the correct constructor in vatt/data/datasets before launching the main script. For reference, there is a toy example in vatt/data/datasets/toy_dataset.py.

Embeddings and Vocabulary

Depending on the configuration, pre-trained text embeddings and vocabulary may be required. Download this file and unzip it under vatt/.

pre-training

Assuming all datasets are saved and the data loader is working, pretraining can be launched using: python -m vatt.main –task=pretrain –mode=train –model_dir =PATH/TO/RUN –model_arch=tx_fac –strategy_type=Mirroring

–mode=train starts self-supervised training and –mode=eval starts exhaustive evaluation.

The evaluation pipeline constantly loops through the model_dir path looking for new checkpoints. This means that evaluation pipelines can be launched separately to benefit from continuous evaluation during pre-training.

Alternatively, you can set –override_checkpoint=PATH/TO/CHECKPOINT to base the evaluation on a specific checkpoint.

If you are using TPU, you can set –strategy_type=tpu –tpu=ADDRESS/OF/TPU.

Options for model_arch are: tx_fac: modality specific VATT ut_fac: modality independent VATT mmv_fac: CNN based counterpart like MMV FineTune

Pre-training a model allows fine-tuning a vision or audio transformer on a classification dataset.

Assuming all datasets are saved and the data loader is working, fine tuning can be started using the following command: python -m vatt.main –task=finetune –mode=train – -model_dir=PATH/TO/RUN –model_arch=ViT_Medium –strategy_type=Mirroring

Similarly, mode can take either train or eval, allowing continuous evaluation by running evaluation pipelines in parallel.

Options for model_arch are: vit_base: Vision Transformer in basic configuration vit_medium: Vision Transformer in medium configuration vit_large: Vision Transformer in large configuration wat_base: Waveform Transformer in basic configuration wat_medium: Waveform Transformer in medium configuration spt_base: Spectrogram Transformer in basic configuration spt_medium: Spectrogram Transformer in medium configuration i3d: video model based on I3D architecture resnet2d_50: audio model based on ResNet-2D architecture (spectrogram only)

For any setting, make sure you have the correct configuration for data and optimization in vatt/configs.

Checkpoint Pre-trained checkpoint backbone model size (video, audio, text) Checkpoint Modality-specific Base-Base-Small data, Index Modality-specific Medium-Base-Small data, Index Modality-specific Large-Base-Small Data, Index Modality Independent Medium (Single Backbone) Data, Index Checkpoint Dataset Fine-tuned for Video Action Recognition Model Type Pre-trained Checkpoints Top-1 Top-5 Checkpoints Kinetics-400 ViT Base Base- Base-Small 79.6 94.9 Data, Index Kinetics-400 ViT Medium Medium -Base-Small 81.1 95.6 Data, Index Kinetics-400 ViT Large Large-Base-Small 82.1 95.5 Data, Index Kinetics-400 ViT Medium Medium (single backbone) 79.9 94.9 Data, Index Kinetics-600 ViT Base Base-Base-Small 80.5 95.5 Data, Index Kinetics-600 ViT Medium Medium-Base-Small 82.4 96.1 Data, Index Kinetics-600 ViT Large Large-Base-Small 83.6 96.6 Data, Index Kinetics- 600 ViT Medium Medium (single backbone) 80.8 95.5 Data, Index Kinetics-700 ViT Base Base-Base-Small – – TBD Kinetics-700 ViT Medium Medium-Base-Small – – TBD Kinetics-700 ViT Large Large-Base-Small 72.7 90.5 data, index Kinetics-700 ViT Medium Medium (single-backbone) – – TBD moment-in-time ViT base basic-base-small 38.7 67.5 data, index moment-in-time ViT medium medium base-small 39.5 68.2 data, index moment-in-time ViT large large base-small 41.1 67.7 data, index Instantaneous ViT medium medium (single backbone) 37.8 65.9 data, index Fine-tuned checkpoint dataset for speech event classification Model type Pre-trained checkpoint mAP AUC d -prime checkpoint AudioSet WaT Base Base -Base-Small 39.4 97.1 2.895 data, index AudioSet WaT Medium Medium (single backbone) 39.3 97.0 2.884 data, index}, authors={Akbari, Hassan and Yuan, Ryotetsu and Qian, Rui and Chuan, Weihong and Zhang, Shifu and Cui, Ying and Gong, Boqing}, Journal={arXiv preprint arXiv:2104.11178}, Year={2021}} Correspondence and Maintenance

We appreciate your feedback. If you find any issues, please contact us.

