Home

Awesome

<div align="center"> <h1>Denoising Vision Transformers</h1>

Jiawei Yang<sup>1†*</sup> · Katie Z Luo<sup>2*</sup> · Jiefeng Li<sup>3</sup> · Congyue Deng<sup>4</sup> <br> Leonidas Guibas<sup>4</sup> · Dilip Krishnan<sup>5</sup> · Kilian Q. Weinberger<sup>2</sup><br> Yonglong Tian<sup>5</sup> · Yue Wang<sup>1</sup>

<sup>1</sup>University of Southern California   <sup>2</sup>Cornell University <br> <sup>3</sup>Shanghai Jiaotong University   <sup>4</sup>Stanford University <br> <sup>5</sup>Google Research <br> †project lead *equal technical contribution contribution

<a href="https://arxiv.org/abs/2401.02957"><img src='https://img.shields.io/badge/arXiv-DVT -red' alt='Paper PDF'></a> <a href='https://jiawei-yang.github.io/DenoisingViT/'><img src='https://img.shields.io/badge/Project_Page-DVT-blue' alt='Project Page'></a> <a href='https://huggingface.co/jjiaweiyang/DVT'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Pretrained Model-green'></a>

</div> <div align="center"> <h2 style="color:#FF6347;"><strong>📢 ECCV 2024 Oral 📢</strong></h2> </div>

This work presents Denoising Vision Transformers (DVT). It removes the visually annoying artifacts commonly seen in ViTs' feature maps and improves the downstream performance of dense recognition tasks.

teaser

News

Citation

@inproceedings{yang2024dvt,
  author = {Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas J. and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},
  title = {DVT: Denoising Vision Transformers},
  journal = {ECCV},
  year = {2024},
}

This README file and codebase are legacy. We will update them soon.

Usage

Environment Setup

Per-Image Denoising and Denoiser Training

git clone https://github.com/Jiawei-Yang/Denoising-ViT.git
cd Denoising-ViT
conda create -n dvt python=3.10 -y
conda activate dvt
pip install -r requirements.txt

# Install `tiny-cuda-nn` manually:
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you want a single conda environment for different GPU architectures, install tiny-cuda-nn with a pre-defined architecture list:

# 7.0 for V100, 8.0 for A100, 8.6 for A40 or A6000
TORCH_CUDA_ARCH_LIST="7.0 8.0 8.6" pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you encounter the error: parameter packs not expanded with ‘...’, Refer to this solution on GitHub.

Evaluation Environment

This section explains how to evaluate the denoised features on downstream tasks. We use mmsegmentation for dense prediction task evaluations on the VOC, ADE20k, and NYU-Depth datasets. If you don’t plan to evaluate on these tasks, you can skip this part.

Please note that mmsegmentation have some dependencies that may conflict with the dependencies in the main environment. To avoid this, we temporarily downgrade the CUDA and PyTorch versions to 11.7 for installation.

conda create -n dvt_eval python=3.10 -y
conda activate dvt_eval

# Install CUDA 11.7 or soft link CUDA 11.7 to /usr/local/cuda-11.7
CUDA_VERSION=11.7
export PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-${CUDA_VERSION}/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}

# Full uninstallation
pip install -r requirements_eval.txt
pip uninstall mmcv-full -y && pip uninstall mmcv -y && pip cache purge

# Force CUDA installation
MCV_WITH_OPS=1 FORCE_CUDA=1 pip install mmcv-full==1.5.0 mmsegmentation==0.27.0

Pre-trained Models and Video Generation

Please refer to huggingface for the pre-trained models.

To generate demo videos similar to those in our website, you can simply run python make_video_demo.py

Data preparation

Our data folder should look like this:

data
├── ADEChallengeData2016
├── nyu
├── VOCdevkit
├── imagenet
└── voc_train.txt
  1. PASCAL-VOC 2007 and 2012: Please download the PASCAL VOC07 and PASCAL VOC12 datasets (link) and put the data in the folder data, e.g.,

In our experiments reported in the paper, we used the first 10,000 examples from data/voc_train.txt for stage-1 denoising. This text file was generated by gathering all JPG images from data/VOC2007/JPEGImages and data/VOC2012/JPEGImages, excluding the validation images.

  1. ADE20K: Please download the ADE20K dataset and put the data in data/ADEChallengeData2016.

  2. NYU-D: Please download the NYU-depth dataset and put the data in data/nyu. Results are provided given the 2014 annotations following previous works.

  3. ImageNet (Optional):

Run the code

See sample_scripts for examples of running the code.

We provide some demo outputs in demo/demo_outputs. For example, this image shows our denoising results on a cat image: Figure From left to right, we show: (1) input crop, (2) raw DINOv2 base output, (3) Kmeans clustering of the raw output, (4) L2 feature norm of the raw output, (5) the similarity between the central patch and other patches in the raw output, (6) our denoised output, (7) Kmeans clustering of the denoised output, (8) L2 feature norm of the denoised output, (9) the similarity between the central patch and other patches in the denoised output, (10) the decomposed shared artifacts, (11) the L2 norm of the shared artifacts, (12) the ground-truth residual error, (13) the predicted residual term, and (13) the composition of the shared artifacts and the predicted residual term.

Results and Pre-trained Models

Please refer to huggingface for the pre-trained models.

Model Summary

We include 4 versions of models in this release:

Performance Summary

ModelVOC_mIoUVOC_mAccADE_mIoUADE_mAccNYU_RMSENYU_abs_relNYU_a1
vit_small_patch14_dinov2.lvd142m81.7888.4444.0555.530.43400.133184.49%
vit_base_patch14_dinov2.lvd142m83.5290.6047.0258.450.39650.119787.59%
vit_large_patch14_dinov2.lvd142m83.4390.3847.5359.640.38310.114588.89%
vit_small_patch14_reg4_dinov2.lvd142m80.8888.6944.3655.900.43280.130385.00%
vit_base_patch14_reg4_dinov2.lvd142m83.4890.9547.7360.170.39670.117787.92%
vit_large_patch14_reg4_dinov2.lvd142m83.2190.6748.4461.280.38520.113988.53%
deit3_base_patch16_224.fb_in1k71.0380.6732.8442.790.58370.177273.03%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k77.7586.6840.5052.810.55850.167874.30%
vit_base_patch16_224.dino62.9275.9831.0340.620.57420.169474.55%
vit_base_patch16_224.mae50.2963.1023.8432.060.66290.227566.24%
eva02_base_patch16_clip_224.merged2b71.4982.6937.8950.31---
vit_base_patch16_384.augreg_in21k_ft_in1k73.5183.6036.4648.650.63600.189869.10%
ModelVOC_mIoUVOC_mAccADE_mIoUADE_mAccNYU_RMSENYU_abs_relNYU_a1
vit_small_patch14_dinov2.lvd142m82.7890.6945.1456.350.43680.133784.34%
vit_base_patch14_dinov2.lvd142m84.9291.7448.5460.210.38110.116688.42%
vit_large_patch14_dinov2.lvd142m85.2591.6949.8061.980.38260.111889.32%
vit_small_patch14_reg4_dinov2.lvd142m81.9389.5445.5557.520.42510.129285.01%
vit_base_patch14_reg4_dinov2.lvd142m84.5891.1749.2461.660.38980.114688.60%
vit_large_patch14_reg4_dinov2.lvd142m84.3791.4249.1962.210.38520.114188.45%
deit3_base_patch16_224.fb_in1k73.5283.6533.5743.560.58170.177473.05%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k79.5088.4341.3353.540.55120.163975.30%
vit_base_patch16_224.dino66.4177.7532.4542.420.57840.173873.75%
vit_base_patch16_224.mae50.6562.9023.2531.030.66510.227165.44%
eva02_base_patch16_clip_224.merged2b73.7684.5037.9950.400.61960.190469.86%
vit_base_patch16_384.augreg_in21k_ft_in1k74.8284.4036.7548.820.63160.192169.37%
ModelVOC_mIoUVOC_mAccADE_mIoUADE_mAccNYU_RMSENYU_abs_relNYU_a1
vit_base_patch14_dinov2.lvd142m85.1091.4148.5760.350.38500.120788.25%
vit_base_patch14_reg4_dinov2.lvd142m84.3690.8049.2061.560.38380.114388.97%
deit3_base_patch16_224.fb_in1k73.6382.7434.4344.960.57120.174774.00%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k79.8688.3342.2854.260.52530.157177.23%
vit_base_patch16_224.dino66.8078.4732.6842.580.57500.169673.86%
vit_base_patch16_224.mae51.9164.6723.7331.880.67330.228265.33%
eva02_base_patch16_clip_224.merged2b75.9385.4440.1552.04---
vit_base_patch16_384.augreg_in21k_ft_in1k76.2685.1438.6250.610.58250.176873.14%
ModelVOC_mIoUVOC_mAccADE_mIoUADE_mAccNYU_RMSENYU_abs_relNYU_a1
vit_base_patch14_dinov2.lvd142m (denoised)85.1791.5548.6860.600.38320.115288.50%
vit_base_patch14_dinov2.lvd142m (distilled)85.3391.4848.8560.470.37040.111589.74%

A summary of DINOv2-base model is shown below:

vit_base_patch14_dinov2.lvd142mVOC_mIoUVOC_mAccADE_mIoUADE_mAccNYU_RMSENYU_abs_relNYU_a1
baseline83.5290.6047.0258.450.39650.119787.59%
voc_denoised84.9291.7448.5460.210.38110.116688.42%
voc_distilled85.1091.4148.5760.350.38500.120788.25%
imgnet_denoised85.1791.5548.6860.600.38320.115288.50%
imgnet_distilled85.3391.4848.8560.470.37040.111589.74%

In fact, during our exploration, we find the setting of denoiser training and distillation training can slightly affect the performance of the final model. For example, whether to include the cls token in the denoiser's Transformer feedforward layer can affect the depth estimation performance. Our best model during the exploration achieves around 85.56 mIoU on the VOC, 49.02 mIoU on the ADE, and 89.98% a1 on the NYU datasets.

However, we do not include this model in the final release because of the additional complexity but non-significant improvement.

Legacy Results

These are old results. We keep them here for reference.

VOC Evaluation Results

mIoUaAccmAccLogfile
MAE50.2488.0263.15log
MAE + DVT50.5388.0663.29log
DINO63.0091.3876.35log
DINO + DVT66.2292.4178.14log
Registers83.6496.3190.67log
Registers + DVT84.5096.5691.45log
DeiT370.6292.6981.23log
DeiT3 + DVT73.3693.3483.74log
EVA71.5292.7682.95log
EVA + DVT73.1593.4383.55log
CLIP77.7894.7486.57log
CLIP + DVT79.0195.1387.48log
DINOv283.6096.3090.82log
DINOv2 + DVT84.8496.6791.70log

ADE20K Evaluation Results

mIoUaAccmAccLogfile
MAE23.6068.5431.49log
MAE + DVT23.6268.5831.25log
DINO31.0373.5640.33log
DINO + DVT32.4074.5342.01log
Registers48.2281.1160.52log
Registers + DVT49.3481.9461.70log
DeiT332.7372.6142.81log
DeiT3 + DVT36.5774.4449.01log
EVA37.4572.7849.74log
EVA + DVT37.8775.0249.81log
CLIP40.5176.4452.47log
CLIP + DVT41.1077.4153.07log
DINOv247.2980.8459.18log
DINOv2 + DVT48.6681.8960.24log

NYU-D Evaluation Results

RMSERelLogfile
MAE0.66950.2334log
MAE + DVT0.70800.2560log
DINO0.58320.1701log
DINO + DVT0.57800.1731log
Registers0.39690.1190log
Registers + DVT0.38800.1157log
DeiT30.5880.1788log
DeiT3 + DVT0.58910.1802log
EVA0.64460.1989log
EVA + DVT0.62430.1964log
CLIP0.55980.1679log
CLIP + DVT0.55910.1667log
DINOv20.40340.1238log
DINOv2 + DVT0.39430.1200log