Home

Awesome

Multi-Domain Long-Tailed Recognition (MDLT)

This repository contains the implementation for paper: On Multi-Domain Long-Tailed Recognition, Imbalanced Domain Generalization and Beyond (ECCV 2022).

It is also a (living) PyTorch suite containing benchmark datasets and algorithms for Multi-Domain Long-Tailed Recognition (MDLT). Currently we support 8 MDLT datasets (3 synthetic + 5 real), as well as ~20 algorithms that span different learning strategies. Feel free to send us a PR to add your algorithm / dataset for MDLT!


<div align="center"> <img src="mdlt/assets/teaser.gif" width="750"><br> <b>Multi-Domain Long-Tailed Recognition (MDLT)</b> aims to learn from multi-domain imbalanced data, address label imbalance, domain shift, and divergent label distributions across domains, and generalize to all domain-class pairs. </div>

MDLT: From Single- to Multi-Domain Imbalanced Learning

Existing studies on data imbalance focus on single-domain settings, i.e., samples are from the same data distribution. However, natural data can originate from distinct domains, where a minority class in one domain could have abundant instances from other domains. We systematically investigate Multi-Domain Long-Tailed Recognition (MDLT), which learns from multi-domain imbalanced data, addresses label imbalance, domain shift, and divergent label distributions across domains, and generalizes to all domain-class pairs.

We develop the domain-class transferability graph, and show that such transferability governs the success of learning in MDLT. We then propose BoDA, a theoretically grounded learning strategy that tracks the upper bound of transferability statistics, and ensures balanced alignment and calibration across imbalanced domain-class distributions. We curate MDLT benchmark datasets based on widely-used multi-domain datasets, and benchmark ~20 algorithms that span different learning strategies for MDLT.

Beyond MDLT: Domain Generalization under Data Imbalance

Further, as a byproduct, we demonstrate that BoDA strengthens Domain Generalization (DG) algorithms, and consistently improves the results on DG benchmarks. Note that all current standard DG benchmarks naturally exhibit heavy class imbalance within domains and label distributions shift across domains, confirming that data imbalance is an intrinsic problem in DG, but has yet been overlooked by past works.

The results shed light on how label imbalance can affect out-of-distribution generalization, and highlight the importance of integrating label imbalance into practical DG algorithm design.

Getting Started

Installation

Prerequisites

  1. Download the original datasets, and place them in your data_path
python -m mdlt.scripts.download --data_dir <data_path>
  1. Place the .csv files of train/val/test splits for each MDLT dataset (provided in mdlt/dataset/split/) in the corresponding dataset folder under your data_path

Dependencies

  1. PyTorch (>=1.4, tested on 1.4 / 1.9)
  2. pandas
  3. TensorboardX

Code Overview

Main Files

Main Arguments

Usage

Train a single model

python -m mdlt.train --algorithm <algo> --dataset <dset> --output_folder_name <output_folder_name> --data_dir <data_path> --output_dir <output_path>

Train a model using 2-stage (second stage classifier learning)

python -m mdlt.train --algorithm CRT --dataset <dset> --output_folder_name <output_folder_name> --data_dir <data_path> --output_dir <output_path> --stage1_folder <stage1_model_folder> --stage1_algo <stage1_algo>

Note that for $\text{BoDA}_{r,c}$ the command is the same as above, with changes only on stage1_algo & stage1_folder

Train a model on Digits-MLT, with imbalance type all Forward-LT and imbalance ratio 0.01

python -m mdlt.train --algorithm <algo> --dataset ImbalancedDigits \
       --imb_type eee \
       --imb_factor 0.01 \
       --selected_envs 1 2

Note that for Digits-MLT, we additionally provide MNIST as another domain. To maintain the same setting as in paper (2 domains), you only need to set selected_envs to be 1 2 as above

Launch a sweep with different hparams

python -m mdlt.sweep launch --algorithms <...> --dataset <...> --n_hparams <num_of_hparams> --n_trials 1

Launch a sweep after fixing hparam with different seeds

python -m mdlt.sweep launch --algorithms <...> --dataset <...> --best_hp --input_folder <...> --n_trials <num_of_trials>

Collect the results of your sweep

python -m mdlt.scripts.collect_results --input_dir <...>

Evaluate the best hparam model for a <dataset, algo> pair

python -u -m mdlt.evaluate.eval_best_hparam --algorithm <...> --dataset <...> --data_dir <...> --output_dir <...> --folder_name <...>

Evaluate a trained checkpoint

python -u -m mdlt.evaluate.eval_checkpoint --algorithm <...> --dataset <...> --data_dir <...> --checkpoint <...>

Reproduced Benchmarks and Model Zoo

ModelVLCS-MLTPACS-MLTOfficeHome-MLTTerraInc-MLTDomainNet-MLT
BoDA (r)76.9 / model97.0 / model81.5 / model78.6 / model60.1 / model
BoDA (r,c)77.3 / model97.2 / model82.3 / model82.3 / model61.7 / model

Updates

Acknowledgements

This code is partly based on the open-source implementations from DomainBed.

Citation

If you find this code or idea useful, please cite our work:

@inproceedings{yang2022multi,
  title={On Multi-Domain Long-Tailed Recognition, Imbalanced Domain Generalization and Beyond},
  author={Yang, Yuzhe and Wang, Hao and Katabi, Dina},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2022}
}

Contact

If you have any questions, feel free to contact us through email (yuzhe@mit.edu) or Github issues. Enjoy!