Home

Awesome

License: MIT Downloads pypy: v

WhyShift: A Benchmark with Specified Distribution Shift Patterns

<a href="https://ljsthu.github.io">Jiashuo Liu*</a>, <a href="https://wangtianyu61.github.io">Tianyu Wang*</a>, <a href="https://pengcui.thumedialab.com">Peng Cui</a>, <a href="https://hsnamkoong.github.io">Hongseok Namkoong</a>

Tsinghua University, Columbia University

WhyShift is a python package that provides a benchmark with various specified distribution shift patterns on real-world tabular data. Our testbed highlights the importance of future research that builds an understanding of how distributions differ. For more details, please refer to our <a href="https://openreview.net/pdf?id=PF0lxayYST">paper</a>.

If you find this repository useful in your research, please cite the following paper:

@inproceedings{liu2023need,
  title={On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets},
  author={Jiashuo Liu and Tianyu Wang and Peng Cui and Hongseok Namkoong},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2023}
}

Table of Contents

  1. Dataset Access
  2. Python Package: whyshift
  3. Different Distribution Shift Patterns
  4. Implemented Algorithms
  5. Algorithm for Identifying Risk Region
  6. Degradation Decomposition (DISDE)
  7. License and terms of use
  8. References

Dataset Access

Here we provide the access links for the 5 datasets used in our benchmark.

ACS Income

ACS PubCov

ACS Mobility

Taxi Dataset

US Accident Dataset

Python Package: whyshift

Here we provide the scripts to get data in our proposed settings.

Install the package

pip3 install whyshift

For settings utilizing ACS Income, Public Coverage, Mobility datasets

For settings utilizing US Accident, Taxi datasets

Different Distribution Shift Patterns

Based on our whyshift package, one could design various source-target pairs with different distribution shift patterns. Here we list some of them for reference:

#IDDatasetType#FeaturesOutcomeSource#Train Samples#Test DomainsDom. Ratio
1ACS IncomeSpatial9Income≥50kCalifornia195,66550$Y|X: 13/14$
2ACS IncomeSpatial9Income≥50kConnecticut19,78550$Y|X: 24/24$
3ACS IncomeSpatial9Income≥50kMassachusetts40,11450$Y|X: 21/22$
4ACS IncomeSpatial9Income≥50kSouth Dakota4,89950$Y|X: 9/9$
5ACS MobilitySpatial21Residential AddressMississippi5,31850$Y|X: 28/34$
6ACS MobilitySpatial21Residential AddressNew York40,46350$Y|X: 30/31$
7ACS MobilitySpatial21Residential AddressCalifornia80,32950$Y|X: 9/17$
8ACS MobilitySpatial21Residential AddressPennsylvania23,91850$Y|X: 17/17$
9TaxiSpatial7Duration time≥30 minBogotá3,0633$Y|X: 1/2$
10TaxiSpatial7Duration time≥30 minNew York City1,458,6463$Y|X: 3/3$
11ACS Pub.CovSpatial18Public Ins. CoverageNebraska6,33250$Y|X: 32/39$
12ACS Pub.CovSpatial18Public Ins. CoverageFlorida71,29750$Y|X: 28/29$
13ACS Pub.CovSpatial18Public Ins. CoverageTexas98,92850$Y|X: 33/34$
14ACS Pub.CovSpatial18Public Ins. CoverageIndiana24,33050$Y|X: 11/13$
15US AccidentSpatial47Severity of AccidentTexas26,66413$Y|X: 7/7$
16US AccidentSpatial47Severity of AccidentCalifornia64,90913X: 22/31
17US AccidentSpatial47Severity of AccidentFlorida32,27813X: 5/7
18US AccidentSpatial47Severity of AccidentMinnesota8,92713X: 8/11
19ACS Pub.CovTemporal18Public Ins. CoverageYear 2010 (NY)73,2083X: 2/2
20ACS Pub.CovTemporal18Public Ins. CoverageYear 2010 (CA)149,4413X: 2/2
21ACS IncomeSynthetic9Income≥50kYounger People (80%)20,0001X: 1/1
22ACS IncomeSynthetic9Income≥50kYounger People (90%)20,0001X: 1/1

In our benchmark, each setting has multiple target domains (except the last setting). In our main body, we select only one target domain for each setting. We report the Dom. Ratio to represent the dominant ratio of $Y|X$ shifts or $X$ shifts in source-target pairs with performance degradation larger than 5 percentage points in each setting. For example, "$Y|X$: 13/14" means that there are 14 source-target pairs in Setting 1 with degradation larger than 5 percentage points and 13 out of them with over 50% degradation attributed to $Y|X$ shifts. We use XGBoost to measure this.

Implemented Algorithms

In our whyshift package, we also implement several algorithms for tabular data classification, including Logistic Regression, MLP, SVM, Random Forest, XGBoost, LightGBM, GBM, $\chi^2$/CVaR-DRO/DORO, Group DRO, Simple-Reweighting, JTT, Fairness-In/Postprocess and DWR methods.

# use the implemented methods
algo = fetch_model(method_name)

Note that the supported method names are:

method_name_list = ['lr','svm','xgb', 'lightgbm', 'rf',  'dwr', 'jtt','suby', 'subg', 'rwy', 'rwg', 'FairPostprocess_exp','FairInprocess_dp', 'FairPostprocess_threshold', 'FairInprocess_eo', 'FairInprocess_error_parity','chi_dro', 'chi_doro','cvar_dro','cvar_doro','group_dro']

Identify Risk Region

In our whyshift package, we implement the risk region identification algorithm (Algorithm 1 in our paper). The function is risk_region. And here is an example to use it:

from whyshift import risk_region

source_model = xgb.XGBClassifier()
target_model = xgb.XGBClassifier()
risk_region('xgb', source_model, target_model, 'income', 'CA', 'PR', "./datasets/acs")

DISDE Method

In our whyshift package, we implement the DIstribution Shift DEcomposition (DISDE) method to attribute the performance degradation to $Y|X$-shifts and $X$-shifts, respectively. Function degradation_decomp

The parameters include:

An example to use our WHYSHIFT package could be found at <a href="./Example.ipynb">Example.ipynb</a>.

License and terms of use

Our benchmark is built upon Folktables. The License of Folktables is:

Folktables provides code to download data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files managed by the US Census Bureau. The data itself is governed by the terms of use provided by the Census Bureau. For more information, see https://www.census.gov/data/developers/about/terms-of-service.html

The Adult reconstruction dataset is a subsample of the IPUMS CPS data available from https://cps.ipums.org/. The data are intended for replication purposes only. Individuals analyzing the data for other purposes must submit a separate data extract request directly via IPUMS CPS. Individuals are not to redistribute the data without permission. Contact ipums@umn.edu for redistribution requests.

Besides, for US Accident and Taxi data from kaggle, individuals should follow the their Licenses, see https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents and https://www.kaggle.com/competitions/nyc-taxi-trip-duration/data.

References

[1] Ding, F., Hardt, M., Miller, J., & Schmidt, L. (2021). Retiring adult: New datasets for fair machine learning. Advances in neural information processing systems, 34, 6478-6490.

<!-- if __name__ == '__main__': source_model = xgb.XGBClassifier() target_model = xgb.XGBClassifier() compare_best_model('xgb', source_model, target_model, 'income', 'CA', 'PR', "/home/jiashuoliu/TabularWilds/src/datasets/acs") -->