Awesome
Awesome-Data-Centric-AI
A curated, but incomplete, list of data-centric AI resources. It should be noted that it is unfeasible to encompass every paper. Thus, we prefer to selectively choose papers that present a range of distinct ideas. We welcome contributions to further enrich and refine this list.
:loudspeaker: News: Please check out our open-sourced Large Time Series Model (LTSM)!
If you want to contribute to this list, please feel free to send a pull request. Also, you can contact daochen.zha@rice.edu.
- Survey paper: Data-centric Artificial Intelligence: A Survey
- Perspective paper (SDM 2023): Data-centric AI: Perspectives and Challenges
- Data-centric AI: Techniques and Future Perspectives (KDD 2023 Tutorial): [Website] [Slides] [Video] [Paper]
- Blogs:
- 中文解读:
- Graph Structure Learning (GSL) is a data-centric AI direction in graph neural networks (GNNs):
- Check our latest Knowledge Graphs (KGs) based paper search engine DiscoverPath: https://github.com/ynchuang/DiscoverPath
Want to discuss with others who are also interested in data-centric AI? There are three options:
- Join our Slack channel
- Join our QQ group (183116457). Password:
datacentric
- Join the WeChat group below (if the QR code is expired, please add WeChat ID:
zdcwhu
and add a note indicating that you want to join the Data-centric AI group)!
What is Data-centric AI?
Data-centric AI is an emerging field that focuses on engineering data to improve AI systems with enhanced data quality and quantity.
Data-centric AI vs. Model-centric AI
<img width="500" src="./imgs/data-centric.png" alt="data-centric" />In the conventional model-centric AI lifecycle, researchers and developers primarily focus on identifying more effective models to improve AI performance while keeping the data largely unchanged. However, this model-centric paradigm overlooks the potential quality issues and undesirable flaws of data, such as missing values, incorrect labels, and anomalies. Complementing the existing efforts in model advancement, data-centric AI emphasizes the systematic engineering of data to build AI systems, shifting our focus from model to data.
It is important to note that "data-centric" differs fundamentally from "data-driven", as the latter only emphasizes the use of data to guide AI development, which typically still centers on developing models rather than engineering data.
Why Data-centric AI?
<img width="800" src="./imgs/motivation.png" alt="motivation" />Two motivating examples of GPT models highlight the central role of data in AI.
- On the left, large and high-quality training data are the driving force of recent successes of GPT models, while model architectures remain similar, except for more model weights.
- On the right, when the model becomes sufficiently powerful, we only need to engineer prompts (inference data) to accomplish our objectives, with the model being fixed.
Another example is Segment Anything, a foundation model for computer vision. The core of training Segment Anything lies in the large amount of annotated data, containing more than 1 billion masks, which is 400 times larger than existing segmentation datasets.
What is the Data-centric AI Framework?
<img width="800" src="./imgs/framework.png" alt="framework" />Data-centric AI framework consists of three goals: training data development, inference data development, and data maintenance, where each goal is associated with several sub-goals.
- The goal of training data development is to collect and produce rich and high-quality training data to support the training of machine learning models.
- The objective of inference data development is to create novel evaluation sets that can provide more granular insights into the model or trigger a specific capability of the model with engineered data inputs.
- The purpose of data maintenance is to ensure the quality and reliability of data in a dynamic environment.
Cite this Work
Zha, Daochen, et al. "Data-centric Artificial Intelligence: A Survey." arXiv preprint arXiv:2303.10158, 2023.
@article{zha2023data-centric-survey,
title={Data-centric Artificial Intelligence: A Survey},
author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Jiang, Zhimeng and Zhong, Shaochen and Hu, Xia},
journal={arXiv preprint arXiv:2303.10158},
year={2023}
}
Zha, Daochen, et al. "Data-centric AI: Perspectives and Challenges." SDM, 2023.
@inproceedings{zha2023data-centric-perspectives,
title={Data-centric AI: Perspectives and Challenges},
author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Hu, Xia},
booktitle={SDM},
year={2023}
}
Table of Contents
Training Data Development
<img width="800" src="./imgs/training-data-development.png" alt="training-data-development" />Data Collection
- Revisiting time series outlier detection: Definitions and benchmarks, NeurIPS 2021 [Paper] [Code]
- Dataset discovery in data lakes, ICDE 2020 [Paper]
- Aurum: A data discovery system, ICDE 2018 [Paper] [Code]
- Table union search on open data, VLDB 2018 [Paper]
- Data Integration: The Current Status and the Way Forward, IEEE Computer Society Technical Committee on Data Engineering 2018 [Paper]
- To join or not to join? thinking twice about joins before feature selection, SIGMOD 2016 [Paper]
- Data curation at scale: the data tamer system, CIDR 2013 [Paper]
- Data integration: A theoretical perspective, PODS 2002 [Paper]
Data Labeling
- Segment Anything [Paper] [code]
- Active Ensemble Learning for Knowledge Graph Error Detection, WSDM 2023 [Paper]
- Active-Learning-as-a-Service: An Efficient MLOps System for Data-Centric AI, NeurIPS 2022 Workshop on Human in the Loop Learning [paper] [code]
- Training language models to follow instructions with human feedback, NeurIPS 2022 [Paper]
- Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling, ICLR 2021 [Paper] [Code]
- A survey of deep active learning, ACM Computing Surveys 2021 [Paper]
- Adaptive rule discovery for labeling text data, SIGMOD 2021 [Paper]
- Cut out the annotator, keep the cutout: better segmentation with weak supervision, ICLR 2021 [Paper]
- Meta-AAD: Active anomaly detection with deep reinforcement learning, ICDM 2020 [Paper] [Code]
- Snorkel: Rapid training data creation with weak supervision, VLDB 2020 [Paper] [Code]
- Graph-based semi-supervised learning: A review, Neurocomputing 2020 [Paper]
- Annotator rationales for labeling tasks in crowdsourcing, JAIR 2020 [Paper]
- Rethinking pre-training and self-training, NeurIPS 2020 [Paper]
- Multi-label dataless text classification with topic modeling, KIS 2019 [Paper]
- Data programming: Creating large training sets, quickly, NeurIPS 2016 [Paper]
- Semi-supervised consensus labeling for crowdsourcing, SIGIR 2011 [Paper]
- Vox Populi: Collecting High-Quality Labels from a Crowd, COLT 2009 [Paper]
- Democratic co-learning, ICTAI 2004 [Paper]
- Active learning with statistical models, JAIR 1996 [Paper]
Data Preparation
- DataFix: Adversarial Learning for Feature Shift Detection and Correction, NeurIPS 2023 [Paper] [Code]
- OpenGSL: A Comprehensive Benchmark for Graph Structure Learning, arXiv 2023 [Paper] [Code]
- TSFEL: Time series feature extraction library, SoftwareX 2020 [Paper] [Code]
- Alphaclean: Automatic generation of data cleaning pipelines, arXiv 2019 [Paper] [Code]
- Introduction to Scikit-learn, Book 2019 [Paper] [Code]
- Feature extraction: a survey of the types, techniques, applications, ICSC 2019 [Paper]
- Feature engineering for predictive modeling using reinforcement learning, AAAI 2018 [Paper]
- Time series classification from scratch with deep neural networks: A strong baseline, IIJCNN 2017 [Paper]
- Missing data imputation: focusing on single imputation, ATM 2016 [Paper]
- Estimating the number and sizes of fuzzy-duplicate clusters, CIKM 2014 [Paper]
- Data normalization and standardization: a technical report, MLTR 2014 [Paper]
- CrowdER: crowdsourcing entity resolution, VLDB 2012 [Paper]
- Imputation of Missing Data Using Machine Learning Techniques, KDD 1996 [Paper]
Data Reduction
- Active feature selection for the mutual information criterion, AAAI 2021 [Paper] [Code]
- Active incremental feature selection using a fuzzy-rough-set-based information entropy, IEEE Transactions on Fuzzy Systems, 2020 [Paper]
- MESA: boost ensemble imbalanced learning with meta-sampler, NeurIPS 2020 [Paper] [Code]
- Autoencoders, arXiv 2020 [Paper]
- Feature selection: A data perspective, ACM COmputer Surveys, 2017 [Paper] [Code]
- Intrusion detection model using fusion of chi-square feature selection and multi class SVM, Journal of King Saud University-Computer and Information Sciences 2017 [Paper]
- Feature selection and analysis on correlated gas sensor data with recursive feature elimination, Sensors and Actuators B: Chemical 2015 [Paper]
- Embedded unsupervised feature selection, AAAI 2015 [Paper]
- Using random undersampling to alleviate class imbalance on tweet sentiment data, ICIRI 2015 [Paper]
- Feature selection based on information gain, IJITEE 2013 [Paper]
- Linear discriminant analysis, Book 2013 [Paper]
- Introduction to k nearest neighbour classification and condensed nearest neighbour data reduction, 2012 [Paper]
- Principal component analysis, Wiley Interdisciplinary Reviews 2010 [Paper] [Code]
- Finding representative patterns with ordered projections, Pattern Recognition 2003 [Paper]
Data Augmentation
- Towards automated imbalanced learning with deep hierarchical reinforcement learning, CIKM 2022 [Paper] [Code]
- G-Mixup: Graph Data Augmentation for Graph Classification, ICML 2022 [Paper] [Code]
- Cascaded Diffusion Models for High Fidelity Image Generation, JMLR 2022 [Paper]
- Time series data augmentation for deep learning: A survey, IJCAI 2021 [Paper]
- Text data augmentation for deep learning, JBD 2020 [Paper]
- Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification, ACL 2020 [Paper] [Code]
- Autoaugment: Learning augmentation policies from data, CVPR 2019 [Paper] [Code]
- Mixup: Beyond empirical risk minimization, ICLR 2018 [Paper] [Code]
- Synthetic data augmentation using GAN for improved liver lesion classification, ISBI 2018 [Paper] [Code]
- Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation, ASRU 2017 [Paper]
- Character-level convolutional networks for text classification, NeurIPS 2015 [Paper] [Code]
- ADASYN: Adaptive synthetic sampling approach for imbalanced learning, IJCNN 2008 [Paper] [Code]
- SMOTE: synthetic minority over-sampling technique, JAIR 2002 [Paper] [Code]
Pipeline Search
- Towards Personalized Preprocessing Pipeline Search, arXiv 2023 [Paper]
- AutoVideo: An Automated Video Action Recognition System, IJCAI 2022 [Paper] [Code]
- Tods: An automated time series outlier detection system, AAAI 2021 [Paper] [Code]
- Deepline: Automl tool for pipelines generation using deep reinforcement learning and hierarchical actions filtering, KDD 2020 [Paper]
- On evaluation of automl systems, ICML 2020 [Paper]
- AlphaD3M: Machine learning pipeline synthesis, ICML 2018 [Paper]
- Efficient and robust automated machine learning, NeurIPS 2015 [Paper] [Code]
- Tiny3D: A Data-Centric AI based 3D Object Detection Service Production System [Code]
- Learning From How Humans Correct [Paper]
- The Re-Label Method For Data-Centric Machine Learning [Paper]
- Automatic Label Error Correction [Paper] [Code]
Inference Data Development
<img width="800" src="./imgs/inference-data-development.png" alt="inference-data-development" />In-distribution Evaluation
- FOCUS: Flexible optimizable counterfactual explanations for tree ensembles, AAAI 2022 [Paper] [Code]
- Sliceline: Fast, linear-algebra-based slice finding for ml model debugging, SIGMOD 2021 [Paper] [Code]
- Counterfactual explanations for oblique decision trees: Exact, efficient algorithms, AAAI 2021 [Paper]
- A Step Towards Global Counterfactual Explanations: Approximating the Feature Space Through Hierarchical Division and Graph Search, AAIML 2021 [Paper]
- An exact counterfactual-example-based approach to tree-ensemble models interpretability, arXiv 2021 [Paper] [Code]
- No subclass left behind: Fine-grained robustness in coarse-grained classification problems, NeurIPS 2020 [Paper] [Code]
- FACE: feasible and actionable counterfactual explanations, AIES 2020 [Paper] [Code]
- DACE: Distribution-Aware Counterfactual Explanation by Mixed-Integer Linear Optimization, IJCAI 2020 [Paper]
- Multi-objective counterfactual explanations, arXiv 2020 [Paper] [Code]
- Certifai: Counterfactual explanations for robustness, transparency, interpretability, and fairness of artificial intelligence models, AIES 2020 [Paper] [Code]
- Propublica's compas data revisited, arXiv 2019 [Paper]
- Slice finder: Automated data slicing for model validation, ICDE 2019 [Paper] [Code]
- Multiaccuracy: Black-box post-processing for fairness in classification, AIES 2019 [Paper] [Code]
- Model agnostic contrastive explanations for structured data, arXiv 2019 [Paper]
- Counterfactual explanations without opening the black box: Automated decisions and the GDPR, Harvard Journal of Law & Technology 2018 [Paper]
- Comparison-based inverse classification for interpretability in machine learning, IPMU 2018 [Paper]
- Quantitative program slicing: Separating statements by relevance, ICSE 2013 [Paper]
- Stratal slicing, Part II: Real 3-D seismic data, Geophysics 1998 [Paper]
Out-of-distribution Evaluation
- A brief review of domain adaptation, Transactions on Computational Science and Computational Intelligenc 2021 [Paper]
- Domain adaptation for medical image analysis: a survey, IEEE Transactions on Biomedical Engineering 2021 [Paper]
- Retiring adult: New datasets for fair machine learning, NeurIPS 2021 [Paper] [Code]
- Wilds: A benchmark of in-the-wild distribution shifts, ICML 2021 [Paper] [Code]
- Do image classifiers generalize across time?, ICCV 2021 [Paper]
- Using videos to evaluate image model robustness, arXiv 2019 [Paper]
- Regularized learning for domain adaptation under label shifts, ICLR 2019 [Paper] [Code]
- Benchmarking neural network robustness to common corruptions and perturbations, ICLR 2019 [Paper] [Code]
- Towards deep learning models resistant to adversarial attacks, ICLR 2018 [Paper] [Code]
- Robust physical-world attacks on deep learning visual classification, CVPR 2018 [Paper]
- Detecting and correcting for label shift with black box predictors, ICML 2018 [Paper]
- Poison frogs! targeted clean-label poisoning attacks on neural networks, NeurIPS 2018 [Paper] [Code]
- Practical black-box attacks against machine learning, CCS 2017 [Paper]
- Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models, AISec 2017 [Paper] [Code]
- Deepfool: a simple and accurate method to fool deep neural networks, CVPR 2016 [Paper] [Code]
- Evasion attacks against machine learning at test time, ECML PKDD 2013 [Paper] [Code]
- Adapting visual category models to new domains, ECCV 2010 [Paper]
- Covariate shift by kernel mean matching, MIT Press 2009 [Paper]
- Covariate shift adaptation by importance weighted cross validation, JMLR 2007 [Paper]
Prompt Engineering
- SPeC: A Soft Prompt-Based Calibration on Mitigating Performance Variability in Clinical Notes Summarization, arXiv 2023 [Paper]
- Making Pre-trained Language Models Better Few-shot Learners, arXiv 2021 [Paper] [Code]
- Bartscore: Evaluating generated text as text generation, NeurIPS 2021 [Paper] [Code]
- BERTese: Learning to Speak to BERT, arXiv 2021 [Paper]
- Few-shot text generation with pattern-exploiting training, arXiv 2020 [Paper]
- Exploiting cloze questions for few shot text classification and natural language inference, arXiv 2020 [Paper] [Code]
- It's not just size that matters: Small language models are also few-shot learners, arXiv 2020 [Paper]
- How can we know what language models know?, TACL 2020 [Paper] [Code]
- Universal adversarial triggers for attacking and analyzing NLP, EMNLP 2019 [Paper] [Code]
Data Maintenance
<img width="800" src="./imgs/data-maintenance.png" alt="data-maintenance" />Data Understanding
- The science of visual data communication: What works, Psychological Science in the Public Interest 2021 [Paper]
- Towards out-of-distribution generalization: A survey, arXiv 2021 [Paper]
- Snowy: Recommending utterances for conversational visual analysis, UIST 2021 [Paper]
- A distributional framework for data valuation, ICML 2020 [Paper]
- A comparison of radial and linear charts for visualizing daily patterns, TVCG 2020 [Paper]
- A marketplace for data: An algorithmic solution, EC 2019 [Paper]
- Data shapley: Equitable valuation of data for machine learning, PMLR 2019 [Paper] [Code]
- Deepeye: Towards automatic data visualization, ICDE 2018 [Paper] [Code]
- Voyager: Exploratory analysis via faceted browsing of visualization recommendations, TVCG 2016 [Paper]
- A survey of clustering algorithms for big data: Taxonomy and empirical analysis, TETC 2014 [Paper]
- On the benefits and drawbacks of radial diagrams, Handbook of Human Centric Visualization 2013 [Paper]
- What makes a visualization memorable?, TVCG 2013 [Paper]
- Toward a taxonomy of visuals in science communication, Technical Communication 2011 [Paper]
Data Quality Assurance
- Human-AI Collaboration for Improving the Identification of Cars for Autonomous Driving, CIKM Workshop 2022 [Paper]
- A Human-ML Collaboration Framework for Improving Video Content Reviews, arXiv 2022 [Paper]
- Knowledge graph quality management: a comprehensive survey, TKDE 2022 [Paper]
- A crowdsourcing open platform for literature curation in UniProt, PLoS Biol. 2021 [Paper] [Code]
- Building data curation processes with crowd intelligence, Advanced Information Systems Engineering 2020 [Paper]
- Data Curation with Deep Learning, EDBT, 2020 [Paper]
- Automating large-scale data quality verification, VLDB 2018 [Paper]
- Data quality: The role of empiricism, SIGMOD 2017 [Paper]
- Tfx: A tensorflow-based production-scale machine learning platform, KDD 2017 [Paper] [Code]
- Discovering denial constraints, VLDB 2013 [Paper] [Code]
- Methodologies for data quality assessment and improvement, ACM Computing Surveys 2009 [Paper]
- Conditional functional dependencies for data cleaning, ICDE 2007 [Paper]
- Data quality assessment, Communications of the ACM 2002 [Paper]
Data Storage and Retrieval
- Dbmind: A self-driving platform in opengauss, PVLDB 2021 [Paper]
- Online index selection using deep reinforcement learning for a cluster database, ICDEW 2020 [Paper]
- Bridging the semantic gap with SQL query logs in natural language interfaces to databases, ICDE 2019 [Paper]
- An end-to-end learning-based cost estimator, VLDB 2019 [Paper] [Code]
- An adaptive approach for index tuning with learning classifier systems on hybrid storage environments, Hybrid Artificial Intelligent Systems 2018 [Paper]
- Automatic database management system tuning through large-scale machine learning, SIGMOD 2017 [Paper]
- Learning to rewrite queries, CIKM 2016 [Paper]
- DBridge: A program rewrite tool for set-oriented query execution, IEEE ICDE 2011 [Paper]
- Starfish: A Self-tuning System for Big Data Analytics, CIDR 2011 [Paper] [Code]
- DB2 advisor: An optimizer smart enough to recommend its own indexes, ICDE 2000 [Paper]
- An efficient, cost-driven index selection tool for Microsoft SQL server, VLDB 1997 [Paper]
Data Benchmark
Training Data Development Benchmark
- OpenGSL: A Comprehensive Benchmark for Graph Structure Learning, arXiv 2023 [Paper] [Code]
- REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines, EDBT 2023 [Paper] [Code]
- Usb: A unified semi-supervised learning benchmark for classification, NeurIPS 2022 [Paper] [Code]
- A feature extraction & selection benchmark for structural health monitoring, Structural Health Monitoring 2022 [Paper]
- Data augmentation for deep graph learning: A survey, KDD 2022 [Paper]
- Blood-based transcriptomic signature panel identification for cancer diagnosis: benchmarking of feature extraction methods, Briefings in Bioinformatics 2022 [Paper] [Code]
- Amlb: an automl benchmark, arXiv 2022 [Paper]
- A benchmark for data imputation methods, Front. Big Data 2021 [Paper] [Code]
- Benchmark and survey of automated machine learning frameworks, JAIR 2021 [Paper]
- Benchmarking differentially private synthetic data generation algorithms, arXiv 2021 [Paper]
- A comprehensive benchmark framework for active learning methods in entity matching, SIGMOD 2020 [Paper]
- Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy, CVPR 2020 [Paper] [Code]
- Comparison of instance selection and construction methods with various classifiers, Applied Sciences 2020 [Paper]
- An empirical survey of data augmentation for time series classification with neural networks, arXiv 2020 [Paper] [Code]
- Toward a quantitative survey of dimension reduction techniques, IEEE Transactions on Visualization and Computer Graphics 2019 [Paper] [Code]
- Cleanml: A benchmark for joint data cleaning and machine learning experiments and analysis, arXiv 2019 [Paper] [Code]
- Comparison of different image data augmentation approaches, Journal of Big Data 2019 [Paper] [Code]
- A benchmark and comparison of active learning for logistic regression, Pattern Recognition 2018 [Paper]
- A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge, Database (Oxford). 2017 [Paper] [Data]
- RODI: A benchmark for automatic mapping generation in relational-to-ontology data integration, ESWC 2015 [Paper] [Code]
- TPC-DI: the first industry benchmark for data integration, PVLDB 2014 [Paper]
- Comparison of instance selection algorithms II. Results and comments, ICAISC 2004 [Paper]
Inference Data Development Benchmark
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, arXiv 2023 [Paper] [Code]
- Carla: a python library to benchmark algorithmic recourse and counterfactual explanation algorithms, arXiv 2021 [Paper] [Code]
- Benchmarking adversarial robustness on image classification, CVPR 2020 [Paper]
- Searching for a search method: Benchmarking search algorithms for generating nlp adversarial examples, ACL Workshop 2020 [Code]
- Benchmarking neural network robustness to common corruptions and perturbations, ICLR 2019 [Paper] [Code] [Code]
Data Maintenance Benchmark
- Chart-to-text: A large-scale benchmark for chart summarization, ACL 2022 [Paper] [Code]
- Scalability vs. utility: Do we have to sacrifice one for the other in data importance quantification?, CVPR 2021 [Paper] [Code]
- An evaluation-focused framework for visualization recommendation algorithms, IEEE Transactions on Visualization and Computer Graphics 2021 [Paper] [Code]
- Facilitating database tuning with hyper-parameter optimization: a comprehensive experimental evaluation, VLDB 2021 [Paper] [Code]
- Benchmarking Data Curation Systems, IEEE Data Eng. Bull. 2016 [Paper]
- Methodologies for data quality assessment and improvement, ACM Computing Surveys 2009 [Paper]
- Benchmark development for the evaluation of visualization for data mining, Information visualization in data mining and knowledge discovery 2001 [Paper]
Unified Benchmark
- Dataperf: Benchmarks for data-centric AI development, arXiv 2022 [Paper]