Home

Awesome

Awesome-Data-Centric-AI

Awesome

A curated, but incomplete, list of data-centric AI resources. It should be noted that it is unfeasible to encompass every paper. Thus, we prefer to selectively choose papers that present a range of distinct ideas. We welcome contributions to further enrich and refine this list.

:loudspeaker: News: Please check out our open-sourced Large Time Series Model (LTSM)!

If you want to contribute to this list, please feel free to send a pull request. Also, you can contact daochen.zha@rice.edu.

Want to discuss with others who are also interested in data-centric AI? There are three options:

<img width="250" src="./imgs/group.jpeg" alt="group" />

What is Data-centric AI?

Data-centric AI is an emerging field that focuses on engineering data to improve AI systems with enhanced data quality and quantity.

Data-centric AI vs. Model-centric AI

<img width="500" src="./imgs/data-centric.png" alt="data-centric" />

In the conventional model-centric AI lifecycle, researchers and developers primarily focus on identifying more effective models to improve AI performance while keeping the data largely unchanged. However, this model-centric paradigm overlooks the potential quality issues and undesirable flaws of data, such as missing values, incorrect labels, and anomalies. Complementing the existing efforts in model advancement, data-centric AI emphasizes the systematic engineering of data to build AI systems, shifting our focus from model to data.

It is important to note that "data-centric" differs fundamentally from "data-driven", as the latter only emphasizes the use of data to guide AI development, which typically still centers on developing models rather than engineering data.

Why Data-centric AI?

<img width="800" src="./imgs/motivation.png" alt="motivation" />

Two motivating examples of GPT models highlight the central role of data in AI.

Another example is Segment Anything, a foundation model for computer vision. The core of training Segment Anything lies in the large amount of annotated data, containing more than 1 billion masks, which is 400 times larger than existing segmentation datasets.

What is the Data-centric AI Framework?

<img width="800" src="./imgs/framework.png" alt="framework" />

Data-centric AI framework consists of three goals: training data development, inference data development, and data maintenance, where each goal is associated with several sub-goals.

Cite this Work

Zha, Daochen, et al. "Data-centric Artificial Intelligence: A Survey." arXiv preprint arXiv:2303.10158, 2023.

@article{zha2023data-centric-survey,
  title={Data-centric Artificial Intelligence: A Survey},
  author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Jiang, Zhimeng and Zhong, Shaochen and Hu, Xia},
  journal={arXiv preprint arXiv:2303.10158},
  year={2023}
}

Zha, Daochen, et al. "Data-centric AI: Perspectives and Challenges." SDM, 2023.

@inproceedings{zha2023data-centric-perspectives,
  title={Data-centric AI: Perspectives and Challenges},
  author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Hu, Xia},
  booktitle={SDM},
  year={2023}
}

Table of Contents

Training Data Development

<img width="800" src="./imgs/training-data-development.png" alt="training-data-development" />

Data Collection

Data Labeling

Data Preparation

Data Reduction

Data Augmentation

Pipeline Search

Inference Data Development

<img width="800" src="./imgs/inference-data-development.png" alt="inference-data-development" />

In-distribution Evaluation

Out-of-distribution Evaluation

Prompt Engineering

Data Maintenance

<img width="800" src="./imgs/data-maintenance.png" alt="data-maintenance" />

Data Understanding

Data Quality Assurance

Data Storage and Retrieval

Data Benchmark

Training Data Development Benchmark

Inference Data Development Benchmark

Data Maintenance Benchmark

Unified Benchmark