Home

Awesome

[中文主页] | [Docs] | [API] | [DJ-SORA] | [Awesome List]

Data-Juicer: A One-Stop Data Processing System for Large Language Models

<img src="https://img.alicdn.com/imgextra/i3/O1CN017Eq5kf27AlA2NUKef_!!6000000007757-0-tps-1280-720.jpg" width = "640" height = "360" alt="Data-Juicer"/>

pypi version Docker version

DataModality Usage ModelScope- Demos HuggingFace- Demos

Document_List 文档列表 API Reference Paper

Data-Juicer is a one-stop multimodal data processing system to make data higher-quality, juicier, and more digestible for LLMs.

We provide a playground with a managed JupyterLab. Try Data-Juicer straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly cite our work.

Platform for AI of Alibaba Cloud (PAI) has cited our work and integrated Data-Juicer into its data processing products. PAI is an AI Native large model and AIGC engineering platform that provides dataset management, computing power management, model tool chain, model development, model training, model deployment, and AI asset management. For documentation on data processing, please refer to: PAI-Data Processing for Large Models.

Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. We welcome you to join us (via issues, PRs, Slack channel, DingDing group, ...), in promoting data-model co-development along with research and applications of (multimodal) LLMs!


News

<details> <summary> History News: </summary>> </details> <div id="table" align="center"></div>

Table of Contents

Features

Overview

Documentation Index <a name="documents"/>

Demos

Prerequisites

Installation

From Source

cd <path_to_data_juicer>
pip install -v -e .
cd <path_to_data_juicer>
pip install -v -e .  # install a minimal dependencies, which support the basic functions
pip install -v -e .[tools] # install a subset of tools dependencies

The dependency options are listed below:

TagDescription
. or .[mini]Install minimal dependencies for basic Data-Juicer.
.[all]Install all dependencies except sandbox.
.[sci]Install all dependencies for all OPs.
.[dist]Install dependencies for distributed data processing. (Experimental)
.[dev]Install dependencies for developing the package as contributors.
.[tools]Install dependencies for dedicated tools, such as quality classifiers.
.[sandbox]Install all dependencies for sandbox.

With the growth of the number of OPs, the dependencies of all OPs becomes very heavy. Instead of using the command pip install -v -e .[sci] to install all dependencies, we provide two alternative, lighter options:

Using pip

pip install py-data-juicer

Using Docker

Installation check

import data_juicer as dj
print(dj.__version__)

For Video-related Operators

Before using video-related operators, FFmpeg should be installed and accessible via the $PATH environment variable.

You can install FFmpeg using package managers(e.g. sudo apt install ffmpeg on Debian/Ubuntu, brew install ffmpeg on OS X) or visit the official ffmpeg link.

Check if your environment path is set correctly by running the ffmpeg command from the terminal.

<p align="right"><a href="#table">🔼 back to index</a></p>

Quick Start

Data Processing

# only for installation from source
python tools/process_data.py --config configs/demo/process.yaml

# use command line tool
dj-process --config configs/demo/process.yaml
# cache home
export DATA_JUICER_CACHE_HOME="/path/to/another/directory"
# cache models
export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
# cache assets
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"

Flexible Programming Interface

We provide various simple interfaces for users to choose from as follows.

#... init op & dataset ...

# Chain call style, support single operator or operator list
dataset = dataset.process(op)
dataset = dataset.process([op1, op2])
# Functional programming style for quick integration or script prototype iteration
dataset = op(dataset)
dataset = op.run(dataset)

Distributed Data Processing

We have now implemented multi-machine distributed data processing based on RAY. The corresponding demos can be run using the following commands:

# Run text data processing
python tools/process_data.py --config ./demos/process_on_ray/configs/demo.yaml
# Run video data processing
python tools/process_data.py --config ./demos/process_video_on_ray/configs/demo.yaml

Users can also opt not to use RAY and instead split the dataset to run on a cluster with Slurm. In this case, please use the default Data-Juicer without RAY. Aliyun PAI-DLC supports the RAY framework, Slurm framework, etc. Users can directly create RAY jobs and Slurm jobs on the DLC cluster.

Data Analysis

# only for installation from source
python tools/analyze_data.py --config configs/demo/analyzer.yaml

# use command line tool
dj-analyze --config configs/demo/analyzer.yaml

# you can also use auto mode to avoid writing a recipe. It will analyze a small
# part (e.g. 1000 samples, specified by argument `auto_num`) of your dataset 
# with all Filters that produce stats.
dj-analyze --auto --dataset_path xx.jsonl [--auto_num 1000]

Data Visualization

streamlit run app.py

Build Up Config Files

python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang=en

Sandbox

The data sandbox laboratory (DJ-Sandbox) provides users with the best practices for continuously producing data recipes. It features low overhead, portability, and guidance.

The sandbox is run using the following commands by default, and for more information and details, please refer to the sandbox documentation.

python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml

Preprocess Raw Data (Optional)

For Docker Users

# run the data processing directly
docker run --rm \  # remove container after the processing
  --privileged \
  --shm-size 256g \
  --network host \
  --gpus all \
  --name dj \  # name of the container
  -v <host_data_path>:<image_data_path> \  # mount data or config directory into the container
  -v ~/.cache/:/root/.cache/ \  # mount the cache directory into the container to reuse caches and models (recommended)
  datajuicer/data-juicer:<version_tag> \  # image to run
  dj-process --config /path/to/config.yaml  # similar data processing commands
# start the container
docker run -dit \  # run the container in the background
  --privileged \
  --shm-size 256g \
  --network host \
  --gpus all \
  --rm \
  --name dj \
  -v <host_data_path>:<image_data_path> \
  -v ~/.cache/:/root/.cache/ \
  datajuicer/data-juicer:latest /bin/bash

# enter into this container and then you can use data-juicer in editable mode
docker exec -it <container_id> bash
<p align="right"><a href="#table">🔼 back to index</a></p>

Data Recipes

License

Data-Juicer is released under Apache License 2.0.

Contributing

We are in a rapidly developing field and greatly welcome contributions of new features, bug fixes and better documentations. Please refer to How-to Guide for Developers.

If you have any questions, please join our discussion groups.

Acknowledgement

Data-Juicer is used across various LLM products and research initiatives, including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for financial analysis, and Zhiwen for reading assistant, as well as the Alibaba Cloud's platform for AI (PAI). We look forward to more of your experience, suggestions and discussions for collaboration!

Data-Juicer thanks and refers to several community projects, such as Huggingface-Datasets, Bloom, RedPajama, Pile, Alpaca-Cot, Megatron-LM, DeepSpeed, Arrow, Ray, Beam, LM-Harness, HELM, ....

References

If you find our work useful for your research or development, please kindly cite the following paper.

@inproceedings{chen2024datajuicer,
  title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
  author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
  booktitle={International Conference on Management of Data},
  year={2024}
}
<details> <summary> More related papers from Data-Juicer Team: </summary>> </details> <p align="right"><a href="#table">🔼 back to index</a></p>