Home

Awesome

Cogdata

Install

pip install cogdata
sudo `which install_unrarlib.sh`

Streaming in Cogdata v1.0

Readme

Directory Structure

.
├── cogdata_task_task1
│   ├── cogdata_config.json (indicating a task path)
│   ├── merged.bin
│   ├── dataset1
│   │   ├── dataset1.bin
│   │   └── meta_info.json
│   └── dataset2
│       ├── dataset2.bin
│       └── meta_info.json
├── dataset1
│   ├── cogdata_info.json (indicating a dataset path)
│   ├── dataset1.json
│   └── dataset1.rar
└── dataset2
    ├── cogdata_info.json
    ├── dataset2.json
    └── dataset2.zip

Pipeline

The motivation of this project is to provide lightweight APIs for large-scale NN-based data-processing, e.g. ImageTokenization. The abstraction has 3 parts:

Commands

cogdata create_dataset  [-h] [--description DESCRIPTION] --data_files DATA_FILES [DATA_FILES ...] --data_format DATA_FORMAT [--text_files TEXT_FILES [TEXT_FILES ...]] [--text_format TEXT_FORMAT] name

Alias: cogdata data .... data_format is chosen from class names in cogdata.datasets, e.g. StreamingRarDataset. Texts related options are optional for text-image datasets.

cogdata create_task [-h] [--description DESCRIPTION] --task_type TASK_TYPE --saver_type SAVER_TYPE [--length_per_sample LENGTH_PER_SAMPLE] [--img_sizes IMG_SIZES [IMG_SIZES ...]] [--txt_len TXT_LEN]
                           [--dtype {int32,int64,float32,uint8,bool}] --model_path MODEL_PATH
                           task_id

Alias: cogdata task .... task_type and saver_type is chosen from class names in cogdata, e.g. ImageTextTokenizationTask or BinarySaver.

cogdata process [-h] --task_id TASK_ID [--nproc NPROC] [--dataloader_num_workers DATALOADER_NUM_WORKERS]
                       [--batch_size BATCH_SIZE] [--ratio RATIO]
                       [datasets [datasets ...]]

The i-th proc will be binded to the i-th GPU.

cogdata merge [-h] --task_id TASK_ID

Merge all the processed data.

cogdata list [-h] [--task_id TASK_ID]

List all the current datasets in this folder.

cogdata clean [-h] [--task_id TASK_ID]

Clean the unfinished states of the task.

Customized Tasks

Add --extra_code PATH_TO_CODE after cogdata (e.g., cogdata --extra_code ../examples/convert2tar_task.py [task or process] to execute and register your own task before running the command. See examples/ for details.

TODO List