Home

Awesome

UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-Modal Learning

The dataset and codes of the paper UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-Modal Learning

UniMod1K involves three data modalities: vision, depth, and language. For the vision and depth modalities, the UniMod1K dataset contains 1,050 RGB-D sequences. Regarding the language modality, the UniMod1K includes 1,050 sentences describing the target object in each video. The link of the paper will be released soon. Here are some samples of the dataset:

<center><img width="75%" alt="" src="./data_samples.jpg"/></center>

Download

The RGB-D images of UniMod1K dataset are available on Baidu Cloud Disk and Google Drive. Besides, the text files of the UniMod1K can be downloaded here.

Dataset

RGB-D sequences and the files of bounding box labels:

--UniMod1K
    |--Adapter
        |--adapter1
           |--groundtruth_rect.txt
           |--color
              |--00000001.jpg
              |--00000002.jpg
              ...
           |--depth
              |--00000001.png
              |--00000002.png
              ...           
        |--adapter2
        ...
    |--Animal
       |--alpaca1
           |--groundtruth_rect.txt
           |--color
              |--00000001.jpg
              |--00000002.jpg
              ...
           |--depth
              |--00000001.png
              |--00000002.png
              ...            
       |--bear1
        ...
    ... 

The natural language files:

--UniMod1K
    |--Adapter
        |--adapter1
           |--nlp.txt
        |--adapter2
           |--nlp.txt
        ...
    |--Animal
       |--alpaca1
           |--nlp.txt           
       |--bear1
           |--nlp.txt        
        ...
    ... 

Dataset Format

The RGB images are saved in the 24-bit JPEG format with 8 bits allocated to each channel, whereas the depth maps are saved in the 16-bit PNG format. For the labels, the format of rectangle bounding boxes is as [x1, y1, w, h]. (x1, y1) is the top-left corner of a targeted object, while the w and h are the width and height of the target bounding box.

Data Visualisation

After downloading the data and corresponding label files, you can visualise the samples by:

cd /path/to/UniMod1K
export PYTHONPATH=/path/to/SPT:$PYTHONPATH
python ./read_dataset.py --data_dir '/path/to/UniMod1K/' --nlp_dir '/path/to/nlps/' --seq_id 0

Baselines Codes and pre-trained models

For the usages of the baselines and the UniMod1K, please refer to the README_SPT. The training and test codes of the SPT, as well as the trained model are provided.

Monocular Depth Estimation

The subset for monocular depth estimation can be downloaded from Baidu Cloud or Google Drive.

Contact

If you have any question, please feel free to contact us(xuefeng_zhu95@163.com)