Home

Awesome

ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition

This repository contains code for the following two papers:

The code is authored by Daniela Massiceti and built using PyTorch 1.13.1, TorchVision 0.14.1, and Python 3.7.

<table> <tr> <td><img src="docs/facemask.PNG" alt="clean frame of facemask" width = 140px></td> <td><img src="docs/hairbrush.PNG" alt="clean frame of hairbrush" width = 140px></td> <td><img src="docs/keys.PNG" alt="clean frame of keys" width = 140px></td> <td><img src="docs/watering can.PNG" alt="clean frame of a watering can" width = 140px></td> </tr> <tr> <td><img src="docs/facemask_clutter.PNG" alt="clutter frame of facemask" width = 140px></td> <td><img src="docs/hairbrush_clutter.PNG" alt="clutter frame of hairbrush" width = 140px></td> <td><img src="docs/keys_clutter.PNG" alt="clutter frame of keys" width = 140px></td> <td><img src="docs/wateringcan_clutter.PNG" alt="clutter frame of watering can" width = 140px></td> </tr> <caption style="caption-side:bottom"> <i>Frames from clean (top row) and clutter (bottom row) videos from the ORBIT benchmark dataset</i></caption> </table>

Installation

  1. Clone or download this repository
  2. Install dependencies
    cd ORBIT-Dataset
    
    # if using Anaconda
    conda env create -f environment.yml
    conda activate orbit-dataset
    
    # if using pip
    pip install -r requirements.txt
    

Download ORBIT Benchmark Dataset

The following script downloads the benchmark dataset into a folder called orbit_benchmark_<FRAME_SIZE> at the path folder/to/save/dataset. Use FRAME_SIZE=224 to download the dataset already re-sized to 224x224 frames. For other values of FRAME_SIZE, the script will dynamically re-size the frames accordingly:

bash scripts/download_benchmark_dataset.sh folder/to/save/dataset FRAME_SIZE

Alternatively, the 224x224 train/validation/test ZIPs can be manually downloaded here. Each should be unzipped as a separate train/validation/test folder into folder/to/save/dataset/orbit_benchmark_224. The full-size (1080x1080) ZIPs can also be manually downloaded and scripts/resize_videos.py can be used to re-size the frames if needed.

The following script summarizes the dataset statistics:

python3 scripts/summarize_dataset.py --data_path path/to/save/dataset/orbit_benchmark_<FRAME_SIZE>
# to aggregate stats across train, validation, and test collectors, add --combine_modes

These should match the values in Table 2 (combine_modes=True) and Table A.2 (combine_modes=False) in the dataset paper.

Training & testing models on ORBIT

The following describes the protocols for training and testing models on the ORBIT Benchmark.

Training protocol

The training protocol is flexible and can leverage any training regime (e.g. episodic learning, self-supervised learning). There are no restrictions on the choice of model/feature extractor, or how users/objects/videos/frames are sampled.

What data can be used:

What data cannot be used:

Testing protocol (updated Dec 2022)

We have updated the evaluation protocol for the ORBIT benchmark (compared to the original dataset paper) following the ORBIT Few-Shot Object Recognition Challenge 2022:

Personalize rules

For each test user's task, a model must be personalized to all the user's objects using only the support (clean) videos and associated labels for those objects. Note, any method of personalization can be used (e.g. fine-tuning, parameter generation, metric learning).

What data can be used to personalize:

What data cannot be used to personalize:

Recognize rules

Once a model has been personalized to a test user's task, the model should be evaluated on the task's query set which should contain all that user's clutter videos. Predictions should be made for 200 randomly sampled frames per clutter video, ensuring that no sampled frames have object_not_present_issue=True. For each frame, the personalized model should predict which one object is present from all the user's objects. The frame accuracy metric should be calculated over the 200 randomly sampled frames for each clutter video in the task's query set.

Note, before sampling the 200 frames, the video should be filtered to exclude all frames that do not contain the ground-truth object (i.e. object_not_present_issue=True; see Filtering by annotations section). If after filtering, a clutter video has less than 50 valid frames, the video should be excluded from the evaluation. If it has 50-200 valid frames then all these frames should be included.

What data can be used to make a frame prediction:

What data cannot be used to make a frame prediction:

Baselines

The following scripts can be used to train and test several baselines on the ORBIT benchmark. We provide support for 224x224 frames and the following feature extractors: efficientnet_b0 (pre-trained on ImageNet-1K), efficientnet_v2_s, vit_s_32, and vit_b_32 (all pre-trained on ImagetNet-21K), and vit_b_32_clip (pre-trained on Laion2B).

All other arguments are described in utils/args.py. Note that the Clutter Video Evaluation (CLU-VE) setting is run by specifying --context_video_type clean --target_video_type clutter. Experiments will be saved in --checkpoint_dir. All other implementation details are described in Section 5 and Appendix F of the dataset paper.

Note, before training/testing remember to activate the conda environment (conda activate orbit-dataset) or virtual environment. If you are using Windows (or WSL) you may need to set workers=0 in data/queues.py as multi-threaded data loading is not supported. You will also need to enable longer file paths as some file names in the dataset are longer than the system limit.

CNAPS+LITE. Our implementation of the model-based few-shot learner CNAPs (Requeima et al., NeurIPS 2019) is trained with LITE on a Tesla V100 32GB GPU (see Table 1):

python3 single-step-learner.py --data_path folder/to/save/dataset/orbit_benchmark_224 \
                         --feature_extractor efficientnet_b0 \
                         --classifier versa --adapt_features \
                         --context_video_type clean --target_video_type clutter \
                         --with_lite --num_lite_samples 16 --batch_size 256 \

Simple CNAPs+LITE. Our implementation of the model-based few-shot learner Simple CNAPs (Bateni et al., CVPR 2020) is trained with LITE on a Tesla V100 32GB GPU (see Table 1):

python3 single-step-learner.py --data_path folder/to/save/dataset/orbit_benchmark_224 \
                         --feature_extractor efficientnet_b0 \
                         --classifier mahalanobis --adapt_features \
                         --context_video_type clean --target_video_type clutter \
                         --with_lite --num_lite_samples 16 --batch_size 256 \

ProtoNets+LITE. Our implementation of the metric-based few-shot learner ProtoNets (Snell et al., NeurIPS 2017) is trained with LITE on a Tesla V100 32GB GPU (see Table 1):

python3 single-step-learner.py --data_path folder/to/save/dataset/orbit_benchmark_224 \
                               --feature_extractor efficientnet_b0 \
                               --classifier proto --learn_extractor \
                               --context_video_type clean --target_video_type clutter \
                               --with_lite --num_lite_samples 16 --batch_size 256

FineTuner. Given the recent strong performance of finetuning-based few-shot learners, we also provide a finetuning baseline. Here, we simply freeze a pre-trained feature extractor and, using a task's support set, we finetune either i) a linear head, or i) a linear head and FiLM layers (Perez et al., 2017) in the feature extractor (see Table 1). In principle, you could also use a meta-trained checkpoint as an initialization through the --model_path argument.

python3 multi-step-learner.py --data_path folder/to/save/dataset/orbit_benchmark_224 \
                            --feature_extractor efficientnet_b0 \
                            --mode test \ # train_test not supported
                            --classifier linear \
                            --context_video_type clean --target_video_type clutter \
                            --personalize_num_grad_steps 50 --personalize_learning_rate 0.001 --personalize_optimizer adam \
                            --batch_size 1024

Note, we have removed support for further training the feature extractor on the ORBIT train users using standard supervised learning with the objects' broader cluster labels. Please roll back to this commit if you would like to do this. The object clusters can be found in data/orbit_{train,validation,test}_object_clusters_labels.json and data/object_clusters_benchmark.txt.

MAML. Our implementation of MAML (Finn et al., ICML 2017) is no longer supported. Please roll back to this commit if you need to reproduce the MAML baselines in Table 5 (dataset paper) or Table 1 (LITE paper).

84x84 images. Training/testing on 84x84 images is no longer supported. Please roll back to this commit if you need to reproduce the original baselines in Table 5 (dataset paper).

GPU and CPU memory requirements

The GPU memory requirements can be reduced by:

The CPU memory requirements can be reduced by:

Pre-trained checkpoints

The following checkpoints have been trained on the ORBIT train users using the arguments specified above. The models can be run in test-only mode using the same arguments as above except adding --mode test and providing the path to the checkpoint as --model_path path/to/checkpoint.pt. In principle, the memory required for testing should be significantly less than training so should be possible on 1x 12-16GB GPU (or CPU with --gpu -1). The --batch_size flag can be used to further reduce memory requirements.

ModelFrame sizeFeature extractorTrained with LITEFrame Accuracy (95% c.i)Trained with clean/clutter (context/target) videos
CNAPs224EfficientNet-B0Y67.68 (0.58)orbit_cluve_cnaps_efficientnet_b0_224_lite.pth
224ViT-B-32-CLIPY72.33 (0.54)orbit_cluve_cnaps_vit_b_32_clip_224_lite.pth
SimpleCNAPs224EfficientNet-B0Y66.83 (0.60)orbit_cluve_simple_cnaps_efficientnet_b0_224_lite.pth
224ViT-B-32-CLIPY68.86 (0.56)orbit_cluve_simple_cnaps_vit_b_32_clip_224_lite.pth
ProtoNets224EfficientNet-B0Y67.91 (0.56)orbit_cluve_protonets_efficientnet_b0_224_lite.pth
224EfficientNet-V2-SY72.76 (0.53)orbit_cluve_protonets_efficientnet_v2_s_224_lite.pth
224ViT-B-32Y73.53 (0.51)orbit_cluve_protonets_vit_b_32_224_lite.pth
224ViT-B-32-CLIPY73.95 (0.52)orbit_cluve_protonets_vit_b_32_clip_224_lite.pth
ProtoNets (cosine)224EfficientNet-B0Y67.48 (0.57)orbit_cluve_protonets_cosine_efficientnet_b0_224_lite.pth
224EfficientNet-V2-SY73.10 (0.54)orbit_cluve_protonets_cosine_efficientnet_v2_s_224_lite.pth
224ViT-B-32Y75.38 (0.51)orbit_cluve_protonets_cosine_vit_b_32_224_lite.pth
224ViT-B-32-CLIPY73.54 (0.52)orbit_cluve_protonets_cosine_vit_b_32_clip_224_lite.pth
FineTuner224EfficientNet-B0N64.57 (0.56)Used pre-trained extractor
224ViT-B-32-CLIPN71.31 (0.55)Used pre-trained extractor
FineTuner + FiLM224EfficientNet-B0N66.63 (0.58)Used pre-trained extractor
224ViT-B-32-CLIPN71.86 (0.55)Used pre-trained extractor

ORBIT Few-Shot Object Recognition Challenge

The VizWiz workshop is hosting the ORBIT Few-Shot Object Recognition Challenge at CVPR 2024. The Challenge will run from Friday 12 January 2024 9am CT to Friday 3 May 2023 9am CT.

To participate, visit the Challenge evaluation server which is hosted on EvalAI. Here you will find all details about the Challenge, including the competition rules and how to register your team. The winning team will be invited to give an in-person or virtual talk at the VizWiz workshop at CVPR 2024. Further prizes are still being confirmed.

We have provided orbit_challenge_getting_started.ipynb to help get you started. This starter task will step you through how to load the ORBIT validation set, run it through a pre-trained model, and save the results which you can then upload to the evaluation server.

For any questions, please email orbit-challenge@microsoft.com.

Extra annotations

We provide additional annotations for the ORBIT benchmark dataset in data/orbit_extra_annotations.zip. The annotations include per-frame bounding boxes for all clutter videos, and per-frame quality issues for all clean videos. Please read below for further details.

Bounding boxes

We provide per-frame bounding boxes for all clutter videos. Note, there is one bounding box per frame (i.e. the location of the labelled/target object). Other details:

Quality issues (annotated by Enlabeler (Pty) Ltd)

We provide per-frame quality issues for all clean videos. Note, a frame can contain any/multiple of the following 7 issues: object_not_present_issue, framing_issue, viewpoint_issue, blur_issue, occlusion_issue, overexposed_issue, underexposed_issue. The choice of issues was informed by Chiu et al., 2020. Other details:

Loading the annotations

You can use --annotations_to_load to load the bounding box and quality issue annotations. The argument can take any/multiple of the following: object_bounding_box, object_not_present_issue, framing_issue, viewpoint_issue, blur_issue, occlusion_issue, overexposed_issue, underexposed_issue. The specified annotations will be loaded and returned in a dictionary with the task data (note, if a frame does not have one of the specified annotations then nan will appear in its place). At present, the code does not use these annotations for training/testing. To do so, you will need to return them in the unpack_task function in utils/data.py.

Filtering by annotations

If you would like to filter tasks' context or target sets by specific quality annotations (e.g. remove all frames with no object present), you can use --train_filter_context/--tain_filter_target to filter train tasks, or --test_filter_context/--test_filter_target to filter validation/test tasks. These arguments accept the same options as above. The filtering is applied to all context/target videos when the data loader is created (see load_all_users in data/dataset.py).

Download unfiltered ORBIT dataset

Some collectors/objects/videos did not meet the minimum requirement to be included in the ORBIT benchmark dataset. The full unfiltered ORBIT dataset of 4733 videos (frame size: 1080x1080) of 588 objects can be downloaded and saved to folder/to/save/dataset/orbit_unfiltered by running the following script

bash scripts/download_unfiltered_dataset.sh folder/to/save/dataset

Alternatively, the train/validation/test/other ZIPs can be manually downloaded here. Use scripts/merge_and_split_benchmark_users.py to merge the other folder (see script for usage details).

To summarize and plot the unfiltered dataset, use scripts/summarize_dataset.py (with --no_modes rather than --combine_modes) similar to above.

Citations

For models trained with LITE:

@article{bronskill2021lite,
  title={{Memory Efficient Meta-Learning with Large Images}},
  author={Bronskill, John and Massiceti, Daniela and Patacchiola, Massimiliano and Hofmann, Katja and Nowozin, Sebastian and Turner, Richard E.},
  journal={Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS)},
  year={2021}}

For ORBIT dataset and baselines:

@inproceedings{massiceti2021orbit,
  title={{ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition}},
  author={Massiceti, Daniela and Zintgraf, Luisa and Bronskill, John and Theodorou, Lida and Harris, Matthew Tobias and Cutrell, Edward and Morrison, Cecily and Hofmann, Katja and Stumpf, Simone},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021}}

Contact

To ask questions or report issues, please open an issue on the Issues tab.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.