Awesome
If you haven't already done so, start by reading the Code Submission Format page on the competition website.
Meta AI Video Similarity Challenge: Matching Track Code Execution Runtime
Welcome to the runtime repository for the Meta AI Video Similarity Challenge!
As mentioned in the Problem Description and Code Submission Format pages, this competition is a hybrid code execution competition. This means that you will submit both the full set of query and reference matches that you generate for videos in the test set as well as the code that generates those matches. This repository contains the definition of the environment where your code submissions will run on a subset of videos in the query set to ensure that your submission meets the given resource constraints. It specifies both the operating system and the software packages that will be available to your solution.
This repository has three primary uses for competitors:
- š” Quickstart example: A minimal code example that runs successfully in the runtime environment and outputs a properly formatted submission tarfile. This will generate random matches, so unfortunately you won't win the competition with this example, but you can use it as a guide for bringing in your own work and generating a real submission.
- š§ Test your submission: Test your submission with a locally running version of the container to discover errors before submitting to the competition site.
- š¦ Request new packages in the official runtime: Since the Docker container will not have network access, all packages must be pre-installed. If you want to use a package that is not in the runtime environment, make a pull request to this repository.
This repository is a companion repository to the vsc2022
codebase, which provides a benchmark solution to both tracks of this challenge. The vsc2022
codebase is included as a submodule of this repository, and you can check it out fully by running make update-submodules
as explained further below.
Quickstart
Developing your own submission
Getting Started: the vsc2022
repo
Additional information
- Scoring your submission
- Submitting without code
- Runtime network access
- CPU and GPU
- Make commands
- Updating runtime packages
Quickstart
The quickstart example allows you to generate valid (but random) matches for the full set of query and reference videos, as well as an example main.py
script for generating matches for the test subset.
This section guides you through the steps to generate a simple but valid submission for the competition.
Prerequisites
First, make sure you have the prerequisites installed.
- A clone or fork of this repository
- At least 12 GB of free space for the Docker container images, and an additional 79GB of free space for storing the videos from the training set you'll use as your local test set (163GB if you download the entire dataset including the test set).
- Docker
- GNU make (optional, but useful for running commands in the Makefile)
- A python environment with
requirements.txt
installed
Download the data
Download the competition data into the competition_data
folder by following the instructions on the data download page. Once everything is downloaded and in the right location, it should look like this:
competition_data/ # Competition data directory
āāā train/ # Directory containing the training set
ā āāā train_query_metadata.csv # Training set metadata file
ā āāā train_reference_metadata.csv # Training set metadata file
ā āāā train_descriptor_ground_truth.csv # Training set ground truth file
ā āāā train_matching_ground_truth.csv # Training set ground truth file
ā āāā query/ # Directory containing the test set query videos
ā ā āāā Q100001.mp4
ā ā āāā Q100002.mp4
ā ā āāā Q100003.mp4
ā ā āāā ...
ā āāā reference/ # Directory containing the test set reference videos
ā āāā R100000.mp4
ā āāā R100001.mp4
ā āāā R100002.mp4
ā āāā ...
ā
āāā test/ # Directory containing the test set
āāā test_query_metadata.csv # Test set query metadata file
āāā test_reference_metadata.csv # Test set reference metadata file
āāā query/ # Directory containing the test set query videos
ā āāā Q200001.mp4
ā āāā Q200002.mp4
ā āāā Q200003.mp4
ā āāā ...
āāā reference/ # Directory containing the test set reference videos
āāā R200000.mp4
āāā R200001.mp4
āāā R200002.mp4
āāā ...
If you are competing in both tracks of the competition, you can symlink competition_data
to a single folder where you have all the competition data stored to avoid having two copies of the 162GB dataset.
Run Make commands
To test out the full execution pipeline, make sure Docker is running and then run the following commands in the terminal:
-
make pull
pulls the latest official Docker image from the container registry (Azure). You'll need an internet connection for this. -
make data-subset
generates and copies a subset of thecompetition_data/test
dataset into thedata/test
folder in the format it will exist in the code execution environment. By default, this will copy over 1% of the query videos, but you can modify the proportion of videos copied by editing theMakefile
variableSUBSET_PROPORTION
or by providing a value when you issue the command, e.g.,make SUBSET_PROPORTION=0.1 data-subset
. When your code runs in the competition's code execution environment, it will generate matches for approximately 10% of the test set query videos. -
make pack-quickstart
generates valid, random matches for the full query and reference sets, and then zips the contents of thesubmission_quickstart
directory (including themain.py
script which generates random matches within the code execution environment) and saves it assubmission/submission.zip
. Thesubmission.zip
file will contain both thefull_matches.csv
andmain.py
files, and is what you will upload to the DrivenData competition site for code execution. But first we'll test that everything looks good locally (see next step). -
make test-submission
will do a test run of your submission, simulating what happens during actual code execution. This command runs the Docker container with the requisite host directories mounted, and executesmain.py
to produce a tar file with your matches for the full set and subset.
make pull
make data-subset
make pack-quickstart
make test-submission
š Congratulations! You've just completed your first test run for the Video Similarity Challenge Matching Track. If everything worked as expected, you should see a new file submission/submission.tar.gz
has been generated. If you unpack this file, you should see a full_matches.csv
(which you submitted) and a subset_rankings.csv
file (which was generated by your code), each of which contains scored matches that predict the video segments most likely to have a derived content relationship.
If you were ready to make a real submission to the competition, you would upload the submission.zip
file from step 2 above to the competition Submissions page. The submission.tar.gz
that is written out during code execution will get scored automatically using the competition scoring metric to determine your rank on the leaderboard.
Developing your own submission
Now that you've gone through the quickstart example, let's talk about how to develop your own solution for the competition.
Steps
This section provides instructions on how to develop and run your code submission locally using the Docker container. To make things simpler, key processes are already defined in the Makefile
. Commands from the Makefile
are then run with make {command_name}
. The basic steps are:
make pull
make pack-submission
make data-subset
make test-submission
Let's walk through what you'll need to do, step-by-step. The overall process here is very similar to what we've already covered in the Quickstart, but we'll go into more depth this time around.
-
Set up the prerequisites, including downloading the data.
-
Download the official competition Docker image, if you haven't already:
$ make pull
-
āļø Develop your model.
Keep in mind that the runtime already contains a number of packages that might be useful for you (cpu and gpu versions). If there are other packages you'd like added, see the section below on updating runtime packages.
-
Save your
full_matches.csv
file andmain.py
script in thesubmission_src
folder of the runtime repository.- Working off the
main.py
template we've provided, you'll want to add code as necessary to process the queries, cache intermediate results as necessary, and write out your matches. - Make sure any model weights or other files you need are also saved in
submission_src
(you can include these in that folder or in a subfolder, e.g.,submission_src/model_assets
)
- Working off the
Note: You may generate submissions that only contain your matches, and not a
main.py
script. In this case, the platform will score your matches so you can see how your solution performs on the test set without having to get inference running in the runtime. However, these submissions will not be eligible for Phase 2 or prizes and they will also count against the weekly submission limit.
-
Create a
submission/submission.zip
file containing your code and model assets:$ make pack-submission cd submission_src; zip -r ../submission/submission.zip ./* adding: main.py (deflated 50%)
-
Generate a runtime data subset with
make data-subset
.Similar to
make data-subset
we used when making the quickstart solution, this command will copy a subset of the test set query videos into thedata/test/
folder in the same structure that will exist in the code execution environment. You can change the proportion of subset videos by editing theMakefile
variableSUBSET_PROPORTION
or by specifying a value when you issue the command, e.g.,make SUBSET_PROPORTION=0.1 data-subset
. -
Test your submission with
make test-submission
This command launches an instance of the competition Docker image, simulating the same process that will take place in the official code execution runtime.** The requisite host directories will be mounted on the Docker container,
submission/submission.zip
will be unzipped into the root directory of the container, andmain.py
will be executed to produce your subset matches.$ make test-submission
ā ļø Remember in the official code execution environment,
/data
will contain just the subset of test set query videos along with the full metadata CSV files for the test query and reference sets. When testing locally, the/data
directory is a mounted version of whatever you have saved locally in this project'sdata/test
ordata/train
directories.make data-subset
adds the appropriate metadata files and video files to the localdata/*/
directory - make sure the data subset you are mounting to the container corresponds to the full set of matches you are packing in your submission.
Logging
When you run make test-submission
the logs will be printed to the terminal and written out to submission/log.txt
. If you run into errors, use the log.txt
to determine what changes you need to make for your code to execute successfully. Note that the log messages generated by tqdm
on a submission to the platform will not by default log until all the iterations have completed.
Getting Started: the vsc2022
repository
As part of this competition, our partners at Meta AI have made a benchmark solution available in the vsc2022 repository. You are encouraged to use this benchmark solution as a starting point for your own solution if you so choose.
A working benchmark solution
In addition to creating a quickstart solution, it may be instructive to use the provided benchmark code to generate an initial local submission. To do so, you would follow the instructions above as well as the instructions in the vsc2022 documentation for installation and running the baseline.
Your workflow might look something like this:
- Cloning and recursively updating submodules for the
vsc2022
repo intosubmission_src
- Downloading and transforming the sscd model into
submission_src/model_assets/
- Running inference on the test query and reference datasets to generate descriptors and matches
- Adapting
main.py
to run infererence on the runtime data subset- Note: Within the code execution runtime, the conda environment is accessible to commands run via
subprocess
by including the prefixconda run --no-capture-output -n condaenv [command]
, so yourmain.py
might include asubprocess
call to something that looks like:
conda run --no-capture-output -n condaenv python -m vsc.baseline.inference --torchscript_path "/code_execution/vsc2022/vsc/baseline/adapted_sscd_disc_mixup.torchscript.pt" --accelerator=cuda --processes="1" --dataset_path "{QUERY_SUBSET_VIDEOS_FOLDER}" --output_file "{OUTPUT_FILE}"
- Note: Within the code execution runtime, the conda environment is accessible to commands run via
- Adapting or extending the repo to improve the quality of your matches!
Additional information
Scoring your submission
For convenience and consistency, the vsc2022
repository, including the scoring scripts for both the descriptor track and the matching track, is included as a submodule of this runtime. The descriptor evaluation similarity search is also conducted by the code in this library. After cloning this repository, run make update-submodules
to download the contents of vsc2022
into the specified folder.
You will need to use the training set if you want to score your submission, since ground truth is only provided to you for the training set. While the make commands in this repository default to using the test set, you can also use the training set by setting the variable DATASET=train
. For example, running make data-subset DATASET=train
to generate prepare the necessary data for data/train
, and make test-submission DATASET=train
to have the container mount the data from data/train
.
After running the container, unpack the submission.tar.gz
generated by your solution obtain the matches files to provide to vsc2022/matching_eval.py
along with the training set ground truth to obtain your score.
Note: When evaluating your generated subset submission on the training set, you should provide only the subset of the ground truth entries that contains the query videos in the subset.
Submitting without code
As mentioned in Developing your own submission, you may if you wish create submissions that only contain your matches and not a main.py
script. In this case, the platform will score the generated matches so you can see how your solution performs on the test set without having to get inference running in the runtime. However, these submissions will not be eligible for Phase 2 or prizes and they will also count against the weekly submission limit
Runtime network access
In the real competition runtime, all internet access is blocked. The local test runtime does not impose the same network restrictions. It's up to you to make sure that your code does not make requests to any web resources.
You can test your submission without internet access by running BLOCK_INTERNET=true make test-submission
.
Downloading pre-trained weights
It is common for models to download pre-trained weights from the internet. Since submissions do not have open access to the internet, you will need to include all weights along with your submission.zip
and make sure that your code loads them from disk and rather than the internet.
CPU and GPU
For local testing, the make
commands will try to select the CPU or GPU image automatically by setting the CPU_OR_GPU
variable based on whether make
detects nvidia-smi
.
You can explicitly set the CPU_OR_GPU
variable by prefixing the command with:
CPU_OR_GPU=cpu <command>
If you have nvidia-smi
and a CUDA version other than 11, you will need to explicitly set make test-submission
to run on CPU rather than GPU. make
will automatically select the GPU image because you have access to GPU, but it will fail because make test-submission
requires CUDA version 11.
CPU_OR_GPU=cpu make pull
CPU_OR_GPU=cpu make test-submission
If you want to try using the GPU image on your machine but you don't have a GPU device that can be recognized, you can use SKIP_GPU=true
. This will invoke docker
without the --gpus all
argument.
Updating runtime packages
If you want to use a package that is not in the environment, you are welcome to make a pull request to this repository. If you're new to the GitHub contribution workflow, check out this guide by GitHub. The runtime manages dependencies using conda environments. Here is a good general guide to conda environments. The official runtime uses Python 3.9.13 environments.
To submit a pull request for a new package:
-
Fork this repository, and in your fork ensure you have the submodules checked out and updated by running
make update-submodules
. Having thevsc2022
repository fully checked out as a submodule is necessary for the build process to complete successfully. -
Edit the conda environment YAML files,
runtime/environment-cpu.yml
andruntime/environment-gpu.yml
. There are two ways to add a requirement:- Add an entry to the
dependencies
section. This installs from a conda channel usingconda install
. Conda performs robust dependency resolution with other packages in thedependencies
section, so we can avoid package version conflicts. - Add an entry to the
pip
section. This installs from PyPI usingpip
, and is an option for packages that are not available in a conda channel.
For both methods be sure to include a version, e.g.,
numpy==1.20.3
. This ensures that all environments will be the same. - Add an entry to the
-
Locally test that the Docker image builds and tests successfully for CPU and GPU images:
CPU_OR_GPU=cpu make build CPU_OR_GPU=cpu make test-container CPU_OR_GPU=gpu make build CPU_OR_GPU=gpu make test-container # Ensure this command is run on a machine with `nvidia-smi`
-
Commit the changes to your forked repository.
-
Open a pull request from your branch to the
main
branch of this repository. Navigate to the Pull requests tab in this repository, and click the "New pull request" button. For more detailed instructions, check out GitHub's help page. -
Once you open the pull request, Github Actions will automatically try building the Docker images with your changes and running the tests in
runtime/tests
. These tests can take up to 30 minutes, and may take longer if your build is queued behind others. You will see a section on the pull request page that shows the status of the tests and links to the logs. -
You may be asked to submit revisions to your pull request if the tests fail or if a DrivenData team member has feedback. Pull requests won't be merged until all tests pass and the team has reviewed and approved the changes.
Make commands
Running make
at the terminal will tell you all the commands available in the repository:
āÆ make
Settings based on your machine:
SUBMISSION_IMAGE=f5f61cef3987 # ID of the image that will be used when running test-submission
Available competition images:
meta-vsc-matching-runtime:cpu-local (f5f61cef3987); meta-vsc-matching-runtime:gpu-local (f314bbf3beed);
Available commands:
build Builds the container locally
clean Delete temporary Python cache and bytecode files
data-subset Adds video metadata and a subset of query videos to `data`. Defaults to test set. Can use train set with DATASET=train.
interact-container Start your locally built container and open a bash shell within the running container; same as submission setup except has network access
pack-quickstart Creates a submission/submission.zip file from the source code in submission_quickstart
pack-submission Creates a submission/submission.zip file from the source code in submission_src
pull Pulls the official container from Azure Container Registry
test-container Ensures that your locally built container can import all the Python packages successfully when it runs
test-submission Runs container using code from `submission/submission.zip` and data from `data/test`. Can use `data/train` with DATASET=train.
update-submodules Fetch or update all submodules (vsc2022 and VCSL)
Good luck! And have fun!
Thanks for reading! Enjoy the competition, and hit up the forums if you have any questions!