Awesome
Synchformer: Efficient Synchronization from Sparse Cues
@InProceedings{synchformer2024iashin,
title={Synchformer: Efficient Synchronization from Sparse Cues},
author={Iashin, V., Xie, W., Rahtu, E. and Zisserman, A.},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2024},
organization={IEEE}
}
• [Project Page] • [arXiv] •
<!-- • [[BMVC Proceedings](https://bmvc2022.mpi-inf.mpg.de/395/)] --> <!-- • [[Presentation (spotlight)](https://www.youtube.com/watch?v=DKNNQ54zkIw)] • --> <img src="./_repo_assets/main.png" alt="Synchformer Architecture" width="900">Given audio and visual streams, a synchronization model predicts the temporal offset between them. Instead of extracting features from the entire video, we extract features from shorter temporal segments (0.64 sec) of the video. The segment-level audio and visual inputs are fed into their respective feature extractors independently to obtain temporal features. Finally, the synchronization module inputs the concatenated sequence of audio and visual features to predict the temporal offset. We call our model Synchformer.
- Synchformer: Efficient Synchronization from Sparse Cues
Install
During experimentation, we used Linux machines with a (mini)conda
virtual environment.
We tested our model on both Nvidia (CUDA 11.8) and AMD (ROCm 5.4.2) GPUs,
and the inference code should work for both.
To install CUDA environment, run the following,
conda env create -f conda_env.yml
# conda activate synchformer
If you have a capable AMD GPU, you need to replace the conda_env.yml
with conda_env_for_AMD_CUDA.yml
.
Examples
Start by preparing the environment (see above).
This script applies the --offset_sec
offset to the provided video --vid_path
and
runs the prediction with the provided --exp_name
model (AudioSet-pretrained).
In this example, the audio track will be 1.6 seconds early.
python example.py \
--exp_name "24-01-04T16-39-21" \
--vid_path "./data/vggsound/h264_video_25fps_256side_16000hz_aac/3qesirWAGt4_20000_30000.mp4" \
--offset_sec 1.6
# Prediction Results:
# p=0.8076 (11.5469), "1.60" (18)
# ...
Making the audio track lag is also straightforward and can be achieved with a negative offset (note that we need to start the visual track later to accommodate the earlier start of the audio track):
python example.py \
--exp_name "24-01-04T16-39-21" \
--vid_path "./data/vggsound/h264_video_25fps_256side_16000hz_aac/ZYc410CE4Rg_0_10000.mp4" \
--offset_sec -2.0 \
--v_start_i_sec 4.0
# Prediction Results:
# p=0.8291 (12.7734), "-2.00" (0)
# ...
Pre-trained Models
Audio-visual synchronization models
Below are the pre-trained synchronization models. If you need the feature extractors' weights, see the segment-level feature extractors section.
CKPT ID | S1 train dataset | S2 train dataset | Test dataset | Acc@1 / Acc@1 ±1 cls | |
---|---|---|---|---|---|
23-12-23T18-33-57 | LRS3 ('Full Scene') | LRS3 ('Full Scene') | LRS3 ('Full Scene') | 86.6 / 99.6 | config / ckpt (md5: 4415276... ) |
24-01-02T10-00-53 | VGGSound | VGGSound | VGGSound-Sparse | 43.8 / 60.2 | config / ckpt (md5: 19592ed... ) |
24-01-04T16-39-21 | AudioSet | AudioSet | VGGSound-Sparse | 47.2 / 67.4 | config / ckpt (md5: 54037d2... ) |
The metric is Accuracy@1 / Accuracy@1 ±1 class. Note that the numbers on the metric vary slightly and are better than in the paper. The numbers in the paper are the average performance across multiple training runs from scratch (including these) -- for details see the note on the reproducibility in the supplementary material.
Segment-level feature extractors
If you want to play with pre-trained feature extractors separately, you may download them using the following links,
CKPT ID | Train dataset | |
---|---|---|
23-12-22T16-04-18 | LRS3 ('Full Scene') | config / ckpt (md5: 20b6e55... ) |
23-12-22T16-10-50 | VGGSound | config / ckpt (md5: a9979df... ) |
23-12-22T16-13-38 | Audioset | config / ckpt (md5: 4a566f2... ) |
The checkpoint files contain weights from both the audio and visual feature extractors and will match the
state of AVCLIP
in ./model/modules/feat_extractors/train_clip_src/open_clip/model.py
.
To get the weights of the audio or visual feature extractor separately,
you may need to filter the keys.
To save your time, see
the code for feature extractors' __init__
methods after the line
if was_pt_on_avclip: ...
in particular.
Synchronizability prediction
The synchronizability model is fine-tuned from the synchronization model. The checkpoint is available at the following link,
CKPT ID | Train dataset | Test dataset | Acc@1 | AUCROC | |
---|---|---|---|---|---|
24-01-22T20-34-52 | AudioSet | VGGSound-Sparse | 73.5 | 0.83 | config / ckpt (md5: b1cb346... ) |
* -- we use the VGGSoundSparsePickedCleanTest
dataset for multi-iteration (also iter_times=25
) evaluation.
Training
The synchronization models are trained in two stages:
- Segment-level audio-visual contrastive pre-training of feature extractors
- Audio-visual synchronization module training
Prepare Data
We follow the data preparation procedure of SparseSync. For LRS3 and VGGSound datasets, please refer to that the details how to prepare the data in SparseSync repo. AudioSet is processed similarly to VGGSound.
Segment-level audio-visual contrastive pre-training of feature extractors
# conda activate synchformer
python ./main.py \
config=./configs/segment_avclip.yaml \
logging.logdir=/path/to/logging_dir/ \
data.vids_path=/path/to/lrs3/h264_uncropped_25fps_256side_16000hz_aac/ \
data.dataset.target=dataset.lrs.LRS3 \
training.base_batch_size=2
# add `logging.use_wandb=True` for logging to wandb
It will download pre-trained models (AST and Motionformer) on the first run.
To train on AudioSet or VGGSound, replace the vids_path
and dataset target
data.dataset.target=dataset.audioset.AudioSet
or data.dataset.target=dataset.vggsound.VGGSound
.
Note, the LRS3 model was trained with learning_rate=0.00005
and without audio-visual augmentation, ie.
see config of the 23-12-23T18-33-57
experiment.
This stage requires a GPU with high memory capacity, so if you running into issues with OOM, you may try dropping batch size per GPU to 1 (from 2),
lowering n_segments
(mind the run_shifted_win_val_winsize
), or reusing the pre-trained weights.
We trained this stage on 4 nodes with 4 (8) AMD Instinct MI250 GPUs
for 10 hours (30 epochs) on LRS3 (but decent results after 1 hour), 24 hours on VGGSound (20 epochs),
12 days (28 epochs) on AudioSet (loss didn't saturate).
To resume training, run the following,
CKPT_ID="xx-xx-xxTxx-xx-xx" # replace this with the exp folder name
python main.py \
config="/path/to/logging_dir/$CKPT_ID/cfg-$CKPT_ID.yaml" \
training.resume="latest"
(you may need to specify paths if you are resuming our checkpoints).
To track the performance during training, we make zero-shot evaluation on offset detection
with 0.64 (segment length) step size.
In particular, we take run_shifted_win_val_winsize
(8) consecutive segments from an audio/visual track (that consists of n_segments
(14)) and compute the dot product between the
corresponding windows (the first window of the audio track and the first window of the visual track, etc.),
and the dot product between the first window of the audio track and the other windows of the visual track, and vice versa.
See the figure below for an example.
If the dot product between the corresponding windows is higher than the dot product between the other windows (within a row), we consider this as a correct prediction.
Then, we average the accuracy over all the rows.
In addition, we keep an eye on the individual segment similarity matrix (computed in log_sim_matrices()
).
An example from LRS3:
The brighter the value in a matrix, the higher the similarity between the corresponding segments.
For the v2a
, each row corresponds to a visual segment, and each column corresponds to an audio segment.
This results in a square matrix with the side size: n_segments * batch_size
.
Audio-visual synchronization module training
During this stage, the feature extractors are frozen, and only the synchronization module is trained.
S1_CKPT_ID="xx-xx-xxTxx-xx-xx" # replace this with an exp folder name
EPOCH="best"
python main.py \
config=./configs/sync.yaml \
logging.logdir=/path/to/logging_dir/ \
data.vids_path=/path/to/lrs3/h264_uncropped_25fps_256side_16000hz_aac/ \
data.dataset.target=dataset.lrs.LRS3 \
model.params.vfeat_extractor.params.ckpt_path="/path/to/logging_dir/${S1_CKPT_ID}/checkpoints/epoch_${EPOCH}.pt" \
model.params.afeat_extractor.params.ckpt_path="/path/to/logging_dir/${S1_CKPT_ID}/checkpoints/epoch_${EPOCH}.pt" \
training.base_batch_size=16
To use our pre-trained feature extractors, replace the ckpt_path
arguments with the paths to the downloaded checkpoint.
To train on AudioSet or VGGSound, replace the vids_path
and dataset target
data.dataset.target=dataset.audioset.AudioSet
or data.dataset.target=dataset.vggsound.VGGSound
.
We trained this stage on the same infrastructure as the first stage, yet the training is not bounded by the GPU memory because the feature extractors are frozen. One could increase the throughput by increasing the batch size per GPU if RAM, CPU count, and disk I/O allow.
The training took 2 days (283 epochs) on LRS3 (decent results after 12 hours), 8 days on VGGSound (326 epochs), and 10 days on AudioSet (61 epochs).
We noticed that the validation set loss doesn't correlate with the performance on the validation set and fluctuates a lot, so we recommend relying on the validation set accuracy.
There is also a 'warmup' stage (~4k x training_elements
), during which the performance of the model does not
start to improve.
This doesn't correlate with the learning rate schedule.
See this plot for details.
To resume training, run the following,
CKPT_ID="xx-xx-xxTxx-xx-xx" # replace this with an exp folder name
python main.py \
config="/path/to/logging_dir/$CKPT_ID/cfg-$CKPT_ID.yaml" \
training.resume="True" training.finetune="False"
(you may need to specify paths if you are resuming our checkpoints).
Fine-tune for synchronizability
We fine-tune the synchronization model for the task of synchronizability. In particular, we fine-tune ('unfreeze') the stage II model and replace the 21-class classification head with a 2-class classification head.
To fine-tune the synchronization model for the task of synchronizability, run the following,
S2_CKPT_ID="xx-xx-xxTxx-xx-xx" # replace this with an exp folder name
python main.py \
start_time="$NOW" \
config="./configs/ft_synchability.yaml" \
training.finetune="True" \
ckpt_path="/path/to/logging_dir/$S2_CKPT_ID/$S2_CKPT_ID.pt" \
data.dataset.target=dataset.audioset.AudioSet
data.vids_path=/path/to/audioset/h264_video_25fps_256side_16000hz_aac/ \
logging.logdir=/path/to/logging_dir/ \
# logging.use_wandb=True
Evalution
Synchronization
To evaluation the performance of the synchronization model, run the following,
# experiment id from `./logs/sync_models/xx-xx-xxTxx-xx-xx`
S2_CKPT_ID="xx-xx-xxTxx-xx-xx"
python main.py \
config="/path/to/logging_dir/$S2_CKPT_ID/cfg-$S2_CKPT_ID.yaml" \
training.finetune="False" \
training.run_test_only="True" \
data.iter_times="5" \
data.dataset.params.load_fixed_offsets_on="[]" \
logging.log_code_state=False \
logging.use_wandb=False
If you want to test the S2_CKPT_ID
on a different dataset,
add the data.dataset.target
argument
(e.g. for the manually cleaned VGGSound Sparse data.dataset.target=dataset.vggsound.VGGSoundSparsePickedCleanTestFixedOffsets
).
By default, it will evaluate on the test set of the training dataset (different video IDs).
Following previous work,
we run evaluation with data.iter_times
> 1
and data.dataset.params.load_fixed_offsets_on="[]"
on small datasets to allow for a more robust estimate of model performance.
For instance, for LRS3 we use data.iter_times="2"
, for VGGSound-Sparse we use data.iter_times="25"
.
Please replace the above accordingly.
Note that dataset.vggsound.VGGSoundSparsePickedCleanTestFixedOffsets
has fixed offsets, so we can't run multiple iterations.
Synchronizability
To evaluate the synchronizability, run the following,
S3_CKPT_ID="xx-xx-xxTxx-xx-xx"
python ./scripts/test_syncability.py \
config_sync="/path/to/logging_dir/${S3_CKPT_ID}/cfg-${S3_CKPT_ID}.yaml" \
ckpt_path_sync="/path/to/logging_dir/${S3_CKPT_ID}/${S3_CKPT_ID}_best.pt" \
training.finetune=False \
training.run_test_only=True \
data.dataset.target=dataset.vggsound.VGGSoundSparsePickedCleanTest \
data.vids_path="path/to/vggsound/h264_video_25fps_256side_16000hz_aac/" \
data.n_segments=14 \
data.dataset.params.iter_times=25 \
data.dataset.params.load_fixed_offsets_on="[]" \
logging.log_code_state=False \
logging.use_wandb=False
if you like to evaluate how well the synchronization model performs across
different synchronizability thresholds (Figure 4, right). Specify the config_off
and ckpt_path_off
arguments (with paths to the synchronization model).
Acknowledgements
A few shoutouts to the open-source projects that we used in this work:
- SparseSync
- Motionformer
- AST in HuggingFace
- minGPT
- pre-trained S3D network in PyTorch
- and, of course, PyTorch, Numpy, and other open-source projects that we used in this work (see environments in
*.yml
).
This research was funded by the Academy of Finland projects 327910 and 324346, EPSRC Programme Grant VisualAI EP/T028572/1, and a Royal Society Research Professorship. We also acknowledge CSC (Finland) for awarding this project access to the LUMI supercomputer, owned by the EuroHPC JU, hosted by CSC and the LUMI consortium through CSC.