Awesome
<div align="center"> <h2><a href="https://arxiv.org/abs/2311.15964"><image src="./assets/eccv.png" alt="ECCV 2024" width=5%/> Efficient Pre-training for Localized Instruction Generation of Videos</a></h2>Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller
</div> <h5 align="center"> </h5>Abstract
In this work we propose Sieve & Swap technique, to automatically generate high quality pre-training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve & Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data.
<p align="center"> <image src="./assets/sieve_n_swap.png" alt="sieve and swap approach" width=60%/> </p>Dataset
Sieve & Swap : Our curated dataset along with processed features can be downloaded from :hugs: Hugging Face. More details are available in data.md
Raw Pre-Training : HowTo100M, RecipeNLG
Downstream Task : YouCook2, Tasty
Code
Coming Soon!
:page_facing_up: Citation
If you find this project useful in your research, please consider cite:
@inproceedings{batra2025efficient,
title={Efficient Pre-training for Localized Instruction Generation of Procedural Videos},
author={Batra, Anil and Moltisanti, Davide and Sevilla-Lara, Laura and Rohrbach, Marcus and Keller, Frank},
booktitle={European Conference on Computer Vision},
pages={347--363},
year={2025},
organization={Springer}
}
Licenses
This code is released under the MIT License. The licenses for datasets used in the paper are available at the following links: HowTo100M, YouCook2, and Tasty.
:dizzy: Acknowledgement
Thanks to the open source of the following projects: