Awesome
Unsupervised Learning from Narrated Instruction Videos
Created by Jean-Baptiste Alayrac at INRIA, Paris.
Introduction
We address the problem of automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The contributions of this paper are three-fold. First, we develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method solves two clustering problems, one in text and one in video, applied one after each other and linked by joint constraints to obtain a single coherent sequence of steps in both modalities. Second, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains about 800,000 frames for five different tasks (How to : change a car tire, perform CardioPulmonary resuscitation (CPR), jump cars, repot a plant and make coffee) that include complex interactions between people and objects, and are captured in a variety of indoor and outdoor settings. Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner , the main steps to achieve the task and locate the steps in the input videos.
The webpage for this project is available here. It contains link to the paper, and other utilities such as original data, poster or slides of the presentation.
License
Our code is released under the MIT License (refer to the LICENSE file for details).
Cite
If you find this code useful in your research, please, consider citing our paper:
@InProceedings{Alayrac16unsupervised, author = "Alayrac, Jean-Baptiste and Bojanowski, Piotr and Agrawal, Nishant and Laptev, Ivan and Sivic, Josef and Lacoste-Julien, Simon", title = "Unsupervised learning from Narrated Instruction Videos", booktitle = "Computer Vision and Pattern Recognition (CVPR)", year = "2016" }
Contents
Requirements
To run the code, you need MATLAB installed. The code was tested on Ubuntu 12.04 LTS with MATLAB-2014b. In order to obtain the features used here, other dependencies are needed. For that, see the corresponding section.
Method
This repo contains the code for the method described in the CVPR paper. This method aims at discovering the main steps to achieve a task and temporally localize them in narrated instruction videos. The method is a 2-stage approach:
- Multiple Sequence Alignment of the text input sequences
- Discriminative clustering of videos under text constraints
Code is given for both with a separate script for each stage.You can run both stages with different parameter configurations (see comments in the code).
Multiple Sequence Alignment:
To run a demo of this code, you need to follow these steps:
- Download the package and go to that folder
git clone https://github.com/jalayrac/instructionVideos.git
cd instructionVideos
- Download and unpack the preprocessed features
wget -P data http://www.di.ens.fr/willow/research/instructionvideos/release/NLP_data.zip
unzip data/NLP_data.zip -d data
- Go in the corresponding folder
cd nlp_utils
- Open MATLAB and run
compile.m
launching_script.m
Discriminative clustering under text constraints:
Note, that you don't need to run the first stage to be able to launch this demo as we provide mat files of results for the first stage (see instructions below). To run a demo of this code, you need to follow these steps:
- Download the package and go to that folder
git clone https://github.com/jalayrac/instructionVideos.git
cd instructionVideos
- Download and unpack the preprocessed features (both for NLP and VISION)
wget -P data http://www.di.ens.fr/willow/research/instructionvideos/release/NLP_data.zip
wget -P data http://www.di.ens.fr/willow/research/instructionvideos/release/VISION_data.zip
unzip data/NLP_data.zip -d data
unzip data/VISION_data.zip -d data
- Download and unpack the preprocessed results of the first stage:
wget -P results http://www.di.ens.fr/willow/research/instructionvideos/release/NLP_results.zip
unzip results/NLP_results.zip -d results
- Go in the corresponding folder
cd cv_utils
- Open MATLAB and run
compile.m
launching_script.m
Evaluation
The authors provide the preprocessed results so that one can reproduce the results of the paper. To reproduce our result plots, please follow these steps:
- Download the package and go to that folder
git clone https://github.com/jalayrac/instructionVideos.git
cd instructionVideos
- Download and unpack the preprocessed results, both for NLP and VISION
wget -P results http://www.di.ens.fr/willow/research/instructionvideos/release/NLP_results.zip
unzip results/NLP_results.zip -d results
wget -P results http://www.di.ens.fr/willow/research/instructionvideos/release/VISION_results.zip
unzip results/VISION_results.zip -d results
- Download and unpack the preprocessed data for NLP (for qualitative)
wget -P data http://www.di.ens.fr/willow/research/instructionvideos/release/NLP_data.zip
unzip data/NLP_data.zip -d data
- Go in the corresponding folder
cd display_res
- Open MATLAB and run (for NLP qual. results)
display_res_NLP.m
- Open MATLAB and run (for temporal localization results)
display_res_VISION.m
Features
If you want to run this code on new data, you will need to process the data as follows. If you need more details on this don't hesitate to email the first author of the paper.
NLP
To obtain the direct object relations, we used the Stanford Parser 3.5.1 available here. We first construct a dictionary of direct object relations ranked by their number of apparitions in all our corpus. The indexing is based on this ranking (see count_files folder for a given task.)
For each video, we created a *.trlst file. For each dobj pronounced during the video, it has a new line containing:
- The index of the corresponding dobj in our dictionary
- The start time in the video (coming from subtitles)
- The end time in the video (coming from subtitles)
We then used the nltk python package to obtain the distance between dobj (WordNet interface). This allows us to obtain the sim_mat matrix.
VISION
The data for VISION contains two folders:
- videos_info: This folder contains video information for each video (FPS, number of frames...)
- features: This folder contains a mat file. This mat file is a struct containing all the features, ground truth, different information needed to be able to launch the second stage of the method. The features used here are a concatenation of a Bag-Of-Words of Improved Dense Trajectories, and CNN representation obtained with MatConvNet. Please see the paper for detailed explanations.