Home

Awesome

LVNet

PWC PWC PWC PWC

Official Code for "Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA" paper.

The code will be released soon.

Paper Link

Abstract

Long-form videos that span across wide temporal intervals are highly information- redundant and contain multiple distinct events or entities that are often loosely- related. Therefore, when performing long-form video question answering (LVQA), all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore the use of large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Questioning these decision choices, we explore optimal strategies for key-frame selection and sequence-aware captioning, that can signifi- cantly reduce these redundancies. We propose two novel approaches that improve each of aspects, namely Hierarchical Keyframe Selector and Sequential Visual LLM. Our resulting framework termed LVNet achieves state-of-the-art performance across three benchmark LVQA datasets

Accuracy vs Captions on the EgoSchema Subset

Hierarchical Keyframe Selector: Structural Overview

Hierarchical Keyframe Selector: Operational Visualization

Experiments: EgoSchema

<img src="./tables/table_egoschema.png" alt="egoschema_table" width="600"/>

Experiments: NExT-QA

<img src="./tables/table_nextQA.png" alt="nextQA_table" width="600"/>

Experiments: IntentQA

<img src="./tables/table_intentQA.png" alt="intentQA_table" width="600"/>

Evaluation

Generate Answers Using LLM

You can easily run the LLM to generate answers for the questions using the pre-generated captions.

  1. Download the Captions for Dataset
  1. Run LLM bash scripts/eval_ES.sh

Generate captions using our provided modules

Hierarchical Keyframe Selector (HKS)

  1. EgoSchema keyframe selection from images: bash config/run.sh

  2. Generate captions based on the keyframes: bash scripts/create_caption.sh

Data

Hierarchical Keyframe Selector hyper-parameters & paths

coarseKeyframeDetector.py CLIP model checkpoint

Citation

@inproceedings{Park2024TooMF,
  title={Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA},
  author={Jongwoo Park and Kanchana Ranasinghe and Kumara Kahatapitiya and Wonjeong Ryoo and Donghyun Kim and Michael S. Ryoo},
  year={2024}
}