Awesome

GAMa: Cross-view Video Geo-localization

GAMa stands for Ground-video to Aerial-image Matching

Repository by Shruti Vyas

This is a PyTorch repository for our ECCV 2022 paper titled: "GAMa: Cross-view Video Geo-localization".

gif

We are solving the problem of video geolocalization using cross-view geolocalization which has not been used for videos. In the gif, we see an example of a large aerial region corresponding to a video, along with the video frames at each second. We can see the trajectory of the video on the aerial image. Subsequent GPS points are related by time and geographical location and we make use of this information for geo-localization at video level.

Dataset and Hierarchical approach

Dataset

The dataset comprises of one large aerial image (1792x1792) corresponding to each video of around 40 sec. In Figure A, we see an example of a large aerial image, along with the small aerial images. We have a centered (CN) and an uncentered (UCN) set of small aerial images corresponding to the clips of 1 second each. More details are here.

Download GAMa (Ground-video to Aerial-image Matching) dataset

Aerial images of GAMa dataset can be downloaded from this link.

Ground videos can be downloaded from BDD-100k dataset: https://bdd-data.berkeley.edu/ . To extract the frames use extract_frames.py . Lists of selected videos is available with aerial images at the previous link.

Approach:

We have four steps in this approach (Figure B). In Step-1, we use GAMa-Net which takes one clip (0.5 sec) at a time and matches with an aerial image. Using multiple clips of a video, we get a sequence of aerial images for the whole video, i.e. around 40 small aerial images. In Step-2, we use these predictions of aerial images and match them to the corresponding larger aerial region. We use a screening network to match the features however the features are from the same view i.e aerial view. In Step-3, we use the predictions to reduce the gallery by only keeping top ranked large aerial regions corresponding to a video. These large aerial regions define our new gallery for a given video. In Step-4, we use GAMa-Net i.e. the same network as in Step-1, however geo-localize using the updated gallery.

gif

The gif shows an example of large aerial region for a video and ground truth trajectory as marked by the green line. Initially, only few of the top-1 images matched by GAMa-Net are in the region (White boxes with numbers). The number represent the order of the clip in the video. After using updated gallery by hierarchical approach we have many more predictions in the correct region (Green boxes with numbers). At the end, we also show the outlier that do not fall in the correct larger aerial region.

GAMa-Net

System Requirements:

Anaconda3
Opencv3.5, Numpy, Matplotlib
PyTorch3, Python 3.6.9

Evaluation of Hierarchical approach

Final Evaluation of GAMa-Net model on updated/new gallery:

Run the following script for evaluation

python ./GAMa_Net/evaluationC_pred_Laerial_test_top1per.py

Model weights for GAMa-Net can be downloaded at this link -- Please note that the first time evaluation will take longer to create a dictionary; after that evaluation will be faster

More details regarding training and evaluation are in Readme.txt file.

Cite

@inproceedings{vyas2022gama,
  title={GAMa: Cross-view Video Geo-localization},
  author={Vyas, Shruti and Chen, Chen and Shah, Mubarak},
  booktitle={European Conference on Computer Vision},
  year={2022},
  organization={Springer}
}