Home

Awesome

GSV-Cities

Official repo for Neurocomputing 2022 paper GSV-Cities: Toward Appropriate Supervised Visual Place Recognition

[ArXiv] [ScienceDirect] [Bibtex] [Dataset]


Summary of the paper

  1. We collected GSV-Cities, a large-scale dataset for the task of Visual Place Recognition, with highly accurate ground truth.
    • It contains ~530k images.
    • There are more than 62k different places, spread across multiple cities around the globe.
    • Each place is depited by at least 4 images (up to 20 images).
    • All places are physically distant (at least 100 meters between any pair of places).
  2. We proposed a fully convolutional aggregation technique (called Conv-AP) that outperforms NetVLAD and most existing SotA techniques.
  3. We consider representation learning for visual place recognition as a three components pipeline as follows:

pipeline

What can we do with GSV-Cities dataset and the code base in this repo?

Trained models

Please refer to the following Jupyter Notebook for evaluation.

<table> <thead> <tr> <th rowspan="2">Backbone</th> <th rowspan="2">Output<br>dimension</th> <th colspan="2">Pitts250k-test</th> <th colspan="2">Pitts30k-test</th> <th colspan="2">MSLS-val</th> <th colspan="2">Nordland</th> <th rowspan="2"></th> </tr> <tr> <th>R@1</th> <th>R@5</th> <th>R@1</th> <th>R@5</th> <th>R@1</th> <th>R@5</th> <th>R@1</th> <th>R@5</th> </tr> </thead> <tbody> <tr> <td rowspan="4">ResNet50</td> <td>8192<br>[2048x2x2]</td> <td>92.8</td> <td>97.7</td> <td>90.5</td> <td>95.2</td> <td>83.1</td> <td>90.3</td> <td>42.7</td> <td>58.8</td> <td rowspan="4"><a href="https://drive.google.com/drive/folders/1VYPw9uGD11NgiGFgfWueLt3noJYOIuhL">LINK</a></td> </tr> <tr> <td>4096<br>[1024x2x2]</td> <td>92.5</td> <td>97.7</td> <td>90.5</td> <td>95.3</td> <td>83.5</td> <td>89.7</td> <td>42.6</td> <td>59.8</td> </tr> <tr> <td>2048<br>[512x2x2]</td> <td>92.3</td> <td>97.5</td> <td>90.6</td> <td>95.1</td> <td>83.4</td> <td>90.3</td> <td>40.3</td> <td>56.6</td> </tr> <tr> <td>512<br>[128x2x2]</td> <td>90.7</td> <td>96.6</td> <td>89.1</td> <td>94.6</td> <td>82.6</td> <td>90.0</td> <td>36.3</td> <td>53.1</td> </tr> </tbody> </table>

Code to load the pretrained weights is as follows:

from main import VPRModel

# Note that these models have been trained with images resized to 320x320
# Also, either use BILINEAR or BICUBIC interpolation when resizing.
# The model with 4096-dim output has been trained with images resized with bicubic interpolation
# The model with 8192-dim output with bilinear interpolation
# ConvAP works with all image sizes, but best performance can be achieved when resizing to the training resolution

model = VPRModel(backbone_arch='resnet50', 
                 layers_to_crop=[],
                 agg_arch='ConvAP',
                 agg_config={'in_channels': 2048,
                            'out_channels': 1024,
                            's1' : 2,
                            's2' : 2},
                )


state_dict = torch.load('./LOGS/resnet50_ConvAP_1024_2x2.ckpt')
model.load_state_dict(state_dict)
model.eval()


GSV-Cities dataset overview

example

Database organisation

Unlike existing visual place recognition datasets where images are organised in a way that's not (so humanly) explorable. Images in GSV-Cities are named as follows:

city_placeID_year_month_bearing_latitude_longitude_panoid.JPG

This way of naming has the advantage of exploring the dataset using the default Image Viewer of the OS, and also, adding redondancy of the metadata in case the Dataframes get lost or corrupt.

The dataset is organised as follows:

├── Images
│   ├── Paris
│   │   ├── ...
│   │   ├── PRS_0000003_2015_05_584_48.79733778544615_2.231461206488333_7P0FnGV3k4Fmtw66b8_-Gg.JPG
│   │   ├── PRS_0000003_2018_05_406_48.79731397404108_2.231417994064803_R2vU9sk2livhkYbhy8SFfA.JPG
│   │   ├── PRS_0000003_2019_07_411_48.79731121699659_2.231424930041198_bu4vOZzw3_iU5QxKiQciJA.JPG
│   │   ├── ...
│   ├── Boston
│   │   ├── ...
│   │   ├── Boston_0006385_2015_06_121_42.37599246498178_-71.06902130162344_2MyXGeslIiua6cMcDQx9Vg.JPG
│   │   ├── Boston_0006385_2018_09_117_42.37602467319898_-71.0689666533628_NWx_VsRKGwOQnvV8Gllyog.JPG
│   │   ├── ...
│   ├── Quebec
│   │   ├── ...
│   ├── ...
└── Dataframes
    ├── Paris.csv
    ├── London.csv
    ├── Quebec.csv
    ├── ...

Each datadrame contains the metadata of the its corresponding city. This will help access the dataset almost instantly using Pandas. For example, we show 5 rows from London.csv:

place_idyearmonthnorthdegcity_idlatlonpanoid
1302018415London51.4861-0.08951516jFjb3wGyCkcBfq4k559ag
6793201672London51.5187-0.160767Ff3OtsS4ihGSPdPjtlpEUA
929220181289London51.531-0.127020t-xcCsazIGAjdNC96IF0w
76602015697London51.5233-0.158693zFbmpj8jt8natu7IPYrh_w
871320089348London51.5281-0.127114W3KMPec54NBqLMzmZmGv-Q

And If we want only places that are depicted by at least 8 images each, we can simply filter the dataset using pandas as follows:

df = pd.read_csv('London.csv')
df = df[df.groupby('place_id')['place_id'].transform('size') >= 8]

Notice that given a Dataframe row, we can directly read its corresponding image (the first row of the above example corresponds to the image named ./Images/London/London_0000130_2018_04_015_51.4861_-0.0895151_6jFjb3wGyCkcBfq4k559ag.JPG)

We can, for example, query the dataset with only places that are in the northern hemisphere, taken between 2012 and 2016 during the month of July, each depicted by at least 16 images.

Stay tuned for tutorials in the comming weeks.

Cite

Use the following bibtex code to cite our paper

@article{ali2022gsv,
  title={GSV-Cities: Toward appropriate supervised visual place recognition},
  author={Ali-bey, Amar and Chaib-draa, Brahim and Gigu{\`e}re, Philippe},
  journal={Neurocomputing},
  volume={513},
  pages={194--203},
  year={2022},
  publisher={Elsevier}
}