Visual Place Recognition (VPR) is a content-based image retrieval task in which, given a database of images and a query image, the goal is to return the image in the database that is closest in geographic location to the query image.[1] This task is primarily focused on real-world images of outdoor urban locations, but can be applied to indoor environments. The modern approach to the VPR task is to train machine learning algorithms that can extract features which encode the geographic information of the image.[2] VPR is primarily used in robotics and self-driving applications for localization, mapping, and planning.

Problem definition
editThe VPR task is most commonly referred to as a content-based image retrieval task, in which a query image must be matched to an image in a database.[1] Queries are matched to database images based on whether they are images of the same "place." The term "place" has been defined differently across the field. Some experts define a "place" using location of the camera regardless of its orientation. Others argue that images that contain overlapping elements should constitute a "place" match.[3] Places can vary in size based on the use case of the VPR solution. A match is considered successful based on ground truth metrics associated with the images. These can include GPS location, camera pose, or human labelling.[2] For GPS location, a successful match is determined based on whether the query image is within a specified radius of the database image. Camera pose matches are determined using relative pose error. Human labelling is treated as a classification task, and a match is determined based on whether the label of the query image matches the ground truth label.
History
editThe concept of "place recognition" has its roots in psychology and neuroscience. Early 20th century research into navigation and wayfinding explored how animals recognize their surroundings and orient themselves.[4] Studies in rats found specific place cells that activated when the test subjects visited a known environment, and would update based on new visual information.[5] This prompted works studying human navigation, which investigated how landmarks, spatial memory, and relative distance affected models of place recognition.[6] These works introduced the concept of "features" in the environment as important characteristics that could be used to define a location, and proposed that these features could be learned in order to recognize the location.[7] Most experiments focused on human trials navigating an area, and subsequently being tasked with recalling the location of a specific place in the environment. While mostly unrelated to the image retrieval task, this research laid the groundwork for place recognition as a concept in navigation.
Place recognition began emerging as a computer vision task in the 1990's. The task was introduced in the context of robot navigation and localization in order to build maps of an environment.[8][9] Visual place recognition then explicitly developed as an image retrieval task, in order to recognize whether a robot has seen a location or not while building a map. The problem was addressed by using image signatures, an early form of image feature based on handcrafted pixel computations, to describe and compare images.[10] In the early 2000's, advancements to image feature extraction using algorithms such as PCA, SIFT, and SURF improved visual place recognition results.[11][12] This marked a point where visual place recognition was investigated as its own task, outside the scope of robotics mapping and localization.[13]
The advent of neural networks as feature extractors changed the common approach to VPR.[2] Research into VPR began to focus on training deep learning networks to perform feature extraction as opposed to earlier algorithms. Originally used for image classification, convolutional neural networks (CNN) presented a more powerful method of feature extraction that are generalizable to other tasks, including place recognition. These CNN approaches outperformed older techniques, and became the standard for the VPR task.[14][15] Transformer models have recently been applied to the VPR task, and have proved promising for both feature extraction and re-ranking matching images.[16][17]
Architectures
editModern VPR solutions are deep neural networks that consist of three main components: a feature extractor, a feature aggregator, and a match ranking method.[2] VPR is commonly performed using local image features of different sections of the image, which are extracted using a deep learning architecture such as a CNN or transformer. A feature aggregator is used to condense these local features into a single vector representation. Handcrafted feature aggregators such as VLAD[18] were previously considered state-of-the-art, but have since been replaced with learned neural network aggregators such as netVLAD.[19] This vector representation is then used to compare the query image to the images in the database via a similarity search based on a similarity metric like Euclidean distance or cosine similarity. These results are then ranked based on their vector similarity, and re-ranked using methods such as spatial verification. Research into the VPR task usually focuses on upgrading the feature extractor,[20] improving aggregator clustering,[21] or refining the data labelling of images in the database during training.[22] Other advancements focus on the re-ranking module,[17] or attempt to remove the re-ranking process entirely.[23]
Applications
editVPR has been primarily used in robotics applications for localization and mapping during navigation.[1] VPR is used in SLAM algorithms in conjunction with topological maps or metric maps to define whether or not a robot has seen an area during exploration or navigation. This allows the robot to build a map of the environment based on visual information, without additional sensors like LiDAR or GPS. VPR can still be used in conjunction with additional sensors to provide a more robust approach to localization. VPR models have been deployed on a variety of autonomous agents including ground vehicles, aerial vehicles,[24] and underwater robots.[25] Computational limitations in deployment on physical robots has made efficiency a focus of modern VPR research.
Outside of the domain of robotics, VPR has been studied by Akihiko Torii et al. using mobile phone cameras of city images.[26] Torii used Google street-view panoramas to train a VPR model which was then evaluated using a dataset of phone camera images taken across Tokyo with varying lighting and scene changes. Torii addresses potential uses of VPR in searching for images of a specific location for architectural or urban planning studies, or modelling an area's change over time. In the domain of city identity recognition, a classification task similar to VPR, a 2026 study has examined potential sources of bias in geotagged images such as those from Google street-view.[27] The study finds that reproducibility is difficult for city recognition due to similarities between cities in the same countries, the camera quality and image conditions varying per country, and different camera providing better features for the task. The study pushes for careful data sampling while using geotagged images so that the inherent bias can be accounted for.
References
edit- ^ a b c Lowry, Stephanie; Sünderhauf, Niko; Newman, Paul; Leonard, John J.; Cox, David; Corke, Peter; Milford, Michael J. (2015-11-26). "Visual Place Recognition: A Survey". IEEE Transactions on Robotics. 32 (1): 1–19. doi:10.1109/TRO.2015.2496823. ISSN 1941-0468.
- ^ a b c d Masone, Carlo; Caputo, Barbara (2021). "A Survey on Deep Visual Place Recognition". IEEE Access. 9: 19516–19547. doi:10.1109/ACCESS.2021.3054937. ISSN 2169-3536.
- ^ Garg, Sourav; Fischer, Tobias; Milford, Michael (2021-08-09). "Where Is Your Place, Visual Place Recognition?". International Joint Conferences on Artificial Intelligence. 5: 4416–4425. doi:10.24963/ijcai.2021/603 – via International Joint Conferences on Artificial Intelligence.
- ^ Rabaud, Etienne (1928). How Animals Find Their Way about: A Study of Distant Orientation and Place-recognition. K. Paul, Trench, Trubner & Company, Limited.
- ^ "APA PsycNet". psycnet.apa.org. Archived from the original on 2024-04-20. Retrieved 2025-11-23.
- ^ McClelland, James L.; Rumelhart, David E. (1987), "Biologically Plausible Models of Place Recognition and Goal Location", Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Psychological and Biological Models, MIT Press, pp. 432–470, ISBN 978-0-262-29126-2, retrieved 2025-10-26
- ^ Golledge, Reginald G. (1992-05-01). "Place recognition and wayfinding: Making sense of space". Geoforum. 23 (2): 199–214. doi:10.1016/0016-7185(92)90017-X. ISSN 0016-7185.
- ^ Kuipers, Benjamin; Byun, Yung-Tai (1991-11-01). "A robot exploration and mapping strategy based on a semantic hierarchy of spatial representations". Robotics and Autonomous Systems. Special Issue Toward Learning Robots. 8 (1): 47–63. doi:10.1016/0921-8890(91)90014-C. ISSN 0921-8890.
- ^ Kortenkamp, David Michael (1993). Cognitive maps for mobile robots: A representation for mapping and navigation (PhD thesis). University of Michigan. Retrieved 2025-11-22.
- ^ Engelson, Sean Philip (1994). Passive map learning and visual place recognition (PhD thesis). Yale University. Retrieved 2025-11-22.
- ^ Ulrich, I.; Nourbakhsh, I. (2000-04-24). "Appearance-based place recognition for topological localization". Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065). 2: 1023–1029 vol.2. doi:10.1109/ROBOT.2000.844734.
- ^ Ullah, M. M.; Pronobis, A.; Caputo, B.; Luo, J.; Jensfelt, P.; Christensen, H. I. (2008-05-19). "Towards robust place recognition for robot localization". 2008 IEEE International Conference on Robotics and Automation: 530–537. doi:10.1109/ROBOT.2008.4543261.
- ^ Knopp, Jan; Sivic, Josef; Pajdla, Tomas (2010). Daniilidis, Kostas; Maragos, Petros; Paragios, Nikos (eds.). "Avoiding Confusing Features in Place Recognition". Computer Vision – ECCV 2010. Berlin, Heidelberg: Springer: 748–761. doi:10.1007/978-3-642-15549-9_54. ISBN 978-3-642-15549-9.
- ^ Chen, Zetao; Jacobson, Adam; Sünderhauf, Niko; Upcroft, Ben; Liu, Lingqiao; Shen, Chunhua; Reid, Ian; Milford, Michael (2017-05-29). "Deep learning features at scale for visual place recognition". 2017 IEEE International Conference on Robotics and Automation (ICRA): 3223–3230. doi:10.1109/ICRA.2017.7989366.
- ^ Lopez-Antequera, Manuel; Gomez-Ojeda, Ruben; Petkov, Nicolai; Gonzalez-Jimenez, Javier (2017-06-01). "Appearance-invariant place recognition by discriminatively training a convolutional neural network". Pattern Recognition Letters. 92: 89–95. doi:10.1016/j.patrec.2017.04.017. ISSN 0167-8655.
- ^ Wang, Yuwei; Qiu, Yuanying; Cheng, Peitao; Zhang, Junyu (2022-10-05). "Hybrid CNN-Transformer Features for Visual Place Recognition". IEEE Transactions on Circuits and Systems for Video Technology. 33 (3): 1109–1122. doi:10.1109/TCSVT.2022.3212434. ISSN 1558-2205.
- ^ a b Zhu, Sijie; Yang, Linjie; Chen, Chen; Shah, Mubarak; Shen, Xiaohui; Wang, Heng (2023). "R2Former: Unified Retrieval and Reranking Transformer for Place Recognition". Conference on Computer Vision and Pattern Recognition: 19370–19380 – via The Computer Vision Foundation.
- ^ Jégou, Hervé; Douze, Matthijs; Schmid, Cordelia; Pérez, Patrick (2010-06-13). "Aggregating local descriptors into a compact image representation". Conference on Computer Vision and Pattern Recognition: 3304–3311. doi:10.1109/CVPR.2010.5540039.
- ^ Arandjelovic, Relja; Gronat, Petr; Torii, Akihiko; Pajdla, Tomas; Sivic, Josef (2016). "NetVLAD: CNN Architecture for Weakly Supervised Place Recognition". Conference on Computer Vision and Pattern Recognition: 5297–5307.
- ^ Chen, Zetao; Jacobson, Adam; Sünderhauf, Niko; Upcroft, Ben; Liu, Lingqiao; Shen, Chunhua; Reid, Ian; Milford, Michael (2017-05-29). "Deep learning features at scale for visual place recognition". 2017 IEEE International Conference on Robotics and Automation (ICRA): 3223–3230. doi:10.1109/ICRA.2017.7989366.
- ^ Izquierdo, Sergio; Civera, Javier (2024). "Optimal Transport Aggregation for Visual Place Recognition". Conference on Computer Vision and Pattern Recognition: 17658–17668.
- ^ Berton, Gabriele; Trivigno, Gabriele; Caputo, Barbara; Masone, Carlo (2023). "EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition". International Conference on Computer Vision: 11080–11090.
- ^ Leyva-Vallina, María; Strisciuglio, Nicola; Petkov, Nicolai (2024-05-13). "Regressing Transformers for Data-efficient Visual Place Recognition". 2024 IEEE International Conference on Robotics and Automation (ICRA): 15898–15904. doi:10.1109/ICRA57147.2024.10611288.
- ^ Zaffar, Mubariz; Khaliq, Ahmad; Ehsan, Shoaib; Milford, Michael; Alexis, Kostas; McDonald-Maier, Klaus (2019-05-22), Are State-of-the-art Visual Place Recognition Techniques any Good for Aerial Robotics?, arXiv, doi:10.48550/arXiv.1904.07967, arXiv:1904.07967, retrieved 2025-11-23
- ^ Li, Jie; Eustice, Ryan M.; Johnson-Roberson, Matthew (2015-05-26). "High-level visual features for underwater place recognition". 2015 IEEE International Conference on Robotics and Automation (ICRA): 3652–3659. doi:10.1109/ICRA.2015.7139706.
- ^ Torii, Akihiko; Arandjelovic, Relja; Sivic, Josef; Okutomi, Masatoshi; Pajdla, Tomas (2015). "24/7 Place Recognition by View Synthesis". Conference on Computer Vision and Pattern Recognition: 1808–1817.
- ^ Zhang, Xiang; Yang, Fan; He, Zongze; Li, Weijia; Yang, Min (2026-01-01). "City identity recognition: how representation bias influences model predictability and replicability?". Computers, Environment and Urban Systems. 123 102370. doi:10.1016/j.compenvurbsys.2025.102370. ISSN 0198-9715.