Relative Geometry-Aware Siamese Neural Network for 6DOF Camera Relocalization

6DOF camera relocalization is an important component of autonomous driving and navigation. Deep learning has recently emerged as a promising technique to tackle this problem. In this paper, we present a novel relative geometry-aware Siamese neural network to enhance the performance of deep learning- based methods through explicitly exploiting the relative geometry constraints between images. We perform multi-task learning and predict the absolute and relative poses simultaneously. We regularize the shared-weight twin networks in both the pose and feature domains to ensure that the estimated poses are globally as well as locally correct. We employ metric learning and design a novel adaptive metric distance loss to learn a feature that is capable of distinguishing poses of visually similar images from different locations. We evaluate the proposed method on public indoor and outdoor benchmarks and the experimental results demonstrate that our method can significantly improve localization performance. Furthermore, extensive ablation evaluations are conducted to demonstrate the effectiveness of different terms of the loss function.

The problem

Global Positioning System (GPS) has been widely used for vehicle localization but its accuracy significantly decreases in urban areas where tall buildings block or weaken its signals. Many image-based methods have been proposed to complement GPS. They provide position and orientation in- formation based either on image retrieval [1], [2], [3], [4], [5] or 3D model reconstruction [6]. However, these methods face many challenges, including high storage overheads, low computational efficiency and image variations, especially for large scenes.

Recently, rapid progress in machine learning, particularly deep learning, has produced a number of deep learning-based methods [7], [8], [9], [10], [11], [12], [13], [14], [15]. They have attained good performances in addressing the aforemen- tioned challenges but their accuracies are not as good as traditional methods. Another severe problem of deep learning- based methods is that they fail to distinguish two different locations that have similar objects or scenes.

Methods

In this paper, we present a novel relative geometry-aware Siamese neural network, which explicitly exploits the relative

geometry constraints between images to regularize the net- work. We improve the localization accuracy and enhance the ability of the network to distinguish locations with similar images. It is achieved with three key new ideas:

1) We design a novel Siamese neural network that explicitly learns the global poses of a pair of images. We constrain the estimated global poses with the actual relative pose between the pair of images.

2) We perform multi-task learning to estimate the absolute and relative poses simultaneously to ensure that the predicted poses are correct both globally and locally.

3) We employ metric learning and design an adaptive metric distance loss to learn feature representations that are capable of distinguishing the poses of similar visual images of different locations thus improving the overall pose estimation accuracy.

Results

We compare the results of the proposed method with that of state-of-the-art deep learning-based methods such as PoseNet, Bayesian PoseNet, PoseNet2, Hourgrlass-net, LSTM-Net and RelNet on the 7Scene dataset, and with PoseNet, Bayesian PoseNet, PoseNet2 and LSTM-Net on the Cambridge Land- marks dataset. Similar to others, we report each scene’s median error. We also compare the average median accuracy over all scenes in each dataset. The comparative results are shown in Table I and Table II. Table I shows the results for the 7Scene dataset. It is seen that compared with 7 state-of-the-art deep learning-based camera relocalization methods, the proposed

method achieves the best performance on positional accuracy in all 7 scenes. Our method improves the average median positional accuracy by 16% over the best reported result. It is interesting to note that our method has obtained even better result than PoseNet2, which utilizes 3D reference as additional constraints.

For orientational accuracy, we achieve the best result com- pared to methods based on direct regression. It is not surprising that the results are not as good as PoseNet2 and RelNet since PoseNet2 requires additional 3D models and RelNet triangulates the pose with all referencing images by estimating the relative poses instead of directly regressing results.

Table II shows the results for the Cambridge Landmarks dataset. It can be seen that our method obtains the best positional accuracy on the KingsCollege and the ShopFacade scenes, reaching accuracies of 0.865m and 0.834m respec- tively. We improve the state-of-the-art orientational accuracy of the OldHospital and the StMarysChurch scenes from 3.29◦ and 3.32◦ to 2.42◦ and 2.98◦, achieving 26% and 10% improvement respectively. The average positional accuracy over all scenes is improved from 1.30m to 1.24m. The average orientational accuracy over all scenes is only a little worse than that of PoseNet2, which is trained with 3D model constraints.

It is interesting to note that of all the methods presented in the two tables, some did better in positional accuracy and some did better in orientational accuracy, none of them seems to comprehensively beat the others in both measures. Our method achieves the best average positional accuracy amongst all methods in both datasets. For orientational accuracy, our method achieves competent results, which is only slightly worse than the best method (PoseNet2) but better or at least as good as the other methods.

REFERENCES

[1] A. C. Murillo and J. Kosecka, “Experiments in place recognition using gist panoramas,” in Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 2196–2203.  

[2] T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt, “Image retrieval for image-based localization revisited.” in BMVC, vol. 1, no. 2, 2012, p. 4.  

[3] I. Ulrich and I. Nourbakhsh, “Appearance-based place recognition for topological localization,” in Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE International Conference on, vol. 2. Ieee, 2000, pp. 1023–1029.  

[4] J.Wolf,W.Burgard,andH.Burkhardt,“Robustvision-basedlocalization by combining an image-retrieval system with monte carlo localization,” IEEE transactions on robotics, vol. 21, no. 2, pp. 208–216, 2005.  

[5] ——,“Robustvision-basedlocalizationformobilerobotsusinganimage retrieval system based on invariant features,” in Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE International Conference on, vol. 1. IEEE, 2002, pp. 359–365.  

[6] Z. Kukelova, M. Bujnak, and T. Pajdla, “Real-time solution to the absolute pose problem with unknown radial distortion and focal length,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2816–2823.  

[7] T. Weyand, I. Kostrikov, and J. Philbin, “Planet-photo geolocation with convolutional neural networks,” in European Conference on Computer  

Vision. Springer, 2016, pp. 37–55.

[8] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938–2946.  

[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.  

[10] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization,” arXiv preprint arXiv:1509.05909, 2015.

[11] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Image-based localization using hourglass networks,” in Computer Vision Workshop (ICCVW), 2017 IEEE International Conference on. IEEE, 2017, pp. 870–877.

[12] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers, “Image-based localization using lstms for structured feature correlation,” in Int. Conf. Comput. Vis.(ICCV), 2017, pp. 627–637.

[13] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen, “Vidloc: A deep spatio-temporal model for 6-dof video-clip relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, 2017.

[14] A. Kendall, R. Cipolla et al., “Geometric loss functions for camera pose regression with deep learning,” in Proc. CVPR, vol. 3, 2017, p. 8.

[15] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometry-aware learning of maps for camera localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2616– 2625.

[16] Y. Kalantidis, G. Tolias, Y. Avrithis, M. Phinikettos, E. Spyrou, P. My- lonas, and S. Kollias, “Viral: Visual image retrieval and localization,” Multimedia Tools and Applications, vol. 51, no. 2, pp. 555–592, 2011.

[17] Y.-H. Lee and Y. Kim, “Efficient image retrieval using advanced surf and dcd on mobile platform,” Multimedia Tools and Applications, vol. 74, no. 7, pp. 2289–2299, 2015.

[18] X. Li, M. Larson, and A. Hanjalic, “Geo-distinctive visual element matching for location estimation of images,” IEEE Transactions on Multimedia, vol. 20, no. 5, pp. 1179–1194, 2018.

[19] B. J. Kro ̈se, N. Vlassis, R. Bunschoten, and Y. Motomura, “A proba- bilistic model for appearance-based robot localization,” Image and Vision Computing, vol. 19, no. 6, pp. 381–391, 2001.

[20] E. Menegatti, M. Zoccarato, E. Pagello, and H. Ishiguro, “Image-based monte carlo localisation with omnidirectional images,” Robotics and Autonomous Systems, vol. 48, no. 1, pp. 17–30, 2004.

[21] J. Wang, H. Zha, and R. Cipolla, “Coarse-to-fine vision-based localiza- tion by indexing scale-invariant features,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 36, no. 2, pp. 413–422, 2006.

[22] J.Wang,R.Cipolla,andH.Zha,“Vision-basedgloballocalizationusing a visual vocabulary,” in Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on. IEEE, 2005, pp. 4230–4235.

[23] M. Cummins and P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.

[24] A. Torii, Y. Dong, M. Okutomi, J. Sivic, and T. Pajdla, “Efficient local- ization of panoramic images using tiled image descriptors,” Information and Media Technologies, vol. 9, no. 3, pp. 351–355, 2014.

[25] M. Umeda and H. Date, “Spherical panoramic image-based localization by deep learning,” Transactions of the Society of Instrument and Control Engineers, vol. 54, pp. 483–493, 2018.

[26] A.Guzman-Rivera,P.Kohli,B.Glocker,J.Shotton,T.Sharp,A.Fitzgib- bon, and S. Izadi, “Multi-output learning for camera relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1114–1121.

[27] J. Kosecka, L. Zhou, P. Barber, and Z. Duric, “Qualitative image based localization in indoors environments,” in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol. 2. IEEE, 2003, pp. II–II.

[28] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.

[29] N. Su ̈nderhauf and P. Protzel, “Brief-gist-closing the loop by simple means,” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ Inter- national Conference on. IEEE, 2011, pp. 1234–1241.

[30] G. Singh and J. Kosecka, “Visual loop closing using gist descriptors in manhattan world,” in ICRA Omnidirectional Vision Workshop, 2010. [31] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, J. J. Yebes, and S. Ga ́mez,

“Bidirectional loop closure detection on panoramas for visual navigation,” in Intelligent Vehicles Symposium Proceedings, 2014 IEEE. IEEE, 2014, pp. 1378–1383.

[32] M. J. Milford and G. F. Wyeth, “Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 1643–1649.

[33] D. G. Lowe, “Object recognition from local scale-invariant features,” in Computer vision, 1999. The proceedings of the seventh IEEE international conference on, vol. 2. Ieee, 1999, pp. 1150–1157. 

[34] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008. 

[35] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in null. IEEE, 2003, p. 1470. 

[36] H. Je ́gou, M. Douze, C. Schmid, and P. Pe ́rez, “Aggregating local descriptors into a compact image representation,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 3304–3311. 

[37] M.DonoserandD.Schmalstieg,“Discriminativefeature-to-pointmatching in image-based localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 516–523. 

[38] T. Sattler, M. Havlena, F. Radenovic, K. Schindler, and M. Pollefeys, “Hyperpoints and fine vocabularies for large-scale location recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2102–2110.

[39] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized matching for large-scale image-based localization,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 9, pp. 1744–1756, 2017.

[40] ——, “Improving image-based localization by active correspondence search,” in European conference on computer vision. Springer, 2012, pp. 752–765. 

[41] M. Uyttendaele, M. Cohen, S. Sinha, and H. Lim, “Real-time image- based 6-dof localization in large-scale environments,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 1043–1050.

[42] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Computer Vision (ICCV), 2011 IEEE international conference on. IEEE, 2011, pp. 2564–2571.

[43] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgib- bon, “Scene coordinate regression forests for camera relocalization in rgb- d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937.

[44] J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. H. Torr, “Exploiting uncertainty in regression forests for accurate camera relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4400–4408.

[45] E.Brachmann,A.Krull,S.Nowozin,J.Shotton,F.Michel,S.Gumhold, and C. Rother, “Dsac-differentiable ransac for camera localization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, 2017.

[46] Z.Laskar,I.Melekhov,S.Kalia,andJ.Kannala,“Camerarelocalization by computing pairwise relative poses using convolutional neural network,” in Computer Vision Workshop (ICCVW), 2017 IEEE International Con- ference on. IEEE, 2017, pp. 920–929.

[47] J.Bromley,I.Guyon,Y.LeCun,E.Sa ̈ckinger,andR.Shah,“Signature verification using a” siamese” time delay neural network,” in Advances in neural information processing systems, 1994, pp. 737–744.

[48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[49] A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning for feature vectors and structured data,” arXiv preprint arXiv:1306.6709, 2013.

[50] M. Norouzi, D. J. Fleet, and R. R. Salakhutdinov, “Hamming distance metric learning,” in Advances in neural information processing systems, 2012, pp. 1061–1069.

[51] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 539–546.

[52] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinect- fusion: Real-time dense surface mapping and tracking,” in Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on. IEEE, 2011, pp. 127–136.

[53] C.Wuetal.,“Visualsfm:Avisualstructurefrommotionsystem,”2011. [54] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp.

249–256. [55] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by

learning an invariant mapping,” in null. IEEE, 2006, pp. 1735–1742. [56] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified em- bedding for face recognition and clustering,” in The IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), June 2015. [57] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning for visual localization and odometry,” arXiv preprint arXiv:1803.03642,

2018. [58] N. Radwan, A. Valada, and W. Burgard, “Vlocnet++: Deep multitask

learning for semantic visual localization and odometry,” arXiv preprint arXiv:1804.08366, 2018.