Spatial AI

Simultaneous Localisation And Mapping (SLAM) has long been recognised as a core problem to be solved within countless emerging mobile applications that require intelligent interaction or navigation in an environment. Classical solutions to the problem primarily aim at localisation and reconstruction of a geometric 3D model of the scene. More recently, the community increasingly investigates the development of Spatial Artificial Intelligence (Spatial AI), an evolutionary paradigm pursuing a simultaneous recovery of object-level composition and semantic annotations of the recovered 3D model. It plays an important role in many emerging applications in which a smart mobile system is required to interact with the environment. Examples are given by intelligent transportation, robotics, or intelligence augmentation devices.

Object-level SLAM

One example of an object-level SLAM framework is given by Deep-SLAM++. It employs object detectors to load and register full 3D models of objects. However, rather than just using fixed CAD models, we extend the idea to environments with unknown objects and only impose object priors by employing modern class-specific neural networks to generate complete model geometry proposals. The difficulty of using such predictions in a real SLAM scenario is that the prediction performance depends on the view-point and measurement quality, with even small changes of the input data sometimes leading to a large variability in the network output. We propose two strategies to deal with this problem.

  • The first one consists of embedding the deep shape representation into a traditional measurement residual minimization framework.

    L Hu, Y Cao, P Wu, and L Kneip. Dense object reconstruction from rgbd images with embedded deep shape representations. In Asian Conference on Computer Vision (ACCV), Workshop on RGB-D - sensing and understanding via combined colour and depth, Perth, Australia, December 2018a [pdf]

  • The second one consists of a discrete selection strategy that finds the best among multiple proposals from different registered views by again checking the agreement with the online depth measurements. The result–Deep-SLAM++–is an effective object-level RGBD SLAM system that produces compact, high-fidelity, and dense 3D maps with semantic annotations.

    L Hu, W Xu, K Huang, and L Kneip. Deep-SLAM++: object-level RGBD SLAM based on class-specific deep shape priors. ArXiv e-prints, 2019 [pdf]

Global localization

Autonomous Valet Parking (AVP) in an under-ground garage is an emerging smart vehicle solution that the community believes to be solvable with close-to-market sensors. Absence of GPS signals and a high degree of self similarity however render global visual localization in such environments a highly challenging problem. We present a novel underground parking localization method that relies on text recognition in the wild as well as optical character recognition (OCR) to automatically detect parking slot numbers. The detected numbers are then correlated with both geometric as well as semantic information extracted from an offline map of the environment. The resulting measurement model is embedded into a probabilistic Monte-Carlo localization framework. The success of our method is demonstrated on multiple real-world sequences in one of the largest underground parking garages in Shanghai.

L. Cui, C. Rong, J. Huang, A. Rosendo, and L. Kneip. Monte-Carlo Localization in Underground Parking Lots Using Parking Slot Numbers. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), 2021. [youtube] [bilibili]

Further industry translation

Our Spatial AI approaches have inspired the development of tailored localization solutions for applications in underground parking environments (AVP) at Motovis Intelligent Technologies Ltd. here in Shanghai, where MPL alumni are now working. The solutions perform localization on top of a high-definition map. They run in real-time on the embedded platform and make use of FPGA units for front-end feature detection. The maps are generated onboard. A mapping module takes the detected features and maps them into a high-definition 3D map of the environment containing semantic features such as parking spaces, pillars, and dashed lines. After mapping is concluded, the tracking module can then reuse these features in order to re-localize in the environment and enable solutions such as home-zone parking (HZP) and autonomous valet parking (AVP). The solutions are included into a prototype demo that show-cases fully autonomous path planning and driving execution in underground parkings here in Shanghai.

Other fundamental algorithms

The combination of geometric and semantic information maybe achieved at multiple levels including the more fundamental pose calculation algorithms embedded deep inside a SLAM system. We propose a novel solution to point set registration, and consider the specific case in which the point sets are segmented into semantically annotated parts. Such information may for example come from object detection or instance-level semantic segmentation in the registered RGB image. Prior methods incorporate the additional information to restrict or re-weight the point-pair associations occurring throughout the registration process. The present method introduces a novel hierarchical association framework for a simultaneous inference of semantic region association likelihoods. The formulation is elegantly solved using cascaded expectation-maximization, and demonstrates a substantial improvement over existing alternatives on open RGBD datasets. For more information including illustrations, please visit our page on pointset registration methods.

L. Hu, J. Wei, Z. Ouyang, and L. Kneip. Point Set Registration With Semantic Region Association Using Cascaded Expectation Maximization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2021


Several interesting Spatial AI approaches have already been presented, producing object-level maps with both geometric and semantic properties rather than just accurate and robust localisation performance. As such, they require much broader ground truth information for validation purposes. We therefore provide a new synthetic benchmark with accurate ground truth information about the scene composition as well as individual object shapes and poses. We furthermore propose evaluation metrics for all aspects of such joint geometric-semantic representations. It is our hope that the introduction of these datasets and proper evaluation metrics will be instrumental in the evaluation of current and future Spatial AI systems and as such contribute substantially to the overall research progress on this important topic.

Please head to the dataset webpage to gain access to the SSS Benchmark.

Yuchen Cao, Lan Hu, and Laurent Kneip. Representations and benchmarking of modern visual slam systems. MDPI Sensors, 20:2572, 2020 [pdf]