MPL's research agenda is concerned with all questions in and around the topic of Simultaneous Localization and Mapping (SLAM). This problem occurs if a mobile device or machine is aiming at some form of passive or active interaction with a previously unseen environment. The most straightforward example is given by a robot that needs to travel from A to B. In order to accomplish this task and reach the goal, it must be able to use its sensors to localize within the environment, and also perceive the 3D structure or boundaries of the environment itself. These two problems are intertwined, as registering a map of the environment requires us to first know where we are, and knowing where we are requires us to have some sort of map of the environment. We call this chicken-and-egg problem "Simultaneous Localization and Mapping", and it plays an important role in many emerging, disrupting technologies such as service robotics, industry 4.0, self-driving cars, and augmented reality. Our focus lies on solving this problem with cameras as exteroceptive sensors.
Data-driven architectures for SLAM
Whether dense or sparse, most metric SLAM solutions rely on low-level representations of the environment (e.g. points, lines, planes, voxels, surflets, etc.). They are mostly generated by human-engineered feature extractors or a simple, direct enforcement of photometric consistency. Our primary research focus at MPL investigates the inclusion of modern, data-driven architectures such as deep neural networks into the visual SLAM paradigm. A straightforward example is given by using powerful, learned feature extraction and description functions. We furthermore devised a method for unsupervised learning of dense image matching by simply watching video. Another example is given by semantic 3D map augmentation, during which semantic meaning is associated to a traditional low-level representation of the scene. We intend to go yet one step further, and lift the representation of individual objects in the environment to a higher level shape space that is inherently tied to the object's semantic class. By doing so, we enable artificial, vertical scene understanding capabilities where learned priors are used to simultaneously infer scene composition, shapes, and semantics.
Visual SLAM traditionally relies on either a single perspective or a stereo camera. Images captured by such setups exhibit rather low distortion, and are therefore somewhat easy to handle. On the other hand, many applications such as self-driving cars benefit from omni-directional, 360-degree field of views, which is why modern cars for self driving tests are equipped with several cameras that point into all directions and overlap only partially in their fields of view. One of our primary research directions here at MPL investigates the feasibility of real-time visual SLAM with such multi-perspective camera arrays. In particular, we focus on automatic sensor self-calibration, a robust handling of known motion degeneracies, and a transparent extension of existing, approved single camera visual SLAM solutions to the multi-perspective case.
Novel camera architectures
While giving us a lot of information in a compact and power-efficient format, traditional imagery also has a few disadvantages. For instance, using only a single camera, it is hard to recover the relative depth of the scene, particularly if the camera is not moving. Inspired by human vision, a stereo camera is able to alleviate this shortcoming, and enable direct, virtually instantaneous depth estimation through accurate stereo optical flow computation. However, stereo cameras still have the disadvantage of being able to measure disparities only along the direction of the baseline. Light-field cameras represent an interesting, multi-baseline extrapolation of stereo cameras. A typical arrangement consists of a square grid of multiple forward facing cameras, thus enabling disparity measurements in all directions. Our goal is to exploit the high redundancy in light-field imagery for accurate, reliable visual SLAM. Another disadvantage of regular cameras is given by its inability to capture blur-free images under highly dynamic or HDR circumstances. We therefore also work on exploiting the potential of event cameras, which report pixel-level image changes rather than absolute intensities. In particular, these image changes are reported asynchroneously and at a very high temporal resolution. Though the potential of event cameras in highly dynamic situations is proven, the complicated nature of the sensor data makes reliable, real-time SLAM a particularly hard problem to solve.
Algebraic geometry is a research direction with long-standing history at MPL. This topic of research focusses on the more fundamental, geometric building blocks of classical structure from motion. Our work covers both absolute and relative camera pose estimation problems, dealing with various aspects such as geometric optimality, degeneracies, and computational complexity in both the minimal and the non-minimal case. In particular, we master the technique of automatic solver generation, and are hosting the open-source project polyjam which is able to automatically generate efficient C++ code for algebraic geometry problems. We are also hosting the open-source project OpenGV, which contains several state-of-the-art inventions for both central and non-central camera systems primarily developed with polyjam. OpenGV enjoys wide popularity across both industry and academia.
For simple photometric sensors, sparse methods easily break in the face of low-texture environments due to insufficient detectable keypoints. Dense methods are an answer to this problem, but they depend on slow, localized motion, only produce local representations of the recovered depth, and are computationally expensive. This gap has recently been closed by semi-dense approaches which track and map all the edges in the environment, and find a good compromise between computational efficiency and resilience to low-texture environments. Our work reaches from novel strategies for real-time monocular camera tracking over semi-dense depth maps, to global representation and optimization of 3D edges. In particular, we extend the more traditional global representation in form of straight lines to a more general representation of arbitrary, curved edges: linked Bezier splines.