Combining Learning and Structure for Robotic Manipulation

DSpace Repository


Dokumentart: PhDThesis
Date: 2021-02-03
Language: English
Faculty: 7 Mathematisch-Naturwissenschaftliche Fakultät
Department: Informatik
Advisor: Bohg, Jeannette (Assist. Prof. Dr.)
Day of Oral Examination: 2020-12-16
DDC Classifikation: 004 - Data processing and computer science
Keywords: Robotik , Maschinelles Lernen , Deep learning
Other Keywords:
Order a printed copy: Print-on-Demand
Show full item record


Household robots have been a long standing promise of robotics. But despite of decades of robotic research and progress in the field, we still do not encounter many robots in our everyday life. Creating the robotic helpers envisioned in science-fiction movies would of course require advances towards more general, human-like artificial intelligence. However, today's robots also already struggle with much simpler tasks, like accurately and reliably manipulating novel objects based only on sensory data. Such sensory-motor skills often have challenging dynamics and are further complicated by the fact that the state information that can be extracted from raw sensory input is noisy or even incomplete. Over the last decade, a perceived dichotomy between model-based and data-driven approaches has often shaped the discussion about how to implement such robotic skills: In the ``traditional'', model-based approach, engineers manually define a skill as a combination of models, state representations and algorithms. The data-driven approach, in contrast, relies on powerful function approximators like deep neural networks to learn robot skills from large amounts of training data. Both approaches have advantages and limitations that are in many aspects complementary. The model-based approach requires no training data, is transparent and relies on models that are grounded in the laws of physics and are thus universally applicable. However, formulating accurate and efficient models can be difficult for many tasks, especially when it comes to interpreting sensory signals. For example, manually modeling the appearance of all objects that a robot might encounter in a typical household is clearly impossible. The data-driven approach, on the other hand, requires little prior knowledge about the task. Deep neural networks can directly predict robot actions from raw sensory inputs and have shown impressive results especially in domains like computer vision, for which specifying analytical models is difficult. The disadvantages of learning are the large amounts of required training data, which can be difficult to obtain when robots are involved, and the black-box nature of the resulting learned solutions. In particular, guaranteeing safety and correct performance for data not seen during training (generalization) is currently an open problem. In this thesis, we propose to combine learning and structure to get the best of both worlds. We hypothesize that including learning into a model-based approach could fill the gaps where no analytical model could be found, speed up approaches when the analytical solution is too slow or improve existing models with a learned component. Furthermore, introducing more structure into learning methods could help to reduce the amount of necessary training data and improve the generalization performance and interpretability of the learned models. We evaluate this hypothesis on the sensory-motor skill of planar pushing. This task requires the robot to translate and rotate an object on a flat surface using a cylindrical pusher given only visual observations of the system. While the task is simple to describe, the associated dynamics are already quite challenging with different contact modes and complex friction forces. Our evaluation starts at the level of the individual models and representations required for perception and prediction. We compare a purely learned approach for predicting the outcome of pushing actions from sensory data to a combined one, that trains a neural network for perception end-to-end through an analytical model for prediction. While the purely learned approach reaches the best prediction accuracy when training and test data are similar, the combined method generalizes better to pushes and objects not seen during training and is more data-efficient. Analytical dynamics models and physically meaningful state representations are of course not the only way to provide structure to learning approaches. While models describe certain aspects of the system, algorithms contain knowledge about how to solve a given task. On the example of state estimation with Bayesian filters, we demonstrate that embedding model-learning into differentiable algorithms facilitates learning in comparison to unstructured models. In addition, it allows to optimize the learned models specifically for their respective algorithm and can be advantageous when designing the models manually is difficult. Finally, we address the complete task of planning pushing actions based on visual input. Its main challenges are the computational cost of optimizing pushing motions and contact locations simultaneously and the uncertainty induced by imperfect perception. We present an approach that combines learning and structure to address both of these challenges. First, we improve the planning efficiency by decomposing the problem into contact point selection and pushing motion optimization. This allows focusing the expensive motion optimization on few promising contact locations. We compare using a learned or an analytical model for both planning steps. Second, we combine learning-based perception with a physically meaningful state representation and an explicit state estimation algorithm to increase the accuracy of our approach as compared to a previously published, purely learned method. In summary, this thesis shows that the model-based and the data-driven approach do not have to be understood as exclusive choices but can rather be viewed as extreme points in a trade-off between providing prior knowledge and leaving flexibility for learning new solutions. On the example of planar pushing, we demonstrate how combining learning and structure can make sensory-motor skills more robust, general and data-efficient.

This item appears in the following Collection(s)