Abstract:
Household robots have been a long standing promise of robotics. But despite of
decades of robotic research and progress in the field, we still do not encounter
many robots in our everyday life. Creating the robotic helpers envisioned in
science-fiction movies would of course require advances towards more general,
human-like artificial intelligence. However, today's robots also already
struggle with much simpler tasks, like accurately and reliably manipulating
novel objects based only on sensory data. Such sensory-motor skills
often have challenging dynamics and are further complicated by the fact that the
state information that can be extracted from raw sensory input is noisy or even
incomplete.
Over the last decade, a perceived dichotomy between model-based
and data-driven approaches has often shaped the discussion about how to
implement such robotic skills: In the ``traditional'', model-based approach,
engineers manually define a skill as a combination of models,
state representations and algorithms. The data-driven approach, in contrast,
relies on powerful function approximators like deep neural networks to
learn robot skills from large amounts of training data. Both approaches
have advantages and limitations that are in many aspects complementary.
The model-based approach requires no training data, is transparent and relies
on models that are grounded in the laws of physics and are thus universally
applicable. However, formulating accurate and efficient models can be difficult
for many tasks, especially when it comes to interpreting sensory signals. For
example, manually modeling the appearance of all objects that a robot might
encounter in a typical household is clearly impossible.
The data-driven approach, on the other hand, requires little prior
knowledge about the task. Deep neural networks can directly predict robot
actions from raw sensory inputs and have shown impressive results especially
in domains like computer vision, for which specifying analytical models is
difficult. The disadvantages of learning are the large amounts of required
training data, which can be difficult to obtain when robots are involved, and
the black-box nature of the resulting learned solutions. In particular,
guaranteeing safety and correct performance for data not seen during training
(generalization) is currently an open problem.
In this thesis, we propose to combine learning and structure to get the
best of both worlds. We hypothesize that including learning into a model-based
approach could fill the gaps where no analytical model could be found, speed up
approaches when the analytical solution is too slow or improve existing models
with a learned component. Furthermore, introducing more structure into learning
methods could help to reduce the amount of necessary training data and improve
the generalization performance and interpretability of the learned models.
We evaluate this hypothesis on the sensory-motor skill of planar pushing. This
task requires the robot to translate and rotate an object on a flat surface
using a cylindrical pusher given only visual observations of the system. While
the task is simple to describe, the associated dynamics are already quite
challenging with different contact modes and complex friction forces.
Our evaluation starts at the level of the individual models and representations
required for perception and prediction.
We compare a purely learned approach for predicting the outcome of pushing
actions from sensory data to a combined one, that trains a
neural network for perception end-to-end through an analytical model for
prediction. While the purely learned approach reaches the best prediction accuracy
when training and test data are similar, the combined method generalizes better
to pushes and objects not seen during training and is more data-efficient.
Analytical dynamics models and physically meaningful state representations
are of course not the only way to provide structure to learning approaches.
While models describe certain aspects of the system, algorithms contain knowledge
about how to solve a given task. On the example of state estimation with Bayesian filters,
we demonstrate that embedding model-learning into differentiable algorithms
facilitates learning in comparison to unstructured models. In addition, it
allows to optimize the learned models specifically for their respective algorithm and
can be advantageous when designing the models manually is difficult.
Finally, we address the complete task of planning pushing actions based
on visual input. Its main challenges are the computational cost of optimizing
pushing motions and contact locations simultaneously and the uncertainty induced by
imperfect perception. We present an approach that combines learning and
structure to address both of these challenges. First, we improve the planning
efficiency by decomposing the problem into contact point selection and pushing
motion optimization. This allows focusing the expensive motion optimization on
few promising contact locations. We compare using a learned or an analytical
model for both planning steps.
Second, we combine learning-based perception with a physically
meaningful state representation and an explicit state estimation algorithm to
increase the accuracy of our approach as compared to a previously published,
purely learned method.
In summary, this thesis shows that the model-based and the data-driven approach do not have to
be understood as exclusive choices but can rather be viewed as extreme points
in a trade-off between providing prior knowledge and leaving flexibility for
learning new solutions. On the example of planar pushing, we demonstrate
how combining learning and structure can make sensory-motor skills more robust,
general and data-efficient.