Abstract:
Modeling digital humans that move and interact realistically with virtual 3D worlds has emerged as an essential research area recently, with significant applications in computer graphics, virtual and augmented reality, telepresence, the Metaverse, and assistive technologies. In particular, human-object interaction, encompassing full-body motion, hand-object grasping, and object manipulation, lies at the core of how humans execute tasks and represents the complex and diverse nature of human behavior. Therefore, accurate modeling of these interactions would enable us to simulate avatars to perform tasks, enhance animation realism, and develop applications that better perceive and respond to human behavior. Despite its importance, this remains a challenging problem, due to several factors such as the complexity of human motion, the variance of interaction based on the task, and the lack of rich datasets capturing the complexity of real-world interactions. Prior methods have made progress, but limitations persist as they often focus on individual aspects of interaction, such as body, hand, or object motion, without considering the holistic interplay among these components. This Ph.D. thesis addresses these challenges and contributes to the advancement of human-object interaction modeling through the development of novel datasets, methods, and algorithms.
The first major contribution of this research is the introduction of the GRAB dataset. Training computer models to understand and synthesize realistic avatar interactions requires a rich dataset containing accurate hand-object grasps, complex 3D object shapes, contact information, and full-body motion over time. However, such a comprehensive dataset has been missing, as capturing it comes with numerous challenges. Existing datasets predominantly focus on hand-only interactions and face challenges such as occlusions and inaccurate tracking tools. We overcome these limitations by considering full-body interactions, using a high-quality motion capture system, and adopting state-of-the-art methods to accurately track detailed body, head, hand, and object shapes and motions. Going beyond existing datasets, we collect GRAB, the first human-object-interaction dataset containing the 3D body and object motion, accurate hand-object interaction, head motion, and detailed contact. Utilizing GRAB, we train GrabNet, a model that generates state-of-the-art hand grasps for unseen 3D objects. Overall, GRAB and GrabNet serve as a foundation for further research and have enabled the development of numerous methods for modeling and synthesizing human-object interactions.
The initial step of ``walking'' toward and ``grasping'' an object is a requirement for any object-interaction motion, and is essential for realistic avatar motion. Despite its importance, previous work has mainly focused on static grasps or the major limbs of the body, ignoring the hands and head. To address this, we need to generate full-body motions, with realistic hand grasps and head poses simultaneously. This is challenging due to the extensive state-space of poses, the necessity for consistency between body and hands while maintaining physical constraints, and the vital role of the head during interactions. To address this, we introduce GOAL, the first model that takes an initial 3D body and object and generates the avatar’s motion that walks and grasps the object with realistic body pose, head orientation, hand grasp, and foot-ground contact. To achieve this we exploit complementary grasp representations, namely, SMPL-X body pose and vertex offsets between the body and object, and a novel interaction-aware feature that transforms body-to-object distance to better localize the object with respect to the body. These contributions facilitated the generation of realistic avatar motions, advancing the development of digital human modeling.
To authentically simulate virtual avatar motions, in addition to walking and grasping objects, avatars should have realistic and physically plausible finger motions while manipulating objects. While seemingly trivial for humans, this involves satisfying numerous semantic and physical constraints, making it a complex challenge to address. Consequently, there has been a lack of methods capable of automatically generating hand motions that are consistent with the full-body pose and object manipulation. GRIP bridges this gap, by taking the 3D motion of the body and the object and synthesizing realistic motion for both left and right hands before, during, and after object manipulation. GRIP leverages spatio-temporal information between the body and the object to extract rich temporal interaction cues, which help generalization to new objects. Moreover, it introduces a motion temporal consistency constraint applied in the latent space, leading to generating consistent interaction motions. Overall, GRIP ``upgrades'' sequences of body and object motion to include detailed hand-object interaction and further enhances the realism and applicability of digital human modeling.
In conclusion, this Ph.D. thesis advances the field of 3D human-object interaction modeling by offering novel, holistic, and practical approaches. By developing new resources such as the GRAB dataset and models like GrabNet, GOAL, and GRIP, we are not just incrementally improving the field but transforming the way we approach, understand, and solve problems in human-object interaction modeling. As the digital era advances and the Metaverse concept becomes more prevalent, our work marks a crucial step towards creating more realistic and immersive virtual worlds where digital humans interact with their environment in an authentically human-like way. We envision our contributions will inspire and guide future research to continue exploring, innovating, and enhancing the capabilities of digital human modeling.