Abstract:
Emerging research in 3D computer vision and AR/VR telepresence is propelled by the dream of building a digital twin of the real world where humans can gather and interact remotely in shared virtual spaces. Such a world would enable unprecedented social and professional connections -- from virtual offices and industrial training to remote family gatherings and virtual sports. A critical building block for this future is our ability to faithfully reconstruct and animate physically-plausible avatars of real humans. However, for truly immersive experiences, these avatars must also be able to interact naturally with virtual objects and environments.
Significant progress has been made in modeling 3D humans -- their pose, motion, and interactions with scenes and objects. However, many existing methods still produce physically implausible results, violating basic principles of physics and biomechanics. Reconstructed humans may lean, float, or penetrate the ground; generated motions often ignore biomechanical and external force constraints; and human-object interactions may miss complex contact dynamics, resulting in unrealistic outcomes. While physics simulators have been used to enforce physical plausibility, they are typically non-differentiable black boxes, rely on simplified ``proxy'' body models, and are too slow for real-time applications. These limitations hinder their use in real-world, ``in-the-wild'' scenarios that require reasoning about complex human interactions in unconstrained environments.
To overcome these challenges, this thesis draws inspiration from how humans operate in the physical world. We possess an innate understanding of the physical principles that govern how we move and interact with our surroundings. This physical reasoning involves two key components: intuitive physics and contact. Intuitive physics (IP), as described in biomechanics and cognitive science, refers to our ability to form rapid, online predictions about how physical events will unfold. There is broad consensus in cognitive science that this predictive ability is not based on explicit reasoning about Newtonian laws, but instead arises from coarse internal simulations that guide our responses in everyday situations. Contact, in turn, forms the foundation of this intuition. From infancy, we learn by engaging physically with the world -- touching, grasping, and manipulating objects. These tactile experiences shape and continually refine our understanding of physical interactions.
In this thesis, we incorporate both intuitive physics and contact modeling to develop methods that embed physical constraints into 3D human modeling, motion generation, and interaction understanding.
First, we introduce IPMAN, a novel method that incorporates intuitive physics into 3D human pose and shape (HPS) estimation from monocular images. Instead of full physics simulation, IPMAN uses differentiable biomechanical quantities such as Center of Pressure (CoP) and Center of Mass (CoM), along with novel IP terms, to encourage physically plausible poses with proper balance and ground contact. These terms are efficient to compute, fully differentiable, and integrate easily into existing HPS pipelines, improving both physical realism and accuracy over state-of-the-art methods.
Second, while IPMAN ensures physically plausible \emph{static} poses, generating realistic \emph{dynamic} motion presents additional challenges. Existing models often ignore body shape, relying instead on average bodies, thus, overlooking how physiology affects movement. To address this, we propose HUMOS, a shape-conditioned motion generation method that builds on IPMAN’s physics-based framework. HUMOS generates diverse, physically plausible motions tailored to individual body shapes by leveraging cycle consistency and novel dynamic IP constraints, without requiring paired training data of different body shapes performing the same actions.
Third, beyond basic locomotion, humans frequently interact with objects through full-body contact. Understanding these interactions requires detecting contact across the whole body surface in arbitrary images. Previous methods either focus on 2D contact, joint-level contact, or use coarse 3D body regions and thus struggle to capture the richness and diversity of real-world human-object interactions. We introduce DECO, a novel 3D contact detector that estimates dense vertex-level contact on the full body. DECO leverages a new dataset of dense contact annotations paired with RGB images and combines body-part and scene-context attention to handle occlusions, enabling robust contact detection in complex, real-world scenarios.
Fourth, while DECO enables detecting contact on the human body, recovering full 3D human-object interactions requires understanding contact on \emph{both} the body and object surfaces. This is challenging due to the huge variation in object shapes and the difficulty of establishing correspondences between contact points on the body and the object. We address this with PICO, which introduces a novel data collection method to transfer body contact annotations to arbitrary objects, and a reconstruction approach that uses these contact correspondences to recover accurate 3D human-object interactions from single images. This enables the reconstruction of physically plausible interactions across diverse object categories in unconstrained settings.
Together, these contributions advance physics-informed human modeling by tackling core challenges in reconstructing and generating physically plausible human motion and interactions. Grounded in physics and biomechanics, the proposed methods are practical for real-world scenarios. By bridging the gap between physical plausibility and computational feasibility, this work takes significant steps toward enabling truly immersive virtual experiences where digital humans can move and interact as naturally as their real-world counterparts.