Abstract:
The human visual perception is one of the most significant but limited sensory input. A small nerve tissue -- the retina -- on the back on an optical apparatus -- the eyeball -- provides a continuous flow of visual information of our environment. Yet, the retinal visual perception capabilities differ significantly depending on the affected region. In a nutshell, a small region in the center of the retina -- the fovea -- provides high acuity and the maximal perceivable color spectrum. However, with increasing eccentricity, the retinal visual perception becomes successively less sensitive to spatial and spectral contrasts. This provides humans with a highly developed foveal perception, yet limited to a small area in the field of view. To overcome this limitation, humans move their eyes, rapidly shifting their visual attention towards significant scene areas. The ensuing eye movements can be recorded and evaluated by state-of-the-art video-based eye-tracking systems. This thesis presents a full hardware and software stack to record the user's pupil position and to predict the user's line of sight -- the imaginary line between focused scene point and fovea -- in 3D.
However, the tracked line of sight merely provides the alignment of the user's foveal attention. Yet, human visual perception goes far behind a straight line. Although the fovea offers the highest visual capacity and generally coincides with the center of visual attention, foveal perception is neither sufficiently spatially characterized by a line of sight nor is visual perception limited to the fovea. Rather, the entire visual field is continuously and subliminally scanned and reconciled with a mental image of the scene. Thus, highly developed and sophisticated cognitive processes equalize the received visual stimulus with the anticipated scene image. Hence, retinal extracted visual features are differently emphasized, suppressed, recombined, and abstracted, depending on the performed task, intention, experience, fatigue, and many other factors. This reciprocal process of retinal visual feature extraction (bottom-up) and the selection and interpretation of higher cognitive processes (top-down) is the principle of fixation target identification and consciously perceived scene content. Thus, to assess the user's visual perception, it is crucial to evaluate the user's whole field of view and to incorporate the visual stimulus, retinal capability, and cognitive state.
This thesis provides novel gaze-driven user models to predict the retinal extracted visual stimuli over the entire field of view and the user's cognitive situation. This comprehensive evaluation of the user's visual perception beyond the mere tracked line of sight enables sophisticated Human–computer interaction (HCI) and Human–robot interaction (HRI) applications that integrate the user's visual perception and situation awareness into their interfaces, to enable natural, user-centric, and intelligent collaboration systems.