Abstract:
Grounding language in vision is an active field of research aiming to construct cognitively plausible language representations by incorporating perceptual knowledge from vision into textual language representations. Despite numerous attempts at language grounding, many research questions remain open. First, although visual grounding proves beneficial in modeling the semantic relationship of concrete words, its impact on abstract words remains uncertain. This thesis argues that visual grounding significantly benefits both concrete and abstract words. For this aim, we propose a novel approach that avoids complete modality fusion and focuses on implicit grounding. We achieve this by learning a reversible mapping between textual and grounded spaces through multi-task learning. This mapping transforms pre-trained textual representations into the grounded space, where they are implicitly aligned with visual information through different language-vision tasks. This process aligns the textual embeddings with visual information while simultaneously preserving the distributional statistics that characterize word usage in text corpora. Finally, the learned mapping is used to construct grounded embeddings for unseen words, both abstract and concrete. Secondly, we enhance our grounding approach to be simpler and more effective, providing greater interpretability. Levering this framework, we shed light on
some common concerns at the interplay of language and vision. These concerns include but are not limited to (1) What is the optimal way of bridging the gap between text and vision? (2) To what extent is perceptual knowledge from images advantageous for contextualized embeddings from modern language models? Through novel experiments, We will uncover performance trade-offs between concreteness and abstractness, as well as between similarities and relatedness, arising from the interplay of visual and textual dominance in the grounded embeddings. Moreover, our approach brings forth benefits for contextualized embeddings, particularly evident when trained on corpora of modest, cognitively plausible sizes.Thirdly, we will extend our grounding framework to encompass other languages, demonstrating successful generalization to languages such as German and Arabic. Furthermore, we will establish inter-lingual visual grounding by guiding information flow from textual embeddings into a shared bottleneck, promoting exchange across languages. Our findings indicate that similar languages, such as English and German, benefit from information exchange within the visual grounding context, as evidenced by word similarity and categorization benchmarks. Finally, following our extensive studies on multimodal embeddings, our focus will shift to addressing the limitations of modern networks at the intersection of language and vision. Specifically, we target the visualization of metaphorical language, which plays a crucial role in conveying abstract concepts through concrete experiences and emotions. State-of-the-art text-to-image models struggle to synthesize meaningful images for such abstract and figurative expressions. To tackle this challenge, we introduce ViPE: Visualize Pretty-much Everything. ViPE eliminates the need for human annotations or images with metaphorical content and effectively assists text-to-image models in visualizing figurative and abstract phrases, as well as arbitrary textual input. Our approach unfolds implicit meanings of figurative language through a new visualizable textual description, thereby facilitating the visualization of figurative language. ViPE's development involves three main stages: (1) Compiling a Large Scale Lyric dataset comprising approximately 10 million lines of lyrics, serving as a rich source of figurative language; (2) Constructing a supervised dataset, LyricCanvas, by generating noisy visual elaborations for all lyrics using a Large Language Model (LLM); and (3) Conducting knowledge distillation to build a robust model by fine-tuning lightweight language models on LyricCanvas. ViPE's powerful zero-shot capability enables its use in downstream applications such as synthetic caption generation from keywords, abstract visualizations, and music video generation.