Multimodal Data Efficient Learning

Mercea, Otniel-Bogdan

Publikationsdienste
→
TOBIAS-lib - Publikationen und Dissertationen
→
7 Mathematisch-Naturwissenschaftliche Fakultät
→
Dokumentanzeige

dc.contributor.advisor	Akata, Zeynep (Prof. Dr.)
dc.contributor.author	Mercea, Otniel-Bogdan
dc.date.accessioned	2025-08-18T08:16:24Z
dc.date.available	2025-08-18T08:16:24Z
dc.date.issued	2025-08-18
dc.identifier.uri	http://hdl.handle.net/10900/169090
dc.identifier.uri	http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1690907	de_DE
dc.identifier.uri	http://dx.doi.org/10.15496/publikation-110417
dc.description.abstract	Recently, unimodal models have attained good performance in many tasks. However, using one modality may not provide sufficient information in complex situations. Humans use multimodal input, such as vision and hearing, to act in the real world. Similarly, this thesis proposes systems that use multimodal input for video classification and visual-language learning. However, multimodal models need significant amounts of qualitative paired data, which is costly and time-consuming to gather. At the same time, humans require very few training samples, even for the most complex tasks. Given these aspects, this thesis addresses the problem of multimodal data efficient learning. Firstly, this thesis studies the audio-visual video classification task in generalized zero- and few-shot learning settings. It introduces new training and evaluation protocols, dataset splits, and baselines. Using transformers to fuse the audio and visual modalities leads to higher performance than prior works. Furthermore, typical full-attention does not lead to the best results, and new attention patterns are developed. New loss functions are essential for increasing the performance of both settings. Moreover, the performance in few-shot learning is further improved by using a diffusion model to generate synthetic audio-visual features for the novel classes. The second task is video-adverb retrieval, which is studied both when plenty of training data is available and in the zero-shot learning scenario. The goal is to improve the text embeddings using a residual gating mechanism and a new training objective. New zero-shot splits are also introduced to facilitate a more comprehensive evaluation. Finally, this thesis uses multimodal large language models (MLLMs) to focus on visual-language learning. This task studies the ability of MLLMs to adapt the communication on the fly given a conversation partner by using very few interactions. This work provides a general framework for testing this ability for multiple agents, providing insights about their strengths and weaknesses. It turns out that the ability to adapt the communication to different partners with different comprehension abilities is already present in the current MLLMs.	en
dc.language.iso	en	de_DE
dc.publisher	Universität Tübingen	de_DE
dc.rights	ubt-podno	de_DE
dc.rights.uri	http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de	de_DE
dc.rights.uri	http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en	en
dc.subject.ddc	004	de_DE
dc.subject.other	Deep learning	en
dc.subject.other	Artificial intelligence	en
dc.subject.other	Computer vision	en
dc.subject.other	Efficient learning	en
dc.subject.other	Multimodal learning	en
dc.title	Multimodal Data Efficient Learning	en
dc.type	PhDThesis	de_DE
dcterms.dateAccepted	2025-04-16
utue.publikation.fachbereich	Informatik	de_DE
utue.publikation.fakultaet	7 Mathematisch-Naturwissenschaftliche Fakultät	de_DE
utue.publikation.noppn	yes	de_DE

Dateien:	otniel_bogdan_mercea_thesis.pdf 15.8 MB PDF Beschreibung: Dissertation PDF

Das Dokument erscheint in:

7 Mathematisch-Naturwissenschaftliche Fakultät [5059]

Zur Kurzanzeige

Veröffentlichen

Stöbern

Gesamter Bestand
Diese Sammlung

Mein Benutzerkonto

Einloggen

Multimodal Data Efficient Learning

DSpace Repositorium (Manakin basiert)

Das Dokument erscheint in:

Stöbern

Gesamter Bestand

Diese Sammlung

Mein Benutzerkonto