Multimodal Data Efficient Learning

DSpace Repositorium (Manakin basiert)

Zur Kurzanzeige

dc.contributor.advisor Akata, Zeynep (Prof. Dr.)
dc.contributor.author Mercea, Otniel-Bogdan
dc.date.accessioned 2025-08-18T08:16:24Z
dc.date.available 2025-08-18T08:16:24Z
dc.date.issued 2025-08-18
dc.identifier.uri http://hdl.handle.net/10900/169090
dc.identifier.uri http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1690907 de_DE
dc.identifier.uri http://dx.doi.org/10.15496/publikation-110417
dc.description.abstract Recently, unimodal models have attained good performance in many tasks. However, using one modality may not provide sufficient information in complex situations. Humans use multimodal input, such as vision and hearing, to act in the real world. Similarly, this thesis proposes systems that use multimodal input for video classification and visual-language learning. However, multimodal models need significant amounts of qualitative paired data, which is costly and time-consuming to gather. At the same time, humans require very few training samples, even for the most complex tasks. Given these aspects, this thesis addresses the problem of multimodal data efficient learning. Firstly, this thesis studies the audio-visual video classification task in generalized zero- and few-shot learning settings. It introduces new training and evaluation protocols, dataset splits, and baselines. Using transformers to fuse the audio and visual modalities leads to higher performance than prior works. Furthermore, typical full-attention does not lead to the best results, and new attention patterns are developed. New loss functions are essential for increasing the performance of both settings. Moreover, the performance in few-shot learning is further improved by using a diffusion model to generate synthetic audio-visual features for the novel classes. The second task is video-adverb retrieval, which is studied both when plenty of training data is available and in the zero-shot learning scenario. The goal is to improve the text embeddings using a residual gating mechanism and a new training objective. New zero-shot splits are also introduced to facilitate a more comprehensive evaluation. Finally, this thesis uses multimodal large language models (MLLMs) to focus on visual-language learning. This task studies the ability of MLLMs to adapt the communication on the fly given a conversation partner by using very few interactions. This work provides a general framework for testing this ability for multiple agents, providing insights about their strengths and weaknesses. It turns out that the ability to adapt the communication to different partners with different comprehension abilities is already present in the current MLLMs. en
dc.language.iso en de_DE
dc.publisher Universität Tübingen de_DE
dc.rights ubt-podno de_DE
dc.rights.uri http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de de_DE
dc.rights.uri http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en en
dc.subject.ddc 004 de_DE
dc.subject.other Deep learning en
dc.subject.other Artificial intelligence en
dc.subject.other Computer vision en
dc.subject.other Efficient learning en
dc.subject.other Multimodal learning en
dc.title Multimodal Data Efficient Learning en
dc.type PhDThesis de_DE
dcterms.dateAccepted 2025-04-16
utue.publikation.fachbereich Informatik de_DE
utue.publikation.fakultaet 7 Mathematisch-Naturwissenschaftliche Fakultät de_DE
utue.publikation.noppn yes de_DE

Dateien:

Das Dokument erscheint in:

Zur Kurzanzeige