Abstract:
Learning under limited supervision has been a long-standing theme in machine learning. This thesis explores how limited supervision can be effectively leveraged in the domains of computer vision and vision-language modeling. It covers a range of settings, from scenarios with no access to training data, to those with only a few labeled examples, and further to human intervention to AI systems that incorporate minimal but meaningful feedback. Across these settings, the central goal is to improve model generalization with limited real data.
We first consider scenarios where no training data is available for a downstream task in the vision-language models. We explore how to better utilize pre-trained foundation models without additional supervision. WaffleCLIP investigates the role of class descriptors in the zero-shot classification of the vision-language models and shows that performance gains stem from an ensemble of multiple noisy prompts rather than the semantic richness of the prompts, throwing a question of current prompting techniques. FLM introduces a feasibility prediction module that leverages large language models to assess the feasibility of novel state-object compositions for each class, thereby improving performance in open-world compositional zero-shot learning by discarding infeasible classes in the class candidate set. Together, these contributions improve the downstream performance of vision-language models in zero-shot settings.
Next, we examine the case where users have access to a small number of labeled examples. To maximize the utility of this data, we propose synthetic data generation frameworks guided by few-shot real examples where the synthetic data is used as training data for the downstream classification tasks. DataDream fine-tunes text-to-image generative models using LoRA on a handful of class-specific images, allowing the fine-tuned models to generate more realistic and discriminative data. Building on this, LoFT fine-tunes LoRA modules for each image, and then fuses two LoRAs within the same class at inference time, improving the fidelity and diversity of generated samples. These works demonstrate that synthetic data can provide complementary information to real data.
Finally, we explore the role of human intervention in interpretable models. While Concept Bottleneck Models allow humans to modify intermediate concept predictions, current approaches require extensive manual correction. To address this, we propose a Concept Realignment module that reduces the number of necessary interventions by automatically adjusting related concepts after a small correction by human. This leads to more efficient human-in-the-loop collaboration, making interpretable models more practical in real-world applications.
Together, these contributions present a comprehensive study of learning under limited supervision in computer vision and multimodal tasks. We demonstrate that through thoughtful leverage of the foundation model, synthetic data augmentation, and efficient human interaction, it is possible to improve performance with minimal labeled data. This thesis lays the groundwork for future research toward data-efficient, adaptive, and human-aligned AI systems.