Abstract:
The rise of massive foundation models pretrained on large-scale datasets has enabled unprecedented generalization, but their immense size presents critical deployment challenges, demanding substantial memory and computational resources that make specialized adaptation and deployment in resource-constrained environments prohibitively expensive. This thesis addresses these challenges through three interconnected contributions that progressively enable efficient model adaptation and deployment.
Our first contribution, ETHER, challenges prevailing assumptions about parameter-efficient fine-tuning (PEFT). While LoRA's simplicity has democratized finetuning, parallel research has investigated sophisticated geometric constraints, such as preserving hyperspherical energy (HE) via orthogonal transformations. ETHER, initially motivated by HE preservation through hyperplane reflections, reveals through empirical analysis that adaptation robustness actually derives from bounding the Frobenius norm of weight updates rather than maintaining HE. This insight suggests a paradigm shift: effective finetuning depends on constraining weight deviation, not on geometric energy preservation. However, ETHER's fixed rank, predetermined boundaries, and computational overhead from matrix multiplications limit its practical applicability.
Building directly on this insight, our second contribution, DeLoRA, translates the Frobenius-bounded robustness principle into a practical, LoRA-like method.
DeLoRA operates with arbitrary rank, replaces matrix multiplications with additions, and implicitly maintains a Frobenius boundary by decoupling magnitude and angular learning. This decoupling enables stable training at larger learning rates, and effectively prevents catastrophic overwriting, while its flexible design supports diverse applications.
Our third contribution addresses the remaining bottleneck: the foundation model itself. While parameter-efficient adapters dramatically reduce specialization costs, the base model's size remains the primary deployment constraint.
Motivated by the observation that massive generalist models contain substantial redundancy when deployed for specific operations, we demonstrate how combining parameter-efficient adaptation with knowledge distillation enables practical deployment of specialist models.
Through MemLoRA, deployed in an LLM-powered memory system, we show how lightweight adapters can function as specialist experts—one for each memory operation—when enhanced through knowledge distillation and paired with a single compact student model. This paradigm achieves equivalent performance with dramatically reduced memory footprint, faster inference, and minimal deployment cost.
Together, these contributions establish a cohesive framework for making foundation models accessible in resource-constrained environments—from discovering robust adaptation principles, to designing efficient methods that embody these principles, to ultimately demonstrating practical deployment through PEFT-enhanced distillation.