
Sign up to save your podcasts
Or


QLoRA: Efficient Finetuning of Quantized LLMs presents a highly efficient approach that drastically reduces the memory required to finetune large language models. By backpropagating gradients through a frozen, 4-bit quantized pretrained model into Low Rank Adapters (LoRA), QLoRA makes it possible to finetune a 65-billion parameter model on a single 48GB GPU while preserving full 16-bit task performance.
The paper introduces three key innovations to achieve this memory efficiency without sacrificing performance:
Using this method, the researchers developed a model family called Guanaco, which achieved 99.3% of ChatGPT's performance on the Vicuna benchmark after just 24 hours of finetuning on a single GPU. Ultimately, the authors' analysis of over 1,000 finetuned models demonstrates that using QLoRA on small, high-quality datasets can yield state-of-the-art results, even when utilizing smaller base models.
By Yun WuQLoRA: Efficient Finetuning of Quantized LLMs presents a highly efficient approach that drastically reduces the memory required to finetune large language models. By backpropagating gradients through a frozen, 4-bit quantized pretrained model into Low Rank Adapters (LoRA), QLoRA makes it possible to finetune a 65-billion parameter model on a single 48GB GPU while preserving full 16-bit task performance.
The paper introduces three key innovations to achieve this memory efficiency without sacrificing performance:
Using this method, the researchers developed a model family called Guanaco, which achieved 99.3% of ChatGPT's performance on the Vicuna benchmark after just 24 hours of finetuning on a single GPU. Ultimately, the authors' analysis of over 1,000 finetuned models demonstrates that using QLoRA on small, high-quality datasets can yield state-of-the-art results, even when utilizing smaller base models.