Budget Instruction Fine-tuning of Llama 3 8B Instruct (on Medical Data) with Hugging Face Google Colab and Unsloth
Affordable Fine-Tuning: Optimizing Llama 3 8B Instruct for Medical Data Using Hugging Face, Google Colab, and Unsloth
April 22, 2024Many contemporary LLMs are showing impressive overall performance but often stumble when confronted with specific task-oriented challenges. Fine-tuning provides significant advantages, such as reduced computational costs and the opportunity to harness cutting-edge models without starting from scratch.
Fine-tuning is a process of taking a pre-trained model and further training it on a domain-specific dataset. This process enhances the model’s performance for specific tasks, rendering it more adept and adaptable in real-world scenarios. It is an indispensable step for customizing existing models to address particular tasks or domains effectively.
Despite the relatively lower computational costs of fine-tuning LLMs compared to full training, it still demands significant GPU power. Access to such resources can be a barrier for many enthusiasts. However, Google Colab offers free-tier GPUs, and with efficient memory management facilitated by the Unsloth library, users can successfully fine-tune LLMs on T4 GPUs at no cost.
In this blog post, we delve into explaining some fundamental terms of fine-tuning, exploring different approaches, providing a comprehensive guide on preparing a dataset for instructional fine-tuning focusing on medical data and finishing by fine-tuning a pre-trained instruct/chat version of LLM with 7B/8B parameters(gemma-1.1-7b-it, mistral-7b-instruct-v0.2, llama-2-7b-chat, llama-3-8b-Instruct-bnb-4bit). We’ll walk through the fine-tuning process utilizing tools and platforms such as Hugging Face, Unsloth, and Google Colab.
Exploring the concept of fine-tuning
Before delving into fine-tuning methods, it’s essential to grasp their diverse categories. Fine-tuning approaches for Large Language Models (LLMs) can be categorized based on:
- Data Utilization: The nature of data employed during fine-tuning.
- Weight Adjustment: Whether all or only specific model weights are updated.
Fine-tuning by Data Utilization
Fine-tuning strategies diverge based on the data utilized, which can be categorized into four distinct types:
- Supervised Fine-tuning
- Few-shot Learning
- Full Transfer Learning
- Domain-specific Fine-tuning
Supervised Fine-Tuning
This method represents the standard approach to fine-tuning. The model undergoes further training using a labeled dataset tailored to the specific task it aims to perform, such as text classification, question answering or named entity recognition. For example, in sentiment analysis, the model would be trained on a dataset comprising text samples annotated with their corresponding sentiments.
Few-Shot Learning
In scenarios where assembling a sizable labeled dataset proves impractical, few-shot learning steps in to provide a solution. This technique furnishes the model with a handful of examples (or shots) of the desired task at the outset of input prompts. By doing so, the model gains a better contextual understanding of the task without necessitating an exhaustive fine-tuning regimen.
Full Transfer Learning
While all fine-tuning methods involve a form of transfer learning, this category specifically enables a model to undertake tasks distinct from its original training objective. The crux lies in leveraging the knowledge amassed by the model from a broad, general dataset and applying it to a more specialized or related task.
Domain-Specific Fine-Tuning
This fine-tuning variant aims to acclimate the model to comprehend and generate text pertinent to a particular domain or industry. The model undergoes fine-tuning using a dataset comprising text specific to the target domain, thereby enhancing its contextual grasp and proficiency in domain-specific tasks. For example, to develop a chatbot for a medical application, the model would be trained on medical records to refine its language comprehension abilities within the healthcare domain.
Fine-tuning by Weight Adjustment
There are two types of fine-tuning depending on which model weights are updated during the process of fine-tuning:
- Full Fine-Tuning (Real Instruction Fine-Tuning)
- Parameter Efficient Fine-Tuning (PEFT)
Full Fine Tuning (Real Instruction Fine-Tuning):
Instruction fine-tuning serves as a strategic approach to enhancing a model’s performance across diverse tasks by training it on guiding examples for responding to queries. The selection of the dataset is pivotal and tailored to the specific task at hand, be it summarization or translation. This comprehensive fine-tuning method, often termed as full fine-tuning, involves updating all model weights, resulting in an optimized version. However, it imposes significant demands on memory and computational resources akin to pre-training, necessitating robust infrastructure to manage storage and processing during training.
Parameter Efficient Fine-Tuning (PEFT)
Parameter Efficient Fine-Tuning or simply PEFT represents a more resource-efficient alternative to full fine-tuning in instruction fine-tuning methodologies. While full LLM fine-tuning entails substantial computational overhead, posing challenges in memory allocation, PEFT offers a solution by updating only a subset of parameters, effectively “freezing” the remainder. This approach reduces the number of trainable parameters, thus alleviating memory requirements and guarding against catastrophic forgetting. In contrast to full fine-tuning, PEFT preserves the original LLM weights, retaining previously acquired knowledge. This feature proves advantageous for mitigating storage constraints when fine-tuning across multiple tasks. Widely adopted techniques such as Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA) exemplify effective methods for achieving parameter-efficient fine-tuning.
What are LoRA & QLoRa ?
LoRA is an enhanced fine-tuning approach, which diverges from the conventional method by fine-tuning only two smaller matrices that approximate the weight matrix of the pre-trained large language model, thus forming the LoRA adapter. This fine-tuned adapter is then integrated into the pre-trained model for subsequent inference tasks. Upon completion of LoRA fine-tuning for a specific task or use case, the result is an unchanged original LLM alongside the emergence of a significantly smaller “LoRA adapter,” often constituting a mere fraction of the original LLM’s size (measured in MB rather than GB). During inference, the LoRA adapter must be fused with its original LLM. This approach offers a key advantage: many LoRA adapters can effectively repurpose the original LLM, thus reducing overall memory requirements when handling multiple tasks and use cases.
QLoRA represents a further advancement in memory efficiency over LoRA. It refines the LoRA technique by quantizing the weights of the LoRA adapters to lower precision, typically 4-bit instead of the original 8-bit. This additional optimization drastically reduces the memory footprint and storage overhead. In QLoRA, the pre-trained model is loaded into GPU memory with quantized 4-bit weights, a departure from the 8-bit precision utilized in LoRA. Despite this reduction in bit precision, QLoRA maintains a comparable level of effectiveness to its predecessor, demonstrating its prowess in optimizing memory usage without compromising performance.
Full Fine-Tuning vs PEFT-LoRA vs PEFT-QLoRA
Guide for preparing data and fine-tuning of Llama 3 8B Instruct
After learning some of the basics for fine-tuning LLMs, now we can do the actual fine-tuning. In this blogpost we are going to fine-tune the Llama 3 8B Instruct LLM on a custom created medical instruct dataset. If you want to fine-tune any other popular LLM model like Mistral v0.2, Llama 2 or Gemma 1.1, you can check the code on the GitHub Repository dedicated for this blogpost.
Preparing instruction data for Llama 3 8B Instruct (Optional)
This step of the blogpost(guide) is optional if you already know how to prepare an instructional dataset for fine-tuning a LLM or if you already have an instructional dataset prepared.
For our guide we are going to work with two publicly available medical datasets which entries are question-answer pairs. The datasets are:
The idea is to use the question-answer pairs from the both datasets, to create instruction prompts from each pair(using the Llama 3 Instruct template), to convert the newly created instruction datasets to Hugging Face datasets, to combine them to create one big medical instruction dataset and to create one smaller version of the bigger medical instruction dataset.
We must know the prompt instruction template used by Llama 3 so we can fully utilize the Llama 3 8B Instruct. The template is:
Note that “<|start_header_id|>”,”<|end_header_id|>” and “<|eot_id|>” are special tokens.
First we need to define the preprocessing of the datasets, we need to rename some columns so the both datasets are uniformed, drop some unused columns, remove duplicate or NaN rows, add instruction for each entry and create instruct prompt for each entry in the datasets. This code can be found in the src/data_processing/instruct_datasets.py
After defining the code needed for preprocessing the datasets we are going to write an script that will trigger the preprocessing of the datasets, create instruction datasets from them, create Hugging Face datasets for each particular dataset, merge them into one bigger dataset (also available as Hugging Face dataset) and create a smaller one with 2k entries from the bigger dataset. This code can be found in the src/data_processing/create_process_datasets.py
After successfully creating the instruct datasets, we can continue to fine-tune the Llama 3 8B Instruct. Instruct datasets created with this code can be found on my Hugging Face Hub, in the collection Medical Instruct Datasets.
Fine-tuning Llama 3 8B Instruct
Step 0: Creating Google Colab Notebook and selecting correct runtime
To start to fine-tune the Llama 3 8B Instruct we first need to create a Colab Notebook on Google Colab. One option is to create a new Google Colab Notebook (1) and use the code available on this blogpost or another approach is to open the Google Colab Notebook from the GitHub Repository on which all the code is available and already written (2).
Creating new blank Colab Notebook or opening already written
After successfully creating/opening of Google Colab Notebook, the next step is to execute the code blocks which will fine-tune our LLM. But before we start executing the code blocks, we must assure that we have selected the T4 Colab GPU(or any other).
Selecting the T4 GPU runtime type
Step 1: Installing necessary packages (libraries) for the fine-tuning
The first step is to install all the necessary packages for the execution. We are going to install xformers for the transformer layers, trl for training the transformers(LLMs), peft for the Parameter Efficient Fine Tuning, accelerate to enable the GPU to be used in the whole execution, bitsandbytes for quantization the usage of 4 bit and unsloth allowing us to operate efficiently with minimal memory usage during the fine-tuning, basically the feature which allows us to fine-tune on T4 Google Colab GPU. Analyzing the benchmarks provided by unsloth on the blog post shared on Hugging Face, it can be concluded that using unsloth can provide memory savings from 13.7% to 73.8%.
Benchmark provided by unsloth
The code for installing the necessary packages for the fine-tuning:
Step 2: Importing the installed packages (libraries)
After installing all the necessary packages we need to import them so they can be used for the process of fine-tuning the LLM.
Step 3: Log in to the Hugging Face Hub
The next step is to login to Hugging Face Hub by using a read/write access token(click here to see how to create one) so later we can upload the fine-tuned model.
Step 4: Creating the config
This is the most crucial step, to create the config which will be used to know what is the base model, what will be the name of the new fine-tuned model, what is the data type of the values of the matrices, the lora config(rank, target modules, alpha, dropout, etc..), what is the dataset used for fine-tuning, the config for the training job (hyperparams). To learn more about how to do hyperparameter tuning for LoRA/QLoRA you can read more here.
Step 5: Loading model, tokenizer, configuration for LoRA(QLoRA), dataset, and trainer
In the next step we are loading the model and the tokenizer for the model, we are configuring the LoRA(QLoRA) configuration, loading the training (fine-tuning) dataset and configuring the trainer all by using the previously setup config variable.
Step 6: Initializing the training(fine-tuning)
Firstly we want to check the memory usage(statistics) before we start the process of training(fine-tuning).
Now is the moment to start training(fine-tuning) the model on the medical instruct dataset.
Now we can check the memory usage after the training process.
Step 7: Saving the trainer stats and the model
After fine-tuning the model we need to keep track of the trainer_stats(time required for the training job, training loss, etc…).
We are going to save the fine-tuned model locally(on the Google Colab Notebook environment) and on our Hugging Face Hub.
There are multiple different ways of saving our models with different quantization methods.
Step 8: Loading the fine-tuned model and running inference
Now we can test and run an inference on the fine-tuned model by executing this code block.
Congratulations, now we have successfully instruction fine-tuned LLM with 7B parameters on medical data (dataset). All the models that are fine-tuned using these medical datasets and this code, can be found on my Hugging Face Hub, in the collection Medical Instruct Models.
Optional Step: Evaluation of the fine-tuned model
There are multiple ways of evaluating LLMs. Usually when the LLM is fine-tuned on specific domain data, like in our case on medical instruction data(questions and answers), crucial is to have good evaluation. That task is mostly manual, when inference is triggered for multiple questions and the answers are analyzed by a human evaluator(in this case a person with medical background). However if we know the output for specific questions, we can do some other approach of evaluating, like analyzing n-grams(sequence of words) by comparing the output and the ground truth. For this kind of evaluation we can use different metrics(scores) like: ROUGE, BLEU, METEOR, etc. Another alternative is to use the Open LLM Leaderboard and evaluate the LLM by submitting it on the Open LLM Leaderboard. More information about it can be found here.
Conclusion
As seen from the graph showing the memory usage over time, unsloth provides extremely efficient memory usage which allows us to fine-tune many open-source LLMs. It needs only 5.67GB to load the LLM(Llama 3 8B Instruct) on the VRAM and needs 4.2GB for the PEFT by utilizing the LoRA(QLoRA) method. So basically the whole execution only uses 66% of the available memory(VRAM) or more strictly for the fine-tuning part it uses only 28% of the available memory(VRAM).
Memory usage over the whole execution
In conclusion, the process of fine-tuning LLMs stands as a crucial step towards harnessing the full potential of pre-trained models for specific tasks and domains. Despite the challenges posed by computational resources, solutions such as Google Colab’s free-tier GPUs and memory management tools like Unsloth pave the way for enthusiasts to engage in this transformative process without financial barriers.
Resources
- Code available on GitHub Repository
- Medical Instruct Datasets
- Medical Instruct Models
- Hugging Face
- Google Colab
- Unsloth
- More about Unsloth
- More about LoRA
- More about QLoRA
- More about hyperparameter tuning for LoRA/QLoRA
- More about evaluation using ROUGE(and similar metrics)
Originally posted at: https://mlops.community/budget-instruction-fine-tuning-of-llama-3-8b-instructon-medical-data-with-hugging-face-google-colab-and-unsloth/