Fine-tuning Llama3 Models with LoRA on Custom Data
Meta just released Llama3.1 models yesterday (23rd of July, 2024), so I thought it would be a great time to discuss how we can fine-tune Llama 3 models. In this blog, we will fine-tune the Llama3 8B model with Low-Rank Adaptation (LoRA), to enhance its performance on particular tasks/datasets.
Table of Contents
- Low-Rank Adaptation (LoRA)
- Concept
- Example
- Setting up the Environment
- Data Preparation
- Fine-tuning
- Seeding
- Load and Quantize Model
- Add Padding Token
- Format Training Examples
- Prepare Training Datasets
- Use LoRA
- Training Configurations
- Start Training
- Loading and Merging Saved Model
- Pushing Trained Model to HF Hub
- Evaluation
Low-Rank Adaptation (LoRA)
When fine-tuning large language models like LLaMA 3/3.1 8B, one of the biggest challenges is the required computational resources. This is where Low-Rank Adaptation (LoRA) comes in. LoRA is a technique designed to efficiently fine-tune large language models by reducing the number of trainable parameters while maintaining model performance.
Concept
The main idea of LoRA is to approximate the weight updates required for fine-tuning using low-rank matrices. By decomposing the original weights, LoRA allows us to train only these smaller matrices instead of updating the full weight matrix during fine-tuning.
Example
Let’s consider a simplified example to understand how LoRA works:
Suppose we have a pre-trained weight matrix (W) of size 1000x1000 (1 million parameters). In traditional fine-tuning, we would update all of these parameters. With LoRA, using a rank r=16:
- Matrix (B) would be (1000x16)
- Matrix (A) would be (16x1000)
Total trainable parameters: ((16x1000) x2 = 32,000) parameters.
This is a 96.8% reduction in trainable parameters!
Setting up the Environment
Note: In this post, I will be using Llama 3 8B as an example, but you should be able to train Llama 3.1 in the exact same way. This section should be relevant only if you will train 3.1 models.
- Install the latest version of transformers
New Llama 3.1 models have new attributes within the model config, we won’t be able to load the model unless we upgrade transformers library version
pip install --upgrade transformers
- Request access to Llama 3.1 8B model You will have to sign-in to HuggingFace Hub, and request access to Llama 3.1 8B Instruct model
Data Preparation
As the main goal of this blog post is to train the model on your own custom dataset, we will be talking abstractly about how to train the model on any dataset, and how the data should be formatted.
First, let’s have two main columns in the dataset:
question: <this is the prompt, and this is what the model will be trained on>
answer: <this is the answer to the prompt/question, this is the label>
It’s not recommended to do any normalization/cleaning on your text, it’s preferred to leave text as is when training LLM.
Fine-tuning
1. Seeding
To ensure reproducibility, we will need to set seeds.
import random
import numpy as np
import torch
def seed_everything(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
seed_everything(0)
2. Load and Quantize Model
The 8B model is still quite big to fit on average Colab GPUs (e.g T4), so It’s recommended to quantize the model to a lower precision rate before starting training.
Here’s how we can load and quantize the model using BitsAndBytes to 8-bit
Note: This will reduce GPU utilization from 18GB to approximately 6GB.
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig
)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
quantization_config=quantization_config,
device_map="auto"
)
3. Add Padding Token
Llama 3 tokenizers do not have a padding
token by default, so, to train the model in batches, we will need to configure this ourselves, and it has also proven to show better results even when training with a batch size of one sample.
PAD_TOKEN = "<|pad|>"
tokenizer.add_special_tokens({"pad_token": PAD_TOKEN})
tokenizer.padding_side = "right"
# we added a new padding token to the tokenizer, we have to extend the embddings
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)
print(tokenizer.pad_token, tokenizer.pad_token_id)
# output: ('<|pad|>', 128256)
4. Format Training Examples
We need to properly format all of our training examples, I have my custom data in pandas
dataframe with 2 columns question
and answer
, and here is how we can format them
from textwrap import dedent
def format_example(row: dict):
prompt = dedent(
f"""
{row['question']}
"""
)
messages = [
# the system prompt is very important to adjust/control the behavior of the model, make sure to use it properly accoring to your task
{"role": "system", "content": "You're a document classifier, try to classify the given document as relevant or irrelevant"},
{"role": "user", "content": prompt},
{"role": "assistant", "content": row['answer']}
]
return tokenizer.apply_chat_template(messages, tokenize=False)
# format the training examples into a new text column
df['text'] = df.apply(format_example, axis=1)
5. Prepare Training Datasets
First, we need to create our training, validation, and test splits to evaluate the model during training and test it afterward
from sklearn.model_selection import train_test_split
train, temp = train_test_split(df, test_size=0.2, random_state=1)
val, test = train_test_split(temp, test_size=0.2, random_state=1)
# save training-ready data to JSON
train.to_json("train.json", orient='records', lines=True)
val.to_json("val.json", orient='records', lines=True)
test.to_json("test.json", orient='records', lines=True)
Second, create HF datasets
from datasets import load_dataset
dataset = load_dataset(
"json",
data_files={'train': 'train.json', 'validation': 'val.json', 'test': 'test.json'}
)
# print a training exmaple
print(dataset['train'][0]['text'])
Third, create the training-ready datesets
from trl import DataCollatorForCompletionOnlyLM
# in order to only evaluate the generation of the model, we shouldn't consider the text that were already inputed, we will use the end header id token to get the generated text only, and mask everything else
response_template = "<|end_header_id|>"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
6. Use LoRA
Use LoRA to reduce the number of trainable parameters, you can print the model modules by using print(model)
, and you can see the names of the modules being targeted here
from peft import (
LoraConfig,
TaskType,
get_peft_model,
prepare_model_for_kbit_training
)
# this is recommended by original lora paper: using lora, we should target the linear layers only
lora_config = LoraConfig(
r=32, # rank for matrix decomposition
lora_alpha=16,
target_modules=[
"self_attn.q_proj",
"self_attn.k_proj",
"self_attn.v_proj",
"self_attn.o_proj",
"mlp.gate_proj",
"mlp.up_proj",
"mlp.down_proj"
],
lora_dropout=0.05,
bias='none',
task_type=TaskType.CAUSAL_LM
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
pirnt(model.print_trainable_parameters())
# output: trainable params: 83,886,080 || all params: 8,114,212,864 || trainable%: 1.0338
7. Training Configurations
Set the training configurations
from trl import SFTConfig, SFTTrainer
OUTPUT_DIR = "experiments"
sft_config = SFTConfig(
output_dir=OUTPUT_DIR,
dataset_text_field='text', # this is the final text example we formatted
max_seq_length=4096,
num_train_epochs=1,
per_device_train_batch_size=2, # training batch size
per_device_eval_batch_size=2, # eval batch size
gradient_accumulation_steps=4, # by using gradient accum, we updating weights every: batch_size * gradient_accum_steps = 4 * 2 = 8 steps
optim="paged_adamw_8bit", # paged adamw
eval_strategy='steps',
eval_steps=0.2, # evalaute every 20% of the trainig steps
save_steps=0.2, # save every 20% of the trainig steps
logging_steps=10,
learning_rate=1e-4,
fp16=True, # also try bf16=True
save_strategy='steps',
warmup_ratio=0.1, # learning rate warmup
save_total_limit=2,
lr_scheduler_type="cosine", # scheduler
save_safetensors=True, # saving to safetensors
dataset_kwargs={
"add_special_tokens": False, # we template with special tokens already
"append_concat_token": False, # no need to add additional sep token
},
seed=1
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset['train'],
eval_dataset=dataset['validation'],
tokenizer=tokenizer,
data_collator=collator,
)
7. Start Training
Now, we are finally ready to start training
trainer.train()
We can see how the training is going well, and the validation loss is going down
Loading and Merging Saved Model
Models are being saved during training, but while training with LoRA, the model will be saved with an Adapter, so we will load both the Model and the Adapter, merge them, and have a final model that we can easily push to HF Hub
from peft import PeftModel
NEW_MODEL="path_to_saved_model"
# load trained/resized tokenizer
tokenizer = AutoTokenizer.from_pretrained(NEW_MODEL)
# here we are loading the raw model, if you can't load it on your GPU, you can just change device_map to cpu
# we won't need gpu here anyway
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.float16,
device_map='auto',
)
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)
model = PeftModel.from_pretrained(model, NEW_MODEL)
model = model.merge_and_unload()
Pushing Trained Model to HF Hub
Now we have merged the Model and the Adapter, we can push the Model to HF Hub and load it from there
1. Sign-in to HF Hub using HF-cli
Sign-in, and make sure to create a token with write access, check HF Docs for more info
huggingface-cli login
2. Push Model and Tokenizer
username = "your_username"
repo_name = "repo_name"
model.push_to_hub(f"{username}/{repo_name}", tokenizer=tokenizer, max_shard_size="5GB", private=True)
tokenizer.push_to_hub(f"{username}/{repo_name}", private=True)
Evaluation
In a separate notebook, we can load our trained model and tokenizer from HF hub, and use them for inference
from textwrap import dedent
import pandas as pd
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
pipeline
)
MODEL_NAME = "your_repo_name"
# this should create
df = pd.read_csv('data.csv')
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16
)
# load trained model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=quantization_config,
device_map="auto"
)
pipe = pipeline(
task='text-generation',
model=model,
tokenizer=tokenizer,
max_new_tokens=128,
return_full_text=False
)
def creaet_test_prompt(row):
prompt = dedent(
f"""
{row['question']}
"""
)
messages = [
# the system prompt is very important to adjust the control the behavior of the model, make sure to use properly accoring to your task
{"role": "system", "content": "You're a document classifier, try to classify the given document as relevant or irrelevant"},
{"role": "user", "content": prompt},
]
return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
questions = df['question'].tolist()
prompt = creaet_test_prompt(questions[0])
result = pipe(prompt)[0]['generated_text']
print(result)
# output: <model's response>