<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>QLoRA Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/qlora/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/qlora/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Mon, 21 Aug 2023 12:04:59 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>QLoRA Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/qlora/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Fine-Tuning LLaMA 2 Models using a single GPU, QLoRA and AI Notebooks</title>
		<link>https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/</link>
		
		<dc:creator><![CDATA[Mathieu Busquet]]></dc:creator>
		<pubDate>Fri, 21 Jul 2023 15:04:00 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[AI Notebooks]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Deep learning]]></category>
		<category><![CDATA[Fine-tuning]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LLaMa 2]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[PyTorch]]></category>
		<category><![CDATA[QLoRA]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=25613</guid>

					<description><![CDATA[In this tutorial, we will walk you through the process of fine-tuning LLaMA 2 models, providing step-by-step instructions. All the code related to this article is available in our dedicated GitHub repository. You can reproduce all the experiments with OVHcloud AI Notebooks. Introduction On July 18, 2023, Meta released LLaMA 2, the latest version of [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ffine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks%2F&amp;action_name=Fine-Tuning%20LLaMA%202%20Models%20using%20a%20single%20GPU%2C%20QLoRA%20and%20AI%20Notebooks&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>In this tutorial, we will walk you through the process of fine-tuning <a href="https://ai.meta.com/llama/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 2</a> models, providing step-by-step instructions.</em> </p>



<figure class="wp-block-image aligncenter size-large is-resized"><img fetchpriority="high" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-1024x538.jpg" alt="Fine-Tuning LLaMA 2 Models with a single GPU and OVHcloud" class="wp-image-25629" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/IMG_1564.jpg 1199w" sizes="(max-width: 512px) 100vw, 512px" /></figure>



<p class="has-text-align-center"><em>All the code related to this article is available in our dedicated <a href="https://github.com/ovh/ai-training-examples/blob/main/notebooks/natural-language-processing/llm/miniconda/llama2-fine-tuning/llama_2_finetuning.ipynb" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">GitHub repository</a><a href="https://github.com/ovh/ai-training-examples/tree/main/apps/streamlit/speech-to-text" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">.</a> You can reproduce all the experiments with</em> <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Notebooks</a>.</p>



<h3 class="wp-block-heading">Introduction</h3>



<p>On July 18, 2023, <a href="https://about.meta.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Meta</a> released <a href="https://ai.meta.com/llama/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LLaMA 2</a>, the latest version of their <strong>Large Language Model </strong>(LLM).</p>



<p>Trained between January 2023 and July 2023 on 2 trillion tokens, these new models outperforms other LLMs on many benchmarks, including reasoning, coding, proficiency, and knowledge tests. This release comes in different flavors, with parameter sizes of <strong>7B</strong>, <strong>13B</strong>, and a mind-blowing <strong>70B</strong>. Models are intended for free for both commercial and research use in English.</p>



<p>To suit every text generation needed and fine-tune these models, we will use <a href="https://arxiv.org/abs/2305.14314" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">QLoRA (Efficient Finetuning of Quantized LLMs)</a>, a highly efficient fine-tuning technique that involves quantizing a pretrained LLM to just 4 bits and adding small &#8220;Low-Rank Adapters&#8221;. This unique approach allows for fine-tuning LLMs <strong>using just a single GPU</strong>! This technique is supported by the <a href="https://huggingface.co/docs/peft/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">PEFT library</a>.</p>



<p>To fine-tune our model, we will create <em>a</em> <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Notebooks</a> with only 1 GPU.</p>



<h3 class="wp-block-heading">Mandatory requirements</h3>



<p>To successfully fine-tune LLaMA 2 models, you will need the following:</p>



<ul class="wp-block-list">
<li>Fill Meta&#8217;s form to <a href="https://ai.meta.com/resources/models-and-libraries/llama-downloads/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">request access to the next version of Llama</a>. Indeed, the use of Llama 2 is governed by the Meta license, that you must accept in order to download the model weights and tokenizer.<br></li>



<li>Have a <a href="https://huggingface.co/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face</a> account (with the same email address you entered in Meta&#8217;s form).<br></li>



<li>Have a <a href="https://huggingface.co/settings/tokens" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Hugging Face token</a>.<br></li>



<li>Visit the page of one of the LLaMA 2 available models (version <a href="https://huggingface.co/meta-llama/Llama-2-7b-hf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">7B</a>, <a href="https://huggingface.co/meta-llama/Llama-2-13b-hf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">13B</a> or <a href="https://huggingface.co/meta-llama/Llama-2-70b-hf" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">70B</a>), and accept Hugging Face&#8217;s license terms and acceptable use policy.<br></li>



<li>Log in to the Hugging Face model Hub from your notebook&#8217;s terminal by running the <code>huggingface-cli login</code> command, and enter your token. You will not need to add your token as git credential.<br></li>



<li>Powerful Computing Resources: Fine-tuning the Llama 2 model requires substantial computational power. Ensure you are running code on GPU(s) when using <a href="https://www.ovhcloud.com/en-gb/public-cloud/ai-notebooks/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Notebooks</a> or <a href="https://www.ovhcloud.com/en/public-cloud/ai-training/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">AI Training</a>.</li>
</ul>



<h3 class="wp-block-heading">Set up your Python environment</h3>



<p>Create the following <code>requirements.txt</code> file:</p>



<pre class="wp-block-code"><code lang="" class="">torch
accelerate @ git+https://github.com/huggingface/accelerate.git
bitsandbytes
datasets==2.13.1
transformers @ git+https://github.com/huggingface/transformers.git
peft @ git+https://github.com/huggingface/peft.git
trl @ git+https://github.com/lvwerra/trl.git
scipy</code></pre>



<p>Then install and import the installed libraries:</p>



<pre class="wp-block-code"><code class="">pip install -r requirements.txt</code></pre>



<pre class="wp-block-code"><code lang="python" class="language-python">import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset</code></pre>



<h3 class="wp-block-heading">Download LLaMA 2 model</h3>



<p>As mentioned before, LLaMA 2 models come in different flavors which are 7B, 13B, and 70B. Your choice can be influenced by your computational resources. Indeed, larger models require more resources, memory, processing power, and training time.</p>



<p>To download the model you have been granted access to, <strong>make sure you are logged in to the Hugging Face model hub</strong>. As mentioned in the requirements step, you need to use the <code>huggingface-cli login</code> command.</p>



<p>The following function will help us to download the model and its tokenizer. It requires a bitsandbytes configuration that we will define later.</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def load_model(model_name, bnb_config):
    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto", # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

    # Needed for LLaMA tokenizer
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer</code></pre>



<h3 class="wp-block-heading">Download a Dataset</h3>



<p>There are many datasets that can help you fine-tune your model. You can even use your own dataset!</p>



<p>In this tutorial, we are going to download and use the <a href="https://huggingface.co/datasets/databricks/databricks-dolly-15k" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Databricks Dolly 15k dataset</a>, which contains <strong>15,000 prompt/response pairs</strong>. It was crafted by over 5,000 <a href="https://www.databricks.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Databricks</a> employees during March and April of 2023.</p>



<p>This dataset is designed specifically for fine-tuning large language models. Released under the <a href="https://creativecommons.org/licenses/by-sa/3.0/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">CC BY-SA 3.0 license</a>, it can be used, modified, and extended by any individual or company, even for commercial applications. So it&#8217;s a perfect fit for our use case!</p>



<p>However, like most datasets, this one has <strong>its limitations</strong>. Indeed, pay attention to the following points:</p>



<ul class="wp-block-list">
<li>It consists of content collected from the public internet, which means it may contain objectionable, incorrect or biased content and typo errors, which could influence the behavior of models fine-tuned using this dataset.<br></li>



<li>Since the dataset has been created for Databricks by their own employees, it&#8217;s worth noting that the dataset reflects the interests and semantic choices of Databricks employees, which may not be representative of the global population at large.<br></li>



<li>We only have access to the <code>train</code> split of the dataset, which is its largest subset.</li>
</ul>



<pre class="wp-block-code"><code lang="python" class="language-python"># Load the databricks dataset from Hugging Face
from datasets import load_dataset

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")</code></pre>



<h3 class="wp-block-heading">Explore dataset</h3>



<p>Once the dataset is downloaded, we can take a look at it to understand what it contains:</p>



<pre class="wp-block-code"><code lang="python" class="language-python">print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

*** OUTPUT ***
Number of prompts: 15011
Column Names are: ['instruction', 'context', 'response', 'category']</code></pre>



<p>As we can see, each sample is a dictionary that contains:</p>



<ul class="wp-block-list">
<li><strong>An instruction</strong>: What could be entered by the user, such as a question</li>



<li><strong>A context</strong>: Help to interpret the sample</li>



<li><strong>A response</strong>: Answer to the instruction</li>



<li><strong>A category</strong>: Classify the sample between Open Q&amp;A, Closed Q&amp;A, Extract information from Wikipedia, Summarize information from Wikipedia, Brainstorming, Classification, Creative writing</li>
</ul>



<h3 class="wp-block-heading">Pre-processing dataset</h3>



<p><strong>Instruction fine-tuning</strong> is a common technique used to fine-tune a base LLM for a specific downstream use-case.</p>



<p>It will help us to format our prompts as follows: </p>



<pre class="wp-block-code"><code class="">Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Sea or Mountain

### Response:
I believe Mountain are more attractive but Ocean has it's own beauty and this tropical weather definitely turn you on! SO 50% 50%

### End</code></pre>



<p>To delimit each prompt part by hashtags, we can use the following function:</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction', 'context', 'response')
    Then concatenate them using two newline characters 
    :param sample: Sample dictionnary
    """

    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruction:"
    INPUT_KEY = "Input:"
    RESPONSE_KEY = "### Response:"
    END_KEY = "### End"
    
    blurb = f"{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}\n{sample['instruction']}"
    input_context = f"{INPUT_KEY}\n{sample['context']}" if sample["context"] else None
    response = f"{RESPONSE_KEY}\n{sample['response']}"
    end = f"{END_KEY}"
    
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    
    sample["text"] = formatted_prompt

    return sample</code></pre>



<p>Now, we will use our <strong>model tokenizer to process these prompts into tokenized ones</strong>. </p>



<p>The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model&#8217;s maximum token limit.</p>



<pre class="wp-block-code"><code lang="python" class="language-python"># SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset: str):
    """Format &amp; tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset &amp; and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["instruction", "context", "response", "text", "category"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) &lt; max_length)
    
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset</code></pre>



<p>With these functions, our dataset will be ready for fine-tuning !</p>



<h3 class="wp-block-heading">Create a bitsandbytes configuration</h3>



<p>This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config</code></pre>



<p>To leverage the LoRa method, we need to wrap the model as a PeftModel.</p>



<p>To do this, we need to implement a <a href="https://huggingface.co/docs/peft/conceptual_guides/lora" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LoRa configuration</a>:</p>



<pre class="wp-block-code"><code lang="python" class="language-python">def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for your model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        r=16,  # dimension of the updated matrices
        lora_alpha=64,  # parameter for scaling
        target_modules=modules,
        lora_dropout=0.1,  # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config</code></pre>



<p>Previous function needs the <strong>target modules</strong> to update the necessary matrices. The following function will get them for our model:</p>



<pre class="wp-block-code"><code lang="python" class="language-python"># SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)</code></pre>



<p>Once everything is set up and the base model is prepared, we can use the <em>print_trainable_parameters()</em> helper function to see how many trainable parameters are in the model. </p>



<pre class="wp-block-code"><code lang="python" class="language-python">def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )</code></pre>



<p>We expect the LoRa model to have fewer trainable parameters compared to the original one, since we want to perform fine-tuning.</p>



<h3 class="wp-block-heading">Train</h3>



<p>Now that everything is ready, we can pre-process our dataset and load our model using the set configurations: </p>



<pre class="wp-block-code"><code lang="python" class="language-python"># Load model from HF with user's token and with bitsandbytes config

model_name = "meta-llama/Llama-2-7b-hf" 

bnb_config = create_bnb_config()

model, tokenizer = load_model(model_name, bnb_config)</code></pre>



<pre class="wp-block-code"><code lang="python" class="language-python">## Preprocess dataset

max_length = get_max_length(model)

dataset = preprocess_dataset(tokenizer, max_length, seed, dataset)</code></pre>



<p>Then, we can run our fine-tuning process: </p>



<pre class="wp-block-code"><code lang="python" class="language-python">def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)
    
    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)
    
    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=20,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs
    
    ### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
    # Verifying the datatypes before training
    
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)
     
    do_train = True
    
    # Launch training
    print("Training...")
    
    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)    
    
    ###
    
    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)
    
    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()
    
    
output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer, dataset, output_dir)</code></pre>



<p><em>If you prefer to have a number of epochs (entire training dataset will be passed through the model) instead of a number of training steps (forward and backward passes through the model with one batch of data), you can replace the <code>max_steps</code> argument by <code>num_train_epochs</code>.</em></p>



<p>To later load and use the model for inference, we have used the <code>trainer.model.save_pretrained(output_dir)</code> function, which saves the fine-tuned model&#8217;s weights, configuration, and tokenizer files.</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-1024x498.png" alt="" class="wp-image-25619" width="870" height="422" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-1024x498.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-300x146.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results-768x374.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/07/finetuning-llama2-results.png 1320w" sizes="(max-width: 870px) 100vw, 870px" /></figure>



<p class="has-text-align-center">Fine-tuning llama2 results on <a href="https://huggingface.co/datasets/databricks/databricks-dolly-15k" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">databricks-dolly-15k</a> dataset</p>



<p>Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a <code>EarlyStoppingCallback</code>, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights.</p>



<h3 class="wp-block-heading">Merge weights</h3>



<p>Once we have our fine-tuned weights, we can build our fine-tuned model and save it to a new directory, with its associated tokenizer. By performing these steps, we can have a memory-efficient fine-tuned model and tokenizer ready for inference!</p>



<pre class="wp-block-code"><code lang="python" class="language-python">model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()

output_merged_dir = "results/llama2/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok=True)
model.save_pretrained(output_merged_dir, safe_serialization=True)

# save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)</code></pre>



<h3 class="wp-block-heading">Conclusion</h3>



<p>We hope you have enjoyed this article!</p>



<p>You are now able to fine-tune LLaMA 2 models on your own datasets!</p>



<p>In our next tutorial, you will discover how to <strong>Deploy your Fine-tuned LLM on <a href="https://www.ovhcloud.com/en/public-cloud/ai-deploy/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OVHcloud AI Deploy</a> for inference</strong>!</p>
<img decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ffine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks%2F&amp;action_name=Fine-Tuning%20LLaMA%202%20Models%20using%20a%20single%20GPU%2C%20QLoRA%20and%20AI%20Notebooks&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
