Fine-tuning with limited hardware resources

In 2023, we are witnessing a boom in language models and their practical applications. ChatGPT has sparked interest in replicating its success, and many teams have published the results of their work. A large portion of the new models have been released under the Apache 2.0 license, which allows for their free modification, use, and even commercialization. This is a fantastic move that will enable further dynamic development of artificial intelligence by paving the way for experimentation with fine-tuning language models to meet specific needs.

Today’s post will focus on the journey into the world of fine-tuning large language models. As you likely know, training large models from scratch is a difficult and costly task accessible only to few organizations. However, training a large model from scratch – a task that often requires tens of millions of dollars and results in a so-called base model – is one thing and adjusting such a model to specific tasks through optimization is another. This process is called fine-tuning, and it’s available to a much broader group of specialists and organizations, because it’s much simpler and cheaper. Moreover, it’s becoming possible even on commodity hardware, opening up an entirely new game.

An important note here: using OpenAI’s API and/or prompt engineering is definitely the best first step for building your own solution. However, I’ve assumed that going through the entire path from choosing a model, to gathering data, and then fine-tuning the model is a very interesting engineering task and a valuable lesson, so let’s go!

What do I want to achieve?

Apart from gaining technical experience in fine-tuning a relatively large model under limited hardware resources, my goal is to build a Polish-speaking assistant that can be used in a medical company’s chat system. The assistant’s task would be to automatically handle customers interested in a medical appointment, suggest doctors, and book visits, etc. As mentioned, the simplest way to achieve this in practice would be using OpenAI’s API, so the following considerations should be treated solely as an engineering challenge. For English readers, the use and understanding of Polish data would be a bit problematic, but fortunately, the presented code is universal and can be used for training a Polish, German, or English model.

Model Selection

When it comes to choosing a model, my options weren’t very extensive, considering my requirements and available hardware – I plan to use Google Colab with its free GPU. First, it should be a model that was open-sourced for commercial use. Second, it should be small enough that I could at least start training it on Colab’s free GPU. Third, it should have been trained, at least partially, on Polish data to be capable of responding in the expected language from the start. As of May 2023, when I began analyzing the topic, there weren’t many such models available. Only one model met all these criteria at that time: RedPajama-INCITE-Chat-3B. However, I am convinced that as the months, or even weeks, go by, more such models will become available.

PEFT, LoRA, and Quantization – Key Techniques for “Small” Fine-Tuning

To train a model under limited hardware resources, you have to tackle two issues. The primary issue is the availability of RAM on the GPU. Google Colab and most commodity GPUs offer up to 16GB max. Meanwhile, for a 3 billion-parameter model trained in FP32, you need 4 bytes to hold the value of each parameter, another 4 bytes for gradient calculations, and another 4 for optimization processes. 12 bytes x 3B parameters = 34GB RAM. The second issue is the cost/time involved in tuning that many parameters. On the aforementioned GPUs, this would likely take weeks. And a model with 3B parameters is not considered large these days

This section is dedicated to briefly explaining the three basic concepts: PEFT (Parameter-Efficient Fine-tuning), LoRA (Low-Rank Adaptation), and quantization. They are key in the process of “small” optimization/fine-tuning of large language models.

PEFT: Parameter-Efficient Fine-tuning
PEFT is a group of techniques that optimize computational efficiency and minimize memory requirements during the fine-tuning phase. As the scale of models increases, full fine-tuning becomes increasingly computationally intensive and prohibitively expensive on consumer hardware. PEFT counteracts these limitations by optimizing only a small portion of additional model parameters, while the majority of parameters from the pre-trained model remain unchanged. This strategy significantly reduces computational load and memory costs. Furthermore, PEFT prevents catastrophic forgetting, a phenomenon often encountered during full fine-tuning of LLMs. For more information, visit:

LoRA: Low-Rank Adaptation
LoRA is a PEFT method that enhances the efficiency of fine-tuning large models and sometimes even makes it possible at all. It freezes the weights of the model learned during pre-training and introduces its own trainable layers of parameters into layers of the Transformer architecture. As a result, it significantly limits the number of parameters to be fine-tuned for specific tasks. For example, in the GPT-3 175B model, LoRA can reduce the number of trainable parameters by 10,000 times and cut the GPU memory requirements by threefold. For more information, visit:

It refers to the process of reducing the numerical precision of model parameters from 32-bit floating-point numbers (FP32) to smaller sizes, such as 8-bit integers (INT8). By decreasing the number of bits per model parameter, quantization significantly reduces the model’s memory and computational requirements, making it lighter, faster, and more environment-friendly. This is an essential technique for fine-tuning large models, especially when combined with LoRA.

In quantizing from 32-bit floating-point (FP32) to 8-bit integer (INT8), we reduce the size of each parameter by fourfold. This stems from the fact that FP32 uses 32 bits of memory, while INT8 uses 8 bits.

It’s worth noting that in machine learning, we should consider not just the parameter value but also the memory required for gradient accumulation and model optimization (like momentum values in Adam methods). As a result, it’s usually necessary to maintain at least three copies of model parameters: one for the parameter value itself, one for gradients during backpropagation, and one for update steps in the optimizer.

So actual memory savings can vary based on many factors, including how you handle the quantization process, the need for de-quantization, optimizer used and the specifics of your machine learning model and training process.

It’s also worth noting that most training schemes require high-precision calculations for accumulating small updates, so the above quantization scheme is generally applied only for inference, not for training. During training, quantization is usually applied in a mixed manner, where some parameters and computations are maintained with higher precision to ensure the stability and accuracy of the training process.

What does our base model offer initially?

To compare how the model responds to questions in Polish that could potentially be asked on a medical company’s helpline, we will prompt the model to generate answers to four questions by running the script below.

The complete script is available in my GitHub repo. The script requires loading the entire model into RAM and won’t be able to run on free Colab due to insufficient memory. If someone has a computer with more than 16GB of RAM, the script can successfully run on a CPU. The virtual environment must have the transformers library installed (pip install -q transformers).

from transformers import AutoModelForCausalLM, AutoTokenizer

# Define the model name and tokenizer
model_name = "togethercomputer/RedPajama-INCITE-Chat-3B-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


prompts = ["Boli mnie brzuch. Czy może mnie Pani zapisać do lekarza? Jakie są wolne terminy?",
           "Pacjent ma kontuzjowaną rękę. Jaki lekarz powinien się nim zająć w pierwszej kolejności?",
           "W przypadku nawracających migren, czy lepiej zrobić RTG czy MRI?",
           "Ile dni spędza się w szpitalu po operacji łąkotki?"]

# Helper code to process the prompts
responses = []
for prompt in prompts:
  # Add tags to the prompt
  tagged_prompt = "\n<human>: " + prompt + "\n<bot>:"

  # Tokenize the prompt
  inputs = tokenizer(tagged_prompt, return_tensors='pt').to(model.device)
  input_length = inputs.input_ids.shape[1]

  # Generate the output
  outputs = model.generate(

  # Decode the output
  token = outputs.sequences[0, input_length:]
  output_str = tokenizer.decode(token)


# Let's display results
for prompt, response in zip(prompts, responses):
    print(f"Prompt: {prompt}\nResponse: {response}\n\n")

Due to the response being in Polish, I won’t cite the obtained answers here in English version of the post, but as you can guess, they are not of the highest quality, to put it mildly. First, hallucinations appear. Second, the model starts conversing with itself, generating <human> and <bot> tags. Third, the rest of the conversation proceeds in English.

Why do we have this situation? Primarily, the model is quite small for language models and simply doesn’t have the capacity to handle a fairly challenging task. It was not trained on a large amount of Polish text, although it has certainly seen some. It was not trained on the kinds of texts or conversations that occur on medical helpline services.

How to Improve the Model’s Answer Quality?

We can fine-tune the model on our own data. The general procedure for fine-tuning is as follows: a) prepare our own data, b) load the base model, c) freeze some of the base model’s parameters to train only its end (known as the ‘head’), d) optionally, unfreeze some parameters later and repeat the fine-tuning for the entire model with a very low learning rate to avoid “catastrophic forgetting.” Catastrophic forgetting is a situation where the model completely or partially forgets previously learned patterns after learning new information.

Unfortunately, when dealing with a large model and limited hardware capabilities (as will be the case with a local GPU or in a basic Colab environment), the problem is fitting the training process into the GPU’s RAM. At home and in Colab, we usually have access to 16GB. This is insufficient even for a model with 3 billion parameters, let alone larger models. As an interesting note, at the end of June, Mosaic ML published their latest family of 30B-sized open-source models. This is another great step in the right direction for the AI community, but unfortunately, in our context, this information is crucial:

The size of MPT-30B was also specifically chosen to make it easy to deploy on a single GPU – either 1xA100-80GB in 16-bit precision or 1xA100-40GB in 8-bit precision. Other comparable LLMs such as Falcon-40B have larger parameter counts and cannot be served on a single datacenter GPU (today); this necessitates 2+ GPUs, which increases the minimum inference system cost.

If you don’t happen to have an NVIDIA Tesla A100 with 80GB RAM on hand, then you’re stuck. Well, not entirely, as you can use the aforementioned quantization to reduce the model size. You can also apply the LoRA technique to not train parts of the base model, but to fine-tune the layers added by LoRA. The MPT-30B will still require 40GB of GPU RAM, but for a model with 3 billion parameters, it will be possible in a free Google Colab environment.


For fine-tuning a model, we need data. I prepared it using two sources:

  1. I created the data myself and then augmented it using ChatGPT.
  2. I automatically translated similar data gathered from open-source datasets available on the internet into Polish. For the automatic translation from English to Polish, I used the m2m100_1.2B model recently released by Meta.

As a result, I obtained my own dataset with over 1300 items, which should be sufficient to see some initial fine-tuning results and achieve a slight improvement in quality.

The data used for training models from the RedPajama-INCITE-Chat-nB family must be provided in a specific format. This format is described in RedPajama’s materials. Below, I’m pasting a single sample item from the dataset so that the reader can understand its structure. The entire dataset is stored in a JSON file, which I read during training (see the script below).

{“text”: “: Dzień dobry, potrzebuję porady lekarskiej. Mam problem z ręką\n: Rozumiem, czy mogłaby Pani powiedzieć bliżej co się dzieję?\n: Spuchła mi ręka w okolicach nadgarstka, jest obolała.\n: Proponuję wizytę u ortopedy. Najbliższy termin jest za 2 dni\n:Trochę długo. Mam tyle czekać?\n: Niestety, nie mam terminu wczesniej. Chyba że woli Pani wizytę u lekarza ogólnego?\n: Nie, to już wolę u ortopedy. Proszę mnie zapisać. Na którą godzinę?\n: Na 15:15 i 17:30. Która Pani bardziej pasuje?\n: Na 17:30\n: Poprosze o imię i nazwisko”}

Fine-Tuning Script

Below is a sample script that performs simple fine-tuning on free Colab. I’ve added a short comment before each code snippet. Much of the commentary was generated by ChatGPT, so it might occasionally sound artificial. If you can easily read a code you may skip comments. The script is also available on my GitHub.

Package Installation: Required Python libraries are installed using the !pip install -q command. These include: transformers (for language model handling), datasets (for data management), accelerate (for speeding up training), peft (containing methods for model optimization), bitsandbytes (for optimizing bit and byte operations), and python-dotenv (for handling environment files).

Package Import: After installing the packages, they are imported into the script using the import command. Among others, classes like LoraConfig (configuration for the LORA technique) as well as AutoTokenizer, AutoConfig, and AutoModelForCausalLM from the transformers package are imported for language model handling.

Mounting Google Drive: At the end of the code snippet is the command drive.mount(‘/gdrive’), which allows for the mounting of Google Drive in the Google Colab environment. This enables the script to access the data file and save the fine-tuned model.

!pip install -q transformers datasets accelerate peft bitsandbytes python-dotenv

import torch
import torch.nn as nn
import json
import transformers
from datasets import Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
import os
from dotenv import load_dotenv, find_dotenv
import random

from google.colab import drive

The next code segment deals with the preparation of models, data, and the tokenizer.

  1. Model and Data Location Definition: Here, the names of the base model (BASE_MODEL_NAME), the fine-tuned model (MY_MODEL_NAME), the data set path (MY_DATASET), and the path to the environment file containing secret required for Huggingface integration (ENV_FILE) are defined.
  1. Loading and Shuffling Data: The data file is opened using the open function. Data is then loaded and shuffled – a commonly used technique in preparing data for machine learning.
  2. Loading the Model and Tokenizer: The model and tokenizer are loaded using the AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained functions. The tokenizer is a tool that converts raw text into a form that the model can process (known as tokens).
# models and data
BASE_MODEL_NAME='togethercomputer/RedPajama-INCITE-Chat-3B-v1'  # name of the model I want to fine-tune
MY_MODEL_NAME='RedPajama-Chat3B-Polish'  # my fine-tuned model name
MY_DATASET='/gdrive/My Drive/Colab Notebooks/Data/Combined dataset 1-2 2023.06.18.json'  # my dataset I use during fine-tuning
ENV_FILE='/gdrive/My Drive/Colab Notebooks/.env'  # file in which I store secrets (currently the Huggingface Hub access token)
TOKEN_NAME='HF_COLAB_RP_CHAT_3B'  # name of the environment variable storing access token for the Hugginghface Hub
OUTPUT_DIR='/gdrive/My Drive/Colab Notebooks/Models/'  # output directory to which checkpoints are saved

# read and shuffle my dataset
with open(MY_DATASET, 'r') as fp:
    data = [json.loads(x) for x in fp.readlines()]

# load the base model and a tokenizer
model = AutoModelForCausalLM.from_pretrained(
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

Below, the previously loaded data is processed:

  1. Creating the Data Set: The data loaded from the file is converted into a Dataset object using the from_list method. Dataset is a class from the datasets library, which allows convenient data management while training language models.
  1. Data Preprocessing: The data is then processed (tokenized) using the map method, which applies a function (in this case, the tokenizer) to each element in the data set. The batched=True flag is set when you want to process the data in batches, which is usually more efficient in terms of performance.
  2. Displaying Data Set Size: Finally, the size of the processed data set is displayed using the len function.
# preprocess data and print size of my dataset
data = Dataset.from_list(data)
data = samples: tokenizer(samples['text']), batched=True)
>>> 1391

In this code snippet, we prepare the model for the fine-tuning process using the LoRA (Low-Rank Adaptation) technique and INT-8 quantization. At the very beginning, the current model structure is displayed:

  1. LoRA Configuration: Using LoraConfig, we specify the configuration for the LoRA technique. We choose the degree of low-rank approximation ‘r’, the constant ‘lora_alpha’, specify which modules in the model should be modified ‘key_value’. Also the dropout value for LoRA, the strategy to apply to biases (None), and sets the task type as CAUSAL_LM.
  1. Preparing the Model for INT-8 Training: Using the function prepare_model_for_int8_training, the model is prepared for the fine-tuning process, using INT-8 quantization. This allows for significant memory savings during training.
  1. Adding LoRA Adapters: Using the get_peft_model function, LoRA adapters are added to the model in accordance with the previously established configuration. A LoRA adapter is a specialized module that allows for more effective fine-tuning of the model while preserving its general structure.
  2. Displaying Trainable Parameters: Finally, using the print_trainable_parameters method, we display the list of model parameters that will be trained during the fine-tuning process. This is useful for verifying whether the model configuration aligns with expectations.
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50432, 2560)
    (layers): ModuleList(
      (0-31): 32 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (attention): GPTNeoXAttention(
          (rotary_emb): RotaryEmbedding()
          (query_key_value): Linear8bitLt(in_features=2560, out_features=7680, bias=True)
          (dense): Linear8bitLt(in_features=2560, out_features=2560, bias=True)
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear8bitLt(in_features=2560, out_features=10240, bias=True)
          (dense_4h_to_h): Linear8bitLt(in_features=10240, out_features=2560, bias=True)
          (act): GELUActivation()
    (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
  (embed_out): Linear(in_features=2560, out_features=50432, bias=False)

lora_config = LoraConfig(
    # When using #LoRA it is important to apply it
    # to ALL `Linear` layers of the model to get similar results to "full fine-tuning.
    # should we also wrap embed_out?
    target_modules=["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h", "embed_out"],

# prepare int-8 model for training:
# From method description: this method wraps the entire protocol for preparing a model before running a training.
# This includes: 1- Cast the layernorm in fp32 2- making output embedding layer require grads 3- Add the upcasting of the lm head to fp32
model = prepare_model_for_int8_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
>>> trainable params: 21819392 || all params: 2797683712 || trainable%: 0.7799091765238114

It’s worth noting the bolded section above. This is the whole secret of LoRA’s efficiency – we are training only 0.77% of the overall number of parameters.

The code presented in this section relates to setting training parameters and initializing the Trainer object, which will manage the training process.

  1. Setting Training Arguments: Using TrainingArguments, we specify a range of training parameters. These include the path to the folder where the results will be saved (output_dir), batch size for training (per_device_train_batch_size) and evaluation (per_device_eval_batch_size), the number of warm-up steps for the learning rate schedule (warmup_steps), the maximum number of steps (max_steps), and how often information about the training process should be logged (logging_steps).
  2. Trainer Initialization: The Trainer object is initialized with the previously prepared model, training data, set training arguments, and a DataCollatorForLanguageModeling object, which is responsible for preparing data batches for the training process. The argument “mlm=False” indicates that the model is not being trained in Masked Language Modeling mode.
# set the training arguments for Trainer
training_args = transformers.TrainingArguments(
    output_dir=OUTPUT_DIR + MY_MODEL_NAME,  # output directory
    per_device_train_batch_size=4,   # batch size per device during training
    per_device_eval_batch_size=4,    # batch size for evaluation
    warmup_steps=100,                # number of warmup steps for learning rate scheduler

trainer = transformers.Trainer(
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)

This short code snippet focuses on the actual training process of the model.

  1. Disabling Cache: The line model.config.use_cache = False disables the caching of outputs from previous layers of the model. Caching is sometimes used to speed up computations, but in some cases, like here, it may be better to disable it to reduce memory usage.
  1. Starting Training: The method trainer.train() initiates the actual training process, in accordance with previously defined training arguments and data. All training details, such as the optimization strategy, learning rate schedule, model-saving policies, etc., are already defined in the Trainer object.
# turn off caching to save RAM
model.config.use_cache = False

Step	Training Loss
50	1.496200
600	0.882300
650	0.854300

The last code snippet focuses on saving and sharing the trained model.

  1. Saving the Model to Disk: The command model.save_pretrained(f”{OUTPUT_DIR}{MY_MODEL_NAME}”) saves the trained model to disk, allowing for its later use or sharing.
  1. Loading Huggingface Hub API Key: The code loads the Huggingface Hub API key from the .env file using the method load_dotenv(find_dotenv(filename=ENV_FILE)). The key is then assigned to the variable api_key.
  1. Uploading the Model to Huggingface Hub: The method model.push_to_hub uploads the trained model to the Huggingface Hub. The model will be publicly available under the name specified by MY_MODEL_NAME, allowing others to use it in their projects. The use_auth_token argument specifies the API key that authorizes the upload operation. commit_message is a message that will be attached to the model’s save logs on Huggingface Hub, similar to version control systems like git.
# save a trained model to a drive

# Read the Huggingface Hub api key to be able to save my model to the hub
_ = load_dotenv(find_dotenv(filename=ENV_FILE))
api_key  = os.environ[TOKEN_NAME]

# Saving the model to the Hugging Face Hub
model.push_to_hub(MY_MODEL_NAME, use_auth_token=api_key, commit_message="The first bigger training on 1391 samples.")

Performance Evaluation

Our base model has been fine-tuned. It’s time to run it to see if there’s any positive difference. The model that has been quantized and has used LoRA needs to be loaded in a slightly different way than the base model. The following script shows how this can be done. The script is available on my GitHub and should be run in a Colab environment with a GPU.

!pip install -q transformers accelerate peft bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftConfig, PeftModel
import accelerate
import bitsandbytes

# Define the model name and tokenizer
BASE_MODEL_NAME = 'togethercomputer/RedPajama-INCITE-Chat-3B-v1'
FINETUNED_MODEL_NAME = 'aigeekprogrammer/RedPajama-Chat3B-Polish-v2'

Here’s a crucial moment where before loading the base model, we load the PEFT configuration from Huggingface

config = PeftConfig.from_pretrained(FINETUNED_MODEL_NAME, inference_mode=True)

Next, we load the base model. We fine-tuned an 8-bit quantized version, and that’s what we need to load now. In the following step, we load the trained layers added by LoRA and the tokenizer:

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
model = PeftModel.from_pretrained(model, FINETUNED_MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

prompts = ["Boli mnie brzuch. Czy może mnie Pani zapisać do lekarza? Jakie są wolne terminy?",
           "Pacjent ma kontuzjowaną rękę. Jaki lekarz powinien się nim zająć w pierwszej kolejności?",
           "W przypadku nawracających migren, czy lepiej zrobić RTG czy MRI?",
           "Ile dni spędza się w szpitalu po operacji łąkotki?"]

responses = []
for prompt in prompts:
  # Add tags to the prompt
  tagged_prompt = "\n<human>: " + prompt + "\n<bot>:"

  # Tokenize the prompt
  inputs = tokenizer(tagged_prompt, return_tensors='pt').to(model.device)
  input_length = inputs.input_ids.shape[1]

  # Generate the output
  outputs = model.generate(

  # Decode the output
  token = outputs.sequences[0, input_length:]
  output_str = tokenizer.decode(token)


In the final step, we can display the output:

for prompt, response in zip(prompts, responses):
    print(f"Prompt: {prompt}\nResponse: {response}\n\n")

Conclusions and Summary:

  1. On the positive side: the model no longer switches to English on its own; it consistently stays in Polish.
  2. On the positive side: the text seems a bit more logical and internally coherent, although it’s still very far from the expected level.
  3. On the downside: the model still hallucinates, conversing with itself. This can be mitigated by cutting off the output at the first <human> tag.

The results are not spectacular, but this was essentially “nano” fine-tuning: just an hour of training on a free GPU, and on a very small dataset. It’s reasonable to assume that if we increase the dataset size by a factor of ten or more, to around 20,000 – 30,000 elements, the results will be significantly better. Additionally, fine-tuning should last at least several hours, if not more. Lastly, we used a 3B parameter model that was trained on a limited amount of Polish text. If you’re serious about custom fine-tuning, you should aim for 7B, 13B, or perhaps even 30B models. All such models are currently (as of June 2023) available under an open-source license for commercial use. Of course, this will come at a cost, as it’s not feasible on free or commodity hardware, but the results should definitely be better. 

To conclude, I’ll reiterate something I’ve mentioned before. We’re at a point where the approach should be: prompt-engineering -> (“Not yielding good results?”) -> OpenAI API -> (“Still unsatisfactory?! That’s impossible! You’re probably doing something wrong…”) -> fine-tuning open-source models on your own data -> (“Told you, you’d spend money and it still won’t work!”).

And with that optimistic note, I end this post. I hope it was interesting and useful. If you have any questions, opinions, or suggestions, feel free to contact me or leave comments. Cheers.