About Us
Highlights FPT Cloud Server FPT AI Factory FPT Network FPT Cloud Backup & DR FPT Storage FPT Security FPT Container FPT Database FPT Cloud Monitoring FPT Data Suite FPT.AI

Show all

Object Storage

Secure, unlimited storage to ensures efficiency as well as high and continuous data access demand.

GPU Server

Virtual server integration for 3D Rendering, AI or ML

FPT Load Balancing

Enhance application capacity and availability.

FPT AI Factory

Access to an all-inclusive stack for AI development, driven by NVIDIA’s powerful technology!

Cloud WAF

FPT Web Application Firewall provides powerful protection for web applications

Cloud Server

Advanced virtual server with rapid scalability

Backup Service

Backup and restore data instantly, securely and maintain data integrity.

Cloud Server

Advanced virtual server with rapid scalability

FPT AI Factory

Access to an all-inclusive stack for AI development, driven by NVIDIA’s powerful technology!

FPT Load Balancing

Enhance application capacity and availability.

Backup Service

Backup and restore data instantly, securely and maintain data integrity.

Disaster Recovery Service

Recovery, ensuring quick operation for the business after all incidents and disasters.

Block Storage

Diverse throughput and capacity to meet various business workloads.

Object Storage

Secure, unlimited storage to ensures efficiency as well as high and continuous data access demand.

Cloud WAF

FPT Web Application Firewall provides powerful protection for web applications

FPT Cloud WAPPLES

Intelligent and Comprehensive Virtual Web Application Firewall - Security Collaboration between FPT Cloud and Penta Security.

Next-Gen Firewall

The Next generation firewall security service

Container Registry

Easily store, manage, deploy, and secure Container images

Kubernetes Engine

Safe, secure, stable, high-performance Kubernetes platform

FPT Database for MongoDB

Provided as a service to deploy, monitor, backup, restore, and scale MongoDB databases on cloud.

FPT Database for Redis

Provided as a service to deploy, monitor, backup, restore, and scale Redis databases on cloud.

PostgreSQL Database Engine

Provided as a service to deploy, monitor, backup, restore, and scale PostgreSQL databases on cloud.

FPT Data Suite

Helps reduce operational costs by up to 40% compared to traditional BI solutions, while improving efficiency through optimized resource usage and infrastructure scaling.
Pricing
Partner
- Tech news
- White Paper
Event

ENG

A Step-by-Step Guide to Fine-Tuning Models with FPT AI Factory and NVIDIA NeMo

Author: Lê Bạch Đức Anh

11:50 06/03/2025

Table of Contents

In the fast-paced world of artificial intelligence, large language models (LLMs) are transforming industries from healthcare to creative writing. However, training these models from scratch can be resource-intensive and impractical for most. That’s where parameter-efficient fine-tuning (PEFT) comes in. It allows you to take a pre-trained model, tweak it for your use case, and do so efficiently — saving time, computing power, and even the planet! By reducing the computational footprint, PEFT makes AI accessible on a wider range of hardware (think laptops or edge devices) and aligns with growing calls for sustainable tech practices by cutting down energy use and carbon emissions. Ready to dive in? Let’s get started!

Prerequisites

Before you embark on this fine-tuning journey, let’s make sure you’ve got the right tools and setup. Here’s what you’ll need:

Hardware:

At a minimum, you’ll need 1x NVIDIA A100 80GB GPU to handle PEFT tasks effectively, given the memory and compute demands of models like Llama 3.1–8B. However, in this guide, I’m running the process on Metal Cloud - FPT’s Bare Metal H100 server, which offers even greater power with its NVIDIA H100 GPUs. The H100 is overkill for this tutorial but provides headroom for scaling or handling larger models — expect even faster performance and efficiency compared to the A100.

Software:

NVIDIA NeMo: Ensure you have NeMo version 24.07 or later installed. You’ll run it via Docker, so familiarize yourself with Docker basics.
Python 3.8+: NeMo and its dependencies (like PyTorch and Hugging Face Transformers) require Python 3.8 or higher.
Hugging Face CLI or API: For downloading Llama 3.1–8B, you’ll need access to Hugging Face and a valid token.
Weights & Biases (WandB): Optional but highly recommended for tracking training progress. Sign up for a free account and grab an API key.
NVIDIA Drivers and CUDA Toolkit: Ensure your GPU drivers (minimum version 535 for A100/H100) and CUDA 12.1+ are installed to support NeMo’s GPU-accelerated operations.

Access and Permissions:

Request access to the Llama 3.1–8B model on Hugging Face, as it requires approval from Meta.

These prerequisites set you up for success, whether you’re on a standard A100 or leveraging the H100 server. Let’s dive into the steps!

Understanding PEFT: What It Is and Why It Matters

Before we jump into the how-to, let’s unpack what Parameter-Efficient Fine-Tuning (PEFT) really means. Picture a massive pre-trained model like Llama 3.1–8B, packed with 8 billion parameters — tiny knobs and dials that define its behavior. This model has already soaked up general knowledge from huge datasets, but now you want it to excel at something specific, like generating Japanese creative writing or answering technical questions. Fully fine-tuning all 8 billion parameters would be like rebuilding a car engine to tweak its radio — it works, but it’s overkill and burns through resources.

PEFT flips the script. Instead of adjusting every parameter, it freezes most of the model and tweaks only a small subset — sometimes as little as 0.1% of the total parameters. This keeps the heavy lifting (and GPU memory) to a minimum while still adapting the model to your task. Think of it as adding a custom filter to a pre-built camera lens: you get sharp results without redesigning the whole system. The benefits? Faster training, lower memory use, and the ability to fine-tune on a single GPU — or even a laptop in some cases. Plus, it’s kinder to the environment, slashing the energy cost of AI development.

Key PEFT Methods and Parameters

PEFT isn’t one-size-fits-all — it comes in flavors, each with its own tricks:

LoRA (Low-Rank Adaptation): This method adds small, trainable “adapter” matrices to specific layers (like attention mechanisms). In our example, we target attention_qkv (query, key, value in the transformer). Parameters here include:

Rank (r): Controls the size of the adapter matrices. A lower rank (e.g., 8 or 16) means fewer parameters to train, balancing efficiency and expressiveness.
Target Modules: Which parts of the model get adapters (e.g., attention layers). More targets = more flexibility, but also more compute.

P-Tuning: Instead of tweaking weights, P-Tuning optimizes a set of “prompt” embeddings fed into the model. Key parameter:

Prompt Length: How many tunable tokens to add (e.g., 20). Longer prompts can capture more context but increase complexity.

Adapter Tuning: Adds lightweight neural layers inside the model. Parameters include:

Adapter Size: The number of neurons in these layers (e.g., 64). Smaller sizes keep things light.

In our guide, we’ll use LoRA because it strikes a great balance between performance and efficiency, but NeMo supports other methods too — experiment to find your favorite!

Hardware Requirements: What You’ll Need

To follow along, you’ll need some decent hardware. At a minimum, I recommend 1x NVIDIA A100 80GB GPU for PEFT tasks. Why? The A100’s massive memory and computing power are ideal for handling the tensor operations and parallel processing that NeMo leverages. If you’re on a budget, a smaller GPU like an RTX 3090 (24GB) might work for lighter models, but expect longer training times and potential memory constraints. For optimal performance, especially with larger models like Llama 3.1–8B, stick with the A100 or equivalent.

Step 1: Downloading the Llama 3.1–8B Model

We’ll kick things off by grabbing the Llama 3.1–8B model in Hugging Face format. This 8-billion-parameter beast from Meta AI is a fantastic starting point for fine-tuning, offering a balance of performance and efficiency.

How to Download

First, request download permission from Meta’s Hugging Face page (you’ll need to sign up and agree to their terms). Once approved, create a directory to store the model:


mkdir llama31-8b-hf

You’ve got two options to download:

Option 1: CLI Tool


huggingface-cli login
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir llama31-8b-hf

Option 2: Python API

If you prefer scripting, use this Python snippet (replace <YOUR HF TOKEN> with your Hugging Face token):


from huggingface_hub import snapshot_download


snapshot_download(
 &amp;nbsp;&amp;nbsp; repo_id="meta-llama/Llama-3.1-8B",
 &amp;nbsp;&amp;nbsp; local_dir="llama31-8b-hf",
 &amp;nbsp;&amp;nbsp; local_dir_use_symlinks=False,
 &amp;nbsp;&amp;nbsp; token="&amp;lt;YOUR HF TOKEN&amp;gt;"
)

Once complete, your model files will land in ./llama31-8b-hf. Pro tip: Verify the download by checking for key files like pytorch_model.bin or model.safetensors—this ensures you’ve got everything intact.

Step 2: Converting to NeMo Format

NeMo uses its own .nemo format for models, which supports distributed checkpointing and flexible parallelism. Let’s convert our Hugging Face model to .nemo.

Launch the NeMo Container

Fire up NVIDIA’s NeMo Docker container with GPU support:


docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.07 bash

This command maps your current directory to /workspace in the container and sets up GPU access.

Run the Conversion

Inside the container, execute:


python3 /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=./llama31-8b-hf/ --output_path=llama31-8b.nemo

The resulting llama31-8b.nemo file is ready for fine-tuning and supports any tensor parallel (TP) or pipeline parallel (PP) configuration without additional tweaking. This flexibility is a huge win for scaling across multiple GPUs if you expand your setup later!

Preparing Your Data

Data is the lifeblood of fine-tuning. For this guide, we’ll use the Databricks Dolly 15k Japanese dataset (a translated version of Dolly 15k) as an example, but you can swap in any dataset relevant to your task — think medical QA, customer support logs, or creative writing prompts.

Step 1: Download the Dataset

Let’s pull the dataset from Hugging Face:


# load_dataset.py
from datasets import load_dataset
# Load dataset
ds = load_dataset("llm-jp/databricks-dolly-15k-ja")
df = ds["train"].data.to_pandas()
df.to_json("databricks-dolly-15k-ja.jsonl", orient="records", lines=True)

This saves the dataset as a .jsonl file, where each line is a JSON object with fields like instruction, context, and response.

Step 2: Preprocess the Data

We need to format the data into a structure NeMo can digest. Here’s a preprocessing script to combine instruction and context into an input field, paired with an output response:


# preprocess.py
import json
import argparse
import numpy as np
def to_jsonl(path_to_data):
    print("Preprocessing data to jsonl format...")
    output_path = f"{path_to_data.split('.')[0]}-output.jsonl"
    with open(path_to_data, "r") as f, open(output_path, "w") as g:
        for line in f:
            line = json.loads(line)
            context = line["context"].strip()
            instruction = line["instruction"].strip()
            if context:
                # Randomize order of context and instruction for variety
                context_first = np.random.randint(0, 2) == 0
                input_text = f"{context}\\\\n\\\\n{instruction}" if context_first else f"{instruction}\\\\n\\\\n{context}"
            else:
                input_text = instruction
            output = line["response"]
            g.write(
                json.dumps(
                    {"input": input_text, "output": output, "category": line["category"]},
                    ensure_ascii=False
                ) + "\\\\n"
            )
    print(f"Data saved to {output_path}")
def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", type=str, required=True, help="Path to jsonl dataset")
    return parser.parse_args()
if __name__ == "__main__":
    args = get_args()
    to_jsonl(args.input)

Run it like this:


python preprocess.py --input=databricks-dolly-15k-ja.jsonl

Step 3: Split the Dataset

Now, split the preprocessed data into training, validation, and test sets:


# split_train_val.py
import json
import random
input_file = "databricks-dolly-15k-ja-output.jsonl"
train_file = "training.jsonl"
val_file = "validation.jsonl"
test_file = "test.jsonl"
train_prop, val_prop, test_prop = 0.80, 0.15, 0.05
with open(input_file, "r") as f:
lines = f.readlines()
random.shuffle(lines)
total = len(lines)
train_idx = int(total * train_prop)
val_idx = int(total * val_prop)
train_data = lines[:train_idx]
val_data = lines[train_idx:train_idx + val_idx]
test_data = lines[train_idx + val_idx:]
for data, filename in [(train_data, train_file), (val_data, val_file), (test_data, test_file)]:
with open(filename, "w") as f:
for line in data:
f.write(line.strip() + "\\\\n")

This gives you three files: training.jsonl (80%), validation.jsonl (15%), and test.jsonl (5%). Here’s a sample of what the processed data looks like:


{
"input": "若い頃にもっと時間をかけてやっておけばよかったと思うことは？",
"output": "健康とウェルネスへの投資だ。若い頃に運動やバランスの取れた食事、家族との時間をもっと大切にしていれば、今後の人生がもっと豊かで楽になっていただろう。",
"category": "creative_writing"
}

Step 3: Fine-Tuning with PEFT

Time to fine-tune! We’ll use the LoRA method (as set in PEFT_SCHEME="lora"), though you can switch to P-Tuning or others by tweaking that variable. Here’s the full script:


MODEL="llama31-8b.nemo"
TRAIN_DS="[training.jsonl]"
VALID_DS="[validation.jsonl]"
TEST_DS="[test.jsonl]"
TEST_NAMES="[data]"
PEFT_SCHEME="lora"
CONCAT_SAMPLING_PROBS="[1.0]"
TP_SIZE=1
PP_SIZE=1
huggingface-cli login --token <HF_TOKEN>
export WANDB_API_KEY=<WANDB_TOKEN>
wandb login
torchrun --nproc_per_node=1 \\\\
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \\\\
    trainer.devices=1 \\\\
    trainer.num_nodes=1 \\\\
    trainer.precision=bf16 \\\\
    trainer.val_check_interval=20 \\\\
    trainer.max_steps=50 \\\\
    model.megatron_amp_O2=True \\\\
    ++model.mcore_gpt=True \\\\
    ++model.flash_attention=True \\\\
    model.tensor_model_parallel_size=${TP_SIZE} \\\\
    model.pipeline_model_parallel_size=${PP_SIZE} \\\\
    model.micro_batch_size=1 \\\\
    model.global_batch_size=32 \\\\
    model.optim.lr=1e-4 \\\\
    model.restore_from_path=${MODEL} \\\\
    model.data.train_ds.file_names=${TRAIN_DS} \\\\
    model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \\\\
    model.data.validation_ds.file_names=${VALID_DS} \\\\
    model.peft.peft_scheme=${PEFT_SCHEME} \\\\
    model.peft.lora_tuning.target_modules=[attention_qkv] \\\\
    exp_manager.create_wandb_logger=True \\\\
    exp_manager.explicit_log_dir=/results \\\\
    exp_manager.wandb_logger_kwargs.project=peft_run \\\\
    exp_manager.wandb_logger_kwargs.name=peft_llama31_8b \\\\
    exp_manager.create_checkpoint_callback=True \\\\
    exp_manager.checkpoint_callback_params.monitor=validation_loss

Key Highlights

LoRA in Action: We’re targeting attention_qkv modules, adding small adapters to fine-tune efficiently.
WandB: Tracks training progress — super handy for visualizing loss curves.
Precision: Uses bf16 (bfloat16) for faster training with minimal accuracy loss on modern GPUs.

Adjust max_steps (how many training iterations) or global_batch_size (how many samples per update) based on your dataset size and hardware. For our small example, 50 steps keep things quick.

Diving Deeper: Understanding the Parameters

Want to geek out on what’s driving this PEFT fine-tuning? Here’s a quick rundown of the most important parameters in the script and why they matter for keeping Llama 3.1–8B manageable on a single A100 — or, in my case, FPT’s bare metal H100 server:

trainer.precision=bf16: Uses bfloat16 precision for faster, memory-efficient training on modern GPUs like the A100 or H100. It’s a PEFT superpower, slashing memory use while keeping accuracy sharp.
trainer.max_steps=50: Limits training to 50 steps, keeping things quick for small datasets like ours. Bump this up for larger data or better results, but watch for longer runtimes.
model.micro_batch_size=1 & model.global_batch_size=32: Sets the batch size per GPU (1 sample) and total batch size (32 samples across GPUs). Low micro-batch size saves memory for PEFT, but you might tweak it higher if your GPU (like the H100’s 94GB) can handle more.
model.optim.lr=1e-4: Sets the learning rate to 0.0001, a small value ideal for PEFT’s delicate parameter updates (like LoRA adapters) to avoid overshooting.
model.peft.peft_scheme=lora & model.peft.lora_tuning.target_modules=[attention_qkv]: Uses LoRA for efficiency, targeting only the attention query, key, and value layers. This keeps parameter updates minimal—perfect for resource-light fine-tuning on high-performance hardware like the H100.
exp_manager.create_wandb_logger=True: Enables Weights & Biases logging to track progress live. It’s your window into loss curves and resource use, making it easier to tweak and troubleshoot, especially on a powerful setup like Metal Cloud- FPT’s H100 server.

These parameters work together to make PEFT fast, efficient, and scalable. Tweak them based on your hardware, dataset, or goals — PEFT’s flexibility is one of its biggest perks!

Visualizing PEFT Performance: Resource Usage During Fine-Tuning

Curious about what’s happening under the hood during PEFT fine-tuning? Check out this snapshot of resource metrics from the fine-tuning process (see the graphs below). These charts, captured over 500 seconds, show how our Llama 3.1–8B model behaves on an NVIDIA A100 GPU:

Memory Usage: System memory stays low (peaking at ~1.8%), while GPU memory ramps up to ~26GB (or 34–40% of the A100’s 80GB), reflecting the memory demands of loading and processing the 8-billion-parameter model and its PEFT adapters.
GPU Power and Utilization: The GPU draws up to 500W and operates at 80–85% utilization, showcasing the A100’s efficiency in handling the tensor operations and parallel processing in NeMo. This confirms PEFT’s promise of staying resource-light compared to full fine-tuning.
Memory Access Time: GPU time spent accessing memory hovers around 30–40%, indicating balanced compute and memory operations — ideal for PEFT’s low-parameter adjustments.

These metrics highlight why PEFT is a game-changer: It keeps resource usage manageable, even for a hefty model like Llama 3.1–8B, making fine-tuning feasible on a single high-end GPU. If you’re tweaking hyperparameters or scaling up, expect these patterns to shift — play around and monitor your own runs for insights!

Fine-Tuning Models with FPT AI Factory and NVIDIA NeMo

Step 4: Running Inference

Finally, let’s test our fine-tuned model! This script evaluates performance on the test set:


MODEL="llama31-8b.nemo"
PATH_TO_TRAINED_MODEL="/results/llama31-8b_lora.nemo"  # Adjust based on output from training
TEST_DS="[test.jsonl]"
TEST_NAMES="[data]"
OUTPUT_PREFIX="./results/peft_results"
TP_SIZE=1
PP_SIZE=1
[ ! -d ${OUTPUT_PREFIX} ] && mkdir -p ${OUTPUT_PREFIX}
python3 \\\\
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \\\\
    model.restore_from_path=${MODEL} \\\\
    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \\\\
    trainer.devices=1 \\\\
    model.tensor_model_parallel_size=${TP_SIZE} \\\\
    model.pipeline_model_parallel_size=${PP_SIZE} \\\\
    model.data.test_ds.file_names=${TEST_DS} \\\\
    model.data.test_ds.names=${TEST_NAMES} \\\\
    model.global_batch_size=32 \\\\
    model.micro_batch_size=4 \\\\
    model.data.test_ds.tokens_to_generate=20 \\\\
    inference.greedy=True \\\\
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \\\\
    model.data.test_ds.write_predictions_to_file=True

This generates responses for your test inputs and saves them to


./results/peft_results_data_preds_labels.jsonl.

Dive into the output to see how your model performs—did it nail those Japanese creative writing prompts?

Wrapping Up

And there you have it — a complete guide to fine-tuning Llama 3.1–8B with FPT AI Factory, NVIDIA NeMo, and PEFT! From understanding the magic of parameter-efficient methods to running inference, you’ve now got the tools to adapt LLMs to your own projects. Play around with different datasets, tweak LoRA’s rank, or scale up to multiple GPUs — the possibilities are endless.

For more information and consultancy about FPT AI Factory, please contact:

Hotline: 1900 638 399
Email: [email protected]
Support: m.me/fptsmartcloud

Source: https://blog.usee.ai/a-step-by-step-guide-to-fine-tuning-models-with-nvidias-nemo-framework-49ba3ab27d3d

Maybe you are interested

What Are AI Agents? Examples, How they work, How to use them.

Vision-Language Models (VLM) Use Cases for Insurance Company on NVIDIA H100 GPUs

Use Cases for Training Large Language Models (LLMs) with Slurm on Metal Cloud

TOP 10 TRENDING TECHNOLOGIES IN 2021

Maybe you are interested

What Are AI Agents? Examples, How they work, How to use them.

14:07 22/07/2025

Vision-Language Models (VLM) Use Cases for Insurance Company on NVIDIA H100 GPUs

21:19 11/04/2025

Use Cases for Training Large Language Models (LLMs) with Slurm on Metal Cloud

14:38 21/04/2025

TOP 10 TRENDING TECHNOLOGIES IN 2021

09:47 05/03/2021

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months
cookielawinfo-checbox-functional	11 months
cookielawinfo-checbox-others	11 months
cookielawinfo-checkbox-necessary	11 months
cookielawinfo-checkbox-performance	11 months
viewed_cookie_policy	11 months