Practical Guide to Distributed Training Large Language Models (LLMs) with Slurm and LLaMA-Factory on Metal Cloud
Table of Contents
This guide provides a comprehensive walkthrough for setting up and running distributed training using LLaMA-Factory on Metal Cloud (Bare Metal Server). We cover environment setup, Slurm-based job scheduling, and performance optimizations.
Additionally, we include a training task using the Open Instruct Uncensored Alpaca dataset, which consists of instruction-tuning samples, for fine-tuning a Llama-3.1-8B model, with full fine-tuning settings on 4 nodes, 8 x NVIDIA H100 GPUs per node, providing hands-on instructions for replicating a real-world training scenario.
The execution setting up a distributed training environment on Metal Cloud using Slurm, an open-source workload manager optimized for high-performance computing. The guide walks through:
Key highlights that readers should focus on:
By following this guide, readers can replicate the training pipeline and optimize their own LLM training workflows on Metal Cloud.
Slurm is a widely used open-source workload manager designed for high-performance computing (HPC) environments. It provides efficient job scheduling, resource allocation, and scalability, making it an excellent choice for AI training on Metal Cloud. Key advantages include:
Use case: Training Large Language Models with LLaMA-factory
One practical application of Slurm on Metal Cloud is training large language models using the LLaMA-factory framework. By distributing training across multiple GPUs and nodes, Slurm helps reduce training time while ensuring stable and efficient execution.
Key Benefits:
Before proceeding, ensure you have the following:
To enable seamless multi-node training, ensure:
NCCL and PyTorch distributed backend can communicate over TCP/IP.
You can verify node connectivity using: scontrol show nodes or sinfo
Assuming that you have all the system requirements for distributed training task, run the following on each compute node to install LLaMA-Factory. It will install all necessary packages to run LLaMA-Factory:
python3 –m venv venv source venv/bin/activate git clone https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory pip install –e “.[torch,metrics]”
To demonstrate a real-world scenario, we will full fine-tune a Llama-3.1-8B model using the Open Instruct Uncensored Alpaca dataset for instruction-following tasks.
The Open Instruct Uncensored Alpaca is a collection of instruction-response pairs dataset. It is one of the most common datasets for fine-tuning models for instruction following. This dataset is public on Hugging Face.
With LLaMA-Factory, you can specify the dataset's URI from a remote repository like Hugging Face directly in the YAML file to set up your training configuration, LLaMA-Factory will automatically download the dataset. To achieve this, you must define the dataset in a file named dataset_info.json, located in LLaMA-Factory/data/dataset_info.json. Add the following line to dataset_info.json.
"uncensored_alpaca": {"hf_hub_url": "xzuyn/open-instruct-uncensored-alpaca"}
When you have downloaded the dataset on the machine, you can add the following line to dataset_info.json and you are good to go.
"your_dataset_name": {"file_name": "path/to/your/dataset.json"}
The LLaMA 3.1 8B model is one of the latest releases in Meta’s third-generation LLaMA series. It is a lightweight yet powerful large language model designed for both research and enterprise applications.
LLaMA 3.1 8B can be trained efficiently on multi-GPU multi-node Metal Cloud servers using LLaMA-Factory and DeepSpeed. The next sections of this guide will walk you through setting up distributed training for LLaMA 3.1 8B using LLaMA-Factory.
If you want to download the model directly from Huggingface, you can use the below command:
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir=Llama-3.1-8B
LLaMA-Factory uses YAML configuration files to manage training parameters efficiently. A YAML-based configuration simplifies hyperparameter tuning and ensures reproducibility.
This section explains how to prepare a YAML configuration file for fine-tuning the LLaMA 3.1 8B model using LLaMA-Factory.
LLaMA-Factory provides various predefined YAML training configuration files, located at LLaMA-Factory/examples. Here is a YAML file for full fine-tuning LLaMA 3.1 8B with Open Instruct Uncensored Alpaca dataset:
model_name_or_path: meta-llama/Llama-3.1-8B trust_remote_code: true stage: sft do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z2_config.json dataset: uncensored_alpaca template: llama3 cutoff_len: 2048 max_samples: 500000 overwrite_cache: true preprocessing_num_workers: 16 output_dir: saves/llama3.1-8b/full/sft logging_steps: 10 save_steps: 10000 plot_loss: true overwrite_output_dir: true per_device_train_batch_size: 4 gradient_accumulation_steps: 2 learning_rate: 1.0e-5 num_train_epochs: 2.5 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 val_size: 0.001 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 10000
You can put the model URI on Huggingface directly with:
model_name_or_path: meta-llama/Llama-3.1-8B
and LLaMA-Factory will automatically download the model before training. If you have downloaded the model on the machine, you can specify as:
model_name_or_path: path/to/your/model
To train on Open Instruct Uncensored Alpaca dataset, add the data by the specified name:
dataset: uncensored_alpaca
We adjust the number of training samples to 500,000 samples by:
max_samples: 500000
You can adjust all other options if necessary
Assume you have a YAML training configuration file named llama31_training.yaml, create a Slurm script train_llama.sbatch for training on 4 nodes, 8 GPUs per node:
#!/bin/bash #SBATCH --job-name=multinode-training #SBATCH --nodes=4 #SBATCH --time=2-00:00:00 #SBATCH --gres=gpu:8 #SBATCH -o training.out #SBATCH -e training.err #SBATCH --ntasks=4 nodes=($(scontrol show hostnames $SLURM_JOB_NODELIST ) ) nodes_array=($nodes) head_node=${nodes_array[0]} node_id=${SLURM_NODEID} head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | cut -d" " -f2) echo Master Node IP: $head_node_ip export LOGLEVEL=INFO export NNODES=4 export NPROC_PER_NODE=8 export HEAD_NODE_IP=$head_node_ip export HEAD_NODE_PORT=29401 export NODE_RANK=$node_id export NCCL_IB_DISABLE=0 export NCCL_SOCKET_IFNAME=^lo,docker0 export NCCL_TIMEOUT=180000000 export NCCL_DEBUG=INFO export NCCL_BLOCKING_WAIT=1 # Ensure NCCL waits for operations to finish export NCCL_ASYNC_ERROR_HANDLING=1 # Allow handling of NCCL errors asynchronously source venv/bin/activate</em></wp-p> srun llamafactory-cli train llama31_training.yaml
Use sbatch to submit the training job:
sbatch train_llama.sbatch
View the queue job with squeue
Inspect the training.out and training.err file to see the training progress
All 4 nodes are utilized perfectly.
The final result is shown below:
When training large-scale models like LLaMA 3.1 8B on Metal Cloud, it is important to monitor system resources such as CPU, GPU, memory, and disk usage. Proper monitoring helps in:
Bare Metal provides a monitoring page where users can track real-time hardware usage, including:
Users can access the monitoring page via their Metal Cloud dashboard to ensure efficient and stable training.
This guide provides a structured approach to setting up distributed training for LLaMA-Factory on Metal Cloud. We covered:
Following these steps, you can fine-tune LLaMA models efficiently on Metal Cloud multi-node GPU clusters.
Metal Cloud is now available on FPT AI Factory for reservation. Find out more at: https://aifactory.fptcloud.com/
For more information and consultancy, please contact: