Use Cases for Training Large Language Models (LLMs) with Slurm on Metal Cloud
Table of Contents
Large Language Models (LLMs) are pushing the boundaries of artificial intelligence, enabling human-like text generation and the understanding of complex concepts. However, training these powerful models requires immense computational resources. This document explores the realm of distributed training, empowering you to leverage multiple GPUs efficiently to train LLMs using Slurm on Metal Cloud.
This document presents a proof of concept (PoC) for developing and training Large Language Models (LLMs) utilizing open-source tools. The setup is designed to easily adapt to various frameworks that support distributed training and aims to streamline debugging process.
Large Language Models (LLMs) have significantly advanced artificial intelligence, particularly in the field of natural language processing. Recent models such as GPT-2, GPT-3, and LLaMA2 can understand and generate human-like text with impressive accuracy.
Training LLMs is a highly resource-intensive task that requires substantial hardware resources. Distributed training on GPU clusters, such as NVIDIA H100, has become essential for accelerating the training process and efficiently handling large datasets.
Althought training LLMs on a single node is technically feasible, several limitations make this approach impractical:
These challenges are effectively addressed by utilizing a multi-node (cluster) training setup, which distributes computational workloads across multiple GPUs and accelerates training while ensuring scalability. This approach enables:
By leveraging a multi-node Slurm cluster, organizations and researchers can efficiently train LLMs while overcoming the constraints of single-node training.
As AI projects continue to grow in complexity and scale, the demand for high-performance computing (HPC) environments is increasing rapidly. This expansion requires efficient resource management—a challenge that SLURM (Simple Linux Utility for Resource Management) is designed to address effectively.
SLURM acts as the central nervous system of an HPC environment by enabling AI engineers to maximize computing cluster performance and tackle the most demanding AI workloads. It ensures:
By leveraging SLURM, AI researchers and engineers can harness the full power of distributed computing, enabling faster and more efficient training of Large Language Models (LLMs) and other complex AI applications.
SLURM (Simple Linux Utility for Resource Management) is a widely used job scheduler for High-Performance Computing (HPC), while Kubernetes (K8s) is the leading container orchestration platform for managing distributed workloads. Combining SLURM with Kubernetes offers several advantages:
Deploying SLURM on Kubernetes combines the strengths of HPC job scheduling and cloud-native orchestration, providing scalability, flexibility, and cost efficiency. This approach is ideal for AI/ML training, large-scale simulations, and enterprise-level scientific computing.
Each node in the setup is equipped with the following specifications:
a. Upload Data & Model to High Performance Storage (HPS)
Objective:
The first step is to prepare and upload training data and models to High Performance Storage (HPS), ensuring easy access from computing nodes.
Steps:
Additional training guidelines and example configurations can be found in the official documentation:
These resources provide detailed instructions on configuring and fine-tuning LLMs, ensuring an efficient training process.
The model and dataset are stored in the High-Performance Storage (HPS) under the following paths:
For this experiment, we use the Qwen/Qwen2.5-72B-Instruct model from Hugging Face (Qwen2.5-72B-Instruct). The model is stored in:
/mnt/data-hps/models/Qwen2.5-72B-Instruct/
Regarding the dataset, we use an SFT dataset named stem_sft, which is stored as:
/mnt/data-hps/data/stem_sft.json
To train the model using LLama Factory, we must register this dataset by defining its metadata in the dataset_info.json file. The file should be structured as follows:
{ "stem_aug_sft_h_fm_botest": { "file_name": "stem_aug_sft_h_fm_botest.json", "formatting": "sharegpt", "columns": { "messages": "messages" }, "tags": { "role_tag": "role", "content_tag": "content", "user_tag": "user", "assistant_tag": "assistant", "system_tag": "system" } } }
b. Set Up Slurm Cluster (Infrastructure & Configuration)
Objective:
Set up the Slurm cluster to manage resources and schedule tasks for model training.
Steps:
c. Create Training Configuration & Training Script (LLaMA Factory)
Objective:
Define the training parameters and write the training script using frameworks like LLaMA Factory.
Steps:
d. Create Slurm Job File (Resource Allocation & Script)
Objective:
Prepare the Slurm job file to specify resource requirements and job configuration for training.
Steps:
#!/bin/bash  #SBATCH --job-name=train_llm  #SBATCH --nodes=1  #SBATCH --gres=gpu:1  #SBATCH --time=48:00:00  #SBATCH --mem=64GB     module load cuda/11.2  python train.py --config config.json
e. Submit the Slurm Job (Run the Script to Start the Job)
Objective:
Submit the job to the Slurm job scheduler and begin the training process.
Steps:
sbatch train_llm.slurm 
f. Monitor Metrics (GPU Usage, Logs, etc.)
Objective:
Monitor the performance of the job during training, especially resource usage like GPU and logs.
Steps:
nvidia-smi 
Monitor Job Logs: View logs and metrics using:
j. Retrieve the Trained Model from the Output Path (HPS)
Objective:
After training is complete, retrieve the trained model from High Performance Storage (HPS).
Steps:
scp user@server:/path/to/output_model/model_checkpoint.p th . 
3. Some Execution Results
3.1 Pre-training stage
3.2 Post-training stage (SFT)
Training Large Language Models (LLMs) is a computationally demanding process that requires efficient resource management and scalable computing infrastructure. By leveraging Slurm on Metal Cloud, organizations and researchers can take full advantage of distributed training, enabling faster model convergence, optimized GPU utilization, and seamless workload orchestration.
Throughout this document, we have explored the key steps in training LLMs, from uploading models and datasets to the High-Performance Storage (HPS) to configuring Slurm clusters, submitting jobs, monitoring GPU usage, and retrieving the trained models. The integration of Kubernetes with Slurm further enhances scalability, flexibility, and cost efficiency, making it an ideal solution for handling large-scale AI workloads.
By implementing this approach, AI engineers can overcome the limitations of single-node training, efficiently manage multi-node clusters, and accelerate the development of state-of-the-art language models. As AI continues to evolve, leveraging Slurm on cloud-based HPC platforms will play a crucial role in advancing large-scale deep learning and natural language processing.
For more information and consultancy about FPT AI Factory, please contact: