Blogs Tech

Categories

Blog chia sẻ kiến thức FPT Cloud

FPT Announces Strategic Partnership and Investment with Sumitomo and SBI Holdings

13:54 22/04/2025
Hanoi, April 22, 2025 – FPT Corporation announced a strategic partnership with Sumitomo Corporation and SBI Holdings - Japan’s leading conglomerates in the finance and industrial sectors to accelerate artificial intelligence (AI) adoption through the FPT AI Factory ecosystem, contributing to the advancement of sovereign AI in Japan. Under this partnership, Sumitomo Corporation and SBI Holdings will each invest 20% and 20% in FPT Smart Cloud Japan, a subsidiary of FPT Corporation. This partnership lays a critical foundation for delivering cutting-edge AI solutions to organizations and enterprises in Japan, expediting AI integration across all aspects of society and supporting the nation’s ambition to become a global AI leader. Combining FPT’s technological capabilities with the extensive networks and expertise of Sumitomo Corporation and SBI Holdings across various industries, the three parties are committed to scaling AI and Cloud business in Japan. Together, they aim to build a diversified product and service ecosystem that meets the unique and increasingly complex demands of the Japanese market. FPT’s Chairman Mr. Truong Gia Binh, SBI Holdings Representative Director, Chairman, President & CEO Mr. Yoshitaka Kitao, and Sumitomo Corporation Director, Vice Chairman Mr. Toshikazu NAMBU signed an investment joint venture in Japan. Mr. Truong Gia Binh, Founder and Chairman of FPT Corporation, emphasized: “Sharing a common vision for the transformative potential of AI, we are working closely with our strategic partners to expand the global application of AI technologies. This partnership also contributes to fostering innovation, strengthening organizational competitiveness, and maintaining technology autonomy, supporting Japan’s goal of becoming an AI nation.” With the core philosophy of “Build Your Own AI,” FPT AI Factory aims to make AI more accessible and easily deployable for every business, organization, and individual. Leveraging FPT’s robust AI infrastructure - powered by thousands of the latest-generation GPUs, pre-packaged models, and deployment frameworks, alongside a comprehensive service ecosystem as well as proven experience in the Japanese market of FPT and its investors, FPT AI Factory enables organizations to harness their proprietary data, knowledge, and identity. This empowers them to rapidly develop tailored AI applications, unlock breakthrough performance, and create sustainable competitive advantages. About SBI Group Established in 1999, SBI Group is one of Japan’s pioneers in online financial services. The Group operates a wide range of financial businesses, offering user-friendly products and services via the Internet, primarily in the areas of securities, banking, and insurance. In addition to its core operations, SBI is also active in asset management and various global investment ventures. About Sumitomo Corporation Sumitomo Corporation (TYO: 8053) is an integrated trading and business investment company with a strong global network comprising 125 offices in 64 countries and regions. The Sumitomo Corporation Group consists of approximately 900 companies and 80,000 employees on a consolidated basis. The Group's business activities are spread across the following nine groups: Steel, Automotive, Transportation & Construction Systems, Diverse Urban Development, Media & Digital, Lifestyle Business, Mineral Resources, Chemicals Solutions and Energy Transformation Business. Sumitomo Corporation is committed to creating greater value for society under the corporate message of "Enriching lives and the world," based on Sumitomo’s business philosophy passed down for over 400 years. About FPT Corporation  FPT Corporation (FPT) is a global leading technology and IT services provider headquartered in Vietnam. FPT operates in three core sectors: Technology, Telecommunications, and Education. As AI is indeed a key focus, FPT has been integrating AI across its products and solutions to drive innovation and enhance user experiences within its Made by FPT ecosystem. FPT is actively working on expanding its capabilities in AI through investments in human resources, R&D, and partnerships with leading organizations like NVIDIA, Mila, AITOMATIC, and Landing AI. These efforts are aligned with FPT's ambitious goal to reach 5 billion USD in IT services revenue from global markets by 2030 and solidify its status among the world's top billion-dollar IT companies.  After nearly two decades in Japan, FPT has become one of the largest foreign-invested technology firms in the country by human resource capacity. The company delivers services and solutions to over 450 clients globally, with over 4,000 employees across 17 local offices and innovation hubs in Japan, and nearly 15,000 professionals supporting this market worldwide.   With Japan as a strategic focus for the company’s global growth, FPT has been actively expanding its business and engaging in M&A deals, such as the joint venture with Konica Minolta, strategic investment in LTS Inc, and most recently, the acquisition of NAC—its first M&A deal in the market. As digital transformation, particularly legacy system modernization, is viewed as a key growth driver in the Japanese market, the company is committed to providing end-to-end solutions and seamless services, utilizing advanced AI technologies as a primary accelerator. For more information, please visit https://fpt.com/en.  

Use Cases for Training Large Language Models (LLMs) with Slurm on Metal Cloud

14:38 21/04/2025
I. Introduction  Large Language Models (LLMs) are pushing the boundaries of artificial intelligence, enabling human-like text generation and the understanding of complex concepts. However, training these powerful models requires immense computational resources. This document explores the realm of distributed training, empowering you to leverage multiple GPUs efficiently to train LLMs using Slurm on Metal Cloud.  1. Purpose   This document presents a proof of concept (PoC) for developing and training Large Language Models (LLMs) utilizing open-source tools. The setup is designed to easily adapt to various frameworks that support distributed training and aims to streamline debugging process.  2. Context: Why Training LLMs Requires a Multi-Node (Cluster) Setup? Large Language Models (LLMs) have significantly advanced artificial intelligence, particularly in the field of natural language processing. Recent models such as GPT-2, GPT-3, and LLaMA2 can understand and generate human-like text with impressive accuracy.  Training LLMs is a highly resource-intensive task that requires substantial hardware resources. Distributed training on GPU clusters, such as NVIDIA H100, has become essential for accelerating the training process and efficiently handling large datasets.  Althought training LLMs on a single node is technically feasible, several limitations make this approach impractical:  Extended Training Time: Training on a single node significantly increases the duration of each training cycle, making it inefficient for large-scale models.  Hardware Limitations: Single-node systems often lack the memory and processing power necessary to handle extremely large models. For instance, models exceeding 70 billion parameters or datasets with over 37,000 samples may exceed the available GPU memory and storage capacity of a single machine.  Scalability Issues: As model size and dataset complexity increase, single-node training struggles to efficiently utilize resources, leading to bottlenecks and suboptimal performance.  These challenges are effectively addressed by utilizing a multi-node (cluster) training setup, which distributes computational workloads across multiple GPUs and accelerates training while ensuring scalability. This approach enables:  Parallel Processing: Distributing model training across multiple nodes reduces processing time and optimizes resource utilization.  Handling Large Models & Datasets: Multi-node setups can accommodate LLMs with billions of parameters by splitting the workload across multiple GPUs and nodes.  Improved Fault Tolerance & Flexibility: Cluster computing provides redundancy and enables better handling of system failures, ensuring training stability.  By leveraging a multi-node Slurm cluster, organizations and researchers can efficiently train LLMs while overcoming the constraints of single-node training. 3. SLURM - The Backbone of High-Performance Computing for AI As AI projects continue to grow in complexity and scale, the demand for high-performance computing (HPC) environments is increasing rapidly. This expansion requires efficient resource management—a challenge that SLURM (Simple Linux Utility for Resource Management) is designed to address effectively.  SLURM acts as the central nervous system of an HPC environment by enabling AI engineers to maximize computing cluster performance and tackle the most demanding AI workloads. It ensures:  Optimized Task Distribution: Workloads are efficiently allocated across computing nodes to maintain performance balance.  Intelligent Resource Management: Critical resources such as CPU cores, memory, and specialized hardware like GPUs are dynamically assigned to maximize efficiency.  Scalability & Adaptability: SLURM reallocates resources as needed, ensuring smooth scalability and efficient workload execution.  By leveraging SLURM, AI researchers and engineers can harness the full power of distributed computing, enabling faster and more efficient training of Large Language Models (LLMs) and other complex AI applications. 4. Why Deploy SLURM on Kubernetes? SLURM (Simple Linux Utility for Resource Management) is a widely used job scheduler for High-Performance Computing (HPC), while Kubernetes (K8s) is the leading container orchestration platform for managing distributed workloads. Combining SLURM with Kubernetes offers several advantages:  Enhanced Scalability & Dynamic Resource Allocation: Kubernetes enables auto-scaling of compute resources based on workload demand, dynamically provisioning or deallocating nodes as needed. Unlike traditional SLURM clusters, which are often static, running SLURM on K8s allows for on-demand scaling, optimizing resource utilization.  Improved Containerized Workflows & Portability: AI/ML and HPC workloads increasingly rely on containerized environments (e.g., Docker, Singularity). Kubernetes provides native support for containers, making it easier to package and deploy SLURM workloads across multi-cloud and hybrid environments.  Efficient Multi-Tenancy & Isolation: Kubernetes supports namespace-based isolation, enabling multiple teams to run SLURM jobs securely on shared infrastructure. Resource quotas and limits in K8s help ensure fair allocation of CPU, GPU, and memory among different workloads.  Integration with Cloud-Native Ecosystem: Running SLURM on K8s allows integration with cloud-native tools like Prometheus (monitoring), Grafana (visualization), and Argo Workflows (pipeline automation). This enables a modern, observability-driven approach to HPC workload management.  Cost Optimization for Cloud-Based HPC: Traditional SLURM clusters often require dedicated hardware, leading to underutilization when workloads are low. With Kubernetes, organizations can dynamically spin up and terminate cloud-based nodes, reducing unnecessary costs while ensuring peak performance during intensive computational workloads.  Deploying SLURM on Kubernetes combines the strengths of HPC job scheduling and cloud-native orchestration, providing scalability, flexibility, and cost efficiency. This approach is ideal for AI/ML training, large-scale simulations, and enterprise-level scientific computing. II. Implementation  1. Specifications Each node in the setup is equipped with the following specifications:   GPUs: 8 NVIDIA H100 GPUs, each with 80GB HBM3 memory and 700W power consumption.   OS: Ubuntu 22.04 LTS   Driver Version: 550.44.15+  CUDA Version: 12.2.1+   Docker Version: 26.1.2+  NVIDIA Toolkit: 1.15.0-1+   Built with Docker using NVIDIA toolkit   2. Steps for Training Machine Learning Models (MLMs) a. Upload Data & Model to High Performance Storage (HPS) Objective:  The first step is to prepare and upload training data and models to High Performance Storage (HPS), ensuring easy access from computing nodes.  Steps:  Training Library Reference: For training the model, we utilize the LLaMA-Factory library, which is available at the following repository:  GitHub Repository: LLaMA-Factory  Additional training guidelines and example configurations can be found in the official documentation:  Examples & Tutorials: LLaMA-Factory Examples  These resources provide detailed instructions on configuring and fine-tuning LLMs, ensuring an efficient training process.  Storing Model and Data in HPS for Training  The model and dataset are stored in the High-Performance Storage (HPS) under the following paths:  Model Storage Path: /mnt/data-hps/models/  Dataset Storage Path: /mnt/data-hps/data/  For this experiment, we use the Qwen/Qwen2.5-72B-Instruct model from Hugging Face (Qwen2.5-72B-Instruct). The model is stored in:  [code lang="js"] /mnt/data-hps/models/Qwen2.5-72B-Instruct/ [/code] Regarding the dataset, we use an SFT dataset named stem_sft, which is stored as:  [code lang="js"] /mnt/data-hps/data/stem_sft.json [/code] Registering the Dataset with LLama Factory  To train the model using LLama Factory, we must register this dataset by defining its metadata in the dataset_info.json file. The file should be structured as follows:  [code lang="js"] { "stem_aug_sft_h_fm_botest": { "file_name": "stem_aug_sft_h_fm_botest.json", "formatting": "sharegpt", "columns": { "messages": "messages" }, "tags": { "role_tag": "role", "content_tag": "content", "user_tag": "user", "assistant_tag": "assistant", "system_tag": "system" } } } [/code] b. Set Up Slurm Cluster (Infrastructure & Configuration) Objective:  Set up the Slurm cluster to manage resources and schedule tasks for model training.  Steps:  Install and Configure Slurm Controller (slurmctld):  Install Slurm Controller on a central server to manage compute nodes.  Edit slurm.conf to define node resources (e.g., CPU, GPU) and configure job parameters.  Set Up Slurm Daemon (slurmd) on Compute Nodes:  Install and configure Slurm Daemon on each compute node.  Ensure communication between compute nodes and Slurm Controller for job distribution.  c. Create Training Configuration & Training Script (LLaMA Factory) Objective:  Define the training parameters and write the training script using frameworks like LLaMA Factory.  Steps:  Create Training Configuration File:  Define hyperparameters like learning rate, batch size, number of epochs, etc., in a configuration file (e.g., config.json or train_config.yaml).  Write the Training Script:  Develop the training script (train.py) using frameworks such as PyTorch or TensorFlow. The script will include model definition, loss functions, optimizers, and training logic.  Integrate LLaMA Factory:  Use LLaMA Factory to streamline model configuration and training, optimizing the process for LLMs.  d. Create Slurm Job File (Resource Allocation & Script) Objective:  Prepare the Slurm job file to specify resource requirements and job configuration for training.  Steps:  Create the Slurm Job Script:  Write a Slurm job file (train_llm.slurm) to define resource requirements (e.g., CPU, GPU, memory) and specify the commands to run the training script.  Example train_llm.slurm file:  [code lang="js"] #!/bin/bash&amp;nbsp; #SBATCH --job-name=train_llm&amp;nbsp; #SBATCH --nodes=1&amp;nbsp; #SBATCH --gres=gpu:1&amp;nbsp; #SBATCH --time=48:00:00&amp;nbsp; #SBATCH --mem=64GB&amp;nbsp; &amp;nbsp;&amp;nbsp; module load cuda/11.2&amp;nbsp; python train.py --config config.json [/code] Define Resource Requirements: Specify the necessary resources (e.g., GPU, CPU, RAM) based on the model's training demands.  e. Submit the Slurm Job (Run the Script to Start the Job) Objective:  Submit the job to the Slurm job scheduler and begin the training process.  Steps:  Submit the Job: Use the sbatch command to send the job to Slurm for execution:  [code lang="js"] sbatch train_llm.slurm&amp;nbsp; [/code] Check Job Status: Use the squeue command to monitor the status of the job and confirm it’s running as expected.  f. Monitor Metrics (GPU Usage, Logs, etc.) Objective:  Monitor the performance of the job during training, especially resource usage like GPU and logs.  Steps:  Track GPU Usage: Use the nvidia-smi command to check GPU utilization:  [code lang="js"] nvidia-smi&amp;nbsp; [/code] Monitor Job Logs: View logs and metrics using:  scontrol show job <job_id> for job details.  tail -f slurm-<job_id>.out for real-time log monitoring.  j. Retrieve the Trained Model from the Output Path (HPS) Objective:  After training is complete, retrieve the trained model from High Performance Storage (HPS).  Steps:  Identify the Output Path: Check the Slurm job script or config file to locate the output path where the trained model is saved.  Download the Trained Model: Use scp, rsync, or API to fetch the model from HPS:  [code lang="js"] scp user@server:/path/to/output_model/model_checkpoint.p th .&amp;nbsp; [/code] Verify the Model: After downloading, verify the model’s integrity and performance.  3. Some Execution Results 3.1 Pre-training stage  Data size: 48.74 GB, context length: 4096, model size: 32B, epoch: 1  1 node:   bs/d = 1: 31 days 7:59:33 ~ 31.3 days  32 nodes:   bs/d = 1: 70h ~ 2.9 days  bs/d = 4: 31h ~ 1.3 days   bs/d = 8: OOM (Out of Memory)  3.2 Post-training stage (SFT)  Data size: 37.66 MB ~ 37,123 samples, context length: 2560, model size: 72B, epoch: 5  1 node:   bs/d = 4: 5h22m  32 nodes:   bs/d = 4: 22m   III. Conclusion  Training Large Language Models (LLMs) is a computationally demanding process that requires efficient resource management and scalable computing infrastructure. By leveraging Slurm on Metal Cloud, organizations and researchers can take full advantage of distributed training, enabling faster model convergence, optimized GPU utilization, and seamless workload orchestration.  Throughout this document, we have explored the key steps in training LLMs, from uploading models and datasets to the High-Performance Storage (HPS) to configuring Slurm clusters, submitting jobs, monitoring GPU usage, and retrieving the trained models. The integration of Kubernetes with Slurm further enhances scalability, flexibility, and cost efficiency, making it an ideal solution for handling large-scale AI workloads.  By implementing this approach, AI engineers can overcome the limitations of single-node training, efficiently manage multi-node clusters, and accelerate the development of state-of-the-art language models. As AI continues to evolve, leveraging Slurm on cloud-based HPC platforms will play a crucial role in advancing large-scale deep learning and natural language processing.  For more information and consultancy about FPT AI Factory, please contact:   Hotline: 1900 638 399   Email: [email protected]   Support: m.me/fptsmartcloud    

Vision-Language Models (VLM) Use Cases for Insurance Company on NVIDIA H100 GPUs

21:19 11/04/2025
As the demand for more intelligent and context-aware AI grows, Vision-Language Models (VLMs) have emerged as a powerful class of models capable of understanding both images and text. These models power applications such as AI assistants, medical document analysis, and automated insurance claim processing.  This article provides practical, experience-based best practices for training large VLMs using Metal Cloud, offering a scalable and high-performance AI infrastructure fueled by NVIDIA H100 GPUs. Whether you're an AI engineer, data scientist, or IT decision-maker looking to scale multimodal AI systems efficiently, this guide walks you through the architectural choices, training pipelines, and optimization strategies proven to deliver real-world impact.  1. Real-World Applications and Deployment Outcomes VLMs are transforming multiple industries:  Document Understanding & Intelligent Document Processing (IDP): Extracting insights from unstructured formats and images.  Medical & Insurance Analysis: Automating claims processing, including data entry and the adjustment process, detecting fraudulent claims, and summarizing medical documents.  Example of medical documents:                                        AI-Powered Assistants: Enabling AI chatbots with multimodal reasoning and contextual awareness.  Business Impact of PDF Data Extraction:  Reduced manual data entry time from 15 minutes to under 2 minutes.  Faster adaptation to new datasets, reducing training duration from months to weeks.  Scaled processing capacity without increasing reliance on human resource   Enhanced fraud detection capabilities through AI-driven analysis.  Based on the NVIDIA VSS Blueprint Architecture, FPT AI Factory has implemented it in automated vehicle insurance claim video processing, using the following architecture:   Figure. High-level architecture of the summarization vision AI agent  Business Impact of accessing car information and damage assessments:  Automated Damage Evaluation: Use VLM to analyze claim descriptions and video for automated damage assessment. VLM model categorizes cases into severe damage, minor damage, or no damage, directing them to the appropriate processing streams and experts. This approach enables automation of up to 80% of minor damage claims, reducing claim processing time from 20 minutes to just 2 minutes.  Enhancing Claims Processing Efficiency: Minimize human intervention and expedite claim settlements through AI-powered assessments  Detecting and Preventing Fraud: Identify anomalies and inconsistencies in claim reports to mitigate fraud  Optimizing Operational Costs: Reduce expenses associated with manual inspections and assessment processes - Example of car damage assessment with VLM   ROI of H100 Over A100  Higher initial cost, but lower total expenditure due to efficiency.  Shorter training cycles, leading to faster model deployment.  Estimated 43% reduction in overall training cost compared to A100.  2. VLM Architecture, Data Processing Pipeline, and Hardware Requirements 2.1 VLM Architecture  A standard VLM consists of three key components:  Vision Encoder: Utilize CNN or transformer-based models such as ViT, CLIP Vision Encoder, and Swin Transformer to extract image features.  Language Decoder: Implement LLMs such as GPT, LLaMA, and Qwen to generate textual outputs based on visual prompts.  Multimodal Fusion Module: Integrate image and text embeddings for cohesive output generation.  2.2 Data Processing Pipeline  The pipeline for processing image and text data in VLM training follows these key steps:  Training Phase:  Image data is passed through the Vision Encoder, extracting relevant visual features.  Text data is processed using a Text Embedder, converting it into vector representations.  Both vision and text embeddings are then fused and passed into the Language Model with Self-Attention Layers, enabling multimodal learning.  Testing Phase:  Zero-shot Visual Question Answering (VQA): The trained model can answer questions about new images it has never encoutered before.  Few-shot Image Classification: By leveraging learned embeddings, the model can classify new images with minimal labeled examples.  2.3 NVIDIA software stack in use  Training:   We use NVIDIA NeMo for our fine-tuning task. NVIDIA NeMo is an open-source framework designed for training and fine-tuning large-scale AI models, including vision-language models (VLMs), speech models, and NLP models.  The NVIDIA Nemo framework supports many utilities:  Pretrained Foundation Models: Providing optimized foundation models that can be fine-tuned for specific applications.  Model Parallelism: Supporting tensor, pipeline, and sequence parallelism, enabling the training of extremely large-scale models on multiple GPUs or nodes.  LoRA and QLoRA Support: Reducing compute and memory costs while maintaining accuracy through efficient parameter-efficient fine-tuning methods.  Integration with NVIDIA HGX Cloud: Enabling seamless cloud-based training on clusters powered by H100 GPUs.   Performance Gains with NVIDIA NeMo on H100 GPUs  ✅ 2-3x Faster Training with FP8 precision and optimized kernels ✅ 50% Lower Memory Usage using mixed precision and memory-efficient optimizers ✅ Seamless Multi-GPU Scaling with Tensor & Pipeline Parallelism  By leveraging NVIDIA NeMo on H100 GPUs, we can fine-tune VLMs efficiently at scale, reducing both compute cost and time to deployment.  Inferencing:  In order to maximize the performance of the VLM with the low latency but high throughput requirements, we utilize the TensorRL-LLM as an optimizer for VLM. With TensorRT-LLM, we achieve significantly lower latency overall and also lower TTFT. TensorRT-LLM also supports a wide range of quantization, including INT8, SmoothQuant, FPT8, INT4, GPTQ, and AWQ.  2.4 Hardware Considerations  For effective training, key hardware factors include:  Batch Size and Sequence Length: Optimized for maximum GPU utilization without memory bottlenecks.  Memory Management: Leveraging H100’s high-bandwidth memory for efficient data processing.  Parallelization Strategies: Using tensor parallelism, pipeline parallelism, and distributed training techniques to optimize large-scale models.  3. Benchmarking NVIDIA H100 vs. A100 GPUs for VLM Training While the NVIDIA H100 GPU has a higher hourly operational cost than the A100, it significantly reduces overall training expenses due to shorter training times. Case studies indicate that training on the  H100 reduces costs by approximately 43% and accelerates training by a factor of 3.5x compared to A100.  Performance comparisons highlight H100’s superior efficiency:  Metric  2 x H100 (HBM3-80GB)  2 x A100 (PCIe-80GB)  Higher is Better?  Epoch Time (Qwen2.5VL-7B, batch_size=2, num_sample=200k)  ~24 h  ~84 h  No  Inference Throughput (Qwen2.5VL-3B,  token/sec, PyTorch)  ~410  ~150  Yes  Power Consumption (100% GPU utilization, per card)  480W  250W  No  Hourly Cost  1.5 x A100  Lower  No  Total Training Cost  0.57 x A100  Higher  No  4. Lessons Learned and Optimization Strategies Resource Optimization  Maximizing GPU Utilization: Proper tuning of batch size, sequence length, and caching mechanisms.  Parallel Processing Strategies: Implementing FSDP, ZeRO, and NCCL to improve training speed.  Distributed Training Challenges  Data Synchronization: Efficient GPU communication to avoid bottlenecks.  Infrastructure Readiness: Ensuring power and cooling support for high-energy-consuming H100 clusters.  System Integration & Stability  Software Stack Compatibility: Ensuring seamless operation with PyTorch/XLA, Triton, and TensorRT.  Continuous Performance Monitoring: Regular fine-tuning to maintain optimal efficiency.  5. Future Trends in VLM Training Optimization To address increasing model complexity and computational demands, several trends are shaping the optimization of VLM training:  Scalability and Efficiency: FP8 precision, quantization techniques, and FlashAttention optimize memory utilization, ensuring fast processing.  Advanced Training Pipelines: Techniques like ZeRO (DeepSpeed) and Fully Sharded Data Parallel (FSDP) reduce memory overhead and improve scalability.  High-Performance Multi-GPU Training: H100’s NVLink 4.0 and PCIe 5.0 enable faster inter-GPU communication, minimizing bottlenecks.  Efficient Fine-Tuning Techniques: Methods such as LoRA and QLoRA allow efficient parameter tuning while reducing computational costs.  Domain-Specific Optimization: Future VLMs will be fine-tuned for specialized domains like medical imaging, legal document processing, and technical analysis, requiring tailored datasets and optimized training strategies.  6. Conclusion & Recommendations When to Choose H100  Training large-scale VLMs (7B+ parameters) requiring high batch sizes and long sequence lengths.  Deploying multi-GPU clusters with NVLink 4.0 for enhanced interconnect speeds.  Use cases demanding real-time inference with minimal latency.  When A100 is Sufficient  Smaller-scale GenAI models (under 4B parameters) with relaxed training time constraints.  Cost-sensitive projects where training duration is less critical.  Single-task models requiring less computational complexity.  Final Thoughts  With increasing demands for more sophisticated VLMs, optimizing hardware, algorithms, and training strategies remains essential. The NVIDIA H100 GPUs stands out as the preferred choice for large-scale, high-performance VLM training, driving advancements in multimodal AI and accelerating real-world applications.  Learn more about FPT AI Factory's services HERE. For more information and consultancy about FPT AI Factory, please contact: Hotline: 1900 638 399 Email: [email protected] Support: m.me/fptsmartcloud

LLaMA Factory : A Feature-Rich Toolkit for Accessible LLM Customization 

18:21 11/04/2025
As large language models (LLMs) and vision-language models (VLMs) become increasingly essential in modern AI applications, the ability to fine-tune these models on custom datasets has never been more important. However, for many developers, especially those without a deep background in machine learning, existing frameworks can be overwhelming, requiring heavy coding and complex configurations.  LLaMA Factory is an open-source toolkit designed to make LLM fine-tuning accessible to everyone. Whether you're a beginner, a non-technical professional, or an organization seeking an efficient model customization solution, LLaMA Factory simplifies the entire process with an intuitive web interface and support for dozens of fine-tuning strategies.  In this article, we’ll explore what makes LLaMA Factory stand out, who can benefit from it, and how it compares with other popular frameworks.  Who should use LLaMA Factory?   LLaMA Factory is ideal for:  🧑‍💻 Beginner developers experimenting with LLMs  📊 Data analysts and researchers without ML expertise  🧠 AI enthusiasts working on personal or community projects  🏢 Small teams or startups without ML engineering bandwidth  If you want to fine-tune powerful open-source models like LLaMA or Mistral on your own dataset without writing a line of code, this tool is built for you.  What Does LLaMA Factory Offer?  LLaMA-Factory is an open-source project that provides a comprehensive set of tools and scripts for fine-tuning, serving, and benchmarking LLM and VLM models. It provides a solution for flexibly customizing the fine-tuning of 100+ LLMs and VLMs without the need for coding through the built-in web UI LlamaBoard.  The LLaMA-Factory repository makes it easy to get started with large models by providing:  Scripts for data preprocessing and tokenization tasks  Training pipelines for fine-tuning models.  Inference scripts for generating text with trained models  Benchmarking tools to evaluate model performance  Gradio web UI for interactive testing and training.  LLaMA Factory is designed specifically for beginners and non-technical professionals who want to fine-tune open-source LLMs on their custom datasets without learning complex concepts of AI. Users simply select a model, upload  their dataset, and adjust a few parameters to initiate the training process.   Once training is complete, the same web application can be used to test the model before exporting it to Hugging Face or saved locally. This provides a fast and efficient way for fine-tuning LLMs in a local environment.   Figure: LLaMA Factory Architecture  Comparing Feature support across LLM training frameworks   Here's how LLaMA Factory stacks up against other popular LLM fine-tuning frameworks like FastChat, LitGPT, and LMFlow:    Llama Factory  FastChat  LitGPT  LMFlow  Open-Instruct  LoRA  ✓  ✓  ✓  ✓  ✓  QLoRA  ✓  ✓  ✓  ✓  ✓  DoRA  ✓          LoRA+  ✓          PiSSA  ✓          GaLore  ✓  ✓    ✓  ✓  BAdam  ✓          Flash attention  ✓  ✓  ✓  ✓  ✓  S2 attention  ✓          Unsloth  ✓    ✓      DeepSpeed  ✓  ✓  ✓  ✓  ✓  SFT  ✓  ✓  ✓  ✓  ✓  RLHF  ✓      ✓    DPO  ✓        ✓  KTO  ✓          ORPO  ✓          Table: Comparison of features in LlamaFactory with popular frameworks of fine-tuning LLMs  Note: While most frameworks are built on PyTorch and have similar hardware requirements, LLaMA Factory differentiates itself through its ease of use, wide feature support, and strong community. It stands out with extensive support for multiple fine-tuning techniques, including LoRA, QLoRA, DoRA, PiSSA, and more, providing users with flexibility to optimize models based on their specific needs.  Fine-Tuning Techniques Supported    Freeze-tuning  GaLore  LoRA  DoRA  LoRA+  PiSSA  Mixed precision  ✓  ✓  ✓  ✓  ✓  ✓  Checkpointing  ✓  ✓  ✓  ✓  ✓  ✓  Flash attention  ✓  ✓  ✓  ✓  ✓  ✓  S2 attention  ✓  ✓  ✓  ✓  ✓  ✓  Quantization  ✗  ✗  ✓  ✓  ✓  ✓  Unsloth  ✗  ✗  ✓  ✓  ✓  ✓  Table 2: Compatibility between the fine-tuning techniques featured in Llama Factory.  Quick Overview of Techniques Freeze-tuning: involves freezing a majority of parameters while fine-tuning the remaining parameters in a small subset of decoder layers.  Gradient low-rank projection: projects gradients into a lower-dimensional space, facilitating full-parameter learning in a memory-efficient manner.  Low-rank adaptation freezes all pre-trained weights and introduces a pair of trainable low-rank matrices to the designated layer.  QLoRA: LoRA combines with quantization to reduce memory usage.  DoRA (Weight-Decomposed Low-Rank Adaptation): breaks down pre-trained weights into magnitude and direction components and updates directional components for enhanced performance.  LoRA+: is proposed to overcome the sub-optimality of LoRA.  PiSSA (Principal Singular Values and Singular Vectors Adaptation) initializes adapters with the principal components of the pre-trained weights for faster convergence.  Quick Start to LLaMA-Factory 1. Installing Dependencies Workspace and environment can be set up easily by cloning LLaMA Factory repository.   [code lang="js"] git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory conda create --name llama-factory python=3.11 conda activate llama-factory pip install -e ".[torch,liger-kernel,metrics]" pip install deepspeed==0.14.4 [/code] 2. Preparing Dataset LLaMA Factory supports multiple data formats for various training methods (e.g., SFT, DPO, RLHF...) as well as sample datasets to help us visualize the structure of a dataset.  All data is stored in the /data directory. Users can prepare a customized dataset and update its information in the dataset_info.json file which is also located in the /data directory.  For example, I have a dataset where the image paths, prompt information, and model responses are stored in the file C:/User/annotations.json, and the dataset is structured as follows:  At this point, you can add the following information to the dataset_info.json file:  3. Finetuning You can choose to fine-tune via LLaMA Factory's WebUI by running the following command in the terminal:  [code lang="js"] cd LLaMA-Factory GRADIO_SHARE=1 llamafactory-cli webui [/code] The web interface will appear as follows:  You can adjust the required training configurations, specify the path to the output directory, and click 'Start' to begin the training process.  Additionally, you can fine-tune via the command line by preparing a config.yaml file that includes the required training configurations. You can find examples of different training configurations in the /examples directory.  Then, run the following command to start the training process:  [code lang="js"] llamafactory-cli train training_config.yaml [/code] 4. Merge LoRA In the case of LoRA training, the adapter weights need to be merged with the original model to obtain the fine-tuned model. This process can be executed via the WebUI by selectingclicking on the 'Export' tab. For the command line, you also need to prepare a configuration file in YAML format, as shown in the following example:  Then run the following command to merge:  [code lang="js"] llamafactory-cli export merge_config.yaml [/code] 5. Resource monitoring  After setting up the environment successfully, you can check the running process using the 'nvidia-smi' command. Below is an example of training a LLM model using LlamaFactory on H100 node  Conclusion  LLaMA Factory is a powerful and user-friendly framework designed to lower the barrier to entry for individuals interested in customizing large language models. It offers a comprehensive suite of state-of-the-art fine-tuning techniques, intuitive UI controls, and compatibility with popular deployment tools while eliminating the need for coding.  Whether you're an ML novice or just want a faster way to experiment with LLMs, LLaMA Factory is definitely worth checking out.  Learn more about FPT AI Factory's services HERE. For more information and consultancy about FPT AI Factory, please contact: Hotline: 1900 638 399 Email: [email protected] Support: m.me/fptsmartcloud

FPT GPU Cloud Benchmark: Performance Comparison of GPUs for AI & Machine Learning

14:09 25/03/2025
Benchmarking is crucial for evaluating GPU performance in AI and machine learning. This study measures training speed and scalability across various GPU types to help users choose the best fit for their workloads. Beyond internal assessments, we compare FPT GPU Cloud's performance with similar vendors to highlight key advantages in processing power, memory bandwidth, and scalability. These insights help customers select the most efficient GPU cloud services for their AI needs. Check out the FPT AI Factory’s folk of the Optimum Habana trainer code. These H100 benchmarks may be reproduced by following the provided instructions in that repository. The following benchmarks utilize Habana's Optimum Habana v1.7 trainer code to evaluate the performance of the NVIDIA HGX H100 and HGX H200 against similar vendors. Result H100 (samples per second) – FPT’s Metal Cloud, K8S, DGX, VM. Batch size 54 Model 1 GPU 2 GPUs 3 GPUs 4 GPUs 6 GPUs 8 GPUs Similar Vendor’s H100 80GB SXM 142.3 275 400.6 521.8 740.3 962.2 Compared to 1 GPU (times faster) 1.93 2.82 3.67 5.20 6.76 Metal Cloud – Bare Metal H100 80GB SXM 144.2 283.4 418.9 550.7 799.4 1056.3 Compared to Similar Vendor 101% 103% 105% 106% 108% 110% Compared to 1 GPU (times faster) 1.97 2.91 3.82 5.54 7.33 FPT K8S H100 80GB SXM 143.8 282.4 417.0 546.7 792.8 1046.5 Compared to Similar Vendor 101% 103% 104% 105% 107% 109% Compared to 1 GPU (times faster) 1.96 2.90 3.80 5.51 7.28 DGX H100 80GB SXM 143.8 282.2 417.2 547.7 793.4 1047.0 Compared to Similar Vendor 101% 103% 104% 105% 107% 109% Compared to 1 GPU (times faster) 1.96 2.90 3.81 5.52 7.28 FPT VM H100 80GB SXM (no nvlink) 143.0 261.7 376.6 459.5 Compared to Similar Vendor 101% 95% 94% 88% Compared to 1 GPU (times faster) 1.83 2.63 3.21 Compared to Metal Cloud 99% 92% 90% 83% Result H200 (samples per second) – FPT’s Metal Cloud, multiple batch sizes: 54, 95, 110 Model 1 GPU 2 GPUs 3 GPUs 4 GPUs 6 GPUs 8 GPUs Metal Cloud – Bare Metal H200 141GB SXM (bz54) 158.8 312.4 460.7 600.9 881.4 1165.1 Compared to Similar Vendor’s H100 112% 114% 115% 115% 119% 121% Compared to Metal Cloud H100 110% 110% 110% 109% 110% 110% Compared to Similar Vendor's Baremetal H200 101% 101% 102% 101% 104% 105% Compared to 1 GPU (times faster) 1.84 2.71 3.53 5.18 6.85 Metal Cloud – Bare Metal H200 141GB SXM (bz95) 169.4 332.9 489.2 649.7 917.4 1238.1 Compared to Similar Vendor’s H100 119% 121% 122% 125% 124% 129% Compared to Metal Cloud H100 117% 117% 117% 118% 115% 117% Compared to Similar Vendor’s Baremetal H200 107% 108% 108% 110% 108% 112% Compared to 1 GPU (times faster) 1.96 2.87 3.82 5.39 7.28 Metal Cloud – Bare Metal H200 141GB SXM (bz110) 173.9 341.4 505.8 651.0 973.7 1190.0 Compared to Similar Vendor’s H100 122% 124% 126% 125% 132% 124% Compared to Metal Cloud H100 121% 120% 121% 118% 122% 113% Compared to Similar Vendor’s Baremetal H200 110% 111% 112% 110% 115% 107% Compared to 1 GPU (times faster) 2.01 2.97 3.83 5.72 6.99   FPT AI Factory optimizes GPU performance through advanced infrastructure and software enhancements. Metal Cloud delivers the highest performance across all GPU configurations, outperforming similar vendor benchmarks, with the performance gap increasing as more GPUs are added (up to 110% at 8 GPUs). FPT K8S performs slightly lower due to additional overhead but remains competitive. FPT VM (without NVLink) shows lower performance, especially with multiple GPUs, reinforcing NVLink’s role in scaling efficiency. Across all models, performance scaling is sublinear, with diminishing returns as the number of GPUs increases, though Metal Cloud scales the best (7.33× at 8 GPUs). Meanwhile, the HGX H200, with its larger VRAM (141GB vs. 80GB) and higher memory bandwidth (4.8TB/s vs. 3.35TB/s), enables larger batch sizes and achieves up to 18% better performance than the H100 at maximum batch size. Learn more about FPT AI Factory's services HERE. For more information and consultancy about FPT AI Factory, please contact: Hotline: 1900 638 399 Email: [email protected] Support: m.me/fptsmartcloud

A Step-by-Step Guide to Fine-Tuning Models with FPT AI Factory and NVIDIA NeMo

11:50 06/03/2025
In the fast-paced world of artificial intelligence, large language models (LLMs) are transforming industries from healthcare to creative writing. However, training these models from scratch can be resource-intensive and impractical for most. That’s where parameter-efficient fine-tuning (PEFT) comes in. It allows you to take a pre-trained model, tweak it for your use case, and do so efficiently — saving time, computing power, and even the planet! By reducing the computational footprint, PEFT makes AI accessible on a wider range of hardware (think laptops or edge devices) and aligns with growing calls for sustainable tech practices by cutting down energy use and carbon emissions. Ready to dive in? Let’s get started! Prerequisites Before you embark on this fine-tuning journey, let’s make sure you’ve got the right tools and setup. Here’s what you’ll need: Hardware: At a minimum, you’ll need 1x NVIDIA A100 80GB GPU to handle PEFT tasks effectively, given the memory and compute demands of models like Llama 3.1–8B. However, in this guide, I’m running the process on Metal Cloud - FPT’s Bare Metal H100 server, which offers even greater power with its NVIDIA H100 GPUs. The H100 is overkill for this tutorial but provides headroom for scaling or handling larger models — expect even faster performance and efficiency compared to the A100. Software: NVIDIA NeMo: Ensure you have NeMo version 24.07 or later installed. You’ll run it via Docker, so familiarize yourself with Docker basics. Python 3.8+: NeMo and its dependencies (like PyTorch and Hugging Face Transformers) require Python 3.8 or higher. Hugging Face CLI or API: For downloading Llama 3.1–8B, you’ll need access to Hugging Face and a valid token. Weights & Biases (WandB): Optional but highly recommended for tracking training progress. Sign up for a free account and grab an API key. NVIDIA Drivers and CUDA Toolkit: Ensure your GPU drivers (minimum version 535 for A100/H100) and CUDA 12.1+ are installed to support NeMo’s GPU-accelerated operations. Access and Permissions: Request access to the Llama 3.1–8B model on Hugging Face, as it requires approval from Meta. These prerequisites set you up for success, whether you’re on a standard A100 or leveraging the H100 server. Let’s dive into the steps! Understanding PEFT: What It Is and Why It Matters Before we jump into the how-to, let’s unpack what Parameter-Efficient Fine-Tuning (PEFT) really means. Picture a massive pre-trained model like Llama 3.1–8B, packed with 8 billion parameters — tiny knobs and dials that define its behavior. This model has already soaked up general knowledge from huge datasets, but now you want it to excel at something specific, like generating Japanese creative writing or answering technical questions. Fully fine-tuning all 8 billion parameters would be like rebuilding a car engine to tweak its radio — it works, but it’s overkill and burns through resources. PEFT flips the script. Instead of adjusting every parameter, it freezes most of the model and tweaks only a small subset — sometimes as little as 0.1% of the total parameters. This keeps the heavy lifting (and GPU memory) to a minimum while still adapting the model to your task. Think of it as adding a custom filter to a pre-built camera lens: you get sharp results without redesigning the whole system. The benefits? Faster training, lower memory use, and the ability to fine-tune on a single GPU — or even a laptop in some cases. Plus, it’s kinder to the environment, slashing the energy cost of AI development. Key PEFT Methods and Parameters PEFT isn’t one-size-fits-all — it comes in flavors, each with its own tricks: LoRA (Low-Rank Adaptation): This method adds small, trainable “adapter” matrices to specific layers (like attention mechanisms). In our example, we target attention_qkv (query, key, value in the transformer). Parameters here include: Rank (r): Controls the size of the adapter matrices. A lower rank (e.g., 8 or 16) means fewer parameters to train, balancing efficiency and expressiveness. Target Modules: Which parts of the model get adapters (e.g., attention layers). More targets = more flexibility, but also more compute. P-Tuning: Instead of tweaking weights, P-Tuning optimizes a set of “prompt” embeddings fed into the model. Key parameter: Prompt Length: How many tunable tokens to add (e.g., 20). Longer prompts can capture more context but increase complexity. Adapter Tuning: Adds lightweight neural layers inside the model. Parameters include: Adapter Size: The number of neurons in these layers (e.g., 64). Smaller sizes keep things light. In our guide, we’ll use LoRA because it strikes a great balance between performance and efficiency, but NeMo supports other methods too — experiment to find your favorite! Hardware Requirements: What You’ll Need To follow along, you’ll need some decent hardware. At a minimum, I recommend 1x NVIDIA A100 80GB GPU for PEFT tasks. Why? The A100’s massive memory and computing power are ideal for handling the tensor operations and parallel processing that NeMo leverages. If you’re on a budget, a smaller GPU like an RTX 3090 (24GB) might work for lighter models, but expect longer training times and potential memory constraints. For optimal performance, especially with larger models like Llama 3.1–8B, stick with the A100 or equivalent. Step 1: Downloading the Llama 3.1–8B Model We’ll kick things off by grabbing the Llama 3.1–8B model in Hugging Face format. This 8-billion-parameter beast from Meta AI is a fantastic starting point for fine-tuning, offering a balance of performance and efficiency. How to Download First, request download permission from Meta’s Hugging Face page (you’ll need to sign up and agree to their terms). Once approved, create a directory to store the model: [code lang="js"] mkdir llama31-8b-hf [/code] You’ve got two options to download: Option 1: CLI Tool Log in to Hugging Face and use their CLI: [code lang="js"] huggingface-cli login huggingface-cli download meta-llama/Llama-3.1-8B --local-dir llama31-8b-hf [/code] Option 2: Python API If you prefer scripting, use this Python snippet (replace <YOUR HF TOKEN> with your Hugging Face token): [code lang="js"] from huggingface_hub import snapshot_download [/code] [code lang="js"] snapshot_download( &amp;nbsp;&amp;nbsp; repo_id="meta-llama/Llama-3.1-8B", &amp;nbsp;&amp;nbsp; local_dir="llama31-8b-hf", &amp;nbsp;&amp;nbsp; local_dir_use_symlinks=False, &amp;nbsp;&amp;nbsp; token="&amp;lt;YOUR HF TOKEN&amp;gt;" ) [/code] Once complete, your model files will land in ./llama31-8b-hf. Pro tip: Verify the download by checking for key files like pytorch_model.bin or model.safetensors—this ensures you’ve got everything intact. Step 2: Converting to NeMo Format NeMo uses its own .nemo format for models, which supports distributed checkpointing and flexible parallelism. Let’s convert our Hugging Face model to .nemo. Launch the NeMo Container Fire up NVIDIA’s NeMo Docker container with GPU support: [code lang="js"] docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.07 bash [/code] This command maps your current directory to /workspace in the container and sets up GPU access. Run the Conversion Inside the container, execute: [code lang="js"] python3 /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=./llama31-8b-hf/ --output_path=llama31-8b.nemo [/code] The resulting llama31-8b.nemo file is ready for fine-tuning and supports any tensor parallel (TP) or pipeline parallel (PP) configuration without additional tweaking. This flexibility is a huge win for scaling across multiple GPUs if you expand your setup later! Preparing Your Data Data is the lifeblood of fine-tuning. For this guide, we’ll use the Databricks Dolly 15k Japanese dataset (a translated version of Dolly 15k) as an example, but you can swap in any dataset relevant to your task — think medical QA, customer support logs, or creative writing prompts. Step 1: Download the Dataset Let’s pull the dataset from Hugging Face: [code lang="js"] # load_dataset.py from datasets import load_dataset # Load dataset ds = load_dataset("llm-jp/databricks-dolly-15k-ja") df = ds["train"].data.to_pandas() df.to_json("databricks-dolly-15k-ja.jsonl", orient="records", lines=True) [/code] This saves the dataset as a .jsonl file, where each line is a JSON object with fields like instruction, context, and response. Step 2: Preprocess the Data We need to format the data into a structure NeMo can digest. Here’s a preprocessing script to combine instruction and context into an input field, paired with an output response: [code lang="js"] # preprocess.py import json import argparse import numpy as np def to_jsonl(path_to_data):    print("Preprocessing data to jsonl format...")    output_path = f"{path_to_data.split('.')[0]}-output.jsonl"    with open(path_to_data, "r") as f, open(output_path, "w") as g:        for line in f:            line = json.loads(line)            context = line["context"].strip()            instruction = line["instruction"].strip()            if context:                # Randomize order of context and instruction for variety                context_first = np.random.randint(0, 2) == 0                input_text = f"{context}\\\\n\\\\n{instruction}" if context_first else f"{instruction}\\\\n\\\\n{context}"            else:                input_text = instruction            output = line["response"]            g.write(                json.dumps(                    {"input": input_text, "output": output, "category": line["category"]},                    ensure_ascii=False                ) + "\\\\n"            )    print(f"Data saved to {output_path}") def get_args():    parser = argparse.ArgumentParser()    parser.add_argument("--input", type=str, required=True, help="Path to jsonl dataset")    return parser.parse_args() if __name__ == "__main__":    args = get_args()    to_jsonl(args.input) [/code] Run it like this: [code lang="js"] python preprocess.py --input=databricks-dolly-15k-ja.jsonl [/code] Step 3: Split the Dataset Now, split the preprocessed data into training, validation, and test sets: [code lang="js"] # split_train_val.py import json import random input_file = "databricks-dolly-15k-ja-output.jsonl" train_file = "training.jsonl" val_file = "validation.jsonl" test_file = "test.jsonl" train_prop, val_prop, test_prop = 0.80, 0.15, 0.05 with open(input_file, "r") as f: lines = f.readlines() random.shuffle(lines) total = len(lines) train_idx = int(total * train_prop) val_idx = int(total * val_prop) train_data = lines[:train_idx] val_data = lines[train_idx:train_idx + val_idx] test_data = lines[train_idx + val_idx:] for data, filename in [(train_data, train_file), (val_data, val_file), (test_data, test_file)]: with open(filename, "w") as f: for line in data: f.write(line.strip() + "\\\\n") [/code] This gives you three files: training.jsonl (80%), validation.jsonl (15%), and test.jsonl (5%). Here’s a sample of what the processed data looks like: [code lang="js"] { "input": "若い頃にもっと時間をかけてやっておけばよかったと思うことは?", "output": "健康とウェルネスへの投資だ。若い頃に運動やバランスの取れた食事、家族との時間をもっと大切にしていれば、今後の人生がもっと豊かで楽になっていただろう。", "category": "creative_writing" } [/code] Step 3: Fine-Tuning with PEFT Time to fine-tune! We’ll use the LoRA method (as set in PEFT_SCHEME="lora"), though you can switch to P-Tuning or others by tweaking that variable. Here’s the full script: [code lang="js"] MODEL="llama31-8b.nemo" TRAIN_DS="[training.jsonl]" VALID_DS="[validation.jsonl]" TEST_DS="[test.jsonl]" TEST_NAMES="[data]" PEFT_SCHEME="lora" CONCAT_SAMPLING_PROBS="[1.0]" TP_SIZE=1 PP_SIZE=1 huggingface-cli login --token <HF_TOKEN> export WANDB_API_KEY=<WANDB_TOKEN> wandb login torchrun --nproc_per_node=1 \\\\ /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \\\\    trainer.devices=1 \\\\    trainer.num_nodes=1 \\\\    trainer.precision=bf16 \\\\    trainer.val_check_interval=20 \\\\    trainer.max_steps=50 \\\\    model.megatron_amp_O2=True \\\\    ++model.mcore_gpt=True \\\\    ++model.flash_attention=True \\\\    model.tensor_model_parallel_size=${TP_SIZE} \\\\    model.pipeline_model_parallel_size=${PP_SIZE} \\\\    model.micro_batch_size=1 \\\\    model.global_batch_size=32 \\\\    model.optim.lr=1e-4 \\\\    model.restore_from_path=${MODEL} \\\\    model.data.train_ds.file_names=${TRAIN_DS} \\\\    model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \\\\    model.data.validation_ds.file_names=${VALID_DS} \\\\    model.peft.peft_scheme=${PEFT_SCHEME} \\\\    model.peft.lora_tuning.target_modules=[attention_qkv] \\\\    exp_manager.create_wandb_logger=True \\\\    exp_manager.explicit_log_dir=/results \\\\    exp_manager.wandb_logger_kwargs.project=peft_run \\\\    exp_manager.wandb_logger_kwargs.name=peft_llama31_8b \\\\    exp_manager.create_checkpoint_callback=True \\\\    exp_manager.checkpoint_callback_params.monitor=validation_loss [/code] Key Highlights LoRA in Action: We’re targeting attention_qkv modules, adding small adapters to fine-tune efficiently. WandB: Tracks training progress — super handy for visualizing loss curves. Precision: Uses bf16 (bfloat16) for faster training with minimal accuracy loss on modern GPUs. Adjust max_steps (how many training iterations) or global_batch_size (how many samples per update) based on your dataset size and hardware. For our small example, 50 steps keep things quick. Diving Deeper: Understanding the Parameters Want to geek out on what’s driving this PEFT fine-tuning? Here’s a quick rundown of the most important parameters in the script and why they matter for keeping Llama 3.1–8B manageable on a single A100 — or, in my case, FPT’s bare metal H100 server: trainer.precision=bf16: Uses bfloat16 precision for faster, memory-efficient training on modern GPUs like the A100 or H100. It’s a PEFT superpower, slashing memory use while keeping accuracy sharp. trainer.max_steps=50: Limits training to 50 steps, keeping things quick for small datasets like ours. Bump this up for larger data or better results, but watch for longer runtimes. model.micro_batch_size=1 & model.global_batch_size=32: Sets the batch size per GPU (1 sample) and total batch size (32 samples across GPUs). Low micro-batch size saves memory for PEFT, but you might tweak it higher if your GPU (like the H100’s 94GB) can handle more. model.optim.lr=1e-4: Sets the learning rate to 0.0001, a small value ideal for PEFT’s delicate parameter updates (like LoRA adapters) to avoid overshooting. model.peft.peft_scheme=lora & model.peft.lora_tuning.target_modules=[attention_qkv]: Uses LoRA for efficiency, targeting only the attention query, key, and value layers. This keeps parameter updates minimal—perfect for resource-light fine-tuning on high-performance hardware like the H100. exp_manager.create_wandb_logger=True: Enables Weights & Biases logging to track progress live. It’s your window into loss curves and resource use, making it easier to tweak and troubleshoot, especially on a powerful setup like Metal Cloud- FPT’s H100 server. These parameters work together to make PEFT fast, efficient, and scalable. Tweak them based on your hardware, dataset, or goals — PEFT’s flexibility is one of its biggest perks! Visualizing PEFT Performance: Resource Usage During Fine-Tuning Curious about what’s happening under the hood during PEFT fine-tuning? Check out this snapshot of resource metrics from the fine-tuning process (see the graphs below). These charts, captured over 500 seconds, show how our Llama 3.1–8B model behaves on an NVIDIA A100 GPU: Memory Usage: System memory stays low (peaking at ~1.8%), while GPU memory ramps up to ~26GB (or 34–40% of the A100’s 80GB), reflecting the memory demands of loading and processing the 8-billion-parameter model and its PEFT adapters. GPU Power and Utilization: The GPU draws up to 500W and operates at 80–85% utilization, showcasing the A100’s efficiency in handling the tensor operations and parallel processing in NeMo. This confirms PEFT’s promise of staying resource-light compared to full fine-tuning. Memory Access Time: GPU time spent accessing memory hovers around 30–40%, indicating balanced compute and memory operations — ideal for PEFT’s low-parameter adjustments. These metrics highlight why PEFT is a game-changer: It keeps resource usage manageable, even for a hefty model like Llama 3.1–8B, making fine-tuning feasible on a single high-end GPU. If you’re tweaking hyperparameters or scaling up, expect these patterns to shift — play around and monitor your own runs for insights! Step 4: Running Inference Finally, let’s test our fine-tuned model! This script evaluates performance on the test set: [code lang="js"] MODEL="llama31-8b.nemo" PATH_TO_TRAINED_MODEL="/results/llama31-8b_lora.nemo"  # Adjust based on output from training TEST_DS="[test.jsonl]" TEST_NAMES="[data]" OUTPUT_PREFIX="./results/peft_results" TP_SIZE=1 PP_SIZE=1 [ ! -d ${OUTPUT_PREFIX} ] && mkdir -p ${OUTPUT_PREFIX} python3 \\\\ /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \\\\    model.restore_from_path=${MODEL} \\\\    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \\\\    trainer.devices=1 \\\\    model.tensor_model_parallel_size=${TP_SIZE} \\\\    model.pipeline_model_parallel_size=${PP_SIZE} \\\\    model.data.test_ds.file_names=${TEST_DS} \\\\    model.data.test_ds.names=${TEST_NAMES} \\\\    model.global_batch_size=32 \\\\    model.micro_batch_size=4 \\\\    model.data.test_ds.tokens_to_generate=20 \\\\    inference.greedy=True \\\\    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \\\\    model.data.test_ds.write_predictions_to_file=True [/code] This generates responses for your test inputs and saves them to [code lang="js"] ./results/peft_results_data_preds_labels.jsonl. [/code] Dive into the output to see how your model performs—did it nail those Japanese creative writing prompts? Wrapping Up And there you have it — a complete guide to fine-tuning Llama 3.1–8B with FPT AI Factory, NVIDIA NeMo, and PEFT! From understanding the magic of parameter-efficient methods to running inference, you’ve now got the tools to adapt LLMs to your own projects. Play around with different datasets, tweak LoRA’s rank, or scale up to multiple GPUs — the possibilities are endless. For more information and consultancy about FPT AI Factory, please contact: Hotline: 1900 638 399 Email: [email protected] Support: m.me/fptsmartcloud Source: https://blog.usee.ai/a-step-by-step-guide-to-fine-tuning-models-with-nvidias-nemo-framework-49ba3ab27d3d

Practical Guide to Distributed Training Large Language Models (LLMs) with Slurm and LLaMA-Factory on Metal Cloud

17:24 28/02/2025
1. Overview This guide provides a comprehensive walkthrough for setting up and running distributed training using LLaMA-Factory on Metal Cloud (Bare Metal Server). We cover environment setup, Slurm-based job scheduling, and performance optimizations. Additionally, we include a training task using the Open Instruct Uncensored Alpaca dataset, which consists of instruction-tuning samples, for fine-tuning a Llama-3.1-8B model, with full fine-tuning settings on 4 nodes, 8 x NVIDIA H100 GPUs per node, providing hands-on instructions for replicating a real-world training scenario. The execution setting up a distributed training environment on Metal Cloud using Slurm, an open-source workload manager optimized for high-performance computing. The guide walks through: Preparing the infrastructure with Slurm, CUDA, and NCCL for efficient multi-GPU communication. Installing LLaMA-Factory and configuring the system to enable seamless model training. Running a fine-tuning task for the LLaMA-3.1-8B model using the Open Instruct Uncensored Alpaca dataset. Leveraging Slurm’s job scheduling capabilities to allocate resources and monitor performance efficiently. Key highlights that readers should focus on: Scalability & Efficiency: The guide demonstrates how Slurm optimally distributes workloads across multiple GPUs and nodes, reducing training time. Cost Optimization: Proper job scheduling minimizes idle GPU time, leading to better resource utilization and lower costs. Reliability: Automated job resumption, error handling, and real-time system monitoring ensure stable training execution. Hands-on Training Example: A real-world fine-tuning scenario is provided, including dataset preparation, YAML-based configuration, and Slurm batch scripting for execution. By following this guide, readers can replicate the training pipeline and optimize their own LLM training workflows on Metal Cloud. 2. Why Slurm for Distributed Training? Slurm is a widely used open-source workload manager designed for high-performance computing (HPC) environments. It provides efficient job scheduling, resource allocation, and scalability, making it an excellent choice for AI training on Metal Cloud. Key advantages include: Resource Efficiency: Slurm optimally distributes workloads across GPUs and nodes, minimizing idle resources. Scalability: Seamlessly scales from a few GPUs to thousands, accommodating diverse AI workloads. Job Scheduling: Prioritizes and queues jobs based on defined policies, ensuring fair usage of resources. 3. Use Case Use case: Training Large Language Models with LLaMA-factory One practical application of Slurm on Metal Cloud is training large language models using the LLaMA-factory framework. By distributing training across multiple GPUs and nodes, Slurm helps reduce training time while ensuring stable and efficient execution. Key Benefits: Scalability: Supports large-scale models with efficient GPU utilization. Cost Optimization: Reduces cloud computing costs by minimizing idle time. Reliability: Automated job resumption and error handling enhance workflow robustness. 4. Prerequisites Before proceeding, ensure you have the following: 4.1. System Requirements Metal Cloud access with multiple GPU-equipped nodes Slurm job scheduler installed and configured NVIDIA CUDA (11.8+ recommended) installed on all nodes NCCL (NVIDIA Collective Communication Library) for multi-GPU communication Python 3.8+ installed on all nodes Torch with distributed training support High-performance Storage 4.2. Network & SSH Configuration To enable seamless multi-node training, ensure: Each node can SSH into other nodes without a password using an SSH key. Network interfaces allow high-speed inter-node communication (e.g., InfiniBand). NCCL and PyTorch distributed backend can communicate over TCP/IP. You can verify node connectivity using: scontrol show nodes or sinfo 5. Environment Setup Assuming that you have all the system requirements for distributed training task, run the following on each compute node to install LLaMA-Factory. It will install all necessary packages to run LLaMA-Factory: [code lang="js"] python3 –m venv venv source venv/bin/activate git clone https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory pip install –e “.[torch,metrics]” [/code] 6. Sample Training Task: Fine-Tuning LLaMA on Open Instruct Uncensored Alpaca Dataset To demonstrate a real-world scenario, we will full fine-tune a Llama-3.1-8B model using the Open Instruct Uncensored Alpaca dataset for instruction-following tasks. 6.1. Dataset: Open Instruct Uncensored Alpaca The Open Instruct Uncensored Alpaca is a collection of instruction-response pairs dataset. It is one of the most common datasets for fine-tuning models for instruction following. This dataset is public on Hugging Face. With LLaMA-Factory, you can specify the dataset's URI from a remote repository like Hugging Face directly in the YAML file to set up your training configuration, LLaMA-Factory will automatically download the dataset. To achieve this, you must define the dataset in a file named dataset_info.json, located in LLaMA-Factory/data/dataset_info.json. Add the following line to dataset_info.json. [code lang="js"] "uncensored_alpaca": {"hf_hub_url": "xzuyn/open-instruct-uncensored-alpaca"} [/code] When you have downloaded the dataset on the machine, you can add the following line to dataset_info.json and you are good to go. [code lang="js"]"your_dataset_name": {"file_name": "path/to/your/dataset.json"}[/code] 6.2. Model: LLaMA 3.1 8B The LLaMA 3.1 8B model is one of the latest releases in Meta’s third-generation LLaMA series. It is a lightweight yet powerful large language model designed for both research and enterprise applications. LLaMA 3.1 8B can be trained efficiently on multi-GPU multi-node Metal Cloud servers using LLaMA-Factory and DeepSpeed. The next sections of this guide will walk you through setting up distributed training for LLaMA 3.1 8B using LLaMA-Factory. If you want to download the model directly from Huggingface, you can use the below command: [code lang="js"] huggingface-cli download meta-llama/Llama-3.1-8B --local-dir=Llama-3.1-8B [/code] 7. Preparing training configuration LLaMA-Factory uses YAML configuration files to manage training parameters efficiently. A YAML-based configuration simplifies hyperparameter tuning and ensures reproducibility. This section explains how to prepare a YAML configuration file for fine-tuning the LLaMA 3.1 8B model using LLaMA-Factory. 7.1. Sample YAML Configuration for Fine-Tuning LLaMA 3.1 8B LLaMA-Factory provides various predefined YAML training configuration files, located at LLaMA-Factory/examples. Here is a YAML file for full fine-tuning LLaMA 3.1 8B with Open Instruct Uncensored Alpaca dataset: [code lang="js"] model_name_or_path: meta-llama/Llama-3.1-8B trust_remote_code: true stage: sft do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z2_config.json dataset: uncensored_alpaca template: llama3 cutoff_len: 2048 max_samples: 500000 overwrite_cache: true preprocessing_num_workers: 16 output_dir: saves/llama3.1-8b/full/sft logging_steps: 10 save_steps: 10000 plot_loss: true overwrite_output_dir: true per_device_train_batch_size: 4 gradient_accumulation_steps: 2 learning_rate: 1.0e-5 num_train_epochs: 2.5 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 val_size: 0.001 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 10000 [/code] You can put the model URI on Huggingface directly with: [code lang="js"] model_name_or_path: meta-llama/Llama-3.1-8B [/code] and LLaMA-Factory will automatically download the model before training. If you have downloaded the model on the machine, you can specify as: [code lang="js"] model_name_or_path: path/to/your/model [/code] To train on Open Instruct Uncensored Alpaca dataset, add the data by the specified name: [code lang="js"] dataset: uncensored_alpaca [/code] We adjust the number of training samples to 500,000 samples by: [code lang="js"] max_samples: 500000 [/code] You can adjust all other options if necessary 8. Configuring Slurm for Multi-Node Training Assume you have a YAML training configuration file named llama31_training.yaml, create a Slurm script train_llama.sbatch for training on 4 nodes, 8 GPUs per node: [code lang="js"] #!/bin/bash #SBATCH --job-name=multinode-training #SBATCH --nodes=4 #SBATCH --time=2-00:00:00 #SBATCH --gres=gpu:8 #SBATCH -o training.out #SBATCH -e training.err #SBATCH --ntasks=4 nodes=($(scontrol show hostnames $SLURM_JOB_NODELIST ) ) nodes_array=($nodes) head_node=${nodes_array[0]} node_id=${SLURM_NODEID} head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | cut -d" " -f2) echo Master Node IP: $head_node_ip export LOGLEVEL=INFO export NNODES=4 export NPROC_PER_NODE=8 export HEAD_NODE_IP=$head_node_ip export HEAD_NODE_PORT=29401 export NODE_RANK=$node_id export NCCL_IB_DISABLE=0 export NCCL_SOCKET_IFNAME=^lo,docker0 export NCCL_TIMEOUT=180000000 export NCCL_DEBUG=INFO export NCCL_BLOCKING_WAIT=1 # Ensure NCCL waits for operations to finish export NCCL_ASYNC_ERROR_HANDLING=1 # Allow handling of NCCL errors asynchronously source venv/bin/activate&amp;amp;amp;amp;amp;lt;/em&amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;lt;/wp-p&amp;amp;amp;amp;amp;gt; srun llamafactory-cli train llama31_training.yaml [/code] Use sbatch to submit the training job: sbatch train_llama.sbatch View the queue job with squeue Inspect the training.out and training.err file to see the training progress All 4 nodes are utilized perfectly. The final result is shown below: 9. Monitoring CPU and GPU Usage During Training When training large-scale models like LLaMA 3.1 8B on Metal Cloud, it is important to monitor system resources such as CPU, GPU, memory, and disk usage. Proper monitoring helps in: Detecting bottlenecks (e.g., underutilized GPUs, CPU overload). Optimizing resource allocation (e.g., adjusting batch sizes). Avoiding system crashes due to out-of-memory (OOM) errors. Bare Metal provides a monitoring page where users can track real-time hardware usage, including: GPU Utilization – See how much each GPU is being used. VRAM Usage – Check memory consumption per GPU. CPU Load – Monitor processor usage across nodes. Disk & Network Stats – Identify I/O bottlenecks. Users can access the monitoring page via their Metal Cloud dashboard to ensure efficient and stable training. Conclusion This guide provides a structured approach to setting up distributed training for LLaMA-Factory on Metal Cloud. We covered: Environment setup Slurm job submission Distributed training with LLaMA-Factory and DeepSpeed Optimizations for large-scale models Following these steps, you can fine-tune LLaMA models efficiently on Metal Cloud multi-node GPU clusters. Metal Cloud is now available on FPT AI Factory for reservation. Find out more at: https://aifactory.fptcloud.com/  For more information and consultancy, please contact: Hotline: 1900 638 399 Email: [email protected] Support: m.me/fptsmartcloud

FPT launches FPT AI Factory to accelerate AI development in Japan, offering local companies pre-orders for its NVIDIA H200 Tensor Core GPUs Cloud Service

13:46 13/11/2024
Japan, November 13, 2024 — FPT, a global leading IT firm and Preferred NVIDIA Cloud Partner (NCP), officially announced the launch of the FPT AI Factory in Japan using the full-stack NVIDIA accelerated computing platform. This flagship solution serves as a one-stop shop for AI and Cloud services, offering immense computing power for AI advancement and contributing to developing Sovereign AI in the country. Japanese customers can expedite AI development with priority access to premium solutions and features through an exclusive pre-order.  The launching of FPT AI Factory in Japan by Mr. Le Hong Viet, CEO, FPT Smart Cloud, FPT Corporation At the NVIDIA AI Summit in Japan, FPT has debuted FPT AI Factory, an all-inclusive stack for end-to-end AI product lifecycle, including three main groups. FPT AI Infrastructure offers GPU cloud services with unprecedented computing power to accelerate model development and deployment. FPT AI Studio offers intelligent tools for building, pre-training, and fine-tuning AI models in depth using NVIDIA NeMo. FPT AI Inference, supported by NVIDIA NIM and NVIDIA AI Blueprints, enables customers to effectively deploy and scale their models in terms of size and number of usages. FPT AI Factory is integrated with 20+ ready-to-use AI products, built upon Generative AI, for rapid AI adoption and instant results in elevating customer experience, achieving operational excellence, transforming the human workforce, and optimizing operating expenses.    Powered by thousands of NVIDIA Hopper GPUs and next-generation ones, boosting with the latest NVIDIA AI Enterprise software platform, FPT AI Factory grants regional clientele scalable and confidential supercomputing along with essential tools to cultivate sophisticated AI technologies from scratch with faster time-to-market. This also empowers businesses to manage resources and processes expeditiously, optimizing total cost of ownership (TCO).   FPT is now accepting pre-orders for FPT AI Factory, allowing local corporate clients to leverage the diverse ecosystem of AI and Cloud, earn cloud credit, and gain early access to premium features. Combined with tailor-made consultation from seasoned AI & Cloud experts, enterprises in any industry can reinforce successful AI journeys with practical, high-value solutions.  Japan currently faces a shortage of GPU cloud solutions necessary to drive economic growth, promote innovation, and facilitate digital transformation in a secure manner. FPT’s AI-ready infrastructure in Japan provides local businesses and the government with unparalleled computational performance, high efficiency, and low-latency interactions to enrich research & development capabilities while safeguarding sensitive data and maintaining sovereignty.   By joining the NVIDIA Partner Network as a Service Delivery Partner, FPT will utilize NVIDIA's cutting-edge products and technologies to develop bespoke cloud services, hardware, software, and comprehensive integration services to drive the digital transformation in Japan.  Dr. Truong Gia Binh, FPT Corporation Chairman and Founder, shares FPT’s commitment to accompanying Japan in facilitating a successful AI journey Dr. Truong Gia Binh, FPT Corporation Chairman and Founder, affirmed, "Artificial intelligence continues to be a transformative technology for the entire world. In line with NVIDIA's global initiative, we are working closely with strategic partners to develop cloud infrastructure essential for AI applications worldwide, especially in Japan. We are committed to dedicating all necessary resources and accompanying the Japanese government, enterprises, and partners in the nation’s AI investment and development efforts. Through this significant project, we are aligning our vision and action to rapidly expand AI applications on a global scale while actualizing the collective vision of Japan and Vietnam in becoming AI nations.”  John Fanelli, NVIDIA Vice President of Enterprise AI Software, said, "In today's rapidly evolving technological landscape, Japan recognizes the importance of sovereign AI solutions for driving innovation, supporting data security, and maintaining technological independence. The FPT AI Factory built on NVIDIA accelerated computing and software represents a significant step towards meeting this need, offering Japanese companies access to cutting-edge AI infrastructure while fostering local AI development and expertise."  In April 2024, FPT publicized the development of AI Factories through a comprehensive strategic collaboration with NVIDIA. That marked a significant milestone in FPT's AI journey, aiming to promote AI research and development in the region and expand advanced AI and cloud capabilities on a global scale.        About FPT Corporation  FPT Corporation (FPT) is a global leading technology and IT services provider headquartered in Vietnam. FPT operates in three core sectors: Technology, Telecommunications, and Education. As AI is indeed a key focus, FPT has been integrating AI across its products and solutions to drive innovation and enhance user experiences within its Made by FPT ecosystem. FPT is actively working on expanding its capabilities in AI through investments in human resources, R&D, and partnerships with leading organizations like NVIDIA, Mila, AITOMATIC, and Landing AI. These efforts are aligned with FPT's ambitious goal to reach 5 billion USD in IT services revenue from global markets by 2030 and solidify its status among the world's top billion-dollar IT companies.  After nearly two decades in Japan, FPT has become one of the largest foreign-invested technology firms in the country by human resource capacity. The company delivers services and solutions to over 450 clients globally, with over 3,500 employees across 17 local offices and innovation hubs in Japan, and nearly 15,000 professionals supporting this market worldwide.   With Japan as a strategic focus for the company’s global growth, FPT has been actively expanding its business and engaging in M&A deals, such as the joint venture with Konica Minolta, strategic investment in LTS Inc, and most recently, the acquisition of NAC—its first M&A deal in the market. As digital transformation, particularly legacy system modernization viewed as a key growth driver in the Japanese market, the company is committed to providing end-to-end solutions and seamless services, utilizing advanced AI technologies as a primary accelerator. For more information, please visit https://fpt.com/en.