Training configuration
Training configuration
Updated on 27 May 2025

How do I choose the right training configuration?

1. Number of GPUs that need to fine-tune a model

It depends on the model size:

  • <1B parameters: 1 GPU (24 GB VRAM) is sufficient
  • 7B models: 2–4 GPUs (40 GB VRAM each)
  • 13B models: 4–8 GPUs recommended
  • 30B+ models: Requires 8+ GPUs and multi-node setup

2. Do I need multiple nodes or just one node?

  • For small to medium models (up to 13B), a single node with multiple GPUs is enough.
  • For large models (30B+), multi-node setups are recommended for better memory and performance.

3. What is the minimum GPU memory required?

  • At least 24 GB per GPU for standard fine-tuning.
  • You can fine-tune on GPUs with 8–16 GB VRAM using LoRA or QLoRA methods.

Example:

Model: Llama-3.1-8B-Instruct

  • Training type: Full

    • Number of GPUs: can fit into 2 GPUs (nearly 99% usage) -> 4 GPUs for more consistent runtime
    • Distributed backend: DeepSeed
    • ZeRO Stage: 3
    • Batch size per device: 1
    • All other parameters can be left as default
  • Training type: LoRA:

    • Number of GPUs: can fit into 1 GPU
    • LoRA Rank: 16
    • Batch size per device: 1
    • All other parameters can be left as default
  • To calculate the most suitable training configuration, you can refer here: https://rahulschand.github.io/gpu_poor/ (overhead 10-20%)