We support both single-node and multi-node configurations, with a maximum of 16 nodes:
We recommend you scalable the right infrastructure below:
Number of GPUs depends on the model size:
<1B parameters: 1 GPU (2GB VRAM) is sufficient
7B parameters: 2-4 GPUs (40GB VRAM each)
13B parameters: 4-8 GPUs recommended
30B+ parameters: Requires 8+ GPUs and multi-node setup
When to use single-node or multi-node:
For small to medium models (up to 13B), a single-node with multiple GPUs is enough
For large models (30B+), multi-node setups are recommended for better memory and performance
The minimum GPU memory required:
At least 24GB per GPU for standard fine-tuning.
You can fine-tune on GPUs with 8-16GB VRAM using LoRA or QLoRA methods.
Example:
Model: Llama-3.1-8B-Instruct
Training type: Full
Number of GPUs: can fit into 2 GPUs (nearly 99% usage) -> 4 GPUs for more consistent runtime
Distributed backend: DeepSeed
ZeRO stage: 3
Batch size per device: 1
All other parameters can be left as default
Training type: LoRA
Number of GPUs: can fit into 1 GPU
LoRA rank: 16
Batch size per device: 1
All other parameters can be left as default
To calculate the most suitable training configuration, you can refer here: https://rahulschand.github.io/gpu_poor/ (overhead 10-20%)