Monitor a Cluster and a Server
Monitor a Cluster and a Server
Updated on 16 May 2025

The monitoring feature is bundled with the AI Infrastructure—Metal Cloud service. The collection and visualization of metrics, logs, and events can help to identify potential issues and optimize future workloads. You may select an observability solution that best fits their needs.

Metrics A Cluster (in the same VPC) A single Server
Total number of nodes and down nodes
GPU model, Driver & CUDA version
Power state
Uptime
Total number of GPUs and down GPUs
GPU Utilization
GPU Memory
CPU Utilization
System Memory
Root Storage Usage
Local Disk Usage
Details of each GPUs
Power consumption, Temperature, GPU Utilization, VRAM usage
Network Bandwidth Inbound/Outbound
Network Packets Sent/Received
Network Error rate Receive/Transmit
System Fan Speed
System Voltage
Common Alerts

1. A cluster within a VPC Alt text

2. A Server Alt text