All documents
The monitoring feature is bundled with the AI Infrastructure—Metal Cloud service. The collection and visualization of metrics, logs, and events can help to identify potential issues and optimize future workloads. You may select an observability solution that best fits their needs.
Metrics | A Cluster (in the same VPC) | A single Server |
---|---|---|
Total number of nodes and down nodes | ✔ | |
GPU model, Driver & CUDA version | ✔ | |
Power state | ✔ | |
Uptime | ✔ | |
Total number of GPUs and down GPUs | ✔ | ✔ |
GPU Utilization | ✔ | ✔ |
GPU Memory | ✔ | ✔ |
CPU Utilization | ✔ | ✔ |
System Memory | ✔ | ✔ |
Root Storage Usage | ✔ | ✔ |
Local Disk Usage | ✔ | ✔ |
Details of each GPUs Power consumption, Temperature, GPU Utilization, VRAM usage |
✔ | |
Network Bandwidth Inbound/Outbound | ✔ | ✔ |
Network Packets Sent/Received | ✔ | ✔ |
Network Error rate Receive/Transmit | ✔ | |
System Fan Speed | ✔ | |
System Voltage | ✔ | |
Common Alerts | ✔ |
1. A cluster within a VPC
2. A Server
Cookie | Duration | Description |
---|---|---|
cookielawinfo-checbox-analytics | 11 months | |
cookielawinfo-checbox-functional | 11 months | |
cookielawinfo-checbox-others | 11 months | |
cookielawinfo-checkbox-necessary | 11 months | |
cookielawinfo-checkbox-performance | 11 months | |
viewed_cookie_policy | 11 months |