All documents

Managed – FPT Kubernetes Engine

    Setting up GPU Telemetry
    Setting up GPU Telemetry
    Updated on 29 Nov 2024

    FPT Cloud utilizes NVIDIA GPU Telemetry with kube-prometheus-stack as a monitoring toolkit for Kubernetes cluster with GPU. The toolkit consists of collectors, a Time-series Database to store metrics, and a visualizer to visualize data. The toolkit uses widely-known open-sourced tools such as Prometheus and Grafana. Prometheus also includes Alertmanager to create and manage alerts. Prometheus is deployed alongside kube-state-metrics and node_exporter to collect cluster-level metrics such as Kubernetes API resources and node-level metrics such as GPU utilization.

    • To view GPU’s custom metrics, run this command:
    kubectl get –raw /apis/custom.metrics.k8s.io/v1beta1 | jq –r . | grep DCGM
    • Access Prometheus to view GPU DCGM’s metrics:

    To forward Prometheus’ service, run this command:

    kubectl port-forward service/kube-prometheus-stack-1679-prometheus 9090:63090

    Where 9090 is Prometheus’ service port, 63090 is local port on your client. Prometheus is now accessible locally at http://localhost:63090/

    On the Prometheus interface, navigate to Metrics Explorer to view GPU DCGM’s metrics:

    • Access Grafana Dashboard

    To forward Grafana’s service, run this command:

    kubectl port-forward service/kube-prometheus-stack-1679050354-grafana 80:63080

    Where 80 is Grafana’s service port, 63080 is local port on your client. Grafana is now accessible locally at http://localhost:63080/

    The default username and password for logging into Grafana are:

    User: admin
    Password: prom-operator
    • Import Grafana Dashboard for GPU metrics:

    To import the Dashboard, navigate to Grafana > Dashboards > Manage > Import.

    To use FPT Cloud’s GPU Dashboard, copy this FPT Cloud GPU Dashboard json dashboard as text or import JSON file > Load.

    FPT Cloud’s GPU Dashboard: