NVIDIA GPU metrics exporter for Prometheus leveraging DCGM — screenshot of github.com

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

This is NVIDIA's DCGM-based Prometheus exporter for GPU metrics, which the Kubernetes metrics exporter leverages. It's a foundational tool for monitoring NVIDIA GPUs, often deployed as part of the NVIDIA GPU Operator.

Visit github.com →

Questions & Answers

What is NVIDIA DCGM-Exporter?
NVIDIA DCGM-Exporter is a project that exposes NVIDIA GPU metrics to Prometheus, leveraging the NVIDIA Data Center GPU Manager (DCGM) library. It provides detailed performance and health data for GPUs.
Who should use the NVIDIA DCGM-Exporter?
It is intended for administrators, developers, and SREs who need to monitor NVIDIA GPU performance and health in environments using Prometheus for observability, especially in Kubernetes clusters.
When should I use NVIDIA DCGM-Exporter for GPU monitoring?
Use DCGM-Exporter when you require detailed, real-time GPU metrics for your NVIDIA hardware within a Prometheus monitoring stack. For Kubernetes, it is often recommended to use the NVIDIA GPU Operator, which includes DCGM-Exporter.
How can I collect GPU metrics using DCGM-Exporter?
You can run DCGM-Exporter as a Docker container, exposing metrics on port 9400, or deploy it via a Helm chart in Kubernetes. It integrates with Prometheus by scraping the /metrics endpoint.
Can DCGM-Exporter integrate HPC job information into its metrics?
Yes, DCGM-Exporter can include High-Performance Computing (HPC) job IDs in metric labels. This requires configuring the HPC environment to generate GPU-to-job mapping files in a specified directory for the exporter to consume.