GPU Monitor & Validator

Latest Version: 3.1.9

The GPU Monitor & Validator is a professional-grade tool for ensuring optimal performance and health of both NVIDIA and AMD GPUs in production environments.

Overview

The GPU Monitor & Validator provides:

Multi-vendor support: NVIDIA (CUDA) and AMD (ROCm) GPUs
Enhanced multi-GPU support: Improved authentication and validation for systems with multiple GPUs
Real-time performance monitoring
Hardware validation against manufacturer specs
Comprehensive benchmarking with PyTorch stress tests
Automated anomaly detection
Continuous health monitoring
Performance regression detection
Thermal management validation
Memory integrity verification
Vendor-agnostic unified interface

What's New in Version 3.1.9

Driver-agnostic installation: Package no longer interferes with existing NVIDIA/AMD drivers
Dynamic PyTorch detection: Automatically detects CUDA/ROCm versions and installs compatible PyTorch
CUDA 13.0 support: Full support for latest NVIDIA drivers including CUDA 13.0
Fixed permission issues: Resolved venv permission errors during installation
Better error visibility: Installation errors now properly displayed for easier troubleshooting

System Requirements

Operating Systems

Ubuntu 22.04 LTS (Jammy) or newer
Ubuntu 24.04 LTS (Noble)
Ubuntu 24.10 (Oracular)
Debian 11 (Bullseye) or newer
Debian 12 (Bookworm)

Hardware Requirements

NVIDIA GPUs: Compute Capability 7.0+ (V100, T4, A100, H100, RTX series)
AMD GPUs: ROCm-compatible (MI100, MI200, MI300 series, Radeon Pro VII, W7900)
2MB free disk space (plus dependencies)

Software Requirements

Python 3.10 or newer
Internet connection for initial setup
For NVIDIA: Any NVIDIA driver version (package is driver-agnostic)
For AMD: Any ROCm version (package is driver-agnostic)
Note: The package will NOT install or modify GPU drivers

Installation

Option 1: Ubuntu PPA (Recommended for Ubuntu)

# Add the PPA repository (non-interactive)
sudo add-apt-repository ppa:cl-ax/gpu-validator -y
sudo apt update

# Install GPU Agent (non-interactive)
sudo apt install cl-gpu-agent -y

Option 2: APT Repository (Debian/Ubuntu)

# Add repository
echo 'deb [trusted=yes] https://storage.googleapis.com/gpu_validator/apt-repo stable main' | \
  sudo tee /etc/apt/sources.list.d/computelabs-gpu-agent.list

# Update package list
sudo apt update

# Install GPU Agent
sudo apt install cl-gpu-agent

Option 3: Manual Installation

For systems that don't meet the requirements or need custom configuration:

# Clone the repository
git clone https://github.com/compute-labs-dev/gpu_validator_monitor.git
cd gpu_validator_monitor

# Run the installation script
sudo ./scripts/install.sh

Operation Modes

The GPU Validator supports four primary operation modes:

1. Validate Mode

cl-gpu-agent --mode validate

Performs:

Hardware specification verification
Memory integrity tests
CUDA functionality checks
Driver compatibility validation

2. Benchmark Mode

cl-gpu-agent --mode benchmark

Tests:

Memory bandwidth
CUDA core performance
Tensor core capabilities
PCIe throughput
Power efficiency

3. Monitor Mode (Default Service)

# Start monitoring service
sudo systemctl start cl-gpu-agent

# Enable automatic startup
sudo systemctl enable cl-gpu-agent

Tracks:

Real-time temperature
Power consumption
Memory usage
GPU utilization
Error rates
Performance metrics

4. Detect Mode

cl-gpu-agent --mode detect

Detects and displays:

Available GPUs
GPU models and specifications
Driver versions
CUDA/ROCm capabilities

Usage

Service Management

# Check service status
sudo systemctl status cl-gpu-agent

# Start the service
sudo systemctl start cl-gpu-agent

# Stop the service
sudo systemctl stop cl-gpu-agent

# Enable automatic startup
sudo systemctl enable cl-gpu-agent

# View real-time logs
journalctl -u cl-gpu-agent -f

Direct Command Usage

# Show help and available options
cl-gpu-agent --help

# Run comprehensive validation
cl-gpu-agent --mode validate

# Run performance benchmark
cl-gpu-agent --mode benchmark

# Run monitoring (foreground)
cl-gpu-agent --mode monitor

Understanding Results

The validator provides three key metrics:

Validation Status: PASS/FAIL for hardware specs
Performance Scores: Compared to reference benchmarks
Health Indicators: Temperature/power trends

Example success output for NVIDIA GPU:

GPU 0 (NVIDIA H100 80GB HBM3):
- Validation: PASS
- Memory Bandwidth: 3.1 TB/s
- Temperature: 42°C (Max: 85°C)
- Power Draw: 119W/700W
- CUDA Cores: 100% functional
- Memory Test: PASS
- Driver Status: Optimal

Supported GPU Models

NVIDIA GPUs

Data Center: H100, H200
RTX 40 Series: RTX 4090, 4080
RTX 50 Series: RTX 5090, 5080

AMD GPUs

Tested and Verified: MI300X
Other ROCm-supported GPUs may work but are untested

Troubleshooting

Common Issues

Permission Denied During Installation

If you see [Errno 13] Permission denied during installation:

# Remove old installation
sudo rm -rf /opt/cl-gpu-agent

# Reinstall
sudo apt install --reinstall cl-gpu-agent

NVIDIA Driver Version Mismatch

The package no longer installs or modifies GPU drivers as of v3.1.3+. If you have driver issues:

# Check your current driver
nvidia-smi

# The package will detect and use your existing driver
# No driver changes will be made

CUDA 13.0 or Newer Support

If you have CUDA 13.0 or newer (like driver 580.x):

The package will automatically detect this
PyTorch will be installed with CUDA 12.4 support (backward compatible)
Your CUDA 13.0 drivers will work perfectly with PyTorch cu124

Package Version Not Updating

If apt is not getting the latest version:

# Clear cache and force update
sudo apt clean
sudo apt update

# Check available versions
apt policy cl-gpu-agent

# Force reinstall
sudo apt install --reinstall cl-gpu-agent

GPU Monitor & Validator

On this page