Compute Labs

GPU Monitor & Validator

Professional-grade GPU monitoring and validation tool

GPU Monitor & Validator

Latest Version: 3.1.9

The GPU Monitor & Validator is a professional-grade tool for ensuring optimal performance and health of both NVIDIA and AMD GPUs in production environments.

Overview

The GPU Monitor & Validator provides:

  • Multi-vendor support: NVIDIA (CUDA) and AMD (ROCm) GPUs
  • Enhanced multi-GPU support: Improved authentication and validation for systems with multiple GPUs
  • Real-time performance monitoring
  • Hardware validation against manufacturer specs
  • Comprehensive benchmarking with PyTorch stress tests
  • Automated anomaly detection
  • Continuous health monitoring
  • Performance regression detection
  • Thermal management validation
  • Memory integrity verification
  • Vendor-agnostic unified interface

What's New in Version 3.1.9

  • Driver-agnostic installation: Package no longer interferes with existing NVIDIA/AMD drivers
  • Dynamic PyTorch detection: Automatically detects CUDA/ROCm versions and installs compatible PyTorch
  • CUDA 13.0 support: Full support for latest NVIDIA drivers including CUDA 13.0
  • Fixed permission issues: Resolved venv permission errors during installation
  • Better error visibility: Installation errors now properly displayed for easier troubleshooting

System Requirements

Operating Systems

  • Ubuntu 22.04 LTS (Jammy) or newer
  • Ubuntu 24.04 LTS (Noble)
  • Ubuntu 24.10 (Oracular)
  • Debian 11 (Bullseye) or newer
  • Debian 12 (Bookworm)

Hardware Requirements

  • NVIDIA GPUs: Compute Capability 7.0+ (V100, T4, A100, H100, RTX series)
  • AMD GPUs: ROCm-compatible (MI100, MI200, MI300 series, Radeon Pro VII, W7900)
  • 2MB free disk space (plus dependencies)

Software Requirements

  • Python 3.10 or newer
  • Internet connection for initial setup
  • For NVIDIA: Any NVIDIA driver version (package is driver-agnostic)
  • For AMD: Any ROCm version (package is driver-agnostic)
  • Note: The package will NOT install or modify GPU drivers

Installation

# Add the PPA repository (non-interactive)
sudo add-apt-repository ppa:cl-ax/gpu-validator -y
sudo apt update

# Install GPU Agent (non-interactive)
sudo apt install cl-gpu-agent -y

Option 2: APT Repository (Debian/Ubuntu)

# Add repository
echo 'deb [trusted=yes] https://storage.googleapis.com/gpu_validator/apt-repo stable main' | \
  sudo tee /etc/apt/sources.list.d/computelabs-gpu-agent.list

# Update package list
sudo apt update

# Install GPU Agent
sudo apt install cl-gpu-agent

Option 3: Manual Installation

For systems that don't meet the requirements or need custom configuration:

# Clone the repository
git clone https://github.com/compute-labs-dev/gpu_validator_monitor.git
cd gpu_validator_monitor

# Run the installation script
sudo ./scripts/install.sh

Operation Modes

The GPU Validator supports four primary operation modes:

1. Validate Mode

cl-gpu-agent --mode validate

Performs:

  • Hardware specification verification
  • Memory integrity tests
  • CUDA functionality checks
  • Driver compatibility validation

2. Benchmark Mode

cl-gpu-agent --mode benchmark

Tests:

  • Memory bandwidth
  • CUDA core performance
  • Tensor core capabilities
  • PCIe throughput
  • Power efficiency

3. Monitor Mode (Default Service)

# Start monitoring service
sudo systemctl start cl-gpu-agent

# Enable automatic startup
sudo systemctl enable cl-gpu-agent

Tracks:

  • Real-time temperature
  • Power consumption
  • Memory usage
  • GPU utilization
  • Error rates
  • Performance metrics

4. Detect Mode

cl-gpu-agent --mode detect

Detects and displays:

  • Available GPUs
  • GPU models and specifications
  • Driver versions
  • CUDA/ROCm capabilities

Usage

Service Management

# Check service status
sudo systemctl status cl-gpu-agent

# Start the service
sudo systemctl start cl-gpu-agent

# Stop the service
sudo systemctl stop cl-gpu-agent

# Enable automatic startup
sudo systemctl enable cl-gpu-agent

# View real-time logs
journalctl -u cl-gpu-agent -f

Direct Command Usage

# Show help and available options
cl-gpu-agent --help

# Run comprehensive validation
cl-gpu-agent --mode validate

# Run performance benchmark
cl-gpu-agent --mode benchmark

# Run monitoring (foreground)
cl-gpu-agent --mode monitor

Understanding Results

The validator provides three key metrics:

  1. Validation Status: PASS/FAIL for hardware specs
  2. Performance Scores: Compared to reference benchmarks
  3. Health Indicators: Temperature/power trends

Example success output for NVIDIA GPU:

GPU 0 (NVIDIA H100 80GB HBM3):
- Validation: PASS
- Memory Bandwidth: 3.1 TB/s
- Temperature: 42°C (Max: 85°C)
- Power Draw: 119W/700W
- CUDA Cores: 100% functional
- Memory Test: PASS
- Driver Status: Optimal

Supported GPU Models

NVIDIA GPUs

  • Data Center: H100, H200
  • RTX 40 Series: RTX 4090, 4080
  • RTX 50 Series: RTX 5090, 5080

AMD GPUs

  • Tested and Verified: MI300X
  • Other ROCm-supported GPUs may work but are untested

Troubleshooting

Common Issues

Permission Denied During Installation

If you see [Errno 13] Permission denied during installation:

# Remove old installation
sudo rm -rf /opt/cl-gpu-agent

# Reinstall
sudo apt install --reinstall cl-gpu-agent

NVIDIA Driver Version Mismatch

The package no longer installs or modifies GPU drivers as of v3.1.3+. If you have driver issues:

# Check your current driver
nvidia-smi

# The package will detect and use your existing driver
# No driver changes will be made

CUDA 13.0 or Newer Support

If you have CUDA 13.0 or newer (like driver 580.x):

  • The package will automatically detect this
  • PyTorch will be installed with CUDA 12.4 support (backward compatible)
  • Your CUDA 13.0 drivers will work perfectly with PyTorch cu124

Package Version Not Updating

If apt is not getting the latest version:

# Clear cache and force update
sudo apt clean
sudo apt update

# Check available versions
apt policy cl-gpu-agent

# Force reinstall
sudo apt install --reinstall cl-gpu-agent