GPU Monitor & Validator
Professional-grade GPU monitoring and validation tool
GPU Monitor & Validator
Latest Version: 3.1.9
The GPU Monitor & Validator is a professional-grade tool for ensuring optimal performance and health of both NVIDIA and AMD GPUs in production environments.
Overview
The GPU Monitor & Validator provides:
- Multi-vendor support: NVIDIA (CUDA) and AMD (ROCm) GPUs
- Enhanced multi-GPU support: Improved authentication and validation for systems with multiple GPUs
- Real-time performance monitoring
- Hardware validation against manufacturer specs
- Comprehensive benchmarking with PyTorch stress tests
- Automated anomaly detection
- Continuous health monitoring
- Performance regression detection
- Thermal management validation
- Memory integrity verification
- Vendor-agnostic unified interface
What's New in Version 3.1.9
- Driver-agnostic installation: Package no longer interferes with existing NVIDIA/AMD drivers
- Dynamic PyTorch detection: Automatically detects CUDA/ROCm versions and installs compatible PyTorch
- CUDA 13.0 support: Full support for latest NVIDIA drivers including CUDA 13.0
- Fixed permission issues: Resolved venv permission errors during installation
- Better error visibility: Installation errors now properly displayed for easier troubleshooting
System Requirements
Operating Systems
- Ubuntu 22.04 LTS (Jammy) or newer
- Ubuntu 24.04 LTS (Noble)
- Ubuntu 24.10 (Oracular)
- Debian 11 (Bullseye) or newer
- Debian 12 (Bookworm)
Hardware Requirements
- NVIDIA GPUs: Compute Capability 7.0+ (V100, T4, A100, H100, RTX series)
- AMD GPUs: ROCm-compatible (MI100, MI200, MI300 series, Radeon Pro VII, W7900)
- 2MB free disk space (plus dependencies)
Software Requirements
- Python 3.10 or newer
- Internet connection for initial setup
- For NVIDIA: Any NVIDIA driver version (package is driver-agnostic)
- For AMD: Any ROCm version (package is driver-agnostic)
- Note: The package will NOT install or modify GPU drivers
Installation
Option 1: Ubuntu PPA (Recommended for Ubuntu)
# Add the PPA repository (non-interactive)
sudo add-apt-repository ppa:cl-ax/gpu-validator -y
sudo apt update
# Install GPU Agent (non-interactive)
sudo apt install cl-gpu-agent -yOption 2: APT Repository (Debian/Ubuntu)
# Add repository
echo 'deb [trusted=yes] https://storage.googleapis.com/gpu_validator/apt-repo stable main' | \
sudo tee /etc/apt/sources.list.d/computelabs-gpu-agent.list
# Update package list
sudo apt update
# Install GPU Agent
sudo apt install cl-gpu-agentOption 3: Manual Installation
For systems that don't meet the requirements or need custom configuration:
# Clone the repository
git clone https://github.com/compute-labs-dev/gpu_validator_monitor.git
cd gpu_validator_monitor
# Run the installation script
sudo ./scripts/install.shOperation Modes
The GPU Validator supports four primary operation modes:
1. Validate Mode
cl-gpu-agent --mode validatePerforms:
- Hardware specification verification
- Memory integrity tests
- CUDA functionality checks
- Driver compatibility validation
2. Benchmark Mode
cl-gpu-agent --mode benchmarkTests:
- Memory bandwidth
- CUDA core performance
- Tensor core capabilities
- PCIe throughput
- Power efficiency
3. Monitor Mode (Default Service)
# Start monitoring service
sudo systemctl start cl-gpu-agent
# Enable automatic startup
sudo systemctl enable cl-gpu-agentTracks:
- Real-time temperature
- Power consumption
- Memory usage
- GPU utilization
- Error rates
- Performance metrics
4. Detect Mode
cl-gpu-agent --mode detectDetects and displays:
- Available GPUs
- GPU models and specifications
- Driver versions
- CUDA/ROCm capabilities
Usage
Service Management
# Check service status
sudo systemctl status cl-gpu-agent
# Start the service
sudo systemctl start cl-gpu-agent
# Stop the service
sudo systemctl stop cl-gpu-agent
# Enable automatic startup
sudo systemctl enable cl-gpu-agent
# View real-time logs
journalctl -u cl-gpu-agent -fDirect Command Usage
# Show help and available options
cl-gpu-agent --help
# Run comprehensive validation
cl-gpu-agent --mode validate
# Run performance benchmark
cl-gpu-agent --mode benchmark
# Run monitoring (foreground)
cl-gpu-agent --mode monitorUnderstanding Results
The validator provides three key metrics:
- Validation Status: PASS/FAIL for hardware specs
- Performance Scores: Compared to reference benchmarks
- Health Indicators: Temperature/power trends
Example success output for NVIDIA GPU:
GPU 0 (NVIDIA H100 80GB HBM3):
- Validation: PASS
- Memory Bandwidth: 3.1 TB/s
- Temperature: 42°C (Max: 85°C)
- Power Draw: 119W/700W
- CUDA Cores: 100% functional
- Memory Test: PASS
- Driver Status: OptimalSupported GPU Models
NVIDIA GPUs
- Data Center: H100, H200
- RTX 40 Series: RTX 4090, 4080
- RTX 50 Series: RTX 5090, 5080
AMD GPUs
- Tested and Verified: MI300X
- Other ROCm-supported GPUs may work but are untested
Troubleshooting
Common Issues
Permission Denied During Installation
If you see [Errno 13] Permission denied during installation:
# Remove old installation
sudo rm -rf /opt/cl-gpu-agent
# Reinstall
sudo apt install --reinstall cl-gpu-agentNVIDIA Driver Version Mismatch
The package no longer installs or modifies GPU drivers as of v3.1.3+. If you have driver issues:
# Check your current driver
nvidia-smi
# The package will detect and use your existing driver
# No driver changes will be madeCUDA 13.0 or Newer Support
If you have CUDA 13.0 or newer (like driver 580.x):
- The package will automatically detect this
- PyTorch will be installed with CUDA 12.4 support (backward compatible)
- Your CUDA 13.0 drivers will work perfectly with PyTorch cu124
Package Version Not Updating
If apt is not getting the latest version:
# Clear cache and force update
sudo apt clean
sudo apt update
# Check available versions
apt policy cl-gpu-agent
# Force reinstall
sudo apt install --reinstall cl-gpu-agent