Comprehensive benchmarking framework for GPU tensor operations with roofline model analysis, multi-algorithm comparison, and advanced performance characterization.
Features β’ Quick Start β’ Benchmarks β’ Results β’ Architecture
TensorBench is a production-grade CUDA benchmarking suite designed for comprehensive performance analysis of tensor operations on NVIDIA GPUs. It provides deep insights into GPU utilization, memory hierarchy behavior, and comparative performance across multiple algorithms.
- π Performance Profiling: Detailed analysis of tensor operation performance
- ποΈ Architecture Evaluation: Compare different implementation strategies
- π Optimization Research: Identify bottlenecks and optimization opportunities
- π― GPU Capability Assessment: Understand your hardware's strengths and limitations
- π Comparative Studies: Benchmark naive vs. optimized implementations
- cuBLAS Optimized: Highly-tuned vendor library implementation
- Naive Kernel: Reference implementation for correctness validation
- Fused Operations: Advanced multi-operation kernels
- Roofline Model: Theoretical performance ceiling computation
- Arithmetic Intensity: FLOPS/Byte analysis for memory vs. compute bottleneck classification
- Cache Behavior: L1/L2 cache miss estimation
- Memory Bandwidth: Real-time measurement and analysis
- Statistical Analysis: Mean, variance, standard deviation, 95% confidence intervals
- Thermal Throttling: Risk assessment based on sustained execution
- Efficiency Metrics: Peak efficiency percentage calculation
- Variance Analysis: Execution time stability tracking
- Power Efficiency: GFLOPS/Watt estimation
- GPU Properties: Full device capability reporting
- CSV exports for advanced analysis
- Multi-phase benchmarking (single ops, fused ops, batch processing)
- Detailed performance logs with human-readable formatting
TensorBench/
βββ CMakeLists.txt # Build configuration
βββ README.md # This file
βββ include/
β βββ MatrixFP16.cuh # FP16 matrix class definition
β βββ MatrixFP32.cuh # FP32 matrix class definition
β βββ naive_tensor_tgemm.cuh # Naive GEMM kernel header
β βββ utils.cuh # Utility functions
βββ src/
β βββ MatrixFP16.cu # FP16 matrix implementation
β βββ MatrixFP32.cu # FP32 matrix implementation
β βββ naive_tensor_tgemm.cu # Naive GEMM kernel implementation
β βββ utils.cu # Utility implementations
βββ test/
β βββ 00_benchmark_cuBLAS.cu # Test 1: cuBLAS Baseline
β βββ 01_benchmark_naive.cu # Test 2: Naive Implementation
β βββ 02_benchmark_mixed_precision.cu # Test 3: Mixed Precision Analysis
β βββ 03_benchmark_scaling.cu # Test 4: Strong Scaling
β βββ 04_benchmark_stress_test.cu # Test 5: Stress Testing
β βββ 05_benchmark_advanced_tensor_ops.cu # Test 6: Advanced Analysis (500+ lines)
βββ build/ # CMake build output
βββ *.out # Compiled executables
- NVIDIA GPU with compute capability 6.1+ (Pascal or newer)
- CUDA Toolkit 12.0 or later
- CMake 3.18+
- GCC/G++ 9+ or LLVM/Clang 10+
For the simplest setup, run the appropriate command below. These scripts automatically handle the CMake configuration and compilation.
Run:
./build.sh
Run:
build.bat
or
- Clone and navigate to the project:
cd TensorBench- Configure with CMake:
cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON- Build all benchmarks:
cmake --build build -j$(nproc)- Or build specific tests:
# Build only advanced tensor ops benchmark
cmake --build build --target bench_advanced_tensor_ops
# Build only scaling analysis
cmake --build build --target bench_scaling
# List all targets
cmake --build build --target helpAfter building, run individual benchmarks from the build/ directory:
cd build
# Test 1: cuBLAS Baseline Performance
./00_benchmark_cuBLAS.out
# Test 2: Naive Kernel Comparison
./01_benchmark_naive.out
# Test 3: Mixed Precision Analysis
./02_benchmark_mixed_precision.out
# Test 4: Strong Scaling Analysis
./03_benchmark_scaling.out
# Test 5: Stress Testing
./04_benchmark_stress_test.out
# Test 6: Advanced Tensor Operations (Most Comprehensive)
./05_benchmark_advanced_tensor_ops.outPurpose: Establish performance baseline with vendor-optimized library
| Aspect | Details |
|---|---|
| Sizes | 128, 256, 512, 1024, 2048, 4096 |
| Runs | 10 per size |
| Algorithm | cuBLAS GemmEx with tensor operations |
| Output | GFLOPS, execution time |
| Use Case | Reference performance ceiling |
Purpose: Reference kernel for correctness validation
| Aspect | Details |
|---|---|
| Sizes | 128, 256, 512, 1024, 2048, 4096 |
| Runs | 10 per size |
| Algorithm | Custom naive GEMM kernel |
| Validation | Assert correctness against cuBLAS |
| Output | GFLOPS comparison, error analysis |
| Use Case | Correctness verification, optimization baseline |
Purpose: Comprehensive mixed-precision performance comparison
| Aspect | Details |
|---|---|
| Sizes | 128, 256, 512, 1024, 2048, 4096, 8192 |
| Precision | FP16 (input) Γ FP16 (input) β FP32 (output) |
| Batches | 3 independent benchmark runs |
| Runs | 5 per batch |
| Statistics | Mean time, GFLOPS, speedup metrics |
| Output | benchmark_results.csv with detailed metrics |
| Features | GPU device properties, warmup runs |
| Use Case | Mixed-precision optimization analysis |
Key Metrics:
- Time per operation (ms)
- GFLOPS achieved
- Speedup relative to naive implementation
- Numerical accuracy validation
Purpose: Analyze batch processing and strong scaling behavior
| Aspect | Details |
|---|---|
| Sizes | 256, 512, 1024, 2048 |
| Batch Sizes | 1, 2, 4, 8, 16 matrices |
| Modes | Sequential vs. queue-based execution |
| Metrics | Throughput (matrices/sec), memory bandwidth (GB/s) |
| Output | benchmark_scaling_results.csv |
| Analysis | Speedup, efficiency, bottleneck identification |
Key Insights:
- Batch processing efficiency
- Strong scaling characteristics
- Memory bandwidth utilization
- Queue vs. sequential overhead
Purpose: Push GPU to limits; analyze maximum problem sizes and stability
| Aspect | Details |
|---|---|
| Max Size | 12288 Γ 12288 matrices |
| Runs | 20-50 per size (adaptive) |
| Focus | Execution time variance, stability |
| Metrics | Min, max, avg GFLOPS; variance analysis |
| Output | CSV with detailed statistics |
| Safety | Handles out-of-memory gracefully |
| Use Case | Maximum capacity planning, thermal limits |
Variance Analysis:
- Identifies thermal throttling
- Detects performance degradation
- Measures consistency across runs
Purpose: Most comprehensive analysis with roofline model and multi-algorithm comparison Lines of Code: 650+ lines
| Aspect | Details |
|---|---|
| Sizes | 256, 512, 1024, 2048, 4096 |
| Runs | 15 per size (high statistical significance) |
| Phases | 2 benchmark phases |
| Algorithms | 3 implementations (cuBLAS, Naive, Fused) |
| Output | Multiple CSV files for advanced analysis |
- Compares cuBLAS vs. naive kernel
- Statistical analysis with 95% confidence intervals
- Roofline model analysis
- Cache miss estimation
- Thermal throttling risk assessment
- Tests combined operations: C = AβBβ + AβBβ
- Kernel fusion efficiency
- Memory access pattern optimization
Per Operation:
βββ Execution Time (ms) + Variance
βββ GFLOPS + Efficiency %
βββ Memory Bandwidth (GB/s)
βββ Compute Intensity (FLOPS/Byte)
βββ Cache Miss Estimation
βββ Thermal Throttle Risk (0.0-1.0)
Roofline Model:
βββ Compute Intensity
βββ Achieved GFLOPS
βββ Peak Compute (Theoretical)
βββ Peak Memory Bandwidth (Theoretical)
βββ Bottleneck Classification (Compute vs. Memory)
Confidence Intervals:
βββ 95% CI for mean execution time
βββ Statistical significance
βββ Accuracy bounds
- Peak compute (GFLOPS)
- Peak memory bandwidth (GB/s)
- Warp size
- Max threads per block
- Number of SMs
- TDP estimate
benchmark_advanced_metrics.csv- Per-operation detailed metricsbenchmark_roofline_model.csv- Roofline analysis data
- Small GPUs: 2-3 minutes
- Large GPUs (RTX 4090): 3-5 minutes
All benchmarks export detailed metrics to CSV for further analysis:
benchmark_results.csv (Mixed Precision):
MatrixSize,Batch,cuBLAS_Time_ms,Naive_Time_ms,cuBLAS_GFLOPS,Naive_GFLOPS,Speedup,MaxError,AvgError
128,1,0.123456,0.987654,123.45,15.67,8.03,1.2e-5,3.4e-6
256,1,0.234567,1.234567,234.56,18.90,5.25,1.5e-5,4.2e-6
...
benchmark_scaling_results.csv (Scaling Analysis):
MatrixSize,BatchSize,SequentialTime_ms,BatchTime_ms,SequentialGFLOPS,BatchGFLOPS,Speedup,Throughput_matrices_per_sec,MemoryBandwidth_GB_s
256,1,0.123,0.123,1234.5,1234.5,1.00,8130.08,987.65
256,2,0.246,0.180,617.3,841.5,1.37,10869.57,1289.45
...
benchmark_advanced_metrics.csv (Advanced):
MatrixSize,Algorithm,ExecutionTime_ms,GFLOPS,Efficiency_%,MemoryBandwidth_GB_s,ComputeIntensity,CacheMisses,ThrottleRisk
256,cuBLAS,0.123,1234.56,85.3,123.45,16.78,1024,0.15
256,Naive,0.456,333.33,23.0,45.67,16.78,4096,0.22
...
import pandas as pd
import matplotlib.pyplot as plt
# Load results
df = pd.read_csv('benchmark_advanced_metrics.csv')
# Plot GFLOPS vs. Matrix Size
plt.figure(figsize=(12, 6))
for algo in df['Algorithm'].unique():
subset = df[df['Algorithm'] == algo]
plt.plot(subset['MatrixSize'], subset['GFLOPS'], marker='o', label=algo)
plt.xlabel('Matrix Size')
plt.ylabel('GFLOPS')
plt.legend()
plt.xscale('log')
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.title('Tensor Operation Performance Scaling')
plt.savefig('performance_scaling.png', dpi=300)
plt.show()Edit CMakeLists.txt line 15 to match your GPU:
# Compute Capability Reference:
# 61 -> Pascal (GTX 1080, GTX 1070, etc.)
# 75 -> Turing (RTX 2080, RTX 2070, etc.)
# 86 -> Ampere (RTX 3090, RTX 3080, A100, etc.)
# 89 -> Ada (RTX 4090, RTX 4080, RTX 4070 Ti, etc.)
set(CMAKE_CUDA_ARCHITECTURES 89) # <-- CHANGE THISEach test allows configuration through source code (test files):
Matrix Sizes (in each test file):
int mat_sizes[] = {256, 512, 1024, 2048, 4096}; // Modify as neededNumber of Runs:
int runs_per_batch = 5; // Increase for more statistical precisionBatch Sizes (scaling test):
int batch_sizes[] = {1, 2, 4, 8, 16}; // Modify batch configurationsThe roofline model provides a theoretical performance ceiling based on:
- Arithmetic Intensity (AI): FLOPS per byte of memory transferred
- Peak Compute: Maximum GFLOPS the GPU can achieve
- Peak Memory BW: Maximum memory bandwidth available
Performance Ceiling = min(Peak Compute, AI Γ Peak Memory BW)
- Memory-Bound: Performance limited by memory bandwidth
- Compute-Bound: Performance limited by compute capacity
TensorBench automatically classifies each operation and suggests optimization directions.
Based on TensorBench results, consider:
- β Increase tile size
- β Improve cache locality
- β Use mixed precision (FP16 inputs)
- β Fuse multiple operations
- β Increase parallelism
- β Improve instruction-level parallelism
- β Use tensor cores (Turing+)
- β Optimize register usage
- β Monitor thermal throttling warnings
- β Analyze variance for stability
- β Compare against roofline ceiling
- β Validate numerical accuracy
- NVIDIA GPU: Compute Capability 6.1+ (GTX 1080 or newer)
- CUDA Toolkit: 11.0+
- RAM: 8GB
- Storage: 1GB
- NVIDIA GPU: Compute Capability 7.0+ (Turing+)
- CUDA Toolkit: 12.0+
- RAM: 16GB
- Storage: 2GB (for large benchmark runs)
- β Arch Linux-6.17.7 (x86_64)
- β CUDA 13.0
- β RTX 4050
"cuda_fp16.h not found"
# Update your CUDA include path in .vscode/c_cpp_properties.json
# Or regenerate CMake configuration:
cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES=89"Cannot find cuBLAS"
# Ensure CUDA toolkit is properly installed:
nvcc --version
# If not found, install or set CUDA path:
export CUDA_PATH=/usr/local/cuda
cmake -S . -B buildCompilation fails on specific architecture
# Check your GPU's compute capability:
nvidia-smi -q | grep -i "Compute Capability"
# Update CMakeLists.txt with correct value
set(CMAKE_CUDA_ARCHITECTURES 89)Out of Memory
- Reduce matrix size in test files
- Close other GPU applications
- Use
nvidia-smito check VRAM usage
Thermal Throttling
- Allow GPU cooling period between runs
- Reduce problem sizes
- Monitor with
nvidia-smi dmon
Inconsistent Results
- Run benchmarks multiple times
- Check system background processes
- Verify power management settings
- Review confidence intervals in output
- Good: > 80% of peak GPU GFLOPS
- Acceptable: 50-80% of peak
- Poor: < 50% indicates optimization opportunity
- Check against theoretical peak
- High utilization (>90%) suggests memory optimization needed
- Low Variance: Stable, consistent performance
- High Variance: May indicate thermal throttling or system interference
- 90-100%: Excellent
- 70-90%: Good
- 50-70%: Fair, consider optimization
- <50%: Poor, significant optimization opportunity
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Areas of interest:
- Additional optimization algorithms
- Extended GPU architecture support
- Performance analysis tools
- Documentation improvements
- Bug reports and fixes
If you use TensorBench in your research, please cite:
@software{tensorbench2024,
title={TensorBench: Advanced CUDA Tensor Operation Benchmarking Suite},
author={Your Name},
year={2024},
url={https://github.com/yourusername/TensorBench}
}Q: What GPU do I need? A: Any NVIDIA GPU with compute capability 6.1 or higher (GTX 1080 or newer). Newer GPUs (Turing+) provide better mixed-precision support.
Q: How long do benchmarks take? A:
- Quick tests (00-02): 30 seconds - 2 minutes
- Medium tests (03-04): 1-3 minutes
- Full suite (05): 3-5 minutes
Q: Can I modify matrix sizes?
A: Yes! Edit the test files to adjust mat_sizes[] array. Larger sizes require more VRAM.
Q: How do I interpret results? A: Compare GFLOPS against your GPU's theoretical peak. Use roofline model to identify bottlenecks.
Q: Why is my performance lower than expected? A: Check thermal throttling risk, compare against roofline ceiling, verify no system processes are interfering.
Made with β€οΈ for GPU performance enthusiasts and researchers