X Tutup
Skip to content

CanReader/TensorBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ TensorBench: Advanced CUDA Tensor Operation Benchmarking Suite

CUDA C++ License Status

Comprehensive benchmarking framework for GPU tensor operations with roofline model analysis, multi-algorithm comparison, and advanced performance characterization.

Features β€’ Quick Start β€’ Benchmarks β€’ Results β€’ Architecture


πŸ“‹ Overview

TensorBench is a production-grade CUDA benchmarking suite designed for comprehensive performance analysis of tensor operations on NVIDIA GPUs. It provides deep insights into GPU utilization, memory hierarchy behavior, and comparative performance across multiple algorithms.

Use Cases

  • πŸ” Performance Profiling: Detailed analysis of tensor operation performance
  • πŸ—οΈ Architecture Evaluation: Compare different implementation strategies
  • πŸ“Š Optimization Research: Identify bottlenecks and optimization opportunities
  • 🎯 GPU Capability Assessment: Understand your hardware's strengths and limitations
  • πŸ“ˆ Comparative Studies: Benchmark naive vs. optimized implementations

✨ Features

🎯 Multi-Algorithm Comparison

  • cuBLAS Optimized: Highly-tuned vendor library implementation
  • Naive Kernel: Reference implementation for correctness validation
  • Fused Operations: Advanced multi-operation kernels

πŸ“Š Advanced Performance Analysis

  • Roofline Model: Theoretical performance ceiling computation
  • Arithmetic Intensity: FLOPS/Byte analysis for memory vs. compute bottleneck classification
  • Cache Behavior: L1/L2 cache miss estimation
  • Memory Bandwidth: Real-time measurement and analysis
  • Statistical Analysis: Mean, variance, standard deviation, 95% confidence intervals

πŸ”¬ Deep Profiling Capabilities

  • Thermal Throttling: Risk assessment based on sustained execution
  • Efficiency Metrics: Peak efficiency percentage calculation
  • Variance Analysis: Execution time stability tracking
  • Power Efficiency: GFLOPS/Watt estimation
  • GPU Properties: Full device capability reporting

πŸ“ Comprehensive Output

  • CSV exports for advanced analysis
  • Multi-phase benchmarking (single ops, fused ops, batch processing)
  • Detailed performance logs with human-readable formatting

πŸ—οΈ Architecture

Project Structure

TensorBench/
β”œβ”€β”€ CMakeLists.txt              # Build configuration
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ include/
β”‚   β”œβ”€β”€ MatrixFP16.cuh          # FP16 matrix class definition
β”‚   β”œβ”€β”€ MatrixFP32.cuh          # FP32 matrix class definition
β”‚   β”œβ”€β”€ naive_tensor_tgemm.cuh  # Naive GEMM kernel header
β”‚   └── utils.cuh               # Utility functions
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ MatrixFP16.cu           # FP16 matrix implementation
β”‚   β”œβ”€β”€ MatrixFP32.cu           # FP32 matrix implementation
β”‚   β”œβ”€β”€ naive_tensor_tgemm.cu   # Naive GEMM kernel implementation
β”‚   └── utils.cu                # Utility implementations
β”œβ”€β”€ test/
β”‚   β”œβ”€β”€ 00_benchmark_cuBLAS.cu                 # Test 1: cuBLAS Baseline
β”‚   β”œβ”€β”€ 01_benchmark_naive.cu                  # Test 2: Naive Implementation
β”‚   β”œβ”€β”€ 02_benchmark_mixed_precision.cu        # Test 3: Mixed Precision Analysis
β”‚   β”œβ”€β”€ 03_benchmark_scaling.cu                # Test 4: Strong Scaling
β”‚   β”œβ”€β”€ 04_benchmark_stress_test.cu            # Test 5: Stress Testing
β”‚   └── 05_benchmark_advanced_tensor_ops.cu    # Test 6: Advanced Analysis (500+ lines)
└── build/                      # CMake build output
    └── *.out                   # Compiled executables

πŸš€ Quick Start

Prerequisites

  • NVIDIA GPU with compute capability 6.1+ (Pascal or newer)
  • CUDA Toolkit 12.0 or later
  • CMake 3.18+
  • GCC/G++ 9+ or LLVM/Clang 10+

Installation & Build

For the simplest setup, run the appropriate command below. These scripts automatically handle the CMake configuration and compilation.

🐧 Linux/macOS

Run:

./build.sh

πŸͺŸ Windows

Run:

build.bat

or

  1. Clone and navigate to the project:
cd TensorBench
  1. Configure with CMake:
cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
  1. Build all benchmarks:
cmake --build build -j$(nproc)
  1. Or build specific tests:
# Build only advanced tensor ops benchmark
cmake --build build --target bench_advanced_tensor_ops

# Build only scaling analysis
cmake --build build --target bench_scaling

# List all targets
cmake --build build --target help

Running Benchmarks

After building, run individual benchmarks from the build/ directory:

cd build

# Test 1: cuBLAS Baseline Performance
./00_benchmark_cuBLAS.out

# Test 2: Naive Kernel Comparison
./01_benchmark_naive.out

# Test 3: Mixed Precision Analysis
./02_benchmark_mixed_precision.out

# Test 4: Strong Scaling Analysis
./03_benchmark_scaling.out

# Test 5: Stress Testing
./04_benchmark_stress_test.out

# Test 6: Advanced Tensor Operations (Most Comprehensive)
./05_benchmark_advanced_tensor_ops.out

πŸ“Š Benchmark Suite

Test 1: cuBLAS Baseline (00_benchmark_cuBLAS.cu)

Purpose: Establish performance baseline with vendor-optimized library

Aspect Details
Sizes 128, 256, 512, 1024, 2048, 4096
Runs 10 per size
Algorithm cuBLAS GemmEx with tensor operations
Output GFLOPS, execution time
Use Case Reference performance ceiling

Test 2: Naive Implementation (01_benchmark_naive.cu)

Purpose: Reference kernel for correctness validation

Aspect Details
Sizes 128, 256, 512, 1024, 2048, 4096
Runs 10 per size
Algorithm Custom naive GEMM kernel
Validation Assert correctness against cuBLAS
Output GFLOPS comparison, error analysis
Use Case Correctness verification, optimization baseline

Test 3: Mixed Precision Analysis (02_benchmark_mixed_precision.cu) ⭐

Purpose: Comprehensive mixed-precision performance comparison

Aspect Details
Sizes 128, 256, 512, 1024, 2048, 4096, 8192
Precision FP16 (input) Γ— FP16 (input) β†’ FP32 (output)
Batches 3 independent benchmark runs
Runs 5 per batch
Statistics Mean time, GFLOPS, speedup metrics
Output benchmark_results.csv with detailed metrics
Features GPU device properties, warmup runs
Use Case Mixed-precision optimization analysis

Key Metrics:

  • Time per operation (ms)
  • GFLOPS achieved
  • Speedup relative to naive implementation
  • Numerical accuracy validation

Test 4: Strong Scaling Analysis (03_benchmark_scaling.cu) πŸ“ˆ

Purpose: Analyze batch processing and strong scaling behavior

Aspect Details
Sizes 256, 512, 1024, 2048
Batch Sizes 1, 2, 4, 8, 16 matrices
Modes Sequential vs. queue-based execution
Metrics Throughput (matrices/sec), memory bandwidth (GB/s)
Output benchmark_scaling_results.csv
Analysis Speedup, efficiency, bottleneck identification

Key Insights:

  • Batch processing efficiency
  • Strong scaling characteristics
  • Memory bandwidth utilization
  • Queue vs. sequential overhead

Test 5: Stress Testing (04_benchmark_stress_test.cu) πŸ”₯

Purpose: Push GPU to limits; analyze maximum problem sizes and stability

Aspect Details
Max Size 12288 Γ— 12288 matrices
Runs 20-50 per size (adaptive)
Focus Execution time variance, stability
Metrics Min, max, avg GFLOPS; variance analysis
Output CSV with detailed statistics
Safety Handles out-of-memory gracefully
Use Case Maximum capacity planning, thermal limits

Variance Analysis:

  • Identifies thermal throttling
  • Detects performance degradation
  • Measures consistency across runs

Test 6: Advanced Tensor Operations (05_benchmark_advanced_tensor_ops.cu) 🌟

Purpose: Most comprehensive analysis with roofline model and multi-algorithm comparison Lines of Code: 650+ lines

Aspect Details
Sizes 256, 512, 1024, 2048, 4096
Runs 15 per size (high statistical significance)
Phases 2 benchmark phases
Algorithms 3 implementations (cuBLAS, Naive, Fused)
Output Multiple CSV files for advanced analysis

Phase 1: Single Matrix Multiplication Comparison

  • Compares cuBLAS vs. naive kernel
  • Statistical analysis with 95% confidence intervals
  • Roofline model analysis
  • Cache miss estimation
  • Thermal throttling risk assessment

Phase 2: Fused Operations Analysis

  • Tests combined operations: C = A₁B₁ + Aβ‚‚Bβ‚‚
  • Kernel fusion efficiency
  • Memory access pattern optimization

Metrics Tracked:

Per Operation:
β”œβ”€β”€ Execution Time (ms) + Variance
β”œβ”€β”€ GFLOPS + Efficiency %
β”œβ”€β”€ Memory Bandwidth (GB/s)
β”œβ”€β”€ Compute Intensity (FLOPS/Byte)
β”œβ”€β”€ Cache Miss Estimation
└── Thermal Throttle Risk (0.0-1.0)

Roofline Model:
β”œβ”€β”€ Compute Intensity
β”œβ”€β”€ Achieved GFLOPS
β”œβ”€β”€ Peak Compute (Theoretical)
β”œβ”€β”€ Peak Memory Bandwidth (Theoretical)
└── Bottleneck Classification (Compute vs. Memory)

Confidence Intervals:
β”œβ”€β”€ 95% CI for mean execution time
β”œβ”€β”€ Statistical significance
└── Accuracy bounds

GPU Specifications Reported:

  • Peak compute (GFLOPS)
  • Peak memory bandwidth (GB/s)
  • Warp size
  • Max threads per block
  • Number of SMs
  • TDP estimate

CSV Outputs:

  • benchmark_advanced_metrics.csv - Per-operation detailed metrics
  • benchmark_roofline_model.csv - Roofline analysis data

Expected Runtime:

  • Small GPUs: 2-3 minutes
  • Large GPUs (RTX 4090): 3-5 minutes

πŸ“ˆ Output & Analysis

CSV Export Format

All benchmarks export detailed metrics to CSV for further analysis:

benchmark_results.csv (Mixed Precision):

MatrixSize,Batch,cuBLAS_Time_ms,Naive_Time_ms,cuBLAS_GFLOPS,Naive_GFLOPS,Speedup,MaxError,AvgError
128,1,0.123456,0.987654,123.45,15.67,8.03,1.2e-5,3.4e-6
256,1,0.234567,1.234567,234.56,18.90,5.25,1.5e-5,4.2e-6
...

benchmark_scaling_results.csv (Scaling Analysis):

MatrixSize,BatchSize,SequentialTime_ms,BatchTime_ms,SequentialGFLOPS,BatchGFLOPS,Speedup,Throughput_matrices_per_sec,MemoryBandwidth_GB_s
256,1,0.123,0.123,1234.5,1234.5,1.00,8130.08,987.65
256,2,0.246,0.180,617.3,841.5,1.37,10869.57,1289.45
...

benchmark_advanced_metrics.csv (Advanced):

MatrixSize,Algorithm,ExecutionTime_ms,GFLOPS,Efficiency_%,MemoryBandwidth_GB_s,ComputeIntensity,CacheMisses,ThrottleRisk
256,cuBLAS,0.123,1234.56,85.3,123.45,16.78,1024,0.15
256,Naive,0.456,333.33,23.0,45.67,16.78,4096,0.22
...

Python Analysis Example

import pandas as pd
import matplotlib.pyplot as plt

# Load results
df = pd.read_csv('benchmark_advanced_metrics.csv')

# Plot GFLOPS vs. Matrix Size
plt.figure(figsize=(12, 6))
for algo in df['Algorithm'].unique():
    subset = df[df['Algorithm'] == algo]
    plt.plot(subset['MatrixSize'], subset['GFLOPS'], marker='o', label=algo)
plt.xlabel('Matrix Size')
plt.ylabel('GFLOPS')
plt.legend()
plt.xscale('log')
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.title('Tensor Operation Performance Scaling')
plt.savefig('performance_scaling.png', dpi=300)
plt.show()

πŸ”§ Configuration

Adjusting GPU Architecture

Edit CMakeLists.txt line 15 to match your GPU:

# Compute Capability Reference:
# 61  -> Pascal (GTX 1080, GTX 1070, etc.)
# 75  -> Turing (RTX 2080, RTX 2070, etc.)
# 86  -> Ampere (RTX 3090, RTX 3080, A100, etc.)
# 89  -> Ada (RTX 4090, RTX 4080, RTX 4070 Ti, etc.)

set(CMAKE_CUDA_ARCHITECTURES 89)  # <-- CHANGE THIS

Adjusting Benchmark Parameters

Each test allows configuration through source code (test files):

Matrix Sizes (in each test file):

int mat_sizes[] = {256, 512, 1024, 2048, 4096};  // Modify as needed

Number of Runs:

int runs_per_batch = 5;   // Increase for more statistical precision

Batch Sizes (scaling test):

int batch_sizes[] = {1, 2, 4, 8, 16};  // Modify batch configurations

πŸ“Š Understanding Roofline Model

The roofline model provides a theoretical performance ceiling based on:

  1. Arithmetic Intensity (AI): FLOPS per byte of memory transferred
  2. Peak Compute: Maximum GFLOPS the GPU can achieve
  3. Peak Memory BW: Maximum memory bandwidth available

Performance Ceiling = min(Peak Compute, AI Γ— Peak Memory BW)

Classification:

  • Memory-Bound: Performance limited by memory bandwidth
  • Compute-Bound: Performance limited by compute capacity

TensorBench automatically classifies each operation and suggests optimization directions.


🎯 Performance Optimization Tips

Based on TensorBench results, consider:

If Memory-Bound:

  • βœ… Increase tile size
  • βœ… Improve cache locality
  • βœ… Use mixed precision (FP16 inputs)
  • βœ… Fuse multiple operations

If Compute-Bound:

  • βœ… Increase parallelism
  • βœ… Improve instruction-level parallelism
  • βœ… Use tensor cores (Turing+)
  • βœ… Optimize register usage

General:

  • βœ… Monitor thermal throttling warnings
  • βœ… Analyze variance for stability
  • βœ… Compare against roofline ceiling
  • βœ… Validate numerical accuracy

πŸ–₯️ System Requirements

Minimum

  • NVIDIA GPU: Compute Capability 6.1+ (GTX 1080 or newer)
  • CUDA Toolkit: 11.0+
  • RAM: 8GB
  • Storage: 1GB

Recommended

  • NVIDIA GPU: Compute Capability 7.0+ (Turing+)
  • CUDA Toolkit: 12.0+
  • RAM: 16GB
  • Storage: 2GB (for large benchmark runs)

Tested Platforms

  • βœ… Arch Linux-6.17.7 (x86_64)
  • βœ… CUDA 13.0
  • βœ… RTX 4050

πŸ› Troubleshooting

Build Errors

"cuda_fp16.h not found"

# Update your CUDA include path in .vscode/c_cpp_properties.json
# Or regenerate CMake configuration:
cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES=89

"Cannot find cuBLAS"

# Ensure CUDA toolkit is properly installed:
nvcc --version

# If not found, install or set CUDA path:
export CUDA_PATH=/usr/local/cuda
cmake -S . -B build

Compilation fails on specific architecture

# Check your GPU's compute capability:
nvidia-smi -q | grep -i "Compute Capability"

# Update CMakeLists.txt with correct value
set(CMAKE_CUDA_ARCHITECTURES 89)

Runtime Issues

Out of Memory

  • Reduce matrix size in test files
  • Close other GPU applications
  • Use nvidia-smi to check VRAM usage

Thermal Throttling

  • Allow GPU cooling period between runs
  • Reduce problem sizes
  • Monitor with nvidia-smi dmon

Inconsistent Results

  • Run benchmarks multiple times
  • Check system background processes
  • Verify power management settings
  • Review confidence intervals in output

πŸ“š Output Interpretation

GFLOPS

  • Good: > 80% of peak GPU GFLOPS
  • Acceptable: 50-80% of peak
  • Poor: < 50% indicates optimization opportunity

Memory Bandwidth

  • Check against theoretical peak
  • High utilization (>90%) suggests memory optimization needed

Variance

  • Low Variance: Stable, consistent performance
  • High Variance: May indicate thermal throttling or system interference

Efficiency %

  • 90-100%: Excellent
  • 70-90%: Good
  • 50-70%: Fair, consider optimization
  • <50%: Poor, significant optimization opportunity

πŸ“– References & Resources


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


🀝 Contributing

Contributions are welcome! Areas of interest:

  • Additional optimization algorithms
  • Extended GPU architecture support
  • Performance analysis tools
  • Documentation improvements
  • Bug reports and fixes

πŸ“ Citation

If you use TensorBench in your research, please cite:

@software{tensorbench2024,
  title={TensorBench: Advanced CUDA Tensor Operation Benchmarking Suite},
  author={Your Name},
  year={2024},
  url={https://github.com/yourusername/TensorBench}
}

❓ FAQ

Q: What GPU do I need? A: Any NVIDIA GPU with compute capability 6.1 or higher (GTX 1080 or newer). Newer GPUs (Turing+) provide better mixed-precision support.

Q: How long do benchmarks take? A:

  • Quick tests (00-02): 30 seconds - 2 minutes
  • Medium tests (03-04): 1-3 minutes
  • Full suite (05): 3-5 minutes

Q: Can I modify matrix sizes? A: Yes! Edit the test files to adjust mat_sizes[] array. Larger sizes require more VRAM.

Q: How do I interpret results? A: Compare GFLOPS against your GPU's theoretical peak. Use roofline model to identify bottlenecks.

Q: Why is my performance lower than expected? A: Check thermal throttling risk, compare against roofline ceiling, verify no system processes are interfering.


⬆ Back to Top

Made with ❀️ for GPU performance enthusiasts and researchers

About

High-performance C++/CUDA framework for stress-testing and benchmarking GPU tensor operations. Includes support for mixed precision (FP16/FP32), strong scaling analysis, and detailed statistical output for optimization research.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Β 
Β 
Β 

Contributors

Languages

X Tutup