Tag: CUDA

优化策略

warp divergence

ref 实现中的 warp divergence 主要是因为将不同行归入一个 warp 计算, 而不同行的 NNZ 可能有很大差异, 产生 warp divergence. 因此, 只要避免将不同行划入同一 warp 即可. 因此, 令 block.x = 1, 使每个 thread block 至多处理一行数据.

如图中 ref 与 phase_1 对比, 提升有限.

shared memory

在 SpMM 中, 稀疏矩阵的一个元素代表了对于稠密矩阵的一行的访问. 因此可以将稀疏矩阵的一部分缓存在 shared memory 中, 以减少重复从 global memory 中读取稀疏矩阵.

About 1 minAbout 427 words

PA3: 全源最短路

Environment

P100 GPU 最大支持每个 SM 64KB shared memory，但每个 thread block 最多只支持 48KB

Method

使用实验三 - 高性能计算导论实验文档 (tsinghua.edu.cn) 中的分块方法. 一个 thread block 处理一个矩阵分块. 每个 thread block 所需使用的数据全部拷贝到 shared memory 中. 在 threadIdx 的基础上偏移 i_start, j_start 或 center_block_start 即可将 shared memory 中的坐标映射到 global memory 中的不同矩阵分块.

About 3 minAbout 926 words

exp7: CUDA 优化 (global memory, shared memory)

分析 `test_gmem.cu` 的性能变化来源

__global__ void stride_copy(float *dst, float *src) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  dst[i * STRIDE] = src[i * STRIDE];
}

About 3 minAbout 831 words

exp6: CUDA 并行策略

Performance

About 1 minAbout 436 words