Tag: Introduction to High Performance Computing

优化策略

warp divergence

ref 实现中的 warp divergence 主要是因为将不同行归入一个 warp 计算, 而不同行的 NNZ 可能有很大差异, 产生 warp divergence. 因此, 只要避免将不同行划入同一 warp 即可. 因此, 令 block.x = 1, 使每个 thread block 至多处理一行数据.

如图中 ref 与 phase_1 对比, 提升有限.

shared memory

在 SpMM 中, 稀疏矩阵的一个元素代表了对于稠密矩阵的一行的访问. 因此可以将稀疏矩阵的一部分缓存在 shared memory 中, 以减少重复从 global memory 中读取稀疏矩阵.

About 1 minAbout 427 words

exp8: 单机性能优化

Task 0

Performance

Option	Elapsed Time / seconds	Performance / GFlops
`-O0`	1.0072	0.2665
`-O1`	0.3461	0.7755
`-O2`	0.3332	0.8057
`-O3`	0.0496	5.4081
`-fast`	0.0386	6.9524

Less than 1 minuteAbout 278 words

PA3: 全源最短路

Environment

P100 GPU 最大支持每个 SM 64KB shared memory，但每个 thread block 最多只支持 48KB

Method

使用实验三 - 高性能计算导论实验文档 (tsinghua.edu.cn) 中的分块方法. 一个 thread block 处理一个矩阵分块. 每个 thread block 所需使用的数据全部拷贝到 shared memory 中. 在 threadIdx 的基础上偏移 i_start, j_start 或 center_block_start 即可将 shared memory 中的坐标映射到 global memory 中的不同矩阵分块.

About 3 minAbout 926 words

exp7: CUDA 优化 (global memory, shared memory)

分析 `test_gmem.cu` 的性能变化来源

__global__ void stride_copy(float *dst, float *src) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  dst[i * STRIDE] = src[i * STRIDE];
}

About 3 minAbout 831 words

PA2: 模板计算

Size: 512 x 512 x 512

Naive

No OPT Performance

Threads	Computation Time (s)	Performance (Gflop/s)	Speedup / Single Thread
1	267.075484	0.653310	1.
2	135.902281	1.283886	1.96520182
4	70.080111	2.489766	3.81100243
8	36.813651	4.739629	7.25479328
16	25.793147	6.764706	10.35451164
28	17.400062	10.027726	15.34910839

About 11 minAbout 3246 words

exp6: CUDA 并行策略

Performance

About 1 minAbout 436 words

PA1: 奇偶排序（odd_even_sort）

Performance

Number of Nodes	Number of Tasks	Number Count	v0	v1	v2	v3	v4
1	1	100000000	xxxxxxxxxx6 1void a_plus_b_intrinsic(float* a, float* b, float* c, int n) {2 for (int i = 0; i < n; i += 8) {3 _mm256_store_ps(4 c + i, _mm256_add_ps(_mm256_load_ps(a + i), _mm256_load_ps(b + i)));5 }6}c++	1.	1.	1.	1.
1	2	100000000	1.82738226	1.83557525	1.8772568	1.86270894	1.67925379
1	4	100000000	3.05696721	3.17540441	3.37059297	3.39672385	2.3387731
1	8	100000000	5.15212684	5.59405831	6.03427353	6.15270254	3.29768106
1	16	100000000	7.65590815	8.91877194	9.65970998	10.31895407	4.39598177
2	32	100000000	10.63745468	12.91174115	14.70125847	15.67950831	5.01147717

About 15 minAbout 4460 words

exp4: 自动向量化与基于 intrinsic 的手动向量化

Performance

Method	Time
baseline	4711 us
auto simd	530 us
intrinsic	514 us

Implementation

void a_plus_b_intrinsic(float* a, float* b, float* c, int n) {
  for (int i = 0; i < n; i += 8) {
    _mm256_store_ps(
        c + i, _mm256_add_ps(_mm256_load_ps(a + i), _mm256_load_ps(b + i)));
  }
}

Less than 1 minuteAbout 70 words

exp3: 测量 OpenMP 并行 for 循环不同调度策略的性能

调度策略	static	dynamic	guided
Sort uniform parts	70.4433 ms	94.0649 ms	69.919 ms
Sort random parts	1436.51 ms	1376.06 ms	1443.13 ms

Less than 1 minuteAbout 265 words

exp2: MPI Allreduce

Ring Allreduce 算法

首先将每个结点的数据分为 comm_sz 个数据块, 每个数据块大小为 count = n / comm_sz 个 float.

第一阶段共 comm_sz - 1 步. 在第 k 步, 第 my_rank 个进程会将自己的第 (my_rank - k) % comm_sz 对应数据块发送给第 succ = my_rank + 1 个进程并累加. 注意到对于 my_rank 进程, 第 k 步的 Send 与 Recv 使用的数据块不同, 但第 k + 1 步的 Send 依赖于第 k 步的 Recv 得到的数据块. 因此 Send 可以是非阻塞的, 但 Recv 必须是阻塞的, 以确保在第 k + 1 步 Send 前, 第 k 步 Recv 已完成.

About 1 minAbout 432 words

优化策略

warp divergence

shared memory

Task 0

Performance

Environment

Method

分析 test_gmem.cu 的性能变化来源

Naive

No OPT Performance

Performance

Performance

Performance

Implementation

Ring Allreduce 算法

分析 `test_gmem.cu` 的性能变化来源