Skip to main content

PA3: 全源最短路

Course WorkCUDAIntroduction to High Performance ComputingAbout 3 minAbout 926 words

Environment

P100 GPU 最大支持每个 SM 64KB shared memory,但每个 thread block 最多只支持 48KB

Method

使用 实验三 - 高性能计算导论实验文档 (tsinghua.edu.cn)open in new window 中的分块方法. 一个 thread block 处理一个矩阵分块. 每个 thread block 所需使用的数据全部拷贝到 shared memory 中. 在 threadIdx 的基础上偏移 i_start, j_startcenter_block_start 即可将 shared memory 中的坐标映射到 global memory 中的不同矩阵分块.

Phase 1

k = [ p * b, (p + 1) * b )

对于每个 thread block, 访问范围包括 k * kb * bint, 也即需要 b * b * sizeof(int) 大小的 shared memory.

对于每个 p, 仅需一个 thread block 即可完成任务. 但是很浪费.

Phase 2

Horizontal

k = [ p * b          , (p + 1) * b          )
i = [ p * b          , (p + 1) * b          )
j = [ blockIdx.x * b , (blockIdx.x + 1) * b )

特别的, 若 j 的范围恰好在 center block 后, 即 blockIdx.x * b >= center_block_start 时, 则需额外偏移 b.

对于每个 thread block, 访问范围包括 i * j, i * k, k * j, 其中 i * jk * j 重合, 因此共 2 * b * bint.

共需 (ceil(n / p) - 1) * 1 个 thread block.

Vertical

k = [ p * b          , (p + 1) * b          )
i = [ blockIdx.y * b , (blockIdx.y + 1) * b )
j = [ p * b          , (p + 1) * b          )

特别的, 若 i 的范围恰好在 center block 后, 即 blockIdx.y * b >= center_block_start 时, 则需额外偏移 b.

对于每个 thread block, 访问范围包括 i x j, i x k, k x j, 其中 i x ji x k 重合, 因此共 2 * b * bint.

共需 1 * (ceil(n / p) - 1) 个 thread block.

Phase 3

k = [ p * b          , (p + 1) * b          )
i = [ blockIdx.y * b , (blockIdx.y + 1) * b )
i = [ blockIdx.x * b , (blockIdx.x + 1) * b )

特别的, 若 ij 的范围恰好在 center block 后, 即 blockIdx * b >= center_block_start 时, 则需额外偏移 b.

对于每个 thread block, 访问范围包括 i x j, i x k, k x j, 均不重合, 共 3 * b * bint.

共需 (ceil(n / p) - 1) * (ceil(n / p) - 1) 个 thread block.

综合考虑, 取 b = 32, 每个 thread block 共 32 x 32 个 thread, 既不会超出 shared memory 限制, 又能够避免 bank conflict.

Performance

napspRef() (ms)apsp() (ms)Speedup
100014.8149032.9693714.98923947
2500377.14840237.66041510.01445157
50002972.073596260.96002811.38899938
750010016.146987872.86680411.47500047
1000022632.2116862060.57381710.98345107

n = 1000 下进行 profiling.

nvprof Events

Invocations                                Event Name         Min         Max         Avg       Total
Device "Tesla P100-PCIE-16GB (0)"
    Kernel: _GLOBAL__N__51_tmpxft_000981f1_00000000_20_apsp_compute_61_cpp1_ii_034c69fe::Phase2KernelHorizontal(int, int*, int, int)
         96                   shared_ld_bank_conflict           0           0           0           0
         96                   shared_st_bank_conflict           0           0           0           0
    Kernel: _GLOBAL__N__51_tmpxft_000981f1_00000000_20_apsp_compute_61_cpp1_ii_034c69fe::Phase2KernelVertical(int, int*, int, int)
         96                   shared_ld_bank_conflict           0           0           0           0
         96                   shared_st_bank_conflict           0           0           0           0
    Kernel: _GLOBAL__N__51_tmpxft_000981f1_00000000_20_apsp_compute_61_cpp1_ii_034c69fe::Phase1Kernel(int, int*, int, int)
         96                   shared_ld_bank_conflict           0           0           0           0
         96                   shared_st_bank_conflict           0           0           0           0
    Kernel: _GLOBAL__N__51_tmpxft_000981f1_00000000_20_apsp_compute_61_cpp1_ii_034c69fe::Phase3Kernel(int, int*, int, int)
         96                   shared_ld_bank_conflict           0           0           0           0
         96                   shared_st_bank_conflict           0           0           0           0

没有出现 bank conflict.

nvprof Metrics

Invocations                               Metric Name                         Metric Description         Min         Max         Avg
Device "Tesla P100-PCIE-16GB (0)"
    Kernel: _GLOBAL__N__51_tmpxft_000981f1_00000000_20_apsp_compute_61_cpp1_ii_034c69fe::Phase2KernelHorizontal(int, int*, int, int)
         96                         branch_efficiency                          Branch Efficiency     100.00%     100.00%     100.00%
         96                 warp_execution_efficiency                  Warp Execution Efficiency      97.97%     100.00%      98.03%
         96         warp_nonpred_execution_efficiency   Warp Non-Predicated Execution Efficiency      81.95%      95.89%      95.26%
         96                            gld_efficiency              Global Memory Load Efficiency     100.00%     100.00%     100.00%
         96                            gst_efficiency             Global Memory Store Efficiency     100.00%     100.00%     100.00%
         96                         shared_efficiency                   Shared Memory Efficiency      67.38%      69.64%      67.45%
    Kernel: _GLOBAL__N__51_tmpxft_000981f1_00000000_20_apsp_compute_61_cpp1_ii_034c69fe::Phase2KernelVertical(int, int*, int, int)
         96                         branch_efficiency                          Branch Efficiency     100.00%     100.00%     100.00%
         96                 warp_execution_efficiency                  Warp Execution Efficiency      51.59%     100.00%      98.49%
         96         warp_nonpred_execution_efficiency   Warp Non-Predicated Execution Efficiency      41.59%      97.89%      95.93%
         96                            gld_efficiency              Global Memory Load Efficiency     100.00%     100.00%     100.00%
         96                            gst_efficiency             Global Memory Store Efficiency     100.00%     100.00%     100.00%
         96                         shared_efficiency                   Shared Memory Efficiency      18.58%      69.01%      67.43%
    Kernel: _GLOBAL__N__51_tmpxft_000981f1_00000000_20_apsp_compute_61_cpp1_ii_034c69fe::Phase1Kernel(int, int*, int, int)
         96                         branch_efficiency                          Branch Efficiency     100.00%     100.00%     100.00%
         96                 warp_execution_efficiency                  Warp Execution Efficiency      47.13%     100.00%      98.35%
         96         warp_nonpred_execution_efficiency   Warp Non-Predicated Execution Efficiency      46.31%      97.76%      96.15%
         96                            gld_efficiency              Global Memory Load Efficiency     100.00%     100.00%     100.00%
         96                            gst_efficiency             Global Memory Store Efficiency     100.00%     100.00%     100.00%
         96                         shared_efficiency                   Shared Memory Efficiency      18.52%      68.69%      67.12%
    Kernel: _GLOBAL__N__51_tmpxft_000981f1_00000000_20_apsp_compute_61_cpp1_ii_034c69fe::Phase3Kernel(int, int*, int, int)
         96                         branch_efficiency                          Branch Efficiency     100.00%     100.00%     100.00%
         96                 warp_execution_efficiency                  Warp Execution Efficiency      98.12%     100.00%      98.18%
         96         warp_nonpred_execution_efficiency   Warp Non-Predicated Execution Efficiency      84.89%      95.93%      95.23%
         96                            gld_efficiency              Global Memory Load Efficiency     100.00%     100.00%     100.00%
         96                            gst_efficiency             Global Memory Store Efficiency     100.00%     100.00%     100.00%
         96                         shared_efficiency                   Shared Memory Efficiency      67.69%      69.91%      67.75%

可以看出各项指标的利用率都较充分, 但 shared memory 利用率较低.