Cache Lab: Understanding Cache Memories

About 5 minAbout 1385 words

Part A: Writing a Cache Simulator

argparse 部分使用了更强大的 argp, 而没有使用 getopt (因为一开始没看到作业建议用 getopt)

忽略 Instruction Load, Data Load 和 Data Store 都可看作访问一次 cache, 而 Data Modify 由 Data Load 和 Data Store 组成可看作访问两次 cache.

Cache 中包含一个 clock_t now; 作为时钟, 每次访问 cache 都会使它 ++. 对于每一个 CacheLine, 使用 timestamp 记录其最近一次被访问到的时间戳.

对于一个访问, 首先计算 set_index 和 tag, 即数据应当存放在哪个 CacheSet 中, 相应的 tag 是多少.

SetIndex set_index =
    (address >> cache->num_block_bits) & (cache->num_sets - 1);
Tag tag = (address >> (cache->num_block_bits + cache->num_set_index_bits));

然后搜索所需数据是否已在 cache 中

for (int i = 0; i < cache->num_lines_per_set; ++i) {
  if (cache_lines[i].valid && cache_lines[i].tag == tag) {  // hit
    cache_lines[i].timestamp = cache->now;
    ++(cache->hits);
    if (cache->verbose) printf("hit ");
    return;
  }
}

如果找到, 则 hit, 否则 miss

// no hit, miss
++(cache->misses);
if (cache->verbose) printf("miss ");

如果 miss, 首先寻找空 CacheLine, 尝试将数据存入空 CacheLine

// search for empty line
for (int i = 0; i < cache->num_lines_per_set; ++i) {
  if (cache_lines[i].valid == false) {  // find an empty line
    cache_lines[i].valid = true;
    cache_lines[i].tag = tag;
    cache_lines[i].timestamp = cache->now;
    return;
  }
}

如果找到一空的 CacheLine, 则填入该 CacheLine. 否则, 所有 CacheLine 都存有数据, 则需要清出一行 CacheLine

// no empty line, eviction needed
++(cache->evictions);
if (cache->verbose) printf("eviction ");

根据 LRU (least-recently used) replacement policy 寻找最老的数据

// search for least recently used line
SetIndex least_recently_used_id = 0;
for (int i = 1; i < cache->num_lines_per_set; ++i) {
  if (cache_lines[i].timestamp <
      cache_lines[least_recently_used_id].timestamp) {
    least_recently_used_id = i;
  }
}
cache_lines[least_recently_used_id].valid = true;
cache_lines[least_recently_used_id].tag = tag;
cache_lines[least_recently_used_id].timestamp = cache->now;

Evaluation

                        Your simulator     Reference simulator
Points (s,E,b)    Hits  Misses  Evicts    Hits  Misses  Evicts
     3 (1,1,1)       9       8       6       9       8       6  traces/yi2.trace
     3 (4,2,4)       4       5       2       4       5       2  traces/yi.trace
     3 (2,1,4)       2       3       1       2       3       1  traces/dave.trace
     3 (2,1,3)     167      71      67     167      71      67  traces/trans.trace
     3 (2,2,3)     201      37      29     201      37      29  traces/trans.trace
     3 (2,4,3)     212      26      10     212      26      10  traces/trans.trace
     3 (5,1,5)     231       7       0     231       7       0  traces/trans.trace
     6 (5,1,5)  265189   21775   21743  265189   21775   21743  traces/long.trace
    27

TEST_CSIM_RESULTS=27

Part B: Optimizing Matrix Transpose

Cache 参数为

number of set index bits	number of sets	number of lines per set	number of block bits	block size
5	32	1	5	32

这意味着每个 CacheBlock 能够存储 8 个 int.

首先测试默认的 trans().

                        Points   Max pts      Misses
Trans perf 32x32           0.0         8        1183
Trans perf 64x64           0.0         8        4723
Trans perf 61x67           0.0        10        4423

我们只需针对这三种大小的矩阵进行优化即可.

Transpose32x32

考虑 cache blocking, 因为每个 CacheBlock 只能存储 8 个 int, 所以选取 BLOCK_SIZE 为 8, 使用 r0...7 作为临时变量记录一个 CacheBlock.

char transpose_32x32_desc[] = "Transpose submission 32x32";
void Transpose32x32(int M, int N, int A[N][M], int B[M][N]) {
#define BLOCK_SIZE 8
  int current_start_x, current_start_y;
  int x;
  int r0, r1, r2, r3, r4, r5, r6, r7;
  for (current_start_x = 0; current_start_x < N;
       current_start_x += BLOCK_SIZE) {
    for (current_start_y = 0; current_start_y < M;
         current_start_y += BLOCK_SIZE) {
      for (x = current_start_x; x < current_start_x + BLOCK_SIZE; ++x) {
        r0 = A[x][current_start_y];
        r1 = A[x][current_start_y + 1];
        r2 = A[x][current_start_y + 2];
        r3 = A[x][current_start_y + 3];
        r4 = A[x][current_start_y + 4];
        r5 = A[x][current_start_y + 5];
        r6 = A[x][current_start_y + 6];
        r7 = A[x][current_start_y + 7];
        B[current_start_y][x] = r0;
        B[current_start_y + 1][x] = r1;
        B[current_start_y + 2][x] = r2;
        B[current_start_y + 3][x] = r3;
        B[current_start_y + 4][x] = r4;
        B[current_start_y + 5][x] = r5;
        B[current_start_y + 6][x] = r6;
        B[current_start_y + 7][x] = r7;
      }
    }
  }
}

Performance

func 0 (Transpose submission): hits:1766, misses:287, evictions:255

Transpose64x64

如果仍使用简单的分块方法, 得到的结果很不理想

func 0 (Transpose submission): hits:3586, misses:4611, evictions:4579

这是因为 A[x][y] 与 A[x + 4][y], 将会对应到同一 CacheSet, 发生冲突, 导致 miss 增加.

注意到 Cache 其实有 32 行, 能够存储 4 个 8x8 的矩阵, 这些空行应当被充分利用. 所以, 考虑将 A 中的 8x8 矩阵划分为上下两个 4x8 矩阵, 并且利用 B 中的空余位置记录 Cache. 因此, B 中用来存储这两个 4x8 矩阵的空间在 Cache 中只需要不与 A 中的 8x8 矩阵冲突, 不与 B 中的 8x8 矩阵冲突, 当然也不能互相冲突即可.

// unused 4x8 block in B for the upper 4x8 block from a 8x8 block in A
int tmp_up_x = current_x;
int tmp_up_y = current_y;
do {
  tmp_up_y += BLOCK_SIZE;
  if (tmp_up_y >= N) {
    tmp_up_x += BLOCK_SIZE;
    tmp_up_y -= N;
  }
} while (tmp_up_y == current_x);

// unused 4x8 block in B for the lower 4x8 block from a 8x8 block in A
int tmp_down_x = tmp_up_x;
int tmp_down_y = tmp_up_y;
do {
  tmp_down_y += BLOCK_SIZE;
  if (tmp_down_y >= N) {
    tmp_down_x += BLOCK_SIZE;
    tmp_down_y -= N;
  }
} while (tmp_down_y == current_x);

如果没有空余的空间, 就直接进行转置

if (tmp_up_x >= N) {
  for (int i = 0; i < BLOCK_SIZE; i++) {
    for (int j = 0; j < BLOCK_SIZE; j++) {
      B[current_x + j][current_y + i] = A[current_y + i][current_x + j];
    }
  }
}

Performance

func 0 (Transpose submission): hits:14968, misses:1165, evictions:1133

Transpose61x67

选取一个合适的 BLOCK_SIZE 即可, 这里取为 16.

char transpose_61x67_desc[] = "Transpose submission 61x67";
void Transpose61x67(int M, int N, int A[N][M], int B[M][N]) {
#undef BLOCK_SIZE
#define BLOCK_SIZE 16
  int block_start_x, block_start_y;
  int x, y;
  for (block_start_x = 0; block_start_x < M; block_start_x += BLOCK_SIZE) {
    for (block_start_y = 0; block_start_y < N; block_start_y += BLOCK_SIZE) {
      for (x = block_start_y; (x < N) && (x < block_start_y + BLOCK_SIZE);
           ++x) {
        for (y = block_start_x; (y < M) && (y < block_start_x + BLOCK_SIZE);
             ++y) {
          B[y][x] = A[x][y];
        }
      }
    }
  }
}

Performance

func 0 (Transpose submission): hits:6363, misses:1816, evictions:1784

Pulling it all Together

Part A: Testing cache simulator
Running ./test-csim
                        Your simulator     Reference simulator
Points (s,E,b)    Hits  Misses  Evicts    Hits  Misses  Evicts
     3 (1,1,1)       9       8       6       9       8       6  traces/yi2.trace
     3 (4,2,4)       4       5       2       4       5       2  traces/yi.trace
     3 (2,1,4)       2       3       1       2       3       1  traces/dave.trace
     3 (2,1,3)     167      71      67     167      71      67  traces/trans.trace
     3 (2,2,3)     201      37      29     201      37      29  traces/trans.trace
     3 (2,4,3)     212      26      10     212      26      10  traces/trans.trace
     3 (5,1,5)     231       7       0     231       7       0  traces/trans.trace
     6 (5,1,5)  265189   21775   21743  265189   21775   21743  traces/long.trace
    27


Part B: Testing transpose function
Running ./test-trans -M 32 -N 32
Running ./test-trans -M 64 -N 64
Running ./test-trans -M 61 -N 67

Cache Lab summary:
                        Points   Max pts      Misses
Csim correctness          27.0        27
Trans perf 32x32           8.0         8         287
Trans perf 64x64           8.0         8        1165
Trans perf 61x67          10.0        10        1816
          Total points    53.0        53