Cache Lab: Understanding Cache Memories
Part A: Writing a Cache Simulator
argparse
部分使用了更强大的argp
, 而没有使用getopt
(因为一开始没看到作业建议用getopt
)
忽略 Instruction Load, Data Load 和 Data Store 都可看作访问一次 cache, 而 Data Modify 由 Data Load 和 Data Store 组成可看作访问两次 cache.
Cache
中包含一个 clock_t now;
作为时钟, 每次访问 cache 都会使它 ++
. 对于每一个 CacheLine
, 使用 timestamp
记录其最近一次被访问到的时间戳.
对于一个访问, 首先计算 set_index
和 tag
, 即数据应当存放在哪个 CacheSet
中, 相应的 tag
是多少.
SetIndex set_index =
(address >> cache->num_block_bits) & (cache->num_sets - 1);
Tag tag = (address >> (cache->num_block_bits + cache->num_set_index_bits));
然后搜索所需数据是否已在 cache 中
for (int i = 0; i < cache->num_lines_per_set; ++i) {
if (cache_lines[i].valid && cache_lines[i].tag == tag) { // hit
cache_lines[i].timestamp = cache->now;
++(cache->hits);
if (cache->verbose) printf("hit ");
return;
}
}
如果找到, 则 hit, 否则 miss
// no hit, miss
++(cache->misses);
if (cache->verbose) printf("miss ");
如果 miss, 首先寻找空 CacheLine
, 尝试将数据存入空 CacheLine
// search for empty line
for (int i = 0; i < cache->num_lines_per_set; ++i) {
if (cache_lines[i].valid == false) { // find an empty line
cache_lines[i].valid = true;
cache_lines[i].tag = tag;
cache_lines[i].timestamp = cache->now;
return;
}
}
如果找到一空的 CacheLine
, 则填入该 CacheLine
. 否则, 所有 CacheLine
都存有数据, 则需要清出一行 CacheLine
// no empty line, eviction needed
++(cache->evictions);
if (cache->verbose) printf("eviction ");
根据 LRU (least-recently used) replacement policy 寻找最老的数据
// search for least recently used line
SetIndex least_recently_used_id = 0;
for (int i = 1; i < cache->num_lines_per_set; ++i) {
if (cache_lines[i].timestamp <
cache_lines[least_recently_used_id].timestamp) {
least_recently_used_id = i;
}
}
cache_lines[least_recently_used_id].valid = true;
cache_lines[least_recently_used_id].tag = tag;
cache_lines[least_recently_used_id].timestamp = cache->now;
Evaluation
Your simulator Reference simulator
Points (s,E,b) Hits Misses Evicts Hits Misses Evicts
3 (1,1,1) 9 8 6 9 8 6 traces/yi2.trace
3 (4,2,4) 4 5 2 4 5 2 traces/yi.trace
3 (2,1,4) 2 3 1 2 3 1 traces/dave.trace
3 (2,1,3) 167 71 67 167 71 67 traces/trans.trace
3 (2,2,3) 201 37 29 201 37 29 traces/trans.trace
3 (2,4,3) 212 26 10 212 26 10 traces/trans.trace
3 (5,1,5) 231 7 0 231 7 0 traces/trans.trace
6 (5,1,5) 265189 21775 21743 265189 21775 21743 traces/long.trace
27
TEST_CSIM_RESULTS=27
Part B: Optimizing Matrix Transpose
Cache 参数为
number of set index bits | number of sets | number of lines per set | number of block bits | block size |
---|---|---|---|---|
5 | 32 | 1 | 5 | 32 |
这意味着每个 CacheBlock
能够存储 8 个 int
.
首先测试默认的 trans()
.
Points Max pts Misses
Trans perf 32x32 0.0 8 1183
Trans perf 64x64 0.0 8 4723
Trans perf 61x67 0.0 10 4423
我们只需针对这三种大小的矩阵进行优化即可.
Transpose32x32
考虑 cache blocking, 因为每个 CacheBlock
只能存储 8 个 int
, 所以选取 BLOCK_SIZE
为 8, 使用 r0...7
作为临时变量记录一个 CacheBlock
.
char transpose_32x32_desc[] = "Transpose submission 32x32";
void Transpose32x32(int M, int N, int A[N][M], int B[M][N]) {
#define BLOCK_SIZE 8
int current_start_x, current_start_y;
int x;
int r0, r1, r2, r3, r4, r5, r6, r7;
for (current_start_x = 0; current_start_x < N;
current_start_x += BLOCK_SIZE) {
for (current_start_y = 0; current_start_y < M;
current_start_y += BLOCK_SIZE) {
for (x = current_start_x; x < current_start_x + BLOCK_SIZE; ++x) {
r0 = A[x][current_start_y];
r1 = A[x][current_start_y + 1];
r2 = A[x][current_start_y + 2];
r3 = A[x][current_start_y + 3];
r4 = A[x][current_start_y + 4];
r5 = A[x][current_start_y + 5];
r6 = A[x][current_start_y + 6];
r7 = A[x][current_start_y + 7];
B[current_start_y][x] = r0;
B[current_start_y + 1][x] = r1;
B[current_start_y + 2][x] = r2;
B[current_start_y + 3][x] = r3;
B[current_start_y + 4][x] = r4;
B[current_start_y + 5][x] = r5;
B[current_start_y + 6][x] = r6;
B[current_start_y + 7][x] = r7;
}
}
}
}
Performance
func 0 (Transpose submission): hits:1766, misses:287, evictions:255
Transpose64x64
如果仍使用简单的分块方法, 得到的结果很不理想
func 0 (Transpose submission): hits:3586, misses:4611, evictions:4579
这是因为 A[x][y]
与 A[x + 4][y]
, 将会对应到同一 CacheSet
, 发生冲突, 导致 miss 增加.
注意到 Cache 其实有 32 行, 能够存储 4 个 8x8 的矩阵, 这些空行应当被充分利用. 所以, 考虑将 A
中的 8x8 矩阵划分为上下两个 4x8 矩阵, 并且利用 B
中的空余位置记录 Cache. 因此, B
中用来存储这两个 4x8 矩阵的空间在 Cache 中只需要不与 A
中的 8x8 矩阵冲突, 不与 B
中的 8x8 矩阵冲突, 当然也不能互相冲突即可.
// unused 4x8 block in B for the upper 4x8 block from a 8x8 block in A
int tmp_up_x = current_x;
int tmp_up_y = current_y;
do {
tmp_up_y += BLOCK_SIZE;
if (tmp_up_y >= N) {
tmp_up_x += BLOCK_SIZE;
tmp_up_y -= N;
}
} while (tmp_up_y == current_x);
// unused 4x8 block in B for the lower 4x8 block from a 8x8 block in A
int tmp_down_x = tmp_up_x;
int tmp_down_y = tmp_up_y;
do {
tmp_down_y += BLOCK_SIZE;
if (tmp_down_y >= N) {
tmp_down_x += BLOCK_SIZE;
tmp_down_y -= N;
}
} while (tmp_down_y == current_x);
如果没有空余的空间, 就直接进行转置
if (tmp_up_x >= N) {
for (int i = 0; i < BLOCK_SIZE; i++) {
for (int j = 0; j < BLOCK_SIZE; j++) {
B[current_x + j][current_y + i] = A[current_y + i][current_x + j];
}
}
}
Performance
func 0 (Transpose submission): hits:14968, misses:1165, evictions:1133
Transpose61x67
选取一个合适的 BLOCK_SIZE
即可, 这里取为 16.
char transpose_61x67_desc[] = "Transpose submission 61x67";
void Transpose61x67(int M, int N, int A[N][M], int B[M][N]) {
#undef BLOCK_SIZE
#define BLOCK_SIZE 16
int block_start_x, block_start_y;
int x, y;
for (block_start_x = 0; block_start_x < M; block_start_x += BLOCK_SIZE) {
for (block_start_y = 0; block_start_y < N; block_start_y += BLOCK_SIZE) {
for (x = block_start_y; (x < N) && (x < block_start_y + BLOCK_SIZE);
++x) {
for (y = block_start_x; (y < M) && (y < block_start_x + BLOCK_SIZE);
++y) {
B[y][x] = A[x][y];
}
}
}
}
}
Performance
func 0 (Transpose submission): hits:6363, misses:1816, evictions:1784
Pulling it all Together
Part A: Testing cache simulator
Running ./test-csim
Your simulator Reference simulator
Points (s,E,b) Hits Misses Evicts Hits Misses Evicts
3 (1,1,1) 9 8 6 9 8 6 traces/yi2.trace
3 (4,2,4) 4 5 2 4 5 2 traces/yi.trace
3 (2,1,4) 2 3 1 2 3 1 traces/dave.trace
3 (2,1,3) 167 71 67 167 71 67 traces/trans.trace
3 (2,2,3) 201 37 29 201 37 29 traces/trans.trace
3 (2,4,3) 212 26 10 212 26 10 traces/trans.trace
3 (5,1,5) 231 7 0 231 7 0 traces/trans.trace
6 (5,1,5) 265189 21775 21743 265189 21775 21743 traces/long.trace
27
Part B: Testing transpose function
Running ./test-trans -M 32 -N 32
Running ./test-trans -M 64 -N 64
Running ./test-trans -M 61 -N 67
Cache Lab summary:
Points Max pts Misses
Csim correctness 27.0 27
Trans perf 32x32 8.0 8 287
Trans perf 64x64 8.0 8 1165
Trans perf 61x67 10.0 10 1816
Total points 53.0 53