In part A, we will need to write a cache simulator to interpret the trace output file generated by valgrind. At run time, the parameters for the cache will be given through command line options.
Luckily, the professor provides us with a reference binary as a template to “guide us through”, or show the answer in front of us.
The first thing first is to define a useful structure which can be used to simulate the cache. From the course video and the recitation, we are able to know that cache is composed of three parts: cache set, cache line, and cache block. Since it’s just a simulation, so we don’t need the cache block actually to store the data. We just need to determine if the address that the program access is a hit/miss/eviction.
Generally, a cache line needs a valid bit, tag, and cache block. But here we can remove cache block with the reason above and add a lru variable to act as a time stamp to determine which line should be evicted.
Cache line -> cache set -> cache. So here in the code blow, Cache = ***cacheLine.
/* initialize cache, allocate enough space for the cache */ Cache init_cache(int s, int E, int b) { int S = 1 << s; Cache cache = (Cache)malloc(sizeof(cacheSet) * S); if (cache == NULL) returnNULL; for (int i = 0; i < S; i++) { cache[i] = (cacheSet)calloc(E, sizeof(cacheLine)); if (cache[i] == NULL) returnNULL; } return cache; }
/* free cache one by one */ voidfree_cache(Cache cache, int s) { int S = 1 << s; for (int i = 0; i < S; i++) { free(cache[i]); } free(cache); }
Then, another big thing is to understand how to use getopt by reading the man page and searching for demo code online. In optstring, an option followed by a colon : means it requires an argument, followed by two colons : means the argument is optional. If the optstring start with an add sign +, means it will stop when it meets the first unrecognizable option.
/* check arguments */ if (s <= 0 || E <=0 || b <= 0 || target == NULL) { fprintf(stderr, "%s: Missing required command line argument\n", argv[0]); print_help_msg(); exit(0); }
Then, we have to analyze how the trace file accesses the cache and what should we do to determine if it is a hit/miss/eviction.
To index the cache, we should first know the cache set index from the provided s, E, b, addr. Second, we will check the tag in all valid cache lines to see if it matches. If it does match, we will get a hit, otherwise fail to hit (still need to determine whether it is a miss or eviction).
If the cache is not full, then it will be a miss, and the cache will put the data into the empty cache line. If it is full, then we will have to change the content of the last used cache line to the new input one, which is an eviction.
/* access cache with the address to check whether is hit 1/miss 2/eviction 0 */ intaccess_cache(Cache cache, longunsignedint addr, int s, int E, int b) { longunsignedint tag = addr >> (s + b); unsignedint set_idx = (addr >> b) & ((1 << s) - 1); int empty = -1; int evict = 0;
cacheSet cacheset = cache[set_idx];
for (int i = 0; i < E; i++) { if (cacheset[i].valid_bit) { if (cacheset[i].tag == tag) { // hit cacheset[i].lru = 0; return1; }
if (empty == -1) { // eviction is as well as a miss cacheset[evict].tag = tag; cacheset[evict].lru = 0; return0; } else { // miss cacheset[empty].valid_bit = 1; cacheset[empty].tag = tag; cacheset[empty].lru = 0; return2; } }
So now, the whole program basically accomplishes all goals we would like to achieve, combine them up and add some helper function like print_help_msg and scan through the trace file.
voidprint_help_msg() { fprintf(stderr, "Usage: ./csim [-hv] -s <num> -E <num> -b <num> -t <file> \n\ Options:\n\ -h Print this help message.\n\ -v Optional verbose flag. \n\ -s <num> Number of set index bits. \n\ -E <num> Number of lines per set. \n\ -b <num> Number of block offset bits. \n\ -t <file> Trace file. \n\ \n\ Examples: \n\ linux> ./csim -s 4 -E 1 -b 4 -t traces/yi.trace \n\ linux> ./csim -v -s 8 -E 2 -b 4 -t traces/yi.trace \n"); }
/* initialize cache, allocate enough space for the cache */ Cache init_cache(int s, int E, int b) { int S = 1 << s; Cache cache = (Cache)malloc(sizeof(cacheSet) * S); if (cache == NULL) returnNULL; for (int i = 0; i < S; i++) { cache[i] = (cacheSet)calloc(E, sizeof(cacheLine)); if (cache[i] == NULL) returnNULL; } return cache; }
/* free cache one by one */ voidfree_cache(Cache cache, int s) { int S = 1 << s; for (int i = 0; i < S; i++) { free(cache[i]); } free(cache); }
/* access cache with the address to check whether is hit 1/miss 2/eviction 0 */ intaccess_cache(Cache cache, longunsignedint addr, int s, int E, int b) { longunsignedint tag = addr >> (s + b); unsignedint set_idx = (addr >> b) & ((1 << s) - 1); int empty = -1; int evict = 0;
cacheSet cacheset = cache[set_idx];
for (int i = 0; i < E; i++) { if (cacheset[i].valid_bit) { if (cacheset[i].tag == tag) { // hit cacheset[i].lru = 0; return1; }
if (empty == -1) { // eviction is as well as a miss cacheset[evict].tag = tag; cacheset[evict].lru = 0; return0; } else { // miss cacheset[empty].valid_bit = 1; cacheset[empty].tag = tag; cacheset[empty].lru = 0; return2; } }
intmain(int argc, char *argv[]) { int opt; int v_flag = 0; int s = 0; int E = 0; int b = 0; char *target;
switch(condition) { // eviction is also miss case0: eviction_count++; miss_count++; break; // hit case1: hit_count++; break; // miss case2: miss_count++; break; }
This is a trickier version of solve. Since the professor provides the reference file, we can reverse engineer it to get the content and logic in the csim-ref binary file. Luckily, it compiles with all symbols preserved, it looks exactly the same as the source code in some professional reverse engineering tools. Here is just an example:
while ( 1 ) { c = getopt(argc, (char *const *)argv, "s:E:b:t:vh"); if ( c == -1 ) break; switch ( c ) { case'E': E = atoi(optarg); break; case'b': b = atoi(optarg); break; case'h': printUsage((char **)argv); case's': s = atoi(optarg); break; case't': trace_file = optarg; break; case'v': verbosity = 1; break; default: printUsage((char **)argv); } } if ( !s || !E || !b || !trace_file ) { printf("%s: Missing required command line argument\n", *argv); printUsage((char **)argv); } S = (int)pow(2.0, (double)s); B = (int)pow(2.0, (double)b); initCache(); replayTrace(trace_file); freeCache(); printSummary(hit_count, miss_count, eviction_count); return0; }
Through reverse the logic, we can also know what we need to do in our program XD.
Part B
In part B, we are going to write a function to transpose matrix with different sizes 32 x 32, 64 x 64, 61 x 67, under the cache condition of s = 5, E = 1, b = 5, which has 32 set/line, each line have 32 bytes.
32 x 32
A 32 x 32 matrix is easy. Since each line of cache can store at most 8 int. Splitting the block into 8 x 8 submatrix, reading out each line in matrix A and copying it to each column in matrix B will solve the problem.
Summary for official submission (func 0): correctness=1 misses=287
TEST_TRANS_RESULTS=1:287
64 x 64
This is the hardest part of the assignment, I didn’t finish this myself, so I take reference from here.
So basically, the idea is to split the 8 x 8 matrix into a smaller 4 x 4 matrix. However, if we directly apply the idea to the 8 x 8 matrix, the result will be the same as splitting the original matrix into a 4 x 4 matrix, which is about 1644 misses, failing the test.
That’s because these two methods have the same idea, which is to read 4 rows in A and write 4 columns in B. So they have the same result, no matter what size the original matrix is.
The trick is that: When reading the first four lines of the 8 x 8 submatrix, it also reads in the content in block 2 and stores it somewhere in matrix B (the content in B doesn’t matter because not finished yet). We use the same cache to read out more data.
Then, we reuse the cache to recover the lost block 2 in matrix B and copy the rest content to B.
1 2 3
1 2
3 4
So in the code, it reverses the read sequence to “preheat” the cache and prepare for the following read/write in blocks 3 and 4 to decrease the miss further.
Summary for official submission (func 0): correctness=1 misses=1243
TEST_TRANS_RESULTS=1:1243
61 x 67
This is also very simple to deal with. Since it is not a square matrix, the improvement to split the matrix into smaller parts is little. We can just split it into a 16 x 16 matrix and it will be fine.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
if (M == 61) { int i, j, k, l; for (i = 0; i < N; i += 16) { for (j = 0; j < M; j += 16) { for (k = i; k < i + 16 && k < N; k++) { for (l = j; l < j + 16 && l < M; l++) { B[l][k] = A[k][l]; } } } } }