ics-cachelab

Yiyang-Fan

已于 2024-01-03 18:43:11 修改

阅读量925

点赞数 16

文章标签：数据库

于 2024-01-03 18:36:49 首次发布

本文链接：https://blog.csdn.net/qq_46895011/article/details/135370423

版权

实验简介

前言

CSAPP第6章配套实验。
本实验的目的是加深同学们对高速缓存cache认识。实验分为三个部分：

part A：用c语言设计一个cache模拟器，它能读入特定格式的trace文件（trace文件中模拟了一系列的对存储器的读写操作），并且输出cache的命中、缺失、替换次数；我们会为你提供一部分代码
part B：根据特定的cache参数设计一个矩阵转置的算法，使得矩阵转置运算中cache的miss次数尽可能低。
part C（honor part）：分为两部分
- 继续优化part B的矩阵转置算法
- 同part B，不过设计一个矩阵乘法的算法

本次实验参考CMU CSAPP课程的Cache Lab。
考虑到pj将至，助教将本次lab的难度相较于原版调低了一些（除了honor-part，但honor-part的分数很少），而且本次实验全程用c语言（可以不用和抽象的汇编打交道了），所以大家不用过于担心~~~

分值分配

part A: 40%
part B: 36%
part C (honor part): 9%
实验报告+代码风格：15%

部署实验环境

（1）下载

下载链接：cachelab-handout

这是一个 tar 文件，需要对其解包。

（ubuntu 虚拟机 or WSL）打开终端，进入到上述文件对应的目录下，然后执行如下命令：

1	tar -xvf cachelab-handout.tar

会在当前目录解包出一个 cachelab-handout 文件夹，其中的内容就是本次实验用到的的文件了。

（2）准备工作

确保已安装了 gcc

在终端中检查是否安装了 gcc：

gcc -v

如果已安装，终端将会反馈版本信息，否则会反馈 command not found 。
如未安装，尝试执行以下命令进行安装：

1	sudo apt-get install gcc

确保已安装了 make

检查是否安装 make，在终端输入：

make -v

同理，如未安装，尝试以此执行以下命令：

1
2
3

sudo apt-get update
sudo apt-get install make
sudo apt-get install libc6 libc6-dev libc6-dev-i386

确保安装python

1	python --version

一般情况下系统是自带python的
如未安装，请自行上网搜索安装教程

安装valgrind

1	sudo apt-get install valgrind

part A

intro

设计一个cache模拟器，读入指定格式的trace文件，模拟cache的运行过程，然后输出cache的命中、缺失、替换次数
trace文件是通过valgrind的lackey工具生成的，它具有以下格式

I 0400d7d4,8
 M 0421c7f0,4
 L 04f6b868,8
 S 7ff0005c8,8

每行格式为

1	[space]operation address,size

其中I代表读指令操作，L代表读数据操作，S代表写数据操作，M代表修改数据操作（即读数据后写数据）。除了I操作外，其他操作都会在开头都会有一个空格。address为操作的地址，size为操作的大小（单位为字节）。

to-do

你的所有实现都在csim.c和csim.h中

你的全局变量和函数需要定义在csim.h中，你的函数实现需要在csim.c中

我们提供了一个csim-ref的文件，是一个参考实现，你可以通过它来检查你的实现是否正确，它的用法如下：

1	./csim-ref [-hv] -s <s> -E <E> -b <b> -t <tracefile>

-h代表帮助
-v代表verbose，即输出详细信息
-s代表cache的set数
-E代表每个set中的cache line数
-b代表cache line的大小（单位为字节）
-t代表trace文件的路径

csim-ref会输出cache的命中、缺失、替换次数，比如：

$ ./csim-ref -s 16 -E 1 -b 16 -t traces/yi.trace
hits:4 misses:5 evictions:3
```

verbose模式：

```shell
$ ./csim-ref -v -s 16 -E 1 -b 16 -t traces/yi.trace
L 10,1 miss
M 20,1 miss hit
L 22,1 hit
S 18,1 hit
L 110,1 miss eviction
L 210,1 miss eviction
M 12,1 miss eviction hit
hits:4 misses:5 evictions:3

你的实现需要具有和csim-ref相同的功能，包括verbose模式输出debug信息

在csim.c中，我们已经为你提供了基本的解析命令行参数的代码，你需要在此基础上进行实现

cache的替换策略为LRU算法

requirements

你的代码在编译时不能存在warning
你只能使用c语言来实现（助教看不懂c++和python）
虽然给了测试数据，但不允许面向数据编程，助教会做源码检查；不允许通过直接调用csim-ref来实现

evaluation

共有8项测试

./csim -s 1 -E 1 -b 1 -t traces/yi2.trace
./csim -s 4 -E 2 -b 4 -t traces/yi.trace
./csim -s 2 -E 1 -b 4 -t traces/dave.trace
./csim -s 2 -E 1 -b 3 -t traces/trans.trace
./csim -s 2 -E 2 -b 3 -t traces/trans.trace
./csim -s 2 -E 4 -b 3 -t traces/trans.trace
./csim -s 5 -E 1 -b 5 -t traces/trans.trace
./csim -s 5 -E 1 -b 5 -t traces/long.trace

原始分为：前7项每项3分，最后一项6分，共27分；对于每一项，hit, miss, eviction的正确性各占1/3的分数

原始分将会被乘以40/27得到最终的分数

最终的分数可以通过./driver.py来查看

hints

使用malloc和free来构造cache
你可以使用csim-ref来检查你的实现是否正确，通过开启verbose模式可以更好地debug
LRU算法可以简单地使用计数器的实现方式
对于具体如何实现没有太多要求，大家八仙过海各显神通~~~

part B

intro

cache为何被称为“高速缓存”，是因为读取cache的速率远快于读取主存的速率（可能大概100倍），因此cache miss的次数往往决定了程序的运行速度。因此，我们需要尽可能设计cache-friendly的程序，使得cache miss的次数尽可能少。

在这部分的实验，你将对矩阵转置程序（一个非常容易cache miss的程序）进行优化，让cache miss的次数尽可能少。你的分数将由cache miss的次数决定

to-do

你的所有实现都将在trans.c中

你将设计这样的一个函数：它接收四个参数：M，N，一个N * M的矩阵A和一个M * N的矩阵B，你需要把A转置后的结果存入B中。

char trans_desc[] = "some description";
void trans(int M, int N, int A[N][M], int B[M][N])
{
    
}

每设计好一个这样的函数，你都可以在registerFunctions()中为其进行“注册”，只有“注册”了的函数才会被加入之后的评测中，你可以“注册”并评测多个函数；为上面的函数进行注册只需要将下面代码加入registerFunctions()中

1	registerTransFunction(trans, trans_desc);

我们提供了一个名为trans()的函数作为示例

你需要保证有一个且有唯一一个“注册”的函数用于最终提交，我们将靠“注册”时的description进行区分，请确保你的提交函数的description是“Transpose submission” ，比如

char transpose_submit_desc[] = "Transpose submission";
void transpose_submit(int M, int N, int A[N][M], int B[M][N])
{
    
}

我们将使用特定形状的矩阵和特定参数的cache来进行评测，所以你可以针对这些特殊情况来编写代码

requirements

你的代码在编译时不能存在warning
在每个矩阵转置函数中，你至多能定义12个int类型的局部变量（不包括循环变量，但你不能将循环变量用作其他用途），且不能使用任何全局变量。你不能定义除int以外类型的变量。你不能使用malloc等方式申请内存块。你可以使用int数组，但等同于数组大小的数量的int类型变量也同样被计入
你不能使用递归
你只允许使用一个函数完成矩阵转置的功能，而不能在函数中调用任何辅助函数
你不能修改原始的矩阵A，但是你可以任意修改矩阵B
你可以定义宏

evaluation

我们将使用cache参数为：s = 48, E = 1, b = 48，即每个cache line大小为48字节，共有48个cache line，每个set中只有1个cache line。
我们将使用以下3种矩阵来进行评测 - 48 * 48的矩阵，分值12分，miss次数< 500则满分，miss次数> 800则0分，500~800将按miss次数获取一定比例的分数 - 96 * 96的矩阵，分值12分，miss次数< 2200则满分，miss次数> 3000则0分，2200~3000将按miss次数获取一定比例的分数 - 93 * 99的矩阵，分值12分，miss次数< 3000则满分，miss次数> 4000则0分，3000~4000将按miss次数获取一定比例的分数 - 荣誉分4分，将在荣誉部分介绍

我们只会针对这三种矩阵进行测试，所以你可以只考虑这三种情况

step 0

1	make clean && make

step 1

在测试之前，进行算法正确性的测试

1	./tracegen -M <row> -N <col>

比如对48 * 48转置函数进行测试

1	./tracegen -M 48 -N 48

你也可以对特定的函数进行测试，比如对第0个“注册”的函数

1	./tracegen -M 48 -N 48 -F 0

step 2

1	./test-trans -M <row> -N <col>

这个程序将使用valgrind工具生成trace文件，然后调用csim-ref程序获取cache命中、缺失、替换的次数

hints

在调用./test-trans之后，可以使用如下命令查看你的cache命中/缺失情况；你可以把f0替换为fi来查看第 i 个“注册”的函数带来的cache命中/缺失情况

1	./csim-ref -v -s 48 -E 1 -b 48 -t trace.f0 > result.txt

这篇文章可能对你有所启发
cache的关联度为1，你可能需要考虑冲突带来的miss
脑测一下你的miss次数或许是一个很好的选择，你可以计算一下大概有多少比例的miss，然后乘以总的读写次数；你可以在上面生成的result.txt文件中验证你的想法
你可以认为A和B矩阵的起始地址位于某个cacheline的开始（即A和B二维数组的起始地址能被48整除）

part C --honor part

warning: 本部分较难，可能花费比较多的时间，但是分值较低，请自行平衡付出时间的收益

1

（2分）在part B中，将48 * 48的矩阵转置情况的cache miss次数优化到450次以下
（2分）在part B中，将96 * 96的矩阵转置情况的cache miss次数优化到1900次以下

2

intro

同part B，但是需要实现一个矩阵乘法算法
cache参数：s = 32, E = 1, b = 32
评测矩阵：A：32 * 32；B：32 * 32

to-do

1	cd honor-part

你的所有实现都将在mul.c中

实现以下函数，将A * B的结果存入C中

char mul_desc[] = "some description";
void mul(int M, int N, int A[N][M], int B[M][N], int C[N][N])
{
    
}

并在registerFunctions()“注册”，步骤同part B

requirements

同part B，但是你可以定义至多16个int类型的局部变量（不包括循环变量，但你不能将循环变量用作其他用途）；你不能修改原始的矩阵A和B，但是你可以任意修改矩阵C

evaluation

你将获得附加分5分当你的cache miss次数< 4000

step 0

1	make clean && make

step 1

在测试之前，进行算法正确性的测试

1	./tracegen -M 32 -N 32

step 2

1	./test-mul -M 32 -N 32

评分

在项目根目录下

1	./driver.py

注意请保证在项目根目录和./honor-part目录下都已经make过了

提交实验

（1）内容要求

你需要提交：

csim.c
csim.h
trans.c
mul.c（如果完成了的话）
一份实验报告

实验报告应该包含以下内容：

实验标题，你的姓名，学号。
你在终端中执行./driver.py后的截图。
描述你每个部分实现的思路，要求简洁清晰。
如果有，请务必在报告中列出引用的内容以及参考的资料。
对本实验的感受（可选）。
对助教们的建议（可选）。

（2）格式要求

可提交.md文件或者.pdf文件。不要提交.doc或.docx文件。

将所有代码文件和实验报告打包成tar文件。将其命名为<学号>.tar

参考资料

cache_lab

Part A

结构

题目要求我们替换最后一次访问时间最久远的哪一行，经题目提醒，那么在cache定义的时候，在加上cache_line加上计数器counter来追踪缓存行的最后缓存时间。

思路：

初始化->查找有无idex—>有就hit，无就miss ->查看set有没有满，满了的话就eviction，并且寻找counter最小的Cache_line->更新每一个Cache_line的counter

`Creat_Cache`

初始化我的cache，使用malloc在堆上动态地分配内存，大小为sizeof(Cache).

初始化cache->line,大小为sizeof(Cache_line *) * S(一共有s组)

cache->line[i]，大小为sizeof(Cache_line) * E（每组有E行）

初始化Cache_line，counter设置为0，valid设置为0；tag设置为-1.

`update_counter`

更新缓存的某一行的数据并且调整所有有效行的计数器。

有效位更新为1，标记位更新为所传参数；然后用两个循环把其他所有有效位为1的计数器+1，表示更新了但是没有使用。

将本次调用的数据的计数器更新为0.

`get_earest_index`

当缓存不命中，在同一个组里寻找最大的计数器进行替换

对应最上面思路，连接各个函数

在指定组中找到具有给定标签的行get_index
如组中没有该标签，miss+1；然后进行is_full判断，如果改组满掉了，eviction+1，find_LRU进行驱逐；
如果能够找到给定标签的行，hit+1
update更新数据并且调整所有counter。

指令解析

`Simulate`

读取追踪文件，并根据追踪文件的内容模拟缓存的行为。

首先使用 fopen 打开追踪文件，并检查文件是否成功打开。
使用 fscanf 逐行读取追踪文件。追踪文件中的每一行都代表一个内存操作，格式为 " [M|L|S] [address],[size]"。例如，“L 10,1” 或 “M 20,1”。
M就一次存储一次加载，两次update_info；L和S则一次update_info
原版课本是通过位移操作求tag和s；但是如果不是二的次幂，可以依然通过映射对应关系进行处理

Pasted Graphic Pasted Graphic 1

part B

先随便写了一下

for (i = 0; i < N; i++) {
    for (j = 0; j < M; j++) {
        tmp = A[i][j];
        B[j][i] = tmp;
}

看了一下./csim-ref -v -s 48 -E 1 -b 48 -t trace.f0 > result.txt

48*48

在一个 set 完成操作

最开始的一次S三次L似乎是store了N，然后load i和j？似乎是，但是应该是跟循环有关，具体不是非常清楚。

后面的稳定每次循环都会执行一次L加载操作（从矩阵A加载到临时变量tmp）和一次S存储操作（从临时变量tmp存储到矩阵B）。

前两次L 30a0b0,4 miss，S 34a0d0,4 miss读取和存储的都是[0][0]

两次地址的差就是两个矩阵存放地址的差0x040020，24位地址

因此A数组起始地址和B数组起始地址都会被映射到同一组，对于地址&A[i][j]和&B[i][j]都会被映射到同一组，但是对应的标志位不同。

因此我们应该尽量避免访问完数组A后立马访问数组B，避免将刚刚加载进缓存的数据因为冲突而被替换出去，由于每行可以放12个元素，因此如果我们可以保障连续读取的12个数字在同一组是最好的情况，因此我们可以将数组分块遍历。

我们，48组，每一组有1行，每个块有48个字节，可以放12个int数据。

因此我们每次放满12个数据，将矩阵48*48变为[12+12+12]*[12+12+12]进行分块存储以及存放

A矩阵中，第1个未命中，然后放12个进缓存区，剩下11个都能命中。
在第一次把A矩阵复制到B矩阵时，因为是不同列，B是全部未命中

在之后每一次都只有因为A数组而被eviction造成的一个未命中。

96*96

按照上一个做法估算了一下，96 * (96/12) * 2 =1536,但是实际上是9000；但是因为使用了不同的set，在进行B的读写的时候仍然会进一步miss

于是我想一次读取6*2，但是因为block只能存储连续的地址，要换到不同的line，也就换到了不同的set

于是换成6*6的读取的时候，96 * (96/6) * 2 -128* 6(A中一组漏了6次)=2304 > 1900;只能够换思路了（。

对于每一个12*12的小块，由于地址是顺序存储的，前六行和后六行 $48/ （ 96/12 ） = 6$ 就会出现二次载入cache的情况；为了避免出现cache二次利用导致的冲突，可以把12*12考虑成四块，对于每个小块重复48*48中的转置操作。在其中，我们使用B作为局部容器进行储存，这样的话，即使A被B或者局部变量顶掉也不会出现miss。

Pasted Graphic 1 Pasted Graphic 2

在实际操作中数组不能够全部存完，所以实际上是三四步交替进行

93*99

一开始我又想利用2的方法，进行分块；然后因为每一个set不一定就在同一行，导致了分小块依然会导致前六行和后六行会出现二次载入cache的情况。于是我就每次分不同的小块进行暴力求解（瞅了一眼原版lab），尝试通过在小块内进行转置操作来尽可能避免这种现象。通过几次尝试，32*24的时候miss数为2933，pass。

honor part

part B部分已过

mul

最开始的时候，我更改B矩阵的访问方式，使其以行优先的方式访问。这样，每次迭代时，访问相邻的内存位置，从而利用空间局部性来提高cache的命中率。

但是，这时候依然远大于4000;
通过参考助教老师发的文档，我采用了分块矩阵的方式来进行

下面是我的错误代码思路

最开始我按行和列进行分块，先计算出每一块中的元素，然后再进行乘法和加法运算；将每一块中的元素存储在一个数组中，然后再进行计算；行->列->块内的行->块内的列。

    for (bi = 0; bi < M; bi += bsize) { //行索引
        for (bj = 0; bj < M; bj += bsize) { //列索引
            for (i = 0; i < N; i++) { //块内的行索引
                for (j = bj; j < bj + bsize && j < N; c++) 
                    for (k = bi; k < bi + bsize && k < M;

然后发现一万多的miss率，通过查看自己的miss地址，发现分块方式和访问顺序没有很好地利用空间局部性，在不同行之间相互跳转会访问不同的set，从而增加了miss的次数，于是我把分块逻辑变成了块->行->列,在每次访问的时候，把块A的数据存进数组，把块B的数据存进数组，然后堆数组进行运算，结果加进C里，接近但是还是不对。

    for (int bi = 0; bi < M; bi += bsize) {
        for (int bk = 0; bk < M; bk += bsize) {
            for (int bj = 0; bj < M; bj += bsize) {
                for (int k = 0; k < bsize; k++) {
                    for (int i = 0; i < bsize;i++) {
                    	for (int i = 0; i < bsize; i++){

然后，我询问助教老师，助教老师让我好好读一遍可能有帮助的那一篇文档（本来我只抄了文档中给的代码，然后发现最后两页才是重点），根据文章发现可以把B的列索引变成按行访问，每次行访问的结果存储在C的原来位置，下一次进行一个+=操作。

固定block A的一行
针对block B 矩阵的每一行（以 8 列为一组）进行循环：
- 针对 B 矩阵的一行，与 tmp 数组中的对应元素相乘tmp[j + 8] += tmp[k] * B[bk + k][bj + j];，将结果累加到 tmp 数组的后 8 个元素中。这样把每次的列load的miss消除掉。
- 将得到的中间结果 tmp[j + 8] 累加到 C 矩阵的对应位置 C[i][bj + j] 上。

最终实现了以A为主体，对于A的每一个元素先去和B的一行运算，得到C的一列的一部分和，然后 $C$ += $C_i$ 得到了所求的 $C$

自己想着在外部循环的时候已经把数组初始化了，然后在内部循环的时候让然要给数组初始化一次；因为忘记每次给存放 $C_i$ 的数组初始化卡了巨久；最后还是自己在vscode里跑了一下，打断点发现错误的地方

实验感受

~~在考试周疯狂发疯~~在写完coroutine_lab锻炼好了我的心态，整个cache_lab心态还是非常平稳的。最开始对set的理解出现了偏差，起步部分写得异常艰难；在翻阅了互联网资料后，有种恍然大悟的感觉。然后一路平推到honor，写了四版代码代码，不过最终也过了。TA最开始给了部分代码，真的是太太太太好了，在经历了上一个lab的洗礼之后，感受到了来自助教老师的温暖，爱您。

参考资料

[《深入理解计算机系统》实验五Cache Lab_cachelab-CSDN博客]:

[CacheLab（附Excellent优化思路）_cache矩阵转置-CSDN博客]:

#pragma once

#define MININT -2147483648

// cache parameters
int numSet;
int associativity;
int blockSize;
char filePath[100];
int verbose = 0;

typedef struct cache_line
{
    int valid;     //有效位
    int tag;       //标记位
    int counter; //计数器
} Cache_line;

typedef struct cache
{
    int S;
    int E;
    int B;
    Cache_line **line;
} Cache;



// final results
int hit;
int miss;
int eviction;
int hit = 0, miss = 0, eviction = 0;
Cache *cache = NULL;

// will be set in getopt() function
extern char *optarg;

// you can define functions here
void usage();
void parseline(int argc, char **argv);
void Creat_Cache();
void free_Cache();
int get_earest_index(int op_s);
int is_full(int op_s);
void update_counter(int i, int op_s, int op_tag);
void update_info(unsigned address);

#include "cachelab.h"
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "csim.h"
#include <getopt.h>
#include <limits.h>
#include <math.h>

// print usage info

void usage()
{
    printf("Usage: ./csim [-hv] -s <num> -E <num> -b <num> -t <file>\n");
    printf("Options:\n");
    printf("  -h         Print this help message.\n");
    printf("  -v         Optional verbose flag.\n");
    printf("  -s <num>   Number of set index bits.\n");
    printf("  -E <num>   Number of lines per set.\n");
    printf("  -b <num>   Number of block offset bits.\n");
    printf("  -t <file>  Trace file.\n");
    printf("\n");
    printf("Examples:\n");
    printf("  linux>  ./csim -s 4 -E 1 -b 4 -t traces/yi.trace\n");
    printf("  linux>  ./csim -v -s 8 -E 2 -b 4 -t traces/yi.trace\n");
    exit(1);
}

// parse command line and get the parameters
void parseline(int argc, char **argv)
{
    int opt;
    int num = 0;
    while ((opt = getopt(argc, argv, "hvs:E:b:t:")) != -1)
    {
        switch (opt)
        {
        case 'h':
            usage();
            break;
        case 'v':
            verbose = 1;
            break;
        case 's':
            num = atoi(optarg);
            if (num == 0 && optarg[0] != '0')
            {
                printf("./csim: Missing required command line argument\n");
                usage();
            }
            numSet = num;
            break;
        case 'E':
            num = atoi(optarg);
            if (num == 0 && optarg[0] != '0')
            {
                printf("./csim: Missing required command line argument\n");
                usage();
            }
            associativity = num;
            break;
        case 'b':
            num = atoi(optarg);
            if (num == 0 && optarg[0] != '0')
            {
                printf("./csim: Missing required command line argument\n");
                usage();
            }
            blockSize = num;
            break;
        case 't':
            strcpy(filePath, optarg);
            break;
        case ':':
            printf("./csim: Missing required command line argument\n");
            usage();
            break;
        case '?':
            usage();
            break;
        default:
            printf("getopt error");
            exit(1);
            break;
        }
    }
}


void Creat_Cache()
{
    if (cache)
        free(cache);
    cache = (Cache *)malloc(sizeof(Cache));
    cache->S = numSet;
    cache->E = associativity;
    cache->B = blockSize;
    cache->line = (Cache_line **)malloc(sizeof(Cache_line *) * numSet);
    for (int i = 0; i < numSet; i++)
    {
        cache->line[i] = (Cache_line *)malloc(sizeof(Cache_line) * associativity);
        for (int j = 0; j < associativity; j++)
        {
            cache->line[i][j].valid = 0; 
            cache->line[i][j].tag = -1;
            cache->line[i][j].counter = 0;
        }
    }
}

void free_Cache()
{
    int S = cache->S;
    for (int i = 0; i < S; i++)
    {
        free(cache->line[i]);
    }
    free(cache->line);
    free(cache);
}



int is_full(int op_s)
{
    for (int i = 0; i < cache->E; i++)
    {
        if (cache->line[op_s][i].valid == 0)
            return i;
    }
    return -1;
}

void update(int i, int op_s, int op_tag){
    
    cache->line[op_s][i].valid=1;
    cache->line[op_s][i].tag = op_tag;
    for(int k = 0; k < cache->E; k++)
         if(cache->line[op_s][k].valid==1)
             cache->line[op_s][k].counter++;
     cache->line[op_s][i].counter = 0;
}


int get_earest_index(int op_s)
{
    int index = 0;
    int max_counter = 0;
    for(int i = 0; i < cache->E; i++){
        if(cache->line[op_s][i].counter > max_counter){
            max_counter = cache->line[op_s][i].counter;
            index = i;
        }
    }
    return index;
}



void update_info(unsigned address)
{
    int op_tag = address/(numSet*blockSize);
    int op_s = (address/blockSize )%numSet;
    int index = -1;
    for (int i = 0; i < cache->E; i++)
    {
        if (cache->line[op_s][i].valid && cache->line[op_s][i].tag == op_tag)
            index = i;
    }

    if (index == -1)
    {
        miss++;
        if (verbose)
            printf("miss");
        int i = is_full(op_s);
        if(i==-1){
            eviction++;
            if(verbose) printf("eviction");
            i = get_earest_index(op_s);
        }
        update(i,op_s,op_tag);
    }
    else{
        hit++;
        if(verbose)
            printf("hit");
        update(index,op_s,op_tag);    
    }
}

void Simulate() {
    Creat_Cache();
    FILE *File;
    char identifier;
    unsigned address;
    int size;    
    File = fopen(filePath, "r");

    while (fscanf(File, " %c %x,%d", &identifier, &address, &size) > 0) 
    {
        if (verbose)
            printf("%c %x,%d ", identifier, address, size);
        switch (identifier)
        {
        case 'M': 
            update_info(address);
            update_info(address);
            break;
        case 'L':
            update_info(address);
            break;
        case 'S':
            update_info(address);
            break;
        }
    }
    fclose(File);
}



int main(int argc, char *argv[])
{
    parseline(argc, argv);
    Simulate();
    free_Cache();
    printSummary(hit, miss, eviction);
    return 0;
}

/*
 * mul.c - Matrix multiply C = A * B
 *
 * Each multiply function must have a prototype of the form:
 * void mul(int M, int N, int A[N][M], int B[M][N], int C[N][N]);
 *
 * A multiply function is evaluated by counting the number of misses
 * on a 1KB direct mapped cache with a block size of 32 bytes.
 */
#include <stdio.h>
#include "cachelab.h"

int is_mul(int M, int N, int A[N][M], int B[M][N], int C[N][N]);

/*
 * multiply_submit - This is the solution multiply function that you
 *     will be graded on for Part B of the assignment. Do not change
 *     the description string "multiply submission", as the driver
 *     searches for that string to identify the multiply function to
 *     be graded.
 */

char mul_submit_desc[] = "multiply submission";

#define block_size 8
void mul_submit(int M, int N, int A[N][M], int B[M][N], int C[N][N]) {
    int tmp[16]; 
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j)
            C[i][j] = 0;
    }

    // 外层循环，每次增加8列（bk表示列偏移）
    for (int bk = 0; bk < M; bk += 8) {
        // 主循环，遍历A矩阵的行
        for (int i = 0; i < N; i++) {
            for (int i = 0; i < 16; i++){
                tmp[i] = 0;
            }
            // 将A矩阵的一行复制到tmp数组中
            for (int j = 0; j < 8; j++){
                tmp[j] = A[i][bk + j];
            }

            // 内层循环，每次增加8列（bj表示列偏移）
            for (int bj = 0; bj < 32; bj += 8) {
                // 执行矩阵乘法的主循环
                for (int j = 0; j < 8; j++) {
                    // 通过累加计算C矩阵的一列元素
                    for (int k = 0; k < 8; k++){
                        tmp[j + 8] += tmp[k] * B[bk + k][bj + j];
                    }
                }
                // 将中间结果写回到C的一列
                for (int k = 0; k < 8; k++){
                    C[i][bj + k] += tmp[k + 8];
                }
                // 初始化tmp数组的后8个元素为0！！！！！
                for (int i = 0; i < 8; i++){
                    tmp[i + 8] = 0;
                }
            }
        }
    }
}



/*
 * mul - A simple multiply function, not optimized for the cache.
 */
char mul_desc[] = "multiply submission";
void mul(int M, int N, int A[N][M], int B[M][N], int C[N][N])
{
    int i, j, k, tmp;
    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
        {
            tmp = 0;
            for (k = 0; k < M; k++)
            {
                tmp += A[i][k] * B[k][j];
            }
            C[i][j] = tmp;
        }
    }
}

/*
 * registerFunctions - This function registers your multiply
 *     functions with the driver.  At runtime, the driver will
 *     evaluate each of the registered functions and summarize their
 *     performance. This is a handy way to experiment with different
 *     multiply strategies.
 */
void registerFunctions()
{
    /* Register your solution function */
    registerMulFunction(mul_submit, mul_submit_desc);

    /* Register any additional multiply functions */
    // registerMulFunction(mul, mul_desc);
}

/*
 * is_multiply - This helper function checks if C is the multiply of
 *     A and B. You can check the correctness of your multiply by calling
 *     it before returning from the multiply function.
 */
int is_mul(int M, int N, int A[N][M], int B[M][N], int C[N][N])
{
    int i, j, k;
    int num = 0;
    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
        {
            num = 0;
            for (k = 0; k < M; k++)
            {
                num += A[i][k] * B[k][j];
            }
            if (num != C[i][j])
            {
                return 0;
            }
        }
    }
    return 1;
}

/*
 * trans.c - Matrix transpose B = A^T
 *
 * Each transpose function must have a prototype of the form:
 * void trans(int M, int N, int A[N][M], int B[M][N]);
 *
 * A transpose function is evaluated by counting the number of misses
 * on a 1KB direct mapped cache with a block size of 32 bytes.
 */
#include <stdio.h>
#include "cachelab.h"

int is_transpose(int M, int N, int A[N][M], int B[M][N]);

/*
 * transpose_submit - This is the solution transpose function that you
 *     will be graded on for Part B of the assignment. Do not change
 *     the description string "Transpose submission", as the driver
 *     searches for that string to identify the transpose function to
 *     be graded.
 */

char transpose_submit_desc[] = "Transpose submission";
// void transpose_submit(int M, int N, int A[N][M], int B[M][N])
// {
// }
void transpose_submit(int M, int N, int A[N][M], int B[M][N]) {
    if (M == 48) {
        int temp[12];
        for (int bi = 0; bi < M; bi+=12) {
            for (int bj = 0; bj < N; bj+=12) {
                for (int k = 0; k < 12; k++) {
                    for(int i = 0;i <12;i++){
                        temp[i] = A[bi + k][bj +i];
                    }
                    for(int i = 0;i <12;i++){
                        B[bj+i][bi + k] = temp[i];
                    }                    
                }
            }
        }
    }
    else if(M==96){
        int a[12] = {0}; // 创建一个大小为12的整型数组a，用于临时存储数据
        for(int i=0;i<M;i+=12){ // 外层循环，以12为步长遍历A的行
            for(int j=0;j<N;j+=12){ // 内层循环，以12为步长遍历A的列
                for(int k = i;k < i+6;k++){ // 遍历A的左上部分的行
                    for(int m = 0;m < 12;m++){ // 将A的左上部分的数据存储到数组a中
                        a[m] = A[k][j+m];
                    }
                    for(int m = 0;m < 6;m++){ // 将数组a的前6个元素复制到B的对应位置
                        B[j+m][k] = a[m];
                    }
                    for(int m = 0;m < 6;m++){ // 将数组a的后6个元素复制到B的对应位置
                        B[j+m][k+6] = a[m+6];
                    }                                        
                }

                for(int k =j;k < j +6;k++){ // 遍历B的右上部分的列
                    for(int m = 0;m < 6;m++){ // 将B的右上部分的数据存储到数组a中
                        a[m] = B[k][i+6+m];
                    } 
                    for(int m = 0;m < 6;m++){ // 将A的左下部分的数据存储到数组a的后6个元素中
                        a[m+6] = A[i+6+m][k];
                    }

                    for(int m = 0;m < 6;m++){ // 将数组a的后6个元素复制到B的左下部分对应位置
                        B[k][i+6+m] = a[m+6];
                    } 
                    for(int m = 0;m < 6;m++){ // 将数组a的前6个元素复制到B的右下部分对应位置
                        B[k+6][i+m] = a[m];
                    }                                                      
                }

                for(int k = i+6; k < i+12 ; k++){ // 遍历A的右下部分的行
                    for(int m = 0;m < 6;m++){ // 将A的右下部分的数据存储到数组a的后6个元素中
                        a[m+6] = A[k][j+m+6];
                    }                      
                    for(int m = 0;m < 6;m++){ // 将数组a的后6个元素复制到B的右下部分对应位置
                        B[j+m+6][k] = a[m+6];
                    }   
                }
            }
        }
    }
    else if(M == 93){
        for (int i = 0; i < N; i += 32)
            for (int j = 0; j < M; j += 24)
                for (int k = i; k < i + 32 && k < N; k++)
                    for (int l = j; l < j + 24 && l < M; l++)
                        B[l][k] = A[k][l];
    }
    else{
        int i, j, tmp;

        for (i = 0; i < N; i++)
        {
            for (j = 0; j < M; j++)
            {
                tmp = A[i][j];
                B[j][i] = tmp;
            }

        }
    }
}


/*
 * You can define additional transpose functions below. We've defined
 * a simple one below to help you get started.
 */

/*
 * trans - A simple baseline transpose function, not optimized for the cache.
 */
char trans_desc[] = "Simple row-wise scan transpose";
void trans(int M, int N, int A[N][M], int B[M][N])
{
    int i, j, tmp;

    for (i = 0; i < N; i++)
    {
        for (j = 0; j < M; j++)
        {
            tmp = A[i][j];
            B[j][i] = tmp;
        }
    }
}

/*
 * registerFunctions - This function registers your transpose
 *     functions with the driver.  At runtime, the driver will
 *     evaluate each of the registered functions and summarize their
 *     performance. This is a handy way to experiment with different
 *     transpose strategies.
 */
void registerFunctions()
{
    /* Register your solution function */
    registerTransFunction(transpose_submit, transpose_submit_desc);

    /* Register any additional transpose functions */
    // registerTransFunction(trans, trans_desc);
}

/*
 * is_transpose - This helper function checks if B is the transpose of
 *     A. You can check the correctness of your transpose by calling
 *     it before returning from the transpose function.
 */
int is_transpose(int M, int N, int A[N][M], int B[M][N])
{
    int i, j;

    for (i = 0; i < N; i++)
    {
        for (j = 0; j < M; ++j)
        {
            if (A[i][j] != B[j][i])
            {
                return 0;
            }
        }
    }
    return 1;
}