「700行手写编译器」Part 4：再看代码生成笔记

爱吃小酥肉的小波

已于 2023-12-21 16:44:53 修改

阅读量432

点赞数 7

分类专栏： 700行实现C minus编译器文章标签：笔记

于 2023-12-20 15:33:00 首次发布

本文链接：https://blog.csdn.net/qq_55051321/article/details/135053603

版权

700行实现C minus编译器专栏收录该内容

7 篇文章 2 订阅

订阅专栏

在这里插入图片描述

代码生成：解析完源代码，形成了相应的VM指令，在正式执行VM指令之前。
上图的左边的code区域会有这些数据，data区的数据在图的中间，除此之外其他的东西应该都是处在初始化的状态，比如寄存器、stack等。
上图的右边是symbol table，在代码解析完成之后，实际上就不会有任何意义或任何作用了，它的作用只是在衔接定义和使用lexeme，让函数执行过程中，该lexeme前后关系和属性对应起来。

x86会生成text和code的文件，然后再运行在VM上。
但是，C4、C minus并没有把text和code的文件。这样的话文件更少。
如果分开单独输出一份text和code的文件的话，那么代码将分成编译器的代码和虚拟机的代码，两个部分。

在这里插入图片描述
即便上图中如果有新增的全局变量，也不会影响main，add生成的代码。

区别两种存储方式：
1.代码中的一些字面量（string的值，int的值…），以及全局变量都会定义在data区，加全局变量的时候是Load IMM + 数据区地址。
2.对于函数中的局部变量，存储都是基于栈的bp相对位置的偏移存储的，实际不是存储在data区的，加载局部变量的时候是Load LEA -1（存储地址相对于bp的偏移量）。

上图第4、7行，LEA 2，LEA 3是add传入的两个参数的位置，第11行，LEA -1是

上图第15行，地址为4300210344的代码地址是char*类型的，是记录了输出字符串位置的。相当于15行给了个存储字符串的地址，16-27行完成对字符串的填写，28行完成对字符串的打印。

上图中第25行，地址为4300210464的code地址处，call的地址是token=Id，name = “add”的函数的value的值，也就是add函数存储的位置。

LEA是基于bp的地址，如果是全局变量的话，可以直接Load Immediate（直接加地址），然后push到栈里面。

处理变量或者函数调用的部分的代码分析：

// handle identifer: variable or function all
// 处理标识符：变量或者函数调用
else if (token == Id) {
    tokenize();
    tmp_ptr = symbol_ptr; // for recursive parse
    // function call 函数调用
    //如果是函数调用，那么标识符后边的就是的token就是'('
    if (token == '(') {
        assert('(');
        i = 0; // number of args 参数的个数
        while (token != ')') {
            parse_expr(Assign);  //函数的参数可能是字面量，也有可能参数仍然是个表达式，所以先递归调用parse_expr(),这个值会存在唯一的通用寄存器ax中
            *++code = PUSH; i++;  //push ax入栈，i用来计数
            if (token == ',') assert(',');
        } assert(')');
        // native call 如果是本地函数的调用那就直接在tmp_ptr找对应函数的code位置
        if (tmp_ptr[Class] == Sys) *++code = tmp_ptr[Value];
        // fun call 如果是自定义函数的调用，那就要
        else if (tmp_ptr[Class] == Fun) {*++code = CALL; *++code = tmp_ptr[Value];}
        else {printf("line %lld: invalid function call\n", line); exit(-1);}
        // delete stack frame for args 清理栈的空间，有多少函数调用时的参数（用i计数的），就清理多少空间
        if (i > 0) {*++code = DARG; *++code = i;}
        type = tmp_ptr[Type];
    }
    // handle enum value  处理枚举类型，在C Minus中如果遇到enum，就直接吧enum当做字面量来处理，load IMM直接加载到ax中
    else if (tmp_ptr[Class] == Num) {
        *++code = IMM; *++code = tmp_ptr[Value]; type = INT;
    }
    // handle variables 处理变量
    else {
        // local var, calculate addr base ibp  处理局部变量，计算相对于bp的偏移量
        if (tmp_ptr[Class] == Loc) {*++code = LEA; *++code = ibp - tmp_ptr[Value];}
        // global var 处理全局变量，直接load IMM+全局变量的地址
        else if (tmp_ptr[Class] == Glo) {*++code = IMM; *++code = tmp_ptr[Value];}
        else {printf("line %lld: invalid variable\n", line); exit(-1);}
        type = tmp_ptr[Type];
        *++code = (type == CHAR) ? LC : LI;  //要根据类型选择函数，如果是char就load char，如果是int就load int
        //如果是指针的话，load IMM之后，ax保存的就应该是他的地址，因此如果是指针的话，如果是指针就什么都不用干。
        //这里的C Minus编译器把指针类型数据和int、char类型数据是分开处理的。所以指针处理不在这里
    }
}

parse_stmt()分析函数：

void parse_stmt() {
    int* a;
    int* b;
    // if...else的解析过程
    if (token == If) {
        assert(If); assert('('); parse_expr(Assign); assert(')');
        // 执行parse_expr(Assign)之后，会有返回值存储在ax中，作为if的判断条件
        *++code = JZ; b = ++code; // JZ to false 即ax=0时调转到b的位置，但我们不知道false的代码存在哪里，所以先把false的位置b空着
        parse_stmt(); // parse true stmt 解析条件正确的时候，要执行的语句
        if (token == Else) {
            assert(Else);
            *b = (int)(code + 3); // write back false point 写回false应该跳转的地方
            *++code = JMP; b = ++code; // JMP to endif 跳到if的结尾，但是if的结尾还不知道，先用b暂存那个code的位置
            parse_stmt(); // parse false stmt 解析条件错误的时候，要执行的语句
        }
        *b = (int)(code + 1); // write back endif point 写回
    }
    // while的解析过程
    else if (token == While) {
        assert(While);
        a = code + 1; // write loop point 用a存储循环一开始的code的位置
        assert('('); parse_expr(Assign); assert(')'); // 处理条件的表达式，结果存储在ax中
        *++code = JZ; b = ++code; // JZ to endloop ax等于0的话跳转到循环结尾，结尾未定用b先记住这个位置，之后再写回去
        parse_stmt();  //解析while循环体内的语句
        *++code = JMP; *++code = (int)a; // JMP to loop point 跳转到while函数一开始的位置a，继续循环
        *b = (int)(code + 1); // write back endloop point 写回结尾的位置
    }
    else if (token == Return) {
        assert(Return);
        if (token != ';') parse_expr(Assign);  //return后面可能有返回值的表达式
        assert(';');
        *++code = RET; //函数到结尾，RET返回
    }
    else if (token == '{') {
        assert('{');
        while (token != '}') parse_stmt(Assign); // 对{}内的语句解析
        assert('}');
    }
    else if (token == ';') assert(';');
    else {parse_expr(Assign); assert(';');} //对普通的单条最简单的语句解析
}

parse_expr()的函数解析：

int type; // pass type in recursive parse expr
void parse_expr(int precd) {
    int tmp_type, i;
    int* tmp_ptr;
    // const number
    if (token == Num) {
        tokenize();
        *++code = IMM;
        *++code = token_val;
        type = INT;
    } 
    // const string
    else if (token == '"') {
        *++code = IMM;
        *++code = token_val; // string addr
        assert('"'); while (token == '"') assert('"'); // handle multi-row
        data = (char*)((int)data + 8 & -8); // add \0 for string & align 8
        type = PTR;
    }
    else if (token == Sizeof) {
        tokenize(); assert('(');
        type = parse_base_type();
        while (token == Mul) {assert(Mul); type = type + PTR;}
        assert(')');
        *++code = IMM;
        *++code = (type == CHAR) ? 1 : 8; 
        type = INT;
    }
    // handle identifer: variable or function all
    else if (token == Id) {
        tokenize();
        tmp_ptr = symbol_ptr; // for recursive parse
        // function call
        if (token == '(') {
            assert('(');
            i = 0; // number of args
            while (token != ')') {
                parse_expr(Assign);
                *++code = PUSH; i++;
                if (token == ',') assert(',');
            } assert(')');
            // native call
            if (tmp_ptr[Class] == Sys) *++code = tmp_ptr[Value];
            // fun call
            else if (tmp_ptr[Class] == Fun) {*++code = CALL; *++code = tmp_ptr[Value];}
            else {printf("line %lld: invalid function call\n", line); exit(-1);}
            // delete stack frame for args
            if (i > 0) {*++code = DARG; *++code = i;}
            type = tmp_ptr[Type];
        }
        // handle enum value
        else if (tmp_ptr[Class] == Num) {
            *++code = IMM; *++code = tmp_ptr[Value]; type = INT;
        }
        // handle variables
        else {
            // local var, calculate addr base ibp
            if (tmp_ptr[Class] == Loc) {*++code = LEA; *++code = ibp - tmp_ptr[Value];}
            // global var
            else if (tmp_ptr[Class] == Glo) {*++code = IMM; *++code = tmp_ptr[Value];}
            else {printf("line %lld: invalid variable\n", line); exit(-1);}
            type = tmp_ptr[Type];
            *++code = (type == CHAR) ? LC : LI;
        }
    }
    // cast or parenthesis
    else if (token == '(') {
        assert('(');
        if (token == Char || token == Int) {
            tokenize();
            tmp_type = token - Char + CHAR;
            while (token == Mul) {assert(Mul); tmp_type = tmp_type + PTR;}
            // use precedence Inc represent all unary operators
            assert(')'); parse_expr(Inc); type = tmp_type;
        } else {
            parse_expr(Assign); assert(')');
        }
    }
    // derefer
    else if (token == Mul) {
        tokenize(); parse_expr(Inc);
        if (type >= PTR) type = type - PTR;
        else {printf("line %lld: invalid dereference\n", line); exit(-1);}
        *++code = (type == CHAR) ? LC : LI;
    }
    // reference
    else if (token == And) {
        tokenize(); parse_expr(Inc);
        if (*code == LC || *code == LI) code--; // rollback load by addr
        else {printf("line %lld: invalid reference\n", line); exit(-1);}
        type = type + PTR;
    }
    // Not
    else if (token == '!') {
        tokenize(); parse_expr(Inc);
        *++code = PUSH; *++code = IMM; *++code = 0; *++code = EQ;
        type = INT;
    }
    // bitwise
    else if (token == '~') {
        tokenize(); parse_expr(Inc);
        *++code = PUSH; *++code = IMM; *++code = -1; *++code = XOR;
        type = INT;
    }
    // positive
    else if (token == And) {tokenize(); parse_expr(Inc); type = INT;}
    // negative 
    else if (token == Sub) {
        tokenize(); parse_expr(Inc);
        *++code = PUSH; *++code = IMM; *++code = -1; *++code = MUL;
        type = INT;
    }
    // ++var --var
    else if (token == Inc || token == Dec) {
        i = token; tokenize(); parse_expr(Inc);
        // save var addr, then load var val
        if (*code == LC) {*code = PUSH; *++code = LC;}
        else if (*code == LI) {*code = PUSH; *++code = LI;}
        else {printf("line %lld: invalid Inc or Dec\n", line); exit(-1);}
        *++code = PUSH; // save var val
        *++code = IMM; *++code = (type > PTR) ? 8 : 1;
        *++code = (i == Inc) ? ADD : SUB; // calculate
        *++code = (type == CHAR) ? SC : SI; // write back to var addr
    }
    else {printf("line %lld: invalid expression\n", line); exit(-1);}
    // use [precedence climbing] method to handle binary(or postfix) operators
    while (token >= precd) {
        tmp_type = type;    
        // assignment
        if (token == Assign) {
            tokenize();
            if (*code == LC || *code == LI) *code = PUSH;
            else {printf("line %lld: invalid assignment\n", line); exit(-1);}
            parse_expr(Assign); type = tmp_type; // type can be cast
            *++code = (type == CHAR) ? SC : SI;
        }
        // ? :, same as if stmt
        else if (token == Cond) {
            tokenize(); *++code = JZ; tmp_ptr = ++code;
            parse_expr(Assign); assert(':');
            *tmp_ptr = (int)(code + 3);
            *++code = JMP; tmp_ptr = ++code; // save endif addr
            parse_expr(Cond);
            *tmp_ptr = (int)(code + 1); // write back endif point
        }
        // logic operators, simple and boring, copy from c4
        else if (token == Lor) {
            tokenize(); *++code = JNZ; tmp_ptr = ++code;
            parse_expr(Land); *tmp_ptr = (int)(code + 1); type = INT;}
        else if (token == Land) {
            tokenize(); *++code = JZ; tmp_ptr = ++code;
            parse_expr(Or); *tmp_ptr = (int)(code + 1); type = INT;}
        else if (token == Or)  {tokenize(); *++code = PUSH; parse_expr(Xor); *++code = OR;  type = INT;}
        else if (token == Xor) {tokenize(); *++code = PUSH; parse_expr(And); *++code = XOR; type = INT;}
        else if (token == And) {tokenize(); *++code = PUSH; parse_expr(Eq);  *++code = AND; type = INT;}
        else if (token == Eq)  {tokenize(); *++code = PUSH; parse_expr(Lt);  *++code = EQ;  type = INT;}
        else if (token == Ne)  {tokenize(); *++code = PUSH; parse_expr(Lt);  *++code = NE;  type = INT;}
        else if (token == Lt)  {tokenize(); *++code = PUSH; parse_expr(Shl); *++code = LT;  type = INT;}
        else if (token == Gt)  {tokenize(); *++code = PUSH; parse_expr(Shl); *++code = GT;  type = INT;}
        else if (token == Le)  {tokenize(); *++code = PUSH; parse_expr(Shl); *++code = LE;  type = INT;}
        else if (token == Ge)  {tokenize(); *++code = PUSH; parse_expr(Shl); *++code = GE;  type = INT;}
        else if (token == Shl) {tokenize(); *++code = PUSH; parse_expr(Add); *++code = SHL; type = INT;}
        else if (token == Shr) {tokenize(); *++code = PUSH; parse_expr(Add); *++code = SHR; type = INT;}
        // arithmetic operators
        else if (token == Add) {
            tokenize(); *++code = PUSH; parse_expr(Mul);
            // int pointer * 8
            if (tmp_type > PTR) {*++code = PUSH; *++code = IMM; *++code = 8; *++code = MUL;}
            *++code = ADD; type = tmp_type;
        }
        //减法：当解析到减法的位置时，说明减法前面被减数的位置已经解析好了，并且已经存储在栈顶中了
        else if (token == Sub) {
        	//           ↓减数ax压入栈中 解析后面的表达式
            tokenize(); *++code = PUSH; parse_expr(Mul);
            if (tmp_type > PTR && tmp_type == type) {
                // pointer - pointer, ret / 8  （指针之间的相减的计算，下面的代码就是做的这个计算）
                // 我们设计的是64位的，所以减完要除8
                *++code = SUB; *++code = PUSH;//把做完减法的数字，(栈顶数据-ax)->ax，把减完之后存在ax中的结果压入栈中
                *++code = IMM; *++code = 8;   //把8存到ax里面
                *++code = DIV; type = INT;}   //做除法 (栈顶数据/ax)->ax，两个指针相减的结果是int
            else if (tmp_type > PTR) {
            //数字与指针之间的运算， 指针值 - 数字值*8
                *++code = PUSH;
                *++code = IMM; *++code = 8;
                *++code = MUL;
                *++code = SUB; type = tmp_type;}
            else *++code = SUB;  //普通的两个整数相减，直接sub就行 (栈顶-ax)->ax
        }
        else if (token == Mul) {tokenize(); *++code = PUSH; parse_expr(Inc); *++code = MUL; type = INT;}
        else if (token == Div) {tokenize(); *++code = PUSH; parse_expr(Inc); *++code = DIV; type = INT;}
        else if (token == Mod) {tokenize(); *++code = PUSH; parse_expr(Inc); *++code = MOD; type = INT;}
        // var++, var-- 后置的++，--
        else if (token == Inc || token == Dec) {
            if (*code == LC) {*code = PUSH; *++code = LC;} // save var addr 存储char变量的地址进栈顶
            else if (*code == LI) {*code = PUSH; *++code = LI;} // load int 存储int到栈顶
            else {printf("%lld: invlid operator=%lld\n", line, token); exit(-1);}
            *++code = PUSH; *++code = IMM; *++code = (type > PTR) ? 8 : 1; //让ax = 8(指针类型)或者1(int或char类型)
            *++code = (token == Inc) ? ADD : SUB;
            *++code = (type == CHAR) ? SC : SI; // save value ++ or -- to addr
            *++code = PUSH; *++code = IMM; *++code = (type > PTR) ? 8 : 1;
            *++code = (token == Inc) ? SUB : ADD; // restore value before ++ or --
            tokenize();
        }
        // a[x] = *(a + x) 中括号左边的值一定是地址，相当于移动指针x的位置
        else if (token == Brak) {
            assert(Brak); *++code = PUSH; parse_expr(Assign); assert(']');
            if (tmp_type > PTR) {*++code = PUSH; *++code = IMM; *++code = 8; *++code = MUL;}
            // ↑指针类型 地址+数字*8
            else if (tmp_type < PTR) {printf("line %lld: invalid index op\n", line); exit(-1);}
            *++code = ADD; type = tmp_type - PTR;  // 栈顶的a的地址 + ax中的x值
            *++code = (type == CHAR) ? LC : LI;
        }
        else {printf("%lld: invlid token=%lld\n", line, token); exit(-1);}
    }
}

关于run_vm()函数的说明：

int run_vm(int argc, char** argv) {
    int op;
    int* tmp;
    // exit code for main 这个部分关于run_vm怎么执行main函数的过程放在下边的图示说明
    bp = sp = (int*)((int)stack + MAX_SIZE);
    *--sp = EXIT;
    *--sp = PUSH; tmp = sp;
    *--sp = argc; *--sp = (int)argv;
    *--sp = (int)tmp;
    //     ↓ pc拿到main函数在code区的地址，pc先指向这个位置，执行时从main函数的入口位置开始执行
    if (!(pc = (int*)main_ptr[Value])) {printf("main function is not defined\n"); exit(-1);}
    cycle = 0;
    while (1) {
        cycle++; op = *pc++; // read instruction
        // load & save
        if (op == IMM)          ax = *pc++;                     // load immediate(or global addr)
        else if (op == LEA)     ax = (int)(bp + *pc++);         // load local addr
        else if (op == LC)      ax = *(char*)ax;                // load char
        else if (op == LI)      ax = *(int*)ax;                 // load int
        else if (op == SC)      *(char*)*sp++ = ax;             // save char to stack
        else if (op == SI)      *(int*)*sp++ = ax;              // save int to stack
        else if (op == PUSH)    *--sp = ax;                     // push ax to stack
        // jump
        else if (op == JMP)     pc = (int*)*pc;                 // jump
        else if (op == JZ)      pc = ax ? pc + 1 : (int*)*pc;   // jump if ax == 0
        else if (op == JNZ)     pc = ax ? (int*)*pc : pc + 1;   // jump if ax != 0
        // arithmetic
        else if (op == OR)      ax = *sp++ |  ax;
        else if (op == XOR)     ax = *sp++ ^  ax;
        else if (op == AND)     ax = *sp++ &  ax;
        else if (op == EQ)      ax = *sp++ == ax;
        else if (op == NE)      ax = *sp++ != ax;
        else if (op == LT)      ax = *sp++ <  ax;
        else if (op == LE)      ax = *sp++ <= ax;
        else if (op == GT)      ax = *sp++ >  ax;
        else if (op == GE)      ax = *sp++ >= ax;
        else if (op == SHL)     ax = *sp++ << ax;
        else if (op == SHR)     ax = *sp++ >> ax;
        else if (op == ADD)     ax = *sp++ +  ax;
        else if (op == SUB)     ax = *sp++ -  ax;
        else if (op == MUL)     ax = *sp++ *  ax;
        else if (op == DIV)     ax = *sp++ /  ax;
        else if (op == MOD)     ax = *sp++ %  ax;
        // some complicate instructions for function call
        // call function: push pc + 1 to stack & pc jump to func addr(pc point to)
        else if (op == CALL)    {*--sp = (int)(pc+1); pc = (int*)*pc;}
        // new stack frame for vars: save bp, bp -> caller stack, stack add frame
        else if (op == NVAR)    {*--sp = (int)bp; bp = sp; sp = sp - *pc++;}
        // delete stack frame for args: same as x86 : add esp, <size>
        else if (op == DARG)    sp = sp + *pc++;
        // return caller: retore stack, retore old bp, pc point to caller code addr(store by CALL) 
        else if (op == RET)     {sp = bp; bp = (int*)*sp++; pc = (int*)*sp++;}        
        // end for call function.
        // native call
        else if (op == OPEN)    {ax = open((char*)sp[1], sp[0]);}
        else if (op == CLOS)    {ax = close(*sp);}
        else if (op == READ)    {ax = read(sp[2], (char*)sp[1], *sp);}
        else if (op == PRTF)    {tmp = sp + pc[1] - 1; ax = printf((char*)tmp[0], tmp[-1], tmp[-2], tmp[-3], tmp[-4], tmp[-5]);}
        else if (op == MALC)    {ax = (int)malloc(*sp);}
        else if (op == FREE)    {free((void*)*sp);}
        else if (op == MSET)    {ax = (int)memset((char*)sp[2], sp[1], *sp);}
        else if (op == MCMP)    {ax = memcmp((char*)sp[2], (char*)sp[1], *sp);}
        else if (op == EXIT)    {printf("exit(%lld)\n", *sp); return *sp;}
        else {printf("unkown instruction: %lld, cycle: %lld\n", op, cycle); return -1;}
    }
    return 0;
}

在这里插入图片描述
函数执行的过程就是pc不断++的过程，当然如果有JMP、JZ、JNZ等跳转指令，PC也会跳转。

最后阅读C minus编译器的main()函数，理解主干逻辑：

// after bootstrap use [int] istead of [int32_t]
int32_t main(int32_t argc, char** argv) {
    MAX_SIZE = 128 * 1024 * 8; // 1MB = 128k * 64bit
    
    // load source code 加载需要编译的源码
    if (load_src(*(argv+1)) != 0) return -1;
    
    // init memory & register 初始化内存和寄存器
    if (init_vm() != 0) return -1;
    
    // prepare keywords for symbol table 准备keywords（也包含main）放入symbol table中
    keyword();
    
    // parse and generate vm instructions, save to vm 调用parse处理变量和函数
    parse();
    // 整个parse & Code Gen的过程：
	//parse->处理变量varible->处理参数列表parse_parament（就是parse_param()函数）
	//parse->处理函数function->处理函数体parse_funvtion->调用parse_statement处理语句->调用parse_expression处理表达式
    
    // print assembles: vm instructions. for debug
    write_as();
    
    // run vm and execute instructions 控制代码在VM上运行的函数（具体函数内容就在上文，刚刚说过的）
    return run_vm(--argc, ++argv);
}

在这里插入图片描述

❀完结撒花❀

爱吃小酥肉的小波

关注

7
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
0
评论
「700行手写编译器」Part 4：再看代码生成笔记

代码生成：解析完源代码，形成了相应的VM指令，在正式执行VM指令之前。上图的左边的code区域会有这些数据，data区的数据在图的中间，除此之外其他的东西应该都是处在初始化的状态，比如寄存器、stack等。上图的右边是symbol table，在代码解析完成之后，实际上就不会有任何意义或任何作用了，它的作用只是在衔接定义和使用lexeme，让函数执行过程中，该lexeme前后关系和属性对应起来。
复制链接

扫一扫