【acwj】05,Statements 实现“Statements”

搬运自https://github.com/DoctorWkt/acwj,一个介绍如何使用C语言编写一个可自举的类C语言编译器的说明。进行了粗略的翻译。

【acwj】05,Statements 实现“Statements”

It ‘s time to add some “proper” statements to the grammar of our language. I want to be able to write lines of code like this:

是时候在我们的语言语法中添加一些“适当的”语句了。我希望能够像这样编写代码:

   print 2 + 3 * 5;
   print 18 - 6/3 + 4*2;

Of course, as we are ignoring whitespace, there’ s no necessity that all the tokens for one statement are on the same line.Each statement starts with the keyword print and is terminated with a semicolon.So these are going to become new tokens in our language.

当然,由于我们忽略了空格,一条语句的所有标记不一定都在同一行。每条语句都以关键字print开头,以分号结尾。因此,这些将成为我们语言中的新标记。

BNF Description of the Grammar 语法的BNF描述

We’ve already seen the BNF notation for expressions. Now let’s define the BNF syntax for the above types of statements:

我们已经看到了expression的BNF表达,现在让我们为statements定义BNF语法

statements: statement
     | statement statements
     ;

statement: 'print' expression ';'
     ;

An input file consists of several statements. They are either one statement, or a statement followed by more statements. Each statement starts with the keyword print, then one expression, then a semicolon.

一个输入文件由几个语句组成。它们要么是一条语句,要么是一条语句后跟多条语句。每个语句都以关键字print开头,然后是一个表达式,然后是分号。

Changes to the Lexical Scanner 词法扫描器的更改

Before we can get to the code that parses the above syntax, we need to add a few more bits and pieces to the existing code. Let’s start with the lexical scanner.

在开始解析上述语法的代码之前,我们需要向现有代码中添加更多的片段。让我们从词法扫描器开始。

Adding a token for semicolons will be easy. Now, the print keyword. Later on, we’ll have many keywords in the language, plus identifiers for our variables, so we’ll need to add some code which helps us to deal with them.

为分号添加标记将很容易。现在,着重于print关键字。稍后,我们将在语言中有许多关键字,以及变量的标识符,因此我们需要添加一些代码来帮助我们处理它们。

In scan.c, I’ve added this code which I’ve borrowed from the SubC compiler. It reads in alphanumeric characters into a buffer until it hits a non-alphanumeric character.

scan.c中,我添加了从SubC编译器中借用的代码。它将字母数字字符读入缓冲区,直到碰到非字母数字字符。

// Scan an identifier from the input file and
// store it in buf[]. Return the identifier's length
static int scanident(int c, char *buf, int lim) {
  int i = 0;

  // Allow digits, alpha and underscores
  while (isalpha(c) || isdigit(c) || '_' == c) {
    // Error if we hit the identifier length limit,
    // else append to buf[] and get next character
    if (lim - 1 == i) {
      printf("identifier too long on line %d\n", Line);
      exit(1);
    } else if (i < lim - 1) {
      buf[i++] = c;
    }
    c = next();
  }
  // We hit a non-valid character, put it back.
  // NUL-terminate the buf[] and return the length
  putback(c);
  buf[i] = '\0';
  return (i);
}

We also need a function to recognise keywords in the language. One way would be to have a list of keywords, and to walk the list and strcmp() each one against the buffer from scanident(). The code from SubC has an optimisation: match against the first letter before doing the strcmp(). This speeds up the comparison against dozens of keywords. Right now we don’t need this optimisation but I’ve put it in for later:

我们还需要一个函数来识别语言中的关键字。一种方法是使用一个关键字列表,并根据ScanIdentit()的缓冲区遍历列表和strcmp()。SubC的代码有一个优化:在执行strcmp()之前匹配第一个字母。这加快了与几十个关键字的比较。现在我们不需要这种优化,但我已将其放在后面:

// Given a word from the input, return the matching
// keyword token number or 0 if it's not a keyword.
// Switch on the first letter so that we don't have
// to waste time strcmp()ing against all the keywords.
static int keyword(char *s) {
  switch (*s) {
    case 'p':
      if (!strcmp(s, "print"))
        return (T_PRINT);
      break;
  }
  return (0);
}

Now, at the bottom of the switch statement in scan(), we add this code to recognise semicolons and keywords:

现在,在scan()中switch语句的底部,我们添加以下代码来识别分号和关键字:

    case ';':
      t->token = T_SEMI;
      break;
    default:

      // If it's a digit, scan the
      // literal integer value in
      if (isdigit(c)) {
        t->intvalue = scanint(c);
        t->token = T_INTLIT;
        break;
      } else if (isalpha(c) || '_' == c) {
        // Read in a keyword or identifier
        scanident(c, Text, TEXTLEN);

        // If it's a recognised keyword, return that token
        if (tokentype = keyword(Text)) {
          t->token = tokentype;
          break;
        }
        // Not a recognised keyword, so an error for now
        printf("Unrecognised symbol %s on line %d\n", Text, Line);
        exit(1);
      }
      // The character isn't part of any recognised token, error
      printf("Unrecognised character %c on line %d\n", c, Line);
      exit(1);

I’ve also added a global Text buffer to store the keywords and identifiers:

我还添加了一个全局文本缓冲区Text来存储关键字和标识符:

#define TEXTLEN         512 // Length of symbols in input
extern_ char Text[TEXTLEN + 1];         // Last identifier scanned

Changes to the Expression Parser 表达式解析器的修改

Up to now our input files have contained just a single expression; therefore, in our Pratt parser code in binexpr() (in expr.c), we had this code to exit the parser:

到目前为止,我们的输入文件只包含一个表达式;因此,在binexpr()中的Pratt解析器代码(在expr.c中)中,我们用以下代码退出解析器:

  // If no tokens left, return just the left node
  tokentype = Token.token;
  if (tokentype == T_EOF)
    return (left);

With our new grammar, each expression is terminated by a semicolon. Thus, we need to change the code in the expression parser to spot the T_SEMI tokens and exit the expression parsing:

在我们的新语法中,每个表达式都以分号结尾。因此,我们需要更改表达式解析器中的代码,以T_SEMI 找到分号并退出表达式解析:

// Return an AST tree whose root is a binary operator.
// Parameter ptp is the previous token's precedence.
struct ASTnode *binexpr(int ptp) {
  struct ASTnode *left, *right;
  int tokentype;

  // Get the integer literal on the left.
  // Fetch the next token at the same time.
  left = primary();

  // If we hit a semicolon, return just the left node
  tokentype = Token.token;
  if (tokentype == T_SEMI)
    return (left);

    while (op_precedence(tokentype) > ptp) {
      ...

          // Update the details of the current token.
    // If we hit a semicolon, return just the left node
    tokentype = Token.token;
    if (tokentype == T_SEMI)
      return (left);
    }
}

Changes to the Code Generator 对代码生成器的修改

I want to keep the generic code generator in gen.c separate from the CPU-specific code in cg.c. That also means that the rest of the compiler should only ever call the functions in gen.c, and only gen.c should call the code in cg.c.

我想将gen.c中的通用代码生成器与cg.c中的CPU特定代码分开。这也意味着编译器的其余部分应该只调用gen.c中的函数,并且只有gen.c能调用cg.c中的代码。

To this end, I’ve defined some new “front-end” functions in gen.c:

在最后,我gen.c中定义了一些新的“前端”函数

void genpreamble()        { cgpreamble(); }
void genpostamble()       { cgpostamble(); }
void genfreeregs()        { freeall_registers(); }
void genprintint(int reg) { cgprintint(reg); }

Adding the Parser for Statements 为Statements添加解析器

We have a new file stmt.c. This will hold the parsing code for all the main statements in our language. Right now, we need to parse the BNF grammar for statements which I gave up above. This is done with this single function. I’ve converted the recursive definition into a loop:

我们有一个新文件stmt.c。这将保存我们语言中所有主要语句的解析代码。现在,我们需要为我在上面放弃的语句解析BNF语法。这是通过这个单一函数完成的。我已将递归定义转换为循环:

// Parse one or more statements
void statements(void) {
  struct ASTnode *tree;
  int reg;

  while (1) {
    // Match a 'print' as the first token
    match(T_PRINT, "print");

    // Parse the following expression and
    // generate the assembly code
    tree = binexpr(0);
    reg = genAST(tree);
    genprintint(reg);
    genfreeregs();

    // Match the following semicolon
    // and stop if we are at EOF
    semi();
    if (Token.token == T_EOF)
      return;
  }
}

In each loop, the code finds a T_PRINT token. It then calls binexpr() to parse the expression. Finally, it finds the T_SEMI token. If a T_EOF token follows, we break out of the loop.

在每个循环中,这段从T_PRINT标记开始寻找。然后,其调用binexpr()来解析该表达式。最后,它找到T_SEMItoken。如果最后跟随的是T_EOF标记,则我们跳出这个循环

After each expression tree, the code in gen.c is called to convert the tree into assembly code and to call the assembly printint() function to print out the final value.

在每个表达式树之后,调用gen.c中的代码将树转换为汇编代码,并调用汇编的printint()函数打印最终值。

Some Helper Functions 一些辅助函数

There are a couple of new helper functions in the above code, which I’ve put into a new file, misc.c:

上面的代码中有一对辅助函数,我将其放在了misc.c中:

// Ensure that the current token is t,
// and fetch the next token. Otherwise
// throw an error 
void match(int t, char *what) {
  if (Token.token == t) {
    scan(&Token);
  } else {
    printf("%s expected on line %d\n", what, Line);
    exit(1);
  }
}

// Match a semicon and fetch the next token
void semi(void) {
  match(T_SEMI, ";");
}

These form part of the syntax checking in the parser. Later on, I’ll add more short functions to call match() to make our syntax checking easier.

它们构成解析器中语法检查的一部分。稍后,我将添加更多的短函数来调用match(),以简化语法检查。

Changes to main() main函数的修改

main() used to call binexpr() directly to parse the single expression in the old input files. Now it does this:

main函数之前直接调用binexpr()函数来解析单表达式。现在这么做:

  scan(&Token);                 // Get the first token from the input
  genpreamble();                // Output the preamble
  statements();                 // Parse the statements in the input
  genpostamble();               // Output the postamble
  fclose(Outfile);              // Close the output file and exit
  exit(0);

Trying It Out 试一试

That’s about it for the new and changed code. Let’s give the new code a whirl. Here is the new input file, input01:

print 12 * 3;
print 
   18 - 2
      * 4; print
1 + 2 +
  9 - 5/2 + 3*5;

Yes I’ve decided to check that we have have tokens spread out across multiple lines. To compile and run the input file, do a make test:

是的,我决定检查一下我们是否有分散在多行的标记。要编译并运行输入文件,请执行make test

$ make test
cc -o comp1 -g cg.c expr.c gen.c main.c misc.c scan.c stmt.c tree.c
./comp1 input01
cc -o out out.s
./out
36
10
25

Conclusion and What’s Next 总结展望

We’ve added our first “real” statement grammar to our language. I’ve defined it in BNF notation, but it was easier to implement it with a loop and not recursively. Don’t worry, we’ll go back to doing recursive parsing soon.

Along the way we had to modify the scanner, add support for keywords and identifiers, and to more cleanly separate the generic code generator and the CPU-specific generator.

In the next part of our compiler writing journey, we will add variables to the language. This will require a significant amount of work.

我们已经在我们的语言中添加了第一个“真实”语句语法。我已经用BNF符号定义了它,但是用循环实现它更容易,而不是递归实现。别担心,我们很快就会回到递归解析。

在此过程中,我们必须修改扫描器,添加对关键字和标识符的支持,并更清晰地分离通用代码生成器和特定于CPU的生成器。

在编译器编写过程的下一部分中,我们将向语言中添加变量。这将需要大量的工作。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
东南亚位于我国倡导推进的“一带一路”海陆交汇地带,作为当今全球发展最为迅速的地区之一,近年来区域内生产总值实现了显著且稳定的增长。根据东盟主要经济体公布的最新数据,印度尼西亚2023年国内生产总值(GDP)增长5.05%;越南2023年经济增长5.05%;马来西亚2023年经济增速为3.7%;泰国2023年经济增长1.9%;新加坡2023年经济增长1.1%;柬埔寨2023年经济增速预计为5.6%。 东盟国家在“一带一路”沿线国家中的总体GDP经济规模、贸易总额与国外直接投资均为最大,因此有着举足轻重的地位和作用。当前,东盟与中国已互相成为双方最大的交易伙伴。中国-东盟贸易总额已从2013年的443亿元增长至 2023年合计超逾6.4万亿元,占中国外贸总值的15.4%。在过去20余年中,东盟国家不断在全球多变的格局里面临挑战并寻求机遇。2023东盟国家主要经济体受到国内消费、国外投资、货币政策、旅游业复苏、和大宗商品出口价企稳等方面的提振,经济显现出稳步增长态势和强韧性的潜能。 本调研报告旨在深度挖掘东南亚市场的增长潜力与发展机会,分析东南亚市场竞争态势、销售模式、客户偏好、整体市场营商环境,为国内企业出海开展业务提供客观参考意见。 本文核心内容: 市场空间:全球行业市场空间、东南亚市场发展空间。 竞争态势:全球份额,东南亚市场企业份额。 销售模式:东南亚市场销售模式、本地代理商 客户情况:东南亚本地客户及偏好分析 营商环境:东南亚营商环境分析 本文纳入的企业包括国外及印尼本土企业,以及相关上下游企业等,部分名单 QYResearch是全球知名的大型咨询公司,行业涵盖各高科技行业产业链细分市场,横跨如半导体产业链(半导体设备及零部件、半导体材料、集成电路、制造、封测、分立器件、传感器、光电器件)、光伏产业链(设备、硅料/硅片、电池片、组件、辅料支架、逆变器、电站终端)、新能源汽车产业链(动力电池及材料、电驱电控、汽车半导体/电子、整车、充电桩)、通信产业链(通信系统设备、终端设备、电子元器件、射频前端、光模块、4G/5G/6G、宽带、IoT、数字经济、AI)、先进材料产业链(金属材料、高分子材料、陶瓷材料、纳米材料等)、机械制造产业链(数控机床、工程机械、电气机械、3C自动化、工业机器人、激光、工控、无人机)、食品药品、医疗器械、农业等。邮箱:market@qyresearch.com

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值