语法分析
(1)
--------
自顶向下的语法分析
(1)
语法分析是以词法分析产生的记号流作为输入, 产生分析树或语法树.
自顶向下的语法分析分为递归下降语法分析和预测语法分析. 预测语法分析又分为递归预测语法分析, 非递归预测语法分析.
递归下降的语法分析有很多确定, 而且使用的不多所以没有仔细学习, 主要学习了预测语法分析.
$
表示空集
, #
表示记号结束
一: 两个预测语法分析共用的知识.
1. 上下文无关文法及其处理:
上下文无关文法是描述语法的工具, 如<<编译原理与实践>>中提供了TINY文法
用大写字符表示非终结符, 小写字符和符号表示终结符($ 表示空)
BNF of the TINY
**************************************************************************
PROGRAM-> STMT-SEQUENCE
STMT-SEQUENCE-> STMT-SEQUENCE ; STATEMENT | STATEMENT
STATEMENT-> IF-STMT | REPEAT-STMT | ASSIGN-STMT
| READ-STMT | WRITE-STMT
IF-STMT-> if EXP then STMT-SEQUENCE end
| if EXP then STMT-SEQUENCE else STMT-SEQUENCE end
REPEAT-STMT-> repeat STMT-SEQUENCE until EXP
ASSIGN-STMT-> identifier := EXP
READ-STMT-> read identifier
WRITE-STMT-> write EXP
EXP-> SIMPLE-EXP COMPARISON-OP SIMPLE-EXP | SIMPLE-EXP
COMPARISON-OP-> < | =
SIMPLE-EXP-> SIMPLE-EXP ADDOP TERM | TERM
ADDOP-> + | -
TERM-> TERM MULOP FACTOR | FACTOR
MULOP-> * | /
FACTOR-> ( EXP ) | number | identifier
2. 对文法进行处理
文法中存在二义性, 左递归和公因子
对于二义性不同的文法视具体情况而论.
消除左递归(龙书中的例子):
A-> Aa | b
消除后的产生式
A-> b A’ A’-> aA’ | $
消除左递归的方法是:
A-> Aa1 | Aa2 | … | Aam | b 1 | b 2 | … | b n (
其中
b i
都不已
A
开头
)
使用一下产生式替换
A-> b 1 A’| b 2 A’ | … | b n A’
A’-> a1 A’ | a2 A’| … | am A’ | $
看上面的BNF 其中有几条产生式是属于左递归的例子:
STMT-SEQUENCE-> STMT-SEQUENCE ; STATEMENT | STATEMENT
SIMPLE-EXP-> SIMPLE-EXP ADDOP TERM | TERM
TERM-> TERM MULOP FACTOR | FACTOR
按照消除左递归的方法变换成下面的产生式
STMT-SEQUENCE-> STATEMENT STMT-SEQUENCE’
STMT-SEQUENCE’-> ; STATEMENT STMT-SEQUENCE’ | $
SIMPLE-EXP-> TERM SIMPLE-EXP’
SIMPLE-EXP’-> ADDOP TERM SIMPLE-EXP’ | $
TERM-> FACTOR TERM’
TERM’-> MULOP FACTOR TERM’ | $
提取公因子(龙书中的例子):
A-> a b 1 | a b 2
提取公因子后:
A-> aA’
A’-> b 1 | b 2
提取公因子的方法是:
A-> a b 1 | a b 2 | … | a b n | c (c
是不以
a
开头的候选式
)
提取后:
A-> aA’ | c
A’-> b 1 | b 2 | … | b n
上面的BNF 中有几条公因子的例子:
IF-STMT-> if EXP then STMT-SEQUENCE end
| if EXP then STMT-SEQUENCE else STMT-SEQUENCE end
EXP-> SIMPLE-EXP COMPARISON-OP SIMPLE-EXP | SIMPLE-EXP
提取公因子后:
IF-STMT-> if EXP then STMT-SEQUENCE ELSE-STMT end
ELSE-STMT-> else STMT-SEQUENCE | $
EXP-> SIMPLE-EXP EXP’
EXP’-> COMPARISON-OP SIMPLE-EXP | $
所以处理后的BNF文法为
BNF of the TINY
**************************************************************************
PROGRAM-> STMT-SEQUENCE
STMT-SEQUENCE-> STATEMENT STMT-SEQUENCE'
STMT-SEQUENCE'-> ; STATEMENT STMT-SEQUENCE' | $
STATEMENT-> IF-STMT | REPEAT-STMT | ASSIGN-STMT |
READ-STMT | WRITE-STMT
IF-STMT-> if EXP then STMT-SEQUENCE ELSE-STMT end
ELSE-STMT-> else STMT-SEQUENCE | $
REPEAT-STMT-> repeat STMT-SEQUENCE until EXP
ASSIGN-STMT-> identifier := EXP
READ-STMT-> read identifier
WRITE-STMT-> write EXP
EXP-> SIMPLE-EXP EXP'
EXP'-> COMPARISON-OP SIMPLE-EXP | $
COMPARISON-OP-> < | =
SIMPLE-EXP-> TERM SIMPLE-EXP'
SIMPLE-EXP'-> ADDOP TERM SIMPLE-EXP' | $
ADDOP-> + | -
TERM-> FACTOR TERM'
TERM'-> MULOP FACTOR TERM' | $
MULOP-> * | /
FACTOR-> ( EXP ) | number | identifier
3. first集合和follow集合
进行语法分析时, 非终结符的产生式会包含很多候选式, 当遇到一个记号, 用哪个候选式来扩展就成了问题. 当我们知道了候选式对应的第一个终结符时就可以确定了.
first集合就是文法产生式中所有的候选式的第一个终结符的集合.(参考<<现代编译程序设计>>)
使用如下方法来就文法的first集合(参考龙书):
(1). 如果X是终结符, first(X) = {X}
(2). 如果X-> $ 是产生式, 则将{$} 加入first(X)
(3). 如果X是非终结符, 且X-> Y
1Y
2…Y
k是产生式, 则:
1). 若对于某个i, 有a 属于first(Y
i), 且$属于first(Y
1)…first(Y
i-1), 则 a属于first(X)
2). 若对于j = 1, 2, …, k 有
$ 属于first(Y
j) 则$ 属于first(X)
龙书上的例子:
E-> TE’
E’-> +TE’ | $
T-> FT’
T’-> *FT’ | $
F-> (E) | id
对于first(E), E是非终结符, 根据(3)的1) 当i = 1时. T前面没有符号了所以first(T)属于first(E), 同理first(F) 属于first(T). 又根据(1), first(F) = {(, id}, $ 不属于first(F), 所以first(F) = first(T) = first(E) = {(, id}.
根据(1),(2), 的first(E’) = {+, $}, first(T’) = {*, $}
根据求first集合的规则和TINY的BNF求出TINY的first集合为:
first(PROGRAM) = first(STMT-SEQUENC) = first(STATEMENT) =
{if, repeat, identifier, read, write}
first(STMT-SEQUENCE') = {;, $
}
first(IF-STMT) = {if}
first(ELSE-STMT) = {else, $
}
first(REPEAT-STMT) = {repeat}
first(ASSIGN-STMT) = {identifier}
first(READ-STMT) = {read}
first(WRITE-STMT) = {write}
first(EXP) = first(SIMPLE-EXP) = first(TERM) = first(FACTOR) =
{(, number, identifier}
first(EXP') = {<, =, $
}
first(COMPARISON-OP) = {<, =}
first(SIMPLE-EXP') = {+, -, $
}
first(ADDOP) = {+, -}
first(TERM') = {*, /, $
}
first(MULOP) = {*, /}
follow集合是指产生式A的后继的终结符集合, 也就是紧跟在A后面的终结符集合. follow集合的求法(参考龙书):
(1). 如果S是开始符号, 则$属于follow(S), $是记号串的结束符号.
(2). 如果存在产生式A-> aBb 则将first(b)中除了$ 以外的符号加入到follow(B)中.
(3). 如果存在产生式A-> aB, 或A-> aBb且$ 属于first(b),则 将follow(A)加入到follow(B)中.
龙书中的例子:
E-> TE’
E’-> +TE’ | $
T-> FT’
T’-> *FT’ | $
F-> (E) | id
对于follow(E), 根据(1), follow(E) = {#}, 在产生式右部包含E的产生式F-> (E) | id 其中根据(2)是的{)}加入到follow(E)中, 所以follow(E) = {), #}.
对于follow(E’), 观察所有右部包含E’的产生式: E-> TE’和E’-> +TE’ | $ 在每一个产生式中E’都处在右部最右端所以根据(3), 右follow(E)属于follow(E’) 所以follow(E’) = follow(E).
对于follow(T), 观察所有右部包含T的产生式E-> TE’和E’-> +TE’ | $ 根据(2)first(E’)中处理$ 以外的符号都属于follow(T), 所以follow(T) = {+}, 又$ 属于first(E’) 根据(3)有follow(E) 和 follow(E’) 都属于follow(T), 但是follow(E) = follow(E’)不用重复加入. 所以follow(T) = {+, ), #}
同上follow(T’) = follow(T)
同上follow(F) = {*, +, ), #}
根据求follow集合的规则和TINY的BNF
follow(PROGRAM) = {#}
follow(STMT-SEQUENCE) = {#, else, end, until}
follow(STMT-SEQUENCE') = {#, else, end, until}
follow(STATEMENT) = follow(IF-STMT) = follow(REPEAT-STMT) =
follow(ASSIGN-STMT) = follow(READ-STMT) = follow(WRITE-STMT) =
{;, #, else, end, until}
follow(ELSE-STMT) = {end}
follow(EXP) = {then, ), ;, #, else, end, until}
follow(EXP') = follow(EXP)
follow(COMPARISON-OP) = {(, number, identifier}
follow(SIMPLE-EXP) = {<, =, then, ), ;, #, else, end, until}
follow(SIMPLE-EXP') = follow(SIMPLE-EXP)
follow(TERM) = {+, -, <, =, then, ), ;, #, else, end, until}
follow(ADDOP) = {(, number, identifier}
follow(TERM') = follow(TERM)
follow(MULOP) = {(, number, identifier}
follow(FACTOR) = {*, /, +, -, <, =, then, ), ;, #, else, end, until}
4. select集合
select集合是制导通过某一个记号和非终结符来选择适当的产生式候选式的集合, 是十分重要的集合.
select集合的求法很简单, 主要用到了前面求的first集合和follow集合
求select(A-> a)
(1) 如$ 不属于first(A)则 select(A-> a) = first(A)
(2) 如$ 属于first(A) 则 select(A-> a) = first(A) U follow(A)
根据select集合的规则和TINY的BNF
select(PROGRAM-> STMT-SEQUENCE) = {if, repeat, identifier, read, write}
select(STMT-SEQUENCE-> STATEMENT STMT-SEQUENCE') =
{if, repeat, identifier, read, write}
select(STMT-SEQUENCE'-> ; STATEMENT STMT-SEQUENCE') = {;}
select(STMT-SEQUENCE'-> $
) = {#, else, end, until}
select(STATEMENT-> IF-STMT) = {if}
select(STATEMENT-> REPEAT-STMT) = {repeat}
select(STATEMENT-> ASSIGN-STMT) = {identifier}
select(STATEMENT-> READ-STMT) = {read}
select(STATEMENT-> WRITE-STMT) = {write}
select(IF-STMT-> if EXP then STMT-SEQUENCE ELSE-STMT end) = {if}
select(ELSE-STMT-> else STMT-SEQUENCE) = {else}
select(ELSE-STMT-> $
) = {end}
select(REPEAT-STMT-> repeat STMT-SEQUENCE until EXP) = {repeat}
select(ASSIGN-STMT-> identifier := EXP) = {identifier}
select(READ-STMT-> read identifier) = {read}
select(WRITE-STMT-> write EXP) = {write}
select(EXP-> SIMPLE-EXP EXP') = {(, number, identifier}
select(EXP'-> COMPARISON-OP SIMPLE-EXP) = {<, =}
select(EXP'-> $
) = {then, ), ;, #, else, end, until}
select(COMPARISON-OP-> <) = {<}
select(COMPARISON-OP-> =) = {=}
select(SIMPLE-EXP-> TERM SIMPLE-EXP') = {(, number, identifier}
select(SIMPLE-EXP'-> ADDOP TERM SIMPLE-EXP') = {+, -}
select(SIMPLE-EXP'-> $
) = {<, =, then, ), ;, #, else, end, until}
select(ADDOP-> +) = {+}
select(ADDOP-> -) = {-}
select(TERM-> FACTOR TERM') = {(, number, identifier}
select(TERM'-> MULOP FACTOR TERM') = {*, /}
select(TERM'-> $
) = {+, -, <, =, then, ), ;, #, else, end, until}
select(MULOP-> *) = {*}
select(MULOP-> /) = {/}
select(FACTOR-> (EXP)) = {(}
select(FACTOR-> number) = {number}
select(FACTOR-> identifier) = {identifier}
二. 递归预测语法分析
1. 分析程序
“预测”并不准确, 因为通过上面的select集合我们知道了当前状态下输入一个记号该如何选中产生式的候选式. 这个程序完全是依照select集合来构造的. 如何根据select集合来构造程序呢.看两个例子(这里的语法分析程序与<<编译原理与实践>>中tiny语法分析程序完全不同):
对于
select(STMT-SEQUENCE-> STATEMENT STMT-SEQUENCE') =
{if, repeat, identifier, read, write}
我们可以构造出程序(这里的for(;;)和check_input都是错误处理程序)
static void parse_stmt_sequence(void)
{
for(;;)
{
switch(token)
{
case KEY_IF:
case KEY_REPEAT:
case KEY_READ:
case KEY_WRITE:
case ID:
parse_statement();
parse_stmt_sequence_();
return;
break;
default:
if(check_input(stmt_sequence_first, STMT_SEQUENCE_FIRST_COUNT,
stmt_sequence_follow, STMT_SEQUENCE_FOLLOW_COUNT)
!= IN_FIRST_SET)
return;
break;
}
}
}
select集合的意思是当在分析
STMT-SEQUENCE时, 当输入记号为
if, repeat, identifier, read, write时我们就转换到
STATEMENT 和
STMT-SEQUENCE'的分析程序中. 如果不是如上的这些记号则表示当前分析的程序有错误, 要进行错误处理.
有两条select集合
select(STMT-SEQUENCE'-> ; STATEMENT STMT-SEQUENCE') = {;}
select(STMT-SEQUENCE'-> $
) = {#, else, end, until}
都是关于
STMT-SEQUENCE'的所以, 我们将它们放到一个函数中 如下:
static void parse_stmt_sequence_(void)
{
switch(token)
{
case SEMI:
match(SEMI);
parse_statement();
parse_stmt_sequence_();
break;
case KEY_ELSE:
case KEY_END:
case KEY_UNTIL:
case END_FILE:
/* -> $ */
break;
default:
/* dropp the SEMI symbol */
syntax_error("want symbol", ";");
parse_statement();
parse_stmt_sequence_();
break;
}
}
当处理
STMT-SEQUENCE'的分析时, 输入记号时
; 时程序转入
; STATEMENT STMT-SEQUENCE' 状态分析序列. match(SEMI);是对终结符
; 匹配过
程, 其中包含了错误处理的过程. 当输入记号为
#, else, end, until 中的一个时对应的产生
式为
STMT-SEQUENCE'-> $ 这表示这个分析过程什么也没做. 所以直接退出到上一级分
析程序中.
所以由上面的select集合可以构造出语法分析程序(包含错误处理和对语法树的构造):
/*
* implement of parse for TINY compiler
* Now this isn't build abstrict syntax tree
*/
#include "globals.h"
#include "util.h"
#include "scan.h"
#include "parse.h"
#include "parse_set.h"
/* The token for get_token */
static TokenType token;
/* local utility functios */
static void syntax_error(const char* err, const char* token_str);
static void match(TokenType match_token);
static int check_input(const TokenType* first, int first_count,
const TokenType* follow, int follow_count);
static TreeNode* new_stmt_node(StmtKind kind);
static TreeNode* new_exp_node(ExpKind kind);
static char* copy_string(const char* str);
/* local parse functions */
static TreeNode* parse_stmt_sequence(void);
static TreeNode* parse_statement(void);
static TreeNode* parse_stmt_sequence_(void);
static TreeNode* parse_if_stmt(void);
static TreeNode* parse_else_stmt(void);
static TreeNode* parse_repeat_stmt(void);
static TreeNode* parse_assign_stmt(void);
static TreeNode* parse_read_stmt(void);
static TreeNode* parse_write_stmt(void);
static TreeNode* parse_exp(void);
static TreeNode* parse_exp_(void);
static TreeNode* parse_simple_exp(void);
static TreeNode* parse_simple_exp_(void);
static TreeNode* parse_comparison_op(void);
static TreeNode* parse_term(void);
static TreeNode* parse_term_(void);
static TreeNode* parse_addop(void);
static TreeNode* parse_mulop(void);
static TreeNode* parse_factor(void);
/*
* The implement of parse for TINY
*/
TreeNode* parse(void)
{
TreeNode* tree = NULL;
token = get_token();
for(;;)
{
switch(token)
{
case KEY_IF:
case KEY_REPEAT:
case KEY_READ:
case KEY_WRITE:
case ID:
tree = parse_stmt_sequence();
match(END_FILE);
return tree;
break;
default:
if(check_input(program_first, PROGRAM_FIRST_COUNT,
program_follow, PROGRAM_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
/* local parse functions */
static TreeNode* parse_stmt_sequence(void)
{
TreeNode* tree = NULL;
TreeNode* sibling_tree = NULL;
for(;;)
{
switch(token)
{
case KEY_IF:
case KEY_REPEAT:
case KEY_READ:
case KEY_WRITE:
case ID:
tree = parse_statement();
sibling_tree = parse_stmt_sequence_();
if(tree == NULL)
tree = sibling_tree;
else
tree->sibling = sibling_tree;
return tree;
break;
default:
if(check_input(stmt_sequence_first, STMT_SEQUENCE_FIRST_COUNT,
stmt_sequence_follow, STMT_SEQUENCE_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_statement(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case KEY_IF:
tree = parse_if_stmt();
return tree;
break;
case KEY_REPEAT:
tree = parse_repeat_stmt();
return tree;
break;
case ID:
tree = parse_assign_stmt();
return tree;
break;
case KEY_READ:
tree = parse_read_stmt();
return tree;
break;
case KEY_WRITE:
tree = parse_write_stmt();
return tree;
break;
default:
if(check_input(statement_first, STATEMENT_FIRST_COUNT,
statement_follow, STATEMENT_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_stmt_sequence_(void)
{
TreeNode* tree = NULL;
TreeNode* sibling_tree = NULL;
for(;;)
{
switch(token)
{
case SEMI:
match(SEMI);
tree = parse_statement();
sibling_tree = parse_stmt_sequence_();
if(tree == NULL)
tree = sibling_tree;
else
tree->sibling = sibling_tree;
return tree;
break;
case KEY_ELSE:
case KEY_END:
case KEY_UNTIL:
case END_FILE:
/* -> $ */
return tree;
break;
default:
if(check_input(stmt_sequence__first, STMT_SEQUENCE__FIRST_COUNT,
stmt_sequence__follow, STMT_SEQUENCE__FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_if_stmt(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case KEY_IF:
tree = new_stmt_node(KIND_IF);
match(KEY_IF);
if(tree != NULL)
tree->child[0] = parse_exp();
match(KEY_THEN);
if(tree != NULL)
{
tree->child[1] = parse_stmt_sequence();
tree->child[2] = parse_else_stmt();
}
match(KEY_END);
return tree;
break;
default:
if(check_input(if_stmt_first, IF_STMT_FIRST_COUNT,
if_stmt_follow, IF_STMT_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_else_stmt(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case KEY_ELSE:
match(KEY_ELSE);
tree = parse_stmt_sequence();
return tree;
break;
case KEY_END:
/* -> $ */
return tree;
break;
default:
if(check_input(else_stmt_first, ELSE_STMT_FIRST_COUNT,
else_stmt_follow, ELSE_STMT_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_repeat_stmt(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case KEY_REPEAT:
tree = new_stmt_node(KIND_REPEAT);
match(KEY_REPEAT);
if(tree != NULL)
tree->child[0] = parse_stmt_sequence();
match(KEY_UNTIL);
if(tree != NULL)
tree->child[1] = parse_exp();
return tree;
break;
default:
if(check_input(repeat_stmt_first, REPEAT_STMT_FIRST_COUNT,
repeat_stmt_follow, REPEAT_STMT_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_assign_stmt(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case ID:
tree = new_stmt_node(KIND_ASSIGN);
if(tree != NULL)
tree->attr.name = copy_string(token_string);
match(ID);
match(ASSIGN);
if(tree != NULL)
tree->child[0] = parse_exp();
return tree;
break;
default:
if(check_input(assign_stmt_first, ASSIGN_STMT_FIRST_COUNT,
assign_stmt_follow, ASSIGN_STMT_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_read_stmt(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case KEY_READ:
tree = new_stmt_node(KIND_READ);
match(KEY_READ);
if(tree != NULL)
tree->attr.name = copy_string(token_string);
match(ID);
return tree;
break;
default:
if(check_input(read_stmt_first, READ_STMT_FIRST_COUNT,
read_stmt_follow, READ_STMT_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_write_stmt(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case KEY_WRITE:
tree = new_stmt_node(KIND_WRITE);
match(KEY_WRITE);
if(tree != NULL)
tree->child[0] = parse_exp();
return tree;
break;
default:
if(check_input(write_stmt_first, WRITE_STMT_FIRST_COUNT,
write_stmt_follow, WRITE_STMT_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_exp(void)
{
TreeNode* tree = NULL;
TreeNode* tree_op = NULL;
for(;;)
{
switch(token)
{
case LPAREN:
case NUM:
case ID:
tree = parse_simple_exp();
tree_op = parse_exp_();
if(tree_op != NULL)
tree_op->child[0] = tree;
else
tree_op = tree;
return tree_op;
break;
default:
if(check_input(exp_first, EXP_FIRST_COUNT,
exp_follow, EXP_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_exp_(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case LT:
case EQ:
tree = parse_comparison_op();
if(tree != NULL)
tree->child[1] = parse_simple_exp();
return tree;
break;
case KEY_THEN:
case RPAREN:
case SEMI:
case KEY_ELSE:
case KEY_END:
case KEY_UNTIL:
case END_FILE:
/* -> $ */
return tree;
break;
default:
if(check_input(exp__first, EXP__FIRST_COUNT,
exp__follow, EXP__FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_comparison_op(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case LT:
tree = new_exp_node(KIND_OP);
if(tree != NULL)
tree->attr.op = token;
match(LT);
return tree;
break;
case EQ:
tree = new_exp_node(KIND_OP);
if(tree != NULL)
tree->attr.op = token;
match(EQ);
return tree;
break;
default:
if(check_input(comparsion_op_first, COMPARSION_OP_FIRST_COUNT,
comparsion_op_follow, COMPARSION_OP_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
/*
* the exp tree e.g. 1+4-5+3-9
* the tree is
* -
* / /
* + 9
* / /
* - 3
* / /
* + 5
* / /
* 1 4
* the parse_term in the function parse_simple_exp can parse nothing
* more then 1 and the parse_simple_exp_ parse other part of the tree
* and tree the TreeNode* point to the MINUS symbol at the top of tree
* so we must search the right position for 1 in the tree and attach
* the 1 node the the tree use while circle.
*/
static TreeNode* parse_simple_exp(void)
{
TreeNode* tree = NULL;
TreeNode* tree_term = NULL;
TreeNode* tree_temp = NULL;
for(;;)
{
switch(token)
{
case LPAREN:
case NUM:
case ID:
tree_term = parse_term();
tree = parse_simple_exp_();
if(tree != NULL)
{
tree_temp = tree;
while(tree_temp->child[0] != NULL)
tree_temp = tree_temp->child[0];
tree_temp->child[0] = tree_term;
}
else
tree = tree_term;
return tree;
break;
default:
if(check_input(simple_exp_first, SIMPLE_EXP_FIRST_COUNT,
simple_exp_follow, SIMPLE_EXP_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_simple_exp_(void)
{
TreeNode* tree = NULL;
TreeNode* tree_addop = NULL;
TreeNode* tree_temp = NULL;
for(;;)
{
switch(token)
{
case PLUS:
case MINUS:
tree_addop = parse_addop();
if(tree_addop != NULL)
tree_addop->child[1] = parse_term();
tree = parse_simple_exp_();
if(tree != NULL)
{
tree_temp = tree;
while(tree_temp->child[0] != NULL)
tree_temp = tree_temp->child[0];
tree_temp->child[0] = tree_addop;
}
else
tree = tree_addop;
return tree;
break;
case LT:
case EQ:
case KEY_THEN:
case RPAREN:
case SEMI:
case END_FILE:
case KEY_ELSE:
case KEY_END:
case KEY_UNTIL:
/* -> $ */
return tree;
break;
default:
if(check_input(simple_exp__first, SIMPLE_EXP__FIRST_COUNT,
simple_exp__follow, SIMPLE_EXP__FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_addop(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case PLUS:
tree = new_exp_node(KIND_OP);
if(tree != NULL)
tree->attr.op = token;
match(PLUS);
return tree;
break;
case MINUS:
tree = new_exp_node(KIND_OP);
if(tree != NULL)
tree->attr.op = token;
match(MINUS);
return tree;
break;
default:
if(check_input(addop_first, ADDOP_FIRST_COUNT,
addop_follow, ADDOP_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_term(void)
{
TreeNode* tree = NULL;
TreeNode* tree_factor = NULL;
TreeNode* tree_temp = NULL;
for(;;)
{
switch(token)
{
case LPAREN:
case NUM:
case ID:
tree_factor = parse_factor();
tree = parse_term_();
if(tree != NULL)
{
tree_temp = tree;
while(tree_temp->child[0] != NULL)
tree_temp = tree_temp->child[0];
tree_temp->child[0] = tree_factor;
}
else
tree = tree_factor;
return tree;
break;
default:
if(check_input(term_first, TERM_FIRST_COUNT,
term_follow, TERM_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_term_(void)
{
TreeNode* tree = NULL;
TreeNode* tree_temp = NULL;
TreeNode* tree_mulop = NULL;
for(;;)
{
switch(token)
{
case MULT:
case DIV:
tree_mulop = parse_mulop();
if(tree_mulop != NULL)
tree_mulop->child[1] = parse_factor();
tree = parse_term_();
if(tree != NULL)
{
tree_temp = tree;
while(tree_temp->child[0] != NULL)
tree_temp = tree_temp->child[0];
tree_temp->child[0] = tree_mulop;
}
else
tree = tree_mulop;
return tree;
break;
case PLUS:
case MINUS:
case LT:
case EQ:
case KEY_THEN:
case RPAREN:
case SEMI:
case END_FILE:
case KEY_ELSE:
case KEY_END:
case KEY_UNTIL:
/* -> $ */
return tree;
break;
default:
if(check_input(term__first, TERM__FIRST_COUNT,
term__follow, TERM__FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_mulop(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case MULT:
tree = new_exp_node(KIND_OP);
if(tree != NULL)
tree->attr.op = token;
match(MULT);
return tree;
break;
case DIV:
tree = new_exp_node(KIND_OP);
if(tree != NULL)
tree->attr.op = token;
match(DIV);
return tree;
break;
default:
if(check_input(mulop_first, MULOP_FIRST_COUNT,
mulop_follow, MULOP_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
static TreeNode* parse_factor(void)
{
TreeNode* tree = NULL;
for(;;)
{
switch(token)
{
case LPAREN:
match(LPAREN);
tree = parse_exp();
match(RPAREN);
return tree;
break;
case NUM:
tree = new_exp_node(KIND_CONST);
if(tree != NULL)
tree->attr.val = atoi(token_string);
match(NUM);
return tree;
break;
case ID:
tree = new_exp_node(KIND_ID);
if(tree != NULL)
tree->attr.name = copy_string(token_string);
match(ID);
return tree;
break;
default:
if(check_input(factor_first, FACTOR_FIRST_COUNT,
factor_follow, FACTOR_FOLLOW_COUNT)
!= IN_FIRST_SET)
return tree;
break;
}
}
}
/* local utility functions */
static void syntax_error(const char* err, const char* token_str)
{
fprintf(listing, ">>> ");
fprintf(listing, "Syntax error before line: %d, %s %s/n",
line_no, err, token_str);
}
static void match(TokenType match_token)
{
if(token == match_token)
token = get_token();
else
syntax_error("want symbol", token_string);
}
/*
* drop the token that not in the accept set
* return as far as get a token that in the
* accept set
*/
static int check_input(const TokenType* first, int first_count,
const TokenType* follow, int follow_count)
{
int i = 0;
for(;;)
{
for(i = 0; i < first_count; i++)
if(token == first[i])
return IN_FIRST_SET;
for(i = 0; i< follow_count; i++)
if(token == follow[i])
{
syntax_error("dropped some symobl!", "");
return IN_FOLLOW_SET;
}
/* for some follow set not include END_FILE */
if(token == END_FILE)
{
syntax_error("dropped a lot of symobls!", "");
return IN_END_FILE;
}
/* neither in first set nor in follow set */
/* syntax_error("unwanted symbol", token_string); */
token = get_token();
}
}
static TreeNode* new_stmt_node(StmtKind kind)
{
int i = 0;
TreeNode* tree = NULL;
if((tree = (TreeNode*)malloc(sizeof(TreeNode))) == NULL)
{
fprintf(stderr, "Out of memory at line %d!/n", line_no);
exit(-1);
}
for(i = 0; i < MAX_CHILDREN; i++)
tree->child[i] = NULL;
tree->sibling = NULL;
tree->node_kind = KIND_STMT;
tree->line_no = line_no;
tree->kind.stmt = kind;
return tree;
}
static TreeNode* new_exp_node(ExpKind kind)
{
int i = 0;
TreeNode* tree = NULL;
if((tree = (TreeNode*)malloc(sizeof(TreeNode))) == NULL)
{
fprintf(stderr, "Out of memory at line %d!/n", line_no);
exit(-1);
}
for(i = 0; i < MAX_CHILDREN; i++)
tree->child[i] = NULL;
tree->sibling = NULL;
tree->node_kind = KIND_EXP;
tree->kind.exp = kind;
tree->line_no = line_no;
return tree;
}
static char* copy_string(const char* str)
{
char* ret_str = NULL;
if((ret_str = (char*)malloc(strlen(str)+1)) == NULL)
{
fprintf(stderr, "Out of memory at line %d!/n", line_no);
exit(-1);
}
strncpy(ret_str, str, strlen(str));
ret_str[strlen(str)] = '/0';
return ret_str;
}
2. 错误处理与恢复(参考龙书)
采用的应急错误出来算法. 所谓应急算法就是当遇到错误时丢弃当前输入记号, 直到与同步
集中的记号匹配. 一般的使用follow集合作为同步集, 以跳过当前分析的产生式, 继续对后
面的产生式进行分析, 但是这样有时会跳过大量的记号. 所以将first集合也加入到同步集合
中, 但是对于first集合和follow集合的处理不同. 此外同步集中还必须包含记号流结束标志.
同步集 = first集 U follow集 U {#}
当遇到分析错误时, 首先进行同步.
1). 当前记号不能与同步集中的任何记号匹配时, 丢弃当前记号. 重复执行同步.
2). 当前记号与当前分析的产生式的first集中的记号匹配时. 回到当前分析程序的开始执行.
3). 当前记号与当前分析的产生式的follow集中的记号匹配时. 退出当前分析, 退回到上一
级产生式分析程序中.
函数check_input执行这各个功能:
static int check_input(const TokenType* first, int first_count,
const TokenType* follow, int follow_count)
{
int i = 0;
for(;;)
{
for(i = 0; i < first_count; i++)
if(token == first[i])
return IN_FIRST_SET;
for(i = 0; i< follow_count; i++)
if(token == follow[i])
{
syntax_error("dropped some symobl!", "");
return IN_FOLLOW_SET;
}
/* for some follow set not include END_FILE */
if(token == END_FILE)
{
syntax_error("dropped a lot of symobls!", "");
return IN_END_FILE;
}
/* neither in first set nor in follow set */
/* syntax_error("unwanted symbol", token_string); */
token = get_token();
}
}
这个过程实在刚刚进入到某一产生式的分析程序时就开始了. 但是在匹配终结符和记号的
过程如果两个终结符不匹配而产生的错误是不能用这种方法来处理的. 负责终结符匹配的
函数是match. 它包含了对错误的出来:
static void match(TokenType match_token)
{
if(token == match_token)
token = get_token();
else
syntax_error("want symbol", token_string);
}
如果匹配则获得下一个记号. 如果不匹配说明输入记号流中缺失了该记号, 于是打印出错误
提示并使得分析程序忽略掉对这个终结符的匹配,而继续向下进行.
syntax_error函数只负责打印错误提示信息:
static void syntax_error(const char* err, const char* token_str)
{
fprintf(listing, ">>> ");
fprintf(listing, "Syntax error before line: %d, %s %s/n",
line_no, err, token_str);
}
三. 构造语法树
抽象语法树是分析树的压缩形式, 关键字和操作不再是叶子节点, 变成了内部节点, 与
之相关的变量, 常量等变成了它的子节点.
构建语法树前首先看一看语法中的语句类型(参考<<编译原理与实践>>), tiny语法中包
含5种语句类型和3种表达式类型. 语句类型分别是if语句, repeat语句, assign语句, read语
句, write语句. 表达式类型分别是算术表达式, 常量表达式, 标识符表达式.
根据这8种类型可以定义语法树节点的类型:
typedef enum
{
KIND_STMT, KIND_EXP
}NodeKind;
typedef enum
{
KIND_IF, KIND_REPEAT, KIND_ASSIGN, KIND_READ, KIND_WRITE
}StmtKind;
typedef enum
{
KIND_OP, KIND_CONST, KIND_ID
}ExpKind;
#define MAX_CHILDREN 3
typedef struct tree_node
{
struct tree_node* child[MAX_CHILDREN];
struct tree_node* sibling;
NodeKind node_kind;
int line_no;
union
{
StmtKind stmt;
ExpKind exp;
}kind;
union
{
TokenType op;
int val;
char* name;
}attr;
}TreeNode;
其中
struct tree_node* child[MAX_CHILDREN];是指语法树节点的子节点, 如read x, 其中x就
是read节点的子节点.
struct tree_node* sibling;是指语句的兄弟节点, 如
read x;
write y;
其中write 节点就是read的兄弟节点, read的sibling项指向write节点.
观测BNF发现有些文法是说明语句之间的关系的, 有些文法是说明语句类型以及该类型的
构造方法的.
1. 兄弟语句关系构造
如
STMT-SEQUENCE-> STATEMENT STMT-SEQUENCE'
STMT-SEQUENCE'-> ; STATEMENT STMT-SEQUENCE' | $
主要是说明同级语句之间要求用; 分隔.所以在分析
STMT-SEQUENCE的过程中有:
tree = parse_statement();
sibling_tree = parse_stmt_sequence_();
if(tree == NULL)
tree = sibling_tree;
else
tree->sibling = sibling_tree;
return tree;
tree是第一条语句的语法树的根, sibling_tree是同级的其他语句构造成的语法树的根, 所以
它们是兄弟关系, 使用
tree->sibling = sibling_tree;连接.
2. 语句节点构造.
如
IF-STMT-> if EXP then STMT-SEQUENCE ELSE-STMT end
就是if语句的构造方法
tree = new_stmt_node(KIND_IF);
match(KEY_IF);
if(tree != NULL)
tree->child[0] = parse_exp();
match(KEY_THEN);
if(tree != NULL)
{
tree->child[1] = parse_stmt_sequence();
tree->child[2] = parse_else_stmt();
}
match(KEY_END);
return tree;
首先为if语句构造一个语句节点.
EXP, STMT-SEQUENCE, ELSE-STMT子句分别是if语
句的子节点. 所以要将它们连接到if节点的子节点上:
tree->child[0] = parse_exp();
tree->child[1] = parse_stmt_sequence();
tree->child[2] = parse_else_stmt();
3. 表达式节点的构造.
表达式节点的构造中算术表达式的构造最复杂.
对于关系运算符表达式的BNF是:
EXP-> SIMPLE-EXP EXP'
EXP'-> COMPARISON-OP SIMPLE-EXP | $
从上一级语法分析进入关系表达式的语法分析首先要进入
EXP的分析当中, 返回给上一级分析程序的语法树指针必须是关系运算符节点的指针(如果有), 但是运算符节点的创建要在下一层语法分析
EXP' 的
COMPARISON-OP中才能生成, 所以下一层语法分析
EXP'返回的指针才是这个算术表达式真正的根,
EXP中
SIMPLE-EXP返回的只是这个根的一个左孩子. 所以:
parse_exp中
tree = parse_simple_exp();
tree_op = parse_exp_();
if(tree_op != NULL)
tree_op->child[0] = tree;
else
tree_op = tree;
return tree_op;
parse_exp_中
tree = parse_comparison_op();
if(tree != NULL)
tree->child[1] = parse_simple_exp();
return tree;
下面的加减乘除运算的节点构造是一样的, 以加减法为例:
描述加减法的BNF:
SIMPLE-EXP-> TERM SIMPLE-EXP'
SIMPLE-EXP'-> ADDOP TERM SIMPLE-EXP' | $
与上面同理整个算术表达式的根是由
SIMPLE-EXP' 中的
ADDOP返回的, 但是由于加减法有连续性和左结合行, 所以
ADDOP返回的根是整个表达式的根, 但是
SIMPLE-EXP 中
TERM返回的根是整个表达式最左下的后代如:
表达式是:
a+b-c+d+e-f
则语法树为:
- -
/ / / /
+ f + f
/ / / /
+ e + e
/ / / /
- d - d
/ / / /
+ c + c
/ / /
a b (
完整的语法树
) b (
有ADDOP
返回的语法树)
TERM 返回的指针指向
a,
ADDOP 返回的指针指向最上面的
– 所以要把a添加到语法树上面, 必须找到最下面的加号的节点.所以
parse_simple_exp中
tree_term = parse_term();
tree = parse_simple_exp_();
if(tree != NULL)
{
tree_temp = tree;
while(tree_temp->child[0] != NULL)
tree_temp = tree_temp->child[0];
tree_temp->child[0] = tree_term;
}
else
tree = tree_term;
return tree;
parse_simple_exp_中
tree_addop = parse_addop();
if(tree_addop != NULL)
tree_addop->child[1] = parse_term();
tree = parse_simple_exp_();
if(tree != NULL)
{
tree_temp = tree;
while(tree_temp->child[0] != NULL)
tree_temp = tree_temp->child[0];
tree_temp->child[0] = tree_addop;
}
else
tree = tree_addop;
return tree;
常量表达式和标识符表达式节点都是比较好建立的如:
常量表达式:
tree = new_exp_node(KIND_CONST);
if(tree != NULL)
tree->attr.val = atoi(token_string);
match(NUM);
return tree;
标识符表达式:
tree = new_exp_node(KIND_ID);
if(tree != NULL)
tree->attr.name = copy_string(token_string);
match(ID);
return tree;