Stanford Compilers PA1

最新推荐文章于 2022-11-14 13:21:11 发布

twyc

最新推荐文章于 2022-11-14 13:21:11 发布

阅读量399

点赞数

分类专栏：计算机科学

本文链接：https://blog.csdn.net/yyyccww/article/details/106364967

版权

计算机科学专栏收录该内容

7 篇文章 0 订阅

订阅专栏

记录读英文文档的过程主要为了自己方便复习回顾同时希望能帮到你看懂文档到底在说啥任务到底要怎么做如果有错误/模糊的地方欢迎指出

每次作业完成一个编译器的一个部分：词法分析、语法分析、语义分析、代码生成

PA1 写一个词法分析器利用cpp的flex工具或者java的jlex工具把Cool的tokens转换成cpp或java代码

Flex compiles your rule file (e.g., “lexer.l”) to C (or, if you are using JLex, Java) source code implementing a finite automaton recognizing the regular expressions that you specify in your rule file.

lex的文档

flex的框架

{definitions}
%%
{rules}
%%
{user subroutines}

definitions部分和user subroutines部分可以没有第二个%%可以没有第一个必须有

%%

是一个合法的最小的lex程序从输入到输出什么都没做

这个作业最主要的任务就是写rules rules里面是一个字符串表左边是正则表达式右边是actions, program fragments to be executed when the expressions are recognized.

下面介绍正则表达式里的一些符号

\ - ^这三个字符在正则表达式里的作用特殊

-表示范围如果要表示-这个字符的话必须放在最前面或者最后面

//If it is desired to include the character - in a character class, it should be first or last;

^必须放在最前面表示后面这个表达式的补集

The operator ? indicates an optional element of an expression. Thus

                                    ab?c

matches either ac or abc.

$只能放在最后并且只能在一行的末尾进行匹配

/可以表示后面的上下文。

<x>可以表示这个规则是以条件x为开始的

{x}x如果是一个变量名那么就会从前面找到定义好的x替代掉现在这个x//此处不确定理解是否正确

x如果是数字就表示重复的次数

//a{1,5}looks for 1 to 5 occurrences of a.

下面介绍与表达式对应的动作里的一些操作

lex会把字符数组存到yytext这个变量里

[a-z]+   printf("%s", yytext); <<==>>  [a-z]+   ECHO;

这个字符数组的长度是yyleng

提供两个方法来确定yytext的长度

yymore()把下一个输入串接在当前这串的后面而不是将其替代

Example: Consider a language which defines a string as a set of characters between quotation (") marks, and provides that to include a " in a string it must be preceded by a \. The regular expression which matches that is somewhat confusing, so that it might be preferable to write

                  \"[^"]*   {
                            if (yytext[yyleng-1] == '\\')
                                 yymore();
                            else
                                 ... normal user processing
                            }

yyless (n)把当前这串的末尾n位返回给输入串和下次的输入串一起处理

1) input() which returns the next input character;

2) output(c) which writes the character c on the output; and

3) unput(c) pushes the character c back onto the input stream to be read later by input().

yywrap() 可以重写一般默认在遇到EOF的时候return 1 但是在跨越源文件的情况下可以让其return 0 让解析继续进行

规则的匹配原则：

1. 最长匹配

2. 一样长久匹配先出现的

每个字符都只能匹配一个规则如果要多次匹配的话可以使用REJECT 即完成了当前的action之后将这个输入串与下一个可匹配的规则进行匹配

   she   {s++; REJECT;}
   he    {h++; REJECT;}
   \n    |
   .     ;

比如这个规则可以统计所有地方出现的"he"这个字符串的次数

读完了lex的文档接下来读cool_manual的section10 Lexical Structure

10.1 Integers, Identifiers, and Special Notation

Integers are non-empty strings of digits 0-9.

Identifiers are strings (other than keywords) consisting of letters, digits, and the underscore character.

Type identifiers begin with a capital letter; 类名由大写字母开头

object identifiers begin with a lower case letter. 对象名由小写字母开头

There are two other identifiers, self and SELF TYPE that are treated specially by Cool but are not treated as keywords.这俩是特殊的标识符但是不是关键字

10.2 Strings

字符串由双引号包围字符串里的\c表示字符c 除了

\b backspace \t tab \n newline \f formfeed这几个特例

可以换行但是要在行尾加\

可以没有EOF和\0 但是不能跨文件

10.3 Comments

单行注释-- 包起来的注释(*...*) 后者可以嵌套不能跨文件

10.4 Keywords

The keywords of cool are: class, else, false, fi, if, in, inherits, isvoid, let, loop, pool, then, while, case, esac, new, of, not, true. Except for the constants true and false, keywords are case insensitive. To conform to the rules for other objects, the first letter of true and false must be lowercase; the trailing letters may be upper or lower case.

true和false的首字母是小写的其他的关键字以及true和false之后的字符都是大小写不敏感的

10.5 White Space

White space consists of any sequence of the characters: blank (ascii 32), \n (newline, ascii 10), \f (form feed, ascii 12), \r (carriage return, ascii 13), \t (tab, ascii 9), \v (vertical tab, ascii 11).

BEGIN的用法

To handle the same problem with start conditions, each start condition must be introduced to Lex in the definitions section with a line reading

                          %Start   name1 name2 ...

where the conditions may be named in any order. The word Start may be abbreviated to s or S. The conditions may be referenced at the head of a rule with the <> brackets:

                              <name1>expression

is a rule which is only recognized when Lex is in the start condition name1. To enter a start condition, execute the action statement

                                BEGIN name1;

which changes the start condition to name1. To resume the normal state,

                                  BEGIN 0;

resets the initial condition of the Lex automaton interpreter. A rule may be active in several start conditions: <name1,name2,name3> is a legal prefix. Any rule not beginning with the <> prefix operator is always active.

There is also a special default state called INITIAL which is active unless you explicitly indicate the beginning of a new state.

下面是抄作业的过程

首先从最简单的单行注释开始

状态开头就是单行注释

<INITIAL>"--" {BEGIN INLINE_CONMENTS;}

上面的单行注释后面跟了除了newline之外的字符
<INLINE_COMENTS>[^\n]* {}

单行注释遇上了newline
<INLINE_COMENTS>\n {
curr_lineno++;
BEGIN 0;
}

接着写多行注释的规则

/* Nested comments */

开始符号之后跟着一个注释开始那么层数++ 并且往下寻找可以匹配的规则
<INITIAL,COMMENTS,INLINE_COMMENTS>"(*" {
comment_layer++;
BEGIN COMMENTS;
}

不是这三个字符的情况都不用处理

<COMMENTS>[^\n(*]* { }

如果遇到这三个字符也不同处理最长匹配的机制可以保证当"(*"或者"*)"出现的时候一定是走comment_layer变化的规则

<COMMENTS>[(*] { }

注释层数--

<COMMENTS>"*)" {
comment_layer--;
if (comment_layer == 0) {
BEGIN 0;
}
}

注释跨文件了

<COMMENTS><<EOF>> {
yylval.error_msg = "EOF in comment";
BEGIN 0;
return ERROR;
}

不匹配的结束符

"*)" {
yylval.error_msg = "Unmatched *)";
return ERROR;
}

接着写keyword

?i:表示大小写不敏感其他关键字类似不表

/* NOT */
(?i:not) { return NOT; }

常量

/* INT_CONST */
{DIGIT}+ {
cool_yylval.symbol = inttable.add_string(yytext);
return INT_CONST;
}

true和false首字母小写后面的字母大小写不敏感

/* BOOL_CONST */
t(?i:rue) {
cool_yylval.boolean = 1;
return BOOL_CONST;
}

f(?i:alse) {
cool_yylval.boolean = 0;
return BOOL_CONST;
}

无视代码中国的空白

/* White Space */
[ \f\r\t\v]+ { }

类名

/* TYPEID */
[A-Z][A-Za-z0-9_]* {
cool_yylval.symbol = idtable.add_string(yytext);
return TYPEID;
}

更新行号

/* To treat lines. */
"\n" {
curr_lineno++;
}

对象名

/* OBJECTID */
[a-z][A-Za-z0-9_]* {
cool_yylval.symbol = idtable.add_string(yytext);
return OBJECTID;
}

运算符

/* ASSIGN */
"<-" { return ASSIGN; }

/* LE */
"<=" { return LE; }

/* DARROW */
"=>" { return DARROW; }

"+" { return int('+'); }

最后大头字符串

看到单个双引号了就开始了String的内容

<INITIAL>(\") {
BEGIN STRING;
yymore();
}

字符串的内容里面不能出现这三个字符

/* Cannot read '\\' '\"' '\n' */
<STRING>[^\\\"\n]* { yymore(); }

如果一定要出现单个\的话就匹配这一条

/* normal escape characters, not \n */
<STRING>\\[^\n] { yymore(); }

如果既要\又要换行

/* seen a '\\' at the end of a line, the string continues */
<STRING>\\\n {
curr_lineno++;
yymore();
}

这里注意yyrestart方法把扫描文件的指针重新放到文件开头

/* meet EOF in the middle of a string, error */
<STRING><<EOF>> {
yylval.error_msg = "EOF in string constant";
BEGIN 0;
yyrestart(yyin);
return ERROR;
}

/* meet a '\n' in the middle of a string without a '\\', error */
<STRING>\n {
yylval.error_msg = "Unterminated string constant";
BEGIN 0;
curr_lineno++;
return ERROR;
}

/* meet a "\\0" ??? */
<STRING>\\0 {
yylval.error_msg = "Unterminated string constant";
BEGIN 0;
//curr_lineno++;
return ERROR;
}

字符串结尾

/* string ends, we need to deal with some escape characters */
<STRING>\" {

std::string input(yytext, yyleng);

// remove the '\"'s on both sizes.
input = input.substr(1, input.length() - 2);

std::string output = "";
std::string::size_type pos;

std::string::npos是一个常数，它等于size_type类型可以表示的最大值，用来表示一个不存在的位置,类型一般是std::container_type::size_type。

if (input.find_first_of('\0') != std::string::npos) {
yylval.error_msg = "String contains null character";
BEGIN 0;
return ERROR;
}

while ((pos = input.find_first_of("\\")) != std::string::npos) {
output += input.substr(0, pos);

switch (input[pos + 1]) {
case 'b':
output += "\b";
break;
case 't':
output += "\t";
break;
case 'n':
output += "\n";
break;
case 'f':
output += "\f";
break;
default:
output += input[pos + 1];
break;
}

input = input.substr(pos + 2, input.length() - 2);
}

output += input;

if (output.length() > 1024) {
yylval.error_msg = "String constant too long";
BEGIN 0;
return ERROR;
}

cool_yylval.symbol = stringtable.add_string((char*)output.c_str());
BEGIN 0;
return STR_CONST;

}

twyc

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
3
评论
Stanford Compilers PA1

每次作业完成一个编译器的一个部分：词法分析、语法分析、语义分析、代码生成PA1 写一个词法分析器利用cpp的flex工具或者java的jlex工具把Cool的tokens转换成cpp或java代码Flex compiles your rule file (e.g., “lexer.l”) to C (or, if you are using JLex, Java) source code implementing a finite automaton recognizing the regular
复制链接

扫一扫

专栏目录