【编译原理】Lexical Analysis_Compiler_Week 2_Stanford University (词法分析)

最新推荐文章于 2023-09-14 21:29:15 发布

光明磊磊

最新推荐文章于 2023-09-14 21:29:15 发布

阅读量934

点赞数

分类专栏： Compiler 文章标签： Compiler 编译原理 Lexical Analysis 词法分析 stanford

本文链接：https://blog.csdn.net/AlvinHuntley/article/details/88658774

版权

Compiler 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

课程源地址

第二周(上) 词法分析

Week 2.1: Lexical Analysis

标记类¹

Token Class(or class)

In English:
Noun, verb, adjective, …
In a programming language:
Identifier, keywords, ‘(’, ‘)’, numbers, …

标记类对应于字符串集

token classes correspond to sets of strings

标识符(Identifier):
— 以字母开头的，字母或数字字符串
strings of letters or digits, starting with a letter
整数(Integer):
— 一个非空的字符串
a non-empty string of digits
关键字(Keyword):
— “else” 或 “if” 或 “begin” 或…
“else” or “if” or “begin” or …
空白符(Whitespace)
— 一个非空的空白序列，换行符和制表符
a non-empty sequence of blanks, newlines, and tabs

根据标记类对程序子串进行分类

classify program substrings according to role

将标记传递给解析器

communicate tokens to the parser
在这里插入图片描述

词法分析例子

LA Example

目标是分割字符串。这是通过从左到右阅读来实现的，一次识别一个标记。
The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time.
“Lookahead” 可能需要决定一个标记的结束为止和下一个标记的开始位置。
“Lookahead” may be required to decide where one token ends and the next token begins.

PL/1² 不保留关键字

PL/1 keywords are not reserved

Example I

Example II

Example III

正则语言

Regular Languages

词法结构 = 标记类

Lexical structure = token classes

我们必须说明标记类中有哪些字符串集

We must say what set of strings is in a token class

Use regular languages

正则表达式

Regular Expressions

Single character $\left\{ "c" \right\}$
Epsilon( the empty string, not empty ) $\left\{ "" \right\}$
Union $\left\{ a| a∈A \right\} ∪ \left\{ b| b∈B \right\}$
Concatenation $\left\{ ab | a∈A ∧ b∈B \right\}$
Iteration $A^{*} = \bigcup_{i\ge1}A^{i}$ $A^{i} = \underbrace{A \cdots A}_{i\;times}$

例子: $\sum = \left\{ 0, 1\right\}$

$1^{*} = \bigcup_{i\ge0}1^{i} = "" + 1 + 11 + 111 + 1111 + \cdots = all\;strings\;of\;1$
$\left\{ ab| a \in 1+0 \land b\in1 \right\} = \left\{ 11, 01 \right\}$
$0^{*} + 1^{*} = \left\{ 0^{i}| i\ge0\right\} \cup \left\{ 1^{i}| i\ge0 \right\}$
$1)^{*} = \bigcup_{i\ge0}(0 + 1)^{i} = "", 0+1, (0+1)(0+1),\cdots, \underbrace{(0+1)\cdots(0+1)}_{i\;times} = all\;stiring\;of\;0+1$
There is more than one way to write down the same set, for example, $1^{*} = 1^{*}+1$ , $(1 + 0) 1 = 11 + 10$

正则语言总结

正则表达式指定正则语言

Regular expressions specify regular languages.

五个结构

— Two base cases

empty and 1-character strings

— Three compound expressions

union, concatenation, iteration

形式语言

Formal Languages

定义

形式语言是一个字母表上的某些有限长字符串的集合。
Def. Let $\sum$ be a set of characters (an alphabet). A language over $\sum$ is a set of strings of characters drawn form $\sum$ .

对比

— In English

Alphabet = English characters
Language = English sentences

— In rigorous formal languages

Alphabet = ASCII
Language = C programs

含义函数L将语法映射到语义

Meaning function L maps syntax to semantics.
Example of meaning function

为什么使用Meaning function

明确什么是语法，什么是语义
Makes clear what is syntax, what is semantics.
允许我们将符号视为一个单独的问题，例如阿拉伯数字比罗马数字在计算上更方便
Allows us to consider notation as a separate issue.
因为表达和意义不是1-1
Because expressions and meanings are not 1-1.

含义是多对一的

Meaning is many to one, never one to many.
That is many expressions correspond to one meaning, otherwise it will cause ambiguity.

词法规范

Keyword: “if” or “else” or “then” or …

$^{'} i^{''} f^{'} +^{'} e^{''} l^{''} s^{''} e^{'} +^{'} t^{''} h^{''} e^{''} n^{'}$
$\cdots$

Integer: a non-empty string of digits

$d i g i t =^{'} 0^{'} +^{'} 1^{'} +^{'} 2^{'} +^{'} 3^{'} +^{'} 4^{'} +^{'} 5^{'} +^{'} 6^{'} +^{'} 7^{'} +^{'} 8^{'} +^{'} 9^{'}$
$digit\;digit^{*}\;or\;digit^{+}$

Identifier: strings of letters or digits, starting with a letter

$l e t t e r = [a - z A - Z]$
$letter(letter + digit)^{*}$

Whitespace: a non-empty sequence of blanks, newlines, and tabs

$\ n ′ + ′ \ t ′ ) + ('\;'+'\backslash n' + '\backslash t')^{+}$

例子

Example I : $a n y o n e @ c s . s t a n f o r d . e d u$

$letter^{+}\,'@'\,letter^{+}\,'.'\,letter^{+}\,'.'\,letter^{+}$

Example II: Pascal

$d i g i t =^{'} 0^{'} +^{'} 1^{'} +^{'} 2^{'} +^{'} 3^{'} +^{'} 4^{'} +^{'} 5^{'} +^{'} 6^{'} +^{'} 7^{'} +^{'} 8^{'} +^{'} 9^{'}$
$digits = digit^{+}$
$opt\_fraction = ('.'\,digits)+\epsilon = ('.'\,digits)?$
$opt\_exponent = ('E'\,('+' + '-' + \epsilon)\,digits) + \epsilon = ('E'\,('+' + '-' )?\,digits)?$
$num = digits\;opt\_fraction\;opt\_exponent$

正则表达式描述许多有用的语言

Regular expressions describe many useful languages

email
phone numbers
file names

正则语言是语言规范

Regular languages are a language specification

We still need an implementation

总结词法规范

At lease one: $A^{+} \equiv AA^{*}$
Union: $\equiv A + B$
Option: $\equiv A + \epsilon$
Range: $+\cdots+'z' = [a-z]$
Excluded range: $complement\;of\;[a-z] \equiv [\,\hat{}\, a-z]$

词法规范步骤

给每个标记类的字根写一个正则表达式
Write a rexp for the lexemes of each token class

Number = $digit^{+}$
Keyword = $'if'+'else'+\cdots$
Identifier = $letter(letter+digit)^{*}$
OpenPar = ‘(’
$\cdots$

构造R, 匹配所有标记类的所有字根
Construct R, matching all lexemes for all tokens
$\cdots$
$\;\;\;\,=R1+R2+\cdots$
Let input be $x_{1}...x_{n}$
For $1\leq i \leq n$ check
$\;\;\;\;\;\;\;\;\;\;x_1...x_i \in L(R)$
If success, then we know that
$x_1...x_n \in L(R_j)\;for\;some\;j$
Remove $x_1...x_i$ from input and go to 3

FAQs(frequently asked questions)

I: 使用了多少输入?

How much input is used?
$x_1...x_i \in L(R)$
$x_1...x_j \in L(R)$
${=}\mathllap{/\,} j$
Answer: "Maximal munch"

II: 匹配了哪个标记类？

Which token is used?
$x_1....x_i \in L(R)\;\;\;\;R = R_1+\cdots+R_N$
$x_1...x_i \in L(R_j)$
$x_1...x_i \in L(R_k)$

For example, $\in \begin{cases} L(Keywords) & &Keywords = 'if' + 'else' + \cdots\\L(Identifiers)& &Identifiers = letter(letter + digit)^{*}\end{cases}$
Answer: Choose the one listed first

III: 如果没有规则匹配怎么办？

What if no rule matches?
$x_i...x_i \notin L(R)$

Answer: It is important for compilers to do good error handling. They cannot simply crash. The best solution for lexical analysis is to not do this so don’t let this ever happen. And so what we wanted to do instead is to write a category of error strings. Put it last in priority, then it will only match if no earlier regular expression match and only catch the error strings.
$ERROR\;=\;all\;strings\;not\;in\;the\;lexical\;specification\;of\;the\;language$

知识点梳理

正则表达式是字符串模式的简明表示法。

Regular expressions are a concise notation for string patterns.

在词法分析使用中需要小扩展

Use in lexical analysis requires small extensions

To resolve ambiguities
- matches as long as possible
- highest priority match
  这只是通过在文件中按顺序列出它们而完成的，并且首先列出的那些优先于后面列出的那些。
  This has done just by listing them in order in a file and the ones listed first have higher priority over the ones listed later.
To handle errors
捕获所有正则表达式，它吸收所有可能的错误字符串并赋予它最低优先级，以便只有在没有有效的令牌类匹配某些输入时才会触发它。
Catch all regular expression that soaks up all the possible erroneous strings and give it the lowest priority so that it only triggers if no valid token class matches some piece of the input.

注：个人英文水平有限，如有错误请指正，谢谢！

https://baike.baidu.com/item/TOKEN/2615248 ↩︎
https://baike.baidu.com/item/PL%2FI ↩︎

光明磊磊

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【编译原理】Lexical Analysis_Compiler_Week 2_Stanford University (词法分析)

课程源地址第二周(上) 词法分析和有限自动机Week 2: Lexical Analysis & Finite Automata标记类1Token Class(or class)In English:Noun, verb, adjective, …In a programming language:Identifier, keywords, ‘(’, ‘)’, numbers,...
复制链接

扫一扫