第二周(上) 词法分析
Week 2.1: Lexical Analysis
标记类1
Token Class(or class)
- In English:
Noun, verb, adjective, … - In a programming language:
Identifier, keywords, ‘(’, ‘)’, numbers, …
标记类对应于字符串集
token classes correspond to sets of strings
- 标识符(Identifier):
— 以字母开头的,字母或数字字符串
strings of letters or digits, starting with a letter - 整数(Integer):
— 一个非空的字符串
a non-empty string of digits - 关键字(Keyword):
— “else” 或 “if” 或 “begin” 或…
“else” or “if” or “begin” or … - 空白符(Whitespace)
— 一个非空的空白序列,换行符和制表符
a non-empty sequence of blanks, newlines, and tabs
根据标记类对程序子串进行分类
classify program substrings according to role
将标记传递给解析器
communicate tokens to the parser
词法分析例子
LA Example
- 目标是分割字符串。这是通过从左到右阅读来实现的,一次识别一个标记。
The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time. - “Lookahead” 可能需要决定一个标记的结束为止和下一个标记的开始位置。
“Lookahead” may be required to decide where one token ends and the next token begins.
PL/12 不保留关键字
PL/1 keywords are not reserved
Example I
Example II
Example III
正则语言
Regular Languages
词法结构 = 标记类
Lexical structure = token classes
我们必须说明标记类中有哪些字符串集
We must say what set of strings is in a token class
- Use regular languages
正则表达式
Regular Expressions
- Single character ′ c ′ = { " c " } 'c' = \left\{ "c" \right\} ′c′={"c"}
- Epsilon( the empty string, not empty ) ε = { " " } ε = \left\{ "" \right\} ε={""}
- Union A + B = { a ∣ a ∈ A } ∪ { b ∣ b ∈ B } A+B = \left\{ a| a∈A \right\} ∪ \left\{ b| b∈B \right\} A+B={a∣a∈A}∪{b∣b∈B}
- Concatenation A B = { a b ∣ a ∈ A ∧ b ∈ B } AB = \left\{ ab | a∈A ∧ b∈B \right\} AB={ab∣a∈A∧b∈B}
- Iteration A ∗ = ⋃ i ≥ 1 A i A^{*} = \bigcup_{i\ge1}A^{i} A∗=i≥1⋃Ai A i = A ⋯ A ⏟ i t i m e s A^{i} = \underbrace{A \cdots A}_{i\;times} Ai=itimes A⋯A
例子: ∑ = { 0 , 1 } \sum = \left\{ 0, 1\right\} ∑={0,1}
- 1 ∗ = ⋃ i ≥ 0 1 i = " " + 1 + 11 + 111 + 1111 + ⋯ = a l l s t r i n g s o f 1 1^{*} = \bigcup_{i\ge0}1^{i} = "" + 1 + 11 + 111 + 1111 + \cdots = all\;strings\;of\;1 1∗=⋃i≥01i=""+1+11+111+1111+⋯=allstringsof1
- ( 1 + 0 ) 1 = { a b ∣ a ∈ 1 + 0 ∧ b ∈ 1 } = { 11 , 01 } (1+0)1 = \left\{ ab| a \in 1+0 \land b\in1 \right\} = \left\{ 11, 01 \right\} (1+0)1={ab∣a∈1+0∧b∈1}={11,01}
- 0 ∗ + 1 ∗ = { 0 i ∣ i ≥ 0 } ∪ { 1 i ∣ i ≥ 0 } 0^{*} + 1^{*} = \left\{ 0^{i}| i\ge0\right\} \cup \left\{ 1^{i}| i\ge0 \right\} 0∗+1∗={0i∣i≥0}∪{1i∣i≥0}
-
(
0
+
1
)
∗
=
⋃
i
≥
0
(
0
+
1
)
i
=
"
"
,
0
+
1
,
(
0
+
1
)
(
0
+
1
)
,
⋯
,
(
0
+
1
)
⋯
(
0
+
1
)
⏟
i
t
i
m
e
s
=
a
l
l
s
t
i
r
i
n
g
o
f
0
+
1
(0 + 1)^{*} = \bigcup_{i\ge0}(0 + 1)^{i} = "", 0+1, (0+1)(0+1),\cdots, \underbrace{(0+1)\cdots(0+1)}_{i\;times} = all\;stiring\;of\;0+1
(0+1)∗=⋃i≥0(0+1)i="",0+1,(0+1)(0+1),⋯,itimes
(0+1)⋯(0+1)=allstiringof0+1
There is more than one way to write down the same set, for example, 1 ∗ = 1 ∗ + 1 1^{*} = 1^{*}+1 1∗=1∗+1, ( 1 + 0 ) 1 = 11 + 10 (1+0)1 = 11+10 (1+0)1=11+10
正则语言总结
正则表达式指定正则语言
Regular expressions specify regular languages.
五个结构
— Two base cases
- empty and 1-character strings
— Three compound expressions
- union, concatenation, iteration
形式语言
Formal Languages
定义
形式语言是一个字母表上的某些有限长字符串的集合。
Def. Let
∑
\sum
∑ be a set of characters (an alphabet). A language over
∑
\sum
∑ is a set of strings of characters drawn form
∑
\sum
∑.
对比
— In English
- Alphabet = English characters
- Language = English sentences
— In rigorous formal languages
- Alphabet = ASCII
- Language = C programs
含义函数L将语法映射到语义
为什么使用Meaning function
- 明确什么是语法,什么是语义
Makes clear what is syntax, what is semantics. - 允许我们将符号视为一个单独的问题,例如阿拉伯数字比罗马数字在计算上更方便
Allows us to consider notation as a separate issue. - 因为表达和意义不是1-1
Because expressions and meanings are not 1-1.
含义是多对一的
Meaning is many to one, never one to many.
That is many expressions correspond to one meaning, otherwise it will cause ambiguity.
词法规范
Keyword: “if” or “else” or “then” or …
′
i
′
′
f
′
+
′
e
′
′
l
′
′
s
′
′
e
′
+
′
t
′
′
h
′
′
e
′
′
n
′
'i''f' + 'e''l''s''e' + 't''h''e''n'
′i′′f′+′e′′l′′s′′e′+′t′′h′′e′′n′
′
i
f
′
+
′
e
l
s
e
′
+
′
t
h
e
n
′
+
⋯
'if' + 'else' + 'then' + \cdots
′if′+′else′+′then′+⋯
Integer: a non-empty string of digits
d
i
g
i
t
=
′
0
′
+
′
1
′
+
′
2
′
+
′
3
′
+
′
4
′
+
′
5
′
+
′
6
′
+
′
7
′
+
′
8
′
+
′
9
′
digit = '0' + '1' + '2' + '3' + '4' + '5' + '6' + '7' + '8' + '9'
digit=′0′+′1′+′2′+′3′+′4′+′5′+′6′+′7′+′8′+′9′
d
i
g
i
t
d
i
g
i
t
∗
o
r
d
i
g
i
t
+
digit\;digit^{*}\;or\;digit^{+}
digitdigit∗ordigit+
Identifier: strings of letters or digits, starting with a letter
l
e
t
t
e
r
=
[
a
−
z
A
−
Z
]
letter = [a-zA-Z]
letter=[a−zA−Z]
l
e
t
t
e
r
(
l
e
t
t
e
r
+
d
i
g
i
t
)
∗
letter(letter + digit)^{*}
letter(letter+digit)∗
Whitespace: a non-empty sequence of blanks, newlines, and tabs
( ′ ′ + ′ \ n ′ + ′ \ t ′ ) + ('\;'+'\backslash n' + '\backslash t')^{+} (′′+′\n′+′\t′)+
例子
Example I : a n y o n e @ c s . s t a n f o r d . e d u anyone@cs.stanford.edu anyone@cs.stanford.edu
l e t t e r + ′ @ ′ l e t t e r + ′ . ′ l e t t e r + ′ . ′ l e t t e r + letter^{+}\,'@'\,letter^{+}\,'.'\,letter^{+}\,'.'\,letter^{+} letter+′@′letter+′.′letter+′.′letter+
Example II: Pascal
d
i
g
i
t
=
′
0
′
+
′
1
′
+
′
2
′
+
′
3
′
+
′
4
′
+
′
5
′
+
′
6
′
+
′
7
′
+
′
8
′
+
′
9
′
digit = '0' + '1' + '2' + '3' + '4' + '5' + '6' + '7' + '8' + '9'
digit=′0′+′1′+′2′+′3′+′4′+′5′+′6′+′7′+′8′+′9′
d
i
g
i
t
s
=
d
i
g
i
t
+
digits = digit^{+}
digits=digit+
o
p
t
_
f
r
a
c
t
i
o
n
=
(
′
.
′
d
i
g
i
t
s
)
+
ϵ
=
(
′
.
′
d
i
g
i
t
s
)
?
opt\_fraction = ('.'\,digits)+\epsilon = ('.'\,digits)?
opt_fraction=(′.′digits)+ϵ=(′.′digits)?
o
p
t
_
e
x
p
o
n
e
n
t
=
(
′
E
′
(
′
+
′
+
′
−
′
+
ϵ
)
d
i
g
i
t
s
)
+
ϵ
=
(
′
E
′
(
′
+
′
+
′
−
′
)
?
d
i
g
i
t
s
)
?
opt\_exponent = ('E'\,('+' + '-' + \epsilon)\,digits) + \epsilon = ('E'\,('+' + '-' )?\,digits)?
opt_exponent=(′E′(′+′+′−′+ϵ)digits)+ϵ=(′E′(′+′+′−′)?digits)?
n
u
m
=
d
i
g
i
t
s
o
p
t
_
f
r
a
c
t
i
o
n
o
p
t
_
e
x
p
o
n
e
n
t
num = digits\;opt\_fraction\;opt\_exponent
num=digitsopt_fractionopt_exponent
正则表达式描述许多有用的语言
Regular expressions describe many useful languages
- phone numbers
- file names
正则语言是语言规范
Regular languages are a language specification
- We still need an implementation
总结词法规范
- At lease one: A + ≡ A A ∗ A^{+} \equiv AA^{*} A+≡AA∗
- Union: A ∣ B ≡ A + B A|B \equiv A + B A∣B≡A+B
- Option: A ? ≡ A + ϵ A? \equiv A + \epsilon A?≡A+ϵ
- Range: ′ a ′ + ′ b ′ + ⋯ + ′ z ′ = [ a − z ] 'a' + 'b' +\cdots+'z' = [a-z] ′a′+′b′+⋯+′z′=[a−z]
- Excluded range: c o m p l e m e n t o f [ a − z ] ≡ [ ^ a − z ] complement\;of\;[a-z] \equiv [\,\hat{}\, a-z] complementof[a−z]≡[^a−z]
词法规范步骤
- 给每个标记类的字根写一个正则表达式
Write a rexp for the lexemes of each token class
- Number = d i g i t + digit^{+} digit+
- Keyword = ′ i f ′ + ′ e l s e ′ + ⋯ 'if'+'else'+\cdots ′if′+′else′+⋯
- Identifier = l e t t e r ( l e t t e r + d i g i t ) ∗ letter(letter+digit)^{*} letter(letter+digit)∗
- OpenPar = ‘(’
- ⋯ \cdots ⋯
-
构造R, 匹配所有标记类的所有字根
Construct R, matching all lexemes for all tokens
R = K e y w o r d + I d e n t i f i e r + N u m b e r + ⋯ R = Keyword + Identifier + Number + \cdots R=Keyword+Identifier+Number+⋯
= R 1 + R 2 + ⋯ \;\;\;\,=R1+R2+\cdots =R1+R2+⋯ -
Let input be x 1 . . . x n x_{1}...x_{n} x1...xn
For 1 ≤ i ≤ n 1\leq i \leq n 1≤i≤n check
x 1 . . . x i ∈ L ( R ) \;\;\;\;\;\;\;\;\;\;x_1...x_i \in L(R) x1...xi∈L(R) -
If success, then we know that
x 1 . . . x n ∈ L ( R j ) f o r s o m e j x_1...x_n \in L(R_j)\;for\;some\;j x1...xn∈L(Rj)forsomej -
Remove x 1 . . . x i x_1...x_i x1...xi from input and go to 3
FAQs(frequently asked questions)
I: 使用了多少输入?
How much input is used?
x
1
.
.
.
x
i
∈
L
(
R
)
x_1...x_i \in L(R)
x1...xi∈L(R)
x
1
.
.
.
x
j
∈
L
(
R
)
x_1...x_j \in L(R)
x1...xj∈L(R)
i
=
/
j
i {=}\mathllap{/\,} j
i=/j
Answer: "Maximal munch"
II: 匹配了哪个标记类?
Which token is used?
x
1
.
.
.
.
x
i
∈
L
(
R
)
R
=
R
1
+
⋯
+
R
N
x_1....x_i \in L(R)\;\;\;\;R = R_1+\cdots+R_N
x1....xi∈L(R)R=R1+⋯+RN
x
1
.
.
.
x
i
∈
L
(
R
j
)
x_1...x_i \in L(R_j)
x1...xi∈L(Rj)
x
1
.
.
.
x
i
∈
L
(
R
k
)
x_1...x_i \in L(R_k)
x1...xi∈L(Rk)
For example,
i
f
∈
{
L
(
K
e
y
w
o
r
d
s
)
K
e
y
w
o
r
d
s
=
′
i
f
′
+
′
e
l
s
e
′
+
⋯
L
(
I
d
e
n
t
i
f
i
e
r
s
)
I
d
e
n
t
i
f
i
e
r
s
=
l
e
t
t
e
r
(
l
e
t
t
e
r
+
d
i
g
i
t
)
∗
if \in \begin{cases} L(Keywords) & &Keywords = 'if' + 'else' + \cdots\\L(Identifiers)& &Identifiers = letter(letter + digit)^{*}\end{cases}
if∈{L(Keywords)L(Identifiers)Keywords=′if′+′else′+⋯Identifiers=letter(letter+digit)∗
Answer: Choose the one listed first
III: 如果没有规则匹配怎么办?
What if no rule matches?
x
i
.
.
.
x
i
∉
L
(
R
)
x_i...x_i \notin L(R)
xi...xi∈/L(R)
Answer: It is important for compilers to do good error handling. They cannot simply crash. The best solution for lexical analysis is to not do this so don’t let this ever happen. And so what we wanted to do instead is to write a category of error strings. Put it last in priority, then it will only match if no earlier regular expression match and only catch the error strings.
E
R
R
O
R
=
a
l
l
s
t
r
i
n
g
s
n
o
t
i
n
t
h
e
l
e
x
i
c
a
l
s
p
e
c
i
f
i
c
a
t
i
o
n
o
f
t
h
e
l
a
n
g
u
a
g
e
ERROR\;=\;all\;strings\;not\;in\;the\;lexical\;specification\;of\;the\;language
ERROR=allstringsnotinthelexicalspecificationofthelanguage
知识点梳理
正则表达式是字符串模式的简明表示法。
Regular expressions are a concise notation for string patterns.
在词法分析使用中需要小扩展
Use in lexical analysis requires small extensions
- To resolve ambiguities
- matches as long as possible
- highest priority match
这只是通过在文件中按顺序列出它们而完成的,并且首先列出的那些优先于后面列出的那些。
This has done just by listing them in order in a file and the ones listed first have higher priority over the ones listed later.
- To handle errors
捕获所有正则表达式,它吸收所有可能的错误字符串并赋予它最低优先级,以便只有在没有有效的令牌类匹配某些输入时才会触发它。
Catch all regular expression that soaks up all the possible erroneous strings and give it the lowest priority so that it only triggers if no valid token class matches some piece of the input.
注:个人英文水平有限,如有错误请指正,谢谢!