【编译原理】Lexical Analysis_Compiler_Week 2_Stanford University (词法分析)

课程源地址

第二周(上) 词法分析

Week 2.1: Lexical Analysis

标记类1

Token Class(or class)

  • In English:
    Noun, verb, adjective, …
  • In a programming language:
    Identifier, keywords, ‘(’, ‘)’, numbers, …
标记类对应于字符串集

token classes correspond to sets of strings

  • 标识符(Identifier):
    — 以字母开头的,字母或数字字符串
    strings of letters or digits, starting with a letter
  • 整数(Integer):
    — 一个非空的字符串
    a non-empty string of digits
  • 关键字(Keyword):
    — “else” 或 “if” 或 “begin” 或…
    “else” or “if” or “begin” or …
  • 空白符(Whitespace)
    — 一个非空的空白序列,换行符和制表符
    a non-empty sequence of blanks, newlines, and tabs
根据标记类对程序子串进行分类

classify program substrings according to role

将标记传递给解析器

communicate tokens to the parser
在这里插入图片描述

词法分析例子

LA Example

  • 目标是分割字符串。这是通过从左到右阅读来实现的,一次识别一个标记。
    The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time.
  • “Lookahead” 可能需要决定一个标记的结束为止和下一个标记的开始位置。
    “Lookahead” may be required to decide where one token ends and the next token begins.
PL/12 不保留关键字

PL/1 keywords are not reserved

Example I

Example I

Example II

Example II

Example III

Example III

正则语言

Regular Languages

词法结构 = 标记类

Lexical structure = token classes

我们必须说明标记类中有哪些字符串集

We must say what set of strings is in a token class

  • Use regular languages
正则表达式

Regular Expressions

  • Single character ′ c ′ = { " c " } 'c' = \left\{ "c" \right\} c={"c"}
  • Epsilon( the empty string, not empty ) ε = { " " } ε = \left\{ "" \right\} ε={""}
  • Union A + B = { a ∣ a ∈ A } ∪ { b ∣ b ∈ B } A+B = \left\{ a| a∈A \right\} ∪ \left\{ b| b∈B \right\} A+B={aaA}{bbB}
  • Concatenation A B = { a b ∣ a ∈ A ∧ b ∈ B } AB = \left\{ ab | a∈A ∧ b∈B \right\} AB={abaAbB}
  • Iteration A ∗ = ⋃ i ≥ 1 A i A^{*} = \bigcup_{i\ge1}A^{i} A=i1Ai A i = A ⋯ A ⏟ i    t i m e s A^{i} = \underbrace{A \cdots A}_{i\;times} Ai=itimes AA
例子: ∑ = { 0 , 1 } \sum = \left\{ 0, 1\right\} ={0,1}
  • 1 ∗ = ⋃ i ≥ 0 1 i = " " + 1 + 11 + 111 + 1111 + ⋯ = a l l    s t r i n g s    o f    1 1^{*} = \bigcup_{i\ge0}1^{i} = "" + 1 + 11 + 111 + 1111 + \cdots = all\;strings\;of\;1 1=i01i=""+1+11+111+1111+=allstringsof1
  • ( 1 + 0 ) 1 = { a b ∣ a ∈ 1 + 0 ∧ b ∈ 1 } = { 11 , 01 } (1+0)1 = \left\{ ab| a \in 1+0 \land b\in1 \right\} = \left\{ 11, 01 \right\} (1+0)1={aba1+0b1}={11,01}
  • 0 ∗ + 1 ∗ = { 0 i ∣ i ≥ 0 } ∪ { 1 i ∣ i ≥ 0 } 0^{*} + 1^{*} = \left\{ 0^{i}| i\ge0\right\} \cup \left\{ 1^{i}| i\ge0 \right\} 0+1={0ii0}{1ii0}
  • ( 0 + 1 ) ∗ = ⋃ i ≥ 0 ( 0 + 1 ) i = " " , 0 + 1 , ( 0 + 1 ) ( 0 + 1 ) , ⋯   , ( 0 + 1 ) ⋯ ( 0 + 1 ) ⏟ i    t i m e s = a l l    s t i r i n g    o f    0 + 1 (0 + 1)^{*} = \bigcup_{i\ge0}(0 + 1)^{i} = "", 0+1, (0+1)(0+1),\cdots, \underbrace{(0+1)\cdots(0+1)}_{i\;times} = all\;stiring\;of\;0+1 (0+1)=i0(0+1)i="",0+1,(0+1)(0+1),,itimes (0+1)(0+1)=allstiringof0+1
    There is more than one way to write down the same set, for example, 1 ∗ = 1 ∗ + 1 1^{*} = 1^{*}+1 1=1+1, ( 1 + 0 ) 1 = 11 + 10 (1+0)1 = 11+10 (1+0)1=11+10
正则语言总结
正则表达式指定正则语言

Regular expressions specify regular languages.

五个结构

— Two base cases

  • empty and 1-character strings

— Three compound expressions

  • union, concatenation, iteration

形式语言

Formal Languages

定义

形式语言是一个字母表上的某些有限长字符串的集合。
Def. Let ∑ \sum be a set of characters (an alphabet). A language over ∑ \sum is a set of strings of characters drawn form ∑ \sum .

对比

— In English

  • Alphabet = English characters
  • Language = English sentences

— In rigorous formal languages

  • Alphabet = ASCII
  • Language = C programs
含义函数L将语法映射到语义

Meaning function L maps syntax to semantics.
Example of meaning function

为什么使用Meaning function
  • 明确什么是语法,什么是语义
    Makes clear what is syntax, what is semantics.
  • 允许我们将符号视为一个单独的问题,例如阿拉伯数字比罗马数字在计算上更方便
    Allows us to consider notation as a separate issue.
  • 因为表达和意义不是1-1
    Because expressions and meanings are not 1-1.
含义是多对一的

Meaning is many to one, never one to many.
That is many expressions correspond to one meaning, otherwise it will cause ambiguity.

词法规范

Keyword: “if” or “else” or “then” or …

′ i ′ ′ f ′ + ′ e ′ ′ l ′ ′ s ′ ′ e ′ + ′ t ′ ′ h ′ ′ e ′ ′ n ′ 'i''f' + 'e''l''s''e' + 't''h''e''n' if+else+then
′ i f ′ + ′ e l s e ′ + ′ t h e n ′ + ⋯ 'if' + 'else' + 'then' + \cdots if+else+then+

Integer: a non-empty string of digits

d i g i t = ′ 0 ′ + ′ 1 ′ + ′ 2 ′ + ′ 3 ′ + ′ 4 ′ + ′ 5 ′ + ′ 6 ′ + ′ 7 ′ + ′ 8 ′ + ′ 9 ′ digit = '0' + '1' + '2' + '3' + '4' + '5' + '6' + '7' + '8' + '9' digit=0+1+2+3+4+5+6+7+8+9
d i g i t    d i g i t ∗    o r    d i g i t + digit\;digit^{*}\;or\;digit^{+} digitdigitordigit+

Identifier: strings of letters or digits, starting with a letter

l e t t e r = [ a − z A − Z ] letter = [a-zA-Z] letter=[azAZ]
l e t t e r ( l e t t e r + d i g i t ) ∗ letter(letter + digit)^{*} letter(letter+digit)

Whitespace: a non-empty sequence of blanks, newlines, and tabs

( ′    ′ + ′ \ n ′ + ′ \ t ′ ) + ('\;'+'\backslash n' + '\backslash t')^{+} (+\n+\t)+

例子
Example I : a n y o n e @ c s . s t a n f o r d . e d u anyone@cs.stanford.edu anyone@cs.stanford.edu

l e t t e r +   ′ @ ′   l e t t e r +   ′ . ′   l e t t e r +   ′ . ′   l e t t e r + letter^{+}\,'@'\,letter^{+}\,'.'\,letter^{+}\,'.'\,letter^{+} letter+@letter+.letter+.letter+

Example II: Pascal

d i g i t = ′ 0 ′ + ′ 1 ′ + ′ 2 ′ + ′ 3 ′ + ′ 4 ′ + ′ 5 ′ + ′ 6 ′ + ′ 7 ′ + ′ 8 ′ + ′ 9 ′ digit = '0' + '1' + '2' + '3' + '4' + '5' + '6' + '7' + '8' + '9' digit=0+1+2+3+4+5+6+7+8+9
d i g i t s = d i g i t + digits = digit^{+} digits=digit+
o p t _ f r a c t i o n = ( ′ . ′   d i g i t s ) + ϵ = ( ′ . ′   d i g i t s ) ? opt\_fraction = ('.'\,digits)+\epsilon = ('.'\,digits)? opt_fraction=(.digits)+ϵ=(.digits)?
o p t _ e x p o n e n t = ( ′ E ′   ( ′ + ′ + ′ − ′ + ϵ )   d i g i t s ) + ϵ = ( ′ E ′   ( ′ + ′ + ′ − ′ ) ?   d i g i t s ) ? opt\_exponent = ('E'\,('+' + '-' + \epsilon)\,digits) + \epsilon = ('E'\,('+' + '-' )?\,digits)? opt_exponent=(E(+++ϵ)digits)+ϵ=(E(++)?digits)?
n u m = d i g i t s    o p t _ f r a c t i o n    o p t _ e x p o n e n t num = digits\;opt\_fraction\;opt\_exponent num=digitsopt_fractionopt_exponent

正则表达式描述许多有用的语言

Regular expressions describe many useful languages

  • email
  • phone numbers
  • file names
正则语言是语言规范

Regular languages are a language specification

  • We still need an implementation

总结词法规范

  • At lease one: A + ≡ A A ∗ A^{+} \equiv AA^{*} A+AA
  • Union: A ∣ B ≡ A + B A|B \equiv A + B ABA+B
  • Option: A ? ≡ A + ϵ A? \equiv A + \epsilon A?A+ϵ
  • Range: ′ a ′ + ′ b ′ + ⋯ + ′ z ′ = [ a − z ] 'a' + 'b' +\cdots+'z' = [a-z] a+b++z=[az]
  • Excluded range: c o m p l e m e n t    o f    [ a − z ] ≡ [   ^   a − z ] complement\;of\;[a-z] \equiv [\,\hat{}\, a-z] complementof[az][^az]

词法规范步骤

  1. 给每个标记类的字根写一个正则表达式
    Write a rexp for the lexemes of each token class
  • Number = d i g i t + digit^{+} digit+
  • Keyword = ′ i f ′ + ′ e l s e ′ + ⋯ 'if'+'else'+\cdots if+else+
  • Identifier = l e t t e r ( l e t t e r + d i g i t ) ∗ letter(letter+digit)^{*} letter(letter+digit)
  • OpenPar = ‘(’
  • ⋯ \cdots
  1. 构造R, 匹配所有标记类的所有字根
    Construct R, matching all lexemes for all tokens
    R = K e y w o r d + I d e n t i f i e r + N u m b e r + ⋯ R = Keyword + Identifier + Number + \cdots R=Keyword+Identifier+Number+
            = R 1 + R 2 + ⋯ \;\;\;\,=R1+R2+\cdots =R1+R2+

  2. Let input be x 1 . . . x n x_{1}...x_{n} x1...xn
    For 1 ≤ i ≤ n 1\leq i \leq n 1in check
                         x 1 . . . x i ∈ L ( R ) \;\;\;\;\;\;\;\;\;\;x_1...x_i \in L(R) x1...xiL(R)

  3. If success, then we know that
    x 1 . . . x n ∈ L ( R j )    f o r    s o m e    j x_1...x_n \in L(R_j)\;for\;some\;j x1...xnL(Rj)forsomej

  4. Remove x 1 . . . x i x_1...x_i x1...xi from input and go to 3

FAQs(frequently asked questions)

I: 使用了多少输入?

How much input is used?
x 1 . . . x i ∈ L ( R ) x_1...x_i \in L(R) x1...xiL(R)
x 1 . . . x j ∈ L ( R ) x_1...x_j \in L(R) x1...xjL(R)
i = /   j i {=}\mathllap{/\,} j i=/j
Answer: "Maximal munch"

II: 匹配了哪个标记类?

Which token is used?
x 1 . . . . x i ∈ L ( R )          R = R 1 + ⋯ + R N x_1....x_i \in L(R)\;\;\;\;R = R_1+\cdots+R_N x1....xiL(R)R=R1++RN
x 1 . . . x i ∈ L ( R j ) x_1...x_i \in L(R_j) x1...xiL(Rj)
x 1 . . . x i ∈ L ( R k ) x_1...x_i \in L(R_k) x1...xiL(Rk)

For example, i f ∈ { L ( K e y w o r d s ) K e y w o r d s = ′ i f ′ + ′ e l s e ′ + ⋯ L ( I d e n t i f i e r s ) I d e n t i f i e r s = l e t t e r ( l e t t e r + d i g i t ) ∗ if \in \begin{cases} L(Keywords) & &Keywords = 'if' + 'else' + \cdots\\L(Identifiers)& &Identifiers = letter(letter + digit)^{*}\end{cases} if{L(Keywords)L(Identifiers)Keywords=if+else+Identifiers=letter(letter+digit)
Answer: Choose the one listed first

III: 如果没有规则匹配怎么办?

What if no rule matches?
x i . . . x i ∉ L ( R ) x_i...x_i \notin L(R) xi...xi/L(R)

Answer: It is important for compilers to do good error handling. They cannot simply crash. The best solution for lexical analysis is to not do this so don’t let this ever happen. And so what we wanted to do instead is to write a category of error strings. Put it last in priority, then it will only match if no earlier regular expression match and only catch the error strings.
E R R O R    =    a l l    s t r i n g s    n o t    i n    t h e    l e x i c a l    s p e c i f i c a t i o n    o f    t h e    l a n g u a g e ERROR\;=\;all\;strings\;not\;in\;the\;lexical\;specification\;of\;the\;language ERROR=allstringsnotinthelexicalspecificationofthelanguage

知识点梳理

正则表达式是字符串模式的简明表示法。

Regular expressions are a concise notation for string patterns.

在词法分析使用中需要小扩展

Use in lexical analysis requires small extensions

  • To resolve ambiguities
    • matches as long as possible
    • highest priority match
      这只是通过在文件中按顺序列出它们而完成的,并且首先列出的那些优先于后面列出的那些。
      This has done just by listing them in order in a file and the ones listed first have higher priority over the ones listed later.
  • To handle errors
    捕获所有正则表达式,它吸收所有可能的错误字符串并赋予它最低优先级,以便只有在没有有效的令牌类匹配某些输入时才会触发它。
    Catch all regular expression that soaks up all the possible erroneous strings and give it the lowest priority so that it only triggers if no valid token class matches some piece of the input.

注:个人英文水平有限,如有错误请指正,谢谢!


  1. https://baike.baidu.com/item/TOKEN/2615248 ↩︎

  2. https://baike.baidu.com/item/PL%2FI ↩︎

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值