CP2: Tokenizer, Parser, Syntax & More

最新推荐文章于 2024-07-03 16:17:13 发布

KnightHacker2077

最新推荐文章于 2024-07-03 16:17:13 发布

阅读量234

点赞数

分类专栏： Programming Language 文章标签：开发语言 compiler

本文链接：https://blog.csdn.net/DOITJT/article/details/119919933

版权

Programming Language 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Turing Complete Language: A language that is able to implement, even awkwardly, any arbitrary algorithms & computations. i.e. If a script can be written in Java or Python, it can also be implemented in any other turing complete languages.

Recursive descent approach (a.k.a. syntax-driven): a single procedure PN corresponding to each non-terminal <N> that handles every occurrence of that non-terminal in the (abstract) parse-tree and only occurrences of that non-terminal.

ParseStmt(), PrintStmt(), ExecStmt() for every <stmt>
ParseProg(), PrintProg(), ExecProg() for every <prog>
...

Abstract Parse Tree: throw away nodes and branches in the tree that do not provide useful info.

Throw away tokens that are only useful in building a parse tree -- program, begin, etc.
The rest of the program can simply rely on the parse tree to obtain the info.

Context-free: What is allowed and not allowed in a subtree does not involve the context (info) in other subtrees or the portion of the tree above current subtree. This requires static checking only in the current subtree (according to the BNF grammar), before execution. Concerns syntax.

Context-sensitive: What is allowed and not allowed in a subtree depend on the context (info) in other subtrees or the portion of the tree above current subtree. This requires additional global static checking in the entire PT, before execution, during parsing. Concerns syntax.

No variable can be declared twice in the declaration sequence
Every variable appearing in the statement sequence has to be declared (in the declaration sequence)

Run-time: What is allowed and not allowed in a subtree depend on the output (semantics) during execution. For example, even if int X is has a initialization clause in a 'if' branch, but due to the condition, that branch wasn't executed, and therefore when printing X an error occurs, it would be a run-time error. This requires additional realtime dynamic checking during execution.

A value has to be initialized before it is read and used in the evalExp()
The data stream from which a value is read should not be empty

Tokenizer

Take the program as input and produce a stream of tokens.

** Only the Tokenizer will be aware of the precise sequence of characters making up each keyword.

*** The Tokenizer does NOT tokenize the entire program at once, becuse if the program is not legal, it does not need to continue, so it tokenize the program one token after another and has the parser to check the syntax on the fly.

****

Available Methods

getToken(): return info about the current token (i.e. Token index, 1~32). It will NOT move the token cursor past the current token (1 token look ahead)
- if a token is an identifier, call parse<stmt> (shouldn't consume that token)
  - if the first token is "if", call parse <if>
    - If 2 alternatives that the same tokens, treat it as one of the alternatives up to the token when it starts to differ
  - ...
skipToken(): skip past the current token; skip itself does NOT return anything; so the next call to getToken() will return the next token (which now has become the current token).
intVal(): This method will return the value of the current token which must be an integer token. Not an integer? Return an error.
- [How to check if it is an integer token ???]
idName(): return the name (string) of the current token which must be an identifier token. Not an identifier? Return an error.
- [How to check if it is an id token ???]
Why use square brackets in <cond>?
- If we have ((((X + Y) + Z ) + U) * X) that is a legal <op> but it starts with four left parentheses. So if we try to treat "((" as the start of a [cond], we will go wrong in this case.

Parser

Get input character stream from tokenizer, then build the parse tree. (PT)

Reads in token stream
Call idName, intValue, getToken, skipToken, etc.

Example: ParseIf() --
- Call getToken(), check it is an 'if'
  - If not, print error and stop
Example: ParseStmt() --
- Call getToken and apply one token look ahead
- Call corresponding alternative, e.g. ParseIf()
Example: ParseExp() -- "9-5+4"

Call ParseFac(), create 1st child node
Call ParseOp(), call getToken(), based on the returned integer, parse one of the alter. (or error). ParseOp done, get 9, call SkipToken(), return to ParseFac
ParseFac finished, return to ParseExp, call getToken(), get -
Call ParseExp
...

parse tree. (PT)

2-dimensional array of integers, i.e. PT[n, m] -- m can take 0-4.

1st column (m=0): identify node type -- index of the non-terminal
2nd column (m=1): identify alternative -- index of the alternative of that non-terminal, starts at 1
3rd ~ 5th column (m=2~4): pointer to each child node -- the Core language only allows maximum 3 child nodes, so only 3 columns are used.

** Use 0 for null child node

** Potential problem：Array is fixed size, large program won’t fit. Table doubling
perfomance太差

*** int type: 2nd column = intVal(), the rest of the columns are -1.

Dealing with identifiers

Identifiers, say "xyz", can appear more than once in a CORE program, and the problem is how we can treat each occurance of xyz as the same identifiers, and how to prevent the parser from creating multiple copies of the same id.

When parsing decl seq, the parser won't allow any repeated ("old") identifiers
When parsing stmt seq, the parser won't allow any new identifiers

Parser call getToken(), return type num corresponding to id
Call idName(), return String "xyz"
Check if String "xyz" is already in the

Problem with array representation

Encapsulation
Non-expandable: array lengths are fixed, if the array max length is reached, we need to double the array be copying, which is not very efficient
Wasting memory: the column number is fixed, because some productions only have 1 child, the remaining column space saved for other children is always empty, not efficient.

Responsibility

Checks if the program is syntactially legal
Prints an error message and stop when the program's syntax is illegal

Printer

Print the original Core program in a pretty format

printAssign

Check <assign> node
Check alt. number is not exceeding the max
Call printId() with the row No. pointer of the <id> tag

*** PrintId does NOT call idName(), ONLY the Parser calls that during PT building

printId check the pointerTo IdName of that tag and refers to the idName table to print the id ~~(the reason why it doesn't put the exact id string in the PT row is that we want to make sure the id is actually initialized)~~

print '='
Call printExp()
print ';'

printIf

In a nutshell --

print id on the left
print '='
print the expression on the right
print ';'

Check if the node type is a <if>
Check alt. number is not exceeding the max
Call printCond() to print condition portion
Check which alter. we are parsing to know it is a "if ... then ..." OR "if ... then ... else ..."
Print the then branch and the else branch if there is one presenting
print ';'

Executor

Execute the program

* Only responsible for its direct children

** Always check if the production is what the function should expect; Always check if the alternative numbers are within the range (less or equal to how many alt. this type of production has)

*** Call functions for the children nodes to process the children (by passing the row numbers)

execIf()

ExecStatement() will call execIf() -- it passes the <if> node to execIf()
execIf() is only responsible of its direct children, nothing about its grand-children
**Add check type: before evalCond(), check if the node is an 'if' node, i.e. check PT[n,1], otherwise, print an error statement and stop -- every exec() need this.

execAssign()

Compute the value of the right-hand-side of the assign stmt (by calling evalExp())

Call execExp() on the 2nd child node, receives an int value
Return value ~~CANNOT be put to the row of the <id> tag~~ -- because it cannot be accessed in future execution
Put return value in a new array called IdValue, corresponding to the name of the id.

execRead()

Read data from the data input stream and call assignIdVal()

Read the input stream values and check if the id corresponding to the id is declared
1. If declared -- assign
2. If NOT declared -- report NOT assigned

Data Structure

IdName Table:

Keep an array of ID name of all identifier in the program.

** When parsing an <id>, in the PT, the row for that <id> should look like this

1st column is type num,
2nd column is alt. num,
3rd column is the row num in the array of ID name (pointer to the IdName table).

In the IdName array:

1st column: ID name -- eg. "XYZ"
2nd column: current int value of the ID - constructed during parsing -- contains junk until initialized during execution
3rd column: if the ID is initialized -- 0 if the column is not initialized (parseDeclSwq sets that 0), and if there is an assign statement for the ID, the column is set to 1.

The functions that are going to use the 3rd column:

e.g. execWrite()

Check if the identifier has been initialized or not (but checking the 3rd column of the IdName array)

If initialized -- write the value to the output stream
If NOT initialized -- print error "uninitialized variable [id]"

find an identifier
check the isInitialized column in idName table
if initialized

Exercises

Java compiler gives you an error message if you attempt to access the value of an identifier is initialized! Explain how Java achieved this.

Java 要求你要用的variable在进branch前必须先初始化,否则会报“variable might not have been initialized”

Identifier X is a boolean according to the <decl seq> but there is an expression <X + 42> in the <stmt seq>. What kind of error is this?

Context-sensitive error. because I can check the type of the X in the identifier table to know if X is a boolean, and because 42 has type 31, there is type mismatch, the parsing process abort.

In defining printIf(), we never actually evaluated the condition that appears in the given statement. Explain why we did not evaluate it. Would it have been wrong to evaluate it or just unnecessary to do so? Explain.

This is the printIf() only concerns pretty printing the CORE program, it is not its responsibility to evaluate anything. In fact, we may not even be able to because we may not have the data stream from which the "read" statements will read the data.

Specifying Syntax: Regular Expressions and Context-Free Grammars

all compilers and interpreters are syntax driven.

Formal Syntax rules:

Concatenation
Alternation -- choice from alternatives
“Kleene closure” -- repetition an arbitrary number of times
Recursion -- creation of a construct from simpler instances of the same construct)

Regular set (regular language): Language defined in terms of the first three rules

Context-free language (CFL): Language defined with the addition of the 4th rule; generated by context-free grammars (CFGs) and recognized by parsers

Tokens and Regular Expressions

Tokens: the shortest strings of characters with individual meaning.
- Example: in C -- keywords (double, if, return, struct, etc.), identifiers (my_variable, your_type, sizeof, printf, etc.), integer (0765, 0x1f5, 501), floating-point (6.022e23), and character (‚x‚, ‚\‚‚, ‚\0170‚) constants; string literals ("snerk", "say \"hi\"\n"); “punctuators” (+, ], ->, *=, :, ||, etc.), and two different forms of comments.
Token types: keywords, identifiers, symbols, and constants.
To specify tokens, we use the notation of regular expressions
*** Tokens are numbered -- all integers have the index 31; all ids have the index 32

Character Sets and Formatting Issues

Some language distinguish Upper & Lower case letters, some don't
Some support non-Latin characters, some don't
Some have limits on the maximum length of identifiers, some don't
- Free format, meaning that a program is simply a sequence of tokens: what matters is their order, not their physical positionl; White spaces ignored
- Fixed format, means that there is a limit on the maximum length of a line, to allow the compiler to store the current line in a fixed-length buffer; Line breaks serve to separate statements

Context-Free Grammars

aka. BNF (“Backus Normal Form”): Notation for describing syntax of languages precisely.
A context-free grammar shows us how to generate a syntactically valid string of terminals: Begin with the start symbol.
Capable of specifying nested constructs
Symbols on the left-hand sides of the productions, in angle brackets -- <> -- are known as variables, or non-terminals
- Specify the conditions that strings (of terminal symbols) must satisfy in order to be (syntactically) legal.
- Corresponds to a set of legal strings in a particular category.
- eg. <exp> = all legal expressions; <stmt> = all leagal statements
Symbols on the right-hand sides of the productions, known as terminals (tokens), are the characters in the language the grammar is defining.
Each of the rules in a context-free grammar is known as a production.
- non-terminal <N> on the left
- ::= symbol, read as “is”
- to the right of the ::= symbol, a set of alternatives, separated by |, read as “or”
  - Each alternative will be a (finite) string of terminal and non-terminal symbols.
- Example:

Derivations and Parse Trees

A series of replacement operations that shows how to derive a string of terminals from the start symbol is called a derivation. Each string of symbols along the way is called a sentential form. The final sentential form, consisting of only terminals, is called the yield of the derivation

Derivations: start with a production, but its righthand side was obtained by using a production to replace some nonterminal in the left-hand side
The ⇒ metasymbol is often pronounced “derives.”
The metasymbol ⇒∗ means “derives after zero or more replacements.”

A parse tree represent a derivation graphically. The root of the parse tree is the start symbol of the grammar. The leaves of the tree are its yield. Each internal node, together with its children, represents the use of a production.

Construction of more than one parse tree for the same string of terminals is said to be ambiguous. Turns out to be a problem when trying to build a parser.
Useless symbols: nonterminals that cannot generate any string of terminals, or terminals that cannot appear in the yield of any derivation.
Associativity and Precedence: 结合 & 优先次序，make the production unambiguous.
Concrete Parse Tree: accurate representation of the program -- all the keywords

KnightHacker2077

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CP2: Tokenizer, Parser, Syntax & More

Syntax vs. Semantics: form (syntax) and meaning (semantics) of a language must be specified without ambiguity,Specifying Syntax: Regular Expressions and Context-Free GrammarsFormal Syntax rules: Concatenation Alternation -- choice from alternat
复制链接

扫一扫