Lexical Analysis

最新推荐文章于 2024-10-05 12:07:50 发布

something4567

最新推荐文章于 2024-10-05 12:07:50 发布

阅读量365

点赞数 7

文章标签： java linux 算法

本文链接：https://blog.csdn.net/something4567/article/details/135448729

版权

Lexical Analysis
by hand
00 | Goal
The goal of this on-computer tutorial session is to write a small lexical analyzer (or lexer) by hand.
The development could be done in Java, C# or Python.
Remember that the lexical analyzer takes a source program, and scans it, and replies the recognized
tokens when they are requested by the parser.
Definitions :
• Lexeme : Sequence of characters that is recognized by the parser (usually expressed by
regular expression)
• Token : Symbolic name of a lexeme
During this on-computer tutorial, we must :
Define the language of the source program, its syntax and its semantic.
Write a symbol table
Write the lexer
01 | Language Description
The language to implement in this Labwork (LW) is a part of the Tiny Basic dialect of the BASIC
language.
01.1 Statements
The language is composed of statements, one per line. If this ST we do not recognize
multiple statements on the same line. A line must follow one of the following formats :
statement
number statement
where number is the number of the line, and statement is the instruction of the language.
Lexical
Analyzer Parser
Symbol
Table
tokens
syntax
tree
source
program
getNextToken
The supported statements are :
PRINT expression
▪ Print the given expression on the standard output
IF expression relational_operator expression THEN statement
▪ The statement is executed only if the given relational operator on the two
given expressions replies true.
▪ The relational operators are :
= equality
< less than
> greater than
<= less or equal
>= greater or equal
<>, >< not equal
GOTO expression
▪ Run the statement at the line de)ned by the expression.
INPUTvariable [, variable ...]
▪ Read variable values from the standard input .
LET variable = expression
▪ Assign the value of the expression to the variable.
GOSUB expression
▪ Run the statement at the line de)ned by the expression, but in opposite
than GOTO, remember the line of the GOSUB and run the statement just
after that line when the statement RETURN is encountered.
RETURN
▪ Run the statement at the line following the last GOSUB.
END
▪ Stop the program.
REM text
▪ Insert a comment in the source code.
01.2 Expressions
Many of these statement types involve the use of expressions. An expression is the
combination of one or more numbers, strings or variables, joined by operators, and
possibly grouped by parentheses.
There are four operators: +, -, *, /
Note that the operators + and – have two versions : the binary (a+b) and the unary (+a).
A string of characters is enclosed by the quote characters (").
01.3 Variable Definition and Scope
The variables do not need to be declared. The scope of all the variables is global. So that
a variable, when it is used, could be undened.
All the variables are of numerical type or of string type, depending on the type of the
expression assigned to the variable.
01.4 Example
Let the following Tiny Basic program :
5 LET S = 0
10 INPUT NUM
15 INPUT V
20 LET N = NUM
30 IF N <= 0 THEN GOTO 99
40 LET S = S + V
50 LET N = N - 1
70 GOTO 30
99 PRINT S/NUM
100 END
Your lexer must reply the following tokens, in that order :
<NUM,5> <LET> <ID,S> <=> <NUM,0>
<NUM,10> <INPUT> <ID,NUM>
<NUM,15> <INPUT> <ID,V>
<NUM,20> <LET> <ID,N> <=> <ID,NUM>
<NUM,30> <IF> <ID,N> <RELOP,<=> <NUM,0> <THEN> <GOTO> <NUM,99>
<NUM,40> <LET> <ID,S> <=> <ID,S> <OP,+> <ID,V>
<NUM,50> <LET> <ID,N> <=> <ID,N> <OP,-> <NUM,1>
<NUM,70> <GOTO> <NUM,30>
<NUM,99> <PRINT> <ID,S> <OP,/> <ID,NUM>
<NUM,100> <END>
02 | Exercices
02.1 Exercice 1
 What is the alphabet of the language ?
 What are the lexemes (as regular expressions) of the language ?
 Write the table that is matching the lexemes, the tokens, and the attributes of the tokens.
02.2 Symbol Table
 Create the class SymbolTable that permits to store several informations about the variables :
name, lexeme, line of the first occurrence, etc.
 This symbol table is a kind of map with the name of the variable as key and the description of
the variable as value.
02.2 Tokens
 Define the hierarchy of classes that is describing all the tokens of the language.
 Note that the token ID should point to an entry of the symbol table.
02.3 Scanner
 Create the Scanner class that takes an input stream (Reader is prefered in Java) as
parameter of its constructor.
 The scanner is able to read the characters from the stream, to compact white spaces when
possible.
 The scanner is able to reply the line in the source program at which the read character is
located.
 We recommend to provides to functions : char peek(), and char get(). The first replies the next
available character but does not consume it. The second replies the next available character
and consumes it to pass to the next available character.
02.4 Lexer
 Create the Lexer class and a function inside that is able to read a character from an input
stream.
 Create a Finite Automata (FA) or an ad-hoc (from-scratch) code that is recognizing the
language. If you are encountering problems to do the FA, you may take the FA at the end of
this document.
 Implements the FA or the ad-hoc in the function getNextSymbol() of your Lexer, and it should
use the Scanner.
 Remember that the Lexer should not reply a token for the comment statements.
02.5 Main Program
 Write the main program that creates the Symbol Table, and the Lexer on an input file.
 The main program invokes the getNextSymbol() function of the Lexer to retreive the tokens.
 The tokens are displayed on the standard output.
03 | Finite Automata
A Finite Automata (explained in Chapter 2) is able to recognized one or more tokens. When a state is
drawn with two circles, it means that a token is recognized and the Lexer could return it. If a star is
drawn on the side of a two-circle state, it means that it does not consume the character from the input
stream that has permits to reach this star-two-circle state.
03.1 Alphabet
The alphabet of the language is :
 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
 +, -, /, *
 a, b, c, d, …., x, y, z
 A, B, C, D, ….., X, Y, Z
 ., <, >, =, (, ), "
03.2 Finite Automata for Token ID
1 2 3 start letter other *
letter or digit
03.3 Finite Automata for Token RELOP
03.4 Finite Automata for Token NUMLW2 | LW3
DA53: COMPILERS AND LANGUAGE
THEORY
Lexical Analysis
and
Syntax Analysis
00 | Goal
The goal of this on-computer tutorial session is to write a small interpreter for the Tiny Basic language.
The development must be done in Java.
JavaCC (https://javacc.github.io/javacc/) should be used. This library is also available on Teams.
You must download and complete a skeleton of Java project, downloadable from Teams.
Remember that the lexical analyzer takes a source program, and scans it, and replies the recognized
tokens when they are requested by the parser. The parser generates a syntax tree that can be
executed (interpretation mode).
During this on-computer tutorial, you will :
1. Define the grammar and the Syntax-Directed Definition (SDD) of the Tiny Basic language.
2. Write a symbol table
3. Write the lexer
4. Write the interpretation rules
Lexical
Analyzer Parser
Symbol
Table
tokens
syntax
tree
source
program
getNextToken
01 | Language Definition
The language to implement in this labwork is a part of the Tiny Basic dialect of the BASIC language.
01.1 Statements
The language is composed of statements, one per line. If this labwork we do not
recognize multiple statements on the same line. A line must follow one of the following
formats :
statement
number statement
where number is the number of the line, and statement is the instruction of the language.
The supported statements are :
PRINT expression
▪ Print the given expression on the standard output
IF expression relational_operator expression THEN statement
▪ The statement is executed only if the given relational operator on the two
given expressions replies true.
▪ The relational operators are :
= equality
< less than
> greater than
<= less or equal
>= greater or equal
<>, >< not equal
GOTO expression
▪ Run the statement at the line de(ned by the expression.
INPUTvariable [, variable ...]
▪ Read variable values from the standard input .
LET variable = expression
▪ Assign the value of the expression to the variable.
GOSUB expression
▪ Run the statement at the line de(ned by the expression, but in opposite
than GOTO, remember the line of the GOSUB and run the statement just
after that line when the statement RETURN is encountered.
RETURN
▪ Run the statement at the line following the last GOSUB.
END
▪ Stop the program.
REM text
▪ Insert a comment in the source code.
01.2 Expressions
Many of these statement types involve the use of expressions. An expression is the
combination of one or more numbers, strings or variables, joined by operators, and
possibly grouped by parentheses.
There are four operators: +, -, *, /
Note that the operators + and – have two versions : the binary (a+b) and the unary (+a).
A string of characters is enclosed by the quote characters (").
01.3 Variable Definition and Scope
The variables do not need to be declared. The scope of all the variables is global. So that
a variable, when it is used, could be undened.
All the variables are of numerical type or of string type, depending on the type of the
expression assigned to the variable.
01.4 Example
Let the following Tiny Basic program :
5 LET S = 0
10 INPUT NUM
15 INPUT V
20 LET N = NUM
30 IF N <= 0 THEN GOTO 99
40 LET S = S + V
50 LET N = N - 1
70 GOTO 30
99 PRINT S/NUM
100 END
02 | JavaCC
JavaCC is a Java library and a collection of CLI tools that is generating a lexer and a parser
written in Java.
02.1 Input file
The input file of JavaCC is a file with extension « .jj », named here «tinybasic.jj ».
The input file must start with the Java definition of the Parser (named TinyBasicParser). You may
put in this part all the functions and attributes that are invoked in the grammar rules.
After the Java definition, you must put the lexer rules and the grammar (BNF) rules.
options {
IGNORE_CASE = true;
STATIC = false;
}
PARSER_BEGIN(TinyBasicParser)
package fr.utbm.gi.lo46.tp2.parser;
import java.util.SortedMap;
import java.util.Map;
import java.util.List;
import java.util.TreeMap;
import java.util.ArrayList;
import fr.utbm.gi.lo46.tp2.context.*;
import fr.utbm.gi.lo46.tp2.symbol.*;
import fr.utbm.gi.lo46.tp2.syntaxtree.*;
public class TinyBasicParser {
private int basicLineNumber = 1;
private final SymbolTable symbolTable = new SymbolTable();
/** Replies the symbol table used by the parser.
* @return the symbol table.
*/
public SymbolTable getSymbolTable() {
return this.symbolTable;
}
}
PARSER_END(TinyBasicParser)
02.2 Lexer rule syntax
 Each lexer rule defines the lexeme that is recognized by the lexer. The recognized lexemes
are named tokens.
 Different types of lexer rules are available : SKIP, TOKEN, …
 The best practice in JavaCC is to define tokens only when they are associated to complex
regular expressions, ie. then an regular expression operator is used. For example, the lexeme
« <= » may not be defined in a token rule. Rather is should be directed typed in the BNF rules.
 The SKIP rule tells to the lexer to ignore several sequences of characters :
SKIP :
{
" "
| "\t"
| "\r"
}
 The TOKEN rule tells to the lexer the token to reply to the syntax analyzer :
TOKEN :
{
< CR: "\n" >
}
02.3 Parser rule syntax
 The grammar (BNF) rules defines the syntactic rules of the language.
 Each rule is defined like a kind of Java function (with a similar but different syntax) :
void ruleName() :
{
}
{
}
 The prototype of the rule has the same syntax as the prototype of a Java function. Return type
and parameters are possible. Indeed, a rule could return a value or take values as parameters
when it is invoked.
 The first block after the column character must contains any declaration of Java variable.
 The second block after the column character must contains the BNF rules (separated by the
« | » character).
 The BNF rule is a sequence of one of :
1. a token
• the value of the token could be retreived with an assign symbol (see
example below).
2. a BNF rule invokation
• it is a call to one of the BNF-rule functions, using the Java syntax.
• the value returned by the BNF-rule could be retreived with an assign symbol (see
example below).
• Parameters could be passed to the BNF-rules function.
3. a Java block
The following example is rule that is recognizing a parenthezed numerical expression, a
number literal, or an identifier.
private Number factor() :
{
Number value;
Token t;
}
{
"(" value = expression() ")"
{ return value; }
| t = <NUMBER>
{ return NumberUtil.parse(t.image); }
| t = <IDENTIFIER>
{ this.symbolTable.add(t.image,t.beginLine);
return this.symbolTable.get(t.image);
}
}
02.4 Run JavaCC
 On the CLI, type :
$> cd src/fr/utm/gi/lo46/tp2/parser
$> javacc tinybasic.jj
02 | Working Steps
02.1 Skeleton
 Download the skeleton, and install it into your Eclipse.
 The skeleton contains : a symbol table (SymbolTable), a context of interpretation
(ExecutionContext), an interpreter (Interpreter, and SyntaxTreeInterpreter), the main function
(TinyBasicInterpreter), and several abstract classes that correspond to the tree nodes of the
syntax tree.
 Takes a look on the code.
02.2 JavaCC Scanner and Lexer Definition
 Write the JavaCC input file that is corresponding to the Lexer (the token).
02.3 Complete the JavaCC Definition
 Write the grammar inside your JavaCC input file (the BNF rules with the Java blocks empty).
02.4 Parse Tree Node
 Create classes for all the nodes that are required in a parse tree dedicated to the Tiny Basic
language.
 Each node of the parse tree must corresponds to an element of the language.
 Each statement node should be linked to child nodes that are corresponding to the
parameters of the statement.
02.5 Write the SDD in JavaCC
 Update your JavaCC input file with the SDD rules that permit to :
 check the types
 generate the parse tree for the Tiny Basic program.
02.6 Interpretation of the Parse Tree
 Add in each node of your parse tree, the function run(...) and evaluate(...) which may be
invoked to run the parse tree as in an interpreter, and to compute the value of an expression.
02.7 Extension of the Language
 Update your compiler implementation to support the new statements :
 Statement : for id = num to num step num statements next id
 Arrays as id ( expression )
 The new keyword that is replying the size of an array : LEN ( id )
 Statement : while ( expression relop expression ) do statements wend
LW4
DA53: COMPILERS AND LANGUAGE
THEORY
Intermediate
Code
Generation
00 | Goal
The goal of this on-computer tutorial session is to write a small compiler that is able to compile a Tiny
Basic program and generate three-address code program.
The development must be done in Java.
You must reuse your works from the previous on-computer tutorial sessions.
01 | Working Steps
01.1 Three-Address Code
 Select the three-address instructions to support and put them in the ThreeAddressInstruction
enumeration.
 Write the class ThreeAddressRecord. It is a record that contains the three-address instruction,
the parameters (String), and the result (String). A record may also be associated to a label,
and optionally to a comment. Note that during this tutorial, you should not compute the
addresses of the instructions and of the variables (this computation will be done during
the tutorial TU5).
 Write the class ThreeAddressCode. It is a list of three-address records in one hand. In the
other hand, it should permit to :
1. create temporary variables ;
2. maintain a mapping between a TinyBasic line and a three-address code line ;
3. access to the symbol table ;
4. create labels.
01.2 Three-Address Code Generation
 Update classes of the syntax tree with the generate() function. This function put in the given
ThreeAddressCode all the three-address statements that may corresponds to the syntax tree
node.
 Note that the gosub and goto statements take a TinyBasic line number. You must,
dynamically, retreive the three-address code line that is corresponding to a given TinyBasic
line. This mapping is a part of the generated code by the ThreeAddressCode class.
 Note that the symbol table may contains string literals (constants). You may define the threeaddress variables that are corresponding to these constants prior to any execution of the
wechat codinghelp