编译原理学习
一、词法分析器
词法分析过程
- 将字符流转成符号流。输入:源代码(字符流) 输出:符号流
- 词法分析过程类似于我们中学语文学习的[词性标注],每个符号是一个元组,应该至少包括一个字符串和一个词性描述。
符号(词法单元)
-
词法分析器的结果是一个个的符号,英文Token,也叫词法单元
-
数学上符号是一个元组,例如整数123我们可以表示为(123,Integer)
-
符号类型
- Keyword(关键字)
- Variable(变量)
- Operator(操作符)
- Bracket(括号)
- String(字符串)
- Float(浮点数)
- Boolean(布尔)
实现词法分析器的基本接口
- 实现符号类型枚举
public enum TokenType {
KEYWORD,
VARIABLE,
OPERATOR,
BRACKET,
STRING,
FLOAT,
INTRGER,
BOOLEAN
}
- 实现关键词字典
/**
* @author :LY
* @date :Created in 2021/3/27 15:47
* @modified By:
*/
public class Keywords {
static String[] keywords = {
"var",
"if",
"else",
"for",
"while",
"break",
"func",
"return"
};
static HashSet<String> set = new HashSet<>(Arrays.asList(keywords));
public static boolean isKeyword(String word){
return set.contains(word);
}
}
- 定义词法分析器的程序接口
词法&语法
- 词法就是构词的方法(例如:有哪些词性?有哪些字母?有哪些词语?);语法就是造句的方法
- 编译器制作过程中我们通常用正则表达式来表述词法;然后用状态机来实现正则表达式
串和语言
- 字母表(alphabet):语言L允许的所有字符(如:ansii,utf8)。
- 串(string)是语言L字母表中字母的一个有穷序列;通常用希腊字母代表空串。
- 不可能所有的串都是语言支持的,因此我们通常用一些约束规则来描述串,其中就有正则表达式。
词法分析器的目标
- 给定程序语言(L)以及所有L支持的词汇,从中找出这些词汇并为他们标注词性。
- 如果源代码中有语言(L)不支持单独词汇,报错并提示用户。
实战词法分析器
在开始写代码之前还需要做一些准备工作:
正则判断:
/**
* @author :LY
* @date :Created in 2021/3/27 15:30
* @modified By:
*/
public class AlphabetHelper {
static Pattern ptnLetter = Pattern.compile("^[a-zA-Z]$");
static Pattern ptnNumber = Pattern.compile("^[0-9]$");
static Pattern ptnOperator = Pattern.compile("^[+-\\\\*<>=!&|^%/]$");
static Pattern ptnLiteral = Pattern.compile("^[_a-zA-Z0-9]$");
//匹配字母
public static boolean isLetter(char c){
return ptnLetter.matcher(c+"").matches();
}
//匹配数字
public static boolean isNumber(char c){
return ptnNumber.matcher(c+"").matches();
}
//匹配操作符
public static boolean isOperator(char c){return ptnOperator.matcher(c+"").matches();}
//匹配下划线字母数字
public static boolean isLiteral(char c){
return ptnLiteral.matcher(c+"").matches();
}
}
写完正则表后,还需要实现一个迭代器,方便使用stream流:
/**
* 公共底层流工厂
* 用来做编译器底层数据结构
* 定义了两个链表 一个用来做队列缓存读过的数据,最大空间为10,
* 一个用来做栈,放置putstack操作的数据,next方法优先从stack中返回数据
* 在putstack操作中将队列最后一个缓存存到stack栈中,这样就完成了回退操作,下一次next将会返回stack的栈顶
*
* @author :LY
* @date :Created in 2021/3/27 10:29
* @modified By:
*/
public class PeekIterator<T> implements Iterator<T> {
private Iterator<T> iterator;
//缓存流中的数据,一个先进先出的队列 大小为10
private LinkedList<T> queueCache = new LinkedList<>();
//先进后出的结构 用来保存putback操作后的值,获得值时优先从此栈中拿
private LinkedList<T> stackPutBack = new LinkedList<>();
private final static int CACHE_SIZE = 10;
//流的结束符自定义
private T _endToken = null;
//创建PeekIterator通过stream构建
public PeekIterator(Stream<T> stream){
iterator = stream.iterator();
}
//可以传入一个endToken作为结束标志
public PeekIterator(Stream<T> stream,T endToken){
iterator = stream.iterator();
_endToken = endToken;
}
//获得数据头,不提取仅获得
public T peek(){
//先看看回退栈中有没有值,有值返回栈顶的值
if (this.stackPutBack.size() > 0){
return this.stackPutBack.getFirst();
}
//栈中没值的话再看看迭代器中有没有下一个值,没有就返回结束符
if (!iterator.hasNext()){
return _endToken;
}
//上面都不满足,则正常获取下一个值,再进行回退操作
T val = next();
this.putBack();
return val;
}
//缓存:A -> B -> C -> D
//放回: D -> C -> B -> A
//回退数据操作,从缓存中拿到最后一个数据放到回退栈中
public void putBack(){
if (this.queueCache.size() > 0){
//将队列的最后一个数据导入到栈
this.stackPutBack.push(this.queueCache.pollLast());
}
}
//是否存在下一个值
@Override
public boolean hasNext() {
return _endToken != null || this.stackPutBack.size() > 0 || iterator.hasNext();
}
//返回下一个值,优先从回退栈中取,栈为空的话从迭代器里拿
//拿出一个值先看看我们的缓存是否满了,如果满了就抛出相应数量头节点,再将当前元素缓存
//最后返回此val
@Override
public T next() {
T val = null;
//优先从回退栈中拿值
if (this.stackPutBack.size() > 0){
//直接弹出栈顶
val = this.stackPutBack.pop();
}else {
//回退栈中没值的话,先判断迭代器是否存在下一个值,不存在就返回结束标志符,并把标识符置为null
if (!this.iterator.hasNext()){
T tmp = _endToken;
_endToken = null;
return tmp;
}
//存在的话就正常走next方法获取下一个值
val = iterator.next();
}
//判断缓存队列是否已满,满的话就将头节点扔掉
while (queueCache.size() > CACHE_SIZE - 1){
//和pop方法相同,唯一不同在于pop当头节点为空时抛出异常,poll会返回null
queueCache.poll();
}
//经过上面的判断,现在缓存队列肯定有位置,所以将val进行缓存
queueCache.add(val);
return val;
}
}
提取出来的东西还需要封装一下:
/**
* @author :LY
* @date :Created in 2021/3/26 17:01
* @modified By:
*/
public class Token {
TokenType _type;
String _value;
public Token(TokenType _type, String _value) {
this._type = _type;
this._value = _value;
}
public TokenType getType() {
return _type;
}
public String getValue() {
return _value;
}
@Override
public String toString() {
return "Token{" +
"_type=" + _type +
", _value='" + _value + '\'' +
'}';
}
//判断是否是变量
public boolean isVariable() {
return _type == TokenType.VARIABLE;
}
public boolean isScalar() {
return _type == TokenType.INTEGER || _type == TokenType.FLOAT ||
_type == TokenType.STRING || _type == TokenType.BOOLEAN;
}
//判断是否为数字
public boolean isNumber() {
return this._type == TokenType.INTEGER || this._type == TokenType.FLOAT;
}
//判断是否为操作符
public boolean isOperator() {
return this._type == TokenType.OPERATOR;
}
}
有穷状态机实现词语提取
-
提取关键词和变量名
- 关键词和变量名都以字母下划线开头,但又有所区别。
- 正则表示[_a-zA-Z][_a-zA-Z0-9]*
- 状态机描述:
代码:
经过上面三步操作,基本环境已经搭建完成,可以进行词法分析器的模块编写了
根据上面的状态机编写关键字与变量名的模块(java):
/**
* 提取变量或者关键字
* @param it
* @return
*/
public static Token makeVarOrKeyword(PeekIterator<Character> it) {
// String s = "";
StringBuilder stringBuilder = new StringBuilder();
while (it.hasNext()) {
//查看头数据,peek不会将数据提取
Character lookahead = it.peek();
//使用正则判断是否为下划线字母数字,是就拼接起来,不是直接返回
if (AlphabetHelper.isLiteral(lookahead)) {
// s += lookahead;
stringBuilder.append(lookahead);
} else {
break;
}
it.next();
//循环不变式
}
//利用stringBuilder减少拼接字符串的性能耗损
String s = stringBuilder.toString();
//判断是否是关键字
if (Keywords.isKeyword(s)) {
return new Token(TokenType.KEYWORD, s);
}
//布尔值单独处理
if (s.equals("true") || s.equals("false")) {
return new Token(TokenType.BOOLEAN, s);
}
return new Token(TokenType.VARIABLE, s);
}
- 字符提取状态机
根据字符提取状态机编写代码:
/**
* 提取字符串
* @param it 字符迭代器
* @return
* @throws LexicalException
*/
public static Token makeString(PeekIterator<Character> it) throws LexicalException {
// String s = "";
StringBuilder s = new StringBuilder();
int state = 0;
while (it.hasNext()) {
char c = it.next();
// System.out.println("char "+c+"state =" + state);
switch (state) {
case 0:
//初始状态进来,先判断是"还是',"状态设置为1,'状态设置为2,并将其加到结果中
if (c == '\"') {
state = 1;
} else {
state = 2;
}
s.append(c);
break;
case 1:
//"状态进来,判断当前是否为结束",是就直接返回Token结束循环,不是就进行拼接,因为此时字符串还未完结
if (c == '"') {
return new Token(TokenType.STRING, s.toString() + c);
} else {
s.append(c);
}
break;
case 2:
//'与上面逻辑相同
if (c == '\'') {
return new Token(TokenType.STRING, s.toString() + c);
} else {
s.append(c);
}
break;
}
}
throw new LexicalException("Unexpected error");
}
- 数字提取状态机
数字提取稍微复杂一些,因为涉及到小数以及正负数的判断:
/**
* 提取数字
* @param it
* @return
* @throws LexicalException
*/
public static Token makeNumber(PeekIterator<Character> it) throws LexicalException{
// String s = "";
StringBuilder s = new StringBuilder();
int state = 0;
while (it.hasNext()){
char lookahead = it.peek();
switch (state){
case 0:
if (lookahead == '0'){
state = 1;
}
else if (AlphabetHelper.isNumber(lookahead)){
state = 2;
}
else if (lookahead == '+' || lookahead == '-'){
state = 3;
}
else if (lookahead == '.'){
state = 5;
}
break;
case 1:
if (lookahead == '0'){
state = 1;
}
else if (AlphabetHelper.isNumber(lookahead)){
state = 2;
}
else if (lookahead == '.'){
state = 4;
}else {
return new Token(TokenType.INTEGER,s.toString());
}
break;
case 2:
if (AlphabetHelper.isNumber(lookahead)){
state = 2;
}
else if (lookahead == '.'){
state = 4;
}else {
return new Token(TokenType.INTEGER,s.toString());
}
break;
case 3:
if (AlphabetHelper.isNumber(lookahead)){
state = 2;
}
else if (lookahead == '.'){
state = 5;
}else {
throw new LexicalException(lookahead);
}
break;
case 4:
if (lookahead == '.'){
throw new LexicalException(lookahead);
}
else if (AlphabetHelper.isNumber(lookahead)){
state = 20;
}else {
return new Token(TokenType.FLOAT,s.toString());
}
break;
case 5:
if (AlphabetHelper.isNumber(lookahead)){
state = 20;
}else {
throw new LexicalException(lookahead);
}
break;
case 20:
if (AlphabetHelper.isNumber(lookahead)){
state = 20;
}
else if (lookahead == '.'){
throw new LexicalException(lookahead);
}
else {
return new Token(TokenType.FLOAT,s.toString());
}
}
it.next();
s.append(lookahead);
}
throw new LexicalException("Unexpected error");
}
-
操作符提取状态机(图太大,略)
- 操作符提取较为简单,只需要判断好组合即可:
/**
* 提取操作符
* @param it
* @return
* @throws LexicalException
*/
public static Token makeOp(PeekIterator<Character> it) throws LexicalException{
int state = 0;
while (it.hasNext()){
char lookahead = it.next();
switch (state){
case 0:
switch (lookahead){
case '+':
state = 1;
break;
case '-':
state = 2;
break;
case '*':
state = 3;
break;
case '/':
state = 4;
break;
case '>':
state = 5;
break;
case '<':
state = 6;
break;
case '=':
state = 7;
break;
case '!':
state = 8;
break;
case '&':
state = 9;
break;
case '|':
state = 10;
break;
case '^':
state = 11;
break;
case '%':
state = 12;
break;
case ',':
return new Token(TokenType.OPERATOR,",");
case ';':
return new Token(TokenType.OPERATOR,";");
}
break;
case 1:
if (lookahead == '+'){
return new Token(TokenType.OPERATOR,"++");
} else if (lookahead == '='){
return new Token(TokenType.OPERATOR,"+=");
}else {
//无法和当前符号组合,所以回退一下,相当于当前操作是peek,再返回当前操作符
it.putBack();
return new Token(TokenType.OPERATOR,"+");
}
case 2:
if (lookahead == '-'){
return new Token(TokenType.OPERATOR,"--");
} else if (lookahead == '='){
return new Token(TokenType.OPERATOR,"-=");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,"-");
}
case 3:
if (lookahead == '='){
return new Token(TokenType.OPERATOR,"*=");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,"*");
}
case 4:
if (lookahead == '='){
return new Token(TokenType.OPERATOR,"/=");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,"/");
}
case 5:
if (lookahead == '='){
return new Token(TokenType.OPERATOR,">=");
}else if (lookahead == '>'){
return new Token(TokenType.OPERATOR,">>");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,">");
}
case 6:
if (lookahead == '='){
return new Token(TokenType.OPERATOR,"<=");
}else if (lookahead == '>'){
return new Token(TokenType.OPERATOR,"<<");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,"<");
}
case 7:
if (lookahead == '='){
return new Token(TokenType.OPERATOR,"==");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,"=");
}
case 8:
if (lookahead == '='){
return new Token(TokenType.OPERATOR,"!=");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,"!");
}
case 9:
if (lookahead == '&'){
return new Token(TokenType.OPERATOR,"&&");
}else if (lookahead == '='){
return new Token(TokenType.OPERATOR,"&=");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,"&");
}
case 10:
if (lookahead == '|'){
return new Token(TokenType.OPERATOR,"||");
}else if (lookahead == '='){
return new Token(TokenType.OPERATOR,"|=");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,"|");
}
case 11:
if (lookahead == '^'){
return new Token(TokenType.OPERATOR,"^^");
}else if (lookahead == '='){
return new Token(TokenType.OPERATOR,"^=");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,"^");
}
case 12:
if (lookahead == '='){
return new Token(TokenType.OPERATOR,"%=");
}else {
it.putBack();
return new Token(TokenType.OPERATOR,"%");
}
}
}
throw new LexicalException("Unexpected error");
}
- 合并词法分析器
合并所有的模块,添加一个注释删除,以及括号的提取:
/**
* 词法分析器
* @author :LY
* @date :Created in 2021/3/26 17:07
* @modified By:
*/
public class Lexer {
public ArrayList<Token> analyse(Stream source) throws LexicalException{
ArrayList<Token> tokens = new ArrayList<>();
PeekIterator<Character> it = new PeekIterator<Character>(source,(char)0);
while (it.hasNext()){
char c = it.next();
if (c == 0){
break;
}
char lookahead = it.peek();
if (c == ' ' || c == '\n'){
continue;
}
//删除注释
if (c == '/'){
if (lookahead == '/'){
while(it.hasNext()&&(c = it.next())!='\n');
}
else if (lookahead == '*'){
boolean valid = false;
while (it.hasNext()){
char p = it.next();
if (p == '*' && it.peek() == '/'){
it.next();
valid = true;
break;
}
}
if (!valid){
throw new LexicalException("comments not match");
}
continue;
}
}
if (c == '{' || c == '}' || c == '(' || c == ')'){
tokens.add(new Token(TokenType.BRACKET,c + ""));
continue;
}
if (c == '"' || c == '\''){
it.putBack();
tokens.add(Token.makeString(it));
continue;
}
if (AlphabetHelper.isLetter(c)){
it.putBack();
tokens.add(Token.makeVarOrKeyword(it));
continue;
}
if (AlphabetHelper.isNumber(c)){
it.putBack();
tokens.add(Token.makeNumber(it));
continue;
}
//+ - .
//+ - : 3+5,+5,3 * -5,3.5 有可能存在的情况
if ((c == '+' || c == '-' || c == '.')&& AlphabetHelper.isNumber(lookahead)){
//拿到tokens中最后一个值
Token lastToken = tokens.size() == 0 ? null : tokens.get(tokens.size() - 1);
if (lastToken == null || !lastToken.isNumber() || lastToken.isOperator()){
//代表+或-是跟着数字走的 提取这个数字
it.putBack();
tokens.add(Token.makeNumber(it));
continue;
}
}
if (AlphabetHelper.isOperator(c)){
it.putBack();
tokens.add(Token.makeOp(it));
continue;
}
throw new LexicalException(c);
}
return tokens;
}
}
'-' || c == '.')&& AlphabetHelper.isNumber(lookahead)){
//拿到tokens中最后一个值
Token lastToken = tokens.size() == 0 ? null : tokens.get(tokens.size() - 1);
if (lastToken == null || !lastToken.isNumber() || lastToken.isOperator()){
//代表+或-是跟着数字走的 提取这个数字
it.putBack();
tokens.add(Token.makeNumber(it));
continue;
}
}
if (AlphabetHelper.isOperator(c)){
it.putBack();
tokens.add(Token.makeOp(it));
continue;
}
throw new LexicalException(c);
}
return tokens;
}
}
测试一下
@Test
public void test_functin() throws LexicalException {
String source = "func foo(a,b){\n" +
"print(a+b)\n" +
"}\n"+
"foo(-100.0,100)";
Lexer lexer = new Lexer();
ArrayList<Token> tokens = lexer.analyse(source.chars().mapToObj(x -> (char) x));
assertToken(tokens.get(0),"func",TokenType.KEYWORD);
assertToken(tokens.get(1),"foo",TokenType.VARIABLE);
assertToken(tokens.get(2),"(",TokenType.BRACKET);
assertToken(tokens.get(3),"a",TokenType.VARIABLE);
assertToken(tokens.get(4),",",TokenType.OPERATOR);
assertToken(tokens.get(5),"b",TokenType.VARIABLE);
assertToken(tokens.get(6),")",TokenType.BRACKET);
assertToken(tokens.get(7),"{",TokenType.BRACKET);
assertToken(tokens.get(8),"print",TokenType.VARIABLE);
assertToken(tokens.get(9),"(",TokenType.BRACKET);
assertToken(tokens.get(10),"a",TokenType.VARIABLE);
assertToken(tokens.get(11),"+",TokenType.OPERATOR);
assertToken(tokens.get(12),"b",TokenType.VARIABLE);
assertToken(tokens.get(13),")",TokenType.BRACKET);
assertToken(tokens.get(14),"}",TokenType.BRACKET);
assertToken(tokens.get(15),"foo",TokenType.VARIABLE);
assertToken(tokens.get(16),"(",TokenType.BRACKET);
assertToken(tokens.get(17),"-100.0",TokenType.FLOAT);
assertToken(tokens.get(18),",",TokenType.OPERATOR);
assertToken(tokens.get(19),"100",TokenType.INTEGER);
assertToken(tokens.get(20),")",TokenType.BRACKET);
}
结果: