词法分析与正则表达式（三）

最新推荐文章于 2022-10-22 10:34:29 发布

汪星人来地球

最新推荐文章于 2022-10-22 10:34:29 发布

阅读量1k

点赞数

分类专栏：编译原理文章标签：编译器编程语言 c语言开发工具

本文链接：https://blog.csdn.net/hedan2013/article/details/54410256

版权

编译原理专栏收录该内容

21 篇文章 3 订阅

订阅专栏

对编译原理有基本了解的人都知道，正则表达式与有限状态自动机存在等价的关系。换句话说，它们能够识别的语言集合是一样的。但是在C语言的词法分析中，我们不需要运用有限状态自动机的知识，因此相关的介绍都略去。本文介绍如何实现C语言的语法分析。需要指出的是，我们不打算实现C语言的所有功能。因为这既耗时，也完全没有必要。通过实现一些简单的功能，就足以理解编译原理的全貌了。

我们词法分析程序，识别的关键字有

int

bool

void

true

false

else

for

while

continue

break

识别的运算符有

+，-（单元）

++，--（包括前缀和后缀）

+，-，*，/(双元)

>, <，>=，<=，==，!=

&&，||，!

=，+=，-=，*=，/=

除此之外，还需要定义括号（包括'('，‘)’，‘{’，'}'）逗号（‘,’）和分号（;）。

识别的常数有

整型和布尔型（true，false）。

为了表示这些不同类型的记号，我们需要定义一系列的常量。

const int INT=0;
const int BOOL = 1;
const int VOID = 2;
const int TRUE = 3;
const int FALSE = 4;
const int IF = 5;
const int ELSE = 6;
const int FOR = 7;
const int WHILE = 8;
const int CONTINUE = 9;
const int BREAK = 10;

const int INC = 11;
const int DEC = 12;

const int ADD = 13;
const int SUBTRACT = 14;
const int MULTI = 15;
const int DIVIDE = 16;

const int GT = 17;
const int GE = 18;
const int LT = 19;
const int LE = 20;
const int EQ = 21;
const int NE = 22;

const int AND = 23;
const int OR = 24;
const int NOT = 25;

const int ASSIGN = 26;
const int ASSIGN_ADD = 27;
const int ASSIGN_SUBTRACT = 28;
const int ASSIGN_MULTI = 29;
const int ASSIGN_DIVIDE = 30;

const int PARENTHESIS_START = 31;
const int PARENTHESIS_END = 32;
const int BRACE_START = 33;
const int BRACE_END = 34;
const int COMMA = 35;
const int SEMICOLON = 36;

const int ID = 37;
const int CONST = 38;

const int POSITIVE = 39;
const int NEGATIVE = 40;

const int NONE = 41;

代码的结构和去除注释部分的代码类似。主要逻辑，都定义在Token类和BufferedReader类中。

1. Token类

class Token{
public:
	int type, value;
	char* p_name;
	Token(int NONE, int value=-1, char* p_name = NULL);
	void print();
	~Token();
};

在Token类中，定义了一个新的成员变量，叫p_name，主要是为了保存变量名。

2. BufferedReader类

BufferedReader类的主要接口函数是readToken。

Token BufferedReader::readToken()
{
	char ch = readChar();

	if(ch == '+' || ch == '-' || ch == '*' || ch == '/' ||\
	   ch == '>' || ch == '<' || ch == '=' || ch == '!' ||\
	   ch == '&' || ch == '|'||\
	   ch == '(' || ch == ')' || ch == '{' || ch == '}' ||\
	   ch == ',' || ch == ';')
	{
		pushChar(ch);
		return this->readOperator();
	}
	else if(ch >= '0' && ch <= '9')
	{
		pushChar(ch);
		return this->readInt();
	}
	else if(ch >= 'a' && ch <= 'z' || ch >= 'A' && ch <= 'Z' || ch == '_')
	{
		pushChar(ch);
		return this->readIDOrKey();
	}
	else if(ch == EOF)
	{
		return Token(EOF);
	}
	else
	{
		return Token(NONE);
	}
}

readToken的功能是从输入流中读入一个字符，然后根据读入的字符，判断接下来出现的Token是什么类型，这也是之前提到的1向前看特性。在这个函数中，共调用了下面几个函数。

readInt: 读入一个常整数。

readIDOrKey: 读入关键字或者用户定义的变量。

readOperator: 读入操作符。

1. readInt

这部分代码比较简单，故略去。但是值得注意的是，对于处理“-1”这样的整数时，我们会把它分割成两部分，首先调用readOperator函数读入“-”，然后调用readInt读入整数。至于这两部分的组合，交给语法分析程序完成。

2. readOperator

这里需要考虑两个字符的组合。例如，当读入'+'字符时，需要读入下一个字符。如果下一个字符为‘+’，则与第一个'+'组合成'++'运算符；如果第二个字符为'='，则组合成“+=”运算；否则，将第二个运算符压回，将'+'作为单独的运算符。

3. readKeyOrID

Token BufferedReader::readIDOrKey()
{
	char* p_name = new char[256];
	int len = 0;
	char ch;

	while(true)
	{
		ch = readChar();
		if(ch >= 'a' && ch <= 'z' || ch >= 'A' && ch <= 'Z' || ch >= '0' && ch <= '9' || ch == '_')
		{
			p_name[len++] = ch;
		}
		else
		{
			pushChar(ch);
			p_name[len++] = 0;
			break;
		}
	}

	for(int i = 0; i <= 10; i++)
	{
		if(strcmp(p_name, keywords[i]) == 0)
		{
			delete[] p_name;
			return Token(i);
		}
	}
	return Token(ID, -1, p_name);
}

代码第7-20行读入一段字符串。注意这段字符串不可能以数字开头，因为在readToken函数已经做了判断。

代码第22-29行判断字符串是否在关键字表中。如果在，则作为关键字返回；否则，就把它当成是用户定义的变量。

示例代码