FLEX/LEX基本信息
在電腦科學裡面,lex是一個產生詞法分析器(lexical analyzer,“掃描器”(scanners)或者"lexers")的程式 Lex常常與yacc 語法分析器產生程式(parser generator)一起使用。Lex(最早是埃里克·施密特和迈克·莱斯克製作)是許多UNIX系統的標準詞法分析器(lexical analyzer)產生程式,而且這個工具所作的行為被詳列為POSIX標準的一部分。
Lex讀進一個代表詞法分析器規則的輸入字串流,然後輸出以C語言實做的詞法分析器原始碼。
雖然傳統上是商業軟體,但是有些根據原本AT&T程式碼這些版本的Lex可以以公開原始碼的形式獲得,並被視為某些系統的一部份,例如說OpenSolaris和貝爾實驗室九號計畫。另一個有名的Lex公開原始碼版本是flex,代表"快速的詞法分析器"(fast lexical analyzer)
Lex的脚本结构
一个典型的lex脚本如下:
%{
int wordCount = 0;
%}
chars [A-za-z\_\'\.\"]
numbers ([0-9])+
delim [" "\n\t]
whitespace {delim}+
words {chars}+
%%
{words} { wordCount++; /*
increase the word count by one*/ }
{whitespace} { /* do
nothing*/ }
{numbers} { /* one may
want to add some processing here*/ }
%%
void main()
{
yylex(); /* start the
analysis*/
printf(" No of words:
%d\n", wordCount);
}
int yywrap()
{
return 1;
}
全局声明
第一个由%{ }%
包括的部分为 脚本的全局声明部分,用于声明一些全局的变量,.由于一次检查匹配只对于一个词,所以这里主要谁一些计数器的声明.
匹配规则
匹配规则由两个%
限定范围.这里一个正则表达式对应一个代码块.像这样:
/*正则部分*/ {Operation,... return /*int*/;}
...
注意:程序的主函数yylex()
返回值是int
,所以建议返回一个整型的token,再查表得到字符串形式的token
主体部分
这个部分控制主要函数的调用形式,以及一些I/O操作(如果必要)
一般对于一个完整的文件扫描,使用while(token=yylex())
循环.
这就是一个完整的脚本文件(.l文件)的基本结构.
Lex的使用
一般的Linux发行版都有预装Lex的一个版本Flex(?).如果没有,使用下列指令安装.
sudo apt-get install flex bison
转到编写好的.l文件路径,输入指令将.l转化为.yy.c文件准备编译
flex filename.l
再将生成的yy.c用cc编译出.out.
cc lex.yy.c -lfl
得到解析器的.out文件.再将待分析的.c文件输入.
a.out<filename.c
此时终端应该有返回信息了,如果文件编写正确的话.
简单的C语言词法分析脚本
下面是一个简单的C语言词法分析脚本,主要功能有:
- 检出关键字(保留字)
- 检出变量,字符等.
/*See https://blog.csdn.net/u014594922/article/details/51224231*/
/*See https://pandolia.net/tinyc/ch8_flex.html#flex-tinyc */
/*Issue:VAR覆盖了所有的关键字*/
/*Solved*/
%%
auto {return 0;}
break {return 1;}
case {return 2;}
char {return 3;}
const {return 4;}
continue {return 5;}
default {return 6;}
do {return 7;}
double {return 8;}
else {return 9;}
enum {return 10;}
extern {return 11;}
float {return 12;}
for {return 13;}
goto {return 14;}
if {return 15;}
inline {return 16;}
int {return 17;}
long {return 18;}
register {return 19;}
return {return 20;}
short {return 21;}
signed {return 22;}
sizeof {return 23;}
static {return 24;}
struct {return 25;}
switch {return 26;}
typedef {return 27;}
union {return 28;}
unsigned {return 86;}
void {return 87;}
volatile {return 88;}
while {return 89;}
(\() {return 30;}
(\)) {return 31;}
(\[) {return 32;}
(\]) {return 33;}
(\-\>) {return 34;}
(\.) {return 35;}
(!) {return 36;}
(~) {return 37;}
(\+\+) {return 38;}
(\-\-) {return 39;}
(\-) {return 40;}
(\*) {return 41;}
(&) {return 42;}
(\/) {return 43;}
(%) {return 44;}
(\+) {return 45;}
(\<\<) {return 46;}
(\>\>) {return 47;}
(\<) {return 48;}
(\<=) {return 49;}
(\>) {return 50;}
(\>=) {return 51;}
(==) {return 52;}
(!=) {return 53;}
(\^) {return 54;}
(\|) {return 55;}
(&&) {return 56;}
(\|\|) {return 57;}
(\?:) {return 58;}
(=) {return 59;}
(\+=) {return 60;}
(\-=) {return 61;}
(\*=) {return 62;}
(\/=) {return 63;}
(%=) {return 64;}
(\>\>=) {return 65;}
(\<\<=) {return 66;}
(&=) {return 67;}
(\^=) {return 68;}
(\|=) {return 69;}
(,) {return 70;}
(\\a) {return 71;}
(\\b) {return 72;}
(\\f) {return 73;}
(\\n) {return 74;}
(\\r) {return 75;}
(\\t) {return 76;}
(\\v) {return 77;}
(;) {return 78;}
(:) {return 79;}
(\+|\-)?([1-9][0-9]*|0) {return 80;}
(\+|\-)?([1-9][0-9]*|0)(\.[0-9]+) {return 81;}
(')([a-z]|[A-Z])(') {return 82;}
(")([\s\S]*)(") {return 83;}
(\/\/) {return 84;}
(\/\*)([\s\S]*)(\*\/) {return 85;}
(([a-z]|[A-Z]|_)([a-z]|[A-Z]|[0-9]|_)*) {return 29;}
%%
static char* print_token(int token) {
static char* token_strs[] = {"$AUTO","$BREAK","$CASE","$CHAR","$CONST","$CONTINUE","$DEFAULT","$DO","$DOUBLE","$ELSE","$ENUM","$EXTERN","$FLOAT","$FOR","$GOTO","$IF","$INLINE","$INT","$LONG","$REG","$RET","$SHORT","$SIGNED","$SIZEOF","$STATIC","$STRUCT","$SWITCH","$TYPEDEF","$UNION","$VAR","$LROUND","$RROUND","$LSQUARE","$RSQUARE","$ARROW","$DOT","$NOT","$BITREV","$SELFADD","$SELFSUB","$MINUS","$MUL","$BITAND","$DIV","$MOD","$ADD","$SL","$SR","$LE","$LEQ","$MO","$MEQ","$EQU","$NEQU","$BITXOR","$BITOR","$AND","$OR","$TRIOP","$IS","$ISADD","$ISMINU","$ISMUL","$ISDIV","$ISMOD","$ISSR","$ISSL","$ISBITAND","$ISBITXOR","$ISBITOR","$COMMA","$BELL","$BORDER","$PAGE","$NEWLINE","$CR","$TAB","$VT","$SEMI","$COLON","$CONSINT","$CONSFLOAT","$CONSCHAR","$CONSSTR","$SINGLELINECOMMENT","$BLOCKCOMMENT", "$UNSIGNED","$VOID","$VOL", "$WHILE"};
//printf("%-20s", token_strs[token]);
return token_strs[token];
}
int main(int argc,char** argv)
{
int tok;
while(tok=yylex())
{
printf("%s:%s",print_token(tok),yytext);
}
return 0;
}
需要注意的
- lex的匹配优先级是从前到后.如果输入的字符串匹配多个正则,位置更前的优先.
- 因为一次送入一个词,所以关键字做直接匹配就好(否则还真不知道怎么办)
Ref
https://blog.csdn.net/ruglcc/article/details/7817619
https://blog.csdn.net/ThinkinginLinux/article/details/323379
https://www.ibm.com/developerworks/cn/linux/sdk/lex/index.html
https://www.cnblogs.com/hdk1993/p/4922801.html