Experiment Purposes
Implement a scanner that has the following functions:
-
Find out the preprocess directives including header file and define:
-
#include <xx.h>
-
#include "xxx.h"
-
#define ABC 123
-
-
Disregard comments and blank:
-
single /multiline comments
-
space, tab space, line feed
-
-
Function name
-
Variable name
-
Constant: integer, floating point number, etc
-
Keyword: if, else, int, float, return, etc
-
Operator: +-=*/ etc
-
Punctuation: : {} () , etc
-
Special
-
Identify the sentence in printf("")
Experiment Setup
Flex
Use the command below:
sudo apt install flex
Create .l File
In lab3.l, I write the regular expressions to identify different token with their categories. Details will be discussed in the next section.
Makefile
First, download the relative package:
sudo apt install make
Then, create a file named Makefile with the following content:
test: flex lab3.l gcc lex.yy.c -o lab3 ./lab3 clean: rm -f lex.yy.c
Experiment Results
Here is the test.txt used to test the scanner. text.txt is written in C.
#include <stdio.h> #define abc 123 // this is a single line comment. /* this is a multiple line comment. */ int main() { int x = 3.5; if (x > 0) { printf("x is positive\n"); } else { printf("x is non-positive\n"); } while (x > 0) { printf("x is positive\n"); x--; } }
Preprocess directives
PREPROCESS #.*$
I use RE to identify the #, the line behind # is preprocess directives.
Comments and Blank
COMMENTS1 (\/\*([^*]+)\*\/) COMMENTS2 \/\/.*\n SPACE (" ") TAB (" ") LINE_FEED ("\n") COMMENTS_BLANK {COMMENTS1}|{COMMENTS2}|{SPACE}|{TAB}|{LINE_FEED}
Comments are separated into single and multiple lines with different prefixes. Comments1 means it begins with /* and ends with */ with any content in the middle. Comments2 means begins with // till the end of the line.
Function and Variable Names
FUNCTION ([a-zA-Z0-9_]*\([^"]*\)) VARNAME ([a-zA-Z0-9_]*)
The way to find functions and variables is similar. Using the whole alphabet and underline to represent the name and for function, you should identify the content in parenthesis. To distinguish function() and printf(), the content in function () shouldn't contain quotation marks.
Constant
CONSTANT [0-9]+(\.[0-9]+)?
This RE can identify both integer and float point numbers with decimal points. The question mark means the content in parenthesis can be present or not.
Keyword
KEYWORD ("if"|"else"|"while"|"for"|"printf"|"int"|"double"|"float"|"return")
Just identify the keyword in the list but it looks very cumbersome.
Operator
OPERATORS ("++"|"--"|"+"|"-"|"*"|"/"|"=")
Same logic as the last one.
Punctuation
PUNCTUATION ("{"|"}"|"("|")"|";"|",")
Same logic as the last one.
printf()
PRINTF (\"([^\)]+)\")
This RE means identify the content in the quotation mark which can effectively separate the sentence.
printf("x is non-positive\n");
Appendix
Full code lab3.l
%{ #include <stdio.h> %} KEYWORD ("if"|"else"|"while"|"for"|"printf"|"int"|"double"|"float"|"return") CONSTANT [0-9]+(\.[0-9]+)? PUNCTUATION ("{"|"}"|"("|")"|";"|",") OPERATORS ("++"|"--"|"+"|"-"|"*"|"/"|"=") PREPROCESS #.*$ FUNCTION ([a-zA-Z0-9_]*\(*\)) VARNAME ([a-zA-Z0-9_]*) PRINTF (\"([^\)]+)\") COMMENTS1 (\/\*([^*]+)\*\/) COMMENTS2 \/\/.*\n SPACE (" ") TAB (" ") LINE_FEED ("\n") COMMENTS_BLANK {COMMENTS1}|{COMMENTS2}|{SPACE}|{TAB}|{LINE_FEED} %% {COMMENTS_BLANK} { if(*yytext == '\n'){ printf("<Line feed, Comments or blank>\n"); } else{ printf("<%s, Comments or blank>\n", yytext);} } {PREPROCESS} { printf("<%s, Preprocessor>\n", yytext); } {KEYWORD} { printf("<%s, Keyword>\n", yytext); } {CONSTANT} { printf("<%s, Constant>\n", yytext); } {OPERATORS} { printf("<%s, Operator>\n", yytext); } {FUNCTION} { printf("<%s, Function name>\n", yytext); } {VARNAME} { printf("<%s, Variable name>\n", yytext); } {PUNCTUATION} { printf("<%s, Punctuation>\n", yytext); } {PRINTF} { int length = strlen(yytext); printf("<%s, Printf content>\n", yytext);} . { printf("<%s, Special>\n", yytext); } %% int yywrap(){} int main(int argc, char** argv) { FILE *fp; char filename[50]; printf("Enter the filename: \n"); scanf("%s", filename); fp = fopen(filename, "r"); yyin = fp; yylex(); return 0; }