Article catalog
Preface
The requirements of this lab is to achieve a program function, it can extract keywords of different levels from the C or C++ code files that are read in and upload the code to GitHub. I chose to use Python to complete this lab. See the detailed code in my github.
Description of problem-solving ideas
After seeing this problem, I think it may be to let our program analyze the code and realize the function of a simple interpreter.
Requirement 1:
The first requirement is output “keyword” statistics, C/C++ has 32 keywords as shown:
auto | break | case | char | const | continue | default | do |
---|---|---|---|---|---|---|---|
double | else | enum | extern | float | for | goto | if |
int | long | register | return | short | signed | sizeof | static |
struct | switch | typedef | union | unsigned | void | volatile | while |
Counting the number of occurrences of these strings cannot solve this requirement. Due to the complexity of the code, these words may appear in strings, comments, macro definitions, and even part of variable names. When these situations occur, keywords should not be counted. Considering that this lab is only related to keywords, my idea is to delete these parts with regular expressions and replace the original position with appropriate characters to avoid unexpected situations, like:
int/**/a = 1;
If you just delete these parts, the code will become:
inta = 1;
Keywords will be lost.
Requirement 2:
The second requirement is output the number of “switch case” structures, and output the number of “case” corresponding to each group, like the code below:
switch(i){
case 0:
break;
case 1:
break;
case 2:
break;
default:
break;
}
switch(i){
case 0:
break;
case 1:
break;
default:
break;
}
The output should be:
switch num: 2
case num: 3 2
I considered switch nesting, and the easiest way to think of when splitting nesting is to use stack. When a switch is put into the stack, the number of cases will be counted. When the switch is popped up, the number of cases will be assigned to it, and the number of switches will be increased by one. This can easily solve this requirement.
Requirement 3 & 4:
The last two requirements are output the number of “if else” structures and “if, else if, else” structures, like the code below:
if(i<0){
if(i<-1){}
else{}
}
else if(i>0){
if (i>2){}
else if (i==2) {}
else if (i>1) {}
else {}
}
else{
if(j!=0){}
else{}
}
return 0;
The output should be:
if-else num: 2
if-elseif-else num: 2
Requirement 3 & 4 is similar to requirement 2. Because if and switch are nested, they can be processed together with the stack of requirement 2. Since if else and if-else if-else have the same beginning and end, and as long as else appears, the number of these two will change. Therefore, when else appears in the stack, it can be regarded as a state condition, and else if can change this state, Therefore, when the if is popped up, you only need to count the number of different states to know the number of if else and if-else if-else.
Information query
After seeing this problem, I think it may be to let our program analyze the code and realize the function of a simple interpreter. I searched some websites about writing regular expressions and debugging regular expressions. Also the use of git, help me manage branches better.
Design and implementation process
Requirement 1:
Through the test, I found that the priority of string is higher than that of annotation. Since the string must be enclosed in quotation marks, and the segment comment can have no second half, in order to prevent the following situations, the string must be removed first.
string s1 = "test string /*";
//If comments are processed first, the following code will be treated as comments
...
...
string s2 = "*/ test string";
-
Process string
For strings, you can’t just consider regular matching sentences between double quotes. The following details need to be considered:
-
Quotation marks after escape symbols should not be considered
string s = "test string\"int";
-
Quotation marks after escape symbol which is escaped by the escape symbol should be considered
string s = "test string\\";
-
The above situation, but single quotation mark
char[] c = 'test char \' int \\';
- Multiline string
string s = "test string\ line 2\ line 3";
After analysis, it should first match \\ first and convert them into spaces,then match \\n, \" and \’ and convert them into spaces. Escaping other characters is not necessary. and then it can delete the contents in quotation marks.
-
-
Process comments and macro definitions
There are two kinds of comments: Line comments and block comments. Macro definitions and line comments can also be used \\n to achieve the effect of multi line comments. This situation has been handled when processing strings, so it can not be handled here.
However, because both line annotations and block annotations can affect each other, like:
/* //This line comment is affected */ string s = "This string will be compiled";
// /** string s = "This string will be compiled"; // **/This block comment is affected
Whether comments work depends on the order of comments, so both should be removed at the same time. Macro definitions can only start with #, so just delete all lines starting with #.
-
Statistical keywords
Now, the code is much cleaner, but there are still many things we don’t need. Some symbols, or many spaces or carriage returns. But semicolons and braces cannot be easily removed, these are the key to the code structure. In addition, underscores cannot be deleted because variable names in C / C + + can use underscores, like:
int int_1 = 1; int in_t = 2;
Whether you replace underscores with white space or not, There will be problems with the above statements:
int int 1 = 1;
int int = 2;
So staying still is the best way.
Because keywords may be connected with { , } or ; , we just need to simply add some spaces before and after these symbols
After that, you can segment the string. Divide the string with spaces. After comparing the divided results with keywords, you can count the final number of keywords.
Requirement 2 & 3 & 4:
Now update the description: The keyword only refers to switch, case, if, else, else if and symbol refer to semicolons and braces
From the above requirements, we can extract the switch, case, if, else, braces and semicolon of the whole code. Put these to the stack. Analyze the relationship between these keywords and symbols. When there is no brace after a keyword, the scope of the keyword is a semicolon, otherwise the scope of action is up to the other half of the braces. Now let’s take a look at the situation when keywords and symbols are push to the stack.
-
for semicolon:
Therefore, when a semicolon is entered in the stack, if the semicolon is preceded by a front brace, the semicolon will not affect the range. When the semicolon is preceded by a keyword, that means the range of the keyword has ended and this keyword can be popped.
For the function_pop, see the code section.
-
for back brace:
when a back brace is entered in the stack, this indicates that there must be a front brace in front of it. So we can pop the stack until we encounter the front brace. A front brace pop-up indicates that a scope is over. If there is a keyword before the front brace, it indicates that the scope of this keyword is over and this keyword can be popped.
For the function_pop, see the code section. -
for switch:
Because we want to count the number of cases in the switch, due to the nesting format of the switch, each case that is push to the stack is the case of the switch that is the top switch in the current stack, so we create a stack to store the number of cases for each switch called switch_stack, whenever a switch is pushed into the stack, switch_stack will be pushed into a 0 to start recording.
-
for case:
From the above analysis, it is easy to get that every time a case is put on the stack, the count at the top of the switch_stack stack is increased by one.
-
for if:
Because elif in C/C++ is two keywords, if the top of the stack is else when the if is on the stack, then the else and if should be merged into elseif.
In this way, the matched keywords or symbols are pushed into the stack in order, and the stack is popped after certain conditions are met, and the hierarchical relationship of these keywords can be analyzed.
Code description
1. for pop
When a keyword or symbol is popped, it should check whether the element at the top of the stack can be popped again. If it can then pop it up, if not, then break and continue to add elements to the stack. This is the core idea of my code. But there will be a problem, If, elseif or else cannot be ejected directly, because there may be other codes that correspond to each other. like:
if {
;
}else {
;
}
If the front brace after the if is popped, then the if is also popped, that will cause the else after the if has not yet been put on the stack, it will not find a matching element. So my solution is to mark these elements as popupable when they are to be popped up for the first time. When they are to be popped up again, these elements can be popped up.
-
pop_semicolon:
def pop_semicolon(self): # 分号 self.stack.pop() if len(self.stack) > 0: if self.stack[-1] == 'switch': raise SyntaxError("Incorrect code format") elif self.stack[-1] == 'case': self.pop_case() elif self.stack[-1] == 'if': self.stack[-1] = 'can_pop_if' elif self.stack[-1] == 'can_pop_if': self.pop_if() elif self.stack[-1] == 'elif': self.stack[-1] = 'can_pop_elif' elif self.stack[-1] == 'can_pop_elif': self.pop_elif() elif self.stack[-1] == 'else': self.stack[-1] = 'can_pop_else' elif self.stack[-1] == 'can_pop_else': self.pop_else()
Switch cannot appear before the semicolon, if it appears, the input file format is wrong.
-
pop_switch:
def pop_switch(self): self.stack.pop() if len(self.stack) > 0: if self.stack[-1] == 'switch': self.pop_switch() elif self.stack[-1] == 'case': self.pop_case() elif self.stack[-1] == 'if': self.stack[-1] = 'can_pop_if' elif self.stack[-1] == 'can_pop_if': self.pop_if() elif self.stack[-1] == 'elif': self.stack[-1] = 'can_pop_elif' elif self.stack[-1] == 'can_pop_elif': self.pop_elif() elif self.stack[-1] == 'else': self.stack[-1] = 'can_pop_else' elif self.stack[-1] == 'can_pop_else': self.pop_else()
The keyword before switch can be popped.
-
pop_case:
def pop_case(self): self.stack.pop() if len(self.stack) > 0: if self.stack[-1] == 'switch': self.pop_switch() elif self.stack[-1] == 'case': self.pop_case() elif self.stack[-1] == 'if': self.stack[-1] = 'can_pop_if' elif self.stack[-1] == 'can_pop_if': self.pop_if() elif self.stack[-1] == 'elif': self.stack[-1] = 'can_pop_elif' elif self.stack[-1] == 'can_pop_elif': self.pop_elif() elif self.stack[-1] == 'else': self.stack[-1] = 'can_pop_else' elif self.stack[-1] == 'can_pop_else': self.pop_else() else: raise SyntaxError("Incorrect code format")
The keyword before case can be popped.
-
pop_if:
def pop_if(self): self.stack.pop() if len(self.stack) > 0: if self.stack[-1] == 'switch': self.pop_switch() elif self.stack[-1] == 'case': self.pop_case() elif self.stack[-1] == 'if': self.stack[-1] = 'can_pop_if' elif self.stack[-1] == 'can_pop_if': self.pop_if() elif self.stack[-1] == 'elif': self.stack[-1] = 'can_pop_elif' elif self.stack[-1] == 'can_pop_elif': self.pop_elif() elif self.stack[-1] == 'else': self.stack[-1] = 'can_pop_else' elif self.stack[-1] == 'can_pop_else': self.pop_else()
The keyword before if can be popped.
-
pop_elif:
def pop_elif(self): self.stack.pop() while self.stack[-1] != 'can_pop_if': temp = self.stack[-1] if temp == 'switch': self.pop_switch() elif temp == 'case': self.pop_case() elif temp == 'if': raise SyntaxError("Incorrect code format") elif temp == 'elif': raise SyntaxError("Incorrect code format") elif temp == 'can_pop_elif': self.stack.pop() elif temp == 'else': raise SyntaxError("Incorrect code format") elif temp == 'can_pop_else': self.pop_else() self.pop_if()
If the elseif can be popped, then there must be a can popped if in front of it. Finally, pop the if.
-
pop_else:
def pop_else(self): self.stack.pop() while (self.stack[-1] != 'can_pop_if') and (self.stack[-1] != 'can_pop_elif'): temp = self.stack[-1] if temp == 'switch': self.pop_switch() elif temp == 'case': self.pop_case() elif temp == 'if': raise SyntaxError("Incorrect code format") elif temp == 'elif': raise SyntaxError("Incorrect code format") elif temp == 'else': raise SyntaxError("Incorrect code format") elif temp == 'can_pop_else': self.pop_else() if self.stack[-1] == 'can_pop_if': self.pop_if() else: self.pop_elif()
If the else can be popped, then there must be a can popped if or elseif in front of it. Finally, pop the if or the elseif.
-
pop_brace:
def pop_brace(self): self.stack.pop() temp = '' while temp != '{': temp = self.stack[-1] if temp == 'switch': self.pop_switch() elif temp == 'case': self.pop_case() elif temp == 'if': raise SyntaxError("Incorrect code format") elif temp == 'can_pop_if': self.pop_if() elif temp == 'elif': raise SyntaxError("Incorrect code format") elif temp == 'can_pop_elif': self.pop_elif() elif temp == 'else': raise SyntaxError("Incorrect code format") elif temp == 'can_pop_else': self.pop_else() self.stack.pop()
If the back brace can be popped, then there must be a can popped front brace in front of it. Finally, pop the front brace.
2. switch number and case count
When a switch is popped, increase the switch count by one, pop the case number in switch_stack, and store it in the array as the switch case number.
so, the pop_switch() function should be update:
def pop_switch(self):
self.stack.pop()
self.switch_count += 1 # increase the switch count
self.case_count.append(self.switch_stack.pop()) # store case number in switch_stack
if len(self.stack) > 0: ...
3. if-else count and if-else if-else count
As the previous analysis, we need to give mark when pop_else(), and update the mark when pop_elif() and statistics at pop_if(). So, we update the function:
def pop_else(self):
self.stack.pop()
while (self.stack[-1] != 'can_pop_if') and (self.stack[-1] != 'can_pop_elif'):...
self.if_state = 1 # give mark
if self.stack[-1] == 'can_pop_if':...
def pop_elif(self):
self.stack.pop()
while self.stack[-1] != 'can_pop_if':...
if self.if_state == 1: # update the mark
self.if_state = 2
self.pop_if()
def pop_if(self):
self.stack.pop()
if self.if_state == 1: # statistics
self.if_else_count += 1
self.if_state = 0
elif self.if_state == 2:
self.if_elif_else_count += 1
self.if_state = 0
if len(self.stack) > 0:...
Unit test
1. test case
test.c
#include <stdio.h>
int main(){
int i=1;
double j=0;
long f;
switch(i){
case 0:
break;
case 1:
break;
case 2:
break;
default:
break;
}
switch(i){
case 0:
break;
case 1:
break;
default:
break;
}
if(i<0){
if(i<-1){}
else{}
}
else if(i>0){
if (i>2){}
else if (i==2) {}
else if (i>1) {}
else {}
}
else{
if(j!=0){}
else{}
}
return 0;
}
Should output:
total num: 35
switch num: 2
case num: 3 2
if-else num: 2
if-elseif-else num: 2
actual results :
2. my test
Test.cpp
/*
****************
*This is a test*
****************
*/
#define A 5
# if (A<3) //macro definition:if
#include <iostream>
# else
#include <iostream>
#define con (int i = 1) //macro definition:int
#endif
#include <string>
using namespace std;
//int if switch case line comment for interference
/**double break; return block comment for interference*/
int/**/a = 1;
// /**
string s = "This string will be compiled";
int num1 =1;
// **/This block comment is affected
/* //This line comment is affected */ string s0 = "This string will be compiled"; int num2 = 1;
//String for interference
string s1 = "float a = 0.1; default; if() {...} else {}";
string s2 = "float int double\
if else int double default\
case switch\
";
//string with escape symbol
string s3 = "int \" int double \"char\"";
string s4 = "float int double\
if else int\" do\\u\"ble\" default\
case switch\
";
//escape symbol escape the escape symbol
string s5 = "int \\ int \\";
string s6 = "这是句中文";
string s7 = "Это предложение на русском языке";
//Variable name for interference
int int_double_float = 1;
int Int =2;
int DOUBLE= 3;
int main() {
//switch
switch(1) //switch without braces
case 0: //only case
case 1:
case 2:
case 3: //one_line case without braces
cout << 3 << endl;
switch(1) { //switch with braces
case 0: //case with braces
{
cout << "0_1" << endl;
cout << "0_2" << endl;
cout << "0_3" << endl;
break;
}
case 1: //multiline case without braces
cout << "1_1" << endl;
cout << "1_2" << endl;
case 2:
cout << "2_1" << endl;
}
//nested switch
switch(0){
case 0:
switch(1)
case 0:
case 1:
switch(0){
case 0:
/*Chaotic writing format*/
cout << "nested switch case 0_1_0" << endl;
break;
case 1:
cout << "nested switch case 0_1_1" << endl;
break;
}case 1:{
cout << "nested switch case 1" << endl;
switch(1)
case 1:
cout << "nested switch case 1_1" << endl;break;}
}
//switch num: 6
//case num: [4, 3, 2, 2, 1, 2]
int i = -1,j = 0;
if(i<0){
if(i<-1){
cout << "if_if" << endl; //if with braces
}
else cout << "if_else" << endl; //if without braces
}
else if(i>0){
if (i>2){
cout << "else if_if" << endl;}
else if (i==2) {cout << "else if_else if1" << endl;
}
else if (i>1)
cout << "else if_else if2" << endl;
else cout << "else if_else" << endl;
}
else{
if(j!=0){cout << "else_if" << endl;}
else{
cout << "else_else" << endl;}
}
return 0;
//nested switch and if
switch(j) {
case 1:
if(i < 3)
i++;
else if (i < 5)
i--;
else
i += 10;
case 2:
if(i < 3){
i--;
break;
}
else
i += 10;
}
/*****************************
* Final Answer:
*
* total num: 57
* switch num: 7
* case num: 4 3 2 2 1 2 2
* if-else num: 3
* if-elseif-else num: 3
******************************/
}
Should output:
total num: 57
switch num: 7
case num: 4 3 2 2 1 2 2
if-else num: 3
if-elseif-else num: 3
actual results :
3. Abnormal test
Abnormal.cpp
#include <stdio.h>
int main(){
int i=1;
double j=0;
long f;
if{
;
}else if //Illegal statement
else{
;
}
}
actual results :
Through the test, the code has no obvious problems.
Performance Testing
I updated the format of the code in version 1.2, and modified and integrated many of the useless or repeated regular expressions. Let me run the V1.1 code first:
It can be seen that there are many nests, and it took 0.011s to complete the operation. After improvement:
It only took 0.003 seconds, The efficiency of this function has increased by about 300%.
summary
1.PSP form for this work
Personal Software Process Stages | Estimated Time (minutes) | Actual time (minutes) |
---|---|---|
Planning | 30 | 15 |
Estimate | 720 | 1060 |
Development | 60 | 210 |
Analysis | 30 | 60 |
Design Spec | 180 | 180 |
Design Review | 30 | 60 |
Coding Standard | 30 | 40 |
Design | 60 | 60 |
Coding | 60 | 180 |
Code Review | 30 | 90 |
Reporting | 60 | 60 |
Test Report | 30 | 60 |
Size Measurement | 30 | 15 |
Postmortem & Process Improvement Plan | 30 | 30 |
Total | 720 | 1060 |
2.Personal feelings
I spent a lot of time in this lab. Although the teaching assistant said that some nesting forms or other special forms may not be considered, but as my first project on Github, I still intend to complete it as perfect as possible , Now my code seems to me to have good robustness, I can think of the situation has been dealt with, if you have any situation that I did not consider, or found any bugs, welcome to contact me for discussion, my email is 2579538675@qq.com.
The Link Your Class | EE308 MIEC |
---|---|
The Link of Requirement of This Assignment | LAB 2 Individual programing work (EE308 IEC MU 2021 Fall) |
The Aim of This Assignment | Extract keywords in C/C++ |
MU STU ID and FZU STU ID | 19103387_831902225 |