实现思路
环境:
我们在文章https://blog.csdn.net/lockhou/article/details/113883940已经实现了在Win上的将一系列的c文件转化生成对应的AST文件,并且通过AST文件经过节点匹配生成文本向量,从而构建一个c文件对应一个存储AST的txt文件对应一个存储文本向量的txt文件,且对应的三个文件同名,因为我们判断一个文件是否有漏洞是从文件名字当中体现的。
思路:
我们原理是现将文件分类为Train,Test,Validation,之后直接读取.c文件做去空处理,去停用词处理并将转化后的数据存为pickle文件,用于提供后面词向量转化,模型训练,模型测试,模型验证的数据需求。我们既然要利用文本向量表示结构信息,我们可以不再直接读取文件,而是直接读取每个c文件的对应文本向量txt文件做相同操作,从而提供给后面词向量转化,模型训练,模型测试,模型验证的数据是具有结构信息的,从而完成工作。
从win移到linux过程
step1 java上安装jdk
具体步骤见文章:https://blog.csdn.net/lockhou/article/details/113904085
step2 修改movefiles.py
我最后决定在移动后c文件所在的每一个目录的同级目录建立文件夹用于存储从c文件提取的AST文件(存在Preprocessed文件夹下)和转化得到的文本向量文件(存在processed文件夹下)。所以我们在建立Train,Test,Validation文件夹以及其内部文件夹是要在每一种组合的Non_vulnerable_functions和Vulnerable_functions文件夹下都建立Preprocessed和processed文件夹,所以在46-55添加代码如下:
saveDir = tempDir
tempDir = saveDir + '/'+ "Preprocessed"
if not os.path.exists(tempDir):
#Non_vulnerable_functions/Non_vulnerable_functions/Preprocessed
os.mkdir(tempDir)
tempDir = saveDir + '/'+ "processed"
if not os.path.exists(tempDir):
#Non_vulnerable_functions/Non_vulnerable_functions/processed
os.mkdir(tempDir)
step2 修改ProcessCFilesWithCodeSensor.py和ProcessRawASTs_DFT.py
我们现在有了存储AST和文本向量的位置,所以我们只要反复调用ProcessCFilesWithCodeSensor.py和ProcessRawASTs_DFT.py文件即可,所以为了方便调用我们将两个文件组织成函数的形式,并且将需要的参数作为形参在调用的时候传递。
ProcessCFilesWithCodeSensor.py参数:
1)CodeSensor_OUTPUT_PATH:将每个.c文件提取出来AST保存为.txt文件所存储的地址
“G:\论文\论文\ast\function_representation_learningmaster\FFmpeg\Vulnerable_functions\Preprocessed\”
2)CodeSensor_PATH:codesensor.java所在位置
“D:\codesensor\CodeSensor.jar”(位置固定不需传入,即不需作为参数)
3)PATH :.c文件所存储的目录
“G:\论文\论文\ast\function_representation_learning-master\FFmpeg\Vulnerable_functions”ProcessRawASTs_DFT.py参数:
1)FILE_PATH :存储AST的TXT所在目录
“G:\论文\论文\ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Preprocessed\”
2)Processed_FILE : 存储文本向量的txt文件
“G:\论文\论文\ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Processed\”
根据上面参数需求,我们将文件内容组织成函数如下:
#ProcessCFilesWithCodeSensor.py
def codesensor(CodeSensor_OUTPUT_PATH,PATH):
CodeSensor_PATH = "./Code/codesensor-codeSensor-0.2/CodeSensor.jar"
Full_path = ""
for fpathe,dirs,fs in os.walk(PATH):
for f in fs:
if (os.path.splitext(f)[1]=='.c'): # Get the .c files only
file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
# With each .c file open, CodeSensor will process the opened file and output all the processed files to a specified directory.
# Full_path = CodeSensor_OUTPUT_PATH + "_" + f + ".txt"
Full_path = CodeSensor_OUTPUT_PATH + os.path.splitext(f)[0] + ".txt"
with open(Full_path, "w+") as output_file:
Popen(['/home/jdk1.8.0_65/bin/java', '-jar', CodeSensor_PATH, file_path], stdout=output_file, stderr=STDOUT)
output_file.close()
# ProcessRawASTs_DFT.py
def DepthFirstExtractASTs(file_to_process, file_name):
lines = []
subLines = ''
f = open(file_to_process)
try:
original_lines = f.readlines()
print(original_lines)
#lines.append(file_name) # The first element is the file name.
for line in original_lines:
if not line.isspace(): # Remove the empty line.
line = line.strip('\n')
str_lines = line.split('\t')
#print (str_lines)
if str_lines[0] != "water": # Remove lines starting with water.
#print (str_lines)
if str_lines[0] == "func":
# Add the return type of the function
subElement = str_lines[4].split() # Dealing with "static int" or "static void" or ...
if len(subElement) == 1:
lines.append(str_lines[4])
if subElement.count("*") == 0: # The element does not contain pointer type. If it contains pointer like (int *), it will be divided to 'int' and '*'.
if len(subElement) == 2:
lines.append(subElement[0])
lines.append(subElement[1])
if len(subElement) == 3:
lines.append(subElement[0])
lines.append(subElement[1])
lines.append(subElement[2])
else:
lines.append(str_lines[4])
#lines.append(str_lines[5]) # Add the name of the function
lines.append("func_name") # Add the name of the function
if str_lines[0] == "params":
lines.append("params")
if str_lines[0] == "param":
subParamElement = str_lines[4].split() # Addd the possible type of the parameter
if len(subParamElement) == 1:
lines.append("param")
lines.append(str_lines[4]) # Add the parameter type
if subParamElement.count("*") == 0:
if len(subParamElement) == 2:
lines.append("param")
lines.append(subParamElement[0])
lines.append(subParamElement[1])
if len(subParamElement) == 3:
lines.append("param")
lines.append(subParamElement[0])
lines.append(subParamElement[1])
lines.append(subParamElement[2])
else:
lines.append("param")
lines.append(str_lines[4]) # Add the parameter type
if str_lines[0] == "stmnts":
lines.append("stmnts")
if str_lines[0] == "decl":
subDeclElement = str_lines[4].split() # Addd the possible type of the declared veriable
#print (len(subDeclElement))
if len(subDeclElement) == 1:
lines.append("decl")
lines.append(str_lines[4]) # Add the type of the declared variable
if subDeclElement.count("*") == 0:
if len(subDeclElement) == 2:
lines.append("decl")
lines.append(subDeclElement[0])
lines.append(subDeclElement[1])
if len(subDeclElement) == 3:
lines.append("decl")
lines.append(subDeclElement[0])
lines.append(subDeclElement[1])
lines.append(subDeclElement[2])
else:
lines.append("decl")
lines.append(str_lines[4]) # Add the type of the declared variable
if str_lines[0] == "op":
lines.append(str_lines[4])
if str_lines[0] == "call":
lines.append("call")
lines.append(str_lines[4])
if str_lines[0] == "arg":
lines.append("arg")
if str_lines[0] == "if":
lines.append("if")
if str_lines[0] == "cond":
lines.append("cond")
if str_lines[0] == "else":
lines.append("else")
if str_lines[0] == "stmts":
lines.append("stmts")
if str_lines[0] == "for":
lines.append("for")
if str_lines[0] == "forinit":
lines.append("forinit")
if str_lines[0] == "while":
lines.append("while")
if str_lines[0] == "return":
lines.append("return")
if str_lines[0] == "continue":
lines.append("continue")
if str_lines[0] == "break":
lines.append("break")
if str_lines[0] == "goto":
lines.append("goto")
if str_lines[0] == "forexpr":
lines.append("forexpr")
if str_lines[0] == "sizeof":
lines.append("sizeof")
if str_lines[0] == "do":
lines.append("do")
if str_lines[0] == "switch":
lines.append("switch")
if str_lines[0] == "typedef":
lines.append("typedef")
if str_lines[0] == "default":
lines.append("default")
if str_lines[0] == "register":
lines.append("register")
if str_lines[0] == "enum":
lines.append("enum")
if str_lines[0] == "union":
lines.append("union")
print(lines)
subLines = ','.join(lines)
subLines = subLines + "," + "\n"
finally:
f.close()
return subLines
def text_vector(FILE_PATH,Processed_FILE):
big_line = []
total_processed = 0
for fpathe,dirs,fs in os.walk(FILE_PATH):
for f in fs:
if (os.path.splitext(f)[1]=='.txt'): # Get the .c files only
file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
temp = DepthFirstExtractASTs(FILE_PATH + f, f)
print(temp)
f1 = open(Processed_FILE + os.path.splitext(f)[0]+".txt", "w")
f1.write(temp)
f1.close()
# big_line.append(temp)
#time.sleep(0.001)
total_processed = total_processed + 1
print ("Totally, there are " + str(total_processed) + " files.")
step3 修改movefiles.py
我们现在已经建立好存储AST文件和文本向量文件的位置,我们只需要将每一个含有c文件的目录下所有c文件经过codesensor函数调用产生AST文件并存在对应目录中的Preprocesses文件夹之下,并且在通过调用text_vector函数将每个Preprocessed目录下的AST文件转化为文本向量文件存储在相应目录的procesed文件夹之下,所以我们在movefiles.py的最后,即建立完所有的文件夹之后加入下列代码:
from ProcessCFilesWithCodeSensor import *
from ProcessRawASTs_DFT import *
for i in range(len(FirstDir)):
codesensor(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i])
codesensor(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i])
codesensor(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i])
codesensor(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i])
codesensor(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i])
codesensor(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i])
text_vector(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i]+"/processed/")
text_vector(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i]+"/processed/")
text_vector(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i]+"/processed/")
text_vector(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i]+"/processed/")
text_vector(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i]+"/processed/")
text_vector(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i]+"/processed/")
step4 修改removeComments_Blanks.py和LoadCFilesAsText.py文件
因为我们使想用生成的文本向量文件替代c文件,而在removeComments_Blanks.py和LoadCFilesAsText.py文件中是直接读取c文件并且对文件中的内容进行去空格,去停用词的操作之后转存为pickle文件。所以我们现在用生成的文本向量文件替代c文件,我们需要读取processed文件夹下的文本向量文件,并且读取txt文件其他的操作不变。
所以我们要把所有removeComments_Blanks.py和LoadCFilesAsText.py文件中的所有代表读取c文件的目录扩展到与c文件同目录的processesed文件夹下让其读取文本向量文件,并且在判断文件类型的时候不再寻找c文件改为寻找txt文件即可。