难点是识别字符串中的//, /*和*/。后来觉得只要在匹配注释的时候越过字符串,不去管它就好了。
匹配C++中的字符串的正则表达式是"([^\\*]|\\.)*?",意思是引号中字符不能有\和*,但是可以有\.这种情况,这样就避开了类似"abc\"这种字符串,同时也包含了"abc\"","abc\n"这些情形。
代码如下:
#-*- coding:gbk -*-
import re
def ReplaceComment(matchobj):
if not matchobj:
return
matchstr = matchobj.group(0)
if matchstr.startswith('"') and matchstr.endswith('"'):
return matchstr
else:
return ''
def RemoveComment(inputfileName, outputfileName):
codeString = ""
with open(inputfileName, "rt") as inputfile:
codeString = inputfile.read()
singleLineCommentExp = r'//[^\n]*'
multiLinecommentExp = r'/\*.*?\*/'
literalStringExp = r'"([^\\"]|\\.)*?"' #. should match newline, for scenario like multiline literal string
patternExp = literalStringExp + '|' + singleLineCommentExp + '|' + multiLinecommentExp
codeString = re.sub(patternExp, ReplaceComment, codeString, 0, re.MULTILINE|re.DOTALL)
with open(outputfileName, "wt") as outputfile:
outputfile.write(codeString)
如果发现BUG,欢迎指正