最近用python处理这样一个问题:首先将大段文本根据中文的句号和问号分成句子,再用匹配项来查找匹配的句子。问题虽简单,但在处理时总会出现一些乱码,看了很多人的帖子,总算把这个问题搞清楚了,下面是简要的代码(含有一些注意的地方)
#-*-coding:cp936-*-
#或者-*-coding:utf-8-*-
import sys
import re
if __name__ == '__main__':
if len(sys.argv) < 2:
print "usage:python search.py test.txt word.list outText"
sys.exit()
fin1 = open(sys.argv[2],'r')
fout = open(sys.argv[3],'w')
while 1:
curLine = fin1.readline()
if not curLine: break
searchReg = curLine.rstrip().decode('gbk')
fout.write(searchReg.encode('gbk'))
fout.write('/n')
countN = 1
fin = open(sys.argv[1],'r')
while 1:
testLine = fin.readline()
if not testLine:break
transList = []
if testLine.find('GET') == 0:
tmpList = []
testLine = unicode(testLine,'gbk')
transLine = testLine.rstrip().split('ON:'.decode('gbk'))[-1]
tmpList = transLine.split(r'?'.decode('gbk'))
for item in tmpList:
thisList = item.split(r'。'.decode('gbk'))
for thisItem in thisList:
transList.append(thisItem)
for item in transList:
if re.search(searchReg,item):
outLine = '(' + str(countN) + '):' + item + '/n'
fout.write(outLine.encode('gbk'))
countN += 1
fin.close()
fout.write('/n')
fin1.close()
fout.close()
注意以上encode和decode的地方,编码的原则是匹配项与被匹配项编码要一致。写入文件的内容若被解码了,则在写入前要编码。