python中的汉字处理

最新推荐文章于 2021-02-04 10:34:34 发布

chengba

最新推荐文章于 2021-02-04 10:34:34 发布

阅读量6.4k

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/bob_hu924/article/details/6265200

版权

python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

最近用python处理这样一个问题：首先将大段文本根据中文的句号和问号分成句子，再用匹配项来查找匹配的句子。问题虽简单，但在处理时总会出现一些乱码，看了很多人的帖子，总算把这个问题搞清楚了，下面是简要的代码（含有一些注意的地方）

#-*-coding:cp936-*-

#或者-*-coding:utf-8-*-

import sys

import re

if __name__ == '__main__':

    if len(sys.argv) < 2:
        print "usage:python search.py test.txt word.list outText"
        sys.exit()
    fin1 = open(sys.argv[2],'r')
    fout = open(sys.argv[3],'w')
    while 1:
        curLine = fin1.readline()
        if not curLine: break
        searchReg = curLine.rstrip().decode('gbk')
        fout.write(searchReg.encode('gbk'))
        fout.write('/n')
        countN = 1
        fin = open(sys.argv[1],'r')
        while 1:
            testLine = fin.readline()
            if not testLine:break
            transList = []
            if testLine.find('GET') == 0:
                tmpList = []
                testLine = unicode(testLine,'gbk')
                transLine = testLine.rstrip().split('ON:'.decode('gbk'))[-1]
                tmpList = transLine.split(r'？'.decode('gbk'))
                for item in tmpList:
                    thisList = item.split(r'。'.decode('gbk'))
                    for thisItem in thisList:
                        transList.append(thisItem)
                for item in transList:
                    if re.search(searchReg,item):
                        outLine = '(' + str(countN) + '):' + item + '/n'
                        fout.write(outLine.encode('gbk'))
                        countN += 1
        fin.close()
        fout.write('/n')
    fin1.close()
    fout.close()