使用ipython notebook读取GBK文件，进行split时无法分割

最新推荐文章于 2022-05-21 15:32:01 发布

Yan456jie

最新推荐文章于 2022-05-21 15:32:01 发布

阅读量944

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/Yan456jie/article/details/52419828

版权

python 专栏收录该内容

73 篇文章 0 订阅

订阅专栏

import codecs
import re
text = codecs.open(u'text/text.txt','r','GBK','ignore').read()
#text = text.encode("utf-8")
if isinstance(text, unicode):
    print 'yes'
sentencts = re.split('、|，|\。|\n|\r\n|！|；|：|”|—|？|《|“',text)
print "#".join(sentencts)

结果:

yes
混沌未分天地乱，茫茫渺渺无人见。

可知读取文件到python后自动将GBK格式转换为python内部格式unicode了

而ipython notebook的代码编码应该是utf-8，故那些符号是utf-8编码的，无法进行分割，加上

text = text.encode("utf-8")

后得到正确结果：

<pre style="box-sizing: border-box; overflow: auto; font-size: 14px; padding: 0px; margin-top: 0px; margin-bottom: 0px; line-height: 17.0001px; word-break: break-all; word-wrap: break-word; border: 0px; border-radius: 0px; white-space: pre-wrap; vertical-align: baseline; background-color: rgb(255, 255, 255);">混沌未分天地乱#茫茫渺渺无人见#

Yan456jie

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用ipython notebook读取GBK文件，进行split时无法分割

import codecsimport retext = codecs.open(u'text/text.txt','r','GBK','ignore').read()#text = text.encode("utf-8")if isinstance(text, unicode): print 'yes'sentencts = re.split('、|，|\。|\n|\r\n|！
复制链接

扫一扫

专栏目录