import codecs
import re
text = codecs.open(u'text/text.txt','r','GBK','ignore').read()
#text = text.encode("utf-8")
if isinstance(text, unicode):
print 'yes'
sentencts = re.split('、|,|\。|\n|\r\n|!|;|:|”|—|?|《|“',text)
print "#".join(sentencts)
结果:
yes 混沌未分天地乱,茫茫渺渺无人见。
可知读取文件到python后自动将GBK格式转换为python内部格式unicode了
而ipython notebook的代码编码应该是utf-8,故那些符号是utf-8编码的,无法进行分割,加上
text = text.encode("utf-8")
后得到正确结果:
<pre style="box-sizing: border-box; overflow: auto; font-size: 14px; padding: 0px; margin-top: 0px; margin-bottom: 0px; line-height: 17.0001px; word-break: break-all; word-wrap: break-word; border: 0px; border-radius: 0px; white-space: pre-wrap; vertical-align: baseline; background-color: rgb(255, 255, 255);">混沌未分天地乱#茫茫渺渺无人见#