python 2文本处理

最新推荐文章于 2021-12-28 09:07:04 发布

Lord_sh

最新推荐文章于 2021-12-28 09:07:04 发布

阅读量531

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/Lord_sh/article/details/94967721

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

首先推荐一篇文章《Python2字符编码问题汇总》

https://www.cnblogs.com/pyxiaomangshe/p/7837380.html

在 Python 里，有三大类 string 类型，unicode（text string），str（byte string，二进制数据），basestring，是前两者的父类。

bert 源码 tokenization.py 中有一段挺有意思的代码

def convert_to_unicode(text):
  """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
  if six.PY3:
    if isinstance(text, str):
      return text
    elif isinstance(text, bytes):
      return text.decode("utf-8", "ignore")
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  elif six.PY2:
    if isinstance(text, str):
      return text.decode("utf-8", "ignore")
    elif isinstance(text, unicode):
      return text
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  else:
    raise ValueError("Not running on Python2 or Python 3?")

最佳实践

说了这么多，如果不迁移到 Python 3，能怎么做呢？
有这么几个建议：

所有 text string 都应该是 unicode 类型，而不是 str，如果你在操作 text，而类型却是 str，那就是在制造 bug。
在需要转换的时候，显式转换。从字节解码成文本，用 var.decode(encoding)，从文本编码成字节，用 var.encode(encoding)。
从外部读取数据时，默认它是字节，然后 decode 成需要的文本；同样的，当需要向外部发送文本时，encode 成字节再发送。

立即停止使用 setdefaultencoding('utf-8')，以及为什么

#encoding=utf-8

import json
import codecs
import sys

reload(sys)
sys.setdefaultencoding('utf-8')



def print_sentence_len(file_name):
    with codecs.open(file_name, 'r', 'utf-8') as fr:
        for line in fr:
            dic, p = line.strip().decode('utf-8').split('\t')
            dic = json.loads(dic)
            sentence = dic['text']
            print sentence.encode('utf-8')
            print len(sentence)
            sentence = '我是沈豪,，(（123'.decode('utf-8')
            print sentence.encode('utf-8')
            print len(sentence)
            break

if __name__ == "__main__":
    file_name = sys.argv[1]
    print_sentence_len(file_name)

"""
__________________打印结果____________________
#查尔斯
#3
#我是XX,，(（123
#11
"""

Lord_sh

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python 2文本处理

首先推荐一篇文章《Python2字符编码问题汇总》https://www.cnblogs.com/pyxiaomangshe/p/7837380.html在 Python 里，有三大类 string 类型，unicode（text string），str（byte string，二进制数据），basestring，是前两者的父类。bert 源码 tokenization.py 中有一...
复制链接

扫一扫