Python中字符集与unicode

最新推荐文章于 2024-04-13 21:13:40 发布

weixin_33937499

最新推荐文章于 2024-04-13 21:13:40 发布

阅读量204

点赞数

文章标签： python

原文链接：https://my.oschina.net/vincentwy/blog/158799

版权

2019独角兽企业重金招聘Python工程师标准>>>

###Unicode与编解码### Python中对字符的处理有两种方式：str和unicode。str相当于char*，即原始字串；unicode则等同于众多语言中的String类。因而str和unicode之间的转换等同于char*和String对象之间的转换。

    str1 = 'test'  
    ustr2 = u'test'  
    print type(str1)    #输出str
    print type(ustr2)    #输出unicode

在做字符处理时，通常需要在str和unicode之间转换，或者将一种编码方式的str转换为另一编码方式。字符集即字符编码方式(如ascii、utf-8、gbk等)，是字符串所呈现的属性，而unicode对象则是对字符串的更高层次的封装。
编码(encode)：指将待编码的对象按照给定的字符集编码成对应的字符串，通常提供编码方式作为参数
解码(decode)：指将指定的字符串按照相应的解码方式解码成二进制对象，通常需要提供原字符串的正确编码方式作为参数
在Python中，我们经常使用的是unicode.encode()和str.decode()方法，即将unicode对象编码成字符串与将字符串解码成unicode对象。

    str1 = 'test'
    u1 = str1.decode()    #采用默认的字符集解码，即ascii
    print type(u1)        #输出unicode
    str2 = "中文"
    u2 = str2.decode('gbk')    #源字符串中含有中文，默认ascii无法解码
    u3 = u'test'        #创建unicode对象
    str3 = u3.encode()    #使用默认字符集编码，即ascii
    u4 = unicode('中文'，'gbk')    #创建unicode对象的另一种方式
    str4 = u4.encode('utf-8')    #采用utf-8编码

###windows下IDE中字符集相关问题### windows下命令行参数(cmd)使用的编码是gbk，即其输出输入的字符串都以gbk方式编码。下文以unzip(解压缩软件)为例，说明一些与平台、IDE和字符集相关的问题。
例：

    import os
    strcmd = 'unzip filename.zip -d target_path'
    unzipres = os.popen3(strcmd)
    unzipres[0].close()
    print 'res:' + unzipres[1].read()
    unzipres[1].close()
    print 'err:' + unzipres[2].read()
    unzipres[2].close()

上述代码中，使用os模块的popen3方法执行外部应用，该该方法返回三个流：标准输入流stdin、标准输出流stdout和标准错误流stderr。现假设如下：
1）strcmd中的filename或target_path中含有中文
2）stdout中含有中文
3）stderr中含有中文
上述代码再不同IDE下可能会产生不同的异常，下面分别以命令行、ulipad和Sublime Text为例进行测试：
命令行下：
windows下命令行使用的是gbk编码，即其输出输出都是以gbk编码的str。在命令行下执行上述代码，可以返回正确的输出。

ulipad下：
在ulipad下执行上述代码时，会得到类似错误输出：
err:unzip: cannot find any matches for wildcard specification ...，No zipfiles found
因代码中调用了unzip这一外部应用，而其在命令行下的输入需要以gbk编码，而代码中使用了默认的ascii编码的字符串。解决方法是将strcmd定义为unicode对象，并在传参是将unicode对象以gbk方式编码成str，即：

    strcmd = u'unzip filename.zip -d target_path'
    unzipres = os.popen3(strcmd.encode('gbk'))

Sublime Text下：
在Sublime Text 2下执行改正后的代码可能会输出如下错误：
[Decode error - output not utf-8]
该错误是由于Sublime Text 2 默认配置的输出环境只输出以utf-8编码的字符串，而unzip应用返回的结果是以gbk编码的str。解决方法是将待输出的字符串解码成unicode对象，然后以utf-8编码并输出：

    print 'res:' + unzipres[1].read().decode('gbk').encode('utf-8')  
    print 'err:' + unzipres[2].read().decode('gbk').encode('utf-8')

###总结### python2.X的字符编码一直是让人诟病的问题。其关键在于unicode和str之间的转换，建议在使用时尽量使用unicode，在需要输入输出时将unicode对象按照需要进行编码(encode)。

转载于:https://my.oschina.net/vincentwy/blog/158799

weixin_33937499

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫