Python_字符编码问题,chardet,codecs

本文介绍了Python中处理字符编码的方法,包括使用chardet插件检测文件编码,理解str和unicode对象,以及如何通过codecs模块进行编码转换。讨论了UTF-8的BOM问题及其解决办法,并提供了具体示例代码,涉及UTF-16、UTF-8、GB2312和ASCII编码的检测和转换。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. chardet 插件可以方便的检测文件,URL,XML等等字符编码的类型。


2. python中字符串的结构:

python的全局函数中basestring,str和unicode的描述如下
basestring() 
This abstract type is the superclass for str and unicode. It cannot be called or instantiated, but it can be used to test whether an object is an instance of str or unicode.isinstance(obj, basestring) is equivalent to isinstance(obj, (str, unicode)).
str([object]) 
Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that str(object) does not always attempt to return a string that is acceptable to eval(); its goal is to return a printable string. If no argument is given, returns the empty string, ''.
unicode([object[, encoding[, errors]]]) 
Return the Unicode string version of object using one of the following modes:
If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according to errors; this specifies the treatment of characters which are invalid in the input encoding. If errors is 'strict' (the default), a ValueError is raised on errors, while a value of 'ignore' causes errors to be silently ignored, and a value of 'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded. See also the codecs module.
If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied.
For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode.

从上面描述来看,python中字符编码转换的大概流程是:先讲一个字符编码解码成python内部的unicode字符串,然后通过相应的字符编码器再将该unicode字符串编码成另一种字符编码。


3. codecs模块的API:http://docs.python.org/library/codecs.html


4. UTF-8中文编码存在的BOM标记问题:http://www.kgblog.net/2009/07/22/utf-8-bom-encoding.html 。可能会导致UTF-8编码报错,解决方法见下面代码。


5. 具体范例代码:需要有四个文件分别是UTF-16,UTF-8,GB2312和ASCII的字符编码。实现了字符编码的检测和转换。

#-*- encoding: gb2312 -*-

import chardet

def test():
    
    import urllib
    #rawdata = urllib.urlopen('http://172.16.120.166/media_resource/').read()   
    #print chardet.detect(rawdata)

#以下代码是尝试使用chardet和python中的unicode()方法以及String的encode方法实现字符编解码
##############################################################################
    #查看UTF-16编码格式的文件
    file = open("src_1.txt", 'r')
    src_1 = file.read()
    print "src_1: " + chardet.detect(src_1)["encoding"] + " with confidence: " + str(chardet.detect(src_1)["confidence"])
    
    #查看UTF-8编码格式的文件
    file = open("src_2.txt", 'r')
    src_2 = file.read()
    print "src_2: " + chardet.detect(src_2)["encoding"] + " with confidence: " + str(chardet.detect(src_2)["confidence"])
    
    #查看GB2312编码格式的文件
    file = open("src_3.txt", 'r')
    src_3 = file.read()
    print "src_3: " + chardet.detect(src_3)["encoding"] + " with confidence: " + str(chardet.detect(src_3)["confidence"])
    
    #查看ASCII编码格式的文件
    file = open("src_4.txt", 'r')
    src_4 = file.read()
    print "src_4: " + chardet.detect(src_4)["encoding"] + " with confidence: " + str(chardet.detect(src_4)["confidence"])
    
    #将UTF-16编码转换为GB2312编码
    result = unicode(src_1, "utf-16").encode("GB2312")      #先解码成python的unicode字符串然后进行GB2312编码
    resultFile = open("rst_1.txt", 'w')     #输出到rst_1.txt
    resultFile.write(result)
    resultFile.close()
    rst = open("rst_1.txt", 'r').read()
    print "rst_1: " + chardet.detect(rst)["encoding"] + " with confidence: " + str(chardet.detect(rst)["confidence"])   #检测是否成功转码
##############################################################################

#以下代码是尝试使用codecs模块实现字符编解码  
##############################################################################
    import codecs
    
    look_gb2312 = codecs.lookup("GB2312")   #创建GB2312编码器
    
    #将UTF-16编码转换为GB2312编码
    file_utf16 = codecs.open("src_1.txt", 'r', "UTF-16")    #使用codecs.open()方法可以指定编码方式读取文件内容
    str_unicode = file_utf16.read()    #直接获得python unicode字符串
    str_gb2312 = look_gb2312.encode(str_unicode)    #使用GB2312编码器encode
    resultFile = open("rst_2.txt", 'w')     #输出到rst_2.txt
    resultFile.write(str_gb2312[0])
    resultFile.close()
    rst = open("rst_2.txt", 'r').read()
    print "rst_2: " + chardet.detect(rst)["encoding"] + " with confidence: " + str(chardet.detect(rst)["confidence"])   #检测是否成功转码
    
    #将UTF-8编码转换为GB2312编码
    file_utf8 = codecs.open("src_2.txt", 'r', "UTF-8")
    str_unicode = file_utf8.read()
    str_gb2312 = look_gb2312.encode(str_unicode, 'ignore')  #由于UTF-8的BOM造成的不正常解码问题,所以需要ignore错误的解码。
    resultFile = open("rst_3.txt", 'w')     #输出到rst_3.txt
    resultFile.write(str_gb2312[0])
    resultFile.close()
    rst = open("rst_3.txt", 'r').read()
    print "rst_3: " + chardet.detect(rst)["encoding"] + " with confidence: " + str(chardet.detect(rst)["confidence"])   #检测是否成功转码
    
##############################################################################
 
if __name__ == '__main__':
    test()


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值