python一些字符編碼處理的手記

最新推荐文章于 2022-03-09 09:25:51 发布

海楓

最新推荐文章于 2022-03-09 09:25:51 发布

阅读量983

点赞数

分类专栏：編碼及數據庫文章标签： python object output encoding codec input

本文链接：https://blog.csdn.net/moxien/article/details/2745663

版权

編碼及數據庫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

幾個和字符編碼相關的函數。

sys.getdefaultencoding()
可以獲取python默認的編碼。

另外還有一個locale.getpreferredencoding()
系統使用的默認字符編碼。

暫時搞不太清這兩個函數不知道有什麼區別，只能按這樣來理解。

還有兩個編碼轉換的函數，decode和encode。
decode將字符按指定的字符集轉為unicode。
encode則相反，將unicode字符串轉為指定的編碼的字符串。

python文檔說明如下。

decode( input[, errors])

Decodes the object input and returns a tuple (output object, length consumed). In a Unicode context, decoding converts a plain string encoded using a particular character set encoding to a Unicode object.

input must be an object which provides the bf_getreadbuf buffer slot. Python strings, buffer objects and memory mapped files are examples of objects providing this slot.

errors defines the error handling to apply. It defaults to 'strict' handling.

The method may not store state in the Codec instance. Use StreamCodec for codecs which have to keep state in order to make encoding/decoding efficient.

The decoder must be able to handle zero length input and return an empty object of the output object type in this situation.

encode( input[, errors])

Encodes the object input and returns a tuple (output object, length consumed). While codecs are not restricted to use with Unicode, in a Unicode context, encoding converts a Unicode object to a plain string using a particular character set encoding (e.g., cp1252 or iso-8859-1).

errors defines the error handling to apply. It defaults to 'strict' handling.

The method may not store state in the Codec instance. Use StreamCodec for codecs which have to keep state in order to make encoding/decoding efficient.

The encoder must be able to handle zero length input and return an empty object of the output object type in this situation.

input參數為需要轉換的字符編碼。
errorse有三個選項
'ignore'，忽略錯誤。
'replace'，用問號（?）代替轉換不了的字符。
'xmlcharrefreplace'，用XML方式的字符引用。

下面一個例子。
big5str.decode('big5').encode('gb18030',replace)
將大五碼字符轉為gb18030字符。

在考慮跨平台特性，建議在pyton中字符全部轉為unicode再進行處理。這時個根據locale.getpreferredencoding獲取的系統編碼使用decode來對字符進行轉換。
如果字符來自文件則需要指明文件所使用的編碼。