python:字符编码与转码

最新推荐文章于 2024-04-22 16:59:23 发布

好吃的糯米团团

最新推荐文章于 2024-04-22 16:59:23 发布

阅读量990

点赞数

字符编码与转码

详细文章:

http://www.cnblogs.com/yuanchenqi/articles/5956943.html

http://www.diveintopython3.net/strings.html

需知:

1.在python2默认编码是ASCII, python3里默认是unicode

2.unicode 分为 utf-32(占4个字节),utf-16(占两个字节)，utf-8(占1-4个字节)， so utf-16就是现在最常用的unicode版本，不过在文件里存的还是utf-8，因为utf8省空间

3.在py3中encode,在转码的同时还会把string 变成bytes类型，decode在解码的同时还会把bytes变回string

上图仅适用于py2

in python3

#-*-coding:gb2312 -*-   #这个也可以去掉
__author__ = 'Alex Li'

import sys
print(sys.getdefaultencoding())


msg = "我爱北京天安门"
#msg_gb2312 = msg.decode("utf-8").encode("gb2312")
msg_gb2312 = msg.encode("gb2312") #默认就是unicode,不用再decode,喜大普奔
gb2312_to_unicode = msg_gb2312.decode("gb2312")
gb2312_to_utf8 = msg_gb2312.decode("gb2312").encode("utf-8")

print(msg)
print(msg_gb2312)
print(gb2312_to_unicode)
print(gb2312_to_utf8)

import sys
print(sys.getdefaultencoding())

msg="我爱北京天安门"
s="分割线"


msg_gb2312=msg.encode("gb2312")
gb2312_to_unicode=msg_gb2312.decode("gb2312")
gb2312_to_utf8=msg_gb2312.decode("gb2312").encode("utf-8")


print(msg)
print("gb2312:",msg_gb2312)
print("unicode:",gb2312_to_unicode)
print("utf-8:",gb2312_to_utf8)
print("\n%s\n"%(s.center(80,"*")))

msg_gbk=msg.encode("gbk")
gbk_to_unicode=msg_gbk.decode("gbk")
gbk_to_utf8=msg_gbk.decode("gbk").encode("utf-8")


print("gbk:",msg_gbk)
print("unicode:",gbk_to_unicode)
print("utf-8:",gb2312_to_utf8)
print(gb2312_to_utf8.decode("utf-8"))

查看Python系统编码

Python3:

Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37)

[MSC v.1900 64 bit (AMD64)] on win32 Type "copyright", "credits" or "license()" for more information.

>>> import sys

>>> sys.getdefaultencoding()

'utf-8'

decode()与encode()

decode 的作用是将其他编码的字符串转换成 Unicode 编码，eg name.decode(“GB2312”)，表示将GB2312编码的字符串name转换成Unicode编码。
encode 的作用是将Unicode编码转换成其他编码的字符串，eg name.encode(”GB2312“)，表示将GB2312编码的字符串name转换成GB2312编码。

　　例如，前面获取百度底部信息的例子。我还可以通过decode()与encode()来解决：

复制代码

#coding=utf-8
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.baidu.com")

# 返回百度页面底部备案信息
text = driver.find_element_by_id("cp").text
text2 = text.encode("gbk","ignore").decode("gbk")
print(text2)

复制代码

chardet模块

chardet是一个非常优秀的编码识别模块。

通过pip 安装：

>pip install chardet

使用：

>>> from chardet import detect

>>> a = "中文"

>>> detect(a)
{'confidence': 0.682639754276994, 'encoding': 'KOI8-R'}

大概有68%的把握为KOI8-R编码类型。

好吃的糯米团团

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python:字符编码与转码

字符编码与转码详细文章:http://www.cnblogs.com/yuanchenqi/articles/5956943.htmlhttp://www.diveintopython3.net/strings.html需知:1.在python2默认编码是ASCII, python3里默认是unicode2.unicode 分为 utf-32(占4个字节),utf-16(...
复制链接

扫一扫