python 编码问题总结

最新推荐文章于 2024-05-08 18:13:18 发布

m718281962

最新推荐文章于 2024-05-08 18:13:18 发布

阅读量384

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/m718281962/article/details/50930628

版权

1 篇文章 0 订阅

订阅专栏

1、python中有两种字符串类型str和unicode，两种类型都是basestring的子类型

str定义 "测试"

unicode定义 u"测试"

2、如何验证字符串类型

isinstance("测试",basestring) True

isinstance("测试",str) True

isinstance(u"测试",unicode) True

3、unicode和str 以及encode()和decode()的关系

python中unicode类型可以理解为是字符串

str类型理解为字节串，所以str类型其实是有unicode进行编码【encode()】后得到的

因此: str只能进行decode()操作，unicode只能进行encode()

另：str的编码类型默认为ascii,文件中和文件头声明的编码方式相同

例： a = u"汉字"

print type(a.encode("utf-8")) # <type 'str'>

c = "hello"

print type(c.decode()) #<type 'unicode'>

4、python文件编码

在文件头部定义 #-*-coding:utf-8 -*- 。其中只有# coding utf-8是有用的，其他都是装饰

python中默认编码为 ascii

例：

1) #文件ascii.py

a = "中国"

print type(a.decode()) # 报错：SyntaxError: Non-ASCII character '\xe4'

2)#文件utf-8.py

#coding:utf-8

a = "中国"

print type(a.decode()) #<type 'unicode'>

5、为了尽量减少程序中编码引起的错误的几种做法

1）一个项目中使用同一种编码

2）程序中使用字符串是尽量定义成 u"hello"

6、unicode和UTF-8的误区

unicode是一个字符集（一张大表）。每个字符在unicode中都有一个2字节的唯一编码。例：a -> 0061 测->6d4b

UTF-8是以字节为单位对Unicode进行编码

Unicode编码(十六进制)	UTF-8 字节流(二进制)
00000000 - 0000007F	0xxxxxxx
00000080 - 000007FF	110xxxxx 10xxxxxx
00000800 - 0000FFFF	1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
04000000 - 7FFFFFFF	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

关注