python字符串转为ascii码_§7. Python 数据 Data

最新推荐文章于 2024-07-11 16:04:30 发布

weixin_39639505

最新推荐文章于 2024-07-11 16:04:30 发布

阅读量1.1k

点赞数

文章标签： python字符串转为ascii码

本文链接：https://blog.csdn.net/weixin_39639505/article/details/111625074

版权

本文主要介绍 python 中的两种数据：文本与二进制格式数据，以及相关处理方法

1. 文本

Unicode

计算机基本存储单元是字节(byte), 包含8位比特(bit), 可存储256种不同的值

ASCII 只用了 7 位，即 128 种取值，世界上现存字符远超 128 个

Unicode 可包含所有语言以及数学及其他领域的各种符号

Unicode 官网展示了目前所有已包含的字符集，链接如下：

Unicode Offcial Sitewww.unicode.org

UTF-8 是 Python、HTML、Linux 的标准文本编码格式,

其简单快速、覆盖广、出错率低,是一种变长编码方式

注意，复制粘贴其他文本源的字符串时，务必保证编码一致，否则造成隐患

1. Python3 中的字符串是 Unicode 字符串而不是字节数组

>>> a = '你好'
>>> a
# 输出结果对比：
Python2                 python3
'xc4xe3xbaxc3'      '你好'

2. unicodedata 模块

>>> import unicodedata
>>> char = "A"

>>> unicodedata.name(char) 5556
# 接受一个unicode字符码，返回标准名称
'LATIN CAPITAL LETTER A'

>>> v2 = unicodedata.lookup('LATIN CAPITAL LETTER A') 
# 接受不区分大小写的标准名称，返回一个unicode字符
"A"

>>> unicodedata.name('u00e9')
'LATIN SMALL LETTER E WITH ACUTE'

>>> unicodedata.lookup('LATIN SMALL LETTER E WITH ACUTE')
# 查找标准名称
'é'

3. len() 可以计算字符串中 unicode 字符的个数，而不是字节数

>>> len('U0001f47b') 
1

4. Unicode 动态编码

指为不同的字符集的字符分配不同的字节

如 ASCII 分配 1 字节，拉丁语系 2 字节，其他位于今本多语言平的字符分 3 字节

亚洲语言及符号分配 4 字节

5. str.encode()

>>> s1 = 'u2603'
>>> len(s1) 
1

>>> s2 = s1.encode('utf-8')
>>> len(s2) # 编码为 utf-8 会变长
3

※ 被编码的字符(s1)必须属于目标字符集(utf-8)，否则 UnicodeEncodeError

此方法还有四个第二参数：

str.encode('ascii','ignore') # 抛弃任何无法编码的字符
str.encode('ascii','replace') # 无法编码的字符替换为 ？
str.encode('ascii','backslashreplace') # 创建一个与unicode-escape类似的Unicode字符串
str.encode('ascii','xmlcharrefreplace') # 字符实体串

6. str.decode()

解码需要知道编码用的字符集，解码字符集不一致则 UnicodeDecodeError

格式化

1. 插值 Interpolate

str % value

str 中需被替换的值用 %s 占位，其他有 `%d、%x %o %f %e` ...

>>> adjective = 'testing'
>>> "This is a %s string" % (adjective)
'This is a testing string'

2. str.format()

>>> a, b, c = 'A', 'B', 'C'

>>> '{} {} {}'.format(a, b, c)  # 最简用法
>>> '{0} {1} {2}'.format(a, b, c)  # 指定顺序
>>> '{a} {b} {c}'.format(a=1, b=2, c=3)  # 命名变量

>>> "{0[a]} {0[b]} {1}".format({'a':'a', 'b':'b'}, 'guy')  # 字典, 其后的参数不能用 key=value 形式
'A B C'

# 规定格式
>>> "{0:d} {1:f} {2:s}".format(1, 2.50, 'string')
'1 2.500000 string'

格式化字符串中的浮点数时需要认识精度(percision)的概念：

对于浮点数，精度代表小数点后的数字个数，对字符串而言代表最大字符数，整数无法用精度

>>> '{0:>10.4e}'.format(100)
ValueError: Precision not allowed in integer format specifier
>>> '{0:>10.4e}'.format(100)
'1.0000e+02'

字符还可以按照左中右对其进行排版

>>> '{0:!^6s}'.format('Ha')  # 一共6个字符居中对齐，其余 space 用！填充
'!!Ha!!'
>>> '{0:!<6s}'.format('Ha')
'Ha!!!!'
>>> '{0:!>6s}'.format('Ha')
'!!!!Ha'

正则表达式

由python标准库模块 re 支持, 使用需要定义一个模式(pattern)和一个源(source)

>>> import re
>>> result = re.match('To', 'Today')  # match()检查源是否以pattern开头!!
>>> result.pos
0
>>> result.span()
(0, 2)

>>> Topattern = re.compile('To')  # complie()预编译
>>> result2 = Topattern.match('Too busy right now')
>>> result2.group()
'To'

re.search(pat, source)  # 任意位置查找 pattern
re.findall(pat, source)  # 查找全部
re.split('n', source)  # 按照 pattern 切分字符串，返回列表
re.sub('a', 'b', source)  # 将所有 a 替换成 b

2. 二进制数据

字节序: endianness, 电脑处理器是如何将数据组织存储为字节的
符号位: sign bit
字节与字节数组

# 字节不可变，字节数组可变
 >>> a = [1, 2, 3, 255]
 >>> byt1 = bytes(a)   # 转为字节
 >>> byt1  
 b'x01x02x03xff'

 >>> byt2 = bytearray(a)  # 转为字节数组  
 >>> byt2
 bytearray(b'x01x02x03xff')

struct 模块：该标准库模块专门用于处理类似 C 和 C++ 中结构体的数据

weixin_39639505

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫