python找与7相关的数_§7. Python 数据 Data

最新推荐文章于 2023-03-10 14:06:54 发布

weixin_39834084

最新推荐文章于 2023-03-10 14:06:54 发布

阅读量194

点赞数

文章标签： python找与7相关的数

本文主要介绍 python 中的两种数据：文本与二进制格式数据，以及相关处理方法

1. 文本Unicode

计算机基本存储单元是字节(byte), 包含8位比特(bit), 可存储256种不同的值

ASCII 只用了 7 位，即 128 种取值，世界上现存字符远超 128个

Unicode 可包含所有语言以及数学及其他领域的各种符号

Unicode官网展示了目前所有已包含的字符集，链接如下：Unicode Offcial Sitewww.unicode.org

UTF-8 是 Python、HTML、Linux 的标准文本编码格式,

其简单快速、覆盖广、出错率低,是一种变长编码方式

注意，复制粘贴其他文本源的字符串时，务必保证编码一致，否则造成隐患

1. Python3 中的字符串是 Unicode 字符串而不是字节数组

>>> a = '你好'

>>> a

# 输出结果对比：

Python2 python3

'\xc4\xe3\xba\xc3' '你好'

2. unicodedata 模块

>>> import unicodedata

>>> char = "A"

>>> unicodedata.name(char) 5556

# 接受一个unicode字符码，返回标准名称

'LATIN CAPITAL LETTER A'

>>> v2 = unicodedata.lookup('LATIN CAPITAL LETTER A')

# 接受不区分大小写的标准名称，返回一个unicode字符

"A"

>>> unicodedata.name('\u00e9')

'LATIN SMALL LETTER E WITH ACUTE'

>>> unicodedata.lookup('LATIN SMALL LETTER E WITH ACUTE')

# 查找标准名称

'é'

3. len() 可以计算字符串中 unicode 字符的个数，而不是字节数

>>> len('\U0001f47b')

4. Unicode 动态编码

指为不同的字符集的字符分配不同的字节

如 ASCII 分配 1 字节，拉丁语系 2 字节，其他位于今本多语言平的字符分 3 字节

亚洲语言及符号分配 4 字节

5. str.encode()

>>> s1 = '\u2603'

>>> len(s1)

>>> s2 = s1.encode('utf-8')

>>> len(s2) # 编码为 utf-8 会变长

※ 被编码的字符(s1)必须属于目标字符集(utf-8)，否则 UnicodeEncodeError

此方法还有四个第二参数：

str.encode('ascii','ignore') # 抛弃任何无法编码的字符

str.encode('ascii','replace') # 无法编码的字符替换为？

str.encode('ascii','backslashreplace') # 创建一个与unicode-escape类似的Unicode字符串

str.encode('ascii','xmlcharrefreplace') # 字符实体串

6. str.decode()

解码需要知道编码用的字符集，解码字符集不一致则 UnicodeDecodeError格式化

1. 插值 Interpolate

str % value

str 中需被替换的值用 %s 占位，其他有 `%d、%x %o %f %e` ...

>>> adjective = 'testing'

>>> "This is a %s string" % (adjective)

'This is a testing string'

2. str.format()

>>> a, b, c = 'A', 'B', 'C'

>>> '{} {} {}'.format(a, b, c) # 最简用法

>>> '{0} {1} {2}'.format(a, b, c) # 指定顺序

>>> '{a} {b} {c}'.format(a=1, b=2, c=3) # 命名变量

>>> "{0[a]} {0[b]} {1}".format({'a':'a', 'b':'b'}, 'guy') # 字典, 其后的参数不能用 key=value 形式

'A B C'

# 规定格式

>>> "{0:d} {1:f} {2:s}".format(1, 2.50, 'string')

'1 2.500000 string'

格式化字符串中的浮点数时需要认识精度(percision)的概念：

对于浮点数，精度代表小数点后的数字个数，对字符串而言代表最大字符数，整数无法用精度

>>> '{0:>10.4e}'.format(100)

ValueError: Precision not allowed in integer format specifier

>>> '{0:>10.4e}'.format(100)

'1.0000e+02'

字符还可以按照左中右对其进行排版

>>> '{0:!^6s}'.format('Ha') # 一共6个字符居中对齐，其余 space 用！填充

'!!Ha!!'

>>> '{0:!<6s}'.format('Ha')

'Ha!!!!'

>>> '{0:!>6s}'.format('Ha')

'!!!!Ha'正则表达式

由python标准库模块 re 支持, 使用需要定义一个模式(pattern)和一个源(source)

>>> import re

>>> result = re.match('To', 'Today') # match()检查源是否以pattern开头!!

>>> result.pos

>>> result.span()

(0, 2)

>>> Topattern = re.compile('To') # complie()预编译

>>> result2 = Topattern.match('Too busy right now')

>>> result2.group()

'To'

re.search(pat, source) # 任意位置查找 pattern

re.findall(pat, source) # 查找全部

re.split('n', source) # 按照 pattern 切分字符串，返回列表

re.sub('a', 'b', source) # 将所有 a 替换成 b

2. 二进制数据字节序: endianness, 电脑处理器是如何将数据组织存储为字节的

符号位: sign bit

字节与字节数组

# 字节不可变，字节数组可变

>>> a = [1, 2, 3, 255]

>>> byt1 = bytes(a) # 转为字节

>>> byt1

b'\x01\x02\x03\xff'

>>> byt2 = bytearray(a) # 转为字节数组

>>> byt2

bytearray(b'\x01\x02\x03\xff')struct 模块：该标准库模块专门用于处理类似 C 和 C++ 中结构体的数据