2 Numpy
- Numpy 高效的运算工具
- Numpy的优势
- ndarry属性
- 基本操作
- ndarry.方法()
- numpy.函数名()
- ndarry运算
- 逻辑运算
- 统计运算
- 数组间运算
- 合并、分割、I/O操作、数据处理
2.1 Numpy的优势
学习目标
- 目标
- 了解Numpy运算速度上的优势
- 知道Numpy的数组内存块规格
- 知道Numpy的并行运算
- 应用
- 机器学习、深度学习各种框架的基础库
- 内容预览
- 2.1.1 Numpy介绍
- 2.1.2 ndarray介绍
- 2.1.3 ndarray与pyton原生的list运算效率对比
- 2.1.4 ndarray的优势
2.1.1 Numpy介绍–开源的数值计算库
- num - numerical
- py - python
- ndarray
- n - 任意个
- d - dimension
- arry - 数组
Numpy(Numerical Python)是一个开源的科学计算库,用于快速处理任意维度的数组。
Numpy支持常见的数组和矩阵操作。对于同样的数值计算任务,使用Numpy比直接使用Python要简洁得多
Numpy使用ndarry对象来处理多维数组,该对象是一个快速而灵活的大数据容器
2.1.2 ndarray介绍
Numpy提供了一个n维数组类型ndarry,它描述了像同类型的"items"的集合
用ndarry进行存储:
import numpy as np
# 创建ndarry
score = np.array([[85, 69, 83, 76, 93],
[76, 84, 61, 69, 81],
[85, 68, 74, 69, 60],
[92, 98, 68, 100, 64],
[60, 67, 73, 92, 82],
[72, 61, 72, 80, 79],
[88, 91, 62, 95, 80],
[89, 71, 63, 94, 66]])
score
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
2.1.3 ndarray于python原生list运算效率的对比
import random, time
import numpy as np
a = []
for i in range(100000000):
a.append(random.random())
t1 = time.time()
sum1 = sum(a)
t2 = time.time()
b = np.array(a)
t4 = time.time()
sum3 = np.sum(b)
t5 = time.time()
print('list求和运算耗时:{},ndarry求和运算耗时:{}'.format((t2 - t1), (t5 - t4)))
list求和运算耗时:0.6531758308410645,ndarry求和运算耗时:0.17858529090881348
2.1.4 ndarray的优势
- 存储风格
- ndarry - 相同类型,数据连续存储,线性引用 - 泛用性不强
- list - 可以是不同类型,数据不连续存储,存在交叉引用 - 泛用性强
- 并行计算
- ndarry支持并行计算
- 底层语言
- Numpy地层使用C语言编写,内部解除了GIL的限制,使得对数组的操作不受python解释器的限制,效率远高于直接使用python
2.2 认识N维数组——ndarray的属性
学习目标
- 目标
- 说明数组的属性、形状、类型
- 应用
- 内容预览
- 2.2.1 ndarray的属性
- 2.2.2 ndarray的形状
- 2.2.3 ndarray的类型
- 2.2.4 总结
2.2.1 ndarray的属性
- 形状(shape)
- ndim
- size
- 类型(dtype)
- itemsize——单个元素的大小(所占字节数)
在创建ndarray时的默认数据类型:
- 整数:int32
- 浮点数:float64
score
score.shape
score.ndim
score.size
score.dtype
score.itemsize
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
(8, 5)
2
40
dtype('int32')
4
2.2.2 ndarray的形状
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([1, 2, 3, 4])
c = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
a
a.shape
array([[1, 2, 3],
[4, 5, 6]])
(2, 3)
b
b.shape
array([1, 2, 3, 4])
(4,)
c
c.shape
array([[[1, 2, 3],
[4, 5, 6]],
[[1, 2, 3],
[4, 5, 6]]])
(2, 2, 3)
2.2.3 ndarray的类型
类型 | 类型代码 | 说明 |
---|---|---|
int8、uint8 | i1、u1 | 有符号和无符号的8位(1个字节)整型 |
int16、uint16 | i2、u2 | 有符号和无符号的16位(2个字节)整型 |
int32、uint32 | i4、u4 | 有符号和无符号的32位(4个字节)整型 |
int64、unint64 | i8、u8 | 有符号和无符号的64位(8个字节)整型 |
float16 | f2 | 半精度浮点数 |
float32 | f4或f | 标准的单精度浮点数。与C的float兼容 |
float64 | f8或d | 标准的双精度浮点数。与C的double和Python的float对象兼容 |
float128 | f16或g | 扩展精度浮点数 |
complex64、complex128、complex256 | c8、c16、c32 | 分别用两个32位、64位或128位浮点数表示的复数 |
bool | ? | 存储True和False值的布尔类型 |
object | O | Python对象类型 |
string_ | S | 固定长度的字符串长度(每个字符1个字节)。例如,要创建一个长度为10的字符串,应使用S10 |
unicode_ | U | 固定长度的unicode长度(字节数由平台决定)。跟字符串的定义方式一样(如U10) |
创建数组的时候指定类型
a = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float64) # 也可以用dtype="float64"
a
a.dtype
array([[1., 2., 3.],
[4., 5., 6.]])
dtype('float64')
arr = np.array(['python', 'tensorflow', 'scikit-learn', 'numpy'], dtype=np.string_) # 也可以用dtype="string_"
arr
arr.dtype
array([b'python', b'tensorflow', b'scikit-learn', b'numpy'], dtype='|S12')
dtype('S12')
2.3 基本操作
- ndarray.方法()
- np.函数名()
2.3.1 生成数组的方法
- 生成0和1
- 从现有数组中生成
- 生成固定范围的数组
- 生成随机数
1. 生成0和1的数组
- empty()
- empty(shape[, dtype, order])
- empty_like(a[, dtype, order, subok])
- eye(N[, M, k, dtype, order])
- identity(n[, dtype])
- ones(shape[, dtype, order])
- ones_like(a[, dtype, order, subok])
- zeros()
- zeros(shape[, dtype, order])
- zeros_like(a[, dtype, order, subok])
- full()
- full(shape, fill_value[, dtype, order])
- full_like(a, fill_value[, dtype, order, subok])
np.zeros(shape=(3, 4), dtype=np.float32)
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=float32)
np.ones(shape=(2, 3), dtype=np.int64)
array([[1, 1, 1],
[1, 1, 1]], dtype=int64)
2. 从现有数组中生成
- np.array()——深拷贝
- np.copy()——浅拷贝
- np.asarray()——深拷贝
score
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
# np.array()
data1 = np.array(score)
data1
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
# np.asarray()
data2 = np.asarray(score)
data2
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
# np.copy()
data3 = np.copy(score)
data3
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
score[3, 1] = 1000
score
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 1000, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
data1
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
data2
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 1000, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
data3
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
3. 生成固定范围的数组
- np.linspace(0, 10, 100)——生成0到10之间的100个数
- 与range()不同之处在于,其生成的范围为闭区间
- np.arange()
- range(1, 100, 5)——生成1到100之间以5为步长的一个可迭代对象——np.arange()的用法与其相似
- range()和np.arange()生成的范围为左闭右开区间
np.linspace(0, 10, 20)
array([ 0. , 0.52631579, 1.05263158, 1.57894737, 2.10526316,
2.63157895, 3.15789474, 3.68421053, 4.21052632, 4.73684211,
5.26315789, 5.78947368, 6.31578947, 6.84210526, 7.36842105,
7.89473684, 8.42105263, 8.94736842, 9.47368421, 10. ])
np.arange(0, 10, 1)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
4.生成随机数组
-
np.random模块
-
均匀分布
-
np.random.rand(n)
返回[0.0, 1.0)内的一组均匀分布的数
-
np.random.uniform(low=0.0, high=1.0, size=None)
- 注意:
- 定义域是左闭右开
- szie:输出样本数目,为int或tuple类型,例如,size=(m,n,k),则输出mnk个样本(m、n、k为维度),缺省时输出1个值。
- 返回值:ndarray类型,其形状和参数size的描述一致
- 注意:
-
-
正态分布(N(μ, σ))
-
均匀分布
np.random.rand(3)
array([0.63344521, 0.44044366, 0.51506874])
data1 = np.random.uniform(low=-1, high=1, size=(100000))
data1
array([ 0.52165485, -0.37857184, 0.85434807, ..., 0.68128425,
0.41514954, 0.97107572])
# 画图验证
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8), dpi=80)
plt.hist(data1, 1000)
plt.show()
正态分布
data2 = np.random.normal(loc=1.75, scale=0.1, size=1000000)
data2
array([1.84143464, 1.73375347, 1.73293659, ..., 1.74933248, 1.80482693,
1.78291665])
# 画图验证
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8), dpi=80)
plt.hist(data2, 1000)
plt.show()
切片索引与形状修改
案例:随机生成8支股票2周的交易日涨幅数据
import numpy as np
stock_change = np.random.normal(loc=0, scale=1, size=(8, 10))
stock_change
array([[ 0.19074222, 0.60043829, 0.80867868, 1.64073086, -0.97300847,
-0.20084744, 0.60241837, 0.44136119, 0.58028258, -0.46935001],
[-0.68734427, 0.85738838, 1.91338844, 0.58939441, 0.20555763,
-1.46895101, -0.00352442, -1.86573645, 0.94978016, -0.07536797],
[ 0.55409794, -0.76569532, -1.07678287, 0.91303802, 0.45830133,
0.41399899, 0.07469296, -0.47342359, 1.35352344, 0.37089442],
[-1.39658106, -0.4144919 , 0.72383645, 0.45637567, -0.65019515,
1.19320966, 1.24901 , -0.15086696, 0.68574793, -0.27589652],
[-0.10789621, -0.60397001, -1.26983449, 0.22412235, 0.29800482,
-1.56288488, 0.73505373, 0.88072784, -0.93668026, -0.24488789],
[-0.83122852, 0.88981107, -0.09342388, 1.45157522, -0.61855113,
-0.24583226, 1.43576482, -1.23514744, 0.48018713, -1.61807954],
[ 0.10005172, -1.27765932, -0.29108339, -0.40146452, -0.9513938 ,
-0.47696161, -0.46654499, 0.2585099 , 1.04241142, -0.75316624],
[ 0.33955043, -0.07898703, -1.32527034, 1.81189898, 1.05193552,
-0.94289232, 0.11584785, -0.58944079, 0.05561722, 0.45423719]])
2.3.2 数组的索引、切片(ndarray的索引从0开始)
# 获取第一支股票前三个交易日的涨跌幅数据
stock_change[0, :3]
array([0.19074222, 0.60043829, 0.80867868])
三维数组如何索引?
a1 = np.array([[[1, 2, 3], [4, 5, 6]], [[12, 3, 34], [5, 6, 7]]])
a1
a1.shape
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[12, 3, 34],
[ 5, 6, 7]]])
(2, 2, 3)
a1[1, 0, 2]
34
a1[1, 0, 2] = 111
a1[1, 0, 2]
111
2.3.3 形状修改
- ndarray.reshape(shape)——返回新的ndarray对象,原始数据未变化
- ndarray.resize()——没有返回值,直接在原始数据上修改
- ndarray.T——转置——返回新的ndarray对象,原始数据未变化
# 需求:让刚才的股票行、日期列翻转,变成股票列、日期行
stock_change
array([[ 0.4757998 , 0.98262317, 0.0903228 , -2.18277494, 0.03458714,
0.27945935, -0.2996386 , 0.55731825, 2.55512752, 0.22270481],
[-0.62812039, -1.85224004, -0.03066103, 1.01028685, 0.07356365,
-0.59625804, 1.78864657, -1.02592844, -0.83059086, -0.51519111],
[ 1.59009598, -0.3561372 , -0.13415047, -0.87131372, 2.63536456,
-0.07324175, 0.11148286, -0.78896717, 0.36006041, -0.32652921],
[-3.06937049, -0.28945156, -1.31411983, 0.27797394, 0.02249254,
0.68111066, 0.19071901, 0.41827306, -0.23168617, 0.02644996],
[-0.57244906, -0.14880773, 0.05463552, 0.05554172, 0.49012011,
-0.97979408, -0.43437754, -1.16343025, -0.22472479, 2.58279491],
[ 0.25552108, 0.28857986, 0.60867223, -0.35784135, -1.32022102,
0.56756162, -1.60034249, 0.81897864, 0.90891023, 1.0347659 ],
[ 1.54397322, 2.43073741, -0.03775417, 1.01840352, -0.20048821,
-0.26870417, -0.02624917, 2.31371289, -0.03578409, -1.6612304 ],
[-1.98659864, -0.55809336, 1.69451479, 0.27399223, -0.15878832,
-0.27930802, -1.5313515 , 1.57077599, -0.3062486 , 0.33964472]])
stock_change.shape
(8, 10)
stock_change.reshape((10, 8))
array([[ 0.4757998 , 0.98262317, 0.0903228 , -2.18277494, 0.03458714,
0.27945935, -0.2996386 , 0.55731825],
[ 2.55512752, 0.22270481, -0.62812039, -1.85224004, -0.03066103,
1.01028685, 0.07356365, -0.59625804],
[ 1.78864657, -1.02592844, -0.83059086, -0.51519111, 1.59009598,
-0.3561372 , -0.13415047, -0.87131372],
[ 2.63536456, -0.07324175, 0.11148286, -0.78896717, 0.36006041,
-0.32652921, -3.06937049, -0.28945156],
[-1.31411983, 0.27797394, 0.02249254, 0.68111066, 0.19071901,
0.41827306, -0.23168617, 0.02644996],
[-0.57244906, -0.14880773, 0.05463552, 0.05554172, 0.49012011,
-0.97979408, -0.43437754, -1.16343025],
[-0.22472479, 2.58279491, 0.25552108, 0.28857986, 0.60867223,
-0.35784135, -1.32022102, 0.56756162],
[-1.60034249, 0.81897864, 0.90891023, 1.0347659 , 1.54397322,
2.43073741, -0.03775417, 1.01840352],
[-0.20048821, -0.26870417, -0.02624917, 2.31371289, -0.03578409,
-1.6612304 , -1.98659864, -0.55809336],
[ 1.69451479, 0.27399223, -0.15878832, -0.27930802, -1.5313515 ,
1.57077599, -0.3062486 , 0.33964472]])
stock_change.shape
(8, 10)
stock_change.resize((10, 8))
stock_change.shape
(10, 8)
stock_change
array([[ 0.4757998 , 0.98262317, 0.0903228 , -2.18277494, 0.03458714,
0.27945935, -0.2996386 , 0.55731825],
[ 2.55512752, 0.22270481, -0.62812039, -1.85224004, -0.03066103,
1.01028685, 0.07356365, -0.59625804],
[ 1.78864657, -1.02592844, -0.83059086, -0.51519111, 1.59009598,
-0.3561372 , -0.13415047, -0.87131372],
[ 2.63536456, -0.07324175, 0.11148286, -0.78896717, 0.36006041,
-0.32652921, -3.06937049, -0.28945156],
[-1.31411983, 0.27797394, 0.02249254, 0.68111066, 0.19071901,
0.41827306, -0.23168617, 0.02644996],
[-0.57244906, -0.14880773, 0.05463552, 0.05554172, 0.49012011,
-0.97979408, -0.43437754, -1.16343025],
[-0.22472479, 2.58279491, 0.25552108, 0.28857986, 0.60867223,
-0.35784135, -1.32022102, 0.56756162],
[-1.60034249, 0.81897864, 0.90891023, 1.0347659 , 1.54397322,
2.43073741, -0.03775417, 1.01840352],
[-0.20048821, -0.26870417, -0.02624917, 2.31371289, -0.03578409,
-1.6612304 , -1.98659864, -0.55809336],
[ 1.69451479, 0.27399223, -0.15878832, -0.27930802, -1.5313515 ,
1.57077599, -0.3062486 , 0.33964472]])
stock_change.resize((8, 10))
stock_change.shape
stock_change
(8, 10)
array([[ 0.4757998 , 0.98262317, 0.0903228 , -2.18277494, 0.03458714,
0.27945935, -0.2996386 , 0.55731825, 2.55512752, 0.22270481],
[-0.62812039, -1.85224004, -0.03066103, 1.01028685, 0.07356365,
-0.59625804, 1.78864657, -1.02592844, -0.83059086, -0.51519111],
[ 1.59009598, -0.3561372 , -0.13415047, -0.87131372, 2.63536456,
-0.07324175, 0.11148286, -0.78896717, 0.36006041, -0.32652921],
[-3.06937049, -0.28945156, -1.31411983, 0.27797394, 0.02249254,
0.68111066, 0.19071901, 0.41827306, -0.23168617, 0.02644996],
[-0.57244906, -0.14880773, 0.05463552, 0.05554172, 0.49012011,
-0.97979408, -0.43437754, -1.16343025, -0.22472479, 2.58279491],
[ 0.25552108, 0.28857986, 0.60867223, -0.35784135, -1.32022102,
0.56756162, -1.60034249, 0.81897864, 0.90891023, 1.0347659 ],
[ 1.54397322, 2.43073741, -0.03775417, 1.01840352, -0.20048821,
-0.26870417, -0.02624917, 2.31371289, -0.03578409, -1.6612304 ],
[-1.98659864, -0.55809336, 1.69451479, 0.27399223, -0.15878832,
-0.27930802, -1.5313515 , 1.57077599, -0.3062486 , 0.33964472]])
stock_change.T
array([[ 0.4757998 , -0.62812039, 1.59009598, -3.06937049, -0.57244906,
0.25552108, 1.54397322, -1.98659864],
[ 0.98262317, -1.85224004, -0.3561372 , -0.28945156, -0.14880773,
0.28857986, 2.43073741, -0.55809336],
[ 0.0903228 , -0.03066103, -0.13415047, -1.31411983, 0.05463552,
0.60867223, -0.03775417, 1.69451479],
[-2.18277494, 1.01028685, -0.87131372, 0.27797394, 0.05554172,
-0.35784135, 1.01840352, 0.27399223],
[ 0.03458714, 0.07356365, 2.63536456, 0.02249254, 0.49012011,
-1.32022102, -0.20048821, -0.15878832],
[ 0.27945935, -0.59625804, -0.07324175, 0.68111066, -0.97979408,
0.56756162, -0.26870417, -0.27930802],
[-0.2996386 , 1.78864657, 0.11148286, 0.19071901, -0.43437754,
-1.60034249, -0.02624917, -1.5313515 ],
[ 0.55731825, -1.02592844, -0.78896717, 0.41827306, -1.16343025,
0.81897864, 2.31371289, 1.57077599],
[ 2.55512752, -0.83059086, 0.36006041, -0.23168617, -0.22472479,
0.90891023, -0.03578409, -0.3062486 ],
[ 0.22270481, -0.51519111, -0.32652921, 0.02644996, 2.58279491,
1.0347659 , -1.6612304 , 0.33964472]])
stock_change.T.shape
(10, 8)
2.3.4 类型的修改
- ndarray.astype(“type”)
- ndarray序列化到本地(转换成bytes)
- ndarray.tostring()
stock_change.astype("int64")
array([[ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[ 0, 0, 1, 0, 0, -1, 0, -1, 0, 0],
[ 0, 0, -1, 0, 0, 0, 0, 0, 1, 0],
[-1, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[ 0, 0, -1, 0, 0, -1, 0, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0, 1, -1, 0, -1],
[ 0, -1, 0, 0, 0, 0, 0, 0, 1, 0],
[ 0, 0, -1, 1, 1, 0, 0, 0, 0, 0]], dtype=int64)
type(stock_change)
numpy.ndarray
# 序列化到本地
stock_change.tostring()
b'm\xe2\xc3\xb6=j\xc8?\xce\xf3\xf4Z\xca6\xe3?\x96I\x06\x1a\xb2\xe0\xe9?\x95%M\xffn@\xfa?\xfd\x05\xaf\xa8\xe2"\xef\xbf\x925yn^\xb5\xc9\xbf\xdd\xea`\xe1\x02G\xe3?{|\xca\x03C?\xdc?\xa9~l\xc6\xac\x91\xe2?pNW\xa2\xd4\t\xde\xbfd\xd5\x95f\xb9\xfe\xe5\xbf\xbcx\xcd\xbf\xb9o\xeb?\x8d\xa5\xba1=\x9d\xfe?t\xca\x11\xabQ\xdc\xe2?(\x8b"c\xb6O\xca?_I\x81\xc7\xd2\x80\xf7\xbf \xd4\xef\xe3?\xdfl\xbf\xcaL\xe8v\x0e\xda\xfd\xbf\xe11n[\x99d\xee?l\xf2\xd4\xbaPK\xb3\xbf\xa0\xf9\xe7\x98+\xbb\xe1?\xea\xe6={\x93\x80\xe8\xbfG@u\xad\x80:\xf1\xbfZO\xb4\x84\x9b7\xed?\xc7\xc6\xf0\x1d\xcfT\xdd?\xf5/+\x9d\xf5~\xda?\xb4\xd1\xeb\xdd\x13\x1f\xb3?\xcd\x80\xd9x\x92L\xde\xbf/\xd0\xb10\x08\xa8\xf5?GC\xba\xf4\xbb\xbc\xd7?\xf1\xe5OaeX\xf6\xbf\xf2\xe0\x0f\x06\t\x87\xda\xbf\xcah\xaf\x0e\xab)\xe7?J\t\xbcJB5\xdd?\x88\xfb\x92\x0ef\xce\xe4\xbf\xaa}\xf0\x02c\x17\xf3?\x88\xfa\xed\xe8\xf1\xfb\xf3?rgD\xca\x9bO\xc3\xbf\xed#l\xa5\xa5\xf1\xe5?\x88,\xef\xe3I\xa8\xd1\xbfi\xa5\x14\x14\x16\x9f\xbb\xbf\x88\xbd\xb5\xe8\xb8S\xe3\xbf\xe7\xfc9\xf9=Q\xf4\xbf\xc9\x08\xc2\x83\n\xb0\xcc?\xe6\xfd\x85\xd2\x82\x12\xd3?\xcbyU\x93\x93\x01\xf9\xbfSw\x11g\x8f\x85\xe7?k+()\xec.\xec?\xaf5t\xe3H\xf9\xed\xbf\xeb&2\x8c|X\xcf\xbf/)\xf4\x8fl\x99\xea\xbf\xbb@\xba\x10Uy\xec?}\x83n\xb2\xa0\xea\xb7\xbf\x0e\xe8\n\xf0\xa69\xf7?\xca\xa8B\xbc+\xcb\xe3\xbf\x07\xb5w{nw\xcf\xbf\xff\xf6\xca\x87\xe4\xf8\xf6?X\x1d%\xf5)\xc3\xf3\xbf"BP\xc9b\xbb\xde?\xa3\xfb]^\xa7\xe3\xf9\xbf\x9eO\x98B\xfd\x9c\xb9?8\xb8\x92\xe6Jq\xf4\xbf\xef|\x9b<\x1c\xa1\xd2\xbf\xfd\x03\rA\x98\xb1\xd9\xbfgg\xa8i\xd1q\xee\xbf\x90\x8f\xb5\x01\x8a\x86\xde\xbf\xd1\xe9x\x87\xdf\xdb\xdd\xbf\xcb\xa8\xad\x1bm\x8b\xd0?\x03\xff\x0e\x9a\xb7\xad\xf0?\x07\xa6\xfa\x16\xf0\x19\xe8\xbf\xd7\xe1\r\xb71\xbb\xd5?\xb3\xc5]\x80~8\xb4\xbfb\xea\xcb\xacN4\xf5\xbf\x94\xd1M\xc9\x89\xfd\xfc?\xa3\xa3\xdfW\xba\xd4\xf0?D\x92n\x81,,\xee\xbf>\x91\xb3e4\xa8\xbd?\x81bW\xee\xb2\xdc\xe2\xbfC\x10\xf4]\xdcy\xac?\x01d\t\xdf8\x12\xdd?'
2.3.5 数组的去重
- set()——集合,集合的特点:无重复项;set()只能处理一维数据
- ndarray.unique()
temp = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
temp
array([[1, 2, 3, 4],
[3, 4, 5, 6]])
set(temp) # 二维数组对象不可hash!
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-b75e6adf7355> in <module>
----> 1 set(temp) # 二维数组对象不可hash!
TypeError: unhashable type: 'numpy.ndarray'
np.unique(temp)
array([1, 2, 3, 4, 5, 6])
# 对temp二维数组进行降维
temp.flatten()
array([1, 2, 3, 4, 3, 4, 5, 6])
# 用set对temp进行去重,变成集合
set(temp.flatten())
{1, 2, 3, 4, 5, 6}
2.3.6 小结
- 创建数组
- 均匀
- 随机(正态分布)
- 正态分布
- 数组索引
- 改变数组的形状
- 数组的类型
- reshape()
- resize()
- 数组的转换
- T(转置)
- tostring(序列化数组)
- unique(去重)
2.4 ndarray运算
- 逻辑运算
- 布尔索引
- 通用判断函数
- np.all()——判断元素是否全部满足括号中的条件
- 只要有一个False,就返回False,只有全部为True时才返回True
- np.any()——判断是否存在满足括号中条件的元素
- 只要有一个True就返回True,只有全是False时才返回False
- np.all()——判断元素是否全部满足括号中的条件
- np.where()——三元运算符——用于对满足某一条件的元素进行操作
- np.where(布尔值, True位置的值, False位置的值)
- 复合逻辑判断需要使用np.logical_and()和np.logical_or()
- 统计运算
- 数组间运算
2.4.1 逻辑运算
- 操作符合某一条件的数据
import numpy as np
stock_change = np.random.normal(loc=0, scale=1, size=(8, 10))
stock_change
array([[-3.21925655e-01, 2.00278648e+00, 5.71029655e-02,
1.34207945e+00, 4.84536098e-01, -1.43965967e+00,
4.95406564e-02, -9.71429614e-02, -7.59968374e-01,
9.05514273e-02],
[-1.26553830e+00, 2.83480830e-01, -1.27096652e+00,
-7.78617184e-02, 2.13893026e-02, 6.99181366e-01,
1.28778436e+00, -1.21318904e+00, 1.34335913e-01,
1.16881429e+00],
[ 2.75733840e+00, -5.19397391e-01, 5.17573162e-01,
-1.03617610e+00, -9.24933387e-01, -1.25727419e+00,
-1.86879247e+00, 1.25274113e+00, -6.42865073e-01,
-1.18522669e+00],
[ 6.45555564e-01, 4.66116901e-02, 9.14255949e-01,
-1.19547375e+00, -3.57279692e-01, -5.93218153e-01,
1.13542745e-01, -8.24030471e-01, 2.24255839e-01,
-1.51164946e+00],
[ 1.62126604e-01, -1.89741507e+00, 9.96692135e-01,
-6.67635633e-01, -5.55159320e-05, 1.19218271e+00,
-1.03884252e+00, 1.25879191e+00, 7.90378065e-01,
1.02574465e+00],
[-8.79940430e-01, -8.81164598e-01, -1.32232404e+00,
1.00266248e+00, -1.31175999e-01, -9.48896084e-03,
3.66261660e-01, -7.94653378e-01, 4.10770668e-01,
1.92790586e-01],
[-1.39018777e+00, 6.05357379e-01, -6.29581411e-01,
2.13056580e+00, -8.09247972e-01, -1.45850019e+00,
8.34844616e-01, 8.84528946e-01, -1.32502380e+00,
1.13360265e+00],
[ 1.38156266e-01, 4.82065621e-02, 1.34596475e+00,
3.35030264e-01, -9.91285791e-01, 7.76555121e-01,
-3.59506728e-01, 5.55275392e-01, -8.74342910e-01,
7.75673585e-02]])
# 逻辑判断:如果涨跌幅大于0.5,就标记为True,否则标记为False
stock_change > 0.5
array([[ True, False, False, False, False, False, False, True, False,
False],
[False, False, False, False, True, False, True, False, False,
False],
[ True, False, False, True, False, False, False, True, False,
True],
[False, False, False, False, True, True, False, True, False,
False],
[False, True, True, False, False, False, False, False, False,
True],
[False, False, True, False, False, False, True, False, False,
True],
[False, False, False, False, True, False, False, True, False,
False],
[False, False, False, False, False, False, True, True, False,
True]])
stock_change[stock_change > 0.5]
array([1.00960969, 1.57389795, 1.48681292, 0.60698373, 1.32107651,
0.61409015, 0.55504015, 0.75181457, 1.19916094, 1.60054449,
2.15595143, 0.81875222, 1.14852624, 3.00824751, 1.41247279,
0.50834144, 0.54149304, 1.24179072, 0.84150558, 1.3810949 ,
0.63093414, 1.45875392])
stock_change[stock_change > 0.5] = 1.1
stock_change
array([[ 1.1 , -0.84427887, 0.02493903, -1.25251943, -0.13526056,
-1.53708678, -1.75029022, 1.1 , -0.11252094, -1.10968466],
[-1.77263505, -1.10273485, -0.06427946, -0.47530352, 1.1 ,
0.39431243, 1.1 , -0.19107432, -0.30473289, -0.36641659],
[ 1.1 , -0.74977583, -1.86773114, 1.1 , -0.82844641,
-0.13917232, 0.39855819, 1.1 , 0.23328748, 1.1 ],
[ 0.13270216, 0.08033047, -0.09144296, -1.1299997 , 1.1 ,
1.1 , -1.14426555, 1.1 , -0.70135722, -1.56731264],
[-1.48992779, 1.1 , 1.1 , 0.27689299, 0.30363445,
-0.01249626, 0.37981243, -0.3862383 , -0.19437319, 1.1 ],
[-0.17573471, -0.69522921, 1.1 , 0.14881664, 0.24209382,
0.43842094, 1.1 , 0.19444138, -0.85873745, 1.1 ],
[ 0.11826444, -0.79209097, -0.22540633, -0.03265994, 1.1 ,
-0.53830251, -0.21617814, 1.1 , -0.18148514, -0.35653799],
[-0.15744953, -0.07925474, -1.31580327, -0.53460345, -0.76964669,
-1.49762656, 1.1 , 1.1 , 0.41973408, 1.1 ]])
2.4.2 通用判断函数
- np.all()
- np.any()
# 判断stock_change[0:2, 0:5]是否全是上涨的
stock_change[0:2, 0:5]
stock_change[0:2, 0:5] > 0
array([[ 1.1 , -0.84427887, 0.02493903, -1.25251943, -0.13526056],
[-1.77263505, -1.10273485, -0.06427946, -0.47530352, 1.1 ]])
array([[ True, False, True, False, False],
[False, False, False, False, True]])
np.all(stock_change > 0)
False
# 判断前5支股票是否有上涨的
stock_change[:5, :] > 0
np.any(stock_change[:5, :] > 0)
array([[ True, False, True, False, False, False, False, True, False,
False],
[False, False, False, False, True, True, True, False, False,
False],
[ True, False, False, True, False, False, True, True, True,
True],
[ True, True, False, False, True, True, False, True, False,
False],
[False, True, True, True, True, False, True, False, False,
True]])
True
2.4.3 np.where()——三元运算符
- np.where(布尔值, True位置的值, False位置的值)
# 判断前四个股票前四天的涨跌幅,大于0的置为1,否则置为0
temp = stock_change[:4, :4]
temp
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172],
[ 2.7573384 , -0.51939739, 0.51757316, -1.0361761 ],
[ 0.64555556, 0.04661169, 0.91425595, -1.19547375]])
np.where(temp > 0, 1, 0)
array([[1, 0, 1, 0],
[0, 0, 0, 0],
[1, 0, 0, 1],
[1, 1, 0, 0]])
复合逻辑判断
- 需要使用np.logical_and()和np.logical_or()
# 判断前四个股票前四天的涨跌幅,大于0.5且小于1的,置换为1,否则置换为0
# 判断前四个股票前四天的涨跌幅,大于0.5或者小于-0.5的,置换为1,否则置换为0
np.logical_and(temp > 0.5, temp < 1)
np.logical_or(temp > 0.5, temp < -0.5)
array([[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False]])
array([[ True, True, False, True],
[ True, True, False, False],
[ True, True, True, True],
[False, False, False, True]])
np.where(np.logical_and(temp > 0.5, temp < 1), 1, 0)
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
2.4.4 统计运算
- 统计指标函数
- min, max, mean, median, var, std
- np.函数名(temp[, axis=])
- ndarray.方法名([axis=])
- min, max, mean, median, var, std
- 返回最大、最小值所在的位置
- np.argmax(temp, axis=)
- np.argmin(temp, axis=)
案例:股票涨跌幅统计运算
统计前四支股票前四天的涨跌幅
# 前四支股票前四天的最大涨幅
temp
temp.shape
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172],
[ 2.7573384 , -0.51939739, 0.51757316, -1.0361761 ],
[ 0.64555556, 0.04661169, 0.91425595, -1.19547375]])
(4, 4)
temp.max()
2.75733839736208
np.max(temp)
2.75733839736208
# 指定按行求最大值
temp.max(axis=1) # 按照第二个维度(axis=1)
# 也可以写成axis=-1(倒数第一个维度)
temp.max(axis=-1)
array([2.00278648, 0.28348083, 2.7573384 , 0.91425595])
array([2.00278648, 0.28348083, 2.7573384 , 0.91425595])
# 按列求最大值
np.max(temp, axis=0) # 按照第一个维度(axis=0)
# 也可以写成axis=-2(按照倒数第二个维度)
np.max(temp, axis=-2)
array([2.7573384 , 2.00278648, 0.91425595, 1.34207945])
array([2.7573384 , 2.00278648, 0.91425595, 1.34207945])
显示最大、最小值所在的位置
np.argmax(temp, axis=1)
array([1, 1, 0, 2], dtype=int64)
np.argmin(temp, axis=0)
array([1, 2, 1, 3], dtype=int64)
2.5 数组间运算
- 2.5.1 场景
- 2.5.2 数组与数的运算
- 2.5.3 数组与数组的运算
- 2.5.4 广播机制
- 2.5.5 矩阵运算
- 什么是矩阵
- 矩阵乘法运算
- 矩阵的应用场景
2.5.1 场景
数据:
[[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]]
2.5.2 数组与数的运算
arr = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])
arr
array([[1, 2, 3, 2, 1, 4],
[5, 6, 1, 2, 3, 1]])
arr + 10
array([[11, 12, 13, 12, 11, 14],
[15, 16, 11, 12, 13, 11]])
2.5.3 数组与数组的运算
arr1 = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])
arr2 = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
arr1 # (2, 6)
arr2 # (2, 4)
array([[1, 2, 3, 2, 1, 4],
[5, 6, 1, 2, 3, 1]])
array([[1, 2, 3, 4],
[3, 4, 5, 6]])
# 形状不同的数组间无法运算
arr1 + arr2
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-39b93f7d8e72> in <module>
1 # 形状不同的数组间无法运算
----> 2 arr1 + arr2
ValueError: operands could not be broadcast together with shapes (2,6) (2,4)
2.5.4 广播机制
执行broadcast的前提在于,两个ndarray执行的是element-wise的运算,Broadcast的功能是为了方便不同形状的ndarray进行数学运算
当操作两个数组时,numpy会逐个比较它们的shape(构成的元组tuple),只有在下述情况下,两个数组才能够进行数组间运算:
- 维度相等
- shape(其中相对应的一个地方为1)
# 从后往前排,满足广播机制,可以进行运算,运算之后每个维度的元素数量为每个维度上元素最多的运算数组的元素数量
arr1 = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])
arr2 = np.array([[1], [3]])
arr1 # (2, 6)
arr2 # (2, 1)
arr1 + arr2 # (2, 6)
arr1 / arr2 # (2, 6)
array([[1, 2, 3, 2, 1, 4],
[5, 6, 1, 2, 3, 1]])
array([[1],
[3]])
array([[2, 3, 4, 3, 2, 5],
[8, 9, 4, 5, 6, 4]])
array([[1. , 2. , 3. , 2. , 1. ,
4. ],
[1.66666667, 2. , 0.33333333, 0.66666667, 1. ,
0.33333333]])
2.5.5 矩阵运算
- 存储矩阵的两种方法:
- ndarray二维数组
- matrix数据结构
1. 什么是矩阵
- np.mat()
- 将数组转化为矩阵
# ndarray存储矩阵
data = np.array([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
data
# matrix存储矩阵
data_mat = np.mat(data)
data_mat
array([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
matrix([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
type(data_mat)
numpy.matrix
2. 矩阵的乘法运算
- 形状
- (m, n) * (n, l) = (m, l)
- 运算规则
- 数乘
- 点积
- 叉积
- api
- ndarray存储的矩阵:
- np.matmul()
- np.dot()
- mat存储的矩阵:
- mat1 * mat2
- ndarray存储的矩阵:
# 查看分数矩阵
data_mat
data_mat.shape
matrix([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
(8, 2)
# 创建分数权重
weights = np.array([[0.3], [0.7]])
weights_mat = np.mat(weights)
weights_mat
weights_mat.shape
matrix([[0.3],
[0.7]])
(2, 1)
data
weights
array([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
array([[0.3],
[0.7]])
# 用数组进行矩阵的乘法运算——np.matmul()
np.matmul(data, weights)
np.matmul(data, weights).shape
array([[84.2],
[80.6],
[80.1],
[90. ],
[83.2],
[87.6],
[79.4],
[93.4]])
(8, 1)
# 用数组进行矩阵的乘法运算——np.dot()——求点积
np.dot(data, weights)
np.dot(data, weights).shape
array([[84.2],
[80.6],
[80.1],
[90. ],
[83.2],
[87.6],
[79.4],
[93.4]])
(8, 1)
# 用数组进行矩阵的乘法运算——@
data @ weights
(data @ weights).shape
array([[84.2],
[80.6],
[80.1],
[90. ],
[83.2],
[87.6],
[79.4],
[93.4]])
(8, 1)
# 用矩阵mat进行矩阵的乘法运算
data_mat * weights_mat
(data_mat * weights_mat).shape
matrix([[84.2],
[80.6],
[80.1],
[90. ],
[83.2],
[87.6],
[79.4],
[93.4]])
(8, 1)
2.6 合并与分割
2.6.1 合并
-
numpy.hstack(tup)----Stack arrays in sequence horizontally (column wise).
-
numpy.vstack(tup)----Stack arrays in sequence vertically (row wise).
-
numpy.concatenate((a1, a2, …), axis=0)
a = stock_change[:2, :4]
b = stock_change[4:6, :4]
a
b
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172]])
array([[ 0.1621266 , -1.89741507, 0.99669214, -0.66763563],
[-0.87994043, -0.8811646 , -1.32232404, 1.00266248]])
np.hstack((a, b))
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945, 0.1621266 ,
-1.89741507, 0.99669214, -0.66763563],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172, -0.87994043,
-0.8811646 , -1.32232404, 1.00266248]])
np.concatenate((a, b), axis=1)
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945, 0.1621266 ,
-1.89741507, 0.99669214, -0.66763563],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172, -0.87994043,
-0.8811646 , -1.32232404, 1.00266248]])
np.vstack((a, b))
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172],
[ 0.1621266 , -1.89741507, 0.99669214, -0.66763563],
[-0.87994043, -0.8811646 , -1.32232404, 1.00266248]])
np.concatenate((a, b), axis=0)
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172],
[ 0.1621266 , -1.89741507, 0.99669214, -0.66763563],
[-0.87994043, -0.8811646 , -1.32232404, 1.00266248]])
2.6.2 分割
- numpy.split(array, indices_or_sections, axis=0)----Split an array into multiple sub-arrays
2.7 I/O操作与数据处理
- 2.7.1 Numpy读取
- 2.7.2 缺失值处理
- 什么是缺失值
- 缺失值处理
- 两种思路:
- 删除含有缺失值的样本
- 替换、插补
- 两种思路:
2.7.1 Numpy读取
# 读取数据
data = np.genfromtxt('test.csv', delimiter=',') # delimiter指定分隔符
data # nan:not a number
array([[ nan, nan, nan, nan],
[ 1. , 123. , 1.4, 23. ],
[ 2. , 110. , nan, 18. ],
[ 3. , nan, 2.1, 19. ]])
2.7.2 缺失值处理
-
什么是缺失值
- 当读取的本地文件为float的时候,如果有缺失值(或者为None),就会出现nan
-
缺失值处理
- 单纯地把nan替换为0:如果替换前数据的均值大于0,那么替换之后均值会变小
- 更一般地:把缺失的数据替换为均值(中值)或者删除含有缺失值的行(数据的清洗)
data[2, 2]
nan
type(data[2, 2])
numpy.float64
# 用均值填补的处理逻辑
def fill_nan_by_column_mean(t):
for i in range(t.shape[1]): # 在第二个维度上进行操作,按列求均值
# 计算nan的个数
nan_num = np.count_nonzero(t[:, i][t[:, i] != t[:,i]]) # nan具有不等于自身的特性
if nan_num > 0: # 如果存在nan元素
now_col = t[:, i]
# 求和
now_col_not_nan = now_col[np.isnan(now_col) == False].sum()
# 求均值
now_col_mean = now_col_not_nan / (t.shape[0] - nan_num)
# 赋值给now_col
now_col[np.isnan(now_col)] = now_col_mean
# 把now_col赋值给t,刷新t的当前列
t[:, i] = now_col
return t
data
array([[ nan, nan, nan, nan],
[ 1. , 123. , 1.4, 23. ],
[ 2. , 110. , nan, 18. ],
[ 3. , nan, 2.1, 19. ]])
fill_nan_by_column_mean(data)
array([[ 2. , 116.5 , 1.75, 20. ],
[ 1. , 123. , 1.4 , 23. ],
[ 2. , 110. , 1.75, 18. ],
[ 3. , 116.5 , 2.1 , 19. ]])