《利用Python进行数据分析》笔记+整理+案例 NumPy(第二部分)
(3)利用ndarray进行数据处理
(a)例子
points = np.arange(-5,5,0.01)
xs, ys = np.meshgrid(points,points)
z = np.sqrt(xs**2+ys**2)
import matplotlib.pyplot as plt
plt.title("Image of $\sqrt{x^2+y^2}$")
plt.imshow(z,cmap=plt.cm.gray)
<matplotlib.image.AxesImage at 0x266d54c7880>
(b)将条件逻辑表述为数组运算(np.where)
- where里面可以传数组,也可以传标量
(i)对数组操作
xarr = np.arange(1.1,1.6,0.1)
yarr = np.arange(2.1,2.6,0.1)
cond = np.array([True, False, True, True, False])
result = np.where(cond,xarr,yarr)
result
array([1.1, 2.2, 1.3, 1.4, 2.5])
(ii)使用标量:正值全部换成2,负值全部换成-2
arr = np.random.randn(4,4)
arr
array([[-1.09102617, 0.14488428, 0.39996343, -0.58025741],
[ 0.16935005, -0.35147731, 0.12913876, -1.627593 ],
[-0.91612171, -1.43681774, -0.20800336, -0.25200059],
[-0.73166757, 1.37763498, 0.31321662, -0.44070821]])
np.where(arr > 0, 2, -2)
array([[-2, 2, 2, -2],
[ 2, -2, 2, -2],
[-2, -2, -2, -2],
[-2, 2, 2, -2]])
(iii)数组+标量
np.where(arr > 0, 2, arr) #arr的元素大于0的就换成2,否则不变
array([[-1.09102617, 2. , 2. , -0.58025741],
[ 2. , -0.35147731, 2. , -1.627593 ],
[-0.91612171, -1.43681774, -0.20800336, -0.25200059],
[-0.73166757, 2. , 2. , -0.44070821]])
(c)数学和统计方法
- 通过数组上的⼀组数学函数对整个数组或某个轴向的数据进⾏统计计算
arr = np.arange(20).reshape(5,4)
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
(i)计算平均值
arr.mean()
9.5
np.mean(arr)
9.5
np.average(arr)
9.5
在某个axis求平均值
arr.mean(0) #0轴上的平均值->平均值在横轴上->每列的所有数的平均值
array([ 8., 9., 10., 11.])
np.mean(arr,axis=0)
array([ 8., 9., 10., 11.])
(ii)求和
arr.sum()
190
np.sum(arr)
190
在某个axis求和
np.sum(arr,axis=1)
array([ 6, 22, 38, 54, 70])
arr.sum(1)
array([ 6, 22, 38, 54, 70])
(iii)计算累加值/累乘值
arr.cumsum()
array([ 0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78,
91, 105, 120, 136, 153, 171, 190], dtype=int32)
np.cumprod(arr) #第一个数字是0
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int32)
在axis上求累加值/累乘值
arr.cumsum(1)
array([[ 0, 1, 3, 6],
[ 4, 9, 15, 22],
[ 8, 17, 27, 38],
[12, 25, 39, 54],
[16, 33, 51, 70]], dtype=int32)
np.cumprod(arr,axis=1)
array([[ 0, 0, 0, 0],
[ 4, 20, 120, 840],
[ 8, 72, 720, 7920],
[ 12, 156, 2184, 32760],
[ 16, 272, 4896, 93024]], dtype=int32)
(iv)求中位数
np.median(arr)
9.5
(v)求最大值/最小值
np.max(arr)
19
np.min(arr)
0
np.max(arr,axis=0)
array([16, 17, 18, 19])
(vi)求最大值最小值的索引
np.argmin(arr)
0
np.argmax(arr)
19
(vii)求标准差和方差
np.std(arr)
5.766281297335398
np.var(arr)
33.25
(d)用于布尔型数组的方法
arr = np.random.randn(100)
(i)常用sum计数
(arr > 0).sum() #arr元素大于0的有几个
52
(ii)全部是True/部分是True
bools = arr>0
bools
array([False, False, False, False, True, True, True, True, True,
False, False, False, True, False, True, False, True, False,
False, False, False, False, False, False, True, True, True,
True, False, True, False, True, True, True, True, True,
False, True, False, True, True, False, False, False, True,
False, True, False, True, True, True, True, True, False,
False, False, True, False, True, True, False, True, False,
True, True, False, True, True, True, False, False, True,
True, False, True, True, False, True, False, True, False,
False, False, False, True, False, True, False, True, False,
False, True, True, True, False, True, True, False, True,
False])
bools.any()
True
bools.all()
False
(iii)排序
arr = np.array(np.random.randn(6)*10, dtype=np.int32)
arr
array([ 2, 21, -9, -7, 7, -2])
arr.sort()
arr
array([-9, -7, -2, 2, 7, 21])
在axis上排序
arr = np.array(np.random.randn(5,3)*10,dtype=np.int32)
arr
array([[ 4, 9, 9],
[-11, 3, -18],
[ 8, 9, 5],
[ -5, 7, -1],
[ 16, 3, -10]])
arr.sort(1)
arr
array([[ 4, 9, 9],
[-18, -11, 3],
[ 5, 8, 9],
[ -5, -1, 7],
[-10, 3, 16]])
(e)唯一化以及它的集合逻辑
(i)np.unique
names = np.array(['Amy','Bob','Carol','Dark','Amy','Carol','Sky','Dark'])
np.unique(names) #排序+唯一
array(['Amy', 'Bob', 'Carol', 'Dark', 'Sky'], dtype='<U5')
(ii)np.in1d
values1 = np.array(np.random.randn(5)*10,dtype=np.int32)
values1
array([ 0, 9, -9, 11, -4])
values2 = np.arange(6)
values2
array([0, 1, 2, 3, 4, 5])
np.in1d(values1,values2) #values2的元素是否在values1中
array([ True, False, False, False, False])
(4)用于数组的文件输入输出
- save:保存
- savez:将多个数组保存
- savez_compressed:如果数据压缩得很好就可以用这个
- load:加载
arr = np.arange(10)
np.save('some_array',arr) # 保存,如果⽂件路径末尾没有.npy,则该扩展名会被⾃动加上
np.load('some_array.npy') #读取磁盘上的数组
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.savez('mul_arrays.npz',a=arr,b=arr)
arch = np.load('mul_arrays.npz')
arch['a']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arch['b']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
(5)线性代数
之前有提到过不少方法了
x = np.arange(1,7).reshape(2,3)
y = np.array([[6., 23.],
[-1,7],
[8,9]])
x
array([[1, 2, 3],
[4, 5, 6]])
y
array([[ 6., 23.],
[-1., 7.],
[ 8., 9.]])
(a)矩阵乘法
x.dot(y)
array([[ 28., 64.],
[ 67., 181.]])
np.dot(x,y)
array([[ 28., 64.],
[ 67., 181.]])
x@y
array([[ 28., 64.],
[ 67., 181.]])
(b)numpy.linalg
from numpy.linalg import inv, qr
x = np.random.randn(5,5)
mat = x.T.dot(x)
(i)inv()
inv(mat) # invert 逆
array([[ 891.89281934, -170.61175297, 309.30921484, -20.04646306,
331.54866714],
[-170.61175297, 34.31559149, -58.38536593, 4.86214551,
-63.16345462],
[ 309.30921484, -58.38536593, 108.49295298, -6.21220496,
115.01146399],
[ -20.04646306, 4.86214551, -6.21220496, 1.31252171,
-7.36221202],
[ 331.54866714, -63.16345462, 115.01146399, -7.36221202,
123.49445396]])
a = mat.dot(inv(mat))
a # 有误差,所以看起来不太像I
array([[ 1.00000000e+00, -7.89016539e-15, 2.58787478e-14,
-5.90488249e-15, 3.81666466e-14],
[ 2.88347099e-16, 1.00000000e+00, -1.53280197e-14,
-5.64103998e-15, -2.88340488e-14],
[-3.41925805e-14, 7.92307257e-15, 1.00000000e+00,
3.11646821e-15, -1.01181538e-15],
[-1.21304008e-13, -5.78606747e-16, -1.37942448e-14,
1.00000000e+00, -2.22511130e-14],
[ 3.98501266e-14, 1.39518727e-14, 5.26455377e-14,
3.24456882e-15, 1.00000000e+00]])
a.dtype
dtype('float64')
a.round()
array([[ 1., -0., 0., -0., 0.],
[ 0., 1., -0., -0., -0.],
[-0., 0., 1., 0., -0.],
[-0., -0., -0., 1., -0.],
[ 0., 0., 0., 0., 1.]])
(ii)qr()
- Compute the qr factorization of a matrix.
- Factor the matrix a as qr, where q is orthonormal and r is upper-triangular.
q, r = qr(mat)
r
array([[-2.96659044e+00, -3.21767296e+00, 3.93922588e-01,
2.99766871e+00, 6.13702819e+00],
[ 0.00000000e+00, -3.82625378e+00, -2.29884715e+00,
7.41479928e+00, 6.24315251e-01],
[ 0.00000000e+00, 0.00000000e+00, -1.40167295e+00,
1.16181054e+00, 1.37841479e+00],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
-6.45930290e-01, -3.89372289e-02],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 2.64955879e-03]])
q
array([[-0.43313575, 0.09964799, 0.17103318, 0.03903137, 0.87845768],
[-0.34127079, -0.46320476, 0.42064025, -0.68119717, -0.16735529],
[ 0.21554375, -0.27033019, -0.7555402 , -0.46557635, 0.30472964],
[ 0.13012615, 0.81285083, 0.0841993 , -0.5611334 , -0.01950661],
[ 0.79532116, -0.2042223 , 0.46462772, -0.05305607, 0.32720582]])
(6)伪随机数生成(numpy.random)
Python内置的random模块则只能⼀次⽣成⼀个样本值,如果需要产⽣⼤量样本值,numpy.random快了不⽌⼀个数量级,以下是测试:
from random import normalvariate
N = 1000000
%timeit samples = [normalvariate(0,1) for i in range(N)]
1.87 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.random.normal(size=N)
55 ms ± 3.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(a)标准正态分布(normal distribution)
samples = np.random.normal(size=(4,4))
samples
array([[ 1.79400576, -0.34332551, -0.7315326 , 0.37596174],
[ 0.68897482, -1.01607419, -0.18746967, 0.03575278],
[ 0.04196738, 0.96214581, -1.3443093 , 1.14355111],
[ 0.32311173, -1.22932036, 0.14297192, 1.8289397 ]])
(b)均匀分布(uniform distribution)
samples2 = np.random.uniform(size=(4,4))
samples2
array([[0.33197232, 0.50249425, 0.20872139, 0.44100725],
[0.79970626, 0.8493364 , 0.38371009, 0.80270876],
[0.81254287, 0.80318489, 0.18548665, 0.48484211],
[0.60264807, 0.41739885, 0.62637336, 0.27848417]])
(7)随机漫步
简单的随机漫步的例⼦:从0开始,步⻓1和-1出现的概率相等
import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
step = 1 if random.randint(0,1) else -1
position += step
walk.append(position)
plt.plot(walk[:100])
[<matplotlib.lines.Line2D at 0x266ec494280>]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7fGfelnv-1603093279110)(output_306_1.png)]
不难看出,这其实就是随机漫步中各步的累计和,可以⽤⼀个数组运算来实现。
nsteps = 1000
draws = np.random.randint(0,2,size = nsteps)
steps = np.where(draws>0,1,-1)
walk = steps.cumsum()
plt.plot(walk[:100])
[<matplotlib.lines.Line2D at 0x266ec4e1430>]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mCojaA3b-1603093279111)(output_312_1.png)]
一次模拟多个随机漫步
nwalks = 50 #模拟50个随机漫步
nsteps = 1000
draws = np.random.randint(0,2,size = (nwalks, nsteps))
steps = np.where(draws>0,1,-1)
walks = steps.cumsum(1) #沿着列找累加值,也就是求行累加值
walks
array([[ -1, -2, -3, ..., -2, -1, -2],
[ -1, 0, 1, ..., 12, 11, 12],
[ -1, -2, -1, ..., -18, -17, -18],
...,
[ 1, 0, 1, ..., 8, 7, 8],
[ 1, 2, 3, ..., -42, -41, -42],
[ -1, 0, -1, ..., 16, 15, 16]], dtype=int32)
walks.max()#计算所有随机漫步过程的最大值
79
walks.min()#计算所有随机漫步过程的最小值
-100
hits30 = (np.abs(walks) >= 30).any(1) #沿着列找有到达过一次±30的随机漫步
hits30
array([False, False, True, True, False, True, True, True, True,
True, True, True, True, True, False, True, True, True,
False, True, False, False, True, True, True, False, True,
True, False, True, True, False, True, False, True, True,
True, False, True, False, False, True, True, True, True,
False, True, False, True, False])
hits30.sum() #求所有True的数量
33
(8)numpy的合并
A=np.array([1,1,1])
B=np.array([2,2,2])
# vertical stack
print(np.vstack((A,B)))
print(np.vstack((A,B)).shape)
[[1 1 1]
[2 2 2]]
(2, 3)
# horizontal stack
print(np.hstack((A,B)))
print(np.hstack((A,B)).shape)
[1 1 1 2 2 2]
(6,)
#如何实现把横向数列改成竖的数列,transpose不能实现
print(A[:,np.newaxis])#给横轴每一项在纵向加维度
A1 = np.array([1,1,1])[:,np.newaxis]
B1 = np.array([2,2,2])[:,np.newaxis]
print(A1)
print(B1)
C1 = np.vstack((A1,B1))
print(C1)
print(C1.shape)
D1 = np.hstack((A1,B1,B1,A1))
print(D1)
print(D1.shape)
[[1]
[1]
[1]]
[[1]
[1]
[1]]
[[2]
[2]
[2]]
[[1]
[1]
[1]
[2]
[2]
[2]]
(6, 1)
[[1 2 2 1]
[1 2 2 1]
[1 2 2 1]]
(3, 4)
#多个array纵向或横向的合并
C2 = np.concatenate((A1,B1,B1,A1),axis=0)
print(C2)
[[1]
[1]
[1]
[2]
[2]
[2]
[2]
[2]
[2]
[1]
[1]
[1]]
(9)numpy array 的分割
A = np.arange(12).reshape(3,4)
A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
# 均等分割,可以控制维度
np.split(A,2,axis=1)
[array([[0, 1],
[4, 5],
[8, 9]]),
array([[ 2, 3],
[ 6, 7],
[10, 11]])]
# 不等量的分割,可以控制维度
np.array_split(A,3,axis=1)
[array([[0, 1],
[4, 5],
[8, 9]]),
array([[ 2],
[ 6],
[10]]),
array([[ 3],
[ 7],
[11]])]
#纵向均等分割
np.vsplit(A,3)
[array([[0, 1, 2, 3]]), array([[4, 5, 6, 7]]), array([[ 8, 9, 10, 11]])]
#纵向均等分割
np.hsplit(A,2)
[array([[0, 1],
[4, 5],
[8, 9]]),
array([[ 2, 3],
[ 6, 7],
[10, 11]])]
(10)浅复制和深复制
a = np.arange(4)
a
array([0, 1, 2, 3])
#赋值运算符的浅复制
b = a
c = a
d = b
a[0]=100
print(b is a)
print("b: ",b)
print(c is a)
print("c: ",c)
print(d is a)
print("d: ",d)
True
b: [100 1 2 3]
True
c: [100 1 2 3]
True
d: [100 1 2 3]
#copy()深复制
b=a.copy()
print(b is a)
print("b: ",b)
False
b: [100 1 2 3]