本内容完全供自己学习(自己的练习笔记),全部内容均来自利用Python进行数据分析书中第四章知识
下面举一个例子来体现numpy的不同,假设numpy包含100万个整数,还有一个同样数据内容的Python列表:
import numpy as np
my_arr = np.arange(1000000)
my_list = list(range(1000000))
#计算的时间
%time for _ in range(10):my_arr2 = my_arr*2
Wall time: 21 ms
Compiler : 341 ms
#计算的时间
%time for _ in range(10):my_list2=[x*2 for x in my_list]
Wall time: 948 ms
numpy的方法比python方法快10到100倍,并使用的内存更少。
4.1、Numpy ndarray:多维数组对象
Numpy的核心特征之一就是N-维数组对象—ndarry.
#导入numpy
import numpy as np
#随机生成数组
data = np.random.randn(2,3)
data
array([[ 0.53526407, 1.42752699, -0.68798613],
[-0.45544835, -1.35615318, -1.6924118 ]])
#数学操作
data*10
array([[ 5.3526407 , 14.27526989, -6.87986133],
[ -4.55448354, -13.56153181, -16.92411803]])
data+data
array([[ 1.07052814, 2.85505398, -1.37597227],
[-0.91089671, -2.71230636, -3.38482361]])
#维度
data.shape
(2, 3)
#数据类型
data.dtype
dtype('float64')
4.1.1、生成ndarry
列表转换
data1 = [6,7.5,8,0,1]
arr1 = np.array(data1)
arr1
array([6. , 7.5, 8. , 0. , 1. ])
嵌套序列,例如同等长度的列表,将会自动转换为多维数组
data2 = [[1,2,3,4],[5,6,7,8]]
arr2 = np.array(data2)
arr2
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
arr2.ndim
2
arr2.shape
(2, 4)
arr1.dtype
dtype('float64')
arr2.dtype
dtype('int32')
给定长度和形状后,zeros可以一次性创造全零数据,ones可以一次性创造全1数据。empty则可以创建一个没有初始化数值的数据
np.zeros(10)
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
np.zeros((3,6))
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
np.empty((2,3,2))
array([[[1.05075542e-311, 2.86558075e-322],
[0.00000000e+000, 0.00000000e+000],
[1.05699242e-307, 8.60952352e-072]],
[[4.26976457e-090, 2.00497183e-052],
[1.26141762e-076, 9.91606475e+164],
[6.48224660e+170, 5.82471487e+257]]])
np.ones((2,3))
array([[1., 1., 1.],
[1., 1., 1.]])
想要使用np.empty来生成一个全零数组,并不安全,有时候它可能会返回未初始化的垃圾值
arange是python内建函数range的数组版
np.arange(15)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
4.1.2、ndarray的数据类型
数据类型,即dytpe
arr1 = np.array([1,2,3],dtype=np.float64)
arr2 = np.array([1,2,3],dtype=np.int32)
arr1.dtype
dtype('float64')
arr2.dtype
dtype('int32')
使用astype方法显式地转换数组的数据类型
arr = np.array([1,2,3,4,5])
arr.dtype
dtype('int32')
整数转成浮点数
float_arr = arr.astype(np.float64)
float_arr.dtype
dtype('float64')
arr = np.array([3.7,2.5,4.3,5.0])
arr
array([3.7, 2.5, 4.3, 5. ])
浮点数转成整数,小数点后部分会直接被消除
arr.astype(np.int32)
array([3, 2, 4, 5])
将表示数字的字符串转成数字
在使用numpy.string_类型做字符串要小心,因为Numpy会修正它的大小或删除输入且不发出警告。pandas在处理非数值数据时有更直观的开箱型操作
numeric_strings = np.array(['1.25','-3.4','4.0'],dtype=np.string_)
numeric_strings
array([b'1.25', b'-3.4', b'4.0'], dtype='|S4')
numeric_strings.astype(float)
array([ 1.25, -3.4 , 4. ])
使用另一个数组的dtype属性
int_array = np.arange(10)
int_array
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
calibers = np.array([.22,.270,.345,.234],dtype=np.float64)
calibers
array([0.22 , 0.27 , 0.345, 0.234])
int_array.astype(calibers.dtype)
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
使用类型代码来传入数据类型
empty_unit32 = np.empty(8,dtype='u4')
empty_unit32
array([3264175145, 1070344437, 343597384, 1070679982, 3779571220,
1070994554, 1168231105, 1070461878], dtype=uint32)
4.1.3 Numpy数组算数
arr = np.array([[1.,2.,3.],[4.,5.,6.]])
arr
array([[1., 2., 3.],
[4., 5., 6.]])
arr + arr#加
array([[ 2., 4., 6.],
[ 8., 10., 12.]])
arr - arr#减
array([[0., 0., 0.],
[0., 0., 0.]])
arr * arr#乘
array([[ 1., 4., 9.],
[16., 25., 36.]])
arr / arr#除
array([[1., 1., 1.],
[1., 1., 1.]])
1 / arr#倒数
array([[1. , 0.5 , 0.33333333],
[0.25 , 0.2 , 0.16666667]])
arr ** 0.5#开根号
array([[1. , 1.41421356, 1.73205081],
[2. , 2.23606798, 2.44948974]])
同大小的数组之间比较,会产生一个布尔值数组
arr2 = np.array([[0.,4.,1.],[7.,4.,23.]])
arr2
array([[ 0., 4., 1.],
[ 7., 4., 23.]])
arr2 > arr
array([[False, True, False],
[ True, False, True]])
4.1.4 基础索引与切片
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[5]
5
arr[5:8]
array([5, 6, 7])
arr[5:8] = 12
arr
array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9])
array_slice = arr[5:8]
array_slice
array([12, 12, 12])
当改变array_slice中的值时,原数组也会发生改变,数组的切片是原数组的视图
array_slice[1] = 123456
array_slice
array([ 12, 123456, 12])
arr
array([ 0, 1, 2, 3, 4, 12, 123456, 12,
8, 9])
如果你想要的是一份切片的拷贝的而不是视图的话,使用arr[5:8].copy()
array_copy = arr[2:5].copy()
array_copy
array([2, 3, 4])
array_copy[1] = 12345
array_copy
array([ 2, 12345, 4])
arr
array([ 0, 1, 2, 3, 4, 12, 123456, 12,
8, 9])
不写切片值的[:]将会引用数组的所有制值
array_slice[:] = 64
arr
array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9])
二维数组
arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2d
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
arr2d[2]
array([7, 8, 9])
选择单个元素
arr2d[0][2]
3
arr2d[0,2]
3
三维数组
arr3d = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
arr3d
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
arr3d[0]#是一个2*3的数组
array([[1, 2, 3],
[4, 5, 6]])
标量和数组都可以传递给arr3d[0]
old_values = arr3d[0].copy()
old_values
array([[1, 2, 3],
[4, 5, 6]])
arr3d[0] = 42
arr3d
array([[[42, 42, 42],
[42, 42, 42]],
[[ 7, 8, 9],
[10, 11, 12]]])
arr3d[0] = old_values
arr3d
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
类似的arr3d[1,0]回的是一个一维数组:
arr3d[1,0]
array([7, 8, 9])
拆分为两步
x = arr3d[1]
x
array([[ 7, 8, 9],
[10, 11, 12]])
x[0]
array([7, 8, 9])
注意:以上子集选择中返回的数组都是视图
4.1.4.1 数组的切片索引
arr
array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9])
arr[1:6]
array([ 1, 2, 3, 4, 64])
二维数组
arr2d
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
arr2d[:2]#行
array([[1, 2, 3],
[4, 5, 6]])
进行多组切片,与多组索引类似
arr2d[:2,1:]
array([[2, 3],
[5, 6]])
选择第二行的前两列
arr2d[1,:2]
array([4, 5])
选择第三列的前两行
arr2d[:2,2]
array([3, 6])
arr2d[:,:1]
array([[1],
[4],
[7]])
赋值
arr2d[:2,1:] = 0
arr2d
array([[1, 0, 0],
[4, 0, 0],
[7, 8, 9]])
4.1.5 布尔索引
names = np.array(['Bob','Joe','Will','Bob','Will','Joe','Joe'])
data = np.random.randn(7,4)
names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')
data
array([[-0.16858164, -0.33108982, 0.68263748, -0.0983769 ],
[-0.14467573, -1.73207863, -0.20321916, 0.75697117],
[ 1.38042424, -1.31551497, 2.10397966, 1.98598204],
[-0.20164359, 0.81705695, -0.51739626, -1.16344194],
[ 0.07882572, -0.68212957, 0.59073925, 1.49971538],
[ 0.13222977, -1.45147521, 0.54796917, 1.19053359],
[-1.02140787, 0.9426649 , -0.75485246, 0.20162042]])
names == 'Bob'
array([ True, False, False, True, False, False, False])
data[names == 'Bob']
array([[-0.16858164, -0.33108982, 0.68263748, -0.0983769 ],
[-0.20164359, 0.81705695, -0.51739626, -1.16344194]])
注意:当布尔值数组长度不正确是时,布尔值选择数据的方法并不会报错,因此在使用的时候要小心
data[names == 'Bob',2:]
array([[ 0.68263748, -0.0983769 ],
[-0.51739626, -1.16344194]])
data[names == 'Bob',3]
array([-0.0983769 , -1.16344194])
可以使用!=或~对条件取反
names != 'Bob'
array([False, True, True, False, True, True, True])
data[~(names == 'Bob')]
array([[-0.14467573, -1.73207863, -0.20321916, 0.75697117],
[ 1.38042424, -1.31551497, 2.10397966, 1.98598204],
[ 0.07882572, -0.68212957, 0.59073925, 1.49971538],
[ 0.13222977, -1.45147521, 0.54796917, 1.19053359],
[-1.02140787, 0.9426649 , -0.75485246, 0.20162042]])
cond = names == 'Bob'
data[~cond]
array([[-0.14467573, -1.73207863, -0.20321916, 0.75697117],
[ 1.38042424, -1.31551497, 2.10397966, 1.98598204],
[ 0.07882572, -0.68212957, 0.59073925, 1.49971538],
[ 0.13222977, -1.45147521, 0.54796917, 1.19053359],
[-1.02140787, 0.9426649 , -0.75485246, 0.20162042]])
mask = (names == 'Bob') | (names == 'Will')
mask
array([ True, False, True, True, True, False, False])
data[mask]
array([[-0.16858164, -0.33108982, 0.68263748, -0.0983769 ],
[ 1.38042424, -1.31551497, 2.10397966, 1.98598204],
[-0.20164359, 0.81705695, -0.51739626, -1.16344194],
[ 0.07882572, -0.68212957, 0.59073925, 1.49971538]])
注意:python关键字and和or对布尔值数组并没有用,使用&和|代替
data[data < 0]=0
data
array([[0. , 0. , 0.68263748, 0. ],
[0. , 0. , 0. , 0.75697117],
[1.38042424, 0. , 2.10397966, 1.98598204],
[0. , 0.81705695, 0. , 0. ],
[0.07882572, 0. , 0.59073925, 1.49971538],
[0.13222977, 0. , 0.54796917, 1.19053359],
[0. , 0.9426649 , 0. , 0.20162042]])
names != 'Joe'
array([ True, False, True, True, True, False, False])
data[names != 'Joe']=7
data
array([[7. , 7. , 7. , 7. ],
[0. , 0. , 0. , 0.75697117],
[7. , 7. , 7. , 7. ],
[7. , 7. , 7. , 7. ],
[7. , 7. , 7. , 7. ],
[0.13222977, 0. , 0.54796917, 1.19053359],
[0. , 0.9426649 , 0. , 0.20162042]])
4.1.6神奇的索引
arr = np.empty((8,4))
for i in range(8):
arr[i]=i
arr
array([[0., 0., 0., 0.],
[1., 1., 1., 1.],
[2., 2., 2., 2.],
[3., 3., 3., 3.],
[4., 4., 4., 4.],
[5., 5., 5., 5.],
[6., 6., 6., 6.],
[7., 7., 7., 7.]])
选择一个特定顺序子集
arr[[4,3,0,6]]
array([[4., 4., 4., 4.],
[3., 3., 3., 3.],
[0., 0., 0., 0.],
[6., 6., 6., 6.]])
如果使用负的索引,将从尾部进行选择
arr[[-3,-5,-7]]
array([[5., 5., 5., 5.],
[3., 3., 3., 3.],
[1., 1., 1., 1.]])
arr = np.arange(32).reshape((8,4))
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]])
arr[[1,5,7,2],[0,3,1,2]]
array([ 4, 23, 29, 10])
arr[[1,5,7,2]][:,[0,3,1,2]]
array([[ 4, 7, 5, 6],
[20, 23, 21, 22],
[28, 31, 29, 30],
[ 8, 11, 9, 10]])
神奇索引与切片不同,它总是将数据复制到一个新的数组中
4.1.7数组转置和转换
转置是一种特殊的数据重组方式,可以返回底层数据的视图而不需要复制任何内容。数组拥有transpose方法,也有特殊的T属性。
arr = np.arange(15).reshape((3,5))
arr
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
arr.T
array([[ 0, 5, 10],
[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14]])
计算矩阵内积会使用np.dot
arr = np.random.randn(6,3)
arr
array([[-0.23144783, -1.53102926, -0.2230637 ],
[ 1.65451328, -0.74725816, -0.64295544],
[ 1.78178001, 0.19446786, -1.34621907],
[ 0.12343761, 1.37570397, -0.92405543],
[ 1.12624911, -1.76795706, -1.18655746],
[ 0.92947622, 2.64016736, -1.06539457]])
np.dot(arr.T,arr)
array([[ 8.11332223, 0.09713011, -5.85149832],
[ 0.09713011, 14.92898035, -1.42608964],
[-5.85149832, -1.42608964, 5.67231753]])
对于更高维的数组,transpose方法可以接收包含轴编号的元组,用于置换轴
arr = np.arange(16).reshape((2,2,4))
arr
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
arr.transpose(1,0,2)
array([[[ 0, 1, 2, 3],
[ 8, 9, 10, 11]],
[[ 4, 5, 6, 7],
[12, 13, 14, 15]]])
在这里,轴已经被重新排序,使得原来的第二个轴变为第一个,第一个变为第二个,最后一个轴并没有改变
ndarray有一个swapaxes方法,该方法接收一对轴编号作为参数,并对轴进行调整用于重组数据
arr
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
arr.swapaxes(1,2)
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])
swapaxes返回的是数据的视图,而没有对数据进行复制
4.2通用函数:快速的逐元素数组函数
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
#平方根
np.sqrt(arr)
array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
#平方
np.square(arr)
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81], dtype=int32)
#自然指数值
np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
二元通用函数
x = np.random.randn(8)
y = np.random.randn(8)
x
array([ 0.43774471, 0.30353109, -0.4385476 , -0.07085461, -0.41682892,
1.74171657, 0.22694261, 0.48012626])
y
array([ 0.38091604, 0.7351168 , 0.04363922, 0.39276555, -0.11270609,
-0.68831551, -0.64187507, 0.2514712 ])
#逐个元素将x,y中的最大值计算出来
np.maximum(x,y)
array([ 0.43774471, 0.7351168 , 0.04363922, 0.39276555, -0.11270609,
1.74171657, 0.22694261, 0.48012626])
也有一些通用函数返回多个数组。比如modf,是python内建函数divmod的向量化版本。它返回了一个浮点值数组的小数部分和整数部分
arr = np.random.randn(7)*5
arr
array([ 0.69713224, -0.39436563, -1.4239261 , 10.89444784,
8.31602522, -0.52237816, -10.31292285])
remainder, whole_part = np.modf(arr)
remainder
array([ 0.69713224, -0.39436563, -0.4239261 , 0.89444784, 0.31602522,
-0.52237816, -0.31292285])
whole_part
array([ 0., -0., -1., 10., 8., -0., -10.])
arr
array([ 0.69713224, -0.39436563, -1.4239261 , 10.89444784,
8.31602522, -0.52237816, -10.31292285])
np.sqrt(arr)
<ipython-input-85-b58949107b3d>:1: RuntimeWarning: invalid value encountered in sqrt
np.sqrt(arr)
array([0.83494446, nan, nan, 3.30067385, 2.88375193,
nan, nan])
np.sqrt(arr,arr)
<ipython-input-86-e3ca18b15869>:1: RuntimeWarning: invalid value encountered in sqrt
np.sqrt(arr,arr)
array([0.83494446, nan, nan, 3.30067385, 2.88375193,
nan, nan])
arr
array([0.83494446, nan, nan, 3.30067385, 2.88375193,
nan, nan])
4.3使用数组进行面向数组编程
我们想要对一些网格数据来计算函数sqrt(x2+y2)的值。np.meshgrid函数接收两个一维数组,并根据两个数组的所有(x,y)对生成一个二维矩阵。
#随机生成数据
points = np.arange(-5,5,0.01)
#生成二维矩阵
xs, ys = np.meshgrid(points,points)
ys
array([[-5. , -5. , -5. , ..., -5. , -5. , -5. ],
[-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
[-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
...,
[ 4.97, 4.97, 4.97, ..., 4.97, 4.97, 4.97],
[ 4.98, 4.98, 4.98, ..., 4.98, 4.98, 4.98],
[ 4.99, 4.99, 4.99, ..., 4.99, 4.99, 4.99]])
xs
array([[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
...,
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99]])
#根据公式计算z
z = np.sqrt(xs ** 2 + ys ** 2)
z
array([[7.07106781, 7.06400028, 7.05693985, ..., 7.04988652, 7.05693985,
7.06400028],
[7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
7.05692568],
[7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
7.04985815],
...,
[7.04988652, 7.04279774, 7.03571603, ..., 7.0286414 , 7.03571603,
7.04279774],
[7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
7.04985815],
[7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
7.05692568]])
使用matplotlib生成二维数组的可视化
import matplotlib.pyplot as plt
plt.imshow(z,cmap=plt.cm.gray)
plt.colorbar()
#设置标题
plt.title('sqrt(x^2+y^2)')
Text(0.5, 1.0, 'sqrt(x^2+y^2)')
4.3.1将条件逻辑作为数组操作
np.where函数是三元表达式 x if condition else y的向量化版本
xarr = np.array([1.1,1.2,1.3,1.4,1.5])
yarr = np.array([2.1,2.2,2.3,2.4,2.5])
cond = np.array([True,False,True,True,False])
result = [(x if c else y)for x,y,c in zip(xarr,yarr,cond)]
result
[1.1, 2.2, 1.3, 1.4, 2.5]
如果数组过大时,速度会很慢。如果数组为多维时,就无法奏效了。而使用np.where时,就可以非常简单地完成
result = np.where(cond,xarr,yarr)#第二个第三个参数并不需要是数组,也可以是标量
result
array([1.1, 2.2, 1.3, 1.4, 2.5])
arr = np.random.randn(4,4)
arr
array([[ 1.45673658, 0.97095783, -0.90075114, -0.86810283],
[ 0.7691019 , -1.44098307, 1.23655136, -0.0863179 ],
[-0.26002458, -0.44007831, -0.64002542, 0.58748434],
[ 1.23704204, -1.42979856, 1.10834965, 0.50134018]])
arr>0
array([[ True, True, False, False],
[ True, False, True, False],
[False, False, False, True],
[ True, False, True, True]])
#将所有正值替换成2,负值替换成-2
np.where(arr>0,2,-2)
array([[ 2, 2, -2, -2],
[ 2, -2, 2, -2],
[-2, -2, -2, 2],
[ 2, -2, 2, 2]])
#将所有正值换成2
np.where(arr>0,2,arr)
array([[ 2. , 2. , -0.90075114, -0.86810283],
[ 2. , -1.44098307, 2. , -0.0863179 ],
[-0.26002458, -0.44007831, -0.64002542, 2. ],
[ 2. , -1.42979856, 2. , 2. ]])
4.3.2数学和统计方法
#生成数据
arr = np.random.randn(5,4)
arr
array([[-0.24008142, -0.08617688, 0.42879457, -1.05699554],
[-0.86102647, -0.01481326, -0.49326453, -0.51728933],
[-1.04369519, -0.07668856, 0.12641113, -0.34170659],
[-0.34358427, -1.19146826, 0.79855649, -0.56526347],
[ 0.34119469, 0.60338427, 0.23612535, 1.70667616]])
#平均值
arr.mean()
-0.1295455547355409
np.mean(arr)
-0.1295455547355409
#和
arr.sum()
-2.5909110947108185
#计算每一列的平均值
arr.mean(axis=1)
array([-0.23861482, -0.47159839, -0.3339198 , -0.32543988, 0.72184512])
#计算行轴向的和
arr.sum(axis=0)
array([-2.14719266, -0.76576269, 1.09662301, -0.77457876])
arr = np.array([0,1,2,3,4,5,6,7])
#从零开始元素累积和
arr.cumsum()
array([ 0, 1, 3, 6, 10, 15, 21, 28], dtype=int32)
arr = np.array([[0,1,2],[3,4,5],[6,7,8]])
arr
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
arr.cumsum(axis=0)
array([[ 0, 1, 2],
[ 3, 5, 7],
[ 9, 12, 15]], dtype=int32)
#从1开始元素累积积
arr.cumprod(axis=1)
array([[ 0, 0, 0],
[ 3, 12, 60],
[ 6, 42, 336]], dtype=int32)
4.3.3布尔值数组的方法
arr = np.random.randn(100)
#计算正值的个数
(arr>0).sum()
51
bools = np.array([False,False,True,False])
bools.any()#是否至少有一个True
True
bools.all()#是否全部为True
False
4.3.4排序
arr = np.random.randn(6)
arr
array([-0.28600425, 0.20138334, 0.61513703, -1.54104191, 0.71169457,
1.28541225])
arr.sort()#排序
arr
array([-1.54104191, -0.28600425, 0.20138334, 0.61513703, 0.71169457,
1.28541225])
arr = np.random.randn(5,3)
arr
array([[ 0.44551524, 0.22691436, -1.49874737],
[ 0.36256785, 1.19204608, 0.31673416],
[ 0.07827487, 0.64557507, -1.31371171],
[-1.01458161, -0.82770194, -0.06353473],
[-0.40078359, 2.48821946, -0.50991488]])
arr.sort(1)
arr
array([[-1.49874737, 0.22691436, 0.44551524],
[ 0.31673416, 0.36256785, 1.19204608],
[-1.31371171, 0.07827487, 0.64557507],
[-1.01458161, -0.82770194, -0.06353473],
[-0.50991488, -0.40078359, 2.48821946]])
#计算一个数组的分位数,并选出分位数所对应的值
large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05*len(large_arr))]
-1.7200330679547906
4.3.5唯一值与其他集合逻辑
np.unique,返回的是数组中唯一值排序后形成的数组
names = np.array(['Bob','Joe','Will','Bob','Will','Joe','Joe'])
np.unique(names)
array(['Bob', 'Joe', 'Will'], dtype='<U4')
ints = np.array([3,3,3,2,2,1,1,4,4])
np.unique(ints)
array([1, 2, 3, 4])
np.unique和纯python相比较
sorted(set(names))
['Bob', 'Joe', 'Will']
np.in1d,可以检查一个数组中的值是否在另外一个数组中,并返回一个布尔值数组
values = np.array([6,0,0,3,2,5,6])
np.in1d(values,[2,3,6])
array([ True, False, False, True, True, False, True])
4.4使用数组进行文件输入和输出
np.save和np.load是高效存取硬盘数据的两大工具函数。数组在默认情况下是以未压缩的格式进行存储的,后缀名是.npy.
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.save('some_array',arr)
np.load('some_array.npy')
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.savez将数组作为参数传递给该函数,用于在未压缩文件中保存多个数组。
np.savez('array_archive.npz',a=arr,b=arr)
arch = np.load('array_archive.npz')
arch['a']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arch['b']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
如果数据已经压缩好了,可以使用np.savez_compressed.
np.savez_compressed('arrays_compressed..npz',a=arr,b=arr)
arch1 = np.load('arrays_compressed..npz')
arch1['a']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
4.5线性代数
x = np.array([[1.,2.,3.],[4.,5.,6.]])
y = np.array([[6.,23.],[-1,7],[8,9]])
x
array([[1., 2., 3.],
[4., 5., 6.]])
y
array([[ 6., 23.],
[-1., 7.],
[ 8., 9.]])
x.dot(y)
array([[ 28., 64.],
[ 67., 181.]])
x.dot(y)等价于np.dot(x,y)
np.dot(x,y)
array([[ 28., 64.],
[ 67., 181.]])
np.dot(x,np.ones(3))
array([ 6., 15.])
特殊符号@也作为中缀操作符,用于点乘矩阵操作
x @ np.ones(3)
array([ 6., 15.])
numpy.linalg拥有一个矩阵分解的标准函数集,以及其他常用函数,例如求逆和行列式求解
from numpy.linalg import inv, qr
X = np.random.randn(5,5)
mat = X.T.dot(X)
mat
array([[ 6.88097643, -0.40153042, -0.11773682, 4.82061317, -0.00948514],
[-0.40153042, 2.93777143, 2.28436549, -3.33712964, 0.27895677],
[-0.11773682, 2.28436549, 2.34334495, -1.8758072 , 0.8700664 ],
[ 4.82061317, -3.33712964, -1.8758072 , 8.08801733, -1.40096259],
[-0.00948514, 0.27895677, 0.8700664 , -1.40096259, 5.84629622]])
#求逆
inv(mat)
array([[ 1.6344894 , -5.30599418, 3.59242482, -2.48157756,
-0.87347595],
[ -5.30599418, 20.96406817, -14.76492044, 8.96595671,
3.3369904 ],
[ 3.59242482, -14.76492044, 11.00957592, -6.09349382,
-2.38834458],
[ -2.48157756, 8.96595671, -6.09349382, 4.14309849,
1.46783835],
[ -0.87347595, 3.3369904 , -2.38834458, 1.46783835,
0.71759003]])
mat.dot(inv(mat))
array([[ 1.00000000e+00, -3.62276349e-15, 2.97708758e-15,
1.00365641e-15, -1.28563628e-15],
[ 1.16986541e-15, 1.00000000e+00, -1.04489259e-15,
-1.43234162e-15, 1.85031636e-16],
[-9.17892019e-16, 5.90211635e-15, 1.00000000e+00,
1.54913164e-15, 7.28854924e-16],
[ 1.67430797e-15, -6.24584441e-15, 2.02034005e-15,
1.00000000e+00, -1.38074650e-15],
[ 6.79495103e-16, -4.09884887e-15, 3.43563276e-15,
-1.61484598e-15, 1.00000000e+00]])
#计算QR分解
q,r = qr(mat)
r
array([[-8.41197519, 2.41336148, 1.31408829, -8.76534057, 0.8426876 ],
[ 0. , -4.40454207, -3.50602341, 5.0520601 , -1.60816204],
[ 0. , 0. , -0.98999659, -0.25669779, -3.68304671],
[ 0. , 0. , 0. , -1.68870316, 4.4795456 ],
[ 0. , 0. , 0. , 0. , 0.22210084]])
4.6伪随机数生成
使用normal获得一个4*4的正态分布
samples = np.random.normal(size=(4,4))
samples
array([[-1.08982894, -0.38664288, 0.08795078, -0.58766288],
[-0.55362143, 0.53318817, -1.24544404, -0.28009587],
[-0.62227897, -0.96513278, 0.94540138, -0.1743617 ],
[-1.02020369, 0.44070475, 0.16880846, 1.32297271]])
使用numpy.random在生成大型样本时比纯python的方式快了一个数量级
from random import normalvariate
N = 1000000
%timeit samples = [normalvariate(0,1) for _ in range(N)]
966 ms ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Compiler time: 0.15 s
%timeit np.random.normal(size=N)
29.7 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.random.seed(1234)#更改随机数种子
#为了避免全局状态,可以使用numpy.random.RandomState创建一个随机数生成器,使数据独立于其他的随机数状态
rng = np.random.RandomState(1234)
rng.randn(10)
array([ 0.47143516, -1.19097569, 1.43270697, -0.3126519 , -0.72058873,
0.88716294, 0.85958841, -0.6365235 , 0.01569637, -2.24268495])
4.7随机漫步
#1000步的随机漫步
import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
step = 1 if random.randint(0,1) else -1
position += step
walk.append(position)
plt.plot(walk[:100])
#1000次随机投掷硬币的结果,每次结果为1或-1
nsteps = 1000
draws = np.random.randint(0,2,size=nsteps)
steps = np.where(draws>0,1,-1)
walk = steps.cumsum()
walk.min()
-9
walk.max()
60
plt.plot(walk[:100])
[<matplotlib.lines.Line2D at 0x2225fe86e20>]
#np.abs(walk)>=10表示连续在一个方向走了十步,argmax()可以返回布尔值数组中最大值的第一个位置(True就是最大值)
(np.abs(walk)>=10).argmax()
297
4.7.1一次性模拟多次随机漫步
#一次性跨行算出全部5000个随机步的累计和
nwalks = 5000
nsteps = 1000
draws = np.random.randint(0,2,size=(nwalks,nsteps))#0/1
steps = np.where(draws>0,1,-1)
walks = steps.cumsum(1)
walks
array([[ 1, 2, 3, ..., 46, 47, 46],
[ 1, 0, 1, ..., 40, 41, 42],
[ 1, 2, 3, ..., -26, -27, -28],
...,
[ 1, 0, 1, ..., 64, 65, 66],
[ 1, 2, 1, ..., 2, 1, 0],
[ -1, -2, -3, ..., 32, 33, 34]], dtype=int32)
plt.plot(walk[:100])
[<matplotlib.lines.Line2D at 0x2225fe82fd0>]
walks.max()
122
walks.min()
-128
#计算30的最小穿越时间
#使用any方法检查
hits30 = (np.abs(walks)>30).any(1)
hits30
array([ True, True, True, ..., True, False, True])
hits30.sum()#达到30的数字
3210
#选出绝对值步数超过30的步所在的行,并使用argmax从轴向1上获取穿越时间
crossing_times = (np.abs(walks[hits30])>=30).argmax(1)
crossing_times.mean()
501.89283489096573