Python for Data Analysis 4

最新推荐文章于 2024-08-10 08:22:05 发布

十三吖

最新推荐文章于 2024-08-10 08:22:05 发布

阅读量428

点赞数

分类专栏：数据分析文章标签： pandas

本文链接：https://blog.csdn.net/qq_40006058/article/details/83683876

版权

数据分析同时被 2 个专栏收录

52 篇文章 2 订阅

订阅专栏

Pandas

6 篇文章 0 订阅

订阅专栏

Python for Data Analysis

第4章 Numpy基础：数组和矢量计算

import numpy as np

4.1 NumPy的ndarray：一种多维数组对象

# generate some random data
data = np.random.randn(2, 3)
data

 array([[-0.88356437, -0.72686335,  0.63221852],
           [-0.16591109,  0.55655533, -0.07961666]])

np.random?

rand                 Uniformly distributed values.
randn                Normally distributed values.
ranf                 Uniformly distributed floating point numbers.
randint              Uniformly distributed integers in a given range.
beta                 Beta distribution over ``[0, 1]``.
binomial             Binomial distribution.
chisquare            :math:`\chi^2` distribution.
exponential          Exponential distribution.
f                    F (Fisher-Snedecor) distribution.
gamma                Gamma distribution.
geometric            Geometric distribution.
gumbel               Gumbel distribution.
hypergeometric       Hypergeometric distribution.
laplace              Laplace distribution.
logistic             Logistic distribution.
lognormal            Log-normal distribution.
logseries            Logarithmic series distribution.
negative_binomial    Negative binomial distribution.
noncentral_chisquare Non-central chi-square distribution.
noncentral_f         Non-central F distribution.
normal               Normal / Gaussian distribution.
pareto               Pareto distribution.
poisson              Poisson distribution.
power                Power distribution.
rayleigh             Rayleigh distribution.
triangular           Triangular distribution.
uniform              Uniform distribution.
vonmises             Von Mises circular distribution.
wald                 Wald (inverse Gaussian) distribution.
weibull              Weibull distribution.
zipf                 Zipf's distribution over ranked data.

Multivariate distributions

dirichlet            Multivariate generalization of Beta distribution.
multinomial          Multivariate generalization of the binomial distribution.
multivariate_normal  Multivariate generalization of the normal distribution.

Standard distributions

standard_cauchy      Standard Cauchy-Lorentz distribution.
standard_exponential Standard exponential distribution.
standard_gamma       Standard Gamma distribution.
standard_normal      Standard normal distribution.
standard_t           Standard Student's t-distribution.

data * 10

array([[-8.83564373, -7.26863349,  6.32218519],
       [-1.65911087,  5.5655533 , -0.79616665]])

data + data

array([[-1.76712875, -1.4537267 ,  1.26443704],
       [-0.33182217,  1.11311066, -0.15923333]])

data.shape

(2, 3)

data.dtype

dtype('float64')

type(data)

numpy.ndarray

创建ndarry对象

# 列表转换为数组
data1 = [6,7,5,8.,0]

arr1 = np.array(data1)

arr1

array([ 6.,  7.,  5.,  8.,  0.])

data2 = [[1,2,3],[4,5,6]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3],
       [4, 5, 6]])

print(arr2.ndim)
print(arr2.shape)

2
(2, 3)

np.zeros((3,4))           # 全为0，，注意两个括号

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

np.empty((3,4))            # 为空

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

np.arange(10)                   # 相当于python 的range(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

数据类型

dtype（数据类型）是一个特殊的对象，它含有ndarray将一块内存解释为特定数据类型所需的信息

arr1 = np.array([1,2,3], dtype = np.int32)
arr2 = np.array([1,2,3], dtype = np.float32)
print(arr1.dtype)
print(arr2.dtype)

int32
float32

# 上例中，整数被转换成了浮点数。如果将浮点数转换成整数，则小数部分将会被截取删除：

arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr.astype(np.int32)

array([ 3, -1, -2,  0, 12, 10])

# 如果字符串数组全为数字，可以通过astype转换为数值形式

s = np.array(['1','2','5'], dtype = np.string_)
s.astype(float)

array([ 1.,  2.,  5.])

Numpy数组运算

arr = np.array([[1,2,3],[4,5,6]])

arr * arr                   #对应元素相乘

array([[ 1,  4,  9],
       [16, 25, 36]])

arr - arr

array([[0, 0, 0],
       [0, 0, 0]])

1. / arr

array([[ 1.        ,  0.5       ,  0.33333333],
       [ 0.25      ,  0.2       ,  0.16666667]])

arr ** 2

array([[ 1,  4,  9],
       [16, 25, 36]], dtype=int32)

arr2 = np.array([[2,3,1],[4,6,5]])
arr2 > arr1

array([[ True,  True, False],
       [ True,  True,  True]], dtype=bool)

# 索引

arr = np.arange(10)

arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

arr[1]

arr[2:5]

array([2, 3, 4])

arr[1:4] = 666           # 当你将一个标量值赋值给一个切片时（如arr[1：4]=666），该值会自动传播（也就说后面将会讲到的“广播”）到整个选区。
                         # 跟列表最重要的区别在于，数组切片是原始数组的视图。
                         # 这意味着数据不会被复制，视图上的任何修改都会直接反映到源数组上。

arr

array([  0, 666, 666, 666,   4,   5,   6,   7,   8,   9])

a = arr[5:8]
a

array([5, 6, 7])

a[1] = 888             # 当我修稿a中的值，变动也会体现在原始数组arr中
arr

array([  0, 666, 666, 666,   4,   5, 888,   7,   8,   9])

arr[:]

array([  0, 666, 666, 666,   4,   5, 888,   7,   8,   9])

#对于高维度数组，能做的事情更多。在一个二维数组中，各索引位置上的元素不再是标量而是一维数组：

arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr2d[2])

#因此，可以对各个元素进行递归访问，但这样需要做的事情有点多。你可以传入一个以逗号隔开的索引列表来选取单个元素。也就是说，下面两种方式是等价的：

print(arr2d[0][2])
print(arr2d[0, 2])

[7 8 9]
3
3

arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

切片索引

arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

arr2d[:2, 1:2]

array([[2],
       [5]])

arr2d[2, :2]                    # 第三行前两列

array([7, 8])

arr2d[:2] = 0
arr2d

array([[0, 0, 0],
       [0, 0, 0],
       [7, 8, 9]])

布尔索引

data = np.random.randn(7, 4)
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

data

array([[-0.07506259,  0.57214315, -0.18041378, -0.12117736],
       [-2.14281031,  0.40535298, -0.9620643 ,  0.22670712],
       [ 0.76416949, -1.65382794,  0.9403536 ,  0.68560868],
       [-0.2522858 , -0.39025653, -1.79771471, -0.61124803],
       [-3.18644813, -0.08027387, -0.19944585, -0.60847046],
       [-1.0220794 ,  0.25628282, -1.06842413, -1.21997291],
       [ 0.98564206, -0.58358104, -1.92757536,  1.38717539]])

names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],
      dtype='<U4')

# 选出对应于名字"Bob"的所有行。跟算术运算一样，数组的比较运算（如==）也是矢量化的。
# 因此，对names和字符串"Bob"的比较运算将会产生一个布尔型数组：

names == 'Bob'

array([ True, False, False,  True, False, False, False], dtype=bool)

data[names == 'Bob']

array([[-0.07506259,  0.57214315, -0.18041378, -0.12117736],
       [-0.2522858 , -0.39025653, -1.79771471, -0.61124803]])

data[names == 'Bob', 2:]

array([[-0.18041378, -0.12117736],
       [-1.79771471, -0.61124803]])

# 要选择除"Bob"以外的其他值，既可以使用不等于符号（!=），也可以通过~对条件进行否定

data[~(names == 'Bob')]

array([[-2.14281031,  0.40535298, -0.9620643 ,  0.22670712],
       [ 0.76416949, -1.65382794,  0.9403536 ,  0.68560868],
       [-3.18644813, -0.08027387, -0.19944585, -0.60847046],
       [-1.0220794 ,  0.25628282, -1.06842413, -1.21997291],
       [ 0.98564206, -0.58358104, -1.92757536,  1.38717539]])

data[data < 0]

array([-0.07506259, -0.18041378, -0.12117736, -2.14281031, -0.9620643 ,
       -1.65382794, -0.2522858 , -0.39025653, -1.79771471, -0.61124803,
       -3.18644813, -0.08027387, -0.19944585, -0.60847046, -1.0220794 ,
       -1.06842413, -1.21997291, -0.58358104, -1.92757536])

# 将data 中小于0的值设为0

data[data < 0] = 0

data

array([[ 0.        ,  0.57214315,  0.        ,  0.        ],
       [ 0.        ,  0.40535298,  0.        ,  0.22670712],
       [ 0.76416949,  0.        ,  0.9403536 ,  0.68560868],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.25628282,  0.        ,  0.        ],
       [ 0.98564206,  0.        ,  0.        ,  1.38717539]])

花式索引

利用整数数组进行索引

arr = np.empty((8,4))

for i in range(8):
    arr[i] = i
arr

array([[ 0.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.],
       [ 4.,  4.,  4.,  4.],
       [ 5.,  5.,  5.,  5.],
       [ 6.,  6.,  6.,  6.],
       [ 7.,  7.,  7.,  7.]])

# 以特定顺序选取子集

arr[[4,6,1,5]]

array([[ 4.,  4.,  4.,  4.],
       [ 6.,  6.,  6.,  6.],
       [ 1.,  1.,  1.,  1.],
       [ 5.,  5.,  5.,  5.]])

arr = np.arange(32).reshape((8,4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

arr[[1, 5, 7, 2], [0, 3, 1, 2]]     # 最终选出的是元素(1,0)、(5,3)、(7,1)和(2,2)。无论数组是多少维的，花式索引总是一维的。

array([ 4, 23, 29, 10])

数组转置和轴对换

转置是重塑的一种特殊形式，它返回的是源数据的视图（不会进行任何复制操作）。数组不仅有transpose方法，还有一个特殊的T属性：

arr = np.arange(15).reshape((3,5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

arr = np.random.randn(6,3)

arr

array([[ 0.57222764,  1.89441072, -0.47364627],
       [-1.1347914 , -0.55292593, -0.95928693],
       [-1.43456476,  0.71880133, -1.09404351],
       [-0.86380144, -1.51847717,  0.46772892],
       [-2.15055074, -0.23467395,  1.29220332],
       [-1.60513595, -0.61993174,  0.38326724]])

np.dot(arr.T, arr)

array([[ 11.62065489,   3.49173843,  -1.41113643],
       [  3.49173843,   7.15635456,  -2.40434756],
       [ -1.41113643,  -2.40434756,   4.37695693]])

4.2 通用函数：快速的元素级数组函数

通用函数（即ufunc）是一种对ndarray中的数据执行元素级运算的函数。你可以将其看做简单函数（接受一个或多个标量值，并产生一个或多个标量值）
的矢量化包装器。

许多ufunc都是简单的元素级变体，如sqrt和exp：

arr = np.arange(10)
np.sqrt(arr)

array([ 0.        ,  1.        ,  1.41421356,  1.73205081,  2.        ,
        2.23606798,  2.44948974,  2.64575131,  2.82842712,  3.        ])

np.exp(arr)

array([  1.00000000e+00,   2.71828183e+00,   7.38905610e+00,
         2.00855369e+01,   5.45981500e+01,   1.48413159e+02,
         4.03428793e+02,   1.09663316e+03,   2.98095799e+03,
         8.10308393e+03])

# 这些都是一元（unary）ufunc。另外一些（如add或maximum）接受2个数组（因此也叫二元（binary）ufunc），并返回一个结果数组：

x = np.random.randn(8)
y = np.random.randn(8)
np.maximum(x, y)

array([ 1.6915874 ,  1.18310349,  1.26619674,  0.1312126 ,  3.86851852,
        0.5851202 , -0.50204715,  1.65377949])

np.modf?
返回整数部分组成的数组和小数部分组成的数组

arr = np.random.randn(3) * 5
remainder, whole_part = np.modf(arr)
print(arr, '\n',remainder,'\n', whole_part)

[-6.90481066 -5.02940138  3.78555874] 
 [-0.90481066 -0.02940138  0.78555874] 
 [-6. -5.  3.]

4.3 利用数组进行数据处理

NumPy数组使你可以将许多种数据处理任务表述为简洁的数组表达式（否则需要编写循环）。用数组表达式代替循环的做法，通常被称为矢量化。
一般来说，矢量化数组运算要比等价的纯Python方式快上一两个数量级（甚至更多），尤其是各种数值计算。

作为简单的例子，假设我们想要在一组值（网格型）上计算函数sqrt(x^2+y2)。np.meshgrid函数接受两个一维数组，并产生两个二维矩阵
（对应于两个数组中所有的(x,y)对）：

points = np.arange(-5,5,0.1)

xs,ys = np.meshgrid(points, points)
print(xs, '\n', '----------------ooooooooooooo------------------','\n', ys)

[[-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]
 [-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]
 [-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]
 ..., 
 [-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]
 [-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]
 [-5.  -4.9 -4.8 ...,  4.7  4.8  4.9]] 
 ----------------ooooooooooooo------------------ 
 [[-5.  -5.  -5.  ..., -5.  -5.  -5. ]
 [-4.9 -4.9 -4.9 ..., -4.9 -4.9 -4.9]
 [-4.8 -4.8 -4.8 ..., -4.8 -4.8 -4.8]
 ..., 
 [ 4.7  4.7  4.7 ...,  4.7  4.7  4.7]
 [ 4.8  4.8  4.8 ...,  4.8  4.8  4.8]
 [ 4.9  4.9  4.9 ...,  4.9  4.9  4.9]]

z = np.sqrt(xs ** 2 +  ys** 2)
z

array([[ 7.07106781,  7.00071425,  6.93108938, ...,  6.86221539,
         6.93108938,  7.00071425],
       [ 7.00071425,  6.92964646,  6.85930026, ...,  6.78969808,
         6.85930026,  6.92964646],
       [ 6.93108938,  6.85930026,  6.7882251 , ...,  6.71788657,
         6.7882251 ,  6.85930026],
       ..., 
       [ 6.86221539,  6.78969808,  6.71788657, ...,  6.64680374,
         6.71788657,  6.78969808],
       [ 6.93108938,  6.85930026,  6.7882251 , ...,  6.71788657,
         6.7882251 ,  6.85930026],
       [ 7.00071425,  6.92964646,  6.85930026, ...,  6.78969808,
         6.85930026,  6.92964646]])

"""
import matplotlib.pyplot as plt
plt.imshow(z, cmap = plt.cm.hot)
plt.colorbar()
plt.show()
"""

'\nimport matplotlib.pyplot as plt\nplt.imshow(z, cmap = plt.cm.hot)\nplt.colorbar()\nplt.show()\n'

将条件逻辑表述为数组运算

numpy.where函数是三元表达式x if condition else y的矢量化版本。假设我们有一个布尔数组和两个值数组：

xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

#假设我们想要根据cond中的值选取xarr和yarr的值：当cond中的值为True时，选取xarr的值，否则从yarr中选取。列表推导式的写法应该如下所示：

result = [(x if c else y) for x,y,c in zip(xarr,yarr,cond)]

result

[1.1000000000000001, 2.2000000000000002, 1.3, 1.3999999999999999, 2.5]

# 这有几个问题。第一，它对大数组的处理速度不是很快（因为所有工作都是由纯Python完成的）。第二，无法用于多维数组。
# 若使用np.where，则可以将该功能写得非常简洁：

result = np.where(cond, xarr, yarr)
result

array([ 1.1,  2.2,  1.3,  1.4,  2.5])

# np.where的第二个和第三个参数不必是数组，它们都可以是标量值。在数据分析工作中，where通常用于根据另一个数组而产生一个新的数组。
# 假设有一个由随机数据组成的矩阵，你希望将所有正值替换为2，将所有负值替换为－2。若利用np.where，则会非常简单：

arr = np.random.randn(4, 4)
arr

array([[ 0.18939455,  1.1022609 ,  1.2985593 , -0.6510627 ],
       [ 0.15873684,  0.76710093,  0.2974    , -0.46029059],
       [ 0.30607043, -0.96320707, -0.2942148 ,  0.14623174],
       [-0.68012247,  1.37051585, -0.80434269, -0.19214095]])

arr > 0

array([[ True,  True,  True, False],
       [ True,  True,  True, False],
       [ True, False, False,  True],
       [False,  True, False, False]], dtype=bool)

np.where(arr>0, 2, -2)

array([[ 2,  2,  2, -2],
       [ 2,  2,  2, -2],
       [ 2, -2, -2,  2],
       [-2,  2, -2, -2]])

# 使用np.where，可以将标量和数组结合起来。例如，我可用常数2替换arr中所有正的值

np.where(arr>0,2,arr)

array([[ 2.        ,  2.        ,  2.        , -0.6510627 ],
       [ 2.        ,  2.        ,  2.        , -0.46029059],
       [ 2.        , -0.96320707, -0.2942148 ,  2.        ],
       [-0.68012247,  2.        , -0.80434269, -0.19214095]])

数学和统计方法

可以通过数组上的一组数学函数对整个数组或某个轴向的数据进行统计计算。sum、mean以及标准差std等聚合计算
（aggregation，通常叫做约简（reduction））既可以当做数组的实例方法调用，也可以当做顶级NumPy函数使用。

这里，我生成了一些正态分布随机数据，然后做了聚类统计：

arr = np.random.randn(5, 4)

arr.mean()

-0.21715403200178157

arr.sum()

-4.3430806400356312

arr.mean(axis = 1)

array([-0.41294226, -0.15434665, -0.42011139,  0.28631343, -0.38468329])

arr.sum(axis = 1)           # 计算行和，返回一列

array([-1.65176904, -0.6173866 , -1.68044556,  1.14525371, -1.53873315])

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])

arr.cumsum()                       # 累加函数

array([ 0,  1,  3,  6, 10, 15, 21, 28], dtype=int32)

arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])

arr.cumsum(axis = 0)

array([[ 0,  1,  2],
       [ 3,  5,  7],
       [ 9, 12, 15]], dtype=int32)

arr.cumprod(axis = 1)                 # 所有元素累计积

array([[  0,   0,   0],
       [  3,  12,  60],
       [  6,  42, 336]], dtype=int32)

用于布尔型数组的方法

在上面这些方法中，布尔值会被强制转换为1（True）和0（False）。因此，sum经常被用来对布尔型数组中的True值计数：

arr = np.random.randn(100)

(arr > 0).sum()                     # 计算正数的个数

bools = np.array([False, False, True, False])
bools.any()                  # 检查是否存在True

True

bools.all()                     # 检查是否全部为True

False

排序

sort()

arr = np.random.randn(6)           # 一维数组排序

arr.sort()

arr

array([-1.22937654, -0.57560317,  0.27646244,  0.81203191,  0.83331953,
        0.97375779])

arr = np.random.randn(5, 3)                  # 多维数组排序
arr

array([[ 1.56743746,  0.24278787, -0.82335562],
       [-1.61820194,  1.82090974, -0.37097399],
       [ 1.06528464,  0.27556863,  0.73302075],
       [-0.58584045, -0.54060996, -1.66575822],
       [-1.04347057,  1.19813072,  1.0395426 ]])

arr.sort(1)                       # 按行排序
arr

array([[-0.82335562,  0.24278787,  1.56743746],
       [-1.61820194, -0.37097399,  1.82090974],
       [ 0.27556863,  0.73302075,  1.06528464],
       [-1.66575822, -0.58584045, -0.54060996],
       [-1.04347057,  1.0395426 ,  1.19813072]])

唯一化以及其它的集合逻辑

NumPy提供了一些针对一维ndarray的基本集合运算。最常用的可能要数np.unique了，它用于找出数组中的唯一值并返回已排序的结果：

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names)

array(['Bob', 'Joe', 'Will'],
      dtype='<U4')

ints = np.array([3,3,3,4,4,3,4,2,1,1,2,3,3])
np.unique(ints)

array([1, 2, 3, 4])

# np.in1d用于测试一个数组中的值在另一个数组中的成员资格，返回一个布尔型数组：

values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])

array([ True, False, False,  True,  True, False,  True], dtype=bool)

4.4 用于数组的文件输入输出

NumPy能够读写磁盘上的文本数据或二进制数据。这一小节只讨论NumPy的内置二进制格式，因为更多的用户会使用pandas或其它工具加载文本或表格数据（见第6章）。

np.save和np.load是读写磁盘数组数据的两个主要函数。默认情况下，数组是以未压缩的原始二进制格式保存在扩展名为.npy的文件中的：

In [213]: arr = np.arange(10)

In [214]: np.save(‘some_array’, arr)
如果文件路径末尾没有扩展名.npy，则该扩展名会被自动加上。然后就可以通过np.load读取磁盘上的数组：

In [215]: np.load(‘some_array.npy’)
Out[215]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
通过np.savez可以将多个数组保存到一个未压缩文件中，将数组以关键字参数的形式传入即可：

In [216]: np.savez(‘array_archive.npz’, a=arr, b=arr)
加载.npz文件时，你会得到一个类似字典的对象，该对象会对各个数组进行延迟加载：

In [217]: arch = np.load(‘array_archive.npz’)

In [218]: arch[‘b’]
Out[218]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
如果数据压缩的很好，就可以使用numpy.savez_compressed：

In [219]: np.savez_compressed(‘arrays_compressed.npz’, a=arr, b=arr)

4.5 线性代数

线性代数（如矩阵乘法、矩阵分解、行列式以及其他方阵数学等）是任何数组库的重要组成部分。
不像某些语言（如MATLAB），通过 * 对两个二维数组相乘得到的是一个元素级的积，而不是一个矩阵点积。
因此，NumPy提供了一个用于矩阵乘法的dot函数（既是一个数组方法也是numpy命名空间中的一个函数）：

x = np.array([[1., 2., 3.], [4., 5., 6.]])
x

array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

y = np.array([[6., 23.], [-1, 7], [8, 9]])
y

array([[  6.,  23.],
       [ -1.,   7.],
       [  8.,   9.]])

# 矩阵乘法

x.dot(y)

array([[  28.,   64.],
       [  67.,  181.]])

np.dot(x, y)

array([[  28.,   64.],
       [  67.,  181.]])

x @ y

array([[  28.,   64.],
       [  67.,  181.]])

# numpy.linalg中有一组标准的矩阵分解运算以及诸如求逆和行列式之类的东西。

from numpy.linalg import inv, qr, pinv

X = np.random.randn(5, 5)
X

array([[ 2.40495877, -2.26563267, -0.07099409,  1.19512501,  0.43861647],
       [-1.05748984, -0.46596716,  0.09420242, -1.52079483,  0.14840596],
       [-1.57249384,  0.07670084, -1.19786539,  0.38048725,  1.30067983],
       [ 0.05769554,  2.85045679, -2.10660309, -1.23085749,  1.44081619],
       [-0.41548049,  0.23928522, -0.88725759,  1.20801395,  1.66920717]])

mat = X.T.dot(X)
mat

array([[  9.55080109,  -5.0115689 ,   1.8603766 ,   3.31121638,
         -1.75778862],
       [ -5.0115689 ,  13.5384611 ,  -6.19201468,  -5.1893363 ,
          3.54326802],
       [  1.8603766 ,  -6.19201468,   6.67379838,   0.83722674,
         -6.09144291],
       [  3.31121638,  -5.1893363 ,   0.83722674,   6.86021912,
          1.03638477],
       [ -1.75778862,   3.54326802,  -6.09144291,   1.03638477,
          6.76838064]])

inv(mat)                      # 逆运算

array([[ 0.19936998,  0.12875977,  0.44851935, -0.11484585,  0.40561761],
       [ 0.12875977,  0.35211053,  0.84281052,  0.00977551,  0.60612895],
       [ 0.44851935,  0.84281052,  3.52762827, -0.45045327,  2.91905684],
       [-0.11484585,  0.00977551, -0.45045327,  0.33790903, -0.49208612],
       [ 0.40561761,  0.60612895,  2.91905684, -0.49208612,  2.63823391]])

mat.dot(inv(mat))

array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         -1.11022302e-16,   0.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00,   3.55271368e-15,
          0.00000000e+00,   1.77635684e-15],
       [  4.44089210e-16,   0.00000000e+00,   1.00000000e+00,
         -4.44089210e-16,   3.55271368e-15],
       [ -1.66533454e-16,  -3.33066907e-16,  -8.88178420e-16,
          1.00000000e+00,  -8.88178420e-16],
       [  4.44089210e-16,   8.88178420e-16,   3.55271368e-15,
         -8.88178420e-16,   1.00000000e+00]])

pinv(mat)

array([[ 0.19936998,  0.12875977,  0.44851935, -0.11484585,  0.40561761],
       [ 0.12875977,  0.35211053,  0.84281052,  0.00977551,  0.60612895],
       [ 0.44851935,  0.84281052,  3.52762827, -0.45045327,  2.91905684],
       [-0.11484585,  0.00977551, -0.45045327,  0.33790903, -0.49208612],
       [ 0.40561761,  0.60612895,  2.91905684, -0.49208612,  2.63823391]])

mat.dot(pinv(mat))             # 伪逆运算

array([[  1.00000000e+00,  -2.22044605e-15,  -1.06581410e-14,
          2.44249065e-15,  -1.06581410e-14],
       [ -6.66133815e-16,   1.00000000e+00,   0.00000000e+00,
         -8.88178420e-16,   5.32907052e-15],
       [  8.88178420e-16,   1.33226763e-15,   1.00000000e+00,
          0.00000000e+00,   7.10542736e-15],
       [  9.43689571e-16,   1.66533454e-15,   5.32907052e-15,
          1.00000000e+00,   3.99680289e-15],
       [ -8.88178420e-16,   0.00000000e+00,  -3.55271368e-15,
         -8.88178420e-16,   1.00000000e+00]])

np.linalg?

Core Linear Algebra Tools

Linear algebra basics:

norm Vector or matrix norm
inv Inverse of a square matrix
solve Solve a linear system of equations
det Determinant of a square matrix
lstsq Solve linear least-squares problem
pinv Pseudo-inverse (Moore-Penrose) calculated using a singular
value decomposition
matrix_power Integer power of a square matrix

Eigenvalues and decompositions:

eig Eigenvalues and vectors of a square matrix
eigh Eigenvalues and eigenvectors of a Hermitian matrix
eigvals Eigenvalues of a square matrix
eigvalsh Eigenvalues of a Hermitian matrix
qr QR decomposition of a matrix
svd Singular value decomposition of a matrix
cholesky Cholesky decomposition of a matrix

4.6 伪随机数生成

numpy.random模块对Python内置的random进行了补充，增加了一些用于高效生成多种概率分布的样本值的函数。
例如，你可以用normal来得到一个标准正态分布的4×4样本数组：

samples = np.random.normal(size=(4, 4))
samples

array([[-0.31411226,  0.78370571, -1.06857403, -0.14836215],
       [ 1.14352539,  1.42418282, -0.28299944, -0.39765053],
       [ 1.00697231,  0.06954574, -1.62074125,  2.63925111],
       [ 1.06188346,  0.35586702,  1.37162022,  1.88444903]])

#可以用NumPy的np.random.seed更改随机数生成种子：

np.random.seed(1234)

#numpy.random的数据生成函数使用了全局的随机种子。要避免全局状态，你可以使用numpy.random.RandomState，创建一个与其它隔离的随机数生成器：

rng = np.random.RandomState(1234)
rng.randn(10)

array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 , -0.72058873,
        0.88716294,  0.85958841, -0.6365235 ,  0.01569637, -2.24268495])

np.random.randn(3,3)

array([[ 2.00074455, -0.6721939 ,  0.91388772],
       [-0.40461981,  2.30728121, -0.47127242],
       [-1.21870147, -0.07087522, -0.65394543]])

np.random.normal(3,0.01)

3.0034769363248364

np.random.rand(3,3)

array([[ 0.58730363,  0.47163253,  0.10712682],
       [ 0.22921857,  0.89996519,  0.41675354],
       [ 0.53585166,  0.00620852,  0.30064171]])

4.7 示例：随机漫步

我们通过模拟随机漫步来说明如何运用数组运算。先来看一个简单的随机漫步的例子：从0开始，步长1和－1出现的概率相等。

下面是一个通过内置的random模块以纯Python的方式实现1000步的随机漫步：

import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
    step = 1 if random.randint(0, 1) else -1
    position += step
    walk.append(position)

# plt.plot(walk[:100])

a = np.random.randint(0,2, size = 1000)

array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1,
       1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1])

steps = np.where(a>0, 1, -1)
walk = steps.cumsum()

walk.min()

-14

walk.max()

# 现在来看一个复杂点的统计任务——首次穿越时间，即随机漫步过程中第一次到达某个特定值的时间。
# 假设我们想要知道本次随机漫步需要多久才能距离初始0点至少10步远（任一方向均可）。
#np.abs(walk)>=10可以得到一个布尔型数组，它表示的是距离是否达到或超过10，而我们想要知道的是第一个10或－10的索引。
# 可以用argmax来解决这个问题，它返回的是该布尔型数组第一个最大值的索引（True就是最大值）：
(np.abs(walk) >= 10).argmax()
# 注意，这里使用argmax并不是很高效，因为它无论如何都会对数组进行完全扫描。在本例中，只要发现了一个True，那我们就知道它是个最大值了。