学习笔记：Numpy常见用法（3/3）

最新推荐文章于 2020-09-10 16:21:37 发布

蜡青

最新推荐文章于 2020-09-10 16:21:37 发布

阅读量435

点赞数

分类专栏： python常用库文章标签： python numpy 数据分析

本文链接：https://blog.csdn.net/qq_34844698/article/details/105246106

版权

python常用库专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

7花哨的索引 fancy indexing
8数组的排序
9结构化数据：NumPy的结构化数组

7花哨的索引 fancy indexing

利用简单的索引值（如arr[0]）、切片（如arr[:5]）和布尔掩码（如arr[arr > 0]）获得并修改部分数组
花哨的索引和前面那些简单的索引非常类似，但是传递的是索引数组，而不是单个标量。花哨的索引让我们能够快速获得并修改复杂的数组值的子数据集。

7.1探索花哨的索引

传递一个索引数组来一次性获得多个数组元素
结果的形状与索引数组的形状一致，而不是与被索引数组的形状一致

x =np.random.randint(100, size=10)
print(x)

[ 0 38 19 46 42 56 60 77 30 24]

np.array([x[3],x[7],x[9]])

array([46, 77, 24])

ind=[3,7,9]
x[ind]

array([46, 77, 24])

ind=np.array([[3,7],[4,5]])
x[ind]

array([[46, 77],
       [42, 56]])

X = np.arange(12).reshape((3, 4))
X

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

# X[0,2],X[1,1],X[2,3]
row=np.array([0,1,2])
col=np.array([2,1,3])
X[row,col]

array([ 2,  5, 11])

# 索引值的配对遵循介绍过的广播的规则
# row=array([[0],[1],[2]])
X[row[:,np.newaxis],col]

array([[ 2,  1,  3],
       [ 6,  5,  7],
       [10,  9, 11]])

(row[:,np.newaxis]*col).shape

(3, 3)

花哨的索引返回的值反映的是广播后的索引数组的形状，而不是被索引的数组的形状。

7.2组合索引

花哨索引与其他索引（简单、切片、掩码）方案结合

# 花哨的索引和简单的索引组合使用
# 第三行第三、一、二个元素
X[2, [2, 0, 1]]

array([10,  8,  9])

# 与切片索引组合使用
# 除第一行外的行中，第三、一、二个元素
X[1:,[2,0,1]]

array([[ 6,  4,  5],
       [10,  8,  9]])

# 和掩码组合使用
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]

array([[ 0,  2],
       [ 4,  6],
       [ 8, 10]])

7.3示例：选择随机点

花哨的索引的一个常见用途是从一个矩阵中选择行的子集。

例如我们有一个N×D 的矩阵，表示在D 个维度的N 个点。（此处的维度值D个角度描述，如有D个特征；与矩阵.ndims不一样）

# 一个二维正态分布的点组成的数组：
mean = [0, 0]
cov = [[1, 2],
[2, 5]]
X = np.random.multivariate_normal(mean, cov, 100)
print(X.shape)
X

(100, 2)

array([[ 0.73172672,  2.74159213],
       [ 0.15767901, -1.28214738],
       [-2.10049112, -3.01893849],
       [-1.02019607, -2.65772042],
       [ 0.58572523,  2.55539611],
       [-0.09532255,  1.09303603],
       [-0.56616457, -0.3097357 ],
	  ..........
       [-0.51784986, -1.15449278],
       [ 0.24994794,  0.70885289],
       [ 0.69235456, -1.329541  ],
       [-1.9563096 , -4.3007152 ],
       [ 0.75235832,  1.39318877],
       [-0.41172687, -1.11966659],
       [ 1.0860606 ,  4.85657321]])

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # 设置绘图风格
plt.scatter(X[:, 0], X[:, 1]);

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1EvxfarQ-1585724071225)(output_215_0.png)]

选择20 个随机的、不重复的索引值(replace=False)，并利用
这些索引值选取到原始数组对应的值

# p The probabilities associated with each entry in a.
# If not given the sample assumes a uniform distribution over all entries in a.
indices = np.random.choice(X.shape[0], 20,replace=False)
indices

array([14, 63, 27, 16, 38,  5, 95, 85, 58, 15,  1, 55, 88, 56, 92, 17, 11,
       21, 67, 90])

selection=X[indices]

selection.shape

(20, 2)

# 将随机抽取的点圈起来
plt.scatter(X[:, 0], X[:, 1], alpha=0.3)
plt.scatter(selection[:, 0], selection[:, 1],
facecolor='none', edgecolor='b', s=200);

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-t963iJyB-1585724071227)(output_220_0.png)]

这种方法通常用于快速分割数据，即需要分割训练/ 测试数据集以验证统计模型

7.4用花哨的索引修改值

x = np.arange(10)
i = np.array([2, 1, 8, 4])
x[i] = 99
print(x)

[ 0 99 99  3 99  5  6  7 99  9]

x[i] -= 10
print(x)

[ 0 89 89  3 89  5  6  7 89  9]

操作中的重复索引

x = np.zeros(10)
x[[0, 0]] = [4, 6]
print(x)

[6. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

x = np.ones(10)
i = [2, 3, 3, 4, 4, 4]
x[i] += 1
x

array([1., 1., 2., 2., 2., 1., 1., 1., 1., 1.])

此处没有发生累加,我们可以借助通用函数中的 at() 方法进行就地操作

x = np.ones(10)
i = [2, 3, 3, 4, 4, 4]
np.add.at(x, i, 1)
print(x)

[1. 1. 2. 3. 4. 1. 1. 1. 1. 1.]

np.add.at?

# ``add.at(a, [0,0], 1)`` will increment the first element twice.
# Array like index object or slice object for indexing into first
#     operand. If first operand has multiple dimensions, indices can be a
#     tuple of array like index objects or slice objects.
# Increment items 0 and 1, and increment item 2 twice:
# >>> a = np.array([1, 2, 3, 4])
# >>> np.add.at(a, [0, 1, 2, 2], 1)
# >>> a
# array([2, 3, 5, 4])
# Add items 0 and 1 in first array to second array,
# and store results in first array:
# >>> a = np.array([1, 2, 3, 4])
# >>> b = np.array([1, 2])
# >>> np.add.at(a, [0, 1], b)
# >>> a
# array([2, 4, 3, 4])
a = np.arange(12).reshape((2,6))
print(a)
b = np.array([1, 2])[:,np.newaxis]
np.add.at(a,[0,1], b)

[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]]

a   # 就地操作

array([[ 1,  2,  3,  4,  5,  6],
       [ 8,  9, 10, 11, 12, 13]])

7.5示例：数据区间划分

假定我们有1000
个值，希望快速统计分布在每个区间中的数据频次，可以用ufunc.at 来计算

np.random.seed(42)
# 从标准正态分布抽样1000
x = np.random.randn(1000)
print(x[:10])
# 手动计算直方图
bins = np.linspace(-5, 5, 20)
print(bins)
counts = np.zeros_like(bins)
# # 为每个x找到合适的区间
# Examples
# --------
# >>> np.searchsorted([1,2,3,4,5], 3)
# 2
# >>> np.searchsorted([1,2,3,\4,5], 3, side='right')
# 3
# >>> np.searchsorted([1,2,3,4,5], [-10, 10, 2, 3])
# array([0, 5, 1, 2])
i = np.searchsorted(bins, x)
print(i[:10])
# 为每个区间加上1
np.add.at(counts, i, 1)

[ 0.49671415 -0.1382643   0.64768854  1.52302986 -0.23415337 -0.23413696
  1.57921282  0.76743473 -0.46947439  0.54256004]
[-5.         -4.47368421 -3.94736842 -3.42105263 -2.89473684 -2.36842105
 -1.84210526 -1.31578947 -0.78947368 -0.26315789  0.26315789  0.78947368
  1.31578947  1.84210526  2.36842105  2.89473684  3.42105263  3.94736842
  4.47368421  5.        ]
[11 10 11 13 10 10 13 11  9 11]

# 计数数组counts 反映的是在每个区间中的点的个数
counts

array([  0.,   0.,   0.,   0.,   1.,   5.,  19.,  63., 118., 184., 218.,
       187., 106.,  62.,  27.,   8.,   1.,   1.,   0.,   0.])

np.searchsorted?

np.searchsorted([1,2,3,4,5], [0.9,1.5,2.2,3.8,4.7,5.1]) # 分桶

array([0, 1, 2, 3, 4, 5], dtype=int64)

# 画出结果
plt.plot(bins, counts, linestyle='steps');

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-laUPQi7Y-1585724071241)(output_238_0.png)]

plt.hist(x,bins,histtype='step',color='red');

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-g3CqUY6S-1585724071243)(output_239_0.png)]

# 大数据集 hist更优
x = np.random.randn(1000000)
print("NumPy routine:")
%timeit counts, edges = np.histogram(x, bins)
print("Custom routine:")
%timeit np.add.at(counts, np.searchsorted(bins, x), 1)

NumPy routine:
64.6 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Custom routine:
106 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# 小数据集 自己的方法更优
x = np.random.randn(100)
print("NumPy routine:")
%timeit counts, edges = np.histogram(x, bins)
print("Custom routine:")
%timeit np.add.at(counts, np.searchsorted(bins, x), 1)

NumPy routine:
30.7 µs ± 3.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Custom routine:
14.6 µs ± 998 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

8数组的排序

import numpy as np
def selection_sort(x):
    for i in range(len(x)):
        # swap=current_index + index of x[i:]'s min
        swap = i+ np.argmin(x[i:])
        x[i],x[swap]=x[swap],x[i]
    return x

x=np.array([2,1,3,5,3])
selection_sort(x)

array([1, 2, 3, 3, 5])

def bogosort(x):
    while np.any(x[:-1] > x[1:]):
        np.random.shuffle(x)
    return x

x=np.array([2,1,3,5,3])
x[:-1] > x[1:]

array([ True, False, False,  True])

8.1NumPy的快排:np.sort和np.argsort

想在不修改原始输入数组的基础上返回一个排好序的数组np.sort()
想用排好序的数组替代原始数组，用数组方法 ndarray.sort()
argsort，该函数返回的是原始数组排好序的索引值

x = np.array([2, 1, 4, 3, 5])
np.sort(x)

array([1, 2, 3, 4, 5])

x.sort()
print(x)

[1 2 3 4 5]

import random

# np.sort 与 python内置sort 的运行效率对比
# 5倍之差
x=np.random.random(1000000)
b=[random.random() for i in range(1000000)]
%timeit x.sort()
%timeit b.sort()

9.41 ms ± 78.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
52.1 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
print(i)

[1 0 3 2 4]

这些索引值可以被用于（通过花哨的索引）创建有序的数组

rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)

[[6 3 7 4 6 9]
 [2 6 7 4 3 7]
 [7 2 5 4 1 7]
 [5 1 4 0 9 5]]

# 对X的每一列排序
# 沿行方向
np.sort(X,axis=0)

array([[2, 1, 4, 0, 1, 5],
       [5, 2, 5, 4, 3, 7],
       [6, 3, 7, 4, 6, 7],
       [7, 6, 7, 4, 9, 9]])

# 对X的每一行排序
# 沿列方向
np.sort(X,axis=1)

array([[3, 4, 6, 6, 7, 9],
       [2, 3, 4, 6, 7, 7],
       [1, 2, 4, 5, 7, 7],
       [0, 1, 4, 5, 5, 9]])

8.2部分排序，分隔

有时候我们不希望对整个数组进行排序，仅仅希望找到数组中前K个小的值，NumPy 的
np.partition 函数提供了该功能。np.partition 函数的输入是数组和数字K，输出结果是
一个新数组，新数组第k个位置是原数组第k小的数，其左边的是小于第k小数的数，右边是大于第k小数的数，左右边的数无顺序

x = np.array([7, 2, 3, 1, 6, 5, 4])
np.partition(x, 3)

array([2, 1, 3, 4, 6, 5, 7])

index=np.argpartition(x,3)

# fancy index
x[index]

array([2, 1, 3, 4, 6, 5, 7])

np.partition?

np.partition(X, 2, axis=1)

array([[3, 4, 6, 7, 6, 9],
       [2, 3, 4, 7, 6, 7],
       [1, 2, 4, 5, 7, 7],
       [0, 1, 4, 5, 9, 5]])

输出结果是一个数组，该数组每一行的前两个元素是该行最小的两个值，每行的其他值分
布在剩下的位置。

np.argpartition 函数计算
的是分隔的索引值

8.3示例：K个最近邻

如何利用argsort 函数沿着多个轴快速找到集合中每个点的最近邻

import numpy as np

X=np.random.random((10,2))

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn
seaborn.set()
plt.scatter(X[:,0],X[:,1],s=100)
plt.ylim(0,1)
plt.xlim(0,1)

(0, 1)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oR03Vj3v-1585724071245)(output_268_1.png)]

dist_sq = np.sum((X[:,np.newaxis,:] - X[np.newaxis,:,:]) ** 2, axis=-1)

# 在坐标系中计算每对点的差值
# (10,1,2) - (1,10,2) -> (10,10,2)
differences = X[:, np.newaxis, :] - X[np.newaxis, :, :]
differences.shape

(10, 10, 2)

# 求出差值的平方
sq_differences = differences ** 2
sq_differences.shape

(10, 10, 2)

# 将差值求和获得平方距离
dist_sq = sq_differences.sum(-1)
dist_sq.shape

(10, 10)

# 查看二维数组的对角线
dist_sq.diagonal()

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

当我们有了这样一个转化为两点间的平方距离的矩阵后，就可以使用
np.argsort 函数沿着每行进行排序了。最左边的列给出的索引值就是最近邻：

nearest = np.argsort(dist_sq, axis=1)
print(nearest)

[[0 3 1 6 5 4 2 7 8 9]
 [1 0 3 5 6 4 7 2 8 9]
 [2 8 6 4 3 5 7 0 9 1]
 [3 6 0 7 9 2 1 8 4 5]
 [4 5 2 6 0 8 3 1 7 9]
 [5 4 0 2 6 3 8 1 7 9]
 [6 2 8 3 7 0 4 9 5 1]
 [7 9 3 6 8 2 0 4 1 5]
 [8 2 6 7 3 4 9 5 0 1]
 [9 7 3 6 8 2 0 4 1 5]]

需要注意的是，第一列是按0~9 从小到大排列的。这是因为每个点的最近邻是其自身

如果我们仅仅关心k 个最近邻，那么唯一需要做的是分隔每一行，这样最小的k + 1 的平方距离将排在最前面，
其他更长的距离占据矩阵该行的其他位置。（k+1中的1表示自身）

K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)

nearest_partition

array([[3, 1, 0, 6, 5, 4, 2, 7, 8, 9],
       [3, 1, 0, 5, 6, 4, 2, 7, 8, 9],
       [2, 8, 6, 4, 3, 5, 7, 0, 1, 9],
       [3, 6, 0, 7, 9, 5, 1, 2, 8, 4],
       [4, 5, 2, 6, 0, 1, 3, 7, 8, 9],
       [5, 4, 0, 2, 6, 3, 1, 7, 8, 9],
       [6, 2, 8, 3, 7, 0, 4, 5, 1, 9],
       [3, 7, 9, 6, 8, 2, 0, 5, 1, 4],
       [8, 2, 6, 7, 3, 4, 9, 5, 1, 0],
       [3, 7, 9, 6, 8, 2, 0, 5, 1, 4]], dtype=int64)

plt.scatter(X[:, 0], X[:, 1], s=100)
# 将每个点与它的两个最近邻连接
K = 2
# 逐行遍历
for i in range(X.shape[0]):
    # 逐元素遍历每行前K+1个
    for j in nearest_partition[i, :K+1]:
        # 画一条从X[i]到X[j]的线段
        # 用zip方法实现：
        plt.plot(*zip(X[j], X[i]), color='black')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZbFFZSpB-1585724071246)(output_280_0.png)]

list(zip(X[1],X[2]))

[(0.1498005921808534, 0.5263301949184661),
 (0.15307640958808832, 0.8533986100963741)]

X[1]

array([0.14980059, 0.15307641])

X[2]

array([0.52633019, 0.85339861])

向量化操作的优美之处在于，它的实现方式决定了它对输入数据的数据量并
不敏感
也就是说，我们可以非常轻松地计算任意维度空间的100 或1 000 000 个邻节点，
而代码看起来是一样的。

9结构化数据：NumPy的结构化数组

类似 pandas 的 dataframe

假定现在有关于一些人的分类数据（如姓名、年龄和体重），我们需要存储这些数据用于
Python 项目，那么一种可行的方法是将它们存在三个单独的数组中：

name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

但没有任何信息告诉我们这三个数组是相关联的

x = np.zeros(4, dtype=int)

# 使用复合数据结构的结构化数组
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
print(data.dtype)

[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]

data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

[('Alice', 25, 55. ) ('Bob', 45, 85.5) ('Cathy', 37, 68. )
 ('Doug', 19, 61.5)]

结构化数组的方便之处在于，你可以通过索引或名称查看相应的值

# 获取所有名字
data['name']

array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')

# 获取数据第一行
data[0]

('Alice', 25, 55.)

# 获取最后一行的名字
data[-1]['name']

'Doug'

# 获取年龄小于30岁的人的名字
data[data['age'] < 30]['name']

array(['Alice', 'Doug'], dtype='<U10')

参考资料：《Python数据科学手册》

蜡青

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
学习笔记：Numpy常见用法（3/3）

7花哨的索引 fancy indexing利用简单的索引值（如arr[0]）、切片（如arr[:5]）和布尔掩码（如arr[arr > 0]）获得并修改部分数组花哨的索引和前面那些简单的索引非常类似，但是传递的是索引数组，而不是单个标量。花哨的索引让我们能够快速获得并修改复杂的数组值的子数据集。7.1探索花哨的索引传递一个索引数组来一次性获得多个数组元素结果的形状与索引数组的...
复制链接

扫一扫

专栏目录