一些学习python中的小技巧记录

Bayern-Xie

已于 2022-08-23 14:38:48 修改

阅读量867

点赞数 1

文章标签： python

于 2021-07-30 21:07:06 首次发布

本文链接：https://blog.csdn.net/bayern_xie/article/details/119256007

版权

1.np.argwhere()函数

2.np.count_nonzero()函数

3.np.random.multivariate_normal()函数

4.plt.GridSpec()函数

5.pd.rename()函数

6.np.dot()和np.matmul()函数的区别以及二者的广播规则

7.plt.axes()函数大于一维的情况

8.np.meshgrid()函数

9.结合利用np.meshgrid()函数和np.vstack()函数组合成点对

10.pd.merge和DataFrame.join的一些记录

11.zip函数的妙用

12.利用ndarray.transpose函数将多行多列的小图组合成一张大图

1.np.argwhere()函数

该函数可以找出数组中符合条件的元素的索引值：

x = np.arange(12).reshape(3, 4)
print(np.argwhere(x > 6))

输出：

x
Out[4]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

np.argwhere(x > 6)
Out[5]: 
array([[1, 3],
       [2, 0],
       [2, 1],
       [2, 2],

2.np.count_nonzero()函数

该函数传入参数条件后，可以用来计算符合特定条件的元素的个数：

np.count_nonzero(x < 6)
Out[6]: 6

np.count_nonzero(x == 6)
Out[7]: 1

其实括号里的参数条件生成的是一个掩码数组，也就是全部由True和False组成的，而np.count_nonzero()函数会将True认定为1，False认定为0。

3.np.random.multivariate_normal()函数

mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y, z = np.random.multivariate_normal(mean, cov, (3, 2))
print(x)
print(x.shape)
print(y)
print(y.shape)
print(z)
print(z.shape)

该函数用来产生多维正态分布数据，multivariate_normal函数会在最后一个维度参数后添上N（N是mean参数指定的要产生的每组正态分布里包含的数据维数）,比如输入的维度是(x, y, z),那么最终可以得到x个形状相同的数组, 每个数组的形状为(y, z, N),也就是每个维度的变量都会有N元的随机数组。上述代码产生的结果如下：

[[-0.01015427  0.81979567]
 [-1.02379448  0.23363184]]
(2, 2)
[[-0.71351785 -0.27438302]
 [ 1.50336145  1.94894636]]
(2, 2)
[[2.39060454 4.10847821]
 [0.25278962 0.28928115]]
(2, 2)

要么就传入mean的维数加一个参数，比如这个例子里面mean的维数为2，那么就传入3个参数来接收函数返回结果，要不然就只传入一个参数，接收所有结果。

4.plt.GridSpec()函数

可以用该函数创建多行多列的复杂排列的子图网格，整个过程也比较好理解，就是创建一个figure，然后创建一个grid对象，接下来需要几个网格就创建几个ax对象，然后用add_subplot()函数把网格添加进去，每个网格的占据几个基础单元网格的位置由grid的切片来表示：

fig = plt.figure()
grid = plt.GridSpec(2, 3, wspace=0.4, hspace=0.3)
ax1 = fig.add_subplot(grid[0, 0])
ax2 = fig.add_subplot(grid[0, 1:])
ax3 = fig.add_subplot(grid[1, :2])
ax4 = fig.add_subplot(grid[1, 2])
plt.show()

效果如下：

注意，上述代码和下面的代码是一个意思：

grid = plt.GridSpec(2, 3, wspace=0.4, hspace=0.3)
plt.subplot(grid[0, 0])
plt.subplot(grid[0, 1:])
plt.subplot(grid[1, :2])
plt.subplot(grid[1, 2])
plt.show()

下面是一个通过该函数创建多轴频次直方图（multi_axes histogram）的过程：

mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 3000).T

fig = plt.figure(figsize=(6, 6))
grid= plt.GridSpec(4, 4, hspace=0.2, wspace=0.2)
main_ax = fig.add_subplot(grid[:-1, 1:])
y_hist = fig.add_subplot(grid[:-1, 0], xticklabels=[], sharey=main_ax)
x_hist = fig.add_subplot(grid[-1, 1:], yticklabels=[], sharex=main_ax)

main_ax.plot(x, y, 'ok', markersize=3, alpha=0.2)

# orientation='vertical'意思就是柱状图走向是沿垂直方向的，也就是我们常见的那种，所以默认不写也可；
# 但是'horizontal'意思就是柱状图周走向是沿水平方向的，必须要指定该参数，否则图会变得很丑。
# invert_yaxis()就是沿y轴180°翻转一下
x_hist.hist(x, bins=40, histtype='stepfilled', orientation='vertical', color='gray')
x_hist.invert_yaxis()
y_hist.hist(y, bins=40, histtype='stepfilled', orientation='horizontal', color='gray')
y_hist.invert_xaxis()
plt.show()

代码效果图如下所示：

5.pd.rename()函数

该函数可以任意更改pd.Dataframe对象的任意行或者列的名字。

比如：

dt = pd.merge(areas, merged, on='state', how='right')
dt.dropna(inplace=True)
dt.rename(columns={'area (sq. mi)': 'area', 'state/region': 'abbrevs'}, inplace=True)

传入columns参数和index参数就行，更改的格式是用字典形式实现的。即:
columns={'要更改的列名称1':'更改后的列名称1'， '要更改的列名称2':'更改后的列名称2'....}

index={'要更改的行名称1':'更改后的行名称1'， '要更改的行名称2':'更改后的行名称2'....}

inplace=True的意思是不需要重新拷贝一个数组来做修改，而是直接在原有的基础上来修改即可。

6.np.dot()和np.matmul()函数的区别以及二者的广播规则

np.dot和np.matmul对于二维的矩阵,运算的结果是一样的,对3维及以上则不同；

后者符合常见的运算规律,例如三维：第一维必须是一样的维度或者有一个为1,后两个维度则需要符合矩阵运算的规律;

比如a=np.arange(12).reshape(2,2,3),b=np.arange(24).reshape(2,3,4)，这样a和b才能相乘，结果为c=np.matmul(a,b),c为(2,2,4)矩阵：

a = np.arange(12).reshape(2, 2, 3)
b = np.arange(24).reshape(2, 3, 4)
c = np.matmul(a, b)

a，b，c，和c.shape为：

a
Out[7]:
array([[[ 0,  1,  2],
        [ 3,  4,  5]],
       [[ 6,  7,  8],
        [ 9, 10, 11]]])

b
Out[8]: 
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],
       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

c
Out[9]: 
array([[[ 20,  23,  26,  29],
        [ 56,  68,  80,  92]],
       [[344, 365, 386, 407],
        [488, 518, 548, 578]]])

c.shape
Out[10]: (2, 2, 4)

再来看看广播规则：

一般的ndarray数组的广播规则是如果两个数组的维度不同，则小维度数组的形状将会在左边补1；

但是np.matmul()和np.dot()则不同，在进行矩阵向量乘法的时候，如果两个数组有一个的维度是一维的，它会提升该参数为矩阵（根据另一个参数的维数（也就是轴数），给该参数增加一个轴长为1的轴，使得矩阵乘法成立），矩阵相乘之后会将为轴长为1的轴去掉。大意如下：

import numpy as np
a = np.arange(12).reshape(3, 4)
b = np.array([0, 1, 2, 3])

a.shape
Out[28]: (3, 4)
b.shape
Out[29]: (4,)

np.matmul(a, b)
Out[30]: array([14, 38, 62])
np.dot(a, b)
Out[31]: array([14, 38, 62])
np.matmul(a, b).shape
Out[32]: (3,)
np.dot(a, b).shape
Out[33]: (3,)

可以从代码中看出，这两个函数的相乘都是将b扩充为(4, 1)，最终输出的结果都是（3,），而如果我们换个顺序相乘：

np.matmul(b, a)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-34-af3b88aa2232>", line 1, in <module>
    np.matmul(b, a)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 4)

会出错，因为这时会将b扩充为(1,4)，但是（1,4）*（3,4）是没办法做向量相乘的，所以会出错。

7.plt.axes()函数大于一维的情况

当fig, axes = plt.subplots(x, y) 的x和y都大于1时，axes此时就是一个2D ndarray对象，如果要使用的话，一定要用axes.flat之后用for循环，或者用axes=axes.ravel()，将其摊平成一维数组才行：

fig, axes=plt.subplots(2, 2, sharex=True, sharey=True)
axes= axes.ravel()
sns.distplot(data['x'],rug=True,ax=axes[0])
sns.distplot(data['y'],ax=axes[1])
sns.distplot(data, kde=True, ax=axes[2])
sns.distplot(data, ax=axes[3])
plt.show()

8.np.meshgrid()函数

这个函数可以方便的将x和y两个一维数据组合成为一个坐标对：

比如x=[1, 2, 3]，y=[1, 2]，那么np.meshgrid(x, y)就生成的是2X3=6个点，坐标分别是（1, 1）(2, 1) (3, 1) (1, 2) (2, 2) (3, 2)，我们再看下用X,Y赋值的结果：

x = np.array([1, 2, 3])
y = np.array([1, 2])
X, Y = np.meshgrid(x, y)

X
Out[10]: 
array([[1, 2, 3],
       [1, 2, 3]])

Y
Out[11]: 
array([[1, 1, 1],
       [2, 2, 2]])

之后可以用各种函数对X, Y进行处理，比如求每个点的坐标平方和后开根就是：

Z = np.sqrt(X ** 2 + Y ** 2)
Z
Out[15]: 
array([[1.41421356, 2.23606798, 3.16227766],
       [2.23606798, 2.82842712, 3.60555128]])

结果的六个值刚好就是之前生成的六个点的坐标平方和后开根的值。总之,np.meshgrid()函数很适合生成第三维数据或者对每个坐标值对进行处理，常用在三维绘图中。

9.结合利用np.meshgrid()函数和np.vstack()函数组合成点对

xlim = (-8, 8)
ylim = (-15, 5)
xg = np.linspace(xlim[0], xlim[-1], 60)
yg = np.linspace(ylim[0], ylim[-1], 40)
xx, yy = np.meshgrid(xg, yg)
# np.vstack会将(N,)型array先转换成(1, N)型之后再进行vertical concatenation
Xgrid = np.vstack([xx.ravel(), yy.ravel()]).T
# (上面这行相当于np.hstack([xx.ravel().reshape(-1, 1), yy.ravel().reshape(-1, 1)])

上面这段代码，Xgrid最终的shape是(2400, 2)，其中每一行分别是一个点的坐标，整个Xgrid就是我们要生成的点对的横纵坐标值，共有2400个点。这样将np.meshgrid()和np.vstack()结合起来，就能生成我们需要的一个矩形区域的规则点对的坐标矩阵。

10.pd.merge和DataFrame.join的一些记录

pd.merge分三种情况：

1.列名称和列值都相同，则直接连接，调用pd.merge，只传入数组即可：pd.merge(df1, df2)；

2.列名称不同但是列值相同（或者说列值的类型相同），这时要传入on或者left_on,right_on参数；

3.列名称相同但是列值不完全相同，这时要传入how参数，也就是以什么样的方式来连接，how有四个值：inner（取交集连接），outer（取并集连接），left（结果只包含左列拥有的值），right（结果只包含右列拥有的值）。

下面举一个很有意思的例子：

df1 = pd.DataFrame([[1,2,3],[1,10,20],[5,6,7],[3,9,0],[8,0,3]],columns=['x1','x2','x3'])
df2 = pd.DataFrame([[1,2],[1,10],[1,3],[4,6],[3,9]],columns=['x1','x4'])
df1
Out[32]: 
   x1  x2  x3
0   1   2   3
1   1  10  20
2   5   6   7
3   3   9   0
4   8   0   3
df2
Out[33]: 
   x1  x4
0   1   2
1   1  10
2   1   3
3   4   6
4   3   9

这时候属于是第三种情况，列名称相同但是列值不完全相同，我们直接连接就是默认的how=inner了，可以看见，x1=5,8,4的值所在的行都消失了：

pd.merge(df1, df2)
Out[44]: 
   x1  x2  x3  x4
0   1   2   3   2
1   1   2   3  10
2   1   2   3   3
3   1  10  20   2
4   1  10  20  10
5   1  10  20   3
6   3   9   0   9

如果我们传入how=left参数，这时候最终的结果有点像pd.MultiIndex.from_product()，只包含左边的值，但是会进行交叉组合：

pd.merge(df1, df2, how='left')
Out[39]: 
   x1  x2  x3    x4
0   1   2   3   2.0
1   1   2   3  10.0
2   1   2   3   3.0
3   1  10  20   2.0
4   1  10  20  10.0
5   1  10  20   3.0
6   5   6   7   NaN
7   3   9   0   9.0
8   8   0   3   NaN

df2中x1列的值只有1和3也包含于df1的x1列中，所以最终的结果将舍弃掉x2中x1值为4的行；再来看，df1中x1=1的行有两行，要与df2中所有包含x1=1的行（共三行）进行组合，也就是说结果一共有2*3=6行包含x1=1的行，也即上图所示。

之后我们再看用DataFrame.join()方法的情况：

首先，这时用df1.join(df2)或者df1.join(df2,how='left')都会报错：

df1.join(df2)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-41-f8069890b6d0>", line 1, in <module>
    df1.join(df2)
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/frame.py", line 8110, in join
    return self._join_compat(
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/frame.py", line 8135, in _join_compat
    return merge(
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 89, in merge
    return op.get_result()
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 686, in get_result
    llabels, rlabels = _items_overlap_with_suffix(
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 2178, in _items_overlap_with_suffix
    raise ValueError(f"columns overlap but no suffix specified: {to_rename}")
ValueError: columns overlap but no suffix specified: Index(['x1'], dtype='object')

为什么会报错呢？因为join()方法是按照行索引(index)来连接的，然而df1和df2都有一列x1，是重复的，这时候就要指定后缀才行，否则就会报错，所以应该如下面这样：

df1.join(df2, lsuffix='_c', rsuffix='_o')
Out[43]: 
   x1_c  x2  x3  x1_o  x4
0     1   2   3     1   2
1     1  10  20     1  10
2     5   6   7     1   3
3     3   9   0     4   6
4     8   0   3     3   9

df1.join(df2, lsuffix='_l', rsuffix='_r', how='right')
Out[47]: 
   x1_l  x2  x3  x1_r  x4
0     1   2   3     1   2
1     1  10  20     1  10
2     5   6   7     1   3
3     3   9   0     4   6
4     8   0   3     3   9

join里how默认设为left，这里例子为什么用left和right结果相同？是因为df1和df2的行索引都是一样的。

再来看join里的on参数用法：

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
df
Out[49]: 
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5

other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})
other
Out[52]: 
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

只传入后缀参数的连接：

df.join(other, lsuffix='_caller', rsuffix='_other')
Out[53]: 
  key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN

如果我们想用key列来连接，那么必须将其在df1和df2中都设置为index：

df.set_index('key').join(other.set_index('key'))
Out[54]: 
      A    B
key         
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

如果我们想保持左边不变，那么需要将右边的key列设置为行索引index，并且这时候就要用到on参数了：

df.join(other.set_index('key'), on='key')
Out[55]: 
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2   B2
3  K3  A3  NaN
4  K4  A4  NaN
5  K5  A5  NaN

就是说，当我们左边用column连接，右边用index连接的时候，要用到on来指定左边是用的哪一列来连接的。

官网是这么解释的：DataFrame.join() always uses other’s index but we can use any column in df.

也就是说，原dataframe我们是可以使用column来连接的，但是右边的dataframe(也就是这里说的other），必须使用其index进行连接。

11.zip函数的妙用

1.可以用于对二维列表（矩阵）取列:

matrix = [[1,2,3],[4,5,6],[7,8,9]]
list(zip(*matrix))
Out[76]: [(1, 4, 7), (2, 5, 8), (3, 6, 9)]

2.拼接字母：

nums = ['flower','flow','flight']
for i in zip(*nums):
    print(i)

结果：

('f', 'f', 'f')
('l', 'l', 'l')
('o', 'o', 'i')
('w', 'w', 'g')

3.列表元素依次相连：

l = ['a', 'b', 'c', 'd', 'e','f']
print l
#打印列表
print zip(l[:-1],l[1:])

结果：

['a', 'b', 'c', 'd', 'e', 'f']
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('e', 'f')]

4.机器学习模型训练中，经常需要打乱数据集，用 zip() 函数可以实现如下：

import random
X = [1, 2, 3, 4, 5, 6]
y = [0, 1, 0, 0, 1, 1]
zipped_data = list(zip(X, y))  
# 将样本和标签一 一对应组合起来,并转换成list类型方便后续打乱操作

random.shuffle(zipped_data)  
# 使用random模块中的shuffle函数打乱列表，原地操作，没有返回值

new_zipped_data = list(map(list, zip(*zipped_data)))  
# zip(*)反向解压，map()逐项转换类型，list()做最后转换

new_X, new_y = new_zipped_data[0], new_zipped_data[1]  
# 返回打乱后的新数据

print('X:',X,'\n','y:',y)
print('new_X:',new_X, '\n', 'new_y:',new_y)

结果为：

# 打乱顺序前数组
X: [1, 2, 3, 4, 5, 6] 
y: [0, 1, 0, 0, 1, 1]
# 打乱顺序后数组
new_X: [5, 1, 3, 6, 2, 4] 
new_y: [1, 0, 0, 1, 1, 0]

12.利用ndarray.transpose函数将多行多列的小图组合成一张大图

ndarray.transpose函数可以用来转置数组或者改变数组轴的位置，因此可以利用其将多行多列的小图片组合成为一张大图片。

比如原数组如下：

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist['data'], mnist['target']

X是包含70000张图片的数组，每个图片有28*28=784个像素：

X.shape
(70000, 784)

那么如果要画一张包含10行10列图片的图，我们可以利用transpose将其合成为一张大图：

instances = X[:100]
image_grid = instances.reshape(10, 10, 28, 28)
big_image = image_grid.transpose(0, 2, 1, 3).reshape(10 * 28, 10 * 28)

其中image_grid是将100张图片分为10行10列，而image_grid.transpose将大图的行的axis（即轴0）和每一张小图的行的axis（即轴2）位置移到相邻，将大图的列的axis（即轴1）和每一张小图的列的axis（即轴3）的位置也移到相邻，这样就将所有的小图变成了一个280 * 280的大图：

plt.figure(figsize=(10, 10))
plt.imshow(big_image, cmap='binary')
plt.axis('off')
plt.show()

其效果如下：

13.np.average()遇到nan值得出的结果是nan，而pd.mean()遇到nan值会自动忽略调nan，只计算有实值的数的均值！！

Bayern-Xie

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
一些学习python中的小技巧记录

1.np.argwhere()函数该函数可以找出数组中符合条件的元素的索引值：x = np.arange(12).reshape(3, 4)print(np.argwhere(x > 6))输出：xOut[4]: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])np.argwhere(x > 6)Out[5]: array(.
复制链接

扫一扫