本篇主要是pandas50练习题的进阶部分,也就是23~31题,难度已经逐渐上来了,尤其是从26题开始已经有较多逻辑的融合需要理解,后续更新++++++++
import pandas as pd
import numpy as np
23.有一列整数列A的DatraFrame,删除数值重复的行
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df
A | |
---|---|
0 | 1 |
1 | 2 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
6 | 5 |
7 | 5 |
8 | 6 |
9 | 7 |
10 | 7 |
df1 = df.loc[df['A'].shift() != df['A']]
df1
A | |
---|---|
0 | 1 |
1 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
8 | 6 |
9 | 7 |
df2 = df.drop_duplicates(subset='A')
df2
A | |
---|---|
0 | 1 |
1 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
8 | 6 |
9 | 7 |
24.一个全数值DatraFrame,每个数字减去该行的平均数
df = pd.DataFrame(np.random.random(size=(5,3)))
df
0 | 1 | 2 | |
---|---|---|---|
0 | 0.860984 | 0.762510 | 0.880017 |
1 | 0.364329 | 0.022123 | 0.755357 |
2 | 0.324832 | 0.944399 | 0.585825 |
3 | 0.937763 | 0.241015 | 0.798840 |
4 | 0.498158 | 0.584322 | 0.634986 |
df3 = df.sub(df.mean(axis=1),axis=0)
df3
0 | 1 | 2 | |
---|---|---|---|
0 | 0.026480 | -0.071993 | 0.045513 |
1 | -0.016274 | -0.358480 | 0.374754 |
2 | -0.293520 | 0.326047 | -0.032527 |
3 | 0.278557 | -0.418191 | 0.139634 |
4 | -0.074330 | 0.011834 | 0.062497 |
25.一个有5列的DataFrame,求哪一列的和最小
df = pd.DataFrame(np.random.random(size=(5,5)),columns=list('abcde'))
print(df)
a b c d e
0 0.364215 0.569260 0.625360 0.616201 0.366357
1 0.310033 0.669937 0.403421 0.632327 0.252816
2 0.698762 0.510754 0.946358 0.802647 0.855209
3 0.346315 0.361260 0.675928 0.980343 0.906240
4 0.986167 0.037916 0.235484 0.712136 0.207447
df.sum().idxmin()
#可以自己试一下顺序:df.sum()>>>df.sum().min()>>>df.sum().idxmin()
'b'
26.给定DataFrame,求A列每个值的前3大的B的和
df = pd.DataFrame({'A': list('aaabbcaabcccbbc'),
'B': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
print(df)
A B
0 a 12
1 a 345
2 a 3
3 b 1
4 b 45
5 c 14
6 a 4
7 a 52
8 b 54
9 c 23
10 c 235
11 c 21
12 b 57
13 b 3
14 c 87
df1 = df.groupby('A')['B'].nlargest(3).sum(level=0)
df1
A
a 409
b 156
c 345
Name: B, dtype: int64
27.给定DataFrame,有列A, B,A的值在1-100(含),对A列每10步长,求对应的B的和
df = pd.DataFrame({'A': [1,2,11,11,33,34,35,40,79,99],
'B': [1,2,11,11,33,34,35,40,79,99]})
print(df)
df1 = df.groupby(pd.cut(df['A'], np.arange(0, 101, 10)))['B'].sum()
print(df1)
A B
0 1 1
1 2 2
2 11 11
3 11 11
4 33 33
5 34 34
6 35 35
7 40 40
8 79 79
9 99 99
A
(0, 10] 3
(10, 20] 22
(20, 30] 0
(30, 40] 142
(40, 50] 0
(50, 60] 0
(60, 70] 0
(70, 80] 79
(80, 90] 0
(90, 100] 99
Name: B, dtype: int64
28.给定DataFrame,计算每个元素至左边最近的0(或者至开头)的距离,生成新列y
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
izero = np.r_[-1, (df['X'] == 0).to_numpy().nonzero()[0]] # 标记0的位置
idx = np.arange(len(df))
df['Y'] = idx - izero[np.searchsorted(izero - 1, idx) - 1]
print(df)
# 方法二
# x = (df['X'] != 0).cumsum()
# y = x != x.shift()
# df['Y'] = y.groupby((y != y.shift()).cumsum()).cumsum()
# 方法三
# df['Y'] = df.groupby((df['X'] == 0).cumsum()).cumcount()
#first_zero_idx = (df['X'] == 0).idxmax()
# df['Y'].iloc[0:first_zero_idx] += 1
X Y
0 7 1
1 2 2
2 0 0
3 3 1
4 4 2
5 2 3
6 5 4
7 0 0
8 3 1
9 4 2
29.一个全数值的DataFrame,返回最大3值的坐标
df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
0 1 2
0 0.655695 0.208415 0.770685
1 0.538737 0.653572 0.423824
2 0.324956 0.747991 0.128974
3 0.894755 0.053863 0.039284
4 0.866307 0.405564 0.889443
df.unstack().sort_values()[-3:].index.tolist()
[(0, 4), (2, 4), (0, 3)]
30.给定DataFrame,将负值代替为同组的平均值
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [-12,345,3,1,45,14,4,-52,54,23,-235,21,57,3,87]})
print(df)
grps vals
0 a -12
1 a 345
2 a 3
3 b 1
4 b 45
5 c 14
6 a 4
7 a -52
8 b 54
9 c 23
10 c -235
11 c 21
12 b 57
13 b 3
14 c 87
def replace(group):
mask = group<0
group[mask] = group[~mask].mean()
return group
df['vals'] = df.groupby(['grps'])['vals'].transform(replace)
print(df)
grps vals
0 a 117.333333
1 a 345.000000
2 a 3.000000
3 b 1.000000
4 b 45.000000
5 c 14.000000
6 a 4.000000
7 a 117.333333
8 b 54.000000
9 c 23.000000
10 c 36.250000
11 c 21.000000
12 b 57.000000
13 b 3.000000
14 c 87.000000
31.计算3位滑动窗口的平均值,忽略NAN
df = pd.DataFrame({'group': list('aabbabbbabab'),
'value': [1, 2, 3, np.nan, 2, 3, np.nan, 1, 7, 3, np.nan, 8]})
print(df)
g1 = df.groupby(['group'])['value']
g2 = df.fillna(0).groupby(['group'])['value']
s = g2.rolling(3, min_periods=1).sum() / g1.rolling(3, min_periods=1).count()
s.reset_index(level=0, drop=True).sort_index()
group value
0 a 1.0
1 a 2.0
2 b 3.0
3 b NaN
4 a 2.0
5 b 3.0
6 b NaN
7 b 1.0
8 a 7.0
9 b 3.0
10 a NaN
11 b 8.0
0 1.000000
1 1.500000
2 3.000000
3 3.000000
4 1.666667
5 3.000000
6 3.000000
7 2.000000
8 3.666667
9 2.000000
10 4.500000
11 4.000000
Name: value, dtype: float64