pandas基础学习
数据的过滤(filter)
选取数据框df中对应行的数值大于0.5的全部列
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df[df['A'] > 0.5]
'''
A B C D E
0 0.534886 0.863546 0.236718 0.326766 0.415460
2 0.953931 0.070198 0.483749 0.922528 0.295505
8 0.880175 0.056811 0.520499 0.533152 0.548145
'''
选取数据框df中对应行的数值大于0.5,并且小于0.7的全部列
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df[(df['C'] > 0.5) & (df['D'] < 0.7)]
'''
A B C D E
2 0.953112 0.174517 0.645300 0.308216 0.171177
6 0.853087 0.863079 0.701823 0.354019 0.311754
'''
排序(sort)
按照数据框的列col1升序(ascending)的方式对数据框df做排序
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.sort_values('E')
'''
A B C D E
3 0.024096 0.623842 0.775949 0.828343 0.317729
6 0.220055 0.381614 0.463676 0.762644 0.391758
4 0.589411 0.727439 0.064528 0.319521 0.413518
1 0.878490 0.229301 0.699506 0.726879 0.464106
8 0.438101 0.970649 0.050256 0.697440 0.499057
9 0.566100 0.558798 0.723253 0.254244 0.524486
7 0.613603 0.933109 0.677036 0.808160 0.544953
5 0.079326 0.711673 0.266434 0.910628 0.816783
2 0.132114 0.145395 0.908436 0.521271 0.889645
0 0.432677 0.216837 0.203532 0.093214 0.977671
'''
按照数据框的列col2降序(descending)的方式对数据框df做排序
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.sort_values('A',ascending=False)
'''
A B C D E
9 0.977172 0.930607 0.889285 0.475032 0.031715
0 0.864511 0.229990 0.678612 0.042491 0.148123
2 0.694747 0.580891 0.817524 0.392417 0.055003
6 0.684327 0.802028 0.862043 0.241838 0.800401
7 0.612324 0.099445 0.714120 0.215054 0.280343
8 0.441434 0.315553 0.564762 0.800143 0.330030
1 0.438734 0.161109 0.610750 0.647330 0.792404
4 0.365880 0.710768 0.344320 0.998757 0.979497
3 0.202511 0.769728 0.575057 0.511384 0.696753
5 0.029527 0.560114 0.224787 0.086291 0.318322
'''
按照数据框的列col1升序,col2降序的方式对数据框df做排序
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.sort_values(['A','E'],ascending=[True,False])
'''
A B C D E
6 0.075863 0.696980 0.648945 0.336977 0.113122
2 0.199316 0.632063 0.787358 0.133175 0.060568
5 0.242081 0.818550 0.618439 0.215761 0.924459
7 0.261237 0.400725 0.659224 0.555746 0.132572
0 0.390540 0.358432 0.754028 0.194403 0.889624
8 0.410481 0.463811 0.343021 0.736340 0.291121
4 0.578705 0.544711 0.881707 0.396593 0.414465
3 0.600541 0.459247 0.591303 0.027464 0.496864
9 0.720029 0.419921 0.740225 0.904391 0.226958
1 0.777955 0.992290 0.144495 0.600207 0.647018
'''
分组(groupby)
按照某列对数据框df做分组
df = pd.DataFrame({'A':np.array(['foo','foo','foo','foo','bar','bar']),
'B':np.array(['one','one','two','two','three','three']),
'C':np.array(['small','medium','large','large','small','small']),
'D':np.array([1,2,2,3,3,5])})
df.groupby('A').count()
'''
B C D
A
bar 2 2 2
foo 4 4 4
'''
按照列col1和col2对数据框df做分组
df = pd.DataFrame({'A':np.array(['foo','foo','foo','foo','bar','bar']),
'B':np.array(['one','one','two','two','three','three']),
'C':np.array(['small','medium','large','large','small','small']),
'D':np.array([1,2,2,3,3,5])})
df.groupby(['B','C']).sum()
'''
D
B C
one medium 2
small 1
three small 8
two large 5
'''
按照列col1对数据框df做分组处理后,返回对应的col2的平均值
df = pd.DataFrame({'A':np.array(['foo','foo','foo','foo','bar','bar']),
'B':np.array(['one','one','two','two','three','three']),
'C':np.array(['small','medium','large','large','small','small']),
'D':np.array([1,2,2,3,3,5])})
df.groupby('B')['D'].mean()
'''
B
one 1.5
three 4.0
two 2.5
Name: D, dtype: float64
'''
做透视表,索引为col1,针对的数值列为col2和col3,分组函数为平均值
df = pd.DataFrame({'A':np.array(['foo','foo','foo','foo','bar','bar']),
'B':np.array(['one','one','two','two','three','three']),
'C':np.array(['small','medium','large','large','small','small']),
'D':np.array([1,2,2,3,3,5])})
df.pivot_table(df,index=['A','B'],columns=['C'],aggfunc=np.sum)
'''
D
C large medium small
A B
bar three NaN NaN 8.0
foo one NaN 2.0 1.0
two 5.0 NaN
'''
df = pd.DataFrame({'A':np.array(['foo','foo','foo','foo','bar','bar']),
'B':np.array(['one','one','two','two','three','three']),
'C':np.array(['small','medium','large','large','small','small']),
'D':np.array([1,2,2,3,3,5])})
df.groupby('A').agg(np.mean)
'''
D
A
bar 4
foo 2
'''
对数据框df的每一列求平均值
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.apply(np.mean)
'''
A 0.388075
B 0.539564
C 0.607983
D 0.518634
E 0.482960
dtype: float64
''
对数据框df的每一行求最大值
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.apply(np.max,axis=1)
'''
0 0.904163
1 0.804519
2 0.924102
3 0.761781
4 0.952084
5 0.923679
6 0.796320
7 0.582907
8 0.761310
9 0.893564
dtype: float64
'''
数据的连接(join)
在数据框df2的末尾添加数据框df1,其中df1和df2的列数应该相等
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df1.append(df2)
'''
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
'''
在数据框df1的列最后添加数据框df2,其中df1和df2的行数应该相等
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
pd.concat([df1,df2],axis=1)
'''
A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4
5 NaN NaN NaN NaN A5 B5 C5 D5
6 NaN NaN NaN NaN A6 B6 C6 D6
7 NaN NaN NaN NaN A7 B7 C7 D7
'''
组合(combine)
对数据框df1和df2做内连接,其中连接的列为col1
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key': ['K0', 'K1', 'K0', 'K1']})
df2 = pd.DataFrame({'C': ['C0', 'C1'],
'D': ['D0', 'D1']},
index=['K0', 'K1'])
df1.join(df2, on='key')
'''
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K0 C0 D0
3 A3 B3 K1 C1 D1
'''
数据的统计
得到数据框df每一列的描述性统计
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.describe()
'''
A B C D E
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 0.398648 0.451699 0.443472 0.739478 0.412954
std 0.330605 0.221586 0.303084 0.308798 0.262148
min 0.004457 0.188689 0.079697 0.113562 0.052935
25% 0.088177 0.270355 0.205663 0.715005 0.205685
50% 0.315533 0.457229 0.332148 0.885872 0.400232
75% 0.749716 0.497208 0.737900 0.948651 0.634670
max 0.782956 0.825671 0.851065 0.962922 0.815447
'''
得到数据框df中每一列的平均值
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.mean()
'''
A 0.395643
B 0.528812
C 0.692011
D 0.446750
E 0.544759
dtype: float64
'''
得到数据框df中每一列与其他列的相关系数
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.corr()
'''
A B C D E
A 1.000000 -0.634931 -0.354824 -0.354131 0.170957
B -0.634931 1.000000 0.225222 -0.338124 -0.043300
C -0.354824 0.225222 1.000000 0.098285 0.297133
D -0.354131 -0.338124 0.098285 1.000000 -0.324209
E 0.170957 -0.043300 0.297133 -0.324209 1.000000
'''
得到数据框df中每一列的非空值个数
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.count()
'''
A 10
B 10
C 10
D 10
E 10
dtype: int64
'''
得到数据框df中每一列的最大值
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.max()
'''
A 0.933848
B 0.730197
C 0.921751
D 0.715280
E 0.940010
dtype: float64
'''
得到数据框df中每一列的最小值
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.min()
'''
A 0.107516
B 0.001635
C 0.024502
D 0.092810
E 0.019898
dtype: float64
'''
得到数据框df中每一列的中位数
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.median()
'''
A 0.497591
B 0.359854
C 0.661607
D 0.342418
E 0.588468
dtype: float64
'''
得到数据框df中每一列的标准差
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.std()
'''
A 0.231075
B 0.286691
C 0.276511
D 0.304167
E 0.272570
dtype: float64
'''