本文主要介绍pandas数据转换操作
1.分组
In [42]: df = DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5)
...: ,'data2':np.random.randn(5)})
In [43]: df
Out[43]:
data1 data2 key1 key2
0 0.762448 0.816634 a one
1 1.412613 0.867923 a two
2 0.899297 -1.049657 b one
3 0.912080 0.628012 b two
4 -0.549258 -1.327614 a one
In [44]: grouped = df['data1'].groupby(df['key1']) #按key1列分组,计算data1列的平均值
In [45]: grouped
Out[45]: <pandas.core.groupby.SeriesGroupBy object at 0x00000000073C97F0>
In [46]: grouped.mean()
Out[46]:
key1
a 0.541935
b 0.905688
Name: data1, dtype: float64
In [48]: df['data1'].groupby([df['key1'],df['key2']]).mean()
Out[48]:
key1 key2
a one 0.106595
two 1.412613
b one 0.899297
two 0.912080
Name: data1, dtype: float64
In [49]: df.groupby('key1').mean() #根据列名分组
Out[49]:
data1 data2
key1
a 0.541935 0.118981
b 0.905688 -0.210822
In [50]: df.groupby(['key1','key2']).mean()
Out[50]:
data1 data2
key1 key2
a one 0.106595 -0.255490
two 1.412613 0.867923
b one 0.899297 -1.049657
two 0.912080 0.628012
In [51]: df.groupby('key1')['data1'].mean() #选取部分列进行聚合
Out[51]:
key1
a 0.541935
b 0.905688
Name: data1, dtype: float64
In [52]: df.groupby(['key1','key2'])['data1'].mean()
Out[52]:
key1 key2
a one 0.106595
two 1.412613
b one 0.899297
two 0.912080
Name: data1, dtype: float64
In [53]: people = DataFrame(np.random.randn(5,5),columns = ['a','b','c','d','e'],index = ['Joe','Steve','Wes','Jim','Tr
...: avis'])
In [54]: people
Out[54]:
a b c d e
Joe 0.223628 -0.282831 0.368583 0.246665 -0.815742
Steve 0.662181 0.187961 0.515883 -2.021429 -0.624596
Wes -1.009086 0.450082 -0.819855 -1.626971 0.632064
Jim 1.593881 0.803760 -0.209345 -1.295325 -0.553693
Travis -0.041911 1.115285 -1.648207 0.521751 -0.414183
In [55]: mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
In [56]: map_series = Series(mapping)
In [57]: map_series
Out[57]:
a red
b red
c blue
d blue
e red
f orange
dtype: object
In [58]: people.groupby(map_series,axis = 1).count() #根据series分组
Out[58]:
blue red
Joe 2 3
Steve 2 3
Wes 2 3
Jim 2 3
Travis 2 3
In [59]: by_columns = people.groupby(mapping,axis =1) #根据字典分组
In [60]: by_columns.sum()
Out[60]:
blue red
Joe 0.615248 -0.874945
Steve -1.505546 0.225546
Wes -2.446826 0.073060
Jim -1.504670 1.843948
Travis -1.126456 0.659191
In [61]: people.groupby(len).sum() #根据函数分组
Out[61]:
a b c d e
3 0.808423 0.971012 -0.660617 -2.675632 -0.737371
5 0.662181 0.187961 0.515883 -2.021429 -0.624596
6 -0.041911 1.115285 -1.648207 0.521751 -0.414183
In [63]: columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]],names = ['city','tennor'])
In [65]: df1 = DataFrame(np.random.randn(4,5),columns = columns)
In [66]: df1
Out[66]:
city US JP
tennor 1 3 5 1 3
0 1.103548 1.087425 0.717741 -0.354419 1.294512
1 -0.247544 -1.247665 1.340309 1.337957 0.528693
2 2.168903 -0.124958 0.367158 0.478355 -0.828126
3 -0.078540 -3.062132 -2.095675 -0.879590 -0.020314
In [67]: df1.groupby(level = 'city',axis = 1).count() #根据索引级别分组
Out[67]:
city JP US
0 2 3
1 2 3
2 2 3
2.透视表
pandas为我们提供了实现数据透视表功能的函数pivot_table(),该函数参数说明如下:
参数 | 说明 |
---|---|
data | 需要进行透视的数据 |
value | 指定需要聚合的字段 |
index | 指定值为行索引 |
columns | 指定值为列索引 |
aggfunc | 聚合函数 |
fill_value | 常量替换缺失值,默认不替换 |
margins | 是否合并,默认否 |
dropna | 是否观测缺失值,默认是 |
举例:
In [68]: dic = {'Name':['LiuShunxiang','Zhangshan','ryan'],
...: ...: 'Sex':['M','F','F'],
...: ...: 'Age':[27,23,24],
...: ...: 'Height':[165.7,167.2,154],
...: ...: 'Weight':[61,63,41]}
...:
In [69]: student = pd.DataFrame(dic)
In [70]: student
Out[70]:
Age Height Name Sex Weight
0 27 165.7 LiuShunxiang M 61
1 23 167.2 Zhangshan F 63
2 24 154.0 ryan F 41
In [71]: pd.pivot_table(student,values = ['Height'],columns = ['Sex']) #'Height'作为数值变量,'Sex'作为分组变量
Out[71]:
Sex F M
Height 160.6 165.7
In [72]: pd.pivot_table(student,values = ['Height','Weight'],columns = ['Sex','Age'])
Out[72]:
Sex Age
Height F 23 167.2
24 154.0
M 27 165.7
Weight F 23 63.0
24 41.0
M 27 61.0
dtype: float64
In [73]: pd.pivot_table(student,values = ['Height','Weight'],columns = ['Sex','Age']).unstack()
Out[73]:
Age 23 24 27
Sex
Height F 167.2 154.0 NaN
M NaN NaN 165.7
Weight F 63.0 41.0 NaN
M NaN NaN 61.0
In [74]: pd.pivot_table(student,values = ['Height','Weight'],columns = ['Sex'],aggfunc = [np.mean,np.median,np.std])
Out[74]:
mean median std
Sex F M F M F M
Height 160.6 165.7 160.6 165.7 9.333810 NaN
Weight 52.0 61.0 52.0 61.0 15.556349 NaN