Python数据分析第三期–简述Pandas
目录
1. Pandas 的数据结构
Pandas , 一个强大的分析结构化数据的工具集 。
基础式NumPy提供了高性能矩阵的运算。
提供数据清洗功能。
1.1 Series
类似于一维数组的对象
import pandas as pd
ser_obj = pd. Series( range ( 10 , 20 ) )
print ( type ( ser_obj) )
print ( ser_obj. values)
print ( ser_obj. index)
[ 10 11 12 13 14 15 16 17 18 19 ]
RangeIndex( start= 0 , stop= 10 , step= 1 )
print ( ser_obj. head( 3 ) )
0 10
1 11
2 12
dtype: int64
print ( ser_obj)
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64
print ( ser_obj[ 0 ] )
print ( ser_obj[ 8 ] )
10
18
print ( ser_obj * 2 )
print ( ser_obj > 15 )
0 20
1 22
2 24
3 26
4 28
5 30
6 32
7 34
8 36
9 38
dtype: int64
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 True
9 True
dtype: bool
year_data = { 2001 : 17.8 , 2002 : 20.1 , 2003 : 16.5 }
ser_obj2 = pd. Series( year_data)
print ( ser_obj2. head( ) )
print ( ser_obj2. index)
2001 17.8
2002 20.1
2003 16.5
dtype: float64
Int64Index( [ 2001 , 2002 , 2003 ] , dtype= 'int64' )
ser_obj2. name = 'temp'
ser_obj2. index. name = 'year'
print ( ser_obj2. head( ) )
year
2001 17.8
2002 20.1
2003 16.5
Name: temp, dtype: float64
1.2 DataFrame
类似于多维数组/表格数组,索引包括列索引和行索引。
import numpy as np
array = np. random. randn( 5 , 4 )
print ( array)
df_obj = pd. DataFrame( array)
print ( df_obj. head( ) )
[ [ - 0.81322788 - 1.51137049 - 1.06359499 1.28648638 ]
[ 1.31333252 0.25461725 - 0.7915332 0.75370275 ]
[ - 0.09241866 - 1.29103143 - 0.32951959 - 0.76066101 ]
[ - 0.40326652 1.04621326 0.99044616 - 0.35491204 ]
[ 2.8170386 - 0.58422902 2.84249502 - 0.62538581 ] ]
0 1 2 3
0 - 0.813228 - 1.511370 - 1.063595 1.286486
1 1.313333 0.254617 - 0.791533 0.753703
2 - 0.092419 - 1.291031 - 0.329520 - 0.760661
3 - 0.403267 1.046213 0.990446 - 0.354912
4 2.817039 - 0.584229 2.842495 - 0.625386
dict_data = { 'A' : 1 . ,
'B' : pd. Timestamp( '20161217' ) ,
'C' : pd. Series( 1 , index= list ( range ( 4 ) ) , dtype= 'float32' ) ,
'D' : np. array( [ 3 ] * 4 , dtype= 'int32' ) ,
'E' : pd. Categorical( [ "Python" , "Java" , "C++" , "C#" ] ) ,
'F' : 'ChinaHadoop' }
df_obj2 = pd. DataFrame( dict_data)
print ( df_obj2. head( ) )
A B C D E F
0 1.0 2016-12-17 1.0 3 Python ChinaHadoop
1 1.0 2016-12-17 1.0 3 Java ChinaHadoop
2 1.0 2016-12-17 1.0 3 C++ ChinaHadoop
3 1.0 2016-12-17 1.0 3 C# ChinaHadoop
print ( df_obj2[ 'A' ] )
print ( type ( df_obj2[ 'A' ] ) )
print ( df_obj2. A)
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
< class 'pandas.core.series.Series' >
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
df_obj2[ 'G' ] = df_obj2[ 'D' ] + 4
print ( df_obj2. head( ) )
A B C D E F G
0 1.0 2016 - 12 - 17 1.0 3 Python ChinaHadoop 7
1 1.0 2016 - 12 - 17 1.0 3 Java ChinaHadoop 7
2 1.0 2016 - 12 - 17 1.0 3 C+ + ChinaHadoop 7
3 1.0 2016 - 12 - 17 1.0 3 C
del ( df_obj2[ 'G' ] )
print ( df_obj2. head( ) )
A B C D E F
0 1.0 2016 - 12 - 17 1.0 3 Python ChinaHadoop
1 1.0 2016 - 12 - 17 1.0 3 Java ChinaHadoop
2 1.0 2016 - 12 - 17 1.0 3 C+ + ChinaHadoop
3 1.0 2016 - 12 - 17 1.0 3 C
1.3 Index
Series和DataFrame中的索引都是Index对象,不可变,保证了数据的安全。
常见的 Index 种类:
index
Int64Index
MultiIndex , “层级”索引
DataTimeIndex , 时间戳类型
print ( type ( ser_obj. index) )
print ( type ( df_obj2. index) )
print ( df_obj2. index)
< class 'pandas.core.indexes.range.RangeIndex' >
< class 'pandas.core.indexes.numeric.Int64Index' >
Int64Index( [ 0 , 1 , 2 , 3 ] , dtype= 'int64' )
2. Pandas 的数据操作
2.1 索引操作
切片索引的注意 ,位置访问时,末尾边界不包含;名称访问时,末尾边界包含。
2.1.1 Series 索引
import pandas as pd
ser_obj = pd. Series( range ( 5 ) , index = [ 'a' , 'b' , 'c' , 'd' , 'e' ] )
print ( ser_obj. head( ) )
a 0
b 1
c 2
d 3
e 4
dtype: int32
print ( ser_obj[ 'a' ] )
print ( ser_obj[ 0 ] )
0
0
print ( ser_obj[ 1 : 3 ] )
print ( ser_obj[ 'b' : 'd' ] )
b 1
c 2
dtype: int32
b 1
c 2
d 3
dtype: int32
print ( ser_obj[ [ 0 , 2 , 4 ] ] )
print ( ser_obj[ [ 'a' , 'e' ] ] )
a 0
c 2
e 4
dtype: int32
a 0
e 4
dtype: int32
ser_bool = ser_obj > 2
print ( ser_bool)
print ( ser_obj[ ser_bool] )
print ( ser_obj[ ser_obj > 2 ] )
a False
b False
c False
d True
e True
dtype: bool
d 3
e 4
dtype: int32
d 3
e 4
dtype: int32
2.1.2 DataFrame 索引
import numpy as np
df_obj = pd. DataFrame( np. random. randn( 5 , 4 ) , columns = [ 'a' , 'b' , 'c' , 'd' ] )
print ( df_obj. head( ) )
a b c d
0 -0.595692 0.813699 -0.551327 -0.059703
1 0.339194 -2.335579 0.230472 -0.680213
2 -0.252306 0.212406 -0.979523 0.408522
3 0.216677 0.574524 -0.819607 2.170009
4 -1.099175 -0.665488 0.391421 -0.400642
print ( '列索引' )
print ( df_obj[ 'a' ] )
print ( type ( df_obj[ [ 0 ] ] ) )
print ( '不连续索引' )
print ( df_obj[ [ 'a' , 'c' ] ] )
print ( df_obj[ [ 1 , 3 ] ] )
列索引
0 -0.595692
1 0.339194
2 -0.252306
3 0.216677
4 -1.099175
Name: a, dtype: float64
<class 'pandas.core.frame.DataFrame'>
不连续索引
a c
0 -0.595692 -0.551327
1 0.339194 0.230472
2 -0.252306 -0.979523
3 0.216677 -0.819607
4 -1.099175 0.391421
b d
0 0.813699 -0.059703
1 -2.335579 -0.680213
2 0.212406 0.408522
3 0.574524 2.170009
4 -0.665488 -0.400642
2.1.3 三种索引方式(标签索引,整型位置索引,混合索引)
print ( ser_obj[ 'b' : 'd' ] )
print ( ser_obj. loc[ 'b' : 'd' ] )
print ( df_obj[ 'a' ] )
print ( df_obj. loc[ 0 : 2 , 'a' ] )
b 1
c 2
d 3
dtype: int32
b 1
c 2
d 3
dtype: int32
0 -0.595692
1 0.339194
2 -0.252306
3 0.216677
4 -1.099175
Name: a, dtype: float64
0 -0.595692
1 0.339194
2 -0.252306
Name: a, dtype: float64
print ( ser_obj[ 1 : 3 ] )
print ( ser_obj. iloc[ 1 : 3 ] )
print ( df_obj. iloc[ 0 : 2 , 0 ] )
b 1
c 2
dtype: int32
b 1
c 2
dtype: int32
0 -0.595692
1 0.339194
Name: a, dtype: float64
print ( ser_obj. ix[ 1 : 3 ] )
print ( ser_obj. ix[ 'b' : 'c' ] )
print ( df_obj. ix[ 0 : 2 , 0 ] )
b 1
c 2
dtype: int32
b 1
c 2
dtype: int32
0 -0.595692
1 0.339194
2 -0.252306
Name: a, dtype: float64
2.2 运算与对齐
s1 = pd.Series(range(10, 20), index = range(10))
s2 = pd.Series(range(20, 25), index = range(5))
print('s1: ' )
print(s1)
print('')
print('s2: ')
print(s2)
s1:
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int32
s2:
0 20
1 21
2 22
3 23
4 24
dtype: int32
s1 + s2
0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
import numpy as np
df1 = pd. DataFrame( np. ones( ( 2 , 2 ) ) , columns = [ 'a' , 'b' ] )
df2 = pd. DataFrame( np. ones( ( 3 , 3 ) ) , columns = [ 'a' , 'b' , 'c' ] )
print ( 'df1: ' )
print ( df1)
print ( '' )
print ( 'df2: ' )
print ( df2)
df1:
a b
0 1.0 1.0
1 1.0 1.0
df2:
a b c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
df1 + df2
a b c 0 2.0 2.0 NaN 1 2.0 2.0 NaN 2 NaN NaN NaN
print ( s1)
print ( s2)
s1. add( s2, fill_value = - 1 )
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64
0 20
1 21
2 22
3 23
4 24
dtype: int64
Out[9]:
0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 14.0
6 15.0
7 16.0
8 17.0
9 18.0
dtype: float64
df1. sub( df2, fill_value = 2 . )
a b c 0 0.0 0.0 1.0 1 0.0 0.0 1.0 2 1.0 1.0 1.0
s3 = s1 + s2
print ( s3)
0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 -1.0
6 -1.0
7 -1.0
8 -1.0
9 -1.0
dtype: float64
df3 = df1 + df2
print ( df3)
a b c
0 2.0 2.0 NaN
1 2.0 2.0 NaN
2 NaN NaN NaN
df3. fillna( 100 , inplace = True )
print ( df3)
a b c
0 2.0 2.0 100.0
1 2.0 2.0 100.0
2 100.0 100.0 100.0
2.3 函数应用
df = pd. DataFrame( np. random. randn( 5 , 4 ) - 1 )
print ( df)
print ( np. abs ( df) )
0 1 2 3
0 -2.193022 -2.090432 -2.288651 -0.026022
1 -0.720957 -1.501025 -1.734828 -1.858286
2 0.300216 -3.391127 -0.872570 -0.686669
3 -2.552131 -1.452268 -1.188845 -0.597845
4 2.111044 -1.203676 -1.143487 -0.542755
0 1 2 3
0 2.193022 2.090432 2.288651 0.026022
1 0.720957 1.501025 1.734828 1.858286
2 0.300216 3.391127 0.872570 0.686669
3 2.552131 1.452268 1.188845 0.597845
4 2.111044 1.203676 1.143487 0.542755
print ( df. apply ( lambda x : x. max ( ) ) )
0 2.111044
1 -1.203676
2 -0.872570
3 -0.026022
dtype: float64
print ( df. apply ( lambda x : x. max ( ) , axis= 1 ) )
f2 = lambda x : '%.2f' % x
print ( df. applymap( f2) )
0 1 2 3
0 -2.19 -2.09 -2.29 -0.03
1 -0.72 -1.50 -1.73 -1.86
2 0.30 -3.39 -0.87 -0.69
3 -2.55 -1.45 -1.19 -0.60
4 2.11 -1.20 -1.14 -0.54
排序
s4 = pd. Series( range ( 10 , 15 ) , index = np. random. randint( 5 , size= 5 ) )
print ( s4)
4 10
3 11
1 12
4 13
4 14
dtype: int32
s4. sort_index( )
1 12
3 11
4 10
4 13
4 14
dtype: int32
df4 = pd. DataFrame( np. random. randn( 3 , 4 ) ,
index= np. random. randint( 3 , size= 3 ) ,
columns= np. random. randint( 4 , size= 4 ) )
print ( df4)
3 2 2 1
2 0.244068 - 1.977220 0.045238 - 2.064546
2 0.218196 - 0.419284 - 0.698839 0.241649
2 0.296747 - 0.021311 0.225724 - 0.325439
df4. sort_index( axis= 1 )
1 2 2 3 2 -2.064546 -1.977220 0.045238 0.244068 2 0.241649 -0.419284 -0.698839 0.218196 2 -0.325439 -0.021311 0.225724 0.296747
df4. sort_values( by= 1 )
3 2 2 1 2 0.244068 -1.977220 0.045238 -2.064546 2 0.296747 -0.021311 0.225724 -0.325439 2 0.218196 -0.419284 -0.698839 0.241649
2.4 处理缺失数据
df_data = pd. DataFrame( [ np. random. randn( 3 ) , [ 1 . , np. nan, np. nan] ,
[ 4 . , np. nan, np. nan] , [ 1 . , np. nan, 2 . ] ] )
df_data. head( )
0 1 2 0 0.106763 1.521225 -0.654447 1 1.000000 NaN NaN 2 4.000000 NaN NaN 3 1.000000 NaN 2.000000
df_data. isnull( )
0 1 2 0 False False False 1 False True True 2 False True True 3 False True False
df_data. dropna( )
0 1 2 0 0.106763 1.521225 -0.654447
df_data. dropna( axis= 1 )
0 0 0.106763 1 1.000000 2 4.000000 3 1.000000
df_data. fillna( - 100 . )
0 1 2 0 1.619463 0.548047 -1.027003 1 1.000000 -100.000000 -100.000000 2 4.000000 -100.000000 -100.000000 3 1.000000 -100.000000 2.000000
3. Pandas 统计计算和描述
import numpy as np
import pandas as pd
3.1 常用的统计计算
df_obj = pd. DataFrame( np. random. randn( 5 , 4 ) , columns = [ 'a' , 'b' , 'c' , 'd' ] )
df_obj
a b c d 0 -0.767557 0.637179 1.650174 1.046862 1 1.675605 0.438124 -0.119837 -0.637159 2 -0.367849 1.044564 0.977799 -0.185916 3 -0.453807 -0.786276 -1.917200 0.502135 4 0.230846 -0.056462 1.612659 0.426120
df_obj. sum ( )
a 0.317239
b 1.277129
c 2.203596
d 1.152041
dtype: float64
df_obj. max ( )
a 1.675605
b 1.044564
c 1.650174
d 1.046862
dtype: float64
df_obj. min ( axis= 1 )
0 -0.767557
1 -0.637159
2 -0.367849
3 -1.917200
4 -0.056462
dtype: float64
3.2 统计描述
df_obj. describe( )
a b c d count 5.000000 5.000000 5.000000 5.000000 mean 0.882532 -0.574613 0.004122 -0.455327 std 1.052045 0.887115 0.456436 0.646042 min -0.713423 -1.600219 -0.628493 -1.403887 25% 0.715594 -1.483364 -0.211933 -0.664303 50% 0.783228 -0.053138 0.021936 -0.424985 75% 1.580704 0.123322 0.276417 -0.103682 max 2.046556 0.140333 0.562683 0.320219
4. 数据的分组与聚合
4.1 分组(groupby)
对数据集进行分组,然后对每组进行统计分析,pandas 能利用 groupby 进行更加复杂的分组运算。
分组运算过程:
split -> apply -> combine
拆分:进行分组的依据
应用:每个分组运行的计算规则
合并:把每个分组的的计算结果合并起来
4.1.1 groupby对象
import pandas as pd
import numpy as np
dict_obj = { 'key1' : [ 'a' , 'b' , 'a' , 'b' ,
'a' , 'b' , 'a' , 'a' ] ,
'key2' : [ 'one' , 'one' , 'two' , 'three' ,
'two' , 'two' , 'one' , 'three' ] ,
'data1' : np. random. randn( 8 ) ,
'data2' : np. random. randn( 8 ) }
df_obj = pd. DataFrame( dict_obj)
print ( df_obj)
data1 data2 key1 key2
0 -0.943078 0.820645 a one
1 -1.429043 0.142617 b one
2 0.832261 0.843898 a two
3 0.906262 0.688165 b three
4 0.541173 0.117232 a two
5 -0.213385 -0.098734 b two
6 -1.291468 -1.186638 a one
7 1.186941 0.809122 a three
print ( type ( df_obj. groupby( 'key1' ) ) )
<class 'pandas.core.groupby.DataFrameGroupBy'>
print ( type ( df_obj[ 'data1' ] . groupby( df_obj[ 'key1' ] ) ) )
<class 'pandas.core.groupby.SeriesGroupBy'>
grouped1 = df_obj. groupby( 'key1' )
print ( grouped1. mean( ) )
grouped2 = df_obj[ 'data1' ] . groupby( df_obj[ 'key1' ] )
print ( grouped2. mean( ) )
data1 data2
key1
a 0.065166 0.280852
b -0.245389 0.244016
key1
a 0.065166
b -0.245389
Name: data1, dtype: float64
print ( grouped1. size( ) )
print ( grouped2. size( ) )
key1
a 5
b 3
dtype: int64
key1
a 5
b 3
dtype: int64
df_obj. groupby( 'key1' )
Out[ 10 ] :
< pandas. core. groupby. generic. DataFrameGroupBy object at 0x000002091C820550 >
self_def_key = [ 1 , 1 , 2 , 2 , 2 , 1 , 1 , 1 ]
df_obj. groupby( self_def_key) . size( )
结果:
1 5
2 3
dtype: int64
df_obj. groupby( [ df_obj[ 'key1' ] , df_obj[ 'key2' ] ] ) . size( )
key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64
grouped2 = df_obj. groupby( [ 'key1' , 'key2' ] )
print ( grouped2. size( ) )
key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64
grouped3 = df_obj. groupby( [ 'key2' , 'key1' ] )
print ( grouped3. mean( ) )
print ( )
print ( grouped3. mean( ) . unstack( ) )
data1 data2
key2 key1
one a -1.117273 -0.182997
b -1.429043 0.142617
three a 1.186941 0.809122
b 0.906262 0.688165
two a 0.686717 0.480565
b -0.213385 -0.098734
data1 data2
key1 a b a b
key2
one -1.117273 -1.429043 -0.182997 0.142617
three 1.186941 0.906262 0.809122 0.688165
two 0.686717 -0.213385 0.480565 -0.098734
4.1.2 GroupBy 对象分组迭代
for group_name, group_data in grouped1:
print ( group_name)
print ( group_data)
a
data1 data2 key1 key2
0 -0.943078 0.820645 a one
2 0.832261 0.843898 a two
4 0.541173 0.117232 a two
6 -1.291468 -1.186638 a one
7 1.186941 0.809122 a three
b
data1 data2 key1 key2
1 -1.429043 0.142617 b one
3 0.906262 0.688165 b three
5 -0.213385 -0.098734 b two
for group_name, group_data in grouped2:
print ( group_name)
print ( group_data)
('a', 'one')
data1 data2 key1 key2
0 -0.943078 0.820645 a one
6 -1.291468 -1.186638 a one
('a', 'three')
data1 data2 key1 key2
7 1.186941 0.809122 a three
('a', 'two')
data1 data2 key1 key2
2 0.832261 0.843898 a two
4 0.541173 0.117232 a two
('b', 'one')
data1 data2 key1 key2
1 -1.429043 0.142617 b one
('b', 'three')
data1 data2 key1 key2
3 0.906262 0.688165 b three
('b', 'two')
data1 data2 key1 key2
5 -0.213385 -0.098734 b two
list ( grouped1)
[('a', data1 data2 key1 key2
0 -0.943078 0.820645 a one
2 0.832261 0.843898 a two
4 0.541173 0.117232 a two
6 -1.291468 -1.186638 a one
7 1.186941 0.809122 a three), ('b', data1 data2 key1 key2
1 -1.429043 0.142617 b one
3 0.906262 0.688165 b three
5 -0.213385 -0.098734 b two)]
dict ( list ( grouped1) )
{'a': data1 data2 key1 key2
0 -0.943078 0.820645 a one
2 0.832261 0.843898 a two
4 0.541173 0.117232 a two
6 -1.291468 -1.186638 a one
7 1.186941 0.809122 a three, 'b': data1 data2 key1 key2
1 -1.429043 0.142617 b one
3 0.906262 0.688165 b three
5 -0.213385 -0.098734 b two}
print ( df_obj. dtypes)
df_obj. groupby( df_obj. dtypes, axis= 1 ) . size( )
df_obj. groupby( df_obj. dtypes, axis= 1 ) . sum ( )
data1 float64
data2 float64
key1 object
key2 object
dtype: object
float64 object 0 -0.122433 aone 1 -1.286426 bone 2 1.676158 atwo 3 1.594427 bthree 4 0.658404 atwo 5 -0.312119 btwo 6 -2.478106 aone 7 1.996064 athree
4.1.2 其他分组方法
df_obj2 = pd. DataFrame( np. random. randint( 1 , 10 , ( 5 , 5 ) ) ,
columns= [ 'a' , 'b' , 'c' , 'd' , 'e' ] ,
index= [ 'A' , 'B' , 'C' , 'D' , 'E' ] )
df_obj2. ix[ 1 , 1 : 4 ] = np. NaN
df_obj2
a b c d e A 1 1.0 1.0 6.0 5 B 2 NaN NaN NaN 6 C 5 5.0 7.0 5.0 7 D 2 8.0 5.0 6.0 2 E 5 1.0 4.0 4.0 4
mapping_dict = { 'a' : 'python' , 'b' : 'python' , 'c' : 'java' , 'd' : 'C' , 'e' : 'java' }
df_obj2. groupby( mapping_dict, axis= 1 ) . size( )
df_obj2. groupby( mapping_dict, axis= 1 ) . count( )
df_obj2. groupby( mapping_dict, axis= 1 ) . sum ( )
C java python A 6.0 6.0 2.0 B NaN 6.0 2.0 C 5.0 14.0 10.0 D 6.0 7.0 10.0 E 4.0 8.0 6.0
df_obj3 = pd. DataFrame( np. random. randint( 1 , 10 , ( 5 , 5 ) ) ,
columns= [ 'a' , 'b' , 'c' , 'd' , 'e' ] ,
index= [ 'AA' , 'BBB' , 'CC' , 'D' , 'EE' ] )
def group_key ( idx) :
"""
idx 为列索引或行索引
"""
return len ( idx)
df_obj3. groupby( group_key) . size( )
1 1
2 3
3 1
dtype: int64
columns = pd. MultiIndex. from_arrays( [ [ 'Python' , 'Java' , 'Python' , 'Java' , 'Python' ] ,
[ 'A' , 'A' , 'B' , 'C' , 'B' ] ] , names= [ 'language' , 'index' ] )
df_obj4 = pd. DataFrame( np. random. randint( 1 , 10 , ( 5 , 5 ) ) , columns= columns)
df_obj4
language Python Java Python Java Python index A A B C B 0 1 6 4 7 2 1 9 7 2 2 4 2 3 9 9 7 5 3 1 6 1 6 6 4 5 1 7 3 6
df_obj4. groupby( level= 'language' , axis= 1 ) . sum ( )
df_obj4. groupby( level= 'index' , axis= 1 ) . sum ( )
index A B C 0 7 6 7 1 16 6 2 2 12 14 7 3 7 7 6 4 6 13 3
4.2 聚合
对分组后的数据进行运算 , 数组 产生标量的过程,如mean()、count()等。
dict_obj = { 'key1' : [ 'a' , 'b' , 'a' , 'b' ,
'a' , 'b' , 'a' , 'a' ] ,
'key2' : [ 'one' , 'one' , 'two' , 'three' ,
'two' , 'two' , 'one' , 'three' ] ,
'data1' : np. random. randint( 1 , 10 , 8 ) ,
'data2' : np. random. randint( 1 , 10 , 8 ) }
df_obj5 = pd. DataFrame( dict_obj)
print ( df_obj5)
out:
data1 data2 key1 key2
0 4 2 a one
1 7 1 b one
2 2 8 a two
3 9 4 b three
4 3 2 a two
5 8 5 b two
6 6 8 a one
7 9 3 a three
# 内置的聚合函数
print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())
data1 data2
key1
a 24 23
b 24 10
data1 data2 key2
key1
a 9 8 two
b 9 5 two
data1 data2 key2
key1
a 2 2 one
b 7 1 one
data1 data2
key1
a 4.8 4.600000
b 8.0 3.333333
key1
a 5
b 3
dtype: int64
data1 data2 key2
key1
a 5 5 5
b 3 3 3
data1 data2
key1
a count 5.000000 5.000000
mean 4.800000 4.600000
std 2.774887 3.130495
min 2.000000 2.000000
25% 3.000000 2.000000
50% 4.000000 3.000000
75% 6.000000 8.000000
max 9.000000 8.000000
b count 3.000000 3.000000
mean 8.000000 3.333333
std 1.000000 2.081666
min 7.000000 1.000000
25% 7.500000 2.500000
50% 8.000000 4.000000
75% 8.500000 4.500000
max 9.000000 5.000000
def peak_range ( df) :
"""
返回数值范围
"""
return df. max ( ) - df. min ( )
print ( df_obj5. groupby( 'key1' ) . agg( peak_range) )
print ( df_obj. groupby( 'key1' ) . agg( lambda df : df. max ( ) - df. min ( ) ) )
data1 data2
key1
a 7 6
b 2 4
data1 data2
key1
a 2.478410 2.030536
b 2.335305 0.786899
print ( df_obj. groupby( 'key1' ) . agg( [ 'mean' , 'std' , 'count' , peak_range] ) )
data1 data2
mean std count peak_range mean std count peak_range
key1
a 0.065166 1.110226 5 2.478410 0.280852 0.875752 5 2.030536
b -0.245389 1.167982 3 2.335305 0.244016 0.403130 3 0.786899
print ( df_obj. groupby( 'key1' ) . agg( [ 'mean' , 'std' , 'count' , ( 'range' , peak_range) ] ) )
data1 data2
mean std count range mean std count range
key1
a 0.065166 1.110226 5 2.478410 0.280852 0.875752 5 2.030536
b -0.245389 1.167982 3 2.335305 0.244016 0.403130 3 0.786899
dict_mapping = { 'data1' : 'mean' ,
'data2' : 'sum' }
print ( df_obj. groupby( 'key1' ) . agg( dict_mapping) )
data2 data1
key1
a 1.404259 0.065166
b 0.732047 -0.245389
dict_mapping = { 'data1' : [ 'mean' , 'max' ] ,
'data2' : 'sum' }
print ( df_obj. groupby( 'key1' ) . agg( dict_mapping) )
data2 data1
sum mean max
key1
a 1.404259 0.065166 1.186941
b 0.732047 -0.245389 0.906262
4.3 数据分组运算
import pandas as pd
import numpy as np
dict_obj = { 'key1' : [ 'a' , 'b' , 'a' , 'b' ,
'a' , 'b' , 'a' , 'a' ] ,
'key2' : [ 'one' , 'one' , 'two' , 'three' ,
'two' , 'two' , 'one' , 'three' ] ,
'data1' : np. random. randint( 1 , 10 , 8 ) ,
'data2' : np. random. randint( 1 , 10 , 8 ) }
df_obj = pd. DataFrame( dict_obj)
df_obj
key1 key2 data1 data2 0 a one 3 9 1 b one 4 5 2 a two 8 9 3 b three 4 8 4 a two 5 2 5 b two 3 2 6 a one 4 9 7 a three 9 4
k1_sum = df_obj. groupby( 'key1' ) . sum ( ) . add_prefix( 'sum_' )
k1_sum
sum_data1 sum_data2 key1 a 29 33 b 11 15
pd. merge( df_obj, k1_sum, left_on= 'key1' , right_index= True )
key1 key2 data1 data2 sum_data1 sum_data2 0 a one 3 9 29 33 2 a two 8 9 29 33 4 a two 5 2 29 33 6 a one 4 9 29 33 7 a three 9 4 29 33 1 b one 4 5 11 15 3 b three 4 8 11 15 5 b two 3 2 11 15
4.3.1 transform 方法
k1_sum_tf = df_obj. groupby( 'key1' ) . transform( np. sum ) . add_prefix( 'sum_' )
df_obj[ k1_sum_tf. columns] = k1_sum_tf
df_obj
key1 key2 data1 data2 sum_key2 sum_data1 sum_data2 0 a one 3 9 onetwotwoonethree 29 33 1 b one 4 5 onethreetwo 11 15 2 a two 8 9 onetwotwoonethree 29 33 3 b three 4 8 onethreetwo 11 15 4 a two 5 2 onetwotwoonethree 29 33 5 b two 3 2 onethreetwo 11 15 6 a one 4 9 onetwotwoonethree 29 33 7 a three 9 4 onetwotwoonethree 29 33
def diff_mean ( s) :
"""
返回数据与均值的差值
"""
return s - s. mean( )
df_obj. groupby( 'key1' ) . transform( diff_mean)
data1 data2 sum_data1 sum_data2 0 -2.800000 2.4 0 0 1 0.333333 0.0 0 0 2 2.200000 2.4 0 0 3 0.333333 3.0 0 0 4 -0.800000 -4.6 0 0 5 -0.666667 -3.0 0 0 6 -1.800000 2.4 0 0 7 3.200000 -2.6 0 0
dataset_path = './starcraft.csv'
df_data = pd. read_csv( dataset_path, usecols= [ 'LeagueIndex' , 'Age' , 'HoursPerWeek' ,
'TotalHours' , 'APM' ] )
4.3.2 apply
def top_n ( df, n= 3 , column= 'APM' ) :
"""
返回每个分组按 column 的 top n 数据
"""
return df. sort_values( by= column, ascending= False ) [ : n]
df_data. groupby( 'LeagueIndex' ) . apply ( top_n)
LeagueIndex Age HoursPerWeek TotalHours APM LeagueIndex 1 2214 1 20.0 12.0 730.0 172.9530 2246 1 27.0 8.0 250.0 141.6282 1753 1 20.0 28.0 100.0 139.6362 2 3062 2 20.0 6.0 100.0 179.6250 3229 2 16.0 24.0 110.0 156.7380 1520 2 29.0 6.0 250.0 151.6470 3 1557 3 22.0 6.0 200.0 226.6554 484 3 19.0 42.0 450.0 220.0692 2883 3 16.0 8.0 800.0 208.9500 4 2688 4 26.0 24.0 990.0 249.0210 1759 4 16.0 6.0 75.0 229.9122 2637 4 23.0 24.0 650.0 227.2272 5 3277 5 18.0 16.0 950.0 372.6426 93 5 17.0 36.0 720.0 335.4990 202 5 37.0 14.0 800.0 327.7218 6 734 6 16.0 28.0 730.0 389.8314 2746 6 16.0 28.0 4000.0 350.4114 1810 6 21.0 14.0 730.0 323.2506 7 3127 7 23.0 42.0 2000.0 298.7952 104 7 21.0 24.0 1000.0 286.4538 1654 7 18.0 98.0 700.0 236.0316 8 3393 8 NaN NaN NaN 375.8664 3373 8 NaN NaN NaN 364.8504 3372 8 NaN NaN NaN 355.3518
df_data. groupby( 'LeagueIndex' ) . apply ( top_n, n= 2 , column= 'Age' )
LeagueIndex Age HoursPerWeek TotalHours APM LeagueIndex 1 3146 1 40.0 12.0 150.0 38.5590 3040 1 39.0 10.0 500.0 29.8764 2 920 2 43.0 10.0 730.0 86.0586 2437 2 41.0 4.0 200.0 54.2166 3 1258 3 41.0 14.0 800.0 77.6472 2972 3 40.0 10.0 500.0 60.5970 4 1696 4 44.0 6.0 500.0 89.5266 1729 4 39.0 8.0 500.0 86.7246 5 202 5 37.0 14.0 800.0 327.7218 2745 5 37.0 18.0 1000.0 123.4098 6 3069 6 31.0 8.0 800.0 133.1790 2706 6 31.0 8.0 700.0 66.9918 7 2813 7 26.0 36.0 1300.0 188.5512 1992 7 26.0 24.0 1000.0 219.6690 8 3340 8 NaN NaN NaN 189.7404
4.3.3 禁止分组 group_keys = False
df_data. groupby( 'LeagueIndex' , group_keys= False ) . apply ( top_n)
LeagueIndex Age HoursPerWeek TotalHours APM 2214 1 20.0 12.0 730.0 172.9530 2246 1 27.0 8.0 250.0 141.6282 1753 1 20.0 28.0 100.0 139.6362 3062 2 20.0 6.0 100.0 179.6250 3229 2 16.0 24.0 110.0 156.7380 1520 2 29.0 6.0 250.0 151.6470 1557 3 22.0 6.0 200.0 226.6554 484 3 19.0 42.0 450.0 220.0692 2883 3 16.0 8.0 800.0 208.9500 2688 4 26.0 24.0 990.0 249.0210 1759 4 16.0 6.0 75.0 229.9122 2637 4 23.0 24.0 650.0 227.2272 3277 5 18.0 16.0 950.0 372.6426 93 5 17.0 36.0 720.0 335.4990 202 5 37.0 14.0 800.0 327.7218 734 6 16.0 28.0 730.0 389.8314 2746 6 16.0 28.0 4000.0 350.4114 1810 6 21.0 14.0 730.0 323.2506 3127 7 23.0 42.0 2000.0 298.7952 104 7 21.0 24.0 1000.0 286.4538 1654 7 18.0 98.0 700.0 236.0316 3393 8 NaN NaN NaN 375.8664 3373 8 NaN NaN NaN 364.8504 3372 8 NaN NaN NaN 355.3518
5. 数据清洗、合并、转化和重构
5.1 数据清洗
5.2 数据连接
import pandas as pd
import numpy as np
df_obj1 = pd. DataFrame( { 'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'a' , 'b' ] ,
'data1' : np. random. randint( 0 , 10 , 7 ) } )
df_obj2 = pd. DataFrame( { 'key' : [ 'a' , 'b' , 'd' ] ,
'data2' : np. random. randint( 0 , 10 , 3 ) } )
print ( df_obj1)
print ( df_obj2)
data1 key
0 5 b
1 9 b
2 1 a
3 0 c
4 3 a
5 9 a
6 0 b
data2 key
0 9 a
1 3 b
2 8 d
pd. merge( df_obj1, df_obj2)
data1 key data2 0 5 b 3 1 9 b 3 2 0 b 3 3 1 a 9 4 3 a 9 5 9 a 9
pd. merge( df_obj1, df_obj2, on= 'key' )
data1 key data2 0 5 b 3 1 9 b 3 2 0 b 3 3 1 a 9 4 3 a 9 5 9 a 9
df_obj1 = df_obj1. rename( columns= { 'key' : 'key1' } )
df_obj2 = df_obj2. rename( columns= { 'key' : 'key2' } )
pd. merge( df_obj1, df_obj2, left_on= 'key1' , right_on= 'key2' )
data1 key1 data2 key2 0 5 b 3 b 1 9 b 3 b 2 0 b 3 b 3 1 a 9 a 4 3 a 9 a 5 9 a 9 a
pd. merge( df_obj1, df_obj2, left_on= 'key1' , right_on= 'key2' , how= 'outer' )
data1 key1 data2 key2 0 5.0 b 3.0 b 1 9.0 b 3.0 b 2 0.0 b 3.0 b 3 1.0 a 9.0 a 4 3.0 a 9.0 a 5 9.0 a 9.0 a 6 0.0 c NaN NaN 7 NaN NaN 8.0 d
pd. merge( df_obj1, df_obj2, left_on= 'key1' , right_on= 'key2' , how= 'left' )
data1 key1 data2 key2 0 5 b 3.0 b 1 9 b 3.0 b 2 1 a 9.0 a 3 0 c NaN NaN 4 3 a 9.0 a 5 9 a 9.0 a 6 0 b 3.0 b
pd. merge( df_obj1, df_obj2, left_on= 'key1' , right_on= 'key2' , how= 'right' )
data1 key1 data2 key2 0 5.0 b 3 b 1 9.0 b 3 b 2 0.0 b 3 b 3 1.0 a 9 a 4 3.0 a 9 a 5 9.0 a 9 a 6 NaN NaN 8 d
df_obj1 = pd. DataFrame( { 'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'a' , 'b' ] ,
'data' : np. random. randint( 0 , 10 , 7 ) } )
df_obj2 = pd. DataFrame( { 'key' : [ 'a' , 'b' , 'd' ] ,
'data' : np. random. randint( 0 , 10 , 3 ) } )
pd. merge( df_obj1, df_obj2, on= 'key' , suffixes= ( '_left' , '_right' ) )
data_left key data_right 0 4 b 5 1 2 b 5 2 5 b 5 3 9 a 7 4 6 a 7 5 6 a 7
df_obj1 = pd. DataFrame( { 'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'a' , 'b' ] ,
'data1' : np. random. randint( 0 , 10 , 7 ) } )
df_obj2 = pd. DataFrame( { 'data2' : np. random. randint( 0 , 10 , 3 ) } , index= [ 'a' , 'b' , 'd' ] )
pd. merge( df_obj1, df_obj2, left_on= 'key' , right_index= True )
data1 key data2 0 6 b 5 1 7 b 5 6 9 b 5 2 2 a 3 4 4 a 3 5 4 a 3
5.3 数据合并
import numpy as np
import pandas as pd
5.3.1 Numpy 的 concat
arr1 = np. random. randint( 0 , 10 , ( 3 , 4 ) )
arr2 = np. random. randint( 0 , 10 , ( 3 , 4 ) )
print ( arr1)
print ( arr2)
[[8 5 2 4]
[2 9 9 0]
[6 2 1 3]]
[[0 4 5 3]
[6 4 1 3]
[1 2 9 9]]
np. concatenate( [ arr1, arr2] )
array([[8, 5, 2, 4],
[2, 9, 9, 0],
[6, 2, 1, 3],
[0, 4, 5, 3],
[6, 4, 1, 3],
[1, 2, 9, 9]])
np. concatenate( [ arr1, arr2] , axis= 1 )
array([[8, 5, 2, 4, 0, 4, 5, 3],
[2, 9, 9, 0, 6, 4, 1, 3],
[6, 2, 1, 3, 1, 2, 9, 9]])
5.3.2 Series 上的concat
ser_obj1 = pd. Series( np. random. randint( 0 , 10 , 5 ) , index= range ( 0 , 5 ) )
ser_obj2 = pd. Series( np. random. randint( 0 , 10 , 4 ) , index= range ( 5 , 9 ) )
ser_obj3 = pd. Series( np. random. randint( 0 , 10 , 3 ) , index= range ( 9 , 12 ) )
print ( ser_obj1)
print ( ser_obj2)
print ( ser_obj3)
0 9
1 9
2 5
3 6
4 4
dtype: int32
5 1
6 9
7 2
8 5
dtype: int32
9 4
10 7
11 4
dtype: int32
pd. concat( [ ser_obj1, ser_obj2, ser_obj3] )
Out[6]:
0 9
1 9
2 5
3 6
4 4
5 1
6 9
7 2
8 5
9 4
10 7
11 4
dtype: int32
pd. concat( [ ser_obj1, ser_obj2, ser_obj3] , axis= 1 )
0 1 2 0 9.0 NaN NaN 1 9.0 NaN NaN 2 5.0 NaN NaN 3 6.0 NaN NaN 4 4.0 NaN NaN 5 NaN 1.0 NaN 6 NaN 9.0 NaN 7 NaN 2.0 NaN 8 NaN 5.0 NaN 9 NaN NaN 4.0 10 NaN NaN 7.0 11 NaN NaN 4.0
ser_obj1 = pd. Series( np. random. randint( 0 , 10 , 5 ) , index= range ( 5 ) )
ser_obj2 = pd. Series( np. random. randint( 0 , 10 , 4 ) , index= range ( 4 ) )
ser_obj3 = pd. Series( np. random. randint( 0 , 10 , 3 ) , index= range ( 3 ) )
print ( ser_obj1)
print ( ser_obj2)
print ( ser_obj3)
0 7
1 3
2 9
3 1
4 7
dtype: int32
0 6
1 1
2 4
3 7
dtype: int32
0 7
1 2
2 3
dtype: int32
pd. concat( [ ser_obj1, ser_obj2, ser_obj3] )
0 7
1 3
2 9
3 1
4 7
0 6
1 1
2 4
3 7
0 7
1 2
2 3
dtype: int32
pd. concat( [ ser_obj1, ser_obj2, ser_obj3] , axis= 1 , join= 'inner' )
5.3.3 DataFrame上的concat
df_obj1 = pd. DataFrame( np. random. randint( 0 , 10 , ( 3 , 2 ) ) , index= [ 'a' , 'b' , 'c' ] ,
columns= [ 'A' , 'B' ] )
df_obj2 = pd. DataFrame( np. random. randint( 0 , 10 , ( 2 , 2 ) ) , index= [ 'a' , 'b' ] ,
columns= [ 'C' , 'D' ] )
print ( df_obj1)
print ( df_obj2)
A B
a 1 6
b 1 0
c 1 6
C D
a 2 1
b 7 4
pd. concat( [ df_obj1, df_obj2] )
A B C D a 1.0 6.0 NaN NaN b 1.0 0.0 NaN NaN c 1.0 6.0 NaN NaN a NaN NaN 2.0 1.0 b NaN NaN 7.0 4.0
pd. concat( [ df_obj1, df_obj2] , axis= 1 )
A B C D a 1 6 2.0 1.0 b 1 0 7.0 4.0 c 1 6 NaN NaN
5.4 数据重构
5.4.1 stack
import numpy as np
import pandas as pd
df_obj = pd. DataFrame( np. random. randint( 0 , 10 , ( 5 , 2 ) ) , columns= [ 'data1' , 'data2' ] )
df_obj
data1 data2 0 0 8 1 9 5 2 1 1 3 8 5 4 9 4
stacked = df_obj. stack( )
print ( stacked)
0 data1 0
data2 8
1 data1 9
data2 5
2 data1 1
data2 1
3 data1 8
data2 5
4 data1 9
data2 4
dtype: int32
stacked. unstack( )
data1 data2 0 0 8 1 9 5 2 1 1 3 8 5 4 9 4
stacked. unstack( level= 0 )
0 1 2 3 4 data1 0 9 1 8 9 data2 8 5 1 5 4
5.5 数据转换
import numpy as np
import pandas as pd
5.5.1 重复数据
df_obj = pd. DataFrame( { 'data1' : [ 'a' ] * 4 + [ 'b' ] * 4 ,
'data2' : np. random. randint( 0 , 4 , 8 ) } )
df_obj
data1 data2 0 a 1 1 a 3 2 a 3 3 a 3 4 b 3 5 b 3 6 b 1 7 b 1
df_obj. duplicated( )
0 False
1 False
2 True
3 True
4 False
5 True
6 False
7 True
dtype: bool
df_obj. drop_duplicates( )
df_obj. drop_duplicates( 'data2' )
5.5.2 map函数
ser_obj = pd. Series( np. random. randint( 0 , 10 , 10 ) )
ser_obj
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
ser_obj. map ( lambda x : x ** 2 )
0 36
1 25
2 16
3 16
4 64
5 1
6 36
7 25
8 16
9 1
dtype: int64
5.5.4 数据替换repalce
ser_obj. replace( 0 , - 100 )
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
ser_obj. replace( [ 0 , 2 ] , - 100 )
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
ser_obj. replace( [ 0 , 2 ] , [ - 100 , - 200 ] )
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
| —: | ----: | ----: | | 0 | 0 | 8 | | 1 | 9 | 5 | | 2 | 1 | 1 | | 3 | 8 | 5 | | 4 | 9 | 4 |
stacked. unstack( level= 0 )
0 1 2 3 4 data1 0 9 1 8 9 data2 8 5 1 5 4
5.5 数据转换
import numpy as np
import pandas as pd
5.5.1 重复数据
df_obj = pd. DataFrame( { 'data1' : [ 'a' ] * 4 + [ 'b' ] * 4 ,
'data2' : np. random. randint( 0 , 4 , 8 ) } )
df_obj
data1 data2 0 a 1 1 a 3 2 a 3 3 a 3 4 b 3 5 b 3 6 b 1 7 b 1
df_obj. duplicated( )
0 False
1 False
2 True
3 True
4 False
5 True
6 False
7 True
dtype: bool
df_obj. drop_duplicates( )
df_obj. drop_duplicates( 'data2' )
5.5.2 map函数
ser_obj = pd. Series( np. random. randint( 0 , 10 , 10 ) )
ser_obj
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
ser_obj. map ( lambda x : x ** 2 )
0 36
1 25
2 16
3 16
4 64
5 1
6 36
7 25
8 16
9 1
dtype: int64
5.5.4 数据替换repalce
ser_obj. replace( 0 , - 100 )
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
ser_obj. replace( [ 0 , 2 ] , - 100 )
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32
ser_obj. replace( [ 0 , 2 ] , [ - 100 , - 200 ] )
0 6
1 5
2 4
3 4
4 8
5 1
6 6
7 5
8 4
9 1
dtype: int32