在python环境中实现 R for Data Science https://r4ds.had.co.nz/ 中的5,12,13三章中的功能
1.Data transformation数据转换
参考python dfply文档https://github.com/kieferk/dfply/blob/master/README.md
!pip install dfply
from dfply import *
import numpy as np
import pandas as pd
加载内置的数据集diamonds数据集,数据集共53940行,有carat、cut、color、clarity、depth、table、price、x、y、z共10列,对应每个钻石的一些参数值。
diamonds. head( )
carat cut color clarity depth table price x y z 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
1.1 Filter rows过滤筛选
diamonds >> filter_by( X. cut == 'Ideal' ) >> head( )
carat cut color clarity depth table price x y z 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 11 0.23 Ideal J VS1 62.8 56.0 340 3.93 3.90 2.46 13 0.31 Ideal J SI2 62.2 54.0 344 4.35 4.37 2.71 16 0.30 Ideal I SI2 62.0 54.0 348 4.31 4.34 2.68 39 0.33 Ideal I SI2 61.8 55.0 403 4.49 4.51 2.78
管道函数是将上一步的结果直接传参给下一步的函数,从而省略了中间的赋值步骤,可以大量减少内存中的对象,节省内存。
diamonds >> filter_by( X. cut == 'Ideal' , X. price < 337 )
carat cut color clarity depth table price x y z 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
diamonds >> mask( X. cut == 'Ideal' , X. color == 'E' , X. table < 55 , X. price < 500 )
carat cut color clarity depth table price x y z 26683 0.33 Ideal E SI2 62.2 54.0 427 4.44 4.46 2.77 32297 0.34 Ideal E SI2 62.4 54.0 454 4.49 4.52 2.81 40928 0.30 Ideal E SI1 61.6 54.0 499 4.32 4.35 2.67 50623 0.30 Ideal E SI2 62.1 54.0 401 4.32 4.35 2.69 50625 0.30 Ideal E SI2 62.0 54.0 401 4.33 4.35 2.69
1.2 Arrange rows排列
diamonds >> arrange( X. table, ascending= False ) >> head( 5 )
carat cut color clarity depth table price x y z 24932 2.01 Fair F SI1 58.6 95.0 13387 8.32 8.31 4.87 50773 0.81 Fair F SI2 68.8 79.0 2301 5.26 5.20 3.58 51342 0.79 Fair G SI1 65.3 76.0 2362 5.52 5.13 3.35 52860 0.50 Fair E VS2 79.0 73.0 2579 5.21 5.18 4.09 49375 0.70 Fair H VS1 62.0 73.0 2100 5.65 5.54 3.47
1.3 Select columns选择
diamonds >> select( X. cut, 'price' , X. x) >> head( 3 )
cut price x 0 Ideal 326 3.95 1 Premium 326 3.89 2 Good 327 4.05
diamonds >> select( 1 , X. price, [ 'x' , 'y' ] ) >> head( 2 )
cut price x y 0 Ideal 326 3.95 3.98 1 Premium 326 3.89 3.84
starts_with(‘c’) :查找以字符c前缀开头的列; ends_with(‘c’):查找以字符c后缀结束的列; contains(‘c’):筛选出包包字符c的列; everything():所有列。
diamonds >> select( starts_with( 'c' ) ) >> head( 3 )
carat cut color clarity 0 0.23 Ideal E SI2 1 0.21 Premium E SI1 2 0.23 Good E VS1
1.4 Add new variables添加新变量
可以使用mutate()函数创建新变量,在一次调用中可创建多个变量
diamonds >> mutate( x_plus_y= X. x + X. y) >> select( columns_from( 'x' ) ) >> head( 3 )
x y z x_plus_y 0 3.95 3.98 2.43 7.93 1 3.89 3.84 2.31 7.73 2 4.05 4.07 2.31 8.12
diamonds >> transmute( x_plus_y= X. x + X. y, y_div_z= ( X. y / X. z) ) >> head( 3 )
x_plus_y y_div_z 0 7.93 1.637860 1 7.73 1.662338 2 8.12 1.761905
1.5 Grouped summaries分组汇总
diamonds >> summarize( price_mean= X. price. mean( ) , price_std= X. price. std( ) )
price_mean price_std 0 3932.799722 3989.439738
diamonds >> group_by( 'cut' ) >> summarize( price_mean= X. price. mean( ) , price_std= X. price. std( ) )
cut price_mean price_std 0 Fair 4358.757764 3560.386612 1 Good 3928.864452 3681.589584 2 Ideal 3457.541970 3808.401172 3 Premium 4584.257704 4349.204961 4 Very Good 3981.759891 3935.862161
diamonds >> summarize_each( [ np. mean, np. var] , X. price, 'depth' )
price_mean price_var depth_mean depth_var 0 3932.799722 1.591533e+07 61.749405 2.052366
diamonds >> group_by( X. cut) >> summarize_each( [ np. mean, np. var] , X. price, 4 )
cut price_mean price_var depth_mean depth_var 0 Fair 4358.757764 1.266848e+07 64.041677 13.266319 1 Good 3928.864452 1.355134e+07 62.365879 4.705224 2 Ideal 3457.541970 1.450325e+07 61.709401 0.516274 3 Premium 4584.257704 1.891421e+07 61.264673 1.342755 4 Very Good 3981.759891 1.548973e+07 61.818275 1.900466
2.Tidy data数据整理
2.1Pivoting旋转
import pandas as pd
import numpy as np
mydata= pd. DataFrame( {
"Name" : [ "苹果" , "谷歌" , "脸书" , "亚马逊" , "腾讯" ] ,
"Conpany" : [ "Apple" , "Google" , "Facebook" , "Amozon" , "Tencent" ] ,
"Sale2013" : [ 5000 , 3500 , 2300 , 2100 , 3100 ] ,
"Sale2014" : [ 5050 , 3800 , 2900 , 2500 , 3300 ] ,
"Sale2015" : [ 5050 , 3800 , 2900 , 2500 , 3300 ] ,
"Sale2016" : [ 5050 , 3800 , 2900 , 2500 , 3300 ]
} )
mydata
Name Conpany Sale2013 Sale2014 Sale2015 Sale2016 0 苹果 Apple 5000 5050 5050 5050 1 谷歌 Google 3500 3800 3800 3800 2 脸书 Facebook 2300 2900 2900 2900 3 亚马逊 Amozon 2100 2500 2500 2500 4 腾讯 Tencent 3100 3300 3300 3300
python中melt函数(数据宽转长)与R中pivot_longer函数类似,通过减少列并将它们转换为值来延长数据框
mydata1= mydata. melt( id_vars= [ "Name" , "Conpany" ] ,
var_name= "Year" ,
value_name= "Sale" )
mydata1
Name Conpany Year Sale 0 苹果 Apple Sale2013 5000 1 谷歌 Google Sale2013 3500 2 脸书 Facebook Sale2013 2300 3 亚马逊 Amozon Sale2013 2100 4 腾讯 Tencent Sale2013 3100 5 苹果 Apple Sale2014 5050
R中pivot_wider函数实现的功能可用python中pivot_table函数(数据长转宽)代替,跨多列传播键值对
mydata1. pivot_table( index= [ "Name" , "Conpany" ] ,
columns= [ "Year" ] ,
values= [ "Sale" ] )
Sale Year Sale2013 Sale2014 Sale2015 Sale2016 Name Conpany 亚马逊 Amozon 2100 2500 2500 2500 脸书 Facebook 2300 2900 2900 2900 腾讯 Tencent 3100 3300 3300 3300 苹果 Apple 5000 5050 5050 5050 谷歌 Google 3500 3800 3800 3800
也可使用plydata库进行数据整理
python中的plydata库基于 R 中的 dplyr、tidyr 和 forcats 包,许多函数名称都是直接借用过来的
https://plydata.readthedocs.io/en/stable/api.html .
2.2 Separating and uniting分离与融合
df = pd. DataFrame( { 'alpha' : 1 , 'x' : [ 'a,1' , 'b,2' , 'c,3' ] , 'zeta' : 6 } )
print ( df)
df >> separate( 'x' , into= [ 'A' , 'B' ] , remove= False )
alpha x zeta A B 0 1 a,1 6 a 1 1 1 b,2 6 b 2 2 1 c,3 6 c 3
df = pd. DataFrame( { 'c1' : [ 1 , 2 , 3 , 4 , None ] , 'c2' : list ( 'abcde' ) , 'c3' : list ( 'vwxyz' ) } )
df >> unite( 'c1c2' , 'c1' , 'c2' , na_rm= True )
[‘c1’, ‘c2’] _ True maintain
c3 c1c2 0 v 1.0_a 1 w 2.0_b 2 x 3.0_c 3 y 4.0_d 4 z NaN
2.3 Missing values缺失值
df
c1 c2 c3 0 1.0 a v 1 2.0 b w 2 3.0 c x 3 4.0 d y 4 NaN e z
print ( df. isnull( ) )
print ( df. isnull( ) . sum ( ) )
删除缺失值可用df.dropna(),也可以给这个方法传入how="all"参数,只有在整行为空的前提下才删除
df. fillna( '*' )
c1 c2 c3 0 1.0 a v 1 2.0 b w 2 3.0 c x 3 4.0 d y 4 * e z
3.Relational data关系数据
3.1 pd.merge(),pd.concat()
pd.merge()官方文档 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html?highlight=merge#pandas.merge .
df1 = pd. DataFrame( { 'a' : [ 'foo' , 'bar' ] , 'b' : [ 1 , 2 ] } )
df2 = pd. DataFrame( { 'a' : [ 'foo' , 'baz' ] , 'c' : [ 3 , 4 ] } )
print ( df1)
print ( df2)
df1. merge( df2, how= 'inner' , on= 'a' )
df1. merge( df2, how= 'cross' )
a_x b a_y c 0 foo 1 foo 3 1 foo 1 baz 4 2 bar 2 foo 3 3 bar 2 baz 4
df1 = pd. DataFrame( [ [ 'a' , 1 ] , [ 'b' , 2 ] ] ,
columns= [ 'letter' , 'number' ] )
df3 = pd. DataFrame( [ [ 'c' , 3 , 'cat' ] , [ 'd' , 4 , 'dog' ] ] ,
columns= [ 'letter' , 'number' , 'animal' ] )
pd. concat( [ df1, df3] , join= "inner" )
3.2运用dfply库中的join系列函数
a = pd. DataFrame( { 'x1' : [ 'A' , 'B' , 'C' ] , 'x2' : [ 1 , 2 , 3 ] } )
b = pd. DataFrame( { 'x1' : [ 'A' , 'B' , 'D' ] , 'x3' : [ True , False , True ] } )
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HDlnjDWV-1639040492974)(attachment:image.png)]
a >> inner_join( b, by= 'x1' )
a >> outer_join( b, by= 'x1' )
a >> left_join( b, by= 'x1' )
a >> right_join( b, by= 'x1' )
a >> semi_join( b, by= 'x1' )
a >> anti_join( b, by= 'x1' )