Python for Data Science

在python环境中实现 R for Data Science
https://r4ds.had.co.nz/ 中的5,12,13三章中的功能

1.Data transformation数据转换

参考python dfply文档https://github.com/kieferk/dfply/blob/master/README.md

!pip install dfply
from dfply import *
import numpy as np
import pandas as pd

加载内置的数据集diamonds数据集,数据集共53940行,有carat、cut、color、clarity、depth、table、price、x、y、z共10列,对应每个钻石的一些参数值。

diamonds.head()
caratcutcolorclaritydepthtablepricexyz
00.23IdealESI261.555.03263.953.982.43
10.21PremiumESI159.861.03263.893.842.31
20.23GoodEVS156.965.03274.054.072.31
30.29PremiumIVS262.458.03344.204.232.63
40.31GoodJSI263.358.03354.344.352.75

1.1 Filter rows过滤筛选

# 筛选cut为Ideal记录的前5行
diamonds >> filter_by(X.cut == 'Ideal') >> head()
caratcutcolorclaritydepthtablepricexyz
00.23IdealESI261.555.03263.953.982.43
110.23IdealJVS162.856.03403.933.902.46
130.31IdealJSI262.254.03444.354.372.71
160.30IdealISI262.054.03484.314.342.68
390.33IdealISI261.855.04034.494.512.78

管道函数是将上一步的结果直接传参给下一步的函数,从而省略了中间的赋值步骤,可以大量减少内存中的对象,节省内存。

Python>>
R%>%
# 筛选cut为Ideal且price小于337的记录	
diamonds >> filter_by(X.cut == 'Ideal', X.price < 337)
caratcutcolorclaritydepthtablepricexyz
00.23IdealESI261.555.03263.953.982.43
diamonds >> mask(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)
caratcutcolorclaritydepthtablepricexyz
266830.33IdealESI262.254.04274.444.462.77
322970.34IdealESI262.454.04544.494.522.81
409280.30IdealESI161.654.04994.324.352.67
506230.30IdealESI262.154.04014.324.352.69
506250.30IdealESI262.054.04014.334.352.69

1.2 Arrange rows排列

diamonds >> arrange(X.table, ascending=False) >> head(5)     # ascending(升序)默认为True
caratcutcolorclaritydepthtablepricexyz
249322.01FairFSI158.695.0133878.328.314.87
507730.81FairFSI268.879.023015.265.203.58
513420.79FairGSI165.376.023625.525.133.35
528600.50FairEVS279.073.025795.215.184.09
493750.70FairHVS162.073.021005.655.543.47

1.3 Select columns选择

# 选择cut和price列,取前3行
diamonds >> select(X.cut, 'price', X.x) >> head(3)
cutpricex
0Ideal3263.95
1Premium3263.89
2Good3274.05
diamonds >> select(1, X.price, ['x', 'y']) >> head(2)   # 从钻石数据集中选择“cut”、“price”、“x”和“y”。
cutpricexy
0Ideal3263.953.98
1Premium3263.893.84

starts_with(‘c’) :查找以字符c前缀开头的列;
ends_with(‘c’):查找以字符c后缀结束的列;
contains(‘c’):筛选出包包字符c的列;
everything():所有列。

# 根据参数快速筛选出相应数据
diamonds >> select(starts_with('c')) >> head(3)
caratcutcolorclarity
00.23IdealESI2
10.21PremiumESI1
20.23GoodEVS1

1.4 Add new variables添加新变量

可以使用mutate()函数创建新变量,在一次调用中可创建多个变量

diamonds >> mutate(x_plus_y=X.x + X.y) >> select(columns_from('x')) >> head(3)
xyzx_plus_y
03.953.982.437.93
13.893.842.317.73
24.054.072.318.12
# transmute()函数是 mutate 和已创建变量的选择的组合
diamonds >> transmute(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) >> head(3)
x_plus_yy_div_z
07.931.637860
17.731.662338
28.121.761905

1.5 Grouped summaries分组汇总

diamonds >> summarize(price_mean=X.price.mean(), price_std=X.price.std())
price_meanprice_std
03932.7997223989.439738
diamonds >> group_by('cut') >> summarize(price_mean=X.price.mean(), price_std=X.price.std())
cutprice_meanprice_std
0Fair4358.7577643560.386612
1Good3928.8644523681.589584
2Ideal3457.5419703808.401172
3Premium4584.2577044349.204961
4Very Good3981.7598913935.862161
diamonds >> summarize_each([np.mean, np.var], X.price, 'depth')
price_meanprice_vardepth_meandepth_var
03932.7997221.591533e+0761.7494052.052366
diamonds >> group_by(X.cut) >> summarize_each([np.mean, np.var], X.price, 4)
cutprice_meanprice_vardepth_meandepth_var
0Fair4358.7577641.266848e+0764.04167713.266319
1Good3928.8644521.355134e+0762.3658794.705224
2Ideal3457.5419701.450325e+0761.7094010.516274
3Premium4584.2577041.891421e+0761.2646731.342755
4Very Good3981.7598911.548973e+0761.8182751.900466

2.Tidy data数据整理

2.1Pivoting旋转

import pandas as pd
import numpy as np
mydata=pd.DataFrame({
"Name":["苹果","谷歌","脸书","亚马逊","腾讯"],
"Conpany":["Apple","Google","Facebook","Amozon","Tencent"],
"Sale2013":[5000,3500,2300,2100,3100],
"Sale2014":[5050,3800,2900,2500,3300],
"Sale2015":[5050,3800,2900,2500,3300],
"Sale2016":[5050,3800,2900,2500,3300]
       })
mydata
NameConpanySale2013Sale2014Sale2015Sale2016
0苹果Apple5000505050505050
1谷歌Google3500380038003800
2脸书Facebook2300290029002900
3亚马逊Amozon2100250025002500
4腾讯Tencent3100330033003300

python中melt函数(数据宽转长)与R中pivot_longer函数类似,通过减少列并将它们转换为值来延长数据框

mydata1=mydata.melt(id_vars=["Name","Conpany"], # 要保留的主字段
                    var_name="Year", # 拉长的分类变量
                    value_name="Sale" ) # 拉长的度量值名称
mydata1
NameConpanyYearSale
0苹果AppleSale20135000
1谷歌GoogleSale20133500
2脸书FacebookSale20132300
3亚马逊AmozonSale20132100
4腾讯TencentSale20133100
5苹果AppleSale20145050

R中pivot_wider函数实现的功能可用python中pivot_table函数(数据长转宽)代替,跨多列传播键值对

mydata1.pivot_table(index=["Name","Conpany"],    #行索引(可以使多个类别变量)
                    columns=["Year"],                   #列索引(可以使多个类别变量)
                    values=["Sale"])                       #值(一般是度量指标)
Sale
YearSale2013Sale2014Sale2015Sale2016
NameConpany
亚马逊Amozon2100250025002500
脸书Facebook2300290029002900
腾讯Tencent3100330033003300
苹果Apple5000505050505050
谷歌Google3500380038003800

也可使用plydata库进行数据整理

python中的plydata库基于 R 中的 dplyr、tidyr 和 forcats 包,许多函数名称都是直接借用过来的

https://plydata.readthedocs.io/en/stable/api.html.

2.2 Separating and uniting分离与融合

# separate()通过在出现分隔符的地方进行拆分,将一列拆分为多列
df = pd.DataFrame({'alpha': 1,'x': ['a,1', 'b,2', 'c,3'],'zeta': 6})
print(df)
df >> separate('x', into=['A', 'B'], remove=False)
alphaxzetaAB
01a,16a1
11b,26b2
21c,36c3
# unite()将多列合并为一列
df = pd.DataFrame({'c1': [1, 2, 3, 4, None],'c2': list('abcde'),'c3': list('vwxyz')})
df >> unite('c1c2', 'c1', 'c2', na_rm=True)  

[‘c1’, ‘c2’] _ True maintain

c3c1c2
0v1.0_a
1w2.0_b
2x3.0_c
3y4.0_d
4zNaN

2.3 Missing values缺失值

df
c1c2c3
01.0av
12.0bw
23.0cx
34.0dy
4NaNez
# 查看缺失值
print(df.isnull())
print(df.isnull().sum())

删除缺失值可用df.dropna(),也可以给这个方法传入how="all"参数,只有在整行为空的前提下才删除

df.fillna('*')   # 缺失值填充
c1c2c3
01.0av
12.0bw
23.0cx
34.0dy
4*ez

3.Relational data关系数据

3.1 pd.merge(),pd.concat()

pd.merge()官方文档
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html?highlight=merge#pandas.merge.

df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
print(df1)
print(df2)
ab
0foo1
1bar2
df1.merge(df2, how='inner', on='a')
abc
0foo13
df1.merge(df2, how='cross')
a_xba_yc
0foo1foo3
1foo1baz4
2bar2foo3
3bar2baz4
df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
                   columns=['letter', 'number', 'animal'])
pd.concat([df1, df3], join="inner")
letternumber
0a1
1b2
0c3
1d4

3.2运用dfply库中的join系列函数

a = pd.DataFrame({'x1':['A','B','C'],'x2':[1,2,3]})
b = pd.DataFrame({'x1':['A','B','D'],'x3':[True,False,True]})

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HDlnjDWV-1639040492974)(attachment:image.png)]

a >> inner_join(b, by='x1') # 内连接
a >> outer_join(b, by='x1') # 外连接
a >> left_join(b, by='x1')
a >> right_join(b, by='x1')
a >> semi_join(b, by='x1') # 半连接
a >> anti_join(b, by='x1') # 自然连接
x1x2
2C3
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Over 60 practical recipes to help you explore Python and its robust data science capabilities About This Book The book is packed with simple and concise Python code examples to effectively demonstrate advanced concepts in action Explore concepts such as programming, data mining, data analysis, data visualization, and machine learning using Python Get up to speed on machine learning algorithms with the help of easy-to-follow, insightful recipes Who This Book Is For This book is intended for all levels of Data Science professionals, both students and practitioners, starting from novice to experts. Novices can spend their time in the first five chapters getting themselves acquainted with Data Science. Experts can refer to the chapters starting from 6 to understand how advanced techniques are implemented using Python. People from non-Python backgrounds can also effectively use this book, but it would be helpful if you have some prior basic programming experience. What You Will Learn Explore the complete range of Data Science algorithms Get to know the tricks used by industry engineers to create the most accurate data science models Manage and use Python libraries such as numpy, scipy, scikit learn, and matplotlib effectively Create meaningful features to solve real-world problems Take a look at Advanced Regression methods for model building and variable selection Get a thorough understanding of the underlying concepts and implementation of Ensemble methods Solve real-world problems using a variety of different datasets from numerical and text data modalities Get accustomed to modern state-of-the art algorithms such as Gradient Boosting, Random Forest, Rotation Forest, and so on In Detail Python is increasingly becoming the language for data science. It is overtaking R in terms of adoption, it is widely known by many developers, and has a strong set of libraries such as Numpy, Pandas, scikit-learn, Matplotlib, Ipython and Scipy, to support its usage in this field. Data

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值