Pandas数据分析:使用transform,apply函数处理分组数据
数据清洗和可视化的关键之一在于如何处理缺失值。使用pandas的fillna函数可以对缺失值进行基本的处理。然而,fillna只适用于简单的情况,一旦数据中的组或数据顺序变得相关,它就不能满足要求。
本文将讨论如何在更复杂的情况下进行缺失值处理。
这些情况通常是由不同的时间序列,组甚至子组组成的数据集。时间序列以有月,季度为例,组以性别组数据为例,子组的例子有年龄组和种族。
本文主要分为三个部分:
- 复习fillna的用法
- 数据独立时,如何处理缺失值
- 数据存在一定相关性时,如何处理缺失值
复习fillna的用法
调用pandas的fillna()函数可以有三种方法处理缺失值,它们分别为method参数的两个值或具体的数值:
- method=‘ffill’ 在遇到下一个非缺失值前向下填充;
- method='bfill’在遇到下一个非缺失值前向上填充;
- 具体的数值,例如fillna(0)就是将所有的缺失值都填为0。
本次代码使用jupyter notebook。
代码示例:
输入:
demo = pd.Series(range(6))
demo.loc[2:4] = np.nan
demo
输出:
0 0.0
1 1.0
2 NaN
3 NaN
4 NaN
5 5.0
dtype: float64
向下填充:
输入:
demo.fillna(method='ffill')
输出:
0 0.0
1 1.0
2 1.0
3 1.0
4 1.0
5 5.0
dtype: float64
向上填充:
输入:
demo.fillna(method='bfill')
输出:
0 0.0
1 1.0
2 5.0
3 5.0
4 5.0
5 5.0
dtype: float64
填充具体数值
输入:
demo.fillna(0)
输出
0 0.0
1 1.0
2 0.0
3 0.0
4 0.0
5 5.0
dtype: float64
数据独立时,处理缺失值的方法
通常,在处理缺失数据时,数据间的相关性无关紧要,因此,用于替换缺失值的值可以基于可用数据的整体。在这种情况下,通常会用最佳‘猜测值’(如可用数据的平均值或中值)替换缺失值。
让我们快速回顾一下为什么要小心使用这个方法。让我们假设你调查了1000个男孩和1000个女孩的体重,但是在调查过程中出现了一些缺失值。
#模拟男生与女生的体重
boys = np.random.normal(70,5,1000)
girls = np.random.normal(50,3,1000)
#女生的体重数据出现了一些缺失值
for i in range(100):`在这里插入代码片`
girls[np.random.randint(0,1000)] = np.nan
#构造一个DataFrame
boys = pd.DataFrame(boys, columns=['weight'])
boys['gender'] = 'boy'
girls = pd.DataFrame(girls, columns=['weight'])
girls['gender'] = 'girl'
df = pd.concat([girls,boys],axis=0)
df['weight'] = df['weight'].astype(float)
输入
df
输出
weight gender
0 50.445444 girl
1 49.461076 girl
2 51.764700 girl
3 52.189355 girl
4 54.475917 girl
... ... ...
995 70.953356 boy
996 70.519500 boy
997 65.386008 boy
998 68.856916 boy
999 59.730169 boy
2000 rows × 2 columns
整体均值填充
如果不经分析,直接用整体的均值填补缺失值的话,就会存在一个问题,就是导致缺失的女生体重值会普遍偏高,因为用的是整体的缺失值,包含了男生的体重在里面,在此基础上进行分析,就会得出错误的结论。
sns.set_style('white')
fig, ax = plt.subplots(figsize=(16, 7))
mean = df['weight'].mean()
sns.distplot(
df[df['gender'] == 'girl']['weight'].fillna(mean),
kde=True,
hist=False,
ax=ax,
label='girls'
)
sns.distplot(
df[df['gender'] == 'boy']['weight'],
kde=True,
hist=False,
ax=ax,
label='boys'
)
plt.title('Kernel density estimation of weight for boys and girls')
sns.despine()
从上图可以看出,部分女生的体重是异常的,原因就是填充值选取不合理。
分组均值填充
可以使用pandas的transform函数配合进行分组填充,如下所示:
df['filled_weight'] = df.groupby('gender')['weight'].transform(
lambda grp: grp.fillna(np.mean(grp))
)
sns.set_style('white')
fig, ax = plt.subplots(figsize=(16, 7))
sns.distplot(
df[df['gender'] == 'girl']['filled_weight'],
kde=True,
hist=False,
ax=ax,
label='girls'
)
sns.distplot(
df[df['gender'] == 'boy']['filled_weight'],
kde=True,
hist=False,
ax=ax,
label='boys'
)
plt.title('Kernel density estimation of weight for boys and girls')
sns.despine()
最终的结果也会更加合理。
细分子组确实值填充
数据依然使用上一个例子,但是更进一步将数据分为年龄组。
首先创建虚拟数据:
# paramter for the weight distribution (mean, std)
param_map = {
'boy':{
'<10':(40,4),
'<20':(60,4),
'20+':(70,5),
},
'girl':{
'<10':(30,2),
'<20':(40,3),
'20+':(50,3),
}
}
# generate 10k records
df = pd.DataFrame({
'gender':np.random.choice(['girl','boy'],10000),
'age_cohort':np.random.choice(['<10','<20','20+'],10000)
})
# set random weight based on parameters
df['weight'] = df.apply(
lambda x: np.random.normal(
loc=param_map[x['gender']][x['age_cohort']][0],
scale=param_map[x['gender']][x['age_cohort']][1]
),axis=1
)
# set 500 values missing
for i in range(500):
df.loc[np.random.randint(0,len(df)),'weight'] = np.nan
如果用整体的均值填充缺失值,就会导致数据存在异常,让我们画出不同性别不同年龄段的体重密度图进行观察
df['filled_weight'] = df['weight'].fillna(
df['weight'].mean()
)
g = sns.FacetGrid(
df,
col='age_cohort',
row='gender',
col_order=['<10','<20','20+']
)
g.map(sns.kdeplot,'filled_weight')
从图中可以看到,有些奇怪的双峰分布,会导致产生错误的结论。
现在,如果我们仅仅用各自性别的平均值来替换缺失的值,这还不够,因为不仅男孩和女孩的体重不同,而且不同年龄组的体重也有很大的差异。
transform函数依旧可以用来辅助填充缺失值:
df['filled_weight'] = df.groupby(['gender','age_cohort'])['weight'].transform(
lambda grp: grp.fillna(np.mean(grp))
)
如上图所示,进行分组填充后得到的数据更加合理。
数据存在一定相关性时,如何处理缺失值
在处理时间序列数据时,经常会出现两种情况。
- 对齐日期范围:假设您查看各个国家的GDP,教育水平和人口年增长情况。有些国家的早些年的数据缺失,有些缺失了近期的数据,有些缺失了中间几年的数据。当然,你可以忽略他们。不过,出于可视化的目的,可能希望保留它们,并将第一个观察到的值投影到开始,将最后一个观察到的值投影到调查期间的结束,然后提出一些对中间的值有意义的方法。
- 插值:研究时间序列数据插值和排序变得非常重要。用一个基于2021年数据计算的平均值替换2012年缺失的股票数据,势必会产生一些古怪的结果。
以《2019年世界幸福报告》(World Happiness Report 2019)数据为例,我们将在其中讨论这两种情况。《世界幸福报告》试图回答影响全世界幸福的因素。该报告调查了2005年至2018年的数据。
df = pd.read_csv('https://raw.githubusercontent.com/FBosler/you-datascientist/master/happiness_with_continent.csv')
Country name Year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect ... GINI index (World Bank estimate), average 2000-16 gini of household income reported in Gallup, by wp5-year Most people can be trusted, Gallup Most people can be trusted, WVS round 1981-1984 Most people can be trusted, WVS round 1989-1993 Most people can be trusted, WVS round 1994-1998 Most people can be trusted, WVS round 1999-2004 Most people can be trusted, WVS round 2005-2009 Most people can be trusted, WVS round 2010-2014 Continent
0 Afghanistan 2008 3.723590 7.168690 0.450662 50.799999 0.718114 0.177889 0.881686 0.517637 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Asia
1 Afghanistan 2009 4.401778 7.333790 0.552308 51.200001 0.678896 0.200178 0.850035 0.583926 ... NaN 0.441906 0.286315 NaN NaN NaN NaN NaN NaN Asia
2 Afghanistan 2010 4.758381 7.386629 0.539075 51.599998 0.600127 0.134353 0.706766 0.618265 ... NaN 0.327318 0.275833 NaN NaN NaN NaN NaN NaN Asia
3 Afghanistan 2011 3.831719 7.415019 0.521104 51.919998 0.495901 0.172137 0.731109 0.611387 ... NaN 0.336764 NaN NaN NaN NaN NaN NaN NaN Asia
4 Afghanistan 2012 3.782938 7.517126 0.520637 52.240002 0.530935 0.244273 0.775620 0.710385 ... NaN 0.344540 NaN NaN NaN NaN NaN NaN NaN Asia
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1699 Zimbabwe 2014 4.184451 7.562753 0.765839 52.380001 0.642034 -0.048634 0.820217 0.725214 ... 0.432 0.601080 NaN NaN NaN NaN 0.116683 NaN 0.082942 Africa
1700 Zimbabwe 2015 3.703191 7.556052 0.735800 53.799999 0.667193 -0.097354 0.810457 0.715079 ... 0.432 0.655137 NaN NaN NaN NaN 0.116683 NaN 0.082942 Africa
1701 Zimbabwe 2016 3.735400 7.538829 0.768425 54.400002 0.732971 -0.068105 0.723612 0.737636 ... 0.432 0.596690 NaN NaN NaN NaN 0.116683 NaN 0.082942 Africa
1702 Zimbabwe 2017 3.638300 7.549491 0.754147 55.000000 0.752826 -0.069670 0.751208 0.806428 ... 0.432 0.581484 NaN NaN NaN NaN 0.116683 NaN 0.082942 Africa
1703 Zimbabwe 2018 3.616480 7.553395 0.775388 55.599998 0.762675 -0.038384 0.844209 0.710119 ... 0.432 0.541772 NaN NaN NaN NaN 0.116683 NaN 0.082942 Africa
1704 rows × 27 columns
首先统计以下每年的有多少个国家的数据。
# PLOT CODE:
df.groupby(['Year']).size().plot(
kind='bar',
title='Number of countries with data',
figsize=(10,5)
)
我们可以看到,尤其是前几年,我们没有多少国家有数据,而且整个样本期都有一些波动。为了减轻丢失数据的影响,我们将执行以下操作:
- 按国家分组并重新编制整个日期范围的索引
- 在每个国家分组的基础上,在我们观测到的范围之外进行插值和外推。
按国家分组并重新编制整个日期范围的索引
# Define helper function
def add_missing_years(grp):
_ = grp.set_index('Year')
_ = _.reindex(list(range(2005,2019)))
del _['Country name']
return _
# Group by country name and extend
df = df.groupby('Country name').apply(add_missing_years)
df = df.reset_index()
Country name Year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect ... GINI index (World Bank estimate), average 2000-16 gini of household income reported in Gallup, by wp5-year Most people can be trusted, Gallup Most people can be trusted, WVS round 1981-1984 Most people can be trusted, WVS round 1989-1993 Most people can be trusted, WVS round 1994-1998 Most people can be trusted, WVS round 1999-2004 Most people can be trusted, WVS round 2005-2009 Most people can be trusted, WVS round 2010-2014 Continent
0 Afghanistan 2005 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Afghanistan 2006 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Afghanistan 2007 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Afghanistan 2008 3.723590 7.168690 0.450662 50.799999 0.718114 0.177889 0.881686 0.517637 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Asia
4 Afghanistan 2009 4.401778 7.333790 0.552308 51.200001 0.678896 0.200178 0.850035 0.583926 ... NaN 0.441906 0.286315 NaN NaN NaN NaN NaN NaN Asia
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2305 Zimbabwe 2014 4.184451 7.562753 0.765839 52.380001 0.642034 -0.048634 0.820217 0.725214 ... 0.432 0.601080 NaN NaN NaN NaN 0.116683 NaN 0.082942 Africa
2306 Zimbabwe 2015 3.703191 7.556052 0.735800 53.799999 0.667193 -0.097354 0.810457 0.715079 ... 0.432 0.655137 NaN NaN NaN NaN 0.116683 NaN 0.082942 Africa
2307 Zimbabwe 2016 3.735400 7.538829 0.768425 54.400002 0.732971 -0.068105 0.723612 0.737636 ... 0.432 0.596690 NaN NaN NaN NaN 0.116683 NaN 0.082942 Africa
2308 Zimbabwe 2017 3.638300 7.549491 0.754147 55.000000 0.752826 -0.069670 0.751208 0.806428 ... 0.432 0.581484 NaN NaN NaN NaN 0.116683 NaN 0.082942 Africa
2309 Zimbabwe 2018 3.616480 7.553395 0.775388 55.599998 0.762675 -0.038384 0.844209 0.710119 ... 0.432 0.541772 NaN NaN NaN NaN 0.116683 NaN 0.082942 Africa
2310 rows × 27 columns
可以看到经过重新分组设定日期索引后由原来的1704行变成了2310行,因为每个国家日期的范围都变成了从2005年到2018年,并且在补充的年份其他列为缺失值。
在每个国家分组的基础上,在我们观测到的范围之外进行插值和外推
按照线性插值的方法进行插值:代码如下
# Define helper function
def fill_missing(grp):
res = grp.set_index('Year')\
.interpolate(method='linear',limit=5)\
.fillna(method='ffill')\
.fillna(method='bfill')
del res['Country name']
return res
# Group by country name and fill missing
df = df.groupby(['Country name']).apply(
lambda grp: fill_missing(grp)
)
df = df.reset_index()
Country name Year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect ... GINI index (World Bank estimate), average 2000-16 gini of household income reported in Gallup, by wp5-year Most people can be trusted, Gallup Most people can be trusted, WVS round 1981-1984 Most people can be trusted, WVS round 1989-1993 Most people can be trusted, WVS round 1994-1998 Most people can be trusted, WVS round 1999-2004 Most people can be trusted, WVS round 2005-2009 Most people can be trusted, WVS round 2010-2014 Continent
0 Afghanistan 2005 3.723590 7.168690 0.450662 50.799999 0.718114 0.177889 0.881686 0.517637 ... NaN 0.441906 0.286315 NaN NaN NaN NaN NaN NaN Asia
1 Afghanistan 2006 3.723590 7.168690 0.450662 50.799999 0.718114 0.177889 0.881686 0.517637 ... NaN 0.441906 0.286315 NaN NaN NaN NaN NaN NaN Asia
2 Afghanistan 2007 3.723590 7.168690 0.450662 50.799999 0.718114 0.177889 0.881686 0.517637 ... NaN 0.441906 0.286315 NaN NaN NaN NaN NaN NaN Asia
3 Afghanistan 2008 3.723590 7.168690 0.450662 50.799999 0.718114 0.177889 0.881686 0.517637 ... NaN 0.441906 0.286315 NaN NaN NaN NaN NaN NaN Asia
4 Afghanistan 2009 4.401778 7.333790 0.552308 51.200001 0.678896 0.200178 0.850035 0.583926 ... NaN 0.441906 0.286315 NaN NaN NaN NaN NaN NaN Asia
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2305 Zimbabwe 2014 4.184451 7.562753 0.765839 52.380001 0.642034 -0.048634 0.820217 0.725214 ... 0.432 0.601080 0.148151 NaN NaN NaN 0.116683 NaN 0.082942 Africa
2306 Zimbabwe 2015 3.703191 7.556052 0.735800 53.799999 0.667193 -0.097354 0.810457 0.715079 ... 0.432 0.655137 0.148151 NaN NaN NaN 0.116683 NaN 0.082942 Africa
2307 Zimbabwe 2016 3.735400 7.538829 0.768425 54.400002 0.732971 -0.068105 0.723612 0.737636 ... 0.432 0.596690 0.148151 NaN NaN NaN 0.116683 NaN 0.082942 Africa
2308 Zimbabwe 2017 3.638300 7.549491 0.754147 55.000000 0.752826 -0.069670 0.751208 0.806428 ... 0.432 0.581484 0.148151 NaN NaN NaN 0.116683 NaN 0.082942 Africa
2309 Zimbabwe 2018 3.616480 7.553395 0.775388 55.599998 0.762675 -0.038384 0.844209 0.710119 ... 0.432 0.541772 0.148151 NaN NaN NaN 0.116683 NaN 0.082942 Africa
2310 rows × 27 columns
现在我们有了样本中所有国家2005年至2018年的数据。