Pandas数据分析:使用transform,apply函数处理分组数据

数据清洗和可视化的关键之一在于如何处理缺失值。使用pandas的fillna函数可以对缺失值进行基本的处理。然而,fillna只适用于简单的情况,一旦数据中的组或数据顺序变得相关,它就不能满足要求。
本文将讨论如何在更复杂的情况下进行缺失值处理。
这些情况通常是由不同的时间序列,组甚至子组组成的数据集。时间序列以有月,季度为例,组以性别组数据为例,子组的例子有年龄组和种族。
本文主要分为三个部分:

  1. 复习fillna的用法
  2. 数据独立时,如何处理缺失值
  3. 数据存在一定相关性时,如何处理缺失值

复习fillna的用法

调用pandas的fillna()函数可以有三种方法处理缺失值,它们分别为method参数的两个值或具体的数值:

  1. method=‘ffill’ 在遇到下一个非缺失值前向下填充;
  2. method='bfill’在遇到下一个非缺失值前向上填充;
  3. 具体的数值,例如fillna(0)就是将所有的缺失值都填为0。

本次代码使用jupyter notebook。
代码示例:
输入:

demo = pd.Series(range(6))
demo.loc[2:4] = np.nan
demo

输出:

0    0.0
1    1.0
2    NaN
3    NaN
4    NaN
5    5.0
dtype: float64

向下填充:

输入:

demo.fillna(method='ffill')

输出:

0    0.0
1    1.0
2    1.0
3    1.0
4    1.0
5    5.0
dtype: float64

向上填充:

输入:

demo.fillna(method='bfill')

输出:

0    0.0
1    1.0
2    5.0
3    5.0
4    5.0
5    5.0
dtype: float64

填充具体数值

输入:

demo.fillna(0)

输出

0    0.0
1    1.0
2    0.0
3    0.0
4    0.0
5    5.0
dtype: float64

数据独立时,处理缺失值的方法

在这里插入图片描述
通常,在处理缺失数据时,数据间的相关性无关紧要,因此,用于替换缺失值的值可以基于可用数据的整体。在这种情况下,通常会用最佳‘猜测值’(如可用数据的平均值或中值)替换缺失值。
让我们快速回顾一下为什么要小心使用这个方法。让我们假设你调查了1000个男孩和1000个女孩的体重,但是在调查过程中出现了一些缺失值。

#模拟男生与女生的体重
boys = np.random.normal(70,5,1000)
girls = np.random.normal(50,3,1000)
#女生的体重数据出现了一些缺失值
for i in range(100):`在这里插入代码片`
    girls[np.random.randint(0,1000)] = np.nan
#构造一个DataFrame
boys = pd.DataFrame(boys, columns=['weight'])
boys['gender'] = 'boy'
girls = pd.DataFrame(girls, columns=['weight'])
girls['gender'] = 'girl'
df = pd.concat([girls,boys],axis=0)
df['weight'] = df['weight'].astype(float)

输入

df

输出

weight	gender
0	50.445444	girl
1	49.461076	girl
2	51.764700	girl
3	52.189355	girl
4	54.475917	girl
...	...	...
995	70.953356	boy
996	70.519500	boy
997	65.386008	boy
998	68.856916	boy
999	59.730169	boy
2000 rows × 2 columns

整体均值填充

如果不经分析,直接用整体的均值填补缺失值的话,就会存在一个问题,就是导致缺失的女生体重值会普遍偏高,因为用的是整体的缺失值,包含了男生的体重在里面,在此基础上进行分析,就会得出错误的结论。

sns.set_style('white')
fig, ax = plt.subplots(figsize=(16, 7))
mean = df['weight'].mean()
sns.distplot(
    df[df['gender'] == 'girl']['weight'].fillna(mean),
    kde=True,
    hist=False,
    ax=ax,
    label='girls'
)
sns.distplot(
    df[df['gender'] == 'boy']['weight'],
    kde=True,
    hist=False,
    ax=ax,
    label='boys'
)
plt.title('Kernel density estimation of weight for boys and girls')
sns.despine()

在这里插入图片描述
从上图可以看出,部分女生的体重是异常的,原因就是填充值选取不合理。

分组均值填充

可以使用pandas的transform函数配合进行分组填充,如下所示:

df['filled_weight'] = df.groupby('gender')['weight'].transform(
    lambda grp: grp.fillna(np.mean(grp))
)
sns.set_style('white')
fig, ax = plt.subplots(figsize=(16, 7))
sns.distplot(
    df[df['gender'] == 'girl']['filled_weight'],
    kde=True,
    hist=False,
    ax=ax,
    label='girls'
)
sns.distplot(
    df[df['gender'] == 'boy']['filled_weight'],
    kde=True,
    hist=False,
    ax=ax,
    label='boys'
)
plt.title('Kernel density estimation of weight for boys and girls')
sns.despine()

在这里插入图片描述
最终的结果也会更加合理。

细分子组确实值填充

数据依然使用上一个例子,但是更进一步将数据分为年龄组。
首先创建虚拟数据:

# paramter for the weight distribution (mean, std)
param_map = {
    'boy':{
        '<10':(40,4),
        '<20':(60,4),
        '20+':(70,5),
    },
    'girl':{
        '<10':(30,2),
        '<20':(40,3),
        '20+':(50,3),
    }
}
# generate 10k records
df = pd.DataFrame({
    'gender':np.random.choice(['girl','boy'],10000),
    'age_cohort':np.random.choice(['<10','<20','20+'],10000)
})
# set random weight based on parameters
df['weight'] = df.apply(
    lambda x: np.random.normal(
        loc=param_map[x['gender']][x['age_cohort']][0],
        scale=param_map[x['gender']][x['age_cohort']][1]
    ),axis=1
)
# set 500 values missing
for i in range(500):
    df.loc[np.random.randint(0,len(df)),'weight'] = np.nan

如果用整体的均值填充缺失值,就会导致数据存在异常,让我们画出不同性别不同年龄段的体重密度图进行观察

df['filled_weight'] = df['weight'].fillna(
    df['weight'].mean()
)
g = sns.FacetGrid(
    df,
    col='age_cohort',
    row='gender',
    col_order=['<10','<20','20+']
)
g.map(sns.kdeplot,'filled_weight')

在这里插入图片描述
从图中可以看到,有些奇怪的双峰分布,会导致产生错误的结论。
现在,如果我们仅仅用各自性别的平均值来替换缺失的值,这还不够,因为不仅男孩和女孩的体重不同,而且不同年龄组的体重也有很大的差异。
transform函数依旧可以用来辅助填充缺失值:

df['filled_weight'] = df.groupby(['gender','age_cohort'])['weight'].transform(
    lambda grp: grp.fillna(np.mean(grp))
)

在这里插入图片描述
如上图所示,进行分组填充后得到的数据更加合理。

数据存在一定相关性时,如何处理缺失值

在处理时间序列数据时,经常会出现两种情况。

  1. 对齐日期范围:假设您查看各个国家的GDP,教育水平和人口年增长情况。有些国家的早些年的数据缺失,有些缺失了近期的数据,有些缺失了中间几年的数据。当然,你可以忽略他们。不过,出于可视化的目的,可能希望保留它们,并将第一个观察到的值投影到开始,将最后一个观察到的值投影到调查期间的结束,然后提出一些对中间的值有意义的方法。
  2. 插值:研究时间序列数据插值和排序变得非常重要。用一个基于2021年数据计算的平均值替换2012年缺失的股票数据,势必会产生一些古怪的结果。

以《2019年世界幸福报告》(World Happiness Report 2019)数据为例,我们将在其中讨论这两种情况。《世界幸福报告》试图回答影响全世界幸福的因素。该报告调查了2005年至2018年的数据。

df = pd.read_csv('https://raw.githubusercontent.com/FBosler/you-datascientist/master/happiness_with_continent.csv')
Country name	Year	Life Ladder	Log GDP per capita	Social support	Healthy life expectancy at birth	Freedom to make life choices	Generosity	Perceptions of corruption	Positive affect	...	GINI index (World Bank estimate), average 2000-16	gini of household income reported in Gallup, by wp5-year	Most people can be trusted, Gallup	Most people can be trusted, WVS round 1981-1984	Most people can be trusted, WVS round 1989-1993	Most people can be trusted, WVS round 1994-1998	Most people can be trusted, WVS round 1999-2004	Most people can be trusted, WVS round 2005-2009	Most people can be trusted, WVS round 2010-2014	Continent
0	Afghanistan	2008	3.723590	7.168690	0.450662	50.799999	0.718114	0.177889	0.881686	0.517637	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Asia
1	Afghanistan	2009	4.401778	7.333790	0.552308	51.200001	0.678896	0.200178	0.850035	0.583926	...	NaN	0.441906	0.286315	NaN	NaN	NaN	NaN	NaN	NaN	Asia
2	Afghanistan	2010	4.758381	7.386629	0.539075	51.599998	0.600127	0.134353	0.706766	0.618265	...	NaN	0.327318	0.275833	NaN	NaN	NaN	NaN	NaN	NaN	Asia
3	Afghanistan	2011	3.831719	7.415019	0.521104	51.919998	0.495901	0.172137	0.731109	0.611387	...	NaN	0.336764	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Asia
4	Afghanistan	2012	3.782938	7.517126	0.520637	52.240002	0.530935	0.244273	0.775620	0.710385	...	NaN	0.344540	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Asia
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1699	Zimbabwe	2014	4.184451	7.562753	0.765839	52.380001	0.642034	-0.048634	0.820217	0.725214	...	0.432	0.601080	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
1700	Zimbabwe	2015	3.703191	7.556052	0.735800	53.799999	0.667193	-0.097354	0.810457	0.715079	...	0.432	0.655137	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
1701	Zimbabwe	2016	3.735400	7.538829	0.768425	54.400002	0.732971	-0.068105	0.723612	0.737636	...	0.432	0.596690	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
1702	Zimbabwe	2017	3.638300	7.549491	0.754147	55.000000	0.752826	-0.069670	0.751208	0.806428	...	0.432	0.581484	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
1703	Zimbabwe	2018	3.616480	7.553395	0.775388	55.599998	0.762675	-0.038384	0.844209	0.710119	...	0.432	0.541772	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
1704 rows × 27 columns

首先统计以下每年的有多少个国家的数据。

# PLOT CODE:
df.groupby(['Year']).size().plot(
    kind='bar',
    title='Number of countries with data',
    figsize=(10,5)
)

在这里插入图片描述
我们可以看到,尤其是前几年,我们没有多少国家有数据,而且整个样本期都有一些波动。为了减轻丢失数据的影响,我们将执行以下操作:

  1. 按国家分组并重新编制整个日期范围的索引
  2. 在每个国家分组的基础上,在我们观测到的范围之外进行插值和外推。

按国家分组并重新编制整个日期范围的索引

# Define helper function
def add_missing_years(grp):
    _ = grp.set_index('Year')
    _ = _.reindex(list(range(2005,2019)))
    del _['Country name']
    return _
# Group by country name and extend
df = df.groupby('Country name').apply(add_missing_years)
df = df.reset_index()
Country name	Year	Life Ladder	Log GDP per capita	Social support	Healthy life expectancy at birth	Freedom to make life choices	Generosity	Perceptions of corruption	Positive affect	...	GINI index (World Bank estimate), average 2000-16	gini of household income reported in Gallup, by wp5-year	Most people can be trusted, Gallup	Most people can be trusted, WVS round 1981-1984	Most people can be trusted, WVS round 1989-1993	Most people can be trusted, WVS round 1994-1998	Most people can be trusted, WVS round 1999-2004	Most people can be trusted, WVS round 2005-2009	Most people can be trusted, WVS round 2010-2014	Continent
0	Afghanistan	2005	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	2006	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Afghanistan	2007	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Afghanistan	2008	3.723590	7.168690	0.450662	50.799999	0.718114	0.177889	0.881686	0.517637	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Asia
4	Afghanistan	2009	4.401778	7.333790	0.552308	51.200001	0.678896	0.200178	0.850035	0.583926	...	NaN	0.441906	0.286315	NaN	NaN	NaN	NaN	NaN	NaN	Asia
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2305	Zimbabwe	2014	4.184451	7.562753	0.765839	52.380001	0.642034	-0.048634	0.820217	0.725214	...	0.432	0.601080	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
2306	Zimbabwe	2015	3.703191	7.556052	0.735800	53.799999	0.667193	-0.097354	0.810457	0.715079	...	0.432	0.655137	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
2307	Zimbabwe	2016	3.735400	7.538829	0.768425	54.400002	0.732971	-0.068105	0.723612	0.737636	...	0.432	0.596690	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
2308	Zimbabwe	2017	3.638300	7.549491	0.754147	55.000000	0.752826	-0.069670	0.751208	0.806428	...	0.432	0.581484	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
2309	Zimbabwe	2018	3.616480	7.553395	0.775388	55.599998	0.762675	-0.038384	0.844209	0.710119	...	0.432	0.541772	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
2310 rows × 27 columns

可以看到经过重新分组设定日期索引后由原来的1704行变成了2310行,因为每个国家日期的范围都变成了从2005年到2018年,并且在补充的年份其他列为缺失值。

在每个国家分组的基础上,在我们观测到的范围之外进行插值和外推

按照线性插值的方法进行插值:代码如下

# Define helper function
def fill_missing(grp):
    res = grp.set_index('Year')\
    .interpolate(method='linear',limit=5)\
    .fillna(method='ffill')\
    .fillna(method='bfill')
    del res['Country name']
    return res
# Group by country name and fill missing
df = df.groupby(['Country name']).apply(
    lambda grp: fill_missing(grp)
)
df = df.reset_index()
Country name	Year	Life Ladder	Log GDP per capita	Social support	Healthy life expectancy at birth	Freedom to make life choices	Generosity	Perceptions of corruption	Positive affect	...	GINI index (World Bank estimate), average 2000-16	gini of household income reported in Gallup, by wp5-year	Most people can be trusted, Gallup	Most people can be trusted, WVS round 1981-1984	Most people can be trusted, WVS round 1989-1993	Most people can be trusted, WVS round 1994-1998	Most people can be trusted, WVS round 1999-2004	Most people can be trusted, WVS round 2005-2009	Most people can be trusted, WVS round 2010-2014	Continent
0	Afghanistan	2005	3.723590	7.168690	0.450662	50.799999	0.718114	0.177889	0.881686	0.517637	...	NaN	0.441906	0.286315	NaN	NaN	NaN	NaN	NaN	NaN	Asia
1	Afghanistan	2006	3.723590	7.168690	0.450662	50.799999	0.718114	0.177889	0.881686	0.517637	...	NaN	0.441906	0.286315	NaN	NaN	NaN	NaN	NaN	NaN	Asia
2	Afghanistan	2007	3.723590	7.168690	0.450662	50.799999	0.718114	0.177889	0.881686	0.517637	...	NaN	0.441906	0.286315	NaN	NaN	NaN	NaN	NaN	NaN	Asia
3	Afghanistan	2008	3.723590	7.168690	0.450662	50.799999	0.718114	0.177889	0.881686	0.517637	...	NaN	0.441906	0.286315	NaN	NaN	NaN	NaN	NaN	NaN	Asia
4	Afghanistan	2009	4.401778	7.333790	0.552308	51.200001	0.678896	0.200178	0.850035	0.583926	...	NaN	0.441906	0.286315	NaN	NaN	NaN	NaN	NaN	NaN	Asia
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2305	Zimbabwe	2014	4.184451	7.562753	0.765839	52.380001	0.642034	-0.048634	0.820217	0.725214	...	0.432	0.601080	0.148151	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
2306	Zimbabwe	2015	3.703191	7.556052	0.735800	53.799999	0.667193	-0.097354	0.810457	0.715079	...	0.432	0.655137	0.148151	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
2307	Zimbabwe	2016	3.735400	7.538829	0.768425	54.400002	0.732971	-0.068105	0.723612	0.737636	...	0.432	0.596690	0.148151	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
2308	Zimbabwe	2017	3.638300	7.549491	0.754147	55.000000	0.752826	-0.069670	0.751208	0.806428	...	0.432	0.581484	0.148151	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
2309	Zimbabwe	2018	3.616480	7.553395	0.775388	55.599998	0.762675	-0.038384	0.844209	0.710119	...	0.432	0.541772	0.148151	NaN	NaN	NaN	0.116683	NaN	0.082942	Africa
2310 rows × 27 columns

现在我们有了样本中所有国家2005年至2018年的数据。

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值