机器学习-特征工程教程

特征工程的基本流程

https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
1)Understand the problem. We’ll look at each variable and do a philosophical analysis about their meaning and importance for this problem.
理解问题。
2)Univariable study. We’ll just focus on the dependent variable (‘SalePrice’) and try to know a little bit more about it.
单变量与目标值的关联分析
3)Multivariate study. We’ll try to understand how the dependent variable and independent variables relate.
变量关联分析
4)Basic cleaning. We’ll clean the dataset and handle the missing data, outliers and categorical variables.
数据清洗,主要针对数据缺失,异常值,以及分类变量
5)Test assumptions. We’ll check if our data meets the assumptions required by most multivariate techniques.
假设验证。

数据导入

库导入
在这里插入图片描述
输入导入
在这里插入图片描述
查看列名称。
在这里插入图片描述

So… What can we expect?

解析各个特征,查看各变量与我们分析的问题的关联。可建立一个包含以下内容的excel表格来记录分析的思路与过程。
1)Variable - Variable name.
变量名/特征名
2)Type - Identification of the variables’ type. There are two possible values for this field: ‘numerical’ or ‘categorical’. By ‘numerical’ we mean variables for which the values are numbers, and by ‘categorical’ we mean variables for which the values are categories.
变量类型。一般可分为分类特征与数值特征。
因为一般模型仅能对数值型特征进行带入计算,对于分类特征需要先进行数值转换。
3)Segment - Identification of the variables’ segment. We can define three possible segments: building, space or location. When we say ‘building’, we mean a variable that relates to the physical characteristics of the building (e.g. ‘OverallQual’). When we say ‘space’, we mean a variable that reports space properties of the house (e.g. ‘TotalBsmtSF’). Finally, when we say a ‘location’, we mean a variable that gives information about the place where the house is located (e.g. ‘Neighborhood’).
在解析问题的过程中先对所有特征进行大类的分类。比如在房价预测问题中。可以将所有问题归为三类:建筑特性(质量等)、面积、地理位置
4)Expectation - Our expectation about the variable influence in ‘SalePrice’. We can use a categorical scale with ‘High’, ‘Medium’ and ‘Low’ as possible values.
期望重要性。根据个人理解,预估各特征对于目标的影响程度。分为中、低、高
5)Conclusion - Our conclusions about the importance of the variable, after we give a quick look at the data. We can keep with the same categorical scale as in ‘Expectation’.
重要性结论。在经过变量分析,各特征对目标的结论影响程度,分为中、低、高
we can rush into some scatter plots between those variables and ‘SalePrice’, filling in the ‘Conclusion’ column which is just the correction of our expectations.
可使用散点图的方式来进行变量分析。
Maybe this is related to the use of scatter plots instead of boxplots, which are more suitable for categorical variables visualization. The way we visualize data often influences our conclusions.
散点图更适用于数值型变量的分析。箱型图更适用于分类变量的分析。
6)Comments - Any general comments that occured to us.
备注。记录分析过程中的特性。
在这里插入图片描述

First things first: analysing ‘SalePrice’

saleprice是我们的预测目标。首先我们需要了解目标的特性。
通过describe可查看特征的统计学特性。如下图所示
在这里插入图片描述
从上图可大致看到,预测目标为浮点型数值,且其最小值远大于0,说明没有过于明显的异常点存在。
在这里插入图片描述
从上图可以看到数据基本呈正态分布趋势,且存在以下特征
1)Deviate from the normal distribution.
与标准正态分布,存在一定的偏离
2)Have appreciable positive skewness.
与标准正态分布,有明显的正偏斜
3)Show peakedness.
尖峰态

此处存在两个概念:偏度(skewness)和峰度(kurtosis)

偏度能够反应分布的对称情况,右偏(也叫正偏),在图像上表现为数据右边脱了一个长长的尾巴,这时大多数值分布在左侧,有一小部分值分布在右侧。

峰度反应的是图像的尖锐程度:峰度越大,表现在图像上面是中心点越尖锐。在相同方差的情况下,中间一大部分的值方差都很小,为了达到和正太分布方差相同的目的,必须有一些值离中心点越远,所以这就是所说的“厚尾”,反应的是异常点增多这一现象。
在这里插入图片描述
遗留问题:计算预测目标的正态偏度与峰度的目的是?

Relationship with numerical variables

查看预测目标与数值型变量的关系
前面提到数值型变量适用散点图来查看关联性。
在这里插入图片描述
可明显看到GrLivArea与SalePrice存在较强的线性关系。

在这里插入图片描述
‘TotalBsmtSF’ is also a great friend of ‘SalePrice’ but this seems a much more emotional relationship! Everything is ok and suddenly, in a strong linear (exponential?) reaction, everything changes. Moreover, it’s clear that sometimes ‘TotalBsmtSF’ closes in itself and gives zero credit to ‘SalePrice’.
可看到TotalBsmtSF与SalePrice存在一定的指数线性关系?且有时TotalBsmtSF为0时SalePrice仍然发生变化。情况较为特殊。

Relationship with categorical features

查看预测目标与分类型变量的关系
前面提到分类型变量适用箱型图来查看关联性。
对于箱型图的描述
在这里插入图片描述
可以看到OverallQual与SalePrice关联较大,随着OverallQual增大,中位数,四
在这里插入图片描述

在这里插入图片描述

Keep calm and work smart

前面仅对主观认定的几个特征进行了分析

1)Correlation matrix (heatmap style).
相关矩阵分析(热普图)
2)‘SalePrice’ correlation matrix (zoomed heatmap style).
SalePrice与矩阵的相关性
3)Scatter plots between the most correlated variables (move like Jagger style).
以散点图的方式显示各个变量之间的相关性。

在这里插入图片描述
热图是查看各变量之间相关性的最佳方式。
选择与房价关联度最高的10个特征进行热图分析。
在这里插入图片描述
1)OverallQual’, ‘GrLivArea’ and ‘TotalBsmtSF’ are strongly correlated with ‘SalePrice’. Check!
2)‘GarageCars’ and ‘GarageArea’ are also some of the most strongly correlated variables. However, as we discussed in the last sub-point, the number of cars that fit into the garage is a consequence of the garage area. ‘GarageCars’ and ‘GarageArea’ are like twin brothers. You’ll never be able to distinguish them. Therefore, we just need one of these variables in our analysis (we can keep ‘GarageCars’ since its correlation with ‘SalePrice’ is higher).
3)‘TotalBsmtSF’ and ‘1stFloor’ also seem to be twin brothers. We can keep ‘TotalBsmtSF’ just to say that our first guess was right (re-read ‘So… What can we expect?’).
4)‘FullBath’?? Really?
5)‘TotRmsAbvGrd’ and ‘GrLivArea’, twin brothers again. Is this dataset from Chernobyl?
6)Ah… ‘YearBuilt’… It seems that ‘YearBuilt’ is slightly correlated with ‘SalePrice’. Honestly, it scares me to think about ‘YearBuilt’ because I start feeling that we should do a little bit of time-series analysis to get this right. I’ll leave this as a homework for you.
遗留问题:时间需求分析??

可使用散点关联图来查看分析完成的特征与房价的关系
在这里插入图片描述
在这里插入图片描述

Missing data

missing data

缺失数据处理需要先查看数据特性:
1)How prevalent is the missing data?
数据缺失的程度
2)Is missing data random or does it have a pattern?
缺失的数据是随机的还是存在一定规律的

在这里插入图片描述
We’ll consider that when more than 15% of the data is missing, we should delete the corresponding variable and pretend it never existed. This means that we will not try any trick to fill the missing data in these cases.
一般当缺失数据超过15%时,我们会考虑直接将这列变量删除。同时也需要分析一下这些缺失的变量是否真的有预测目标有较大关联性。

对于GarageX的变量,缺失情况完全相同。返回去查看之前的关联性分析,会发现此类特征与GarageCars关联性强。相关信息已经可以在GarageCars中体现,故此处选择删除。BsmtX同理。

Regarding ‘MasVnrArea’ and ‘MasVnrType’, we can consider that these variables are not essential. Furthermore, they have a strong correlation with ‘YearBuilt’ and ‘OverallQual’ which are already considered. Thus, we will not lose information if we delete ‘MasVnrArea’ and ‘MasVnrType’.

Finally, we have one missing observation in ‘Electrical’. Since it is just one observation, we’ll delete this observation and keep the variable.

在这里插入图片描述

Out liars

异常值检测

Univariate analysis

单变量分析。

The primary concern here is to establish a threshold that defines an observation as an outlier. To do so, we’ll standardize the data. In this context, data standardization means converting data values to have mean of 0 and a standard deviation of 1.

将数据进行正则化,以此来寻找异常值。
在这里插入图片描述
根据上图异常值解析,可以得出以下结论
1)Low range values are similar and not too far from 0.
2)High range values are far from 0 and the 7.something values are really out of range.
For now, we’ll not consider any of these values as an outlier but we should be careful with those two 7.something values.

Bivariate analysis

双变量分析:将特征与预测目标两两结对分析
在这里插入图片描述
从上图可以看到GrLivArea有明显的趋势,且可观察到大概四个异常点:
1)对于两个最大的GrLivArea,但是其价值却异常的低,明显不符合整体的趋势,可选择将这两个点去除
2)对于GrLivArea中价格对高的两个点,即前面提到了房价超过700000的,虽然较为特殊,但是可以发现其符合整体的变化趋势,故选择保留。

根据上述决策对数据进行清洗
在这里插入图片描述
继续观察下一个特征
在这里插入图片描述
可以选择将TotalBsmtSF大于3000的点进行清除,但是其离群特性不明显,也可以选择保留。

Getting hard core

The answer to this question lies in testing for the assumptions underlying the statistical bases for multivariate analysis. We already did some data cleaning and discovered a lot about ‘SalePrice’. Now it’s time to go deep and understand how ‘SalePrice’ complies with the statistical assumptions that enables us to apply multivariate techniques.
至此我们已经完成了部分数据清洗,并对预测的目标有了一定程度的了解。现在需要用多变量技术更深层的挖掘其规律与特性。

可从以下四个方面进行挖掘分析:

1)Normality - When we talk about normality what we mean is that the data should look like a normal distribution. This is important because several statistic tests rely on this (e.g. t-statistics). In this exercise we’ll just check univariate normality for ‘SalePrice’ (which is a limited approach). Remember that univariate normality doesn’t ensure multivariate normality (which is what we would like to have), but it helps. Another detail to take into account is that in big samples (>200 observations) normality is not such an issue. However, if we solve normality, we avoid a lot of other problems (e.g. heteroscedacity) so that’s the main reason why we are doing this analysis.
查看数据是否符合正态分布。部分数据分析要求数据为正态分布,且解析数据的正态性可以帮助我们解决很多问题。

2)Homoscedasticity - I just hope I wrote it right. Homoscedasticity refers to the ‘assumption that dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s)’ (Hair et al., 2013). Homoscedasticity is desirable because we want the error term to be the same across all values of the independent variables.
同方差性。同方差是指“假设因变量在预测变量的范围内表现出相同的方差水平”(Hair et al., 2013)。同方差是可取的,因为我们希望误差项在所有自变量的值上都是相同的。

3)Linearity- The most common way to assess linearity is to examine scatter plots and search for linear patterns. If patterns are not linear, it would be worthwhile to explore data transformations. However, we’ll not get into this because most of the scatter plots we’ve seen appear to have linear relationships.
线性程度。检验线性的最常用方式为散点图。若散点图显示非线性,可进行进一步的数据转换,以实现线性关系。

4)Absence of correlated errors - Correlated errors, like the definition suggests, happen when one error is correlated to another. For instance, if one positive error makes a negative error systematically, it means that there’s a relationship between these variables. This occurs often in time series, where some patterns are time related. We’ll also not get into this. However, if you detect something, try to add a variable that can explain the effect you’re getting. That’s the most common solution for correlated errors.
缺乏相关错误 -相关错误,如定义所示,发生在一个错误与另一个错误相关联时。例如,如果一个正误差系统地产生一个负误差,这意味着这些变量之间存在关系。这通常发生在时间序列中,其中一些模式与时间相关。我们也不讨论这个。但是,如果您检测到什么,请尝试添加一个变量来解释您所得到的效果。这是相关错误最常见的解决方案。

In the search for normality

快速查看单变量正态特性的方法:
1)Histogram - Kurtosis and skewness.
直方图-峰度和偏度。
2)Normal probability plot - Data distribution should closely follow the diagonal that represents the normal distribution.
正态概率图-数据分布应紧跟表示正态分布的对角线。
在这里插入图片描述
从图像中可以明显的看到,房价的正态曲线非标准正态曲线,具有明显的尖峰以及正偏态。且未严格遵守对角线原则。
此时可以使用log函数来解决这个问题。
在这里插入图片描述
前后对比可以发现变量数据分布明显趋于正态分布。

下面依次对特征值进行正态分布分析:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
前面的特征都可以通过使用log函数来解决其偏态问题
但是这个特征较为特殊,仔细看图,其特征取值存在0。我们要知道,log函数的值是必须大于0的。
此处可以选择增加一个变量,将所有取值大于0的进行log变换,等于0的进行舍弃。这样的做法不一定正确,所以也称为高风险工程
在这里插入图片描述
在这里插入图片描述

homoscedasticity

查看同方差的最好方式的图形化。若两个变量非同方差,图形中会呈现明显的圆锥形或菱形图案。

下图为正态化后的两变量散点图。
在这里插入图片描述
对比转换前的图,转换前的图呈现明显的圆锥形。说明正则化可很大程度上解决同方差问题。
在这里插入图片描述
Now let’s check ‘SalePrice’ with ‘TotalBsmtSF’.
在这里插入图片描述
可查看大致呈现同方差状态。

Last but not the least, dummy variables

虚拟变量转换。将类别型的特征进行独热算法转换。
在这里插入图片描述

Conclusion

Throughout this kernel we put in practice many of the strategies proposed by Hair et al. (2013). We philosophied about the variables, we analysed ‘SalePrice’ alone and with the most correlated variables, we dealt with missing data and outliers, we tested some of the fundamental statistical assumptions and we even transformed categorial variables into dummy variables. That’s a lot of work that Python helped us make easier.

But the quest is not over. Remember that our story stopped in the Facebook research. Now it’s time to give a call to ‘SalePrice’ and invite her to dinner. Try to predict her behaviour. Do you think she’s a girl that enjoys regularized linear regression approaches? Or do you think she prefers ensemble methods? Or maybe something else?

It’s up to you to find out.

通篇我们将Hair等人(2013)提出的许多策略付诸实践。我们对变量进行了哲学分析,我们单独分析了“销售价格”,用最相关的变量,我们处理了缺失的数据和异常值,我们测试了一些基本的统计假设,我们甚至把类别变量转化成了虚拟变量。Python帮助我们简化了很多工作。

但这一探索并未结束。记住,我们的故事止于Facebook的研究。现在是时候给“特价”打个电话,邀请她共进晚餐了。试着预测她的行为。你认为她是一个喜欢正则化线性回归方法的女孩吗?或者你认为她更喜欢整体方法?或者别的什么?

这取决于你自己。

展开阅读全文

没有更多推荐了,返回首页