# 特征工程的基本流程

https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
1）Understand the problem. We’ll look at each variable and do a philosophical analysis about their meaning and importance for this problem.

2）Univariable study. We’ll just focus on the dependent variable (‘SalePrice’) and try to know a little bit more about it.

3）Multivariate study. We’ll try to understand how the dependent variable and independent variables relate.

4）Basic cleaning. We’ll clean the dataset and handle the missing data, outliers and categorical variables.

5）Test assumptions. We’ll check if our data meets the assumptions required by most multivariate techniques.

# So… What can we expect?

1）Variable - Variable name.

2）Type - Identification of the variables’ type. There are two possible values for this field: ‘numerical’ or ‘categorical’. By ‘numerical’ we mean variables for which the values are numbers, and by ‘categorical’ we mean variables for which the values are categories.

3）Segment - Identification of the variables’ segment. We can define three possible segments: building, space or location. When we say ‘building’, we mean a variable that relates to the physical characteristics of the building (e.g. ‘OverallQual’). When we say ‘space’, we mean a variable that reports space properties of the house (e.g. ‘TotalBsmtSF’). Finally, when we say a ‘location’, we mean a variable that gives information about the place where the house is located (e.g. ‘Neighborhood’).

4）Expectation - Our expectation about the variable influence in ‘SalePrice’. We can use a categorical scale with ‘High’, ‘Medium’ and ‘Low’ as possible values.

5）Conclusion - Our conclusions about the importance of the variable, after we give a quick look at the data. We can keep with the same categorical scale as in ‘Expectation’.

we can rush into some scatter plots between those variables and ‘SalePrice’, filling in the ‘Conclusion’ column which is just the correction of our expectations.

Maybe this is related to the use of scatter plots instead of boxplots, which are more suitable for categorical variables visualization. The way we visualize data often influences our conclusions.

# First things first: analysing ‘SalePrice’

saleprice是我们的预测目标。首先我们需要了解目标的特性。

1）Deviate from the normal distribution.

2）Have appreciable positive skewness.

3）Show peakedness.

## Relationship with numerical variables

‘TotalBsmtSF’ is also a great friend of ‘SalePrice’ but this seems a much more emotional relationship! Everything is ok and suddenly, in a strong linear (exponential?) reaction, everything changes. Moreover, it’s clear that sometimes ‘TotalBsmtSF’ closes in itself and gives zero credit to ‘SalePrice’.

# Keep calm and work smart

1）Correlation matrix (heatmap style).

2）‘SalePrice’ correlation matrix (zoomed heatmap style).
SalePrice与矩阵的相关性
3）Scatter plots between the most correlated variables (move like Jagger style).

1）OverallQual’, ‘GrLivArea’ and ‘TotalBsmtSF’ are strongly correlated with ‘SalePrice’. Check!
2）‘GarageCars’ and ‘GarageArea’ are also some of the most strongly correlated variables. However, as we discussed in the last sub-point, the number of cars that fit into the garage is a consequence of the garage area. ‘GarageCars’ and ‘GarageArea’ are like twin brothers. You’ll never be able to distinguish them. Therefore, we just need one of these variables in our analysis (we can keep ‘GarageCars’ since its correlation with ‘SalePrice’ is higher).
3）‘TotalBsmtSF’ and ‘1stFloor’ also seem to be twin brothers. We can keep ‘TotalBsmtSF’ just to say that our first guess was right (re-read ‘So… What can we expect?’).
4）‘FullBath’?? Really?
5）‘TotRmsAbvGrd’ and ‘GrLivArea’, twin brothers again. Is this dataset from Chernobyl?
6）Ah… ‘YearBuilt’… It seems that ‘YearBuilt’ is slightly correlated with ‘SalePrice’. Honestly, it scares me to think about ‘YearBuilt’ because I start feeling that we should do a little bit of time-series analysis to get this right. I’ll leave this as a homework for you.

# Missing data

## missing data

1）How prevalent is the missing data?

2）Is missing data random or does it have a pattern?

We’ll consider that when more than 15% of the data is missing, we should delete the corresponding variable and pretend it never existed. This means that we will not try any trick to fill the missing data in these cases.

Regarding ‘MasVnrArea’ and ‘MasVnrType’, we can consider that these variables are not essential. Furthermore, they have a strong correlation with ‘YearBuilt’ and ‘OverallQual’ which are already considered. Thus, we will not lose information if we delete ‘MasVnrArea’ and ‘MasVnrType’.

Finally, we have one missing observation in ‘Electrical’. Since it is just one observation, we’ll delete this observation and keep the variable.

## Univariate analysis

The primary concern here is to establish a threshold that defines an observation as an outlier. To do so, we’ll standardize the data. In this context, data standardization means converting data values to have mean of 0 and a standard deviation of 1.

1）Low range values are similar and not too far from 0.
2）High range values are far from 0 and the 7.something values are really out of range.
For now, we’ll not consider any of these values as an outlier but we should be careful with those two 7.something values.

## Bivariate analysis

1）对于两个最大的GrLivArea，但是其价值却异常的低，明显不符合整体的趋势，可选择将这两个点去除
2）对于GrLivArea中价格对高的两个点，即前面提到了房价超过700000的，虽然较为特殊，但是可以发现其符合整体的变化趋势，故选择保留。

# Getting hard core

The answer to this question lies in testing for the assumptions underlying the statistical bases for multivariate analysis. We already did some data cleaning and discovered a lot about ‘SalePrice’. Now it’s time to go deep and understand how ‘SalePrice’ complies with the statistical assumptions that enables us to apply multivariate techniques.

1）Normality - When we talk about normality what we mean is that the data should look like a normal distribution. This is important because several statistic tests rely on this (e.g. t-statistics). In this exercise we’ll just check univariate normality for ‘SalePrice’ (which is a limited approach). Remember that univariate normality doesn’t ensure multivariate normality (which is what we would like to have), but it helps. Another detail to take into account is that in big samples (>200 observations) normality is not such an issue. However, if we solve normality, we avoid a lot of other problems (e.g. heteroscedacity) so that’s the main reason why we are doing this analysis.

2）Homoscedasticity - I just hope I wrote it right. Homoscedasticity refers to the ‘assumption that dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s)’ (Hair et al., 2013). Homoscedasticity is desirable because we want the error term to be the same across all values of the independent variables.

3）Linearity- The most common way to assess linearity is to examine scatter plots and search for linear patterns. If patterns are not linear, it would be worthwhile to explore data transformations. However, we’ll not get into this because most of the scatter plots we’ve seen appear to have linear relationships.

4）Absence of correlated errors - Correlated errors, like the definition suggests, happen when one error is correlated to another. For instance, if one positive error makes a negative error systematically, it means that there’s a relationship between these variables. This occurs often in time series, where some patterns are time related. We’ll also not get into this. However, if you detect something, try to add a variable that can explain the effect you’re getting. That’s the most common solution for correlated errors.

## In the search for normality

1）Histogram - Kurtosis and skewness.

2）Normal probability plot - Data distribution should closely follow the diagonal that represents the normal distribution.

## homoscedasticity

Now let’s check ‘SalePrice’ with ‘TotalBsmtSF’.

# Conclusion

Throughout this kernel we put in practice many of the strategies proposed by Hair et al. (2013). We philosophied about the variables, we analysed ‘SalePrice’ alone and with the most correlated variables, we dealt with missing data and outliers, we tested some of the fundamental statistical assumptions and we even transformed categorial variables into dummy variables. That’s a lot of work that Python helped us make easier.

But the quest is not over. Remember that our story stopped in the Facebook research. Now it’s time to give a call to ‘SalePrice’ and invite her to dinner. Try to predict her behaviour. Do you think she’s a girl that enjoys regularized linear regression approaches? Or do you think she prefers ensemble methods? Or maybe something else?

It’s up to you to find out.