Kaggle比赛系列: (4) House Prices: Advanced Regression Techniques

声明: 从kaggle入门比赛的第4篇开始,将不会再详细的贴出数据处理\特征工程\建模流程,针对每一片文章的特点,突出leaf在大神notebook中的主要收获,但是完整代码我依然会打包整理上传.

1. drop(inplace)剔除某一列

train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

使用DataFrame.drop()函数时,对于 inplace 参数, 默认情况下取值为False ,作用为:不在原数据中更改.这时的结果依然是DataFrame数据类型,所以可以写成例如 train = train.drop(“Id”, axis = 1, [inplace = False]) 的形式.但是当inplace参数为True时,返回的值类型为Nonetype,不能再赋值给train,所以只能写成train.drop(“Id”, axis = 1, inplace = True) 的形式.

2. train.drop(train[select case].index)剔除某一行

右下侧两个点是异常的

#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

注意这里的作用是:将训练集中 GrLivArea>4000和SalePrice<300000 的数据清除记录(行),容易错误的写成
train = train.drop(((train[‘GrLivArea’]>4000) & (train[‘SalePrice’]<300000)).index) 即在对DataFrame进行某一属性的筛选时,应该将筛选条件放在**[ ]**中.
在剔除异常数据记录时应注意: There are probably others outliers in the training data. However, removing all them may affect badly our models if ever there were also outliers in the test data. That’s why , instead of removing them all, we will just manage to make some of our models robust on them. You can refer to the modelling part of this notebook for that.
剔除一小部分严重异常的数据,对剩下的依然存在异常数据的数据集进行建模,能够增强模型的鲁帮性.

3.分布直方图绘制(拟合曲线,正态分布曲线)

sns.distplot(train['SalePrice'] , fit=norm);#绘制拟合曲线和正态分布曲线

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])#获得正态分布的平均值和方差
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')#$\mu=$是一个转意方法, loc= 'best'是常用的标签排版的方法
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

在这里插入图片描述
在这里插入图片描述
由图可见,售价的分布是偏峰(左)的,(skew值较大),需要变化成为更加正态化的分布.

#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])#将偏峰转化为正态分布

4. 数据集合并(pd.concat)

all_data = pd.concat((train, test)).reset_index(drop=True)#数据集合并
all_data.drop(['SalePrice'], axis=1, inplace=True)#删除某一列

数据集合并方法pd.concat((train, test).reset_index(drop=True))

5. 查看缺失数据集的比例

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)

(1) all_data.isnull().sum() 会对DataFrame中的每一列进行统计;
(2) pd.DataFrame({‘Missing Ratio’: all_data_na}) 为DataFrame中的某一列设置字段名,注意是以字典的形式.

6.缺失值填充

(1)

all_data['PoolQC'] = all_data['PoolQC'].fillna("None")#None填充

(2)

all_data['LotFrontage'] = all_data.groupby("Neiborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))#分组transform(lambda)取中值

(3)

for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')

(4)

all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])#使用众数填充缺失值

7. 标签编码

什么是独热编码和标签编码

from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

fit()学习编码 + transform()进行编码

8.求数值型属性的偏度

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)#需要首先x.dropna()
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)

对偏度进行矫正:
(Note that setting λ=0 is equivalent to log1p used above for the target variable.)

skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)
    
#all_data[skewed_features] = np.log1p(all_data[skewed_features])

9. 将类别属性拆开分别作为一个属性

all_data = pd.get_dummies(all_data)

10.设计交叉验证方法

n_folds = 5

def rmsle_cv(model):
	kf = KFold(n_folds, shuffle=True, random_state = 42).get_n_splits(train.values)
	rmse = np.sqrt(-cross_val_score(model, train.values, y_train,scoreing="neg_mean_squared_error", cv = kf))
	return(rmse)

cross_val_score()是一种模型训练方法, 传入参数为:机器学习模型\训练数据集\标签\损失函数
最后的 cv = kf是要求采用n折交叉验证的方法作为最终的结果

11.模型设计与测试

例: LASSO Regression
模型特点:对异常值敏感,可采用sklearn中的Robustscaler()方法应用于LASSO的pipeline上,使其更具鲁棒性.

lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.0005, random_state = 1))

测试: score = rmsle_cv(lasso)
结果: Lasso score: 0.1115 (0.0074)

12. 模型集成

(1) 简单的集成方法: 对基础模型取平均值
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)   

测试:

averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))
score = rmsle_cv(averaged_models)
(2) 采用元模型的模型集成方法

训练过程如下:
a. 将训练集拆分成没有交叉的两部分(train和holdout部分);
b. 在train数据集上训练基础模型
c. 在holdout数据集上测试基础模型
d. 将步骤©中的每个模型的测试结果作为输入,以target variable作为输出训练更高层的"元模型"
对stacking的解释可以参考:添加链接描述

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

测试:

stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),meta_model = lasso)
score = rmsle_cv(stacked_averaged_models)
(3) Ensembling 融合模型 + XGBoost + LightGBM

(1) 定义评价指标

def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

(2) Ensembling模型分别预测

model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)
lgb_pred = np.expm1(model_lgb.predict(test.values))
print(rmsle(y_train, lgb_train_pred))

(3) Ensembling在训练集上预测

rmsle(y_train,stacked_train_pred*0.70 + xgb_train_pred*0.15 + lgb_train_pred*0.15 )

(4) Ensemble prediction

ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15

数据集下载: https://pan.baidu.com/s/1dGDNVIx1RiSyHVWfNayrMg 提取码: axx6
源码:https://pan.baidu.com/s/1l_acggNRYSFxAZrqnqoBdQ 提取码: nsm4

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
"House Prices - Advanced Regression Techniques"是一个用于预测房屋价格的数据集。该数据集包含了79个房屋相关的特征,包括地理位置、建筑材料、房间数量、房屋面积等。这些特征作为输入,我们需要通过回归分析来预测房屋的售价。 对于这个数据集,我们可以使用各种先进的回归技术来进行分析和预测。以下是几种常见的回归技术: 1. 线性回归:线性回归是一种基本的回归方法,它试图通过特征和目标变量之间的线性关系来建立预测模型。线性回归的优点是简单且计算效率高。 2. 多项式回归:多项式回归是在线性回归的基础上引入了高次项的回归方法。它可以更准确地拟合非线性关系,并提高回归模型的灵活性。 3. 支持向量回归(SVR):SVR是一种利用支持向量机技术进行回归的方法。它可以处理非线性问题,并且对于异常值有较好的鲁棒性。 4. 决策树回归:决策树回归是一种以树状结构来建立回归模型的方法。它可以自动选择特征,并且对于非线性关系有较好的拟合能力。 5. 随机森林回归:随机森林是一个基于多个决策树的集成算法。它可以通过组合多个决策树来提高回归模型的准确性和稳定性。 在使用这些回归技术进行预测时,我们通常需要进行特征工程来处理缺失值、异常值和离散特征等。此外,还需要进行模型评估和选择最优模型的步骤,如交叉验证和网格搜索。 总之,"House Prices - Advanced Regression Techniques"数据集是一个用于预测房屋价格的数据集,我们可以利用先进的回归技术来分析和预测房屋价格。通过合理选择合适的回归模型和进行适当的特征工程,我们可以提高预测准确性并为房地产市场提供有价值的信息。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值