分享一下我老师大神的人工智能教程!零基础,通俗易懂!http://blog.csdn.net/jiangjunshow
也欢迎大家转载本篇文章。分享知识,造福人民,实现我们中华民族伟大复兴!
House Prices: Advanced Regression Techniques
赛题链接:
[ https://www.kaggle.com/c/house-prices-advanced-regression-techniques]
相关参考教程
House Prices: 比赛经验分享
https://www.kaggle.com/xirudieyi/house-prices-advanced-regression-techniques/house-prices
Regression using Keras
https://www.kaggle.com/vishnus/house-prices-advanced-regression-techniques/regression-using-keras/code
Advanced Regression Modeling on House Prices
http://blog.nycdatascience.com/student-works/advanced-regression-modeling-house-prices/
Python代码
#-*- coding: utf-8 -*-import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)# Input data files are available in the "../input/" directory.# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directoryfrom subprocess import check_output#print(check_output(["dir", "./"]).decode("utf8"))# 加载数据train_data = pd.read_csv("./data/train.csv")test_data = pd.read_csv("./data/test.csv")# 定义若干个对数据进行清理的函数,这些函数主要作用在pandas的DataFrame数据类型上# 查看数据集属性值得确实情况def show_missing(houseprice): missing = houseprice.columns[houseprice.isnull().any()].tolist() return missing# 查看 categorical 特征的值情况def cat_exploration(houseprice, column): print(houseprice[column].value_counts())# 对数据集中某一列的缺失值进行补全def cat_imputation(houseprice, column, value): houseprice.loc[houseprice[column].isnull(), column] = value# LotFrontage# check correlation with LotAreaprint(test_data['LotFrontage'].corr(test_data['LotArea'])) # 0.64print(train_data['LotFrontage'].corr(train_data['LotArea'])) # 0.42test_data['SqrtLotArea'] = np.sqrt(test_data['LotArea'])train_data['SqrtLotArea'] = np.sqrt(train_data['LotArea'])# print(test_data['LotFrontage'].corr(test_data['SqrtLotArea']))# print(train_data['LotFrontage'].corr(train_data['SqrtLotArea']))cond = test_data['LotFrontage'].isnull()test_data.LotFrontage[cond] = test_data.SqrtLotArea[cond]#缺失值用房屋边长补全cond = train_data['LotFrontage'].isnull()train_data.LotFrontage[cond] = train_data.SqrtLotArea[cond]del test_data['SqrtLotArea']del train_data['SqrtLotArea']# MSZoning# 在test测试集中有缺失, train中没有cat_exploration(test_data, 'MSZoning')print(test_data[test_data['MSZoning'].isnull() == True])# MSSubClass MSZoningprint(pd.crosstab(test_data.MSSubClass, test_data.MSZoning))#test_data中建筑类型缺失值补齐 30:RM 20:RL 70:RMtest_data.loc[test_data['MSSubClass'] == 20, 'MSZoning'] = 'RL'test_data.loc[test_data['MSSubClass'] == 30, 'MSZoning'] = 'RM'test_data.loc[test_data['MSSubClass'] == 70, 'MSZoning'] = 'RM'# Alleyprint(cat_exploration(test_data, 'Alley'))print(cat_exploration(train_data, 'Alley'))# Alley这个特征有太多的nans,这里填充None,也可以直接删除,不使用。后面在根据特征的重要性选择特征是,也可以舍去cat_imputation(test_data, 'Alley', 'None')cat_imputation(train_data,