机器学习模型是怎么工作的
Introduction
我们从机器学习模型如何工作以及怎么使用的概述开始。这可能会比较基础,如果你之前有过统计学建模或者机器学习经验。不用担心,不久后我们会向建造一个强有力的模型发展。
本课程将为你构建以下场景的模型:
你的堂兄已经投入几百美元用于炒房。由于你对数据科学感兴趣,他愿意跟你成为业务合作伙伴。他将提供资金,而你来提供能够预测各种各样房子的价格的模型。
你问你的堂兄他以前是怎么预测真实的房地产价格的。然后他说只是凭直觉。但是随着更多的质疑表明,他已经确定了他过去看过的房屋的价格模式,他利用这些模式对他正在考虑的新房进行预测。
机器学习也是同样的工作原理。我们将从一个决策树的模型开始。当然还有一些更高级的模型能够提供更准确的预测结果。但是决策树比较容易理解,而且它是数据科学中一些最好模型的基础模块。
为了方便理解,我们将从最简单的决策树开始。
它把房子分成了两类。所考虑的任何房屋的预测价格是同一类别房屋历史价格的平均。
我们使用数据来决定如何将房屋分成两组,然后再确定每组的预测价格。从数据中获取模式的这一步骤称为拟合或训练模型。用于拟合模型的数据称为训练数据。
关于模型是怎么训练的细节(例如怎么分割数据)非常复杂,我们把它留到后面再讲。在模型训练完后,你可以将它应用到新数据上,从而来预测其他房子的价格。
Improving the Decision Tree
下面两个决策树哪个更像真实房地产训练数据拟合出来的结果?
左边的决策树 (Decision Tree 1)可能更有意义,因为它抓住了这样一个事实,即拥有更多卧室的房屋往往比卧室更少的房屋以更高的价格出售。这个模型的最大缺点就是它没有获取更多影响房价的因素,比如浴室的数目,房间大小,地理位置,等等。
你可以使用一棵有更多划分的树来获取更多影响因子。这些被称为更深的树。一棵决策树也会考虑每个房子的大小,比如像这样:
你可以通过在决策树中进行循迹来预测任何房屋的价格,只需始终选择与该房屋特征相对应的路径。房子的预测价格在树的末端。我们进行预测的末端节点称之为叶子节点(leaf)。
怎么分割以及叶子节点上的值由数据决定,所以是时候查看你将要使用的数据了。
Continue
数据下载
数据说明
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms
- SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale