纽约市建筑能源得分预测代码分析

项目:纽约市建筑能源得分预测

目录

0简介

1数据清洗与格式转化

1.1数据简介

1.2导入所需的基本工具包

1.3数据分析

1.4数据类型与缺失值

1.5缺失值处理模板

2 Exploratory Data Analysis

2.2剔除离群点

2.3观察那些变量会对结果产生影响

3特征工程

3.1特征变换

3.2双变量绘图

3.3提出共线特征

4分割数据集

4.1划分数据

4.2建立Baseline

4.3结果保存下来,建模再用

5建立基础模型,尝试多种算法

5.1缺失值填充

5.2特征进行与归一化

6建立基础模型,尝试多种算法(回归问题)

6.1建立损失函数

6.2选择机器学习算法

7模型调参

7.1调参

7.2对比损失函数

8评估与测试:预测与真实之间的差异图

9解释模型:基于重要性进行特征选择

正文:

0简介

本次将介绍使用了真实数据集的机器学习项目的完整解决方案,让同学们的了解所有碎片是如何拼接在一起的。

编码之前是了解我们试图解决的问题和可用的数据。在这个项目中,我们将使用公共可用的纽约市的建筑能源数据。目标是使用能源数据建立一个模型,来预测建筑物的 Energy star Score (能源之星分数),并解释结果以找出影晌评分的因素。

数据包括 Energy star Score ,意味着这是一个监督回归机餐学习任务:监督:我们可以知道数据的特征和目标,我们的目标是训练可以学习两者之间映射关系的模型。回归: Energy Star Score 是一个连续变量。我们想要开发一个模型准确性,它可以实现预测Energy Star Score,并且结果接近班实值。

1数据清洗与格式转化

1.1数据简介

1.2导入所需的基本工具包

import pandas as pd
import numpy as np

# API需要升级或者遗弃了,不想看就设置一下warning
pd.options.mode.chained_assignment = None

# 经常用到head(),最多展示多少条数
pd.set_option('display.max_columns', 60) 
import matplotlib.pyplot as plt

# %matplotlib inline 可以在Ipython编译器比如jupyter notebook 或者 jupyter qtconsole里直接使用,功能是可以内嵌绘图,并且省略掉plt.show()。
%matplotlib inline

# pylot使用rc配置文件来自定义图形的各种默认属性,称之为rc配置或rc参数。通过rc参数可以修改默认的属性,包括窗体大小、每英寸的点数、线条宽度、颜色、样式、坐标轴、坐标和网络属性、文本、字体等。
# rc参数存储在字典变量中,通过字典的方式进行访问
#绘图全局的设置好了,画图字体大小
plt.rcParams['font.size'] = 24
from IPython.core.pylabtools import figsize

# matplotlib中的[seaborn](https://so.csdn.net/so/search?q=seaborn)绘图
import seaborn as sns
sns.set(font_scale = 2)
from sklearn.model_selection import train_test_split

# 忽略代码中的警告消息
import warnings
warnings.filterwarnings("ignore")

1.3数据分析

# 加载数据
data = pd.read_csv('data/Energy.csv')

# 展示前3行
data.head(3)
OrderProperty IdProperty NameParent Property IdParent Property NameBBL - 10 digitsNYC Borough, Block and Lot (BBL) self-reportedNYC Building Identification Number (BIN)Address 1 (self-reported)Address 2Postal CodeStreet NumberStreet NameBoroughDOF Gross Floor AreaPrimary Property Type - Self SelectedList of All Property Use Types at PropertyLargest Property Use TypeLargest Property Use Type - Gross Floor Area (ft²)2nd Largest Property Use Type2nd Largest Property Use - Gross Floor Area (ft²)3rd Largest Property Use Type3rd Largest Property Use Type - Gross Floor Area (ft²)Year BuiltNumber of Buildings - Self-reportedOccupancyMetered Areas (Energy)Metered Areas (Water)ENERGY STAR ScoreSite EUI (kBtu/ft²)Weather Normalized Site EUI (kBtu/ft²)Weather Normalized Site Electricity Intensity (kWh/ft²)Weather Normalized Site Natural Gas Intensity (therms/ft²)Weather Normalized Source EUI (kBtu/ft²)Fuel Oil #1 Use (kBtu)Fuel Oil #2 Use (kBtu)Fuel Oil #4 Use (kBtu)Fuel Oil #5 & 6 Use (kBtu)Diesel #2 Use (kBtu)District Steam Use (kBtu)Natural Gas Use (kBtu)Weather Normalized Site Natural Gas Use (therms)Electricity Use - Grid Purchase (kBtu)Weather Normalized Site Electricity (kWh)Total GHG Emissions (Metric Tons CO2e)Direct GHG Emissions (Metric Tons CO2e)Indirect GHG Emissions (Metric Tons CO2e)Property GFA - Self-Reported (ft²)Water Use (All Water Sources) (kgal)Water Intensity (All Water Sources) (gal/ft²)Source EUI (kBtu/ft²)Release DateWater Required?DOF Benchmarking Submission StatusUnnamed: 54
0113286201/20513286201/205101316000110131600011037549201/205 East 42nd st.Not Available100176753 AVENUEManhattan289356.0OfficeOfficeOffice293447Not AvailableNot AvailableNot AvailableNot Available19632100Whole BuildingNot AvailableNot Available305.6303.137.8Not Available614.2Not AvailableNot AvailableNot AvailableNot AvailableNot Available51550675.1Not AvailableNot Available38139374.211082770.56962.206962.2762051Not AvailableNot Available619.45/1/17 5:32 PMNoIn ComplianceNaN
1228400NYP Columbia (West Campus)28400NYP Columbia (West Campus)10213800401-02138-00401084198; 1084387;1084385; 1084386; 1084388; 10...622 168th StreetNot Available10032180FT WASHINGTON AVENUEManhattan3693539.0Hospital (General Medical & Surgical)Hospital (General Medical & Surgical)Hospital (General Medical & Surgical)3889181Not AvailableNot AvailableNot AvailableNot Available196912100Whole BuildingWhole Building55229.8228.824.82.4401.1Not Available19624847.2Not AvailableNot AvailableNot Available-391414802.69330734419330734.433236592496261312.155870.451016.44854.13889181Not AvailableNot Available404.34/27/17 11:23 AMNoIn ComplianceNaN
234778226MSCHoNY North28400NYP Columbia (West Campus)10213800301-02138-003010633803975 BroadwayNot Available100323975BROADWAYManhattan152765.0Hospital (General Medical & Surgical)Hospital (General Medical & Surgical)Hospital (General Medical & Surgical)231342Not AvailableNot AvailableNot AvailableNot Available19241100Not AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot AvailableNot Available000231342Not AvailableNot AvailableNot Available4/27/17 11:23 AMNoIn ComplianceNaN
print((np.array(data)).shape)
(11746, 55)
# 在括号中填入n,便能看见数据的前N行
data.head(2)
OrderProperty IdProperty NameParent Property IdParent Property NameBBL - 10 digitsNYC Borough, Block and Lot (BBL) self-reportedNYC Building Identification Number (BIN)Address 1 (self-reported)Address 2Postal CodeStreet NumberStreet NameBoroughDOF Gross Floor AreaPrimary Property Type - Self SelectedList of All Property Use Types at PropertyLargest Property Use TypeLargest Property Use Type - Gross Floor Area (ft²)2nd Largest Property Use Type2nd Largest Property Use - Gross Floor Area (ft²)3rd Largest Property Use Type3rd Largest Property Use Type - Gross Floor Area (ft²)Year BuiltNumber of Buildings - Self-reportedOccupancyMetered Areas (Energy)Metered Areas (Water)ENERGY STAR ScoreSite EUI (kBtu/ft²)Weather Normalized Site EUI (kBtu/ft²)Weather Normalized Site Electricity Intensity (kWh/ft²)Weather Normalized Site Natural Gas Intensity (therms/ft²)Weather Normalized Source EUI (kBtu/ft²)Fuel Oil #1 Use (kBtu)Fuel Oil #2 Use (kBtu)Fuel Oil #4 Use (kBtu)Fuel Oil #5 & 6 Use (kBtu)Diesel #2 Use (kBtu)District Steam Use (kBtu)Natural Gas Use (kBtu)Weather Normalized Site Natural Gas Use (therms)Electricity Use - Grid Purchase (kBtu)Weather Normalized Site Electricity (kWh)Total GHG Emissions (Metric Tons CO2e)Direct GHG Emissions (Metric Tons CO2e)Indirect GHG Emissions (Metric Tons CO2e)Property GFA - Self-Reported (ft²)Water Use (All Water Sources) (kgal)Water Intensity (All Water Sources) (gal/ft²)Source EUI (kBtu/ft²)Release DateWater Required?DOF Benchmarking Submission StatusUnnamed: 54
0113286201/20513286201/205101316000110131600011037549201/205 East 42nd st.Not Available100176753 AVENUEManhattan289356.0OfficeOfficeOffice293447Not AvailableNot AvailableNot AvailableNot Available19632100Whole BuildingNot AvailableNot Available305.6303.137.8Not Available614.2Not AvailableNot AvailableNot AvailableNot AvailableNot Available51550675.1Not AvailableNot Available38139374.211082770.56962.206962.2762051Not AvailableNot Available619.45/1/17 5:32 PMNoIn ComplianceNaN
1228400NYP Columbia (West Campus)28400NYP Columbia (West Campus)10213800401-02138-00401084198; 1084387;1084385; 1084386; 1084388; 10...622 168th StreetNot Available10032180FT WASHINGTON AVENUEManhattan3693539.0Hospital (General Medical & Surgical)Hospital (General Medical & Surgical)Hospital (General Medical & Surgical)3889181Not AvailableNot AvailableNot AvailableNot Available196912100Whole BuildingWhole Building55229.8228.824.82.4401.1Not Available19624847.2Not AvailableNot AvailableNot Available-391414802.69330734419330734.433236592496261312.155870.451016.44854.13889181Not AvailableNot Available404.34/27/17 11:23 AMNoIn ComplianceNaN

1.4数据类型与缺失值

data.info() # 可以快速让我们知道数据类型与缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11746 entries, 0 to 11745
Data columns (total 55 columns):
 #   Column                                                      Non-Null Count  Dtype  
---  ------                                                      --------------  -----  
 0   Order                                                       11746 non-null  int64  
 1   Property Id                                                 11746 non-null  int64  
 2   Property Name                                               11746 non-null  object 
 3   Parent Property Id                                          11746 non-null  object 
 4   Parent Property Name                                        11746 non-null  object 
 5   BBL - 10 digits                                             11746 non-null  object 
 6   NYC Borough, Block and Lot (BBL) self-reported              11746 non-null  object 
 7   NYC Building Identification Number (BIN)                    11746 non-null  object 
 8   Address 1 (self-reported)                                   11746 non-null  object 
 9   Address 2                                                   11746 non-null  object 
 10  Postal Code                                                 11746 non-null  object 
 11  Street Number                                               11622 non-null  object 
 12  Street Name                                                 11624 non-null  object 
 13  Borough                                                     11628 non-null  object 
 14  DOF Gross Floor Area                                        11628 non-null  float64
 15  Primary Property Type - Self Selected                       11746 non-null  object 
 16  List of All Property Use Types at Property                  11746 non-null  object 
 17  Largest Property Use Type                                   11746 non-null  object 
 18  Largest Property Use Type - Gross Floor Area (ft²)          11746 non-null  object 
 19  2nd Largest Property Use Type                               11746 non-null  object 
 20  2nd Largest Property Use - Gross Floor Area (ft²)           11746 non-null  object 
 21  3rd Largest Property Use Type                               11746 non-null  object 
 22  3rd Largest Property Use Type - Gross Floor Area (ft²)      11746 non-null  object 
 23  Year Built                                                  11746 non-null  int64  
 24  Number of Buildings - Self-reported                         11746 non-null  int64  
 25  Occupancy                                                   11746 non-null  int64  
 26  Metered Areas (Energy)                                      11746 non-null  object 
 27  Metered Areas  (Water)                                      11746 non-null  object 
 28  ENERGY STAR Score                                           11746 non-null  object 
 29  Site EUI (kBtu/ft²)                                         11746 non-null  object 
 30  Weather Normalized Site EUI (kBtu/ft²)                      11746 non-null  object 
 31  Weather Normalized Site Electricity Intensity (kWh/ft²)     11746 non-null  object 
 32  Weather Normalized Site Natural Gas Intensity (therms/ft²)  11746 non-null  object 
 33  Weather Normalized Source EUI (kBtu/ft²)                    11746 non-null  object 
 34  Fuel Oil #1 Use (kBtu)                                      11746 non-null  object 
 35  Fuel Oil #2 Use (kBtu)                                      11746 non-null  object 
 36  Fuel Oil #4 Use (kBtu)                                      11746 non-null  object 
 37  Fuel Oil #5 & 6 Use (kBtu)                                  11746 non-null  object 
 38  Diesel #2 Use (kBtu)                                        11746 non-null  object 
 39  District Steam Use (kBtu)                                   11746 non-null  object 
 40  Natural Gas Use (kBtu)                                      11746 non-null  object 
 41  Weather Normalized Site Natural Gas Use (therms)            11746 non-null  object 
 42  Electricity Use - Grid Purchase (kBtu)                      11746 non-null  object 
 43  Weather Normalized Site Electricity (kWh)                   11746 non-null  object 
 44  Total GHG Emissions (Metric Tons CO2e)                      11746 non-null  object 
 45  Direct GHG Emissions (Metric Tons CO2e)                     11746 non-null  object 
 46  Indirect GHG Emissions (Metric Tons CO2e)                   11746 non-null  object 
 47  Property GFA - Self-Reported (ft²)                          11746 non-null  int64  
 48  Water Use (All Water Sources) (kgal)                        11746 non-null  object 
 49  Water Intensity (All Water Sources) (gal/ft²)               11746 non-null  object 
 50  Source EUI (kBtu/ft²)                                       11746 non-null  object 
 51  Release Date                                                11746 non-null  object 
 52  Water Required?                                             11628 non-null  object 
 53  DOF Benchmarking Submission Status                          11716 non-null  object 
 54  Unnamed: 54                                                 0 non-null      float64
dtypes: float64(2), int64(6), object(47)
memory usage: 4.9+ MB

1.5缺失值处理模板

# 缺失值Not Available转换为np.nan
#replace():描述Python replace() 方法把字符串中的 old(旧字符串) 替换成 new(新字符串),
data = data.replace({'Not Available': np.nan})

#在原始数据中‘ft²’结尾的列中的属性显示的是有的是数值型float类型,但是在python环境中info()函数展示有其他类型的数据都是Object类型
#kBtu/ft²等本应该是float类型,在这里是object类型,所以要转换一下 ,以ft²、kBtu、Metric Tons CO2e等为结尾的astype一下float 

# 把下面的data列中的数据全部转换成float型的
for col in list(data.columns):
    # 如果ft^2平方英尺结尾的,本来是object强制转换为float
    if ('ft²' in col or 'kBtu' in col or 'Metric Tons CO2e' in col or 'kWh' in 
        col or 'therms' in col or 'gal' in col or 'Score' in col):
        
        data[col] = data[col].astype(float)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Eqc6HhrF-1642160554515)(attachment:image.png)]

print(list(data.columns))
['Order', 'Property Id', 'Property Name', 'Parent Property Id', 'Parent Property Name', 'BBL - 10 digits', 'NYC Borough, Block and Lot (BBL) self-reported', 'NYC Building Identification Number (BIN)', 'Address 1 (self-reported)', 'Address 2', 'Postal Code', 'Street Number', 'Street Name', 'Borough', 'DOF Gross Floor Area', 'Primary Property Type - Self Selected', 'List of All Property Use Types at Property', 'Largest Property Use Type', 'Largest Property Use Type - Gross Floor Area (ft²)', '2nd Largest Property Use Type', '2nd Largest Property Use - Gross Floor Area (ft²)', '3rd Largest Property Use Type', '3rd Largest Property Use Type - Gross Floor Area (ft²)', 'Year Built', 'Number of Buildings - Self-reported', 'Occupancy', 'Metered Areas (Energy)', 'Metered Areas  (Water)', 'ENERGY STAR Score', 'Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)', 'Weather Normalized Site Electricity Intensity (kWh/ft²)', 'Weather Normalized Site Natural Gas Intensity (therms/ft²)', 'Weather Normalized Source EUI (kBtu/ft²)', 'Fuel Oil #1 Use (kBtu)', 'Fuel Oil #2 Use (kBtu)', 'Fuel Oil #4 Use (kBtu)', 'Fuel Oil #5 & 6 Use (kBtu)', 'Diesel #2 Use (kBtu)', 'District Steam Use (kBtu)', 'Natural Gas Use (kBtu)', 'Weather Normalized Site Natural Gas Use (therms)', 'Electricity Use - Grid Purchase (kBtu)', 'Weather Normalized Site Electricity (kWh)', 'Total GHG Emissions (Metric Tons CO2e)', 'Direct GHG Emissions (Metric Tons CO2e)', 'Indirect GHG Emissions (Metric Tons CO2e)', 'Property GFA - Self-Reported (ft²)', 'Water Use (All Water Sources) (kgal)', 'Water Intensity (All Water Sources) (gal/ft²)', 'Source EUI (kBtu/ft²)', 'Release Date', 'Water Required?', 'DOF Benchmarking Submission Status', 'Unnamed: 54']
print(data.columns)
Index(['Order', 'Property Id', 'Property Name', 'Parent Property Id',
       'Parent Property Name', 'BBL - 10 digits',
       'NYC Borough, Block and Lot (BBL) self-reported',
       'NYC Building Identification Number (BIN)', 'Address 1 (self-reported)',
       'Address 2', 'Postal Code', 'Street Number', 'Street Name', 'Borough',
       'DOF Gross Floor Area', 'Primary Property Type - Self Selected',
       'List of All Property Use Types at Property',
       'Largest Property Use Type',
       'Largest Property Use Type - Gross Floor Area (ft²)',
       '2nd Largest Property Use Type',
       '2nd Largest Property Use - Gross Floor Area (ft²)',
       '3rd Largest Property Use Type',
       '3rd Largest Property Use Type - Gross Floor Area (ft²)', 'Year Built',
       'Number of Buildings - Self-reported', 'Occupancy',
       'Metered Areas (Energy)', 'Metered Areas  (Water)', 'ENERGY STAR Score',
       'Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)',
       'Weather Normalized Site Electricity Intensity (kWh/ft²)',
       'Weather Normalized Site Natural Gas Intensity (therms/ft²)',
       'Weather Normalized Source EUI (kBtu/ft²)', 'Fuel Oil #1 Use (kBtu)',
       'Fuel Oil #2 Use (kBtu)', 'Fuel Oil #4 Use (kBtu)',
       'Fuel Oil #5 & 6 Use (kBtu)', 'Diesel #2 Use (kBtu)',
       'District Steam Use (kBtu)', 'Natural Gas Use (kBtu)',
       'Weather Normalized Site Natural Gas Use (therms)',
       'Electricity Use - Grid Purchase (kBtu)',
       'Weather Normalized Site Electricity (kWh)',
       'Total GHG Emissions (Metric Tons CO2e)',
       'Direct GHG Emissions (Metric Tons CO2e)',
       'Indirect GHG Emissions (Metric Tons CO2e)',
       'Property GFA - Self-Reported (ft²)',
       'Water Use (All Water Sources) (kgal)',
       'Water Intensity (All Water Sources) (gal/ft²)',
       'Source EUI (kBtu/ft²)', 'Release Date', 'Water Required?',
       'DOF Benchmarking Submission Status', 'Unnamed: 54'],
      dtype='object')
# 每列中只能展示数值型的count、mean、sdt等等,object不会展示
data.describe()

# 3.20e+05=3.20x10^5=3.20x100000=320000
# 在科学计数法中,为了使公式简便,可以用带“E”的格式表示。当用该格式表示时,E前面的数字和“E+”后面要精确到十分位,(位数不够末尾补0),例如7.8乘10的7次方,正常写法为:7.8x10^7,简写为“7.8E+07”的形式
OrderProperty IdDOF Gross Floor AreaLargest Property Use Type - Gross Floor Area (ft²)2nd Largest Property Use - Gross Floor Area (ft²)3rd Largest Property Use Type - Gross Floor Area (ft²)Year BuiltNumber of Buildings - Self-reportedOccupancyENERGY STAR ScoreSite EUI (kBtu/ft²)Weather Normalized Site EUI (kBtu/ft²)Weather Normalized Site Electricity Intensity (kWh/ft²)Weather Normalized Site Natural Gas Intensity (therms/ft²)Weather Normalized Source EUI (kBtu/ft²)Fuel Oil #1 Use (kBtu)Fuel Oil #2 Use (kBtu)Fuel Oil #4 Use (kBtu)Fuel Oil #5 & 6 Use (kBtu)Diesel #2 Use (kBtu)District Steam Use (kBtu)Natural Gas Use (kBtu)Weather Normalized Site Natural Gas Use (therms)Electricity Use - Grid Purchase (kBtu)Weather Normalized Site Electricity (kWh)Total GHG Emissions (Metric Tons CO2e)Direct GHG Emissions (Metric Tons CO2e)Indirect GHG Emissions (Metric Tons CO2e)Property GFA - Self-Reported (ft²)Water Use (All Water Sources) (kgal)Water Intensity (All Water Sources) (gal/ft²)Source EUI (kBtu/ft²)Unnamed: 54
count11746.0000001.174600e+041.162800e+041.174400e+043741.0000001484.00000011746.00000011746.00000011746.0000009642.00000011583.00000010281.00000010959.0000009783.00000010281.0000009.000000e+002.581000e+031.321000e+035.940000e+021.600000e+019.360000e+021.030400e+049.784000e+031.150200e+041.096000e+041.167200e+041.166300e+041.168100e+041.174600e+047.762000e+037762.00000011583.0000000.0
mean7185.7595783.642958e+061.732695e+051.605524e+0522778.68201012016.8252701948.7383791.28997198.76255759.854594280.071484309.74746611.0726431.901441417.9157093.395398e+063.186882e+065.294367e+062.429105e+061.193594e+062.868907e+085.048543e+075.364578e+055.965472e+061.768752e+064.553657e+032.477937e+032.076339e+031.673739e+051.591798e+04136.172432385.908029NaN
std4323.8599841.049070e+063.367055e+053.095746e+0555094.44142227959.75548630.5763864.0174847.50160329.9935868607.1788779784.731207127.73386897.20458710530.5243392.213237e+065.497154e+065.881863e+064.442946e+063.558178e+063.124603e+093.914717e+094.022606e+073.154430e+079.389154e+062.041639e+051.954498e+055.931295e+043.189238e+051.529524e+051730.7269389312.736225NaN
min1.0000007.365000e+035.002800e+045.400000e+010.0000000.0000001600.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000002.085973e+050.000000e+000.000000e+000.000000e+000.000000e+00-4.690797e+080.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+00-2.313430e+040.000000e+000.000000e+000.0000000.000000NaN
25%3428.2500002.747222e+066.524000e+046.520100e+044000.0000001720.7500001927.0000001.000000100.00000037.00000061.80000065.1000003.8000000.100000103.5000001.663594e+062.550378e+052.128213e+060.000000e+005.698020e+044.320254e+061.098251e+061.176952e+041.043673e+063.019974e+053.287000e+021.474500e+029.480000e+016.699400e+042.595400e+0327.15000099.400000NaN
50%6986.5000003.236404e+069.313850e+049.132400e+048654.0000005000.0000001941.0000001.000000100.00000065.00000078.50000082.5000005.3000000.500000129.4000004.328815e+061.380138e+064.312984e+060.000000e+002.070020e+059.931240e+064.103962e+064.445525e+041.855196e+065.416312e+055.002500e+022.726000e+021.718000e+029.408000e+044.692500e+0345.095000124.900000NaN
75%11054.5000004.409092e+061.596140e+051.532550e+0520000.00000012000.0000001966.0000001.000000100.00000085.00000097.600000102.5000009.2000000.700000167.2000004.938947e+064.445808e+066.514520e+064.293825e+062.918332e+052.064497e+076.855070e+067.348107e+044.370302e+061.284677e+069.084250e+024.475000e+024.249000e+021.584140e+058.031875e+0370.805000162.750000NaN
max14993.0000005.991312e+061.354011e+071.421712e+07962428.000000591640.0000002019.000000161.000000100.000000100.000000869265.000000939329.0000006259.4000009393.000000986366.0000006.275850e+061.046849e+087.907464e+074.410378e+071.435178e+077.163518e+103.942850e+113.942852e+091.691763e+094.958273e+082.094340e+072.094340e+074.764375e+061.421712e+076.594604e+0696305.690000912801.100000NaN
# 缺失值的模板,通用的
#  定义一个函数,传进来一个DataFrame
def missing_values_table(df): 
        # python的pandas库中有一个十分便利的isnull()函数,它可以用来判断缺失值,把每列的缺失值算一下总和
        mis_val = df.isnull().sum() 
        
        # 100相当于%,每列的缺失值的占比
        mis_val_percent = 100 * df.isnull().sum() / len(df) 
        
        # 每列缺失值的个数 、 每列缺失值的占比做成表
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # 重命名指定列的名称
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # 因为第1列缺失值很大,ascending=False代表降序
        #iloc[:,1] != 0的意思是对于下面的表中的第2列(缺失的占比)进行降序,从大到小
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # 打印所有列的个数 、 缺失了多少列
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        
        return mis_val_table_ren_columns
missing_values_table(data) #第一列是每1列,第二列是缺失值个数,第三列是缺失值%比,一共是60列,有46列是有缺失值
Your selected dataframe has 55 columns.
There are 40 columns that have missing values.
Missing Values% of Total Values
Unnamed: 5411746100.0
Fuel Oil #1 Use (kBtu)1173799.9
Diesel #2 Use (kBtu)1173099.9
Address 21153998.2
Fuel Oil #5 & 6 Use (kBtu)1115294.9
District Steam Use (kBtu)1081092.0
Fuel Oil #4 Use (kBtu)1042588.8
3rd Largest Property Use Type1026287.4
3rd Largest Property Use Type - Gross Floor Area (ft²)1026287.4
Fuel Oil #2 Use (kBtu)916578.0
2nd Largest Property Use - Gross Floor Area (ft²)800568.2
2nd Largest Property Use Type800568.2
Metered Areas (Water)460939.2
Water Intensity (All Water Sources) (gal/ft²)398433.9
Water Use (All Water Sources) (kgal)398433.9
ENERGY STAR Score210417.9
Weather Normalized Site Natural Gas Intensity (therms/ft²)196316.7
Weather Normalized Site Natural Gas Use (therms)196216.7
Weather Normalized Source EUI (kBtu/ft²)146512.5
Weather Normalized Site EUI (kBtu/ft²)146512.5
Natural Gas Use (kBtu)144212.3
Weather Normalized Site Electricity Intensity (kWh/ft²)7876.7
Weather Normalized Site Electricity (kWh)7866.7
Electricity Use - Grid Purchase (kBtu)2442.1
Site EUI (kBtu/ft²)1631.4
Source EUI (kBtu/ft²)1631.4
NYC Building Identification Number (BIN)1621.4
Street Number1241.1
Street Name1221.0
DOF Gross Floor Area1181.0
Borough1181.0
Water Required?1181.0
Direct GHG Emissions (Metric Tons CO2e)830.7
Total GHG Emissions (Metric Tons CO2e)740.6
Indirect GHG Emissions (Metric Tons CO2e)650.6
Metered Areas (Energy)570.5
DOF Benchmarking Submission Status300.3
NYC Borough, Block and Lot (BBL) self-reported110.1
Largest Property Use Type - Gross Floor Area (ft²)20.0
Largest Property Use Type20.0
# 50%是阈值,大于50%的列
missing_df = missing_values_table(data);
# 大于50%的列拿出来 ,后面drop()删掉
missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('We will remove %d columns.' % len(missing_columns))

#原始的列中有60列,发现有缺失值的列有46列 , 缺失的46列中大于50%的将删除,有11列
Your selected dataframe has 55 columns.
There are 40 columns that have missing values.
We will remove 12 columns.
# 大于50%的列都drop掉
data = data.drop(columns = list(missing_columns))

2 Exploratory Data Analysis

2.1单变量绘图

# 设置图形的宽和高
figsize(8, 8)

# Y,就是从1~100的能源得分值,重命名为score
data = data.rename(columns = {'ENERGY STAR Score': 'score'})

# 在seaboard中找到不同的风格,不同的参数,代表不同背景格式
plt.style.use('fivethirtyeight')

#dropna():该函数主要用于滤除缺失数据 
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); 

plt.xlabel('Score'); plt.ylabel('Number of Buildings'); 

plt.title('Energy Star Score Distribution');

#在展示的图中,1和100的得分比较高,原始数据都是物业自己填的报表打得分,根据实际情况,给房屋的能源利用率打的分值,人为填的,
#所以1和100,得分很高,有水分,但是,我们的目标只是预测分数,而不是设计更好的建筑物评分方法! 我们可以在我们的报告中记下分数具有可疑分布,但我们主要关注预测分数。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GO86qLvG-1642160554517)(output_26_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9bpUjb8Z-1642160554517)(attachment:image.png)]

plt.style.ava
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-186-aa5a23d3013a> in <module>
----> 1 plt.style.ava


AttributeError: module 'matplotlib.style' has no attribute 'ava'

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ypjv0rRI-1642160554518)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ESukvH9A-1642160554519)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fmtPMkoL-1642160554519)(attachment:image.png)]

help(plt.hist)
# 设置图形的宽和高
figsize(8, 8)

# Y,就是从1~100的能源得分值,重命名为score
data = data.rename(columns = {'ENERGY STAR Score': 'score'})

# 在seaboard中找到不同的风格
plt.style.use('dark_background')

# hist表示的画直方图
#dropna():该函数主要用于滤除缺失数据,删除data列表中,score列中的有缺失值的行。处理后的数据作为画图的数据
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); 

plt.xlabel('Score'); plt.ylabel('Number of Buildings'); 

plt.title('Energy Star Score Distribution');

#在展示的图中,1和100的得分比较高,原始数据都是物业自己填的报表打得分,根据实际情况,给房屋的能源利用率打的分值,人为填的,
#所以1和100,得分很高,有水分,但是,我们的目标只是预测分数,而不是设计更好的建筑物评分方法! 我们可以在我们的报告中记下分数具有可疑分布,但我们主要关注预测分数。
plt.style.available
print(data.columns)
# 设置图形的宽和高
figsize(10, 10)

# Y,就是从1~100的能源得分值,重命名为score
data = data.rename(columns = {'ENERGY STAR Score': 'score'})

# 在seaboard中找到不同的风格
plt.style.use('fivethirtyeight')

#dropna():该函数主要用于滤除缺失数据 
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); 

plt.xlabel('Score'); plt.ylabel('Number of Buildings'); 

plt.title('Energy Star Score Distribution');

#在展示的图中,1和100的得分比较高,原始数据都是物业自己填的报表打得分,根据实际情况,给房屋的能源利用率打的分值,人为填的,
#所以1和100,得分很高,有水分,但是,我们的目标只是预测分数,而不是设计更好的建筑物评分方法! 我们可以在我们的报告中记下分数具有可疑分布,但我们主要关注预测分数。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-peqW6VVa-1642160554521)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uEY5diWG-1642160554521)(attachment:image.png)]

# Site EUI (kBtu/ft²:能源使用强度
figsize(8, 8)
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'black'); # 边也是黑色
plt.xlabel('Site EUI'); 
plt.ylabel('Count'); plt.title('Site EUI Distribution');

#这显示我们有另一个问题:!由于存在几个非常高分的建筑物,这张图难以置信地倾斜了。所以必须进行异常值处理。
#你会很清楚地看到最后一个值异常大。出现异常值的原因很多:错字,测量设备故障,错误的单位,或者它们可能是合法的但是个极端值
#相当于分一下数据有很多点离均值很远,就有离群点
# Site EUI (kBtu/ft²:能源使用强度
figsize(8, 8)
# edgecolor:直方图中柱形边缘的颜色
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'red'); # 边也是黑色
plt.xlabel('Site EUI'); 
plt.ylabel('Count'); plt.title('Site EUI Distribution');

#这显示我们有另一个问题:!由于存在几个非常高分的建筑物,这张图难以置信地倾斜了。所以必须进行异常值处理。
#你会很清楚地看到最后一个值异常大。出现异常值的原因很多:错字,测量设备故障,错误的单位,或者它们可能是合法的但是个极端值
#相当于分一下数据有很多点离均值很远,就有离群点
data['Site EUI (kBtu/ft²)'].describe() 
# 均值mean小 , 标准差很大,就意味着有很多点离均值很远,就有离群点 ,因为最小值为0,最大值为869265
#dropna()该函数主要用于滤除缺失数据
# sort_values()先分组 ,再看后10位
#能源使用强度(EUI)
#sort_values():默认是升序 ,从小到大排序,按值排序,左边是行号,右边是数据
data['Site EUI (kBtu/ft²)'].dropna().sort_values().tail(10)
# 怎么过滤离群点呢,查看第869265行
data.loc[data['Site EUI (kBtu/ft²)'] == 869265, :]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AQnReIgf-1642160554522)(attachment:image.png)]

# 应该是版本更新的问题,没有ix,和iloc这两种函数了
# 怎么过滤离群点呢,查看第869265行
data.ix[data['Site EUI (kBtu/ft²)'] == 869265, :]
# 怎么过滤离群点呢,查看第869265行
data.iloc[data['Site EUI (kBtu/ft²)'] == 869265, :]

2.2剔除离群点

# 在describe取25%和75%分位
first_quartile = data['Site EUI (kBtu/ft²)'].describe()['25%'] 
third_quartile = data['Site EUI (kBtu/ft²)'].describe()['75%']

#  2者一减就是IQ值,就是间隔 
iqr = third_quartile - first_quartile


#在这里判断的是正常数据,Q3 - 3IQ  < EUI < Q3+ 3IQ ,保留正常数据,剩下的过滤异常点
# Q3+ 3IQ > 。。。。。。>Q3 - 3IQ ,中间的就是非离群点,就是咱们想要的数据
data = data[(data['Site EUI (kBtu/ft²)'] > (first_quartile - 3 * iqr)) &
            (data['Site EUI (kBtu/ft²)'] < (third_quartile + 3 * iqr))]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gUPEPiG6-1642160554523)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-j1F9x97h-1642160554524)(attachment:image.png)]

# #能源使用强度(EUI),剔除离群点后应该有的正太分布
figsize(8, 8)
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'black');
plt.xlabel('Site EUI'); 
plt.ylabel('Count'); plt.title('Site EUI Distribution');

2.3观察那些变量会对结果产生影响

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GPpHhLOJ-1642160554524)(attachment:image.png)]

types = data.dropna(subset=['score'])

#Largest Property Use Type:最大财产使用类型
#该列中有很多的个属性,大于100的值分别有4个属性 , 为:Multifamily Housing——多户住宅区 、 Office——办公室 、 Hotel——酒店 
#Data Center, Non-Refrigerated Warehouse, Office——数据中心、非冷藏仓库、办公室

types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100])
print(types)
types = data.dropna(subset=['score'])

#Largest Property Use Type:最大财产使用类型
#该列中有很多的个属性,大于100的值分别有4个属性 , 为:Multifamily Housing——多户住宅区 、 Office——办公室 、 Hotel——酒店 
#Data Center, Non-Refrigerated Warehouse, Office——数据中心、非冷藏仓库、办公室

types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100].index)
print(types)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B894snof-1642160554525)(attachment:image.png)]

# 找出差异大的2个选取特征
#Largest Property Use Type:最大财产使用类型
figsize(12, 10)

# b_type是变量,types是4种类型 
for b_type in types:
    #当前Largest Property Use Type就是画的类型b_type4个 变量
    subset = data[data['Largest Property Use Type'] == b_type] 
    
    # 拿到subset的得分值,alpha指的是透明度
    sns.kdeplot(subset['score'].dropna(),
               label = b_type, shade = False, alpha = 0.5);
    
# 横轴是能源得分 ,纵轴是密度
plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20); 
plt.title('Density Plot of Energy Star Scores by Building Type', size = 28);

#红色和黄色差距很大
# 找出差异大的2个选取特征
#Largest Property Use Type:最大财产使用类型
figsize(12, 10)

# b_type是变量,types是4种类型 ,其实types列表中包含了7个元素,也就是有7种类型,这儿只是现实了前4种
for b_type in types:
    #当前Largest Property Use Type就是画的类型b_type4个 变量
    subset = data[data['Largest Property Use Type'] == b_type] 
    print(subset)
# 找出差异大的2个选取特征
#Largest Property Use Type:最大财产使用类型
figsize(12, 10)

# b_type是变量,types是4种类型 ,其实types列表中包含了7个元素,也就是有7种类型,这儿只是现实了前4种
for b_type in types:
    #当前Largest Property Use Type就是画的类型b_type4个 变量
    subset = data[data['Largest Property Use Type'] == b_type] 
    print(subset['score'].dropna())

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-C9YXcBux-1642160554526)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4TyHPyFb-1642160554527)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2eU3SHfW-1642160554527)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-G4RVKkWP-1642160554528)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B9ybHRoH-1642160554528)(attachment:image.png)]

# 查看当前的结果跟地区有什么结果     结果
boroughs = data.dropna(subset=['score'])
#                    地区
boroughs = boroughs['Borough'].value_counts()
boroughs = list(boroughs[boroughs.values > 100].index)
print(boroughs)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pCTrfw8A-1642160554529)(attachment:image.png)]

# 4个从差异程度来说,影响不大,特征的差异性不强
#Borough:自治区镇 ,该列中有5个属性,分别为:Manhattan——曼哈顿 、 Brooklyn——布鲁克林 、 Queens——皇后区 、 Bronx——布朗克斯
# Staten Island——斯塔顿岛


figsize(12, 10)
 
# 遍历5个属性遍历,画出图,横轴是能源得分、纵轴是密度
for borough in boroughs:
    
    subset = data[data['Borough'] == borough]
    
    
    sns.kdeplot(subset['score'].dropna(),
               label = borough);
    

plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20); 
plt.title('Density Plot of Energy Star Scores by Borough', size = 28);

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6G9nDpPd-1642160554530)(attachment:image.png)]

# corr()相关系数矩阵,即给出任意X与Y之间的相关系数 X——>Y两两相关的,负相关多,-0.046605接近于0的都删掉 , 正相关的少
correlations_data = data.corr()['score'].sort_values()#升序,从小到大

# 后10个
print(correlations_data.head(10), '\n')
print("---------------------------")
# 前10个
print(correlations_data.tail(10))

其中corr()函数的参数为空时,默认使用的参数为pearson

3特征工程

3.1特征变换

import warnings
warnings.filterwarnings("ignore")

# 所有的数值数据拿到手,只需要数值列的数据,数据为字符串或者其他类型的数据列,不要
numeric_subset = data.select_dtypes('number')


# 遍历所有数值数据的每一列数据
# 遍历所有的数值数据
for col in numeric_subset.columns:
    # 这个项目把score看成了标签,也就是线性函数这种的y,其他特征值全部都是x,而每一个x的系数就是这个特征与score的相关系数
    # 如果score就是y值 ,就不做任何变换
    if col == 'score':
        next
    #剩下的不是y的话特征做log和开根号
    else: 
        # 直接对整个列的数据进行开方和log计算
        numeric_subset['sqrt_' + col] = np.sqrt(numeric_subset[col])
        numeric_subset['log_' + col] = np.log(numeric_subset[col])

# Borough:自治镇
# Largest Property Use Type:
categorical_subset = data[['Borough', 'Largest Property Use Type']]
print(categorical_subset)

# One hot encode用到了读热编码get_dummies 
categorical_subset = pd.get_dummies(categorical_subset)
print(categorical_subset)


#      合并数组     一个是数值的,      一个热度编码的
print(numeric_subset)
features = pd.concat([numeric_subset, categorical_subset], axis = 1)

features = features.dropna(subset = ['score'])

# sort_values()做一下排序
correlations = features.corr()['score'].dropna().sort_values()
print(correlations)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wzb3UvvG-1642160554531)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k7cVG3Mj-1642160554531)(attachment:image.png)]

特征和特征之间的相关性,特征和score之间的相关性。
相关性:线性相关性和非线性相关性。

#sqrt结尾的变幻后就是sqrt_,log结尾的变幻后就是log_
# 这些都是负的
correlations.head(15)

#Weather Normalized Site EUI (kBtu/ft²)和转换后sqrt_Weather Normalized Site EUI (kBtu/ft²)没啥变化,所以没有价值
#都差不多,没有明显的趋势,
# 后15位下面是正的 
correlations.tail(15)

一般head能做的操作,tail也能够做

3.2 双变量绘图

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ASa6g0ty-1642160554532)(attachment:image.png)]

import warnings
warnings.filterwarnings("ignore")
figsize(12, 10)

# 能源得分与城镇区域之间的关系
features['Largest Property Use Type'] = data.dropna(subset = ['score'])['Largest Property Use Type']


# Largest Property Use Type 最大财产使用类型 ,isin()接受一个列表,判断该列中4个属性是否在列表中
features = features[features['Largest Property Use Type'].isin(types)]


# hue = 'Largest Property Use Type'是4个种类变量 ,4个颜色
sns.lmplot('Site EUI (kBtu/ft²)', 'score', 
           # 种类变量,有4个种类,右下角hue是有4个种类变量,
          hue = 'Largest Property Use Type', data = features,
          scatter_kws = {'alpha': 0.8, 's': 60}, fit_reg = False,
          size = 12, aspect = 1.2);

# Plot labeling
plt.xlabel("Site EUI", size = 28)
plt.ylabel('Energy Star Score', size = 28)
plt.title('Energy Star Score vs Site EUI', size = 36);

3.3 剔除共线特征

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EtmuRsXb-1642160554532)(attachment:image.png)]

#原始数据备份一下copy(),修改后数据后保持原数据不变
features = data.copy()

# select_dtypes():根据数据类型选择特征,number表示数值型特征
numeric_subset = data.select_dtypes('number')

# 遍历特征是数值型在一个列表中
for col in numeric_subset.columns:
    # 跳过能源得分就是咱们的目标值Y
    if col == 'score':
        next
    else:
        #numeric_subset()从某一个列中选择出符合某条件的数据或是相关的列
        numeric_subset['log_' + col] = np.log(numeric_subset[col])
        
# Borough:自治区镇
# 最大财产使用类型/多户家庭的a住宅区、办公区、酒店、不制冷的大仓库
categorical_subset = data[['Borough', 'Largest Property Use Type']]


# get_dummies 是利用pandas实现one hot encode的方式。
categorical_subset = pd.get_dummies(categorical_subset)

#把所有数值型特征和治区镇以及最大财产的使用类型合并起来
features = pd.concat([numeric_subset, categorical_subset], axis = 1)

features.shape#有110个列,比原来的列多

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k32D2KSL-1642160554533)(attachment:image.png)]

#Weather Normalized Site EUI (kBtu/ft²):天气正常指数的使用强度
#Site EUI:能源使用强度


plot_data = data[['Weather Normalized Site EUI (kBtu/ft²)', 'Site EUI (kBtu/ft²)']].dropna()
#'bo':由点绘制的线
plt.plot(plot_data['Site EUI (kBtu/ft²)'], plot_data['Weather Normalized Site EUI (kBtu/ft²)'], 'bo')
#横轴是天气正常指数的使用强度 、 纵轴是能源使用强度
plt.xlabel('Site EUI'); plt.ylabel('Weather Norm EUI')
plt.title('Weather Norm EUI vs Site EUI, R = %0.4f' % np.corrcoef(data[['Weather Normalized Site EUI (kBtu/ft²)', 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);
# collinear 共线,这个函数的作用是删除一些两个特征值,之间的相关性特别高的,其中的一个特征。
# threshold:设置的阈值,这个值,是通过多次尝试求取出来的。
def remove_collinear_features(x, threshold):
    y = x['score'] #在原始数据X中”score“当做y值
    x = x.drop(columns = ['score']) #除去标签值以外的当做特征
    # 多长运行,直到相关性小于阈值才稳定结束
    while True:
        # 计算一个矩阵 ,两两的相关系数
        corr_matrix = x.corr()

        
        for i in range(len(corr_matrix)):
            corr_matrix.iloc[i][i] = 0 # 将对角线上的相关系数置为0。避免自己跟自己计算相关系数一定大于阈值,自己与自己的相关系数的是1

        # 定义待删除的特征。
        drop_cols = []
        # col返回的是列名

        
        for col in corr_matrix:
            if col not in drop_cols: # A和B比 ,B和A比的相关系数一样,避免AB全删了
                # 取相关系数的绝对值。
                v = np.abs(corr_matrix[col]) # 取的是每一列的相关系数
                # 如果相关系数大于设置的阈值 
                # 取出每一列中相关系数绝对值最大的那个数
                if np.max(v) > threshold:
                    # 取出最大值对应的索引。
                    name = np.argmax(v) # 找到最大值的的列名
                    # 将含有最大值的那一列放到drop_cols列表中
                    drop_cols.append(name)
         # 列表不为空,就删除,列表为空,符合条件,退出循环   
        # drop_cols 列表中存储的是,两个特征的相关系数的绝对值大于设置的阈值的其中一个特征,为了减小模型的复杂度,和提高模型的效果,就需要删除其中一个特征
        if drop_cols:
            # 删除想删除的列
            x = x.drop(columns=drop_cols, axis=1)
        else:
            break

    # 指定标签
    # y中存储的是原始数据X中”score“
    x['score'] = y
               
    return x
help(remove_collinear_features)
# 下面这段代码运行有问题,我修改不出来,所以就注释了,不让它运行
# # 设置阈值0.6 ,tem.values相关性的矩阵的向量大于0.6的
# features = remove_collinear_features(features, 0.6);
# 上面这段代码运行有问题,我修改不出来,所以就注释了,不让它运行
# 删除
features  = features.dropna(axis=1, how = 'all')
features.shape #原来时110
features.shape

4 分割数据集

4.1 划分数据

# pandas:isna(): 如果参数的结果为#NaN, 则结果TRUE, 否则结果是FALSE。
no_score = features[features['score'].isna()]
# pandas:notnull()判断是否不是NaN
score = features[features['score'].notnull()]

print(no_score.shape)
print(score.shape)
# 把所有特征放在features列表中
# 把标签,也就是targets(建筑物的得分)放在targets列表中
features = score.drop(columns='score')
targets = pd.DataFrame(score['score'])

#np.inf :最大值      -np.inf:最小值  
features = features.replace({np.inf: np.nan, -np.inf: np.nan})

# random_state = 42设置成一个固定值,是为了让,每一次生成的训练集和测试集都是一样的,如果不设置,
# 那么每次生成的测试集和训练集是不一样的,那么这样就无法调参了random_state可以被设置成任何值,
# 但是当你使用相同的数据集,进行测试集和训练集分割时,如果想要与之前的训练集和测试集生成一样的,
# 那么你就得把random_state设置成一样的值,因为分割数据集其实是在生成的一些系列随机数,通过这些
# 随机数去取数据。由于随机数的生成也是通过程序控制的,那么当你设置相同的random_state值,就会
# 得到相同的随机数,那么就会得到相同的测试集和训练集
X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)

print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-omV3TRls-1642160554534)(attachment:image.png)]

4.2 建立Baseline

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PAh1oBf6-1642160554534)(attachment:image.png)]

# mae平均的绝对值 ,就是 (真实值 - 预测值) / n
#abs():绝对值 
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))
baseline_guess = np.median(y)

print('The baseline guess is a score of %0.2f' % baseline_guess) # 中位数为66 
print("Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, baseline_guess)) # MAE = 24.5164

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SQLqtAbD-1642160554534)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bwU9b3eh-1642160554535)(attachment:image.png)]

4.3 结果保存下来,建模再用

# Save the no scores, training, and testing data
# to_csv:把to_csv列表中的元素以csv的格式写进data/no_score.csv文件中g
no_score.to_csv('data/no_score.csv', index = False)
X.to_csv('data/training_features.csv', index = False)
X_test.to_csv('data/testing_features.csv', index = False)
y.to_csv('data/training_labels.csv', index = False)
y_test.to_csv('data/testing_labels.csv', index = False)

5 建立基础模型,尝试多种算法

#之前把精力都放在了前面了,这回我的重点就要放在建模上了,导入所需要的包
# 数据分析库
import pandas as pd
import numpy as np

# warnings:警告——>忽视
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', 60)

# 可视化
import matplotlib.pyplot as plt
%matplotlib inline

# 字体大小设置
plt.rcParams['font.size'] = 24

from IPython.core.pylabtools import figsize

# Seaborn 高级可视化工具
import seaborn as sns
sns.set(font_scale = 2)

# 预处理:缺失值 、 最大最小归一化

# 下面代码是自己修改的
# 这是原代码
# from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.preprocessing import  MinMaxScaler
from sklearn.impute import SimpleImputer
Imputer = SimpleImputer(strategy='median')
# 上面代码是自己修改的

# 机器学习算法库
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# 调参工具包
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV


import warnings
warnings.filterwarnings("ignore")
# Read in data into dataframes 
train_features = pd.read_csv('data/training_features.csv')
test_features = pd.read_csv('data/testing_features.csv')
train_labels = pd.read_csv('data/training_labels.csv')
test_labels = pd.read_csv('data/testing_labels.csv')

# Display sizes of data
print('Training Feature Size: ', train_features.shape)
print('Testing Feature Size:  ', test_features.shape)
print('Training Labels Size:  ', train_labels.shape)
print('Testing Labels Size:   ', test_labels.shape)

5.1 缺失值填充

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oMSL9UnT-1642160554535)(attachment:image.png)]

# 下面代码是自己修改的
# 这是原代码
# imputer = Imputer(strategy='median') # 因为数据有离群点,有大有小,用mean不太合适,用中位数较合适
imputer = SimpleImputer(strategy='median') # 因为数据有离群点,有大有小,用mean不太合适,用中位数较合适
# 上面代码是自己修改的
# 在训练特征中训练
imputer.fit(train_features)

# 对训练数据进行转换
X = imputer.transform(train_features)#用中位数来代替做成的训练集
X_test = imputer.transform(test_features) #用中位数来代替做成的测试集
# 查看训练集和测试集中的特征列表是否还有缺失值
#np.isnan:数值进行空值检测
print('Missing values in training features: ', np.sum(np.isnan(X))) #返回的是0 ,代表缺失值任务已经完成了
print('Missing values in testing features:  ', np.sum(np.isnan(X_test)))

5.2 特征进行与归一化

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-f23cQpIF-1642160554536)(attachment:image.png)]

# feature_range=(0, 1)特征值的范围在0-1之间
scaler = MinMaxScaler(feature_range=(0, 1))

# 训练与转换
scaler.fit(X)

# 把训练数据转换过来(0,1)
X = scaler.transform(X)
X_test = scaler.transform(X_test) # 测试数据
#标签值是1列 ,reshape变成1行
# reshape(行数,列数)常用来更改数据的行列数目
y = np.array(train_labels).reshape((-1,))#一维数组 , 变成1列
y_test = np.array(test_labels).reshape((-1, )) # 一维数组 , 变成1列

6 建立基础模型,尝试多种算法(回归问题)

6.1 建立损失函数

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZQQhPLqN-1642160554536)(attachment:image.png)]

# 在这里的损失函数是MAE ,abs()是绝对值
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))


#制作一个模型 ,训练模型和在验证集上验证模型的参数
def fit_and_evaluate(model):
    
    # 训练模型
    model.fit(X, y)
    
    # 训练模型开始在测试数据上训练
    model_pred = model.predict(X_test)
    model_mae = mae(y_test, model_pred)
    
    
    return model_mae

6.2 选择机器学习算法

lr = LinearRegression()#线性回归
lr_mae = fit_and_evaluate(lr)

print('Linear Regression Performance on the test set: MAE = %0.4f' % lr_mae)
svm = SVR(C = 1000, gamma = 0.1) #支持向量机
svm_mae = fit_and_evaluate(svm)

print('Support Vector Machine Regression Performance on the test set: MAE = %0.4f' % svm_mae)
random_forest = RandomForestRegressor(random_state=60)#集成算法的随机森林
random_forest_mae = fit_and_evaluate(random_forest)

print('Random Forest Regression Performance on the test set: MAE = %0.4f' % random_forest_mae)
gradient_boosted = GradientBoostingRegressor(random_state=60) #梯度提升树
gradient_boosted_mae = fit_and_evaluate(gradient_boosted)

print('Gradient Boosted Regression Performance on the test set: MAE = %0.4f' % gradient_boosted_mae)
knn = KNeighborsRegressor(n_neighbors=10)#K近邻算法
knn_mae = fit_and_evaluate(knn)

print('K-Nearest Neighbors Regression Performance on the test set: MAE = %0.4f' % knn_mae)
plt.style.use('fivethirtyeight') 
figsize(8, 6)


model_comparison = pd.DataFrame({'model': ['Linear Regression', 'Support Vector Machine',
                                           'Random Forest', 'Gradient Boosted',
                                            'K-Nearest Neighbors'],
                                 'mae': [lr_mae, svm_mae, random_forest_mae, 
                                         gradient_boosted_mae, knn_mae]})

#         ascending=True是对的意思升序      降序 :从大到小/从第1行到第5行    barh:横着去画的直方图 
model_comparison.sort_values('mae', ascending = False).plot(x = 'model', y = 'mae', kind = 'barh',
                                                           color = 'red', edgecolor = 'black')

# 纵轴是算法模型的名称    yticks:为递增值向量       横轴是MAE损失                 xticks:为递增值向量
plt.ylabel(''); plt.yticks(size = 14); plt.xlabel('Mean Absolute Error'); plt.xticks(size = 14)
plt.title('Model Comparison on Test MAE', size = 20);

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Z2xpgxR7-1642160554537)(attachment:image.png)]

7 模型调参

7.1 调参

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MYqoMgUR-1642160554538)(attachment:image.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gk3Vn00o-1642160554538)(attachment:image.png)]

loss = ['ls', 'lad', 'huber']

# 所使用的弱“学习者”(决策树)的数量
n_estimators = [100, 500, 900, 1100, 1500]

# 决策树的最大深度
max_depth = [2, 3, 5, 10, 15]

# 决策树的叶节点所需的最小示例个数
min_samples_leaf = [1, 2, 4, 6, 8]

# 分割决策树节点所需的最小示例个数
min_samples_split = [2, 4, 6, 10]





hyperparameter_grid = {'loss': loss,
                       'n_estimators': n_estimators,
                       'max_depth': max_depth,
                       'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jS4ntaRq-1642160554539)(attachment:image.png)]

model = GradientBoostingRegressor(random_state = 42)


random_cv = RandomizedSearchCV(estimator=model, 
                               
                               param_distributions=hyperparameter_grid,
                               cv=4, n_iter=25, 
                               scoring = 'neg_mean_absolute_error', #选择好结果的评估值
                               
                               n_jobs = -1, verbose = 1, 
                               
                               return_train_score = True,
                               
                               random_state=42)
# 注意:运行的时间非常慢,需要14mins
random_cv.fit(X, y)
help(GradientBoostingRegressor)
RandomizedSearchCV(cv=4, error_score='raise-deprecating',
                   estimator=GradientBoostingRegressor(alpha=0.9,
                                                       criterion='friedman_mse',
                                                       init=None,
                                                       learning_rate=0.1,
                                                       loss='ls', max_depth=3,
                                                       max_features=None,
                                                       max_leaf_nodes=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                                                       min_weight_fraction_leaf=0.0,
                                                       n_estimators=100,
                                                       verbose=0,
                                                       warm_start=False),
                   iid='warn', n_iter=25, n_jobs=-1,
                   param_distributions={'loss': ['ls', 'lad', 'huber'],
                                        'max_depth': [2, 3, 5, 10, 15],
                                        'min_samples_leaf': [1, 2, 4, 6, 8],
                                        'min_samples_split': [2, 4, 6, 10],
                                        'n_estimators': [100, 500, 900, 1100,
                                                         1500]},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=True, scoring='neg_mean_absolute_error',
                   verbose=1)
random_cv.best_estimator_ #最好的参数
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-187-c5a12878b76a> in <module>
----> 1 random_cv.best_estimator_ #最好的参数


NameError: name 'random_cv' is not defined
# 创建树策个数
trees_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}

#建立模型
#lad:最小化绝对偏差
model = GradientBoostingRegressor(loss = 'lad', max_depth = 5,
                                  min_samples_leaf = 6,
                                  min_samples_split = 6,
                                  max_features = None,
                                  random_state = 42)

# 传入参数
grid_search = GridSearchCV(estimator = model, param_grid=trees_grid, cv = 4, 
                           scoring = 'neg_mean_absolute_error', verbose = 1,
                           n_jobs = -1, return_train_score = True)
# 需要3mins
grid_search.fit(X, y)
GridSearchCV(cv=4, error_score='raise-deprecating',
             estimator=GradientBoostingRegressor(alpha=0.9,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.1,
                                                 loss='lad', max_depth=5,
                                                 max_features=None,
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=6,
                                                 min_samples_split=6,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100,
                                                 n_iter_no_change=None,
                                                 presort='auto',
                                                 random_state=42, subsample=1.0,
                                                 tol=0.0001,
                                                 validation_fraction=0.1,
                                                 verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'n_estimators': [100, 150, 200, 250, 300, 350, 400,
                                          450, 500, 550, 600, 650, 700, 750,
                                          800]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='neg_mean_absolute_error', verbose=1)

7.2 对比损失函数

# 得到结果传入DataFrame
results = pd.DataFrame(grid_search.cv_results_)

# 画图操作
figsize(8, 8)
plt.style.use('fivethirtyeight')

plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Testing Error')
plt.plot(results['param_n_estimators'], -1 * results['mean_train_score'], label = 'Training Error')
#横轴是树的个数 ,纵轴是MAE的误差
plt.xlabel('Number of Trees'); plt.ylabel('Mean Abosolute Error'); plt.legend();
plt.title('Performance vs Number of Trees');
#过拟合 , 蓝色平缓 ,红色比较陡 ,中间的数据越来陡,所以overfiting

8 评估与测试:预测和真实之间的差异图

# 测试模型
default_model = GradientBoostingRegressor(random_state = 42)
default_model.fit(X,y)
# 选择最好的参数
final_model = grid_search.best_estimator_

final_model
default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)
print('Default model performance on the test set: MAE = %0.4f.' % mae(y_test, default_pred))
print('Final model performance on the test set:   MAE = %0.4f.' % mae(y_test, final_pred))
figsize = (6, 6)

# 最终的模型差异 = 模型  -  测试值 ,大部分都在+-25%
residuals = final_pred - y_test

plt.hist(residuals, color = 'red', bins = 20,
         edgecolor = 'black')
plt.xlabel('Error'); plt.ylabel('Count')
plt.title('Distribution of Residuals');

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JhYfQn90-1642160554540)(attachment:image.png)]

9 解释模型:基于重要性来进行特征选择

import pandas as pd
import numpy as np


pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', 60)


import matplotlib.pyplot as plt
%matplotlib inline


plt.rcParams['font.size'] = 24

from IPython.core.pylabtools import figsize

import seaborn as sns

sns.set(font_scale = 2)



# 下面代码是自己修改的
# 这是原代码
# from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.preprocessing import  MinMaxScaler
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
# 上面代码是自己修改的

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor

from sklearn import tree



import warnings
warnings.filterwarnings("ignore")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tWveKWPM-1642160554540)(attachment:image.png)]

# 用中值代替缺失值

# 下面代码是自己修改的
# 这是原代码
# imputer = Imputer(strategy='median')
imputer = SimpleImputer(strategy='median')
# 上面代码是自己修改的



# 开始训练
imputer.fit(train_features)


X = imputer.transform(train_features)
# 测试集的缺失值使用的也是训练集的数据
X_test = imputer.transform(test_features)


y = np.array(train_labels).reshape((-1,))
y_test = np.array(test_labels).reshape((-1,))
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))
model = GradientBoostingRegressor(loss='lad', max_depth=5, max_features=None,
                                  min_samples_leaf=6, min_samples_split=6, 
                                  n_estimators=800, random_state=42)

model.fit(X, y)
#  GBDT模型作为最终的模型
model_pred = model.predict(X_test)

print('Final Model Performance on the test set: MAE = %0.4f' % mae(y_test, model_pred))
# 特征重要度
feature_results = pd.DataFrame({'feature': list(train_features.columns),  #所有的训练特征
                                'importance': model.feature_importances_})

# 展示前10名的重要的特征 ,降序  
feature_results = feature_results.sort_values('importance', ascending = False).reset_index(drop=True)

feature_results.head(10)
figsize(12, 10)
plt.style.use('fivethirtyeight')

# 展示前10名的重要的特征 
feature_results.loc[:9, :].plot(x = 'feature', y = 'importance', 
                                 edgecolor = 'k',
                                 kind='barh', color = 'blue');#barh:直方图横着
plt.xlabel('Relative Importance', size = 20); plt.ylabel('')
plt.title('Feature Importances from Random Forest', size = 30);
most_important_features = feature_results['feature'][:10]#前10行的特征
# indices=10个列名
indices = [list(train_features.columns).index(x) for x in most_important_features]# 列表推导式


X_reduced = X[:, indices]
X_test_reduced = X_test[:, indices]

print('Most important training features shape: ', X_reduced.shape)
print('Most important testing  features shape: ', X_test_reduced.shape)
lr = LinearRegression()


lr.fit(X, y)
lr_full_pred = lr.predict(X_test)


lr.fit(X_reduced, y)
lr_reduced_pred = lr.predict(X_test_reduced)


print('Linear Regression Full Results: MAE =    %0.4f.' % mae(y_test, lr_full_pred))
print('Linear Regression Reduced Results: MAE = %0.4f.' % mae(y_test, lr_reduced_pred))


  • 1
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值