基于机器学习的城市暴雨量预测

背景

  • 通过深圳气象局观测到的雷达图数据,每个雷达图覆盖一个目标地点及其周边地区,标记为m*m网格,其中每个网格点记录的是雷达反射率因子值z。这里Z值可以从非常小的数值到大的值,为方便起见,我们使用dBZ来测量这个值:

  • 短期降水预测涉及以下信息的分析:
    1.当前降水量与雷达折射率之间的关系;
    2.雷达图包含当前目标站点及其周边地区的雷达反射率。需要考虑目标地点与周边地区之间的降水关系。

  • 我们需要根据雷达值并预测每小时降雨量总计。

数据说明

训练集数据包含2017年4月到8月期间深圳气象雷达网络和降雨数据。时间和位置信息已经被加敏处理。

文件说明

  • train.zip 训练集。包括在玉米生长季节每个月对美国中西部的仪表进行 20 多天的雷达观测。
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
/kaggle/input/how-much-did-it-rain-ii/train.zip
/kaggle/input/how-much-did-it-rain-ii/test.zip
/kaggle/input/how-much-did-it-rain-ii/sample_dask.py
/kaggle/input/how-much-did-it-rain-ii/sample_solution.csv.zip
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.stats import percentileofscore

数据预处理

之所以会有这么多数据要是因为一个小时内会有多次雷达观测,但只有一个雨量计观测(即Expected)。这就是为什么有多行数据具有相同的Id的原因。

不同特征对应的含义分别为:

  • Id:一个雨量计在一个小时内所有观测数据的唯一编号。在数据集中,每个小时内的观测数据都有一个唯一的“Id”编号。

  • minutes_past:每组雷达观测数据记录了观测时间距离整点的分钟数。雷达观测数据是在某个时间点的快照,记录了该时刻上空的情况。

  • radardist_km:返回雷达观测数据时,降水测量仪器与雷达之间的水平距离

  • Ref:雷达接收到的回波信号中,反射回来的电磁波功率与入射电磁波功率之比,通常以dBZ为单位

  • Ref_5x5_10th:在一个5x5的邻域内,以雨量计为中心,反射率值的第10个百分位数。

  • Ref_5x5_50th:在一个5x5的邻域内,以雨量计为中心,反射率值的第50个百分位数。

  • Ref_5x5_90th:在一个5x5的邻域内,以雨量计为中心,反射率值的第90个百分位数。

  • RefCompsite:垂直列上方最大反射率(dBZ)

  • RefCompsite_5x5_10th

  • RefCompsite_5x5_50th

  • RefCompsite_5x5_90th

  • RhoHV:雷达测量信号之间相关性的参数(无单位)

  • RhoHV_5x5_10th

  • RhoHV_5x5_50th

  • RhoHV_5x5_90th

  • Zdr:水平和垂直极化波的反射率因子之差(分贝)

  • Zdr_5x5_10th

  • Zdr_5x5_50th

  • Zdr_5x5_90th

  • Kdp:单位距离内降水粒子的相位差(度/千米)

  • Kdp_5x5_10th

  • Kdp_5x5_50th

  • Kdp_5x5_90th

  • Expected:每小时结束时的实际仪表观测值(/mm)

导入训练集

train_data = pd.read_csv("../input/how-much-did-it-rain-ii/train.zip")
train_data
Idminutes_pastradardist_kmRefRef_5x5_10thRef_5x5_50thRef_5x5_90thRefCompositeRefComposite_5x5_10thRefComposite_5x5_50th...RhoHV_5x5_90thZdrZdr_5x5_10thZdr_5x5_50thZdr_5x5_90thKdpKdp_5x5_10thKdp_5x5_50thKdp_5x5_90thExpected
01310.0NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaN0.254000
111610.0NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaN0.254000
212510.0NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaN0.254000
313510.0NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaN0.254000
414510.0NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaN0.254000
..................................................................
137651961180945389.033.019.525.536.533.020.528.0...NaNNaNNaNNaNNaNNaNNaNNaNNaN8.636004
137651971180945429.033.021.030.537.036.522.033.5...NaNNaNNaNNaNNaNNaNNaNNaNNaN8.636004
137651981180945479.029.510.026.030.531.016.526.0...1.0516671.75NaN0.7503.000013.379990NaNNaN13.3799908.636004
137651991180945529.019.0NaN15.526.519.0NaN16.5...1.051667NaNNaNNaN2.8125NaNNaNNaNNaN8.636004
137652001180945579.07.5NaN10.013.014.510.012.5...1.0516670.00-1.1250.3753.25006.069992NaN-8.0299996.0699928.636004

13765201 rows × 24 columns

处理无效数据

对于每个观测数据,如果Ref列中含有NaN值(即来自雷达的数据缺失),则将该观测数据从数据集中删除

train_ids = train_data[~np.isnan(train_data.Ref)]  # 选取 Ref 列不为 NaN 的样本,并赋值给 train_ids
train_data = train_data[np.in1d(train_data.Id, train_ids.Id)] # 选取那些 Id 在 train_ids 中出现过的样本,并赋值给 train_new

train_data
Idminutes_pastradardist_kmRefRef_5x5_10thRef_5x5_50thRef_5x5_90thRefCompositeRefComposite_5x5_10thRefComposite_5x5_50th...RhoHV_5x5_90thZdrZdr_5x5_10thZdr_5x5_50thZdr_5x5_90thKdpKdp_5x5_10thKdp_5x5_50thKdp_5x5_90thExpected
6212.09.05.07.510.515.010.516.5...0.9983330.3750-0.12500.31250.87501.059998-1.410004-0.3500061.0599981.016001
7262.026.522.525.531.526.526.528.5...1.0050000.0625-0.18750.25000.6875NaNNaNNaN1.4099881.016001
82112.021.515.520.525.026.523.525.0...1.0016670.3125-0.06250.31250.62500.349991NaN-0.3500061.7599941.016001
92162.018.014.017.521.020.518.020.5...1.0016670.25000.12500.37500.68750.349991-1.0599980.0000001.0599981.016001
102212.024.516.521.024.524.521.024.0...0.9983330.25000.06250.18750.5625-0.350006-1.059998-0.3500061.7599941.016001
..................................................................
137651961180945389.033.019.525.536.533.020.528.0...NaNNaNNaNNaNNaNNaNNaNNaNNaN8.636004
137651971180945429.033.021.030.537.036.522.033.5...NaNNaNNaNNaNNaNNaNNaNNaNNaN8.636004
137651981180945479.029.510.026.030.531.016.526.0...1.0516671.7500NaN0.75003.000013.379990NaNNaN13.3799908.636004
137651991180945529.019.0NaN15.526.519.0NaN16.5...1.051667NaNNaNNaN2.8125NaNNaNNaNNaN8.636004
137652001180945579.07.5NaN10.013.014.510.012.5...1.0516670.0000-1.12500.37503.25006.069992NaN-8.0299996.0699928.636004

9125329 rows × 24 columns

检查Nan值 补0

  1. 消除数据中的异常值:NaN值通常表示缺失值或错误值,这些值可能会影响数据的分析和建模结果。通过去除NaN值,可以减少这些异常值对数据分析和建模的影响,提高数据的准确性和可靠性。
  2. 减少数据的噪声:数据中的NaN值可能会引入噪声,影响数据的分析和建模结果。通过去除NaN值,可以减少数据的噪声,提高数据的质量和可靠性。
  3. 提高数据分析和建模的效率:在处理大量数据时,如果数据中包含大量NaN值,会增加数据处理和分析的复杂度和时间。通过去除NaN值,可以减少数据的复杂度,提高数据分析和建模的效率。
  4. 方便数据可视化:NaN值在绘制图表时通常被视为异常值或缺失值,会影响数据的可视化效果。通过去除NaN值,可以提高数据的可视化效果,更好地展示数据的特征和趋势。
train_data.isna().sum()
Id                             0
minutes_past                   0
radardist_km                   0
Ref                      2775954
Ref_5x5_10th             3841387
Ref_5x5_50th             2772390
Ref_5x5_90th             1706556
RefComposite             2429475
RefComposite_5x5_10th    3377843
RefComposite_5x5_50th    2434061
RefComposite_5x5_90th    1471489
RhoHV                    4314082
RhoHV_5x5_10th           5052173
RhoHV_5x5_50th           4300572
RhoHV_5x5_90th           3458417
Zdr                      4314082
Zdr_5x5_10th             5052173
Zdr_5x5_50th             4300572
Zdr_5x5_90th             3458417
Kdp                      5036548
Kdp_5x5_10th             5745822
Kdp_5x5_50th             5026320
Kdp_5x5_90th             4231865
Expected                       0
dtype: int64
def process_data(train_data):
    train_data_mean = train_data.groupby('Id').mean()
    train_data_mean.fillna(0,inplace=True)
    train_data_mean.reset_index(inplace=True)
    return train_data_mean
train_data_mean = process_data(train_data)
train_data_mean.isnull().sum()
Id                       0
minutes_past             0
radardist_km             0
Ref                      0
Ref_5x5_10th             0
Ref_5x5_50th             0
Ref_5x5_90th             0
RefComposite             0
RefComposite_5x5_10th    0
RefComposite_5x5_50th    0
RefComposite_5x5_90th    0
RhoHV                    0
RhoHV_5x5_10th           0
RhoHV_5x5_50th           0
RhoHV_5x5_90th           0
Zdr                      0
Zdr_5x5_10th             0
Zdr_5x5_50th             0
Zdr_5x5_90th             0
Kdp                      0
Kdp_5x5_10th             0
Kdp_5x5_50th             0
Kdp_5x5_90th             0
Expected                 0
dtype: int64
pd.set_option('display.float_format', lambda x: '%.3f' % x)
train_data_mean[["minutes_past", "radardist_km", "Expected"]].describe()
minutes_pastradardist_kmExpected
count731556.000731556.000731556.000
mean29.5439.74723.995
std1.4594.059241.450
min2.0000.0000.010
25%28.7277.0000.254
50%29.53810.0001.270
75%30.35713.0003.556
max58.00021.00033017.730

EDA数据探索性分析

计算相关矩阵

相关矩阵中每个元素表示两个变量之间的相关系数。相关系数是一个统计量,用于衡量两个变量之间的线性关系的强度和方向。如果两个变量之间存在强烈的正相关,那么它们的相关系数将接近1;如果它们之间存在强烈的负相关,那么相关系数将接近-1;如果它们之间没有线性关系,那么相关系数将接近0。

corr_mat = train_data_mean.corr()
corr_mat.style.background_gradient(cmap='coolwarm')
 Idminutes_pastradardist_kmRefRef_5x5_10thRef_5x5_50thRef_5x5_90thRefCompositeRefComposite_5x5_10thRefComposite_5x5_50thRefComposite_5x5_90thRhoHVRhoHV_5x5_10thRhoHV_5x5_50thRhoHV_5x5_90thZdrZdr_5x5_10thZdr_5x5_50thZdr_5x5_90thKdpKdp_5x5_10thKdp_5x5_50thKdp_5x5_90thExpected
Id1.0000000.0053320.003219-0.015406-0.014896-0.016647-0.016421-0.016830-0.017785-0.018159-0.0177010.002469-0.0016410.0009410.0041090.001950-0.012205-0.0093570.0086560.002813-0.006703-0.0076460.010952-0.000252
minutes_past0.0053321.000000-0.0137120.0081890.0108360.0100630.0085600.0090160.0120730.0103270.0083990.0165780.0193980.0174170.0126250.004729-0.0029050.0083920.009305-0.001297-0.010068-0.0021920.006347-0.001091
radardist_km0.003219-0.0137121.0000000.069752-0.0673660.0060990.019076-0.046264-0.170242-0.096009-0.067483-0.387421-0.359585-0.387563-0.421277-0.2725790.077392-0.342390-0.429352-0.0066550.052841-0.051527-0.0251170.030389
Ref-0.0154060.0081890.0697521.0000000.8541760.9484520.9413830.9583850.8372320.9177360.9074700.0990310.2023650.1298230.043799-0.0843840.1093310.055019-0.163710-0.004208-0.1712400.0631220.146192-0.042359
Ref_5x5_10th-0.0148960.010836-0.0673660.8541761.0000000.8940780.8322420.8430770.9568150.8730060.8162770.2397370.3397020.2708270.182392-0.0321640.0355590.099661-0.049350-0.002603-0.2356870.0613670.195009-0.050606
Ref_5x5_50th-0.0166470.0100630.0060990.9484520.8940781.0000000.9294330.9291430.8856650.9663130.9029440.1626600.2646200.1956890.104907-0.0637400.0742880.078924-0.115541-0.004518-0.2022160.0588200.169582-0.052462
Ref_5x5_90th-0.0164210.0085600.0190760.9413830.8322420.9294331.0000000.9408880.8344950.9271140.9707340.1158370.2181370.1498780.063288-0.0676650.0745070.084126-0.121733-0.005766-0.1814180.0610570.148218-0.047604
RefComposite-0.0168300.009016-0.0462640.9583850.8430770.9291430.9408881.0000000.8721410.9556460.9481120.1417290.2439470.1742160.088659-0.0524070.0929130.102329-0.109243-0.008531-0.1790030.0637800.141982-0.048284
RefComposite_5x5_10th-0.0177850.012073-0.1702420.8372320.9568150.8856650.8344950.8721411.0000000.9071500.8460590.2668920.3683910.3005220.210234-0.0071690.0255390.143557-0.007942-0.008840-0.2392540.0601600.185216-0.054932
RefComposite_5x5_50th-0.0181590.010327-0.0960090.9177360.8730060.9663130.9271140.9556460.9071501.0000000.9355040.1930550.2940200.2277590.137763-0.0359130.0599370.119484-0.067748-0.008984-0.2045280.0585870.159862-0.056364
RefComposite_5x5_90th-0.0177010.008399-0.0674830.9074700.8162770.9029440.9707340.9481120.8460590.9355041.0000000.1495460.2491370.1839570.099236-0.0422420.0607470.115023-0.078103-0.009330-0.1848270.0593690.142435-0.051278
RhoHV0.0024690.016578-0.3874210.0990310.2397370.1626600.1158370.1417290.2668920.1930550.1495461.0000000.7892490.9337300.8546330.231941-0.3672770.1197810.4736850.011824-0.493655-0.1845860.481759-0.042432
RhoHV_5x5_10th-0.0016410.019398-0.3595850.2023650.3397020.2646200.2181370.2439470.3683910.2940200.2491370.7892491.0000000.8412230.6711280.063104-0.4144270.2278280.2839820.005363-0.6086830.0026150.434057-0.050330
RhoHV_5x5_50th0.0009410.017417-0.3875630.1298230.2708270.1956890.1498780.1742160.3005220.2277590.1839570.9337300.8412231.0000000.8168760.156947-0.3922840.1382270.4393280.009700-0.529381-0.1914370.504209-0.048483
RhoHV_5x5_90th0.0041090.012625-0.4212770.0437990.1823920.1049070.0632880.0886590.2102340.1377630.0992360.8546330.6711280.8168761.0000000.218192-0.3408840.1118370.5542520.007968-0.427723-0.1713190.440342-0.033870
Zdr0.0019500.004729-0.272579-0.084384-0.032164-0.063740-0.067665-0.052407-0.007169-0.035913-0.0422420.2319410.0631040.1569470.2181921.0000000.1377070.4980390.561398-0.008279-0.025564-0.0645900.0286770.009787
Zdr_5x5_10th-0.012205-0.0029050.0773920.1093310.0355590.0742880.0745070.0929130.0255390.0599370.060747-0.367277-0.414427-0.392284-0.3408840.1377071.0000000.221498-0.2617020.0091520.4730390.150087-0.3490600.026218
Zdr_5x5_50th-0.0093570.008392-0.3423900.0550190.0996610.0789240.0841260.1023290.1435570.1194840.1150230.1197810.2278280.1382270.1118370.4980390.2214981.0000000.437039-0.014415-0.0980270.1156540.002163-0.000544
Zdr_5x5_90th0.0086560.009305-0.429352-0.163710-0.049350-0.115541-0.121733-0.109243-0.007942-0.067748-0.0781030.4736850.2839820.4393280.5542520.561398-0.2617020.4370391.000000-0.010402-0.276761-0.1820300.257394-0.006270
Kdp0.002813-0.001297-0.006655-0.004208-0.002603-0.004518-0.005766-0.008531-0.008840-0.008984-0.0093300.0118240.0053630.0097000.007968-0.0082790.009152-0.014415-0.0104021.0000000.0455110.1713280.172808-0.003782
Kdp_5x5_10th-0.006703-0.0100680.052841-0.171240-0.235687-0.202216-0.181418-0.179003-0.239254-0.204528-0.184827-0.493655-0.608683-0.529381-0.427723-0.0255640.473039-0.098027-0.2767610.0455111.0000000.119005-0.4973020.029754
Kdp_5x5_50th-0.007646-0.002192-0.0515270.0631220.0613670.0588200.0610570.0637800.0601600.0585870.059369-0.1845860.002615-0.191437-0.171319-0.0645900.1500870.115654-0.1820300.1713280.1190051.000000-0.1144690.002442
Kdp_5x5_90th0.0109520.006347-0.0251170.1461920.1950090.1695820.1482180.1419820.1852160.1598620.1424350.4817590.4340570.5042090.4403420.028677-0.3490600.0021630.2573940.172808-0.497302-0.1144691.000000-0.028471
Expected-0.000252-0.0010910.030389-0.042359-0.050606-0.052462-0.047604-0.048284-0.054932-0.056364-0.051278-0.042432-0.050330-0.048483-0.0338700.0097870.026218-0.000544-0.006270-0.0037820.0297540.002442-0.0284711.000000

处理异常值(设定阈值)

# 看一眼期望值的唯一数量
plt.figure(figsize=(8,6))
plt.hist(train_data_mean["Expected"].unique())
plt.show()

在这里插入图片描述

plt.figure(figsize=(8, 6))
plt.scatter(np.arange(len(train_data["Expected"].unique())),
                            train_data["Expected"].unique())
<matplotlib.collections.PathCollection at 0x7fae9dd6e210>

在这里插入图片描述

期望值大部分是低于5000,但是有部分跳动较大,需要处理

train_data_mean.Expected.mean()
23.994922498376706
print(stats.percentileofscore(train_data_mean["Expected"], 108))
98.2024616023927
train_data_mean.drop(train_data_mean[train_data_mean["Expected"] >= 108].index, inplace=True)
train_data_mean
Idminutes_pastradardist_kmRefRef_5x5_10thRef_5x5_50thRef_5x5_90thRefCompositeRefComposite_5x5_10thRefComposite_5x5_50th...RhoHV_5x5_90thZdrZdr_5x5_10thZdr_5x5_50thZdr_5x5_90thKdpKdp_5x5_10thKdp_5x5_50thKdp_5x5_90thExpected
0229.0832.00016.62513.66717.37521.33322.66720.37522.917...1.0000.3800.1200.4170.781-0.288-1.449-0.3191.1171.016
1428.1549.00026.60020.07125.80030.26926.66721.09125.115...1.016-1.1250.0000.5001.5167.0300.0000.0006.3304.064
2730.93313.00014.75010.50012.50014.42914.75011.50014.875...0.0000.0000.0000.0000.0000.0000.0000.0000.0000.508
3831.0838.00019.60018.08319.20024.86424.59122.25023.727...1.0151.156-0.5940.5311.0270.002-3.963-0.2183.7353.225
41027.33310.00033.95830.29233.62537.12534.79232.12534.792...0.9880.255-1.0310.5262.839-1.404-5.9380.1777.0000.010
..................................................................
731551118093530.57114.00017.00013.50016.25019.55618.25016.00017.375...1.017-0.545-0.875-1.1341.174-0.660-4.560-1.9115.5510.508
731552118093828.36410.00019.0000.00019.00019.16723.00014.00023.000...0.0000.0000.0000.0000.0000.0000.0000.0000.0000.508
731553118094228.4629.00015.66713.00015.88919.95017.33314.00015.900...0.0000.0000.0000.0000.0000.0000.0000.0000.0001.524
731554118094430.76511.00026.66724.37529.08333.90629.16726.00028.071...0.0000.0000.0000.0000.0000.0000.0000.0000.0003.556
731555118094528.3859.00028.16718.14327.00036.50029.66717.94428.958...1.0520.875-1.1250.5623.0219.7250.000-8.0309.7258.636

717105 rows × 24 columns

可视化唯一列

plt.figure(figsize=(8, 6))
plt.scatter(np.arange(len(train_data_mean["Expected"].unique())), train_data_mean["Expected"].unique())
<matplotlib.collections.PathCollection at 0x7faedce2ef10>

在这里插入图片描述

plt.figure(figsize=(8, 6))
train_data_mean["radardist_km"].hist()
plt.title("radardist_km")
Text(0.5, 1.0, 'radardist_km')

在这里插入图片描述

plt.figure(figsize=(8, 6))
train_data_mean["Ref"].hist()
plt.title("Ref")
Text(0.5, 1.0, 'Ref')

在这里插入图片描述

plt.figure(figsize=(8, 6))
train_data_mean["RefComposite"].hist()
plt.title("RefComposite")
Text(0.5, 1.0, 'RefComposite')

在这里插入图片描述

plt.figure(figsize=(8, 6))
train_data_mean["RhoHV"].hist()
plt.title("RhoHV")
Text(0.5, 1.0, 'RhoHV')

在这里插入图片描述

plt.figure(figsize=(8, 6))
train_data_mean["Zdr"].hist()
plt.title("Zdr")
Text(0.5, 1.0, 'Zdr')

在这里插入图片描述

plt.figure(figsize=(8, 6))
train_data_mean["Kdp"].hist()
plt.title("Kdp")
Text(0.5, 1.0, 'Kdp')

在这里插入图片描述

plt.figure(figsize=(8, 6))
plt.hist(train_data_mean["Expected"].unique())
plt.title("Unique Expected")
Text(0.5, 1.0, 'Unique Expected')

在这里插入图片描述

训练100万samples

train_data_sample = train_data_mean.sample(n=1000000,replace=True) # 取样
features = [f for f in train_data_sample.columns]
features.remove("Expected") # 去除 label
features.remove("Id")  
X = train_data_sample[features] # 特征
Y = train_data_sample["Expected"] # 标签
# 缩放工程
scaler = StandardScaler()
# 分割训练集、验证集 
x_train,x_test,y_train,y_test = train_test_split(scaler.fit_transform(X),Y,test_size=0.25,random_state=0)

建模、训练及评估

评估指标:
MSE和MAE都是评估回归模型性能的指标,分别表示均方误差(Mean Squared Error)和平均绝对误差(Mean Absolute Error)。
相比而言,MSE对预测误差较大的样本更加敏感,因为误差平方后值更大,而MAE对所有误差一视同仁,不受误差大小的影响。

LR = LinearRegression()
LR.fit(x_train,y_train)
LinearRegression()
print(f'Linear Regression Score:', LR.score(x_test,y_test))
print(f'Linear Regression MSE: ',mean_squared_error(y_test,LR.predict(x_test)))
print(f'Linear Regression MAE: ',mean_absolute_error(y_test,LR.predict(x_test)))
Linear Regression Score: 0.05009157464667113
Linear Regression MSE:  62.94448284458787
Linear Regression MAE:  3.567090649395791
DTR = DecisionTreeRegressor()
DTR.fit(x_train,y_train)
DecisionTreeRegressor()
print(f'Decision Tree Score:', DTR.score(x_test,y_test))
print(f'Decision Tree MSE: ',mean_squared_error(y_test,DTR.predict(x_test)))
print(f'Decision Tree MAE: ',mean_absolute_error(y_test,DTR.predict(x_test)))
Decision Tree Score: 0.568334912813709
Decision Tree MSE:  28.60374216061774
Decision Tree MAE:  1.1915324397365497
XGB = XGBRegressor()
XGB.fit(x_train,y_train)
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)
print(f'XGBoost Regressor Score:', XGB.score(x_test,y_test))
print(f'XGBoost Regressor MSE: ',mean_squared_error(y_test,XGB.predict(x_test)))
print(f'XGBoost Regressor MAE: ',mean_absolute_error(y_test,XGB.predict(x_test)))
XGBoost Regressor Score: 0.18670791787153074
XGBoost Regressor MSE:  53.89177329607648
XGBoost Regressor MAE:  3.173719464009896

导入测试集

test = pd.read_csv("../input/how-much-did-it-rain-ii/test.zip")
test_data = process_data(test)
test_data
Idminutes_pastradardist_kmRefRef_5x5_10thRef_5x5_50thRef_5x5_90thRefCompositeRefComposite_5x5_10thRefComposite_5x5_50th...RhoHV_5x5_50thRhoHV_5x5_90thZdrZdr_5x5_10thZdr_5x5_50thZdr_5x5_90thKdpKdp_5x5_10thKdp_5x5_50thKdp_5x5_90th
0129.8248.00010.5000.0008.27813.46211.3750.0008.667...0.9901.044-0.547-1.7500.0622.598-1.5230.000-1.2902.602
1228.93815.0000.0000.0000.00013.0000.0000.0000.000...0.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
2328.70011.00023.72217.12524.11134.70025.05616.40025.222...0.9550.9810.419-0.5620.4561.8250.220-4.3590.1705.383
3428.7279.00030.81228.64329.81235.62532.00028.94432.667...0.9710.9820.100-0.5370.1771.2410.912-2.890-0.1405.563
4528.33317.0000.0000.0000.0000.0000.0000.0000.000...0.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
..................................................................
71762071762128.4006.00014.5009.00012.75016.50012.8007.3338.714...0.0001.0230.0000.0000.0000.7500.0000.0000.000-0.935
71762171762229.28615.0000.0000.0000.0000.0000.0000.0000.000...0.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
71762271762330.3089.00021.83319.05622.91731.19226.00022.31824.577...0.9680.9940.168-1.1880.1681.750-0.083-4.717-0.1884.842
71762371762432.00011.00038.0000.00037.00042.83338.0000.00037.500...0.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
71762471762528.00015.0000.0000.0000.0000.0000.0000.0000.000...0.0000.0000.0000.0000.0000.0000.0000.0000.0000.000

717625 rows × 23 columns

预测结果

Linear_Regression_Prediction = LR.predict(scaler.fit_transform(test_data[features]))
Linear_Regression_Result = pd.DataFrame(columns=["Id","Expected"])
Linear_Regression_Result["Id"] = test_data["Id"]
Linear_Regression_Result["Expected"] = Linear_Regression_Prediction

Linear_Regression_Result.head()
IdExpected
013.447
123.566
235.709
345.716
452.295
Decision_Tree_Prediction = DTR.predict(scaler.fit_transform(test_data[features]))
Decision_Tree_Result = pd.DataFrame(columns=["Id","Expected"])
Decision_Tree_Result["Id"] = test_data["Id"]
Decision_Tree_Result["Expected"] = Decision_Tree_Prediction

Decision_Tree_Result.head()
IdExpected
012.794
121.524
230.508
343.950
450.254
XGBoost_Regressor_Prediction = XGB.predict(scaler.fit_transform(test_data[features]))
XGBoost_Regressor_Result = pd.DataFrame(columns=["Id","Expected"])
XGBoost_Regressor_Result["Id"] = test_data["Id"]
XGBoost_Regressor_Result["Expected"] = XGBoost_Regressor_Prediction

XGBoost_Regressor_Result.head()
IdExpected
010.555
122.201
235.386
3411.370
452.868
  • 2
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

By_Liu

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值