背景
-
通过深圳气象局观测到的雷达图数据,每个雷达图覆盖一个目标地点及其周边地区,标记为m*m网格,其中每个网格点记录的是雷达反射率因子值z。这里Z值可以从非常小的数值到大的值,为方便起见,我们使用dBZ来测量这个值:
-
短期降水预测涉及以下信息的分析:
1.当前降水量与雷达折射率之间的关系;
2.雷达图包含当前目标站点及其周边地区的雷达反射率。需要考虑目标地点与周边地区之间的降水关系。 -
我们需要根据雷达值并预测每小时降雨量总计。
数据说明
训练集数据包含2017年4月到8月期间深圳气象雷达网络和降雨数据。时间和位置信息已经被加敏处理。
文件说明
- train.zip 训练集。包括在玉米生长季节每个月对美国中西部的仪表进行 20 多天的雷达观测。
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/how-much-did-it-rain-ii/train.zip
/kaggle/input/how-much-did-it-rain-ii/test.zip
/kaggle/input/how-much-did-it-rain-ii/sample_dask.py
/kaggle/input/how-much-did-it-rain-ii/sample_solution.csv.zip
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.stats import percentileofscore
数据预处理
之所以会有这么多数据要是因为一个小时内会有多次雷达观测,但只有一个雨量计观测(即Expected
)。这就是为什么有多行数据具有相同的Id
的原因。
不同特征对应的含义分别为:
-
Id
:一个雨量计在一个小时内所有观测数据的唯一编号。在数据集中,每个小时内的观测数据都有一个唯一的“Id”编号。 -
minutes_past
:每组雷达观测数据记录了观测时间距离整点的分钟数。雷达观测数据是在某个时间点的快照,记录了该时刻上空的情况。 -
radardist_km
:返回雷达观测数据时,降水测量仪器与雷达之间的水平距离 -
Ref
:雷达接收到的回波信号中,反射回来的电磁波功率与入射电磁波功率之比,通常以dBZ为单位 -
Ref_5x5_10th
:在一个5x5的邻域内,以雨量计为中心,反射率值的第10个百分位数。 -
Ref_5x5_50th
:在一个5x5的邻域内,以雨量计为中心,反射率值的第50个百分位数。 -
Ref_5x5_90th
:在一个5x5的邻域内,以雨量计为中心,反射率值的第90个百分位数。 -
RefCompsite
:垂直列上方最大反射率(dBZ) -
RefCompsite_5x5_10th
-
RefCompsite_5x5_50th
-
RefCompsite_5x5_90th
-
RhoHV
:雷达测量信号之间相关性的参数(无单位) -
RhoHV_5x5_10th
-
RhoHV_5x5_50th
-
RhoHV_5x5_90th
-
Zdr
:水平和垂直极化波的反射率因子之差(分贝) -
Zdr_5x5_10th
-
Zdr_5x5_50th
-
Zdr_5x5_90th
-
Kdp
:单位距离内降水粒子的相位差(度/千米) -
Kdp_5x5_10th
-
Kdp_5x5_50th
-
Kdp_5x5_90th
-
Expected
:每小时结束时的实际仪表观测值(/mm)
导入训练集
train_data = pd.read_csv("../input/how-much-did-it-rain-ii/train.zip")
train_data
Id | minutes_past | radardist_km | Ref | Ref_5x5_10th | Ref_5x5_50th | Ref_5x5_90th | RefComposite | RefComposite_5x5_10th | RefComposite_5x5_50th | ... | RhoHV_5x5_90th | Zdr | Zdr_5x5_10th | Zdr_5x5_50th | Zdr_5x5_90th | Kdp | Kdp_5x5_10th | Kdp_5x5_50th | Kdp_5x5_90th | Expected | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.254000 |
1 | 1 | 16 | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.254000 |
2 | 1 | 25 | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.254000 |
3 | 1 | 35 | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.254000 |
4 | 1 | 45 | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.254000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
13765196 | 1180945 | 38 | 9.0 | 33.0 | 19.5 | 25.5 | 36.5 | 33.0 | 20.5 | 28.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.636004 |
13765197 | 1180945 | 42 | 9.0 | 33.0 | 21.0 | 30.5 | 37.0 | 36.5 | 22.0 | 33.5 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.636004 |
13765198 | 1180945 | 47 | 9.0 | 29.5 | 10.0 | 26.0 | 30.5 | 31.0 | 16.5 | 26.0 | ... | 1.051667 | 1.75 | NaN | 0.750 | 3.0000 | 13.379990 | NaN | NaN | 13.379990 | 8.636004 |
13765199 | 1180945 | 52 | 9.0 | 19.0 | NaN | 15.5 | 26.5 | 19.0 | NaN | 16.5 | ... | 1.051667 | NaN | NaN | NaN | 2.8125 | NaN | NaN | NaN | NaN | 8.636004 |
13765200 | 1180945 | 57 | 9.0 | 7.5 | NaN | 10.0 | 13.0 | 14.5 | 10.0 | 12.5 | ... | 1.051667 | 0.00 | -1.125 | 0.375 | 3.2500 | 6.069992 | NaN | -8.029999 | 6.069992 | 8.636004 |
13765201 rows × 24 columns
处理无效数据
对于每个观测数据,如果Ref
列中含有NaN值(即来自雷达的数据缺失),则将该观测数据从数据集中删除
train_ids = train_data[~np.isnan(train_data.Ref)] # 选取 Ref 列不为 NaN 的样本,并赋值给 train_ids
train_data = train_data[np.in1d(train_data.Id, train_ids.Id)] # 选取那些 Id 在 train_ids 中出现过的样本,并赋值给 train_new
train_data
Id | minutes_past | radardist_km | Ref | Ref_5x5_10th | Ref_5x5_50th | Ref_5x5_90th | RefComposite | RefComposite_5x5_10th | RefComposite_5x5_50th | ... | RhoHV_5x5_90th | Zdr | Zdr_5x5_10th | Zdr_5x5_50th | Zdr_5x5_90th | Kdp | Kdp_5x5_10th | Kdp_5x5_50th | Kdp_5x5_90th | Expected | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 2 | 1 | 2.0 | 9.0 | 5.0 | 7.5 | 10.5 | 15.0 | 10.5 | 16.5 | ... | 0.998333 | 0.3750 | -0.1250 | 0.3125 | 0.8750 | 1.059998 | -1.410004 | -0.350006 | 1.059998 | 1.016001 |
7 | 2 | 6 | 2.0 | 26.5 | 22.5 | 25.5 | 31.5 | 26.5 | 26.5 | 28.5 | ... | 1.005000 | 0.0625 | -0.1875 | 0.2500 | 0.6875 | NaN | NaN | NaN | 1.409988 | 1.016001 |
8 | 2 | 11 | 2.0 | 21.5 | 15.5 | 20.5 | 25.0 | 26.5 | 23.5 | 25.0 | ... | 1.001667 | 0.3125 | -0.0625 | 0.3125 | 0.6250 | 0.349991 | NaN | -0.350006 | 1.759994 | 1.016001 |
9 | 2 | 16 | 2.0 | 18.0 | 14.0 | 17.5 | 21.0 | 20.5 | 18.0 | 20.5 | ... | 1.001667 | 0.2500 | 0.1250 | 0.3750 | 0.6875 | 0.349991 | -1.059998 | 0.000000 | 1.059998 | 1.016001 |
10 | 2 | 21 | 2.0 | 24.5 | 16.5 | 21.0 | 24.5 | 24.5 | 21.0 | 24.0 | ... | 0.998333 | 0.2500 | 0.0625 | 0.1875 | 0.5625 | -0.350006 | -1.059998 | -0.350006 | 1.759994 | 1.016001 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
13765196 | 1180945 | 38 | 9.0 | 33.0 | 19.5 | 25.5 | 36.5 | 33.0 | 20.5 | 28.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.636004 |
13765197 | 1180945 | 42 | 9.0 | 33.0 | 21.0 | 30.5 | 37.0 | 36.5 | 22.0 | 33.5 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.636004 |
13765198 | 1180945 | 47 | 9.0 | 29.5 | 10.0 | 26.0 | 30.5 | 31.0 | 16.5 | 26.0 | ... | 1.051667 | 1.7500 | NaN | 0.7500 | 3.0000 | 13.379990 | NaN | NaN | 13.379990 | 8.636004 |
13765199 | 1180945 | 52 | 9.0 | 19.0 | NaN | 15.5 | 26.5 | 19.0 | NaN | 16.5 | ... | 1.051667 | NaN | NaN | NaN | 2.8125 | NaN | NaN | NaN | NaN | 8.636004 |
13765200 | 1180945 | 57 | 9.0 | 7.5 | NaN | 10.0 | 13.0 | 14.5 | 10.0 | 12.5 | ... | 1.051667 | 0.0000 | -1.1250 | 0.3750 | 3.2500 | 6.069992 | NaN | -8.029999 | 6.069992 | 8.636004 |
9125329 rows × 24 columns
检查Nan值 补0
- 消除数据中的异常值:NaN值通常表示缺失值或错误值,这些值可能会影响数据的分析和建模结果。通过去除NaN值,可以减少这些异常值对数据分析和建模的影响,提高数据的准确性和可靠性。
- 减少数据的噪声:数据中的NaN值可能会引入噪声,影响数据的分析和建模结果。通过去除NaN值,可以减少数据的噪声,提高数据的质量和可靠性。
- 提高数据分析和建模的效率:在处理大量数据时,如果数据中包含大量NaN值,会增加数据处理和分析的复杂度和时间。通过去除NaN值,可以减少数据的复杂度,提高数据分析和建模的效率。
- 方便数据可视化:NaN值在绘制图表时通常被视为异常值或缺失值,会影响数据的可视化效果。通过去除NaN值,可以提高数据的可视化效果,更好地展示数据的特征和趋势。
train_data.isna().sum()
Id 0
minutes_past 0
radardist_km 0
Ref 2775954
Ref_5x5_10th 3841387
Ref_5x5_50th 2772390
Ref_5x5_90th 1706556
RefComposite 2429475
RefComposite_5x5_10th 3377843
RefComposite_5x5_50th 2434061
RefComposite_5x5_90th 1471489
RhoHV 4314082
RhoHV_5x5_10th 5052173
RhoHV_5x5_50th 4300572
RhoHV_5x5_90th 3458417
Zdr 4314082
Zdr_5x5_10th 5052173
Zdr_5x5_50th 4300572
Zdr_5x5_90th 3458417
Kdp 5036548
Kdp_5x5_10th 5745822
Kdp_5x5_50th 5026320
Kdp_5x5_90th 4231865
Expected 0
dtype: int64
def process_data(train_data):
train_data_mean = train_data.groupby('Id').mean()
train_data_mean.fillna(0,inplace=True)
train_data_mean.reset_index(inplace=True)
return train_data_mean
train_data_mean = process_data(train_data)
train_data_mean.isnull().sum()
Id 0
minutes_past 0
radardist_km 0
Ref 0
Ref_5x5_10th 0
Ref_5x5_50th 0
Ref_5x5_90th 0
RefComposite 0
RefComposite_5x5_10th 0
RefComposite_5x5_50th 0
RefComposite_5x5_90th 0
RhoHV 0
RhoHV_5x5_10th 0
RhoHV_5x5_50th 0
RhoHV_5x5_90th 0
Zdr 0
Zdr_5x5_10th 0
Zdr_5x5_50th 0
Zdr_5x5_90th 0
Kdp 0
Kdp_5x5_10th 0
Kdp_5x5_50th 0
Kdp_5x5_90th 0
Expected 0
dtype: int64
pd.set_option('display.float_format', lambda x: '%.3f' % x)
train_data_mean[["minutes_past", "radardist_km", "Expected"]].describe()
minutes_past | radardist_km | Expected | |
---|---|---|---|
count | 731556.000 | 731556.000 | 731556.000 |
mean | 29.543 | 9.747 | 23.995 |
std | 1.459 | 4.059 | 241.450 |
min | 2.000 | 0.000 | 0.010 |
25% | 28.727 | 7.000 | 0.254 |
50% | 29.538 | 10.000 | 1.270 |
75% | 30.357 | 13.000 | 3.556 |
max | 58.000 | 21.000 | 33017.730 |
EDA数据探索性分析
计算相关矩阵
相关矩阵中每个元素表示两个变量之间的相关系数。相关系数是一个统计量,用于衡量两个变量之间的线性关系的强度和方向。如果两个变量之间存在强烈的正相关,那么它们的相关系数将接近1;如果它们之间存在强烈的负相关,那么相关系数将接近-1;如果它们之间没有线性关系,那么相关系数将接近0。
corr_mat = train_data_mean.corr()
corr_mat.style.background_gradient(cmap='coolwarm')
Id | minutes_past | radardist_km | Ref | Ref_5x5_10th | Ref_5x5_50th | Ref_5x5_90th | RefComposite | RefComposite_5x5_10th | RefComposite_5x5_50th | RefComposite_5x5_90th | RhoHV | RhoHV_5x5_10th | RhoHV_5x5_50th | RhoHV_5x5_90th | Zdr | Zdr_5x5_10th | Zdr_5x5_50th | Zdr_5x5_90th | Kdp | Kdp_5x5_10th | Kdp_5x5_50th | Kdp_5x5_90th | Expected | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | 1.000000 | 0.005332 | 0.003219 | -0.015406 | -0.014896 | -0.016647 | -0.016421 | -0.016830 | -0.017785 | -0.018159 | -0.017701 | 0.002469 | -0.001641 | 0.000941 | 0.004109 | 0.001950 | -0.012205 | -0.009357 | 0.008656 | 0.002813 | -0.006703 | -0.007646 | 0.010952 | -0.000252 |
minutes_past | 0.005332 | 1.000000 | -0.013712 | 0.008189 | 0.010836 | 0.010063 | 0.008560 | 0.009016 | 0.012073 | 0.010327 | 0.008399 | 0.016578 | 0.019398 | 0.017417 | 0.012625 | 0.004729 | -0.002905 | 0.008392 | 0.009305 | -0.001297 | -0.010068 | -0.002192 | 0.006347 | -0.001091 |
radardist_km | 0.003219 | -0.013712 | 1.000000 | 0.069752 | -0.067366 | 0.006099 | 0.019076 | -0.046264 | -0.170242 | -0.096009 | -0.067483 | -0.387421 | -0.359585 | -0.387563 | -0.421277 | -0.272579 | 0.077392 | -0.342390 | -0.429352 | -0.006655 | 0.052841 | -0.051527 | -0.025117 | 0.030389 |
Ref | -0.015406 | 0.008189 | 0.069752 | 1.000000 | 0.854176 | 0.948452 | 0.941383 | 0.958385 | 0.837232 | 0.917736 | 0.907470 | 0.099031 | 0.202365 | 0.129823 | 0.043799 | -0.084384 | 0.109331 | 0.055019 | -0.163710 | -0.004208 | -0.171240 | 0.063122 | 0.146192 | -0.042359 |
Ref_5x5_10th | -0.014896 | 0.010836 | -0.067366 | 0.854176 | 1.000000 | 0.894078 | 0.832242 | 0.843077 | 0.956815 | 0.873006 | 0.816277 | 0.239737 | 0.339702 | 0.270827 | 0.182392 | -0.032164 | 0.035559 | 0.099661 | -0.049350 | -0.002603 | -0.235687 | 0.061367 | 0.195009 | -0.050606 |
Ref_5x5_50th | -0.016647 | 0.010063 | 0.006099 | 0.948452 | 0.894078 | 1.000000 | 0.929433 | 0.929143 | 0.885665 | 0.966313 | 0.902944 | 0.162660 | 0.264620 | 0.195689 | 0.104907 | -0.063740 | 0.074288 | 0.078924 | -0.115541 | -0.004518 | -0.202216 | 0.058820 | 0.169582 | -0.052462 |
Ref_5x5_90th | -0.016421 | 0.008560 | 0.019076 | 0.941383 | 0.832242 | 0.929433 | 1.000000 | 0.940888 | 0.834495 | 0.927114 | 0.970734 | 0.115837 | 0.218137 | 0.149878 | 0.063288 | -0.067665 | 0.074507 | 0.084126 | -0.121733 | -0.005766 | -0.181418 | 0.061057 | 0.148218 | -0.047604 |
RefComposite | -0.016830 | 0.009016 | -0.046264 | 0.958385 | 0.843077 | 0.929143 | 0.940888 | 1.000000 | 0.872141 | 0.955646 | 0.948112 | 0.141729 | 0.243947 | 0.174216 | 0.088659 | -0.052407 | 0.092913 | 0.102329 | -0.109243 | -0.008531 | -0.179003 | 0.063780 | 0.141982 | -0.048284 |
RefComposite_5x5_10th | -0.017785 | 0.012073 | -0.170242 | 0.837232 | 0.956815 | 0.885665 | 0.834495 | 0.872141 | 1.000000 | 0.907150 | 0.846059 | 0.266892 | 0.368391 | 0.300522 | 0.210234 | -0.007169 | 0.025539 | 0.143557 | -0.007942 | -0.008840 | -0.239254 | 0.060160 | 0.185216 | -0.054932 |
RefComposite_5x5_50th | -0.018159 | 0.010327 | -0.096009 | 0.917736 | 0.873006 | 0.966313 | 0.927114 | 0.955646 | 0.907150 | 1.000000 | 0.935504 | 0.193055 | 0.294020 | 0.227759 | 0.137763 | -0.035913 | 0.059937 | 0.119484 | -0.067748 | -0.008984 | -0.204528 | 0.058587 | 0.159862 | -0.056364 |
RefComposite_5x5_90th | -0.017701 | 0.008399 | -0.067483 | 0.907470 | 0.816277 | 0.902944 | 0.970734 | 0.948112 | 0.846059 | 0.935504 | 1.000000 | 0.149546 | 0.249137 | 0.183957 | 0.099236 | -0.042242 | 0.060747 | 0.115023 | -0.078103 | -0.009330 | -0.184827 | 0.059369 | 0.142435 | -0.051278 |
RhoHV | 0.002469 | 0.016578 | -0.387421 | 0.099031 | 0.239737 | 0.162660 | 0.115837 | 0.141729 | 0.266892 | 0.193055 | 0.149546 | 1.000000 | 0.789249 | 0.933730 | 0.854633 | 0.231941 | -0.367277 | 0.119781 | 0.473685 | 0.011824 | -0.493655 | -0.184586 | 0.481759 | -0.042432 |
RhoHV_5x5_10th | -0.001641 | 0.019398 | -0.359585 | 0.202365 | 0.339702 | 0.264620 | 0.218137 | 0.243947 | 0.368391 | 0.294020 | 0.249137 | 0.789249 | 1.000000 | 0.841223 | 0.671128 | 0.063104 | -0.414427 | 0.227828 | 0.283982 | 0.005363 | -0.608683 | 0.002615 | 0.434057 | -0.050330 |
RhoHV_5x5_50th | 0.000941 | 0.017417 | -0.387563 | 0.129823 | 0.270827 | 0.195689 | 0.149878 | 0.174216 | 0.300522 | 0.227759 | 0.183957 | 0.933730 | 0.841223 | 1.000000 | 0.816876 | 0.156947 | -0.392284 | 0.138227 | 0.439328 | 0.009700 | -0.529381 | -0.191437 | 0.504209 | -0.048483 |
RhoHV_5x5_90th | 0.004109 | 0.012625 | -0.421277 | 0.043799 | 0.182392 | 0.104907 | 0.063288 | 0.088659 | 0.210234 | 0.137763 | 0.099236 | 0.854633 | 0.671128 | 0.816876 | 1.000000 | 0.218192 | -0.340884 | 0.111837 | 0.554252 | 0.007968 | -0.427723 | -0.171319 | 0.440342 | -0.033870 |
Zdr | 0.001950 | 0.004729 | -0.272579 | -0.084384 | -0.032164 | -0.063740 | -0.067665 | -0.052407 | -0.007169 | -0.035913 | -0.042242 | 0.231941 | 0.063104 | 0.156947 | 0.218192 | 1.000000 | 0.137707 | 0.498039 | 0.561398 | -0.008279 | -0.025564 | -0.064590 | 0.028677 | 0.009787 |
Zdr_5x5_10th | -0.012205 | -0.002905 | 0.077392 | 0.109331 | 0.035559 | 0.074288 | 0.074507 | 0.092913 | 0.025539 | 0.059937 | 0.060747 | -0.367277 | -0.414427 | -0.392284 | -0.340884 | 0.137707 | 1.000000 | 0.221498 | -0.261702 | 0.009152 | 0.473039 | 0.150087 | -0.349060 | 0.026218 |
Zdr_5x5_50th | -0.009357 | 0.008392 | -0.342390 | 0.055019 | 0.099661 | 0.078924 | 0.084126 | 0.102329 | 0.143557 | 0.119484 | 0.115023 | 0.119781 | 0.227828 | 0.138227 | 0.111837 | 0.498039 | 0.221498 | 1.000000 | 0.437039 | -0.014415 | -0.098027 | 0.115654 | 0.002163 | -0.000544 |
Zdr_5x5_90th | 0.008656 | 0.009305 | -0.429352 | -0.163710 | -0.049350 | -0.115541 | -0.121733 | -0.109243 | -0.007942 | -0.067748 | -0.078103 | 0.473685 | 0.283982 | 0.439328 | 0.554252 | 0.561398 | -0.261702 | 0.437039 | 1.000000 | -0.010402 | -0.276761 | -0.182030 | 0.257394 | -0.006270 |
Kdp | 0.002813 | -0.001297 | -0.006655 | -0.004208 | -0.002603 | -0.004518 | -0.005766 | -0.008531 | -0.008840 | -0.008984 | -0.009330 | 0.011824 | 0.005363 | 0.009700 | 0.007968 | -0.008279 | 0.009152 | -0.014415 | -0.010402 | 1.000000 | 0.045511 | 0.171328 | 0.172808 | -0.003782 |
Kdp_5x5_10th | -0.006703 | -0.010068 | 0.052841 | -0.171240 | -0.235687 | -0.202216 | -0.181418 | -0.179003 | -0.239254 | -0.204528 | -0.184827 | -0.493655 | -0.608683 | -0.529381 | -0.427723 | -0.025564 | 0.473039 | -0.098027 | -0.276761 | 0.045511 | 1.000000 | 0.119005 | -0.497302 | 0.029754 |
Kdp_5x5_50th | -0.007646 | -0.002192 | -0.051527 | 0.063122 | 0.061367 | 0.058820 | 0.061057 | 0.063780 | 0.060160 | 0.058587 | 0.059369 | -0.184586 | 0.002615 | -0.191437 | -0.171319 | -0.064590 | 0.150087 | 0.115654 | -0.182030 | 0.171328 | 0.119005 | 1.000000 | -0.114469 | 0.002442 |
Kdp_5x5_90th | 0.010952 | 0.006347 | -0.025117 | 0.146192 | 0.195009 | 0.169582 | 0.148218 | 0.141982 | 0.185216 | 0.159862 | 0.142435 | 0.481759 | 0.434057 | 0.504209 | 0.440342 | 0.028677 | -0.349060 | 0.002163 | 0.257394 | 0.172808 | -0.497302 | -0.114469 | 1.000000 | -0.028471 |
Expected | -0.000252 | -0.001091 | 0.030389 | -0.042359 | -0.050606 | -0.052462 | -0.047604 | -0.048284 | -0.054932 | -0.056364 | -0.051278 | -0.042432 | -0.050330 | -0.048483 | -0.033870 | 0.009787 | 0.026218 | -0.000544 | -0.006270 | -0.003782 | 0.029754 | 0.002442 | -0.028471 | 1.000000 |
处理异常值(设定阈值)
# 看一眼期望值的唯一数量
plt.figure(figsize=(8,6))
plt.hist(train_data_mean["Expected"].unique())
plt.show()
plt.figure(figsize=(8, 6))
plt.scatter(np.arange(len(train_data["Expected"].unique())),
train_data["Expected"].unique())
<matplotlib.collections.PathCollection at 0x7fae9dd6e210>
期望值大部分是低于5000,但是有部分跳动较大,需要处理
train_data_mean.Expected.mean()
23.994922498376706
print(stats.percentileofscore(train_data_mean["Expected"], 108))
98.2024616023927
train_data_mean.drop(train_data_mean[train_data_mean["Expected"] >= 108].index, inplace=True)
train_data_mean
Id | minutes_past | radardist_km | Ref | Ref_5x5_10th | Ref_5x5_50th | Ref_5x5_90th | RefComposite | RefComposite_5x5_10th | RefComposite_5x5_50th | ... | RhoHV_5x5_90th | Zdr | Zdr_5x5_10th | Zdr_5x5_50th | Zdr_5x5_90th | Kdp | Kdp_5x5_10th | Kdp_5x5_50th | Kdp_5x5_90th | Expected | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 29.083 | 2.000 | 16.625 | 13.667 | 17.375 | 21.333 | 22.667 | 20.375 | 22.917 | ... | 1.000 | 0.380 | 0.120 | 0.417 | 0.781 | -0.288 | -1.449 | -0.319 | 1.117 | 1.016 |
1 | 4 | 28.154 | 9.000 | 26.600 | 20.071 | 25.800 | 30.269 | 26.667 | 21.091 | 25.115 | ... | 1.016 | -1.125 | 0.000 | 0.500 | 1.516 | 7.030 | 0.000 | 0.000 | 6.330 | 4.064 |
2 | 7 | 30.933 | 13.000 | 14.750 | 10.500 | 12.500 | 14.429 | 14.750 | 11.500 | 14.875 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.508 |
3 | 8 | 31.083 | 8.000 | 19.600 | 18.083 | 19.200 | 24.864 | 24.591 | 22.250 | 23.727 | ... | 1.015 | 1.156 | -0.594 | 0.531 | 1.027 | 0.002 | -3.963 | -0.218 | 3.735 | 3.225 |
4 | 10 | 27.333 | 10.000 | 33.958 | 30.292 | 33.625 | 37.125 | 34.792 | 32.125 | 34.792 | ... | 0.988 | 0.255 | -1.031 | 0.526 | 2.839 | -1.404 | -5.938 | 0.177 | 7.000 | 0.010 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
731551 | 1180935 | 30.571 | 14.000 | 17.000 | 13.500 | 16.250 | 19.556 | 18.250 | 16.000 | 17.375 | ... | 1.017 | -0.545 | -0.875 | -1.134 | 1.174 | -0.660 | -4.560 | -1.911 | 5.551 | 0.508 |
731552 | 1180938 | 28.364 | 10.000 | 19.000 | 0.000 | 19.000 | 19.167 | 23.000 | 14.000 | 23.000 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.508 |
731553 | 1180942 | 28.462 | 9.000 | 15.667 | 13.000 | 15.889 | 19.950 | 17.333 | 14.000 | 15.900 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.524 |
731554 | 1180944 | 30.765 | 11.000 | 26.667 | 24.375 | 29.083 | 33.906 | 29.167 | 26.000 | 28.071 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 3.556 |
731555 | 1180945 | 28.385 | 9.000 | 28.167 | 18.143 | 27.000 | 36.500 | 29.667 | 17.944 | 28.958 | ... | 1.052 | 0.875 | -1.125 | 0.562 | 3.021 | 9.725 | 0.000 | -8.030 | 9.725 | 8.636 |
717105 rows × 24 columns
可视化唯一列
plt.figure(figsize=(8, 6))
plt.scatter(np.arange(len(train_data_mean["Expected"].unique())), train_data_mean["Expected"].unique())
<matplotlib.collections.PathCollection at 0x7faedce2ef10>
plt.figure(figsize=(8, 6))
train_data_mean["radardist_km"].hist()
plt.title("radardist_km")
Text(0.5, 1.0, 'radardist_km')
plt.figure(figsize=(8, 6))
train_data_mean["Ref"].hist()
plt.title("Ref")
Text(0.5, 1.0, 'Ref')
plt.figure(figsize=(8, 6))
train_data_mean["RefComposite"].hist()
plt.title("RefComposite")
Text(0.5, 1.0, 'RefComposite')
plt.figure(figsize=(8, 6))
train_data_mean["RhoHV"].hist()
plt.title("RhoHV")
Text(0.5, 1.0, 'RhoHV')
plt.figure(figsize=(8, 6))
train_data_mean["Zdr"].hist()
plt.title("Zdr")
Text(0.5, 1.0, 'Zdr')
plt.figure(figsize=(8, 6))
train_data_mean["Kdp"].hist()
plt.title("Kdp")
Text(0.5, 1.0, 'Kdp')
plt.figure(figsize=(8, 6))
plt.hist(train_data_mean["Expected"].unique())
plt.title("Unique Expected")
Text(0.5, 1.0, 'Unique Expected')
训练100万samples
train_data_sample = train_data_mean.sample(n=1000000,replace=True) # 取样
features = [f for f in train_data_sample.columns]
features.remove("Expected") # 去除 label
features.remove("Id")
X = train_data_sample[features] # 特征
Y = train_data_sample["Expected"] # 标签
# 缩放工程
scaler = StandardScaler()
# 分割训练集、验证集
x_train,x_test,y_train,y_test = train_test_split(scaler.fit_transform(X),Y,test_size=0.25,random_state=0)
建模、训练及评估
评估指标:
MSE和MAE都是评估回归模型性能的指标,分别表示均方误差(Mean Squared Error)和平均绝对误差(Mean Absolute Error)。
相比而言,MSE对预测误差较大的样本更加敏感,因为误差平方后值更大,而MAE对所有误差一视同仁,不受误差大小的影响。
LR = LinearRegression()
LR.fit(x_train,y_train)
LinearRegression()
print(f'Linear Regression Score:', LR.score(x_test,y_test))
print(f'Linear Regression MSE: ',mean_squared_error(y_test,LR.predict(x_test)))
print(f'Linear Regression MAE: ',mean_absolute_error(y_test,LR.predict(x_test)))
Linear Regression Score: 0.05009157464667113
Linear Regression MSE: 62.94448284458787
Linear Regression MAE: 3.567090649395791
DTR = DecisionTreeRegressor()
DTR.fit(x_train,y_train)
DecisionTreeRegressor()
print(f'Decision Tree Score:', DTR.score(x_test,y_test))
print(f'Decision Tree MSE: ',mean_squared_error(y_test,DTR.predict(x_test)))
print(f'Decision Tree MAE: ',mean_absolute_error(y_test,DTR.predict(x_test)))
Decision Tree Score: 0.568334912813709
Decision Tree MSE: 28.60374216061774
Decision Tree MAE: 1.1915324397365497
XGB = XGBRegressor()
XGB.fit(x_train,y_train)
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, ...)
print(f'XGBoost Regressor Score:', XGB.score(x_test,y_test))
print(f'XGBoost Regressor MSE: ',mean_squared_error(y_test,XGB.predict(x_test)))
print(f'XGBoost Regressor MAE: ',mean_absolute_error(y_test,XGB.predict(x_test)))
XGBoost Regressor Score: 0.18670791787153074
XGBoost Regressor MSE: 53.89177329607648
XGBoost Regressor MAE: 3.173719464009896
导入测试集
test = pd.read_csv("../input/how-much-did-it-rain-ii/test.zip")
test_data = process_data(test)
test_data
Id | minutes_past | radardist_km | Ref | Ref_5x5_10th | Ref_5x5_50th | Ref_5x5_90th | RefComposite | RefComposite_5x5_10th | RefComposite_5x5_50th | ... | RhoHV_5x5_50th | RhoHV_5x5_90th | Zdr | Zdr_5x5_10th | Zdr_5x5_50th | Zdr_5x5_90th | Kdp | Kdp_5x5_10th | Kdp_5x5_50th | Kdp_5x5_90th | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 29.824 | 8.000 | 10.500 | 0.000 | 8.278 | 13.462 | 11.375 | 0.000 | 8.667 | ... | 0.990 | 1.044 | -0.547 | -1.750 | 0.062 | 2.598 | -1.523 | 0.000 | -1.290 | 2.602 |
1 | 2 | 28.938 | 15.000 | 0.000 | 0.000 | 0.000 | 13.000 | 0.000 | 0.000 | 0.000 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
2 | 3 | 28.700 | 11.000 | 23.722 | 17.125 | 24.111 | 34.700 | 25.056 | 16.400 | 25.222 | ... | 0.955 | 0.981 | 0.419 | -0.562 | 0.456 | 1.825 | 0.220 | -4.359 | 0.170 | 5.383 |
3 | 4 | 28.727 | 9.000 | 30.812 | 28.643 | 29.812 | 35.625 | 32.000 | 28.944 | 32.667 | ... | 0.971 | 0.982 | 0.100 | -0.537 | 0.177 | 1.241 | 0.912 | -2.890 | -0.140 | 5.563 |
4 | 5 | 28.333 | 17.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
717620 | 717621 | 28.400 | 6.000 | 14.500 | 9.000 | 12.750 | 16.500 | 12.800 | 7.333 | 8.714 | ... | 0.000 | 1.023 | 0.000 | 0.000 | 0.000 | 0.750 | 0.000 | 0.000 | 0.000 | -0.935 |
717621 | 717622 | 29.286 | 15.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
717622 | 717623 | 30.308 | 9.000 | 21.833 | 19.056 | 22.917 | 31.192 | 26.000 | 22.318 | 24.577 | ... | 0.968 | 0.994 | 0.168 | -1.188 | 0.168 | 1.750 | -0.083 | -4.717 | -0.188 | 4.842 |
717623 | 717624 | 32.000 | 11.000 | 38.000 | 0.000 | 37.000 | 42.833 | 38.000 | 0.000 | 37.500 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
717624 | 717625 | 28.000 | 15.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
717625 rows × 23 columns
预测结果
Linear_Regression_Prediction = LR.predict(scaler.fit_transform(test_data[features]))
Linear_Regression_Result = pd.DataFrame(columns=["Id","Expected"])
Linear_Regression_Result["Id"] = test_data["Id"]
Linear_Regression_Result["Expected"] = Linear_Regression_Prediction
Linear_Regression_Result.head()
Id | Expected | |
---|---|---|
0 | 1 | 3.447 |
1 | 2 | 3.566 |
2 | 3 | 5.709 |
3 | 4 | 5.716 |
4 | 5 | 2.295 |
Decision_Tree_Prediction = DTR.predict(scaler.fit_transform(test_data[features]))
Decision_Tree_Result = pd.DataFrame(columns=["Id","Expected"])
Decision_Tree_Result["Id"] = test_data["Id"]
Decision_Tree_Result["Expected"] = Decision_Tree_Prediction
Decision_Tree_Result.head()
Id | Expected | |
---|---|---|
0 | 1 | 2.794 |
1 | 2 | 1.524 |
2 | 3 | 0.508 |
3 | 4 | 3.950 |
4 | 5 | 0.254 |
XGBoost_Regressor_Prediction = XGB.predict(scaler.fit_transform(test_data[features]))
XGBoost_Regressor_Result = pd.DataFrame(columns=["Id","Expected"])
XGBoost_Regressor_Result["Id"] = test_data["Id"]
XGBoost_Regressor_Result["Expected"] = XGBoost_Regressor_Prediction
XGBoost_Regressor_Result.head()
Id | Expected | |
---|---|---|
0 | 1 | 0.555 |
1 | 2 | 2.201 |
2 | 3 | 5.386 |
3 | 4 | 11.370 |
4 | 5 | 2.868 |