背景
-
通过深圳气象局观测到的雷达图数据,每个雷达图覆盖一个目标地点及其周边地区,标记为m*m网格,其中每个网格点记录的是雷达反射率因子值z。这里Z值可以从非常小的数值到大的值,为方便起见,我们使用dBZ来测量这个值:
-
短期降水预测涉及以下信息的分析:
1.当前降水量与雷达折射率之间的关系;
2.雷达图包含当前目标站点及其周边地区的雷达反射率。需要考虑目标地点与周边地区之间的降水关系。 -
我们需要根据雷达值并预测每小时降雨量总计。
数据说明
训练集数据包含2017年4月到8月期间深圳气象雷达网络和降雨数据。时间和位置信息已经被加敏处理。
文件说明
- train.zip 训练集。包括在玉米生长季节每个月对美国中西部的仪表进行 20 多天的雷达观测。
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/how-much-did-it-rain-ii/train.zip
/kaggle/input/how-much-did-it-rain-ii/test.zip
/kaggle/input/how-much-did-it-rain-ii/sample_dask.py
/kaggle/input/how-much-did-it-rain-ii/sample_solution.csv.zip
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.stats import percentileofscore
数据预处理
之所以会有这么多数据要是因为一个小时内会有多次雷达观测,但只有一个雨量计观测(即Expected
)。这就是为什么有多行数据具有相同的Id
的原因。
不同特征对应的含义分别为:
-
Id
:一个雨量计在一个小时内所有观测数据的唯一编号。在数据集中,每个小时内的观测数据都有一个唯一的“Id”编号。 -
minutes_past
:每组雷达观测数据记录了观测时间距离整点的分钟数。雷达观测数据是在某个时间点的快照,记录了该时刻上空的情况。 -
radardist_km
:返回雷达观测数据时,降水测量仪器与雷达之间的水平距离 -
Ref
:雷达接收到的回波信号中,反射回来的电磁波功率与入射电磁波功率之比,通常以dBZ为单位 -
Ref_5x5_10th
:在一个5x5的邻域内,以雨量计为中心,反射率值的第10个百分位数。 -
Ref_5x5_50th
:在一个5x5的邻域内,以雨量计为中心,反射率值的第50个百分位数。 -
Ref_5x5_90th
:在一个5x5的邻域内,以雨量计为中心,反射率值的第90个百分位数。 -
RefCompsite
:垂直列上方最大反射率(dBZ) -
RefCompsite_5x5_10th
-
RefCompsite_5x5_50th
-
RefCompsite_5x5_90th
-
RhoHV
:雷达测量信号之间相关性的参数(无单位) -
RhoHV_5x5_10th
-
RhoHV_5x5_50th
-
RhoHV_5x5_90th
-
Zdr
:水平和垂直极化波的反射率因子之差(分贝) -
Zdr_5x5_10th
-
Zdr_5x5_50th
-
Zdr_5x5_90th
-
Kdp
:单位距离内降水粒子的相位差(度/千米) -
Kdp_5x5_10th
-
Kdp_5x5_50th
-
Kdp_5x5_90th
-
Expected
:每小时结束时的实际仪表观测值(/mm)
导入训练集
train_data = pd.read_csv("../input/how-much-did-it-rain-ii/train.zip")
train_data
Id | minutes_past | radardist_km | Ref | Ref_5x5_10th | Ref_5x5_50th | Ref_5x5_90th | RefComposite | RefComposite_5x5_10th | RefComposite_5x5_50th | ... | RhoHV_5x5_90th | Zdr | Zdr_5x5_10th | Zdr_5x5_50th | Zdr_5x5_90th | Kdp | Kdp_5x5_10th | Kdp_5x5_50th | Kdp_5x5_90th | Expected | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.254000 |
1 | 1 | 16 | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.254000 |
2 | 1 | 25 | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.254000 |
3 | 1 | 35 | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.254000 |
4 | 1 | 45 | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.254000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
13765196 | 1180945 | 38 | 9.0 | 33.0 | 19.5 | 25.5 | 36.5 | 33.0 | 20.5 | 28.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.636004 |
13765197 | 1180945 | 42 | 9.0 | 33.0 | 21.0 | 30.5 | 37.0 | 36.5 | 22.0 | 33.5 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.636004 |
13765198 | 1180945 | 47 | 9.0 | 29.5 | 10.0 | 26.0 | 30.5 | 31.0 | 16.5 | 26.0 | ... | 1.051667 | 1.75 | NaN | 0.750 | 3.0000 | 13.379990 | NaN | NaN | 13.379990 | 8.636004 |
13765199 | 1180945 | 52 | 9.0 | 19.0 | NaN | 15.5 | 26.5 | 19.0 | NaN | 16.5 | ... | 1.051667 | NaN | NaN | NaN | 2.8125 | NaN | NaN | NaN | NaN | 8.636004 |
13765200 | 1180945 | 57 | 9.0 | 7.5 | NaN | 10.0 | 13.0 | 14.5 | 10.0 | 12.5 | ... | 1.051667 | 0.00 | -1.125 | 0.375 | 3.2500 | 6.069992 | NaN | -8.029999 | 6.069992 | 8.636004 |
13765201 rows × 24 columns
处理无效数据
对于每个观测数据,如果Ref
列中含有NaN值(即来自雷达的数据缺失),则将该观测数据从数据集中删除
train_ids = train_data[~np.isnan(train_data.Ref)] # 选取 Ref 列不为 NaN 的样本,并赋值给 train_ids
train_data = train_data[np.in1d(train_data.Id, train_ids.Id)] # 选取那些 Id 在 train_ids 中出现过的样本,并赋值给 train_new
train_data
Id | minutes_past | radardist_km | Ref | Ref_5x5_10th | Ref_5x5_50th | Ref_5x5_90th | RefComposite | RefComposite_5x5_10th | RefComposite_5x5_50th | ... | RhoHV_5x5_90th | Zdr | Zdr_5x5_10th | Zdr_5x5_50th | Zdr_5x5_90th | Kdp | Kdp_5x5_10th | Kdp_5x5_50th | Kdp_5x5_90th | Expected | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 2 | 1 | 2.0 | 9.0 | 5.0 | 7.5 | 10.5 | 15.0 | 10.5 | 16.5 | ... | 0.998333 | 0.3750 | -0.1250 | 0.3125 | 0.8750 | 1.059998 | -1.410004 | -0.350006 | 1.059998 | 1.016001 |
7 | 2 | 6 | 2.0 | 26.5 | 22.5 | 25.5 | 31.5 | 26.5 | 26.5 | 28.5 | ... | 1.005000 | 0.0625 | -0.1875 | 0.2500 | 0.6875 | NaN | NaN | NaN | 1.409988 | 1.016001 |
8 | 2 | 11 | 2.0 | 21.5 | 15.5 | 20.5 | 25.0 | 26.5 | 23.5 | 25.0 | ... | 1.001667 | 0.3125 | -0.0625 | 0.3125 | 0.6250 | 0.349991 | NaN | -0.350006 | 1.759994 | 1.016001 |
9 | 2 | 16 | 2.0 | 18.0 | 14.0 | 17.5 | 21.0 | 20.5 | 18.0 | 20.5 | ... | 1.001667 | 0.2500 | 0.1250 | 0.3750 | 0.6875 | 0.349991 | -1.059998 | 0.000000 | 1.059998 | 1.016001 |
10 | 2 | 21 | 2.0 | 24.5 | 16.5 | 21.0 | 24.5 | 24.5 | 21.0 | 24.0 | ... | 0.998333 | 0.2500 | 0.0625 | 0.1875 | 0.5625 | -0.350006 | -1.059998 | -0.350006 | 1.759994 | 1.016001 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
13765196 | 1180945 | 38 | 9.0 | 33.0 | 19.5 | 25.5 | 36.5 | 33.0 | 20.5 | 28.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.636004 |
13765197 | 1180945 | 42 | 9.0 | 33.0 | 21.0 | 30.5 | 37.0 | 36.5 | 22.0 | 33.5 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.636004 |
13765198 | 1180945 | 47 | 9.0 | 29.5 | 10.0 | 26.0 | 30.5 | 31.0 | 16.5 | 26.0 | ... | 1.051667 | 1.7500 | NaN | 0.7500 | 3.0000 | 13.379990 | NaN | NaN | 13.379990 | 8.636004 |
13765199 | 1180945 | 52 | 9.0 | 19.0 | NaN | 15.5 | 26.5 | 19.0 | NaN | 16.5 | ... | 1.051667 | NaN | NaN | NaN | 2.8125 | NaN | NaN | NaN | NaN | 8.636004 |
13765200 | 1180945 | 57 | 9.0 | 7.5 | NaN | 10.0 | 13.0 | 14.5 | 10.0 | 12.5 | ... | 1.051667 | 0.0000 | -1.1250 | 0.3750 | 3.2500 | 6.069992 | NaN | -8.029999 | 6.069992 | 8.636004 |
9125329 rows × 24 columns
检查Nan值 补0
- 消除数据中的异常值:NaN值通常表示缺失值或错误值,这些值可能会影响数据的分析和建模结果。通过去除NaN值,可以减少这些异常值对数据分析和建模的影响,提高数据的准确性和可靠性。
- 减少数据的噪声:数据中的NaN值可能会引入噪声,影响数据的分析和建模结果。通过去除NaN值,可以减少数据的噪声,提高数据的质量和可靠性。
- 提高数据分析和建模的效率:在处理大量数据时,如果数据中包含大量NaN值,会增加数据处理和分析的复杂度和时间。通过去除NaN值,可以减少数据的复杂度,提高数据分析和建模的效率。
- 方便数据可视化:NaN值在绘制图表时通常被视为异常值或缺失值,会影响数据的可视化效果。通过去除NaN值,可以提高数据的可视化效果,更好地展示数据的特征和趋势。
train_data.isna().sum()
Id 0
minutes_past 0
radardist_km 0
Ref 2775954
Ref_5x5_10th 3841387
Ref_5x5_50th 2772390
Ref_5x5_90th 1706556
RefComposite 2429475
RefComposite_5x5_10th 3377843
RefComposite_5x5_50th 2434061
RefComposite_5x5_90th 1471489
RhoHV 4314082
RhoHV_5x5_10th 5052173
RhoHV_5x5_50th 4300572
RhoHV_5x5_90th 3458417
Zdr 4314082
Zdr_5x5_10th 5052173
Zdr_5x5_50th 4300572
Zdr_5x5_90th 3458417
Kdp 5036548
Kdp_5x5_10th 5745822
Kdp_5x5_50th 5026320
Kdp_5x5_90th 4231865
Expected 0
dtype: int64
def process_data(train_data):
train_data_mean = train_data.groupby('Id').mean()
train_data_mean.fillna(0,inplace=True)
train_data_mean.reset_index(inplace=True)
return train_data_mean
train_data_mean = process_data(train_data)
train_data_mean.isnull().sum()
Id 0
minutes_past 0
radardist_km 0
Ref 0
Ref_5x5_10th 0
Ref_5x5_50th 0
Ref_5x5_90th 0
RefComposite 0
RefComposite_5x5_10th 0
RefComposite_5x5_50th 0
RefComposite_5x5_90th 0
RhoHV 0
RhoHV_5x5_10th 0
RhoHV_5x5_50th 0
RhoHV_5x5_90th 0
Zdr 0
Zdr_5x5_10th 0
Zdr_5x5_50th 0
Zdr_5x5_90th 0
Kdp 0
Kdp_5x5_10th 0
Kdp_5x5_50th 0
Kdp_5x5_90th 0
Expected 0
dtype: int64
pd.set_option('display.float_format', lambda x: '%.3f' % x)
train_data_mean[["minutes_past", "radardist_km", "Expected"]].describe()
minutes_past | radardist_km | Expected | |
---|---|---|---|
count | 731556.000 | 731556.000 | 731556.000 |
mean | 29.543 | 9.747 | 23.995 |
std | 1.459 | 4.059 |