基于机器学习的城市暴雨量预测

背景

  • 通过深圳气象局观测到的雷达图数据,每个雷达图覆盖一个目标地点及其周边地区,标记为m*m网格,其中每个网格点记录的是雷达反射率因子值z。这里Z值可以从非常小的数值到大的值,为方便起见,我们使用dBZ来测量这个值:

  • 短期降水预测涉及以下信息的分析:
    1.当前降水量与雷达折射率之间的关系;
    2.雷达图包含当前目标站点及其周边地区的雷达反射率。需要考虑目标地点与周边地区之间的降水关系。

  • 我们需要根据雷达值并预测每小时降雨量总计。

数据说明

训练集数据包含2017年4月到8月期间深圳气象雷达网络和降雨数据。时间和位置信息已经被加敏处理。

文件说明

  • train.zip 训练集。包括在玉米生长季节每个月对美国中西部的仪表进行 20 多天的雷达观测。
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
/kaggle/input/how-much-did-it-rain-ii/train.zip
/kaggle/input/how-much-did-it-rain-ii/test.zip
/kaggle/input/how-much-did-it-rain-ii/sample_dask.py
/kaggle/input/how-much-did-it-rain-ii/sample_solution.csv.zip
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.stats import percentileofscore

数据预处理

之所以会有这么多数据要是因为一个小时内会有多次雷达观测,但只有一个雨量计观测(即Expected)。这就是为什么有多行数据具有相同的Id的原因。

不同特征对应的含义分别为:

  • Id:一个雨量计在一个小时内所有观测数据的唯一编号。在数据集中,每个小时内的观测数据都有一个唯一的“Id”编号。

  • minutes_past:每组雷达观测数据记录了观测时间距离整点的分钟数。雷达观测数据是在某个时间点的快照,记录了该时刻上空的情况。

  • radardist_km:返回雷达观测数据时,降水测量仪器与雷达之间的水平距离

  • Ref:雷达接收到的回波信号中,反射回来的电磁波功率与入射电磁波功率之比,通常以dBZ为单位

  • Ref_5x5_10th:在一个5x5的邻域内,以雨量计为中心,反射率值的第10个百分位数。

  • Ref_5x5_50th:在一个5x5的邻域内,以雨量计为中心,反射率值的第50个百分位数。

  • Ref_5x5_90th:在一个5x5的邻域内,以雨量计为中心,反射率值的第90个百分位数。

  • RefCompsite:垂直列上方最大反射率(dBZ)

  • RefCompsite_5x5_10th

  • RefCompsite_5x5_50th

  • RefCompsite_5x5_90th

  • RhoHV:雷达测量信号之间相关性的参数(无单位)

  • RhoHV_5x5_10th

  • RhoHV_5x5_50th

  • RhoHV_5x5_90th

  • Zdr:水平和垂直极化波的反射率因子之差(分贝)

  • Zdr_5x5_10th

  • Zdr_5x5_50th

  • Zdr_5x5_90th

  • Kdp:单位距离内降水粒子的相位差(度/千米)

  • Kdp_5x5_10th

  • Kdp_5x5_50th

  • Kdp_5x5_90th

  • Expected:每小时结束时的实际仪表观测值(/mm)

导入训练集

train_data = pd.read_csv("../input/how-much-did-it-rain-ii/train.zip")
train_data
Id minutes_past radardist_km Ref Ref_5x5_10th Ref_5x5_50th Ref_5x5_90th RefComposite RefComposite_5x5_10th RefComposite_5x5_50th ... RhoHV_5x5_90th Zdr Zdr_5x5_10th Zdr_5x5_50th Zdr_5x5_90th Kdp Kdp_5x5_10th Kdp_5x5_50th Kdp_5x5_90th Expected
0 1 3 10.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254000
1 1 16 10.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254000
2 1 25 10.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254000
3 1 35 10.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254000
4 1 45 10.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13765196 1180945 38 9.0 33.0 19.5 25.5 36.5 33.0 20.5 28.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 8.636004
13765197 1180945 42 9.0 33.0 21.0 30.5 37.0 36.5 22.0 33.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 8.636004
13765198 1180945 47 9.0 29.5 10.0 26.0 30.5 31.0 16.5 26.0 ... 1.051667 1.75 NaN 0.750 3.0000 13.379990 NaN NaN 13.379990 8.636004
13765199 1180945 52 9.0 19.0 NaN 15.5 26.5 19.0 NaN 16.5 ... 1.051667 NaN NaN NaN 2.8125 NaN NaN NaN NaN 8.636004
13765200 1180945 57 9.0 7.5 NaN 10.0 13.0 14.5 10.0 12.5 ... 1.051667 0.00 -1.125 0.375 3.2500 6.069992 NaN -8.029999 6.069992 8.636004

13765201 rows × 24 columns

处理无效数据

对于每个观测数据,如果Ref列中含有NaN值(即来自雷达的数据缺失),则将该观测数据从数据集中删除

train_ids = train_data[~np.isnan(train_data.Ref)]  # 选取 Ref 列不为 NaN 的样本,并赋值给 train_ids
train_data = train_data[np.in1d(train_data.Id, train_ids.Id)] # 选取那些 Id 在 train_ids 中出现过的样本,并赋值给 train_new

train_data
Id minutes_past radardist_km Ref Ref_5x5_10th Ref_5x5_50th Ref_5x5_90th RefComposite RefComposite_5x5_10th RefComposite_5x5_50th ... RhoHV_5x5_90th Zdr Zdr_5x5_10th Zdr_5x5_50th Zdr_5x5_90th Kdp Kdp_5x5_10th Kdp_5x5_50th Kdp_5x5_90th Expected
6 2 1 2.0 9.0 5.0 7.5 10.5 15.0 10.5 16.5 ... 0.998333 0.3750 -0.1250 0.3125 0.8750 1.059998 -1.410004 -0.350006 1.059998 1.016001
7 2 6 2.0 26.5 22.5 25.5 31.5 26.5 26.5 28.5 ... 1.005000 0.0625 -0.1875 0.2500 0.6875 NaN NaN NaN 1.409988 1.016001
8 2 11 2.0 21.5 15.5 20.5 25.0 26.5 23.5 25.0 ... 1.001667 0.3125 -0.0625 0.3125 0.6250 0.349991 NaN -0.350006 1.759994 1.016001
9 2 16 2.0 18.0 14.0 17.5 21.0 20.5 18.0 20.5 ... 1.001667 0.2500 0.1250 0.3750 0.6875 0.349991 -1.059998 0.000000 1.059998 1.016001
10 2 21 2.0 24.5 16.5 21.0 24.5 24.5 21.0 24.0 ... 0.998333 0.2500 0.0625 0.1875 0.5625 -0.350006 -1.059998 -0.350006 1.759994 1.016001
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13765196 1180945 38 9.0 33.0 19.5 25.5 36.5 33.0 20.5 28.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 8.636004
13765197 1180945 42 9.0 33.0 21.0 30.5 37.0 36.5 22.0 33.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 8.636004
13765198 1180945 47 9.0 29.5 10.0 26.0 30.5 31.0 16.5 26.0 ... 1.051667 1.7500 NaN 0.7500 3.0000 13.379990 NaN NaN 13.379990 8.636004
13765199 1180945 52 9.0 19.0 NaN 15.5 26.5 19.0 NaN 16.5 ... 1.051667 NaN NaN NaN 2.8125 NaN NaN NaN NaN 8.636004
13765200 1180945 57 9.0 7.5 NaN 10.0 13.0 14.5 10.0 12.5 ... 1.051667 0.0000 -1.1250 0.3750 3.2500 6.069992 NaN -8.029999 6.069992 8.636004

9125329 rows × 24 columns

检查Nan值 补0

  1. 消除数据中的异常值:NaN值通常表示缺失值或错误值,这些值可能会影响数据的分析和建模结果。通过去除NaN值,可以减少这些异常值对数据分析和建模的影响,提高数据的准确性和可靠性。
  2. 减少数据的噪声:数据中的NaN值可能会引入噪声,影响数据的分析和建模结果。通过去除NaN值,可以减少数据的噪声,提高数据的质量和可靠性。
  3. 提高数据分析和建模的效率:在处理大量数据时,如果数据中包含大量NaN值,会增加数据处理和分析的复杂度和时间。通过去除NaN值,可以减少数据的复杂度,提高数据分析和建模的效率。
  4. 方便数据可视化:NaN值在绘制图表时通常被视为异常值或缺失值,会影响数据的可视化效果。通过去除NaN值,可以提高数据的可视化效果,更好地展示数据的特征和趋势。
train_data.isna().sum()
Id                             0
minutes_past                   0
radardist_km                   0
Ref                      2775954
Ref_5x5_10th             3841387
Ref_5x5_50th             2772390
Ref_5x5_90th             1706556
RefComposite             2429475
RefComposite_5x5_10th    3377843
RefComposite_5x5_50th    2434061
RefComposite_5x5_90th    1471489
RhoHV                    4314082
RhoHV_5x5_10th           5052173
RhoHV_5x5_50th           4300572
RhoHV_5x5_90th           3458417
Zdr                      4314082
Zdr_5x5_10th             5052173
Zdr_5x5_50th             4300572
Zdr_5x5_90th             3458417
Kdp                      5036548
Kdp_5x5_10th             5745822
Kdp_5x5_50th             5026320
Kdp_5x5_90th             4231865
Expected                       0
dtype: int64
def process_data(train_data):
    train_data_mean = train_data.groupby('Id').mean()
    train_data_mean.fillna(0,inplace=True)
    train_data_mean.reset_index(inplace=True)
    return train_data_mean
train_data_mean = process_data(train_data)
train_data_mean.isnull().sum()
Id                       0
minutes_past             0
radardist_km             0
Ref                      0
Ref_5x5_10th             0
Ref_5x5_50th             0
Ref_5x5_90th             0
RefComposite             0
RefComposite_5x5_10th    0
RefComposite_5x5_50th    0
RefComposite_5x5_90th    0
RhoHV                    0
RhoHV_5x5_10th           0
RhoHV_5x5_50th           0
RhoHV_5x5_90th           0
Zdr                      0
Zdr_5x5_10th             0
Zdr_5x5_50th             0
Zdr_5x5_90th             0
Kdp                      0
Kdp_5x5_10th             0
Kdp_5x5_50th             0
Kdp_5x5_90th             0
Expected                 0
dtype: int64
pd.set_option('display.float_format', lambda x: '%.3f' % x)
train_data_mean[["minutes_past", "radardist_km", "Expected"]].describe()
minutes_past radardist_km Expected
count 731556.000 731556.000 731556.000
mean 29.543 9.747 23.995
std 1.459 4.059
### 使用 ArcGIS 进行城市暴雨洪涝灾害风险评估 #### 数据准备 在进行城市暴雨洪涝灾害风险评估之前,需收集多种数据源来支持分析。这些数据通常包括高分辨率DEM(数字高程模型)、土地利用/覆盖图、降雨历史记录以及基础设施分布等地理空间数据[^1]。 #### 风险因子识别与制图 通过ArcGIS平台可以实现对影响洪水发生的各个因素的空间化表达。例如地形坡度、土壤渗透能力、不透水面比例等因素都可通过特定算法转换成栅格地图层;而河流网络则能基于矢化处理形成线状要素类。对于那些难以直接测的影响变,则可借助统计回归或其他机器学习手段建立预测模型来进行估算[^2]。 #### 模拟情景构建 考虑到极端天气事件具有不确定性,在实际操作过程中往往需要设定不同强度等级下的降水模式作为输入条件之一。此时Hec-RAS这类专业的水力学计算工具便派上了用场——它能够模拟水流路径及其动态变化情况,并输出结果给到ArcGIS用于进一步可视化展示潜在受灾范围及程度。 #### 综合评价体系搭建 最后一步就是综合考虑上述所有方面建立起一套完整的风险指数框架。这不仅涉及到单个指标权重分配问题,更考验着如何科学合理地融合多维度信息得出最终结论的能力。在此基础上绘制出的城市内涝易发区划分图将成为政府决策部门制定应急预案的重要依据。 ```python import arcpy from arcpy.sa import * # 设置工作环境 arcpy.env.workspace = "C:/data" # 计算坡度 dem = Raster("elevation.tif") slope = Slope(dem, "DEGREE") # 将结果保存为新的栅格文件 slope.save("slope_degree.tif") ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

By_Liu

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值