数据挖掘原理与算法:对森林火灾影响因素的分析

数据挖掘原理与算法:对森林火灾影响因素的分析



一、介绍

Forest Fire Area

Prediction of the burnt area by forest fires

Overview

The dataset contains 517 fires from the Montesinho natural park in Portugal. For each incident weekday, month, coordinates, and the burnt area are recorded, as well as several meteorological data such as rain, temperature, humidity, and wind. The workflow reads the data and trains a regression model based on the spatial, temporal, and weather variables.

简介

该数据集包含来自葡萄牙蒙特西尼奥自然公园的 517 起火灾。记录每个事件的工作日、月份、坐标和烧伤区域,以及雨、温度、湿度和风等多个气象数据。工作流读取数据并根据空间、时间和天气变量训练回归模型。



二、资源

Forest Fires Data Set

Forest Fires Data Set----predict the burned area of forest fires using meteorological and other data

加拿大森林火险气候指数系统FWI的原理及应用



三、代码

1.读取数据

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(action='ignore')
fires = pd.read_csv('forestfires.csv')

2.数据清洗

2.1对数据进行处理,将日期数字化
fires = fires.reset_index()

mapping_month = {'jan':1,'feb':2,'mar':3,'apr':4,'may':5,'jun':6,'jul':7,'aug':8,'sep':9,'oct':10,'nov':11,'dec':12,}
fires['month'] = fires['month'].map(mapping_month)

mapping_day = {'mon':1,'tue':2,'wed':3,'thu':4,'fri':5,'sat':6,'sun':0}
fires['day'] = fires['day'].map(mapping_day)
2.2查看数据特征
fires.describe().T
countmeanstdmin25%50%75%max
index517.0258.000000149.3893120.0129.0258.00387.00516.00
X517.04.6692462.3137781.03.04.007.009.00
Y517.04.2998071.2299002.04.04.005.009.00
month517.07.4758222.2759901.07.08.009.0012.00
day517.02.9729212.1438670.01.03.005.006.00
FFMC517.090.6446815.52011118.790.291.6092.9096.20
DMC517.0110.87234064.0464821.168.6108.30142.40291.30
DC517.0547.940039248.0661927.9437.7664.20713.90860.60
ISI517.09.0216634.5594770.06.58.4010.8056.10
temp517.018.8891685.8066252.215.519.3022.8033.30
RH517.044.28820116.31746915.033.042.0053.00100.00
wind517.04.0176021.7916530.42.74.004.909.40
rain517.00.0216630.2959590.00.00.000.006.40
area517.012.84729263.6558180.00.00.526.571090.84
2.3
对预测结果进行处理,由特征可知,烧毁面积的均值为12.847292,前99.613%的数据都小于279,前75%的数据都小于6.57,前50%的数据都小于0.52,前47%的数据都小于0.09。
所以有理由推测当烧毁面积大于0.09、小于6.57的时候,发生了小型火灾;当烧毁面积大于6.57、小于279的时候,发生了中型火灾;当烧毁面积大于279的时候,发生了大型火灾;
fires['area'][fires['area']<=0.09] = 0
fires['area'][(fires['area']>0.09) & (fires['area']<=6.57)] = 1
fires['area'][(fires['area']>6.57) & (fires['area']<=279)] = 2
fires['area'][fires['area']>279] = 3
2.4查看特征之间的相关性
attributes = ['month','day','FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH','wind','rain']
corr = fires[attributes].corr()
corr

monthdayFFMCDMCDCISItempRHwindrain
month1.000000-0.0374690.2914770.4666450.8686980.1865970.368842-0.095280-0.0863680.013438
day-0.0374691.0000000.0735970.0286970.0019130.0359260.032233-0.083318-0.004013-0.024119
FFMC0.2914770.0735971.0000000.3826190.3305120.5318050.431532-0.300995-0.0284850.056702
DMC0.4666450.0286970.3826191.0000000.6821920.3051280.4695940.073795-0.1053420.074790
DC0.8686980.0019130.3305120.6821921.0000000.2291540.496208-0.039192-0.2034660.035861
ISI0.1865970.0359260.5318050.3051280.2291541.0000000.394287-0.1325170.1068260.067668
temp0.3688420.0322330.4315320.4695940.4962080.3942871.000000-0.527390-0.2271160.069491
RH-0.095280-0.083318-0.3009950.073795-0.039192-0.132517-0.5273901.0000000.0694100.099751
wind-0.086368-0.004013-0.028485-0.105342-0.2034660.106826-0.2271160.0694101.0000000.061119
rain0.013438-0.0241190.0567020.0747900.0358610.0676680.0694910.0997510.0611191.000000
2.5查看 加拿大森林火险气候指数系统FWI 中各个参数之间的相关性
from pandas.plotting import scatter_matrix

attributes = ['FFMC', 'DMC', 'DC', 'ISI']
scatter_matrix(fires[attributes],figsize=(15, 15))

在这里插入图片描述

2.6画出散点图,查看属性DMC(粗腐殖质湿度码)与DC(干旱码)之间的关系
fires.plot(kind="scatter", x="DMC", y="DC", alpha=0.4, figsize=(10,8))

在这里插入图片描述

2.7使用极端森林回归模型,进行建模
from sklearn.ensemble import ExtraTreesRegressor

columns = ['X', 'Y','month','day','FFMC', 'DMC', 'DC', 'ISI', 'temp','RH', 'wind', 'rain']
X = fires[columns]
Y = fires[['area']].values.ravel()

model = ExtraTreesRegressor(n_estimators=100)
model.fit(X, Y)

ExtraTreesRegressor()
2.8通过查看模型中的特征影响程度,删除影响程度极低的特征
cols_to_drop = []
for c in zip(columns,model.feature_importances_.round(4)):
    if c[1] <0.01:
        cols_to_drop.append(c[0])
print('Columns to be droped: ',cols_to_drop)

Columns to be droped:  ['rain']
2.9通过各个属性的相关性矩阵,按照相关度递减的顺序输出与属性area相关的属性排序
corr_matrix = fires.corr()
corr_matrix["area"].sort_values(ascending=False)

area     1.000000
index    0.302303
month    0.123613
wind     0.070217
X        0.068824
DC       0.063159
FFMC     0.059142
Y        0.047538
DMC      0.046503
rain     0.043600
temp     0.042614
ISI      0.022006
day      0.004167
RH      -0.054193
Name: area, dtype: float64
2.10因为属性rain的重要性太低;属性DC与属性DMC的相关性过高,且相关性不如属性DMC,所以删除属性rain、DC所在列
fires = fires.drop(cols_to_drop,axis=1)
fires.drop(labels=['DC'],axis=1,inplace=True)
2.11绘制属性细小可燃物湿度码(FFMC)与粗腐殖质湿度码(DMC)的折线图,观察二者关系
import plotly.express as px 
df_long=pd.melt(fires,id_vars=['index'], value_vars=['FFMC', 'DMC']) 
fig = px.line(df_long, x='index', y='value', color='variable')
fig.show()

在这里插入图片描述

2.12绘制折线图查看另外三种属性:初始蔓延指数(ISI)、温度(temp)、风速(wind)之间的关系
df_long=pd.melt(fires,id_vars=['index'], value_vars=['ISI',	'temp',	'wind']) 
fig = px.line(df_long, x='index', y='value', color='variable')
fig.show()

在这里插入图片描述

2.13对数据进行标准化处理,使之变为均值为0,标准差为1的归一化数据
fires_cat = fires[['month', 'day']]
fires_num = fires[['X', 'Y', 'FFMC', 'DMC', 'ISI', 'temp', 'RH','wind']]
target = fires[['area']] 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([('std_scaler', StandardScaler()),])
fires_num_tr = num_pipeline.fit_transform(fires_num)
2.14进行数据集的划分并进行降维处理
from sklearn.model_selection import train_test_split
data = np.concatenate((fires_cat,fires_num_tr),axis=1)
X_train, X_test, y_train, y_test = train_test_split(data, target.values, test_size=0.3)
y_train = y_train.ravel()
y_test = y_test.ravel()

3.开始进行SVR建模

3.1使用网格搜索进行SVR的调参
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
param_grid = [{'kernel': ['rbf', 'sigmoid'], 'C': [1,50, 100 ,300], 'epsilon': [0.2, 0.2,0.1]},]
svr_cv =SVR()
svr_grid_search = GridSearchCV(svr_cv, param_grid, cv=5,scoring='neg_mean_squared_error',return_train_score=True)
svr_grid_search.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=SVR(),
             param_grid=[{'C': [1, 50, 100, 300], 'epsilon': [0.2, 0.2, 0.1],
                          'kernel': ['rbf', 'sigmoid']}],
             return_train_score=True, scoring='neg_mean_squared_error')
3.2输出预计最优参数
svr_grid_search.best_estimator_

SVR(C=1, epsilon=0.2)
3.3进行预测
final_model = svr_grid_search.best_estimator_
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
final_predictions = final_model.predict(X_test)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print('和均方误差SMSE为: ',final_rmse)
print('平均绝对误差MAE为: {}'.format(mean_absolute_error(y_test, final_predictions)))

和均方误差SMSE为:  0.9193740894801092
平均绝对误差MAE为: 0.7773110570888362
final_predictions[final_predictions==0] = 0
final_predictions[(final_predictions>0) & (final_predictions<=1)] = 1
final_predictions[(final_predictions>1) & (final_predictions<=2)] = 2
final_predictions[final_predictions>2] = 3
right_num = 0
for index in range(len(final_predictions)):       
    if y_test[index] == final_predictions[index]:
        right_num = right_num + 1

right = right_num / len(final_predictions) * 100
print('准确率为:', right)

准确率为: 25.64102564102564

3.4开始随机森林建模

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False,True], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor()
rfr_grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
rfr_grid_search.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=RandomForestRegressor(),
             param_grid=[{'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False, True], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')
3.5输出预计最优参数
rfr_grid_search.best_estimator_

RandomForestRegressor(max_features=4, n_estimators=10)
3.6进行测试集预测
final_predictions = final_model.predict(X_test)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

print('和均方误差SMSE为: ',final_rmse)
print('平均绝对误差MAE为: {}'.format(mean_absolute_error(y_test, final_predictions)))

和均方误差SMSE为:  0.9193740894801092
平均绝对误差MAE为: 0.7773110570888362
final_predictions[final_predictions==0] = 0
final_predictions[(final_predictions>0) & (final_predictions<=1)] = 1
final_predictions[(final_predictions>1) & (final_predictions<=2)] = 2
final_predictions[final_predictions>2] = 3
right_num = 0
for index in range(len(final_predictions)):      
    if y_test[index] == final_predictions[index]:
        right_num = right_num + 1

right = right_num / len(final_predictions) * 100
print('准确率为:', right)

准确率为: 25.64102564102564

3.7开始h2o随机森林建模

import  h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator 
from h2o.grid.grid_search import H2OGridSearch
h2o.init()

3.8 h2o读取数据并进行处理
h2oFires = pd.read_csv('forestfires.csv')

h2oFires['area'][h2oFires['area']<=0.09] = 0
h2oFires['area'][(h2oFires['area']>0.09) & (h2oFires['area']<=6.57)] = 1
h2oFires['area'][(h2oFires['area']>6.57) & (h2oFires['area']<=279)] = 2
h2oFires['area'][h2oFires['area']>279] = 3

h2oFires['area'][h2oFires['area']==0] = 'fire0'
h2oFires['area'][h2oFires['area']==1] = 'fire1'
h2oFires['area'][h2oFires['area']==2] = 'fire2'
h2oFires['area'][h2oFires['area']==3] = 'fire3'


trainCsv = h2oFires.sample(frac=0.7,axis=0)
testCsv = h2oFires.sample(frac=0.3,axis=0)

trainCsv = trainCsv[['X','Y','month','day','FFMC','DMC','ISI','temp','RH','wind','area']]
testCsv = testCsv[['X','Y','month','day','FFMC','DMC','ISI','temp','RH','wind','area']]

trainCsv.to_csv('h2oTrain.csv')
testCsv.to_csv('h2oTest.csv')

train=h2o.import_file("h2oTrain.csv")
test=h2o.import_file("h2oTest.csv")
train=train[1:]
test=test[1:]
3.9进行建模
model1 = H2ORandomForestEstimator() 
model1.train(x = train.names[0:-1],y = 'area',training_frame = train)
3.10使用得到的模型进行预测
predict=H2ORandomForestEstimator.predict(model1 ,test[test.names[0:-1]])
predict
out = test.concat(predict)
h2o.download_csv(out,"predict.csv")
3.11得到准确率
test_right = predict[predict['predict'] == test['area']].nrow 
accuracy = test_right/test.nrow
print('准确率为:', accuracy*100)

准确率为: 82.58064516129032
3.12利用网格搜索进行最优参数调整
rf_params = {'ntrees': [x for x in range(100,200,1)],'max_depth': [50] }
rf_grid = H2OGridSearch(model = H2ORandomForestEstimator, hyper_params=rf_params)
rf_grid.train(x = train.names[0:-1],y = 'area',training_frame = train)
model4 = H2ORandomForestEstimator(ntrees=100,max_depth=50)
model4.train(x = train.names[0:-1],y = 'area',training_frame = train)

predict=H2ORandomForestEstimator.predict(model4,test[test.names[0:-1]]) 
test_right = predict[predict['predict'] == test['area']].nrow 
accuracy = test_right/test.nrow
print('准确率为:', accuracy*100)

准确率为: 85.16129032258064
  • 14
    点赞
  • 113
    收藏
    觉得还不错? 一键收藏
  • 20
    评论
评论 20
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值