L8：机器学习｜随机森林学习记录

最新推荐文章于 2024-09-26 11:05:19 发布

m0_74907726

最新推荐文章于 2024-09-26 11:05:19 发布

阅读量932

点赞数 10

文章标签：机器学习随机森林学习

本文链接：https://blog.csdn.net/m0_74907726/article/details/142406524

版权

第L8周：机器学习｜随机森林学习记录

🍨 本文为🔗365天深度学习训练营中的学习记录博客
🍖 原作者：K同学啊

随机森林(Random Forest)是一种集成学习方法，用于解决分类和回归问题。随机森林是一个包含多个决策树的分类器，并且其输出的类别是由个别树输出的类别的众数而定。

实例：利用一个人工合成的天气数据集，模拟了雨天、晴天、多云和雪天四种类型

异常值处理————描述性统计————构建随机森林模型进行预测————生成了模型的重要特征图

导入库

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

数据读取

data = pd.read_csv(r"E:\文件\python学习\K同学啊 365\数据\weather_classification_data.csv")
data

	Temperature	Humidity	Wind Speed	Precipitation (%)	Cloud Cover	Atmospheric Pressure	UV Index	Season	Visibility (km)	Location	Weather Type
0	14.0	73	9.5	82.0	partly cloudy	1010.82	2	Winter	3.5	inland	Rainy
1	39.0	96	8.5	71.0	partly cloudy	1011.43	7	Spring	10.0	inland	Cloudy
2	30.0	64	7.0	16.0	clear	1018.72	5	Spring	5.5	mountain	Sunny
3	38.0	83	1.5	82.0	clear	1026.25	7	Spring	1.0	coastal	Sunny
4	27.0	74	17.0	66.0	overcast	990.67	1	Winter	2.5	mountain	Rainy
...	...	...	...	...	...	...	...	...	...	...	...
13195	10.0	74	14.5	71.0	overcast	1003.15	1	Summer	1.0	mountain	Rainy
13196	-1.0	76	3.5	23.0	cloudy	1067.23	1	Winter	6.0	coastal	Snowy
13197	30.0	77	5.5	28.0	overcast	1012.69	3	Autumn	9.0	coastal	Cloudy
13198	3.0	76	10.0	94.0	overcast	984.27	0	Winter	2.0	inland	Snowy
13199	-5.0	38	0.0	92.0	overcast	1015.37	5	Autumn	10.0	mountain	Rainy

13200 rows × 11 columns

数据检查与预处理

# 查看数据信息
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13200 entries, 0 to 13199
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Temperature           13200 non-null  float64
 1   Humidity              13200 non-null  int64  
 2   Wind Speed            13200 non-null  float64
 3   Precipitation (%)     13200 non-null  float64
 4   Cloud Cover           13200 non-null  object 
 5   Atmospheric Pressure  13200 non-null  float64
 6   UV Index              13200 non-null  int64  
 7   Season                13200 non-null  object 
 8   Visibility (km)       13200 non-null  float64
 9   Location              13200 non-null  object 
 10  Weather Type          13200 non-null  object 
dtypes: float64(5), int64(2), object(4)
memory usage: 1.1+ MB

# 查看分类特征的唯一值
characteristic = ['Cloud Cover','Season','Location','Weather Type']
for i in characteristic:
    print(f'{i}:')
    print(data[i].unique())
    print('-'*50)

Cloud Cover:
['partly cloudy' 'clear' 'overcast' 'cloudy']
--------------------------------------------------
Season:
['Winter' 'Spring' 'Summer' 'Autumn']
--------------------------------------------------
Location:
['inland' 'mountain' 'coastal']
--------------------------------------------------
Weather Type:
['Rainy' 'Cloudy' 'Sunny' 'Snowy']
--------------------------------------------------

feature_map = {
    'Temperature': '温度',
    'Humidity': '湿度百分比',
    'Wind Speed': '风速',
    'Precipitation (%)': '降水量百分比',
    'Atmospheric Pressure': '大气压力',
    'UV Index': '紫外线指数',
    'Visibility (km)': '能见度'
}
plt.figure(figsize=(15, 10))

for i, (col, col_name) in enumerate(feature_map.items(), 1):
    plt.subplot(2, 4, i)
    sns.boxplot(y=data[col])
    plt.title(f'{col_name}的箱线图', fontsize=14)
    plt.ylabel('数值', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

在这里插入图片描述

print(f"温度超过60°C的数据量：{data[data['Temperature'] > 60].shape[0]}，占比{round(data[data['Temperature'] > 60].shape[0] / data.shape[0] * 100,2)}%。")
print(f"湿度百分比超过100%的数据量：{data[data['Humidity'] > 100].shape[0]}，占比{round(data[data['Humidity'] > 100].shape[0] / data.shape[0] * 100,2)}%。")
print(f"降雨量百分比超过100%的数据量：{data[data['Precipitation (%)'] > 100].shape[0]}，占比{round(data[data['Precipitation (%)'] > 100].shape[0] / data.shape[0] * 100,2)}%。")

温度超过60°C的数据量：207，占比1.57%。
湿度百分比超过100%的数据量：416，占比3.15%。
降雨量百分比超过100%的数据量：392，占比2.97%。

print("删前的数据shape：", data.shape)
data = data[(data['Temperature'] <= 60) & (data['Humidity'] <= 100) & (data['Precipitation (%)'] <= 100)]
print("删后的数据shape：", data.shape)

删前的数据shape： (13200, 11)
删后的数据shape： (12360, 11)

数据分析

data.describe(include='all')

	Temperature	Humidity	Wind Speed	Precipitation (%)	Cloud Cover	Atmospheric Pressure	UV Index	Season	Visibility (km)	Location	Weather Type
count	12360.000000	12360.000000	12360.000000	12360.000000	12360	12360.000000	12360.000000	12360	12360.000000	12360	12360
unique	NaN	NaN	NaN	NaN	4	NaN	NaN	4	NaN	3	4
top	NaN	NaN	NaN	NaN	overcast	NaN	NaN	Winter	NaN	mountain	Snowy
freq	NaN	NaN	NaN	NaN	5726	NaN	NaN	5288	NaN	4535	3130
mean	18.071359	66.937460	9.356837	50.864968	NaN	1005.713743	3.791262	NaN	5.535801	NaN	NaN
std	15.804363	19.390333	6.318334	30.967846	NaN	38.300471	3.720638	NaN	3.377554	NaN	NaN
min	-24.000000	20.000000	0.000000	0.000000	NaN	800.120000	0.000000	NaN	0.000000	NaN	NaN
25%	4.000000	56.000000	5.000000	19.000000	NaN	994.587500	1.000000	NaN	3.000000	NaN	NaN
50%	21.000000	69.000000	8.500000	54.000000	NaN	1007.495000	2.000000	NaN	5.000000	NaN	NaN
75%	30.000000	81.000000	13.000000	79.000000	NaN	1016.750000	6.000000	NaN	7.500000	NaN	NaN
max	60.000000	100.000000	48.500000	100.000000	NaN	1199.210000	14.000000	NaN	20.000000	NaN	NaN

plt.figure(figsize=(20, 15))
plt.subplot(3, 4, 1)
sns.histplot(data['Temperature'], kde=True,bins=20)
plt.title('温度分布')
plt.xlabel('温度')
plt.ylabel('频数')

plt.subplot(3, 4, 2)
sns.boxplot(y=data['Humidity'])
plt.title('湿度百分比箱线图')
plt.ylabel('湿度百分比')

plt.subplot(3, 4, 3)
sns.histplot(data['Wind Speed'], kde=True,bins=20)
plt.title('风速分布')
plt.xlabel('风速（km/h）')
plt.ylabel('频数')

plt.subplot(3, 4, 4)
sns.boxplot(y=data['Precipitation (%)'])
plt.title('降雨量百分比箱线图')
plt.ylabel('降雨量百分比')

plt.subplot(3, 4, 5)
sns.countplot(x='Cloud Cover', data=data)
plt.title('云量 (描述)分布')
plt.xlabel('云量 (描述)')
plt.ylabel('频数')

plt.subplot(3, 4, 6)
sns.histplot(data['Atmospheric Pressure'], kde=True,bins=10)
plt.title('大气压分布')
plt.xlabel('气压 (hPa)')
plt.ylabel('频数')

plt.subplot(3, 4, 7)
sns.histplot(data['UV Index'], kde=True,bins=14)
plt.title('紫外线等级分布')
plt.xlabel('紫外线指数')
plt.ylabel('频数')

plt.subplot(3, 4, 8)
Season_counts = data['Season'].value_counts()
plt.pie(Season_counts, labels=Season_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('季节分布')

plt.subplot(3, 4, 9)
sns.histplot(data['Visibility (km)'], kde=True,bins=10)
plt.title('能见度分布')
plt.xlabel('能见度（Km）')
plt.ylabel('频数')

plt.subplot(3, 4, 10)
sns.countplot(x='Location', data=data)
plt.title('地点分布')
plt.xlabel('地点')
plt.ylabel('频数')

plt.subplot(3, 4, (11,12))
sns.countplot(x='Weather Type', data=data)
plt.title('天气类型分布')
plt.xlabel('天气类型')
plt.ylabel('频数')

plt.tight_layout()
plt.show()

在这里插入图片描述

随机森林

new_data = data.copy()
label_encoders = {}
categorical_features = ['Cloud Cover', 'Season', 'Location', 'Weather Type']
for feature in categorical_features:
    le = LabelEncoder()
    new_data[feature] = le.fit_transform(data[feature])
    label_encoders[feature] = le

for feature in categorical_features:
    print(f"'{feature}'特征的对应关系：")
    for index, class_ in enumerate(label_encoders[feature].classes_):
        print(f"  {index}: {class_}")

'Cloud Cover'特征的对应关系：
  0: clear
  1: cloudy
  2: overcast
  3: partly cloudy
'Season'特征的对应关系：
  0: Autumn
  1: Spring
  2: Summer
  3: Winter
'Location'特征的对应关系：
  0: coastal
  1: inland
  2: mountain
'Weather Type'特征的对应关系：
  0: Cloudy
  1: Rainy
  2: Snowy
  3: Sunny

# 构建x，y
x = new_data.drop(['Weather Type'],axis=1)
y = new_data['Weather Type']

# 划分数据集
x_train,x_test,y_train,y_test = train_test_split(x,y,
                                                 test_size=0.3,
                                                 random_state=15) 

# 构建随机森林模型
rf_clf = RandomForestClassifier(random_state=15)
rf_clf.fit(x_train, y_train)

# 使用随机森林进行预测
y_pred_rf = rf_clf.predict(x_test)
class_report_rf = classification_report(y_test, y_pred_rf)
print(class_report_rf)

              precision    recall  f1-score   support

           0       0.88      0.91      0.89       871
           1       0.93      0.91      0.92       983
           2       0.92      0.93      0.92       929
           3       0.92      0.90      0.91       925

    accuracy                           0.91      3708
   macro avg       0.91      0.91      0.91      3708
weighted avg       0.91      0.91      0.91      3708

结果分析

feature_importances = rf_clf.feature_importances_
features_rf = pd.DataFrame({'特征': x.columns, '重要度': feature_importances})
features_rf.sort_values(by='重要度', ascending=False, inplace=True)
plt.figure(figsize=(10, 8))
sns.barplot(x='重要度', y='特征', data=features_rf)
plt.xlabel('重要度')
plt.ylabel('特征')
plt.title('随机森林特征图')
plt.show()