9 - L7/L8：机器学习｜随机森林 | 糖尿病探索与预测案例-CSDN博客

本文链接：https://blog.csdn.net/hazel_cjx/article/details/141224051

🍨 本文为🔗365天深度学习训练营 中的学习记录博客
🍖 原作者：K同学啊

1. 随机森林是什么？

随机森林（Random Forest, RF）是一种由决策树构成的集成算法，采用的是 Bagging 方法，主要用于分类和回归问题。它是通过构建多个决策树并将它们的结果结合起来进行预测的一种算法。随机森林具有较高的精度和鲁棒性，且可以有效地处理高维数据和避免过拟合。

1.1 基本概念：

决策树（Decision Tree）：这是随机森林的基本构建单元。每棵决策树都是根据数据中的某些特征划分数据，逐步建立模型来预测目标变量的值。
集成学习（Ensemble Learning）：这是指通过组合多个模型的结果来提高预测性能的技术。随机森林通过集成多个决策树来进行决策。

1.2. 工作原理：

样本选择（Bootstrap Sampling）：在构建随机森林时，每棵决策树都会从原始训练数据中随机抽取一定比例的数据来训练。这种抽样是有放回的（即一个样本可能被多次选中）。
特征选择（Feature Selection）：在构建每棵决策树时，随机森林算法会随机选择特定数量的特征来决定数据的最佳分裂点。这种随机性使得不同的决策树在同一个数据集上训练时可能会得到不同的结果。
预测和投票（Prediction and Voting）：在分类任务中，随机森林中的每棵树会独立进行预测，然后所有树的预测结果进行投票，票数最多的类别将作为最终的预测结果；在回归任务中，随机森林中的每棵树会输出一个值，最终的结果是所有树预测值的平均值。

1.3. 优点：

降低过拟合风险：由于随机森林是通过集成多个决策树进行预测，因此它可以有效减少单一决策树过拟合的风险。
处理高维数据：随机森林能够很好地处理具有大量特征的数据集，并且在特征之间存在相关性的情况下仍然表现出色。
鲁棒性强：随机森林对数据中的噪声和异常值具有很强的鲁棒性。

2. 数据读取

本项目使用了一个人工合成的天气数据集，模拟了雨天、晴天、多云和雪天四种类型，在分析过程中，对数据进行了异常值处理，并通过描述性统计对数据进行了初步探索，接着，构建了随机森林模型进行预测，并生成了模型的重要特征图，该项目适用于初学者学习如何进行全面的数据分析和机器学习模型构建。数据集字段详情如下：

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

#data = pd.read_csv('./data/weather_classification_data.csv')
data = pd.read_csv('weather_classification_data.csv')
data

3. 数据检查与预处理

data.info()

# 查看分类特征的唯一值
characteristic = ['Cloud Cover','Season','Location','Weather Type']
for i in characteristic:
    print(f'{i}:')  
    print(data[i].unique())
    print('-'*50)

feature_map = {
    'Temperature': 'Temp',
    'Humidity': 'Humidity (%)',
    'Wind Speed': 'Wind (km/h)',
    'Precipitation (%)': 'Precipitation',
    'Atmospheric Pressure': 'Pressure (hPa)',
    'UV Index': 'UV Index',
    'Visibility (km)': 'Visibility (km)'
}
plt.figure(figsize=(15,10))

for i, (col, col_name) in enumerate(feature_map.items(), 1):
    plt.subplot(2,4,i)
    sns.boxplot(y=data[col])
    plt.title(f'{col_name} Boxplot', fontsize=14)
    plt.ylabel('Value', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

温度的异常值存在大量超出常识的温度，这里以超过60摄氏度认定为异常值，需要进行处理。
湿度百分比和降水量百分比，由于数值存在超过100%的值，认为超过100%的值为异常值，需要进行处理。
风速的高值可能是由于台风、龙卷风等极端天气事件，故不处理。
大气压力的异常值可能由于高海拔地区或气象现象（如低气压系统）引起。
能见度低可能是由于雾霾、雨雪等天气现象，这些异常值在特定条件下是正常的，故不处理。

print(f"温度超过60°C的数据量：{data[data['Temperature'] > 60].shape[0]}，占比{round(data[data['Temperature'] > 60].shape[0] / data.shape[0] * 100,2)}%。")
print(f"湿度百分比超过100%的数据量：{data[data['Humidity'] > 100].shape[0]}，占比{round(data[data['Humidity'] > 100].shape[0] / data.shape[0] * 100,2)}%。")
print(f"降雨量百分比超过100%的数据量：{data[data['Precipitation (%)'] > 100].shape[0]}，占比{round(data[data['Precipitation (%)'] > 100].shape[0] / data.shape[0] * 100,2)}%。")

print("删前的数据shape：", data.shape)
data = data[(data['Temperature'] <= 60) & (data['Humidity'] <= 100) & (data['Precipitation (%)'] <= 100)]
print("删后的数据shape：", data.shape)

4. 数据分析

data.describe(include='all') #generate summary statistics for all columns in a DataFrame, regardless of their data type

plt.figure(figsize=(20, 15))

plt.subplot(3, 4, 1)
sns.histplot(data['Temperature'], kde=True, bins=20)
plt.title('Temperature Distribution')
plt.xlabel('Temperature')
plt.ylabel('Frequency')

plt.subplot(3, 4, 2)
sns.boxplot(y=data['Humidity'])
plt.title('Humidity Percentage Boxplot')
plt.ylabel('Humidity Percentage')

plt.subplot(3, 4, 3)
sns.histplot(data['Wind Speed'], kde=True, bins=20)
plt.title('Wind Speed Distribution')
plt.xlabel('Wind Speed (km/h)')
plt.ylabel('Frequency')

plt.subplot(3, 4, 4)
sns.boxplot(y=data['Precipitation (%)'])
plt.title('Precipitation Percentage Boxplot')
plt.ylabel('Precipitation Percentage')

plt.subplot(3, 4, 5)
sns.countplot(x='Cloud Cover', data=data)
plt.title('Cloud Cover (Description) Distribution')
plt.xlabel('Cloud Cover (Description)')
plt.ylabel('Frequency')

plt.subplot(3, 4, 6)
sns.histplot(data['Atmospheric Pressure'], kde=True, bins=10)
plt.title('Atmospheric Pressure Distribution')
plt.xlabel('Pressure (hPa)')
plt.ylabel('Frequency')

plt.subplot(3, 4, 7)
sns.histplot(data['UV Index'], kde=True, bins=14)
plt.title('UV Index Distribution')
plt.xlabel('UV Index')
plt.ylabel('Frequency')

plt.subplot(3, 4, 8)
Season_counts = data['Season'].value_counts()
plt.pie(Season_counts, labels=Season_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Season Distribution')

plt.subplot(3, 4, 9)
sns.histplot(data['Visibility (km)'], kde=True, bins=10)
plt.title('Visibility Distribution')
plt.xlabel('Visibility (km)')
plt.ylabel('Frequency')

plt.subplot(3, 4, 10)
sns.countplot(x='Location', data=data)
plt.title('Location Distribution')
plt.xlabel('Location')
plt.ylabel('Frequency')

plt.subplot(3, 4, (11, 12))
sns.countplot(x='Weather Type', data=data)
plt.title('Weather Type Distribution')
plt.xlabel('Weather Type')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

● 温度：温度数据集中在较合理的范围内（主要在0°C到40°C），极端高温（>60°C）的数据已被清理。整体分布稍微左偏，说明较低温度的情况较多。
● 湿度：湿度分布在合理范围内（20%到100%），中位数和平均值接近，说明数据分布相对对称。
● 风速：数据集中在较低的风速范围内（0-20 km/h），极端高风速事件少见，数据左偏，低风速情况更为常见。
● 降水量：降水量分布较均匀，中位数为54%，反映了各种天气条件下的降水概率。
● 大气压力：大气压力主要集中在标准范围（990-1020 hPa），数据分布正常，没有明显的异常值。
● 紫外线指数：紫外线指数大多较低，极端高指数的情况罕见，表明大部分时间的紫外线风险较低。
● 能见度：能见度数据大多集中在5 km左右，反映了多数情况下的中等能见度条件。
● 云量：多云（overcast）在数据集中出现频率较高。
● 季节分布：冬季数据最多，可能是数据采集季节或地区气候特征的反映。
● 地点分布：主要来自山区和内陆地区，这可能影响天气类型和其他气象特征的分布。
● 天气类型：分布比较均匀，没有单一类别占据绝对优势。

5. 随机森林

new_data = data.copy()
label_encoders = {}
categorical_features = ['Cloud Cover', 'Season', 'Location', 'Weather Type']

for feature in categorical_features:
    le = LabelEncoder() #对于每个分类特征，创建一个LabelEncoder对象le
    new_data[feature] = le.fit_transform(data[feature]) #使用fit_transform方法对该特征的值进行标签编码，并将编码后的结果存储在new_data的相应列中。
    label_encoders[feature] = le #将每个特征的LabelEncoder对象存储在字典label_encoders中，以便以后使用。

for feature in categorical_features:
    print(f"Mapping for '{feature}' feature:")
    for index, class_ in enumerate(label_encoders[feature].classes_):
        print(f"  {index}: {class_}")

# 构建x，y
x = new_data.drop(['Weather Type'],axis=1)
y = new_data['Weather Type']

# 划分数据集
x_train,x_test,y_train,y_test = train_test_split(x,y,
                                                 test_size=0.3,
                                                 random_state=15) 

# 构建随机森林模型
rf_clf = RandomForestClassifier(random_state=15)
rf_clf.fit(x_train, y_train)

# 使用随机森林进行预测
y_pred_rf = rf_clf.predict(x_test)
class_report_rf = classification_report(y_test, y_pred_rf)
print(class_report_rf)

6. 结果分析

feature_importances = rf_clf.feature_importances_ #重要性得分通常基于特征在每棵树的分裂节点上减少的不纯度（如Gini不纯度或信息增益）。分裂点越能有效地划分数据，该特征的重要性得分就越高
features_rf = pd.DataFrame({'Feature': x.columns, 'Importance': feature_importances})
features_rf.sort_values(by='Importance', ascending=False, inplace=True)

plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=features_rf)
plt.xlabel('Importance') #这些得分的总和为1，每个特征的得分越高，表示它对模型的预测影响越大 eg.温度特征对模型预测的贡献占30%
plt.ylabel('Feature')
plt.title('Random Forest Feature Importance')
plt.show()

随机森林模型的预测准确率很高，并且通过特征度分析，发现影响模型的主要因素有：温度、湿度、紫外线指数、能见度、大气压力。

7. 总结

针对 weather_classification_data 数据的分析和预测，我们进行了以下几个关键步骤，并得出了一些有价值的结论：

● 数据可视化：

我们通过多种图表（如直方图、箱线图、饼图等）对温度、湿度、风速、降水量、云量、大气压、紫外线指数、季节、能见度、地点和天气类型等特征进行了可视化分析。这些图表帮助我们直观地理解了每个特征的分布情况以及可能的异常值。

● 类别特征编码：

对于数据中的类别特征（如云量、季节、地点、天气类型），我们使用了 LabelEncoder 对其进行了编码。这一步将类别特征转换为数值形式，以便于模型训练。我们还打印了每个类别特征的编码映射关系，以便理解编码后的数据含义。

● 特征重要性分析：

使用随机森林模型，我们计算了每个特征的重要性得分。特征重要性分析帮助我们识别出在模型预测中最具影响力的特征。例如，温度、湿度和天气类型可能是预测天气情况的关键因素。

● 模型预测：

我们通过训练一个随机森林分类模型，对天气类型进行了预测。该模型基于不同的天气相关特征，提供了一个相对准确的分类结果。

总结

通过对 weather_classification_data 的分析，我们不仅深入理解了每个特征的分布和相互关系，还识别了对预测结果最重要的特征。这些结果对于进一步优化模型、提高预测准确性具有重要的参考价值。此外，使用的随机森林模型不仅提供了良好的预测能力，还通过特征重要性分析帮助我们解释了模型的决策过程。未来的工作可以针对特定的天气模式进行更精细的分析，进一步提高预测模型的表现。

糖尿病探索与预测案例

目的：主要是对糖尿病进行预测

1. 数据预处理

1.1 数据导入

import numpy             as np
import pandas            as pd
import seaborn           as sns
from sklearn.model_selection   import train_test_split

import matplotlib.pyplot as plt
plt.rcParams['savefig.dpi'] = 500 #图片像素
plt.rcParams['figure.dpi']  = 500 #分辨率

import warnings 
warnings.filterwarnings("ignore")

dia = pd.read_excel('dia.xls')
dia.head()

2.1 数据检查

# 查看数据是否有缺失值
print('数据缺失值---------------------------------')
print(DataFrame.isnull().sum())

# 查看数据是否有重复值
print('数据重复值---------------------------------')
print('数据集的重复值为:'f'{DataFrame.duplicated().sum()}')

2.数据分析

2.1 数据分布分析

feature_map = {
    '年龄': 'Age',
    '低密度脂蛋白胆固醇': 'Low-Density Lipoprotein Cholesterol',
    '极低密度脂蛋白胆固醇': 'Very Low-Density Lipoprotein Cholesterol',
    '甘油三酯': 'Triglycerides',
    '总胆固醇': 'Total Cholesterol',
    '脉搏': 'Pulse',
    '舒张压': 'Diastolic Blood Pressure',
    '高血压史': 'History of Hypertension',
    '尿素氮': 'Blood Urea Nitrogen',
    '尿酸': 'Uric Acid',
    '肌酐': 'Creatinine',
    '体重检查结果': 'Weight Check Results'
}

plt.figure(figsize=(15, 10))

for i, (col, col_name) in enumerate(feature_map.items(), 1):
    plt.subplot(3, 4, i)
    sns.boxplot(x=dia['是否糖尿病'], y=dia[col], data=dia)
    plt.title(f'{col_name} Boxplot', fontsize=14)
    plt.xlabel('Diabetes Status', fontsize=12)  # can skip, 由于无法显示中文，修改X轴标签为英文
    plt.ylabel('Value', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

2.2 相关性分析

import plotly
import plotly.express as px

# 删除列 '卡号'
dia.drop(columns=['卡号'], inplace=True)
# 计算各列之间的相关系数
df_corr = dia.corr()

# 相关矩阵生成函数
def corr_generate(df):
    fig = px.imshow(df,text_auto=True,aspect="auto",color_continuous_scale='RdBu_r')
    fig.show()

# 生成相关矩阵
corr_generate(df_corr)

3. 随机森林模型

3.1 数据集构建

# '高密度脂蛋白胆固醇'字段与糖尿病负相关，故而在 X 中去掉该字段
X = DataFrame.drop(['是否糖尿病','高密度脂蛋白胆固醇'],axis=1)
y = DataFrame['是否糖尿病']

train_X, test_X, train_y, test_y = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=1)

3.2 定义模型

from sklearn.ensemble import RandomForestClassifier

# 构建随机森林模型
rf_clf = RandomForestClassifier(random_state=15)
rf_clf.fit(train_X, train_y)

4. 模型评估

4.1 性能评估

from sklearn.metrics import classification_report

# 使用随机森林进行预测
pred_y_rf = rf_clf.predict(test_X)
class_report_rf = classification_report(test_y, pred_y_rf)
print(class_report_rf)