基于GBM和随机森林模型探索影响学生压力的主要因素

本文链接：https://blog.csdn.net/m0_67431719/article/details/135594476

1.项目背景

数据集包含了心理、生理、社会、环境和学术等不同方面信息，为深入探讨学生面临的各种压力提供参考数据。从睡眠质量到学习负担，再到环境和人际关系的影响，数据集涵盖了约20个最显著的特征。
本项目主要采取可视化分析探索数据，并且根据学生不同的压力水平进行差异分析，发现所有因素对于学生的压力水平都有显著的相关性，再通过KW检验，可以认为这些因素对学生的压力水平具有显著影响，或者说不同压力水平下，这些因素之间都有差异，最后建立了梯度提升机和随机森林模型，进一步探究这些因素的重要度。

2.数据说明

字段名说明

anxiety_level 焦虑水平；[0, 21]，数字越大表示程度越高

self_esteem 自尊水平；[0, 30]，数字越大表示程度越高

mental_health_history 心理健康病史；1：有，0：无

depression 抑郁；[0, 27]，数字越大表示程度越高

headache 头痛问题；[0, 5]，数字越大表示发生频率越高

blood_pressure 血压问题；[1, 3]，数字越大表示情况越严重

sleep_quality 睡眠质量；[0, 5]，数字越大表示质量越高

breathing_problem 呼吸问题；[0, 5]，数字越大表示情况越严重

noise_level 环境噪音水平; [0, 5]，数字越大表示程度越高

living_conditions 居住条件；[0, 5]，数字越大表示条件越好

safety 安全； [0, 5]，数字越大表示程度越高

basic_needs 基本需求满足情况；[0, 5]，数字越大表示程度越高

academic_performance 学业表现；[0, 5]，数字越大表示水平越高

study_load 学业负担；[0, 5]，数字越大表示水平越高

teacher_student_relationship 师生关系；[0, 5]，数字越大表示水平越高

future_career_concerns 未来职业担忧；[0, 5]，数字越大表示水平越高

social_support 社会支持；[0, 3]，数字越大表示程度越强

peer_pressure 同辈压力；[0, 5]，数字越大表示程度越高

extracurricular_activities 课外活动；[0, 5]，数字越大表示频率越高

bullying 霸凌问题；[0, 5]，数字越大表示程度越高

stress_level 压力水平；[0, 2]，数字越大表示程度越高

3.Python库导入及数据读取

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier
In [2]:
# 读取数据
data = pd.read_csv("/home/mw/input/stress4628/StressLevelDataset.csv")
4.数据预览及数据处理

4.1数据预览

In [3]:
# 查看数据维度
data.shape
Out[3]:
(1100, 21)
In [4]:
# 查看数据信息
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 21 columns):
anxiety_level                   1100 non-null int64
self_esteem                     1100 non-null int64
mental_health_history           1100 non-null int64
depression                      1100 non-null int64
headache                        1100 non-null int64
blood_pressure                  1100 non-null int64
sleep_quality                   1100 non-null int64
breathing_problem               1100 non-null int64
noise_level                     1100 non-null int64
living_conditions               1100 non-null int64
safety                          1100 non-null int64
basic_needs                     1100 non-null int64
academic_performance            1100 non-null int64
study_load                      1100 non-null int64
teacher_student_relationship    1100 non-null int64
future_career_concerns          1100 non-null int64
social_support                  1100 non-null int64
peer_pressure                   1100 non-null int64
extracurricular_activities      1100 non-null int64
bullying                        1100 non-null int64
stress_level                    1100 non-null int64
dtypes: int64(21)
memory usage: 180.5 KB
In [5]:
# 查看各列缺失值
data.isna().sum()
Out[5]:
anxiety_level                   0
self_esteem                     0
mental_health_history           0
depression                      0
headache                        0
blood_pressure                  0
sleep_quality                   0
breathing_problem               0
noise_level                     0
living_conditions               0
safety                          0
basic_needs                     0
academic_performance            0
study_load                      0
teacher_student_relationship    0
future_career_concerns          0
social_support                  0
peer_pressure                   0
extracurricular_activities      0
bullying                        0
stress_level                    0
dtype: int64
In [6]:
# 查看重复值
data.duplicated().sum()
Out[6]:
0
In [7]:
# 查看数据的唯一取值
for i in data.columns.tolist():
    print(f'{i}:')
    print(data[i].unique())
    print('-'*50)
anxiety_level:
[14 15 12 16 20  4 17 13  6  5  9  2 11  7 21  3 18  0  8  1 19 10]
--------------------------------------------------
self_esteem:
[20  8 18 12 28 13 26  3 22 15 23 21 25  1 27  5  6  9 29 30  4 19 16  2
  0 14  7 17 24 11 10]
--------------------------------------------------
mental_health_history:
[0 1]
--------------------------------------------------
depression:
[11 15 14  7 21  6 22 12 27 25  8 24  3  1  0  5 26 20 10  9  2 16  4 13
 18 23 17 19]
--------------------------------------------------
headache:
[2 5 4 3 1 0]
--------------------------------------------------
blood_pressure:
[1 3 2]
--------------------------------------------------
sleep_quality:
[2 1 5 4 3 0]
--------------------------------------------------
breathing_problem:
[4 2 3 1 5 0]
--------------------------------------------------
noise_level:
[2 3 4 1 0 5]
--------------------------------------------------
living_conditions:
[3 1 2 4 5 0]
--------------------------------------------------
safety:
[3 2 4 1 5 0]
--------------------------------------------------
basic_needs:
[2 3 1 4 5 0]
--------------------------------------------------
academic_performance:
[3 1 2 4 5 0]
--------------------------------------------------
study_load:
[2 4 3 5 1 0]
--------------------------------------------------
teacher_student_relationship:
[3 1 2 4 5 0]
--------------------------------------------------
future_career_concerns:
[3 5 2 4 1 0]
--------------------------------------------------
social_support:
[2 1 3 0]
--------------------------------------------------
peer_pressure:
[3 4 5 2 1 0]
--------------------------------------------------
extracurricular_activities:
[3 5 2 4 0 1]
--------------------------------------------------
bullying:
[2 5 1 4 3 0]
--------------------------------------------------
stress_level:
[1 2 0]
--------------------------------------------------
In [8]:
data.head()
Out[8]:

anxiety_level self_esteem mental_health_history depression headache blood_pressure sleep_quality breathing_problem noise_level living_conditions ... basic_needs academic_performance study_load teacher_student_relationship future_career_concerns social_support peer_pressure extracurricular_activities bullying stress_level

0 14 20 0 11 2 1 2

字段名	说明
anxiety_level	焦虑水平；[0, 21]，数字越大表示程度越高
self_esteem	自尊水平；[0, 30]，数字越大表示程度越高
mental_health_history	心理健康病史；1：有，0：无
depression	抑郁；[0, 27]，数字越大表示程度越高
headache	头痛问题；[0, 5]，数字越大表示发生频率越高
blood_pressure	血压问题；[1, 3]，数字越大表示情况越严重
sleep_quality	睡眠质量；[0, 5]，数字越大表示质量越高
breathing_problem	呼吸问题；[0, 5]，数字越大表示情况越严重
noise_level	环境噪音水平; [0, 5]，数字越大表示程度越高
living_conditions	居住条件；[0, 5]，数字越大表示条件越好
safety	安全； [0, 5]，数字越大表示程度越高
basic_needs	基本需求满足情况；[0, 5]，数字越大表示程度越高
academic_performance	学业表现；[0, 5]，数字越大表示水平越高
study_load	学业负担；[0, 5]，数字越大表示水平越高
teacher_student_relationship	师生关系；[0, 5]，数字越大表示水平越高
future_career_concerns	未来职业担忧；[0, 5]，数字越大表示水平越高
social_support	社会支持；[0, 3]，数字越大表示程度越强
peer_pressure	同辈压力；[0, 5]，数字越大表示程度越高
extracurricular_activities	课外活动；[0, 5]，数字越大表示频率越高
bullying	霸凌问题；[0, 5]，数字越大表示程度越高
stress_level	压力水平；[0, 2]，数字越大表示程度越高