Python数据分析-高中生的学业成功因素（学生表现数据集）

最新推荐文章于 2025-04-15 16:38:38 发布

statistican_ABin

最新推荐文章于 2025-04-15 16:38:38 发布

阅读量2.1k

点赞数 18

分类专栏： python数据分析案例文章标签： python 数据分析开发语言

本文链接：https://blog.csdn.net/m0_62638421/article/details/140079180

版权

python数据分析案例专栏收录该内容

69 篇文章

订阅专栏

一、研究背景

随着教育事业的发展和教育资源的不断丰富，如何有效地评估学生的学习表现成为教育研究中的一个重要课题。学生表现数据分析是通过收集和分析学生在学习过程中的各种数据（如成绩、出勤率、参与度等），以期发现影响学生表现的因素，从而为教育管理者和教师提供科学的决策依据。近年来，随着数据科学和机器学习技术的快速发展，数据驱动的教育研究逐渐成为提升教育质量的重要手段。

在现代教育中，学生的学习表现不仅仅取决于智力因素，还受到家庭环境、学校资源、教师水平、学生的学习态度和习惯等多方面因素的影响。传统的教育评估方式通常局限于考试成绩，无法全面反映学生的学习状况和潜力。因此，利用大数据分析技术对学生的表现进行多维度的分析，可以更全面地了解学生的学习状态，进而为个性化教学和教育质量提升提供依据。

本研究旨在通过分析学生表现数据，探讨影响学生学习表现的关键因素，评估不同因素对学生成绩的影响，并提出相应的教育干预策略。希望通过本研究，为教育管理者和教师提供数据支持，促进教育质量的提升和学生全面发展。

二、研究意义

全面评估学生表现：通过对学生多维度数据的分析，全面评估学生的学习表现，发现单一考试成绩无法揭示的潜在问题和优点，为教师和家长提供更加科学和全面的学生评价。
个性化教学支持：基于数据分析的结果，可以针对不同学生的学习需求和特点，制定个性化的教学方案，提高教学的针对性和有效性，帮助每个学生发挥其最大潜力。
提升教育质量：通过研究影响学生表现的因素，识别和解决教育中的薄弱环节，优化教育资源配置，提升整体教育质量和教学效果。
教育政策制定依据：本研究的结果可以为教育管理者提供数据支持，辅助制定和调整教育政策，推动教育改革和创新，促进教育公平和质量提升。

三、实证分析

数据和代码

导入需要的包

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import missingno as msno
from matplotlib.ticker import MaxNLocator
from colorama import Fore, init

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

读取数据集，将表格形式改的好看一下

df = pd.read_csv("Student_performance_data _.csv")

def get_pretty_frame(df):
    return df.style.set_table_styles(
        [{'selector': 'thead th', 'props': [('background-color', '#3b528b'),
                                            ('color', 'black'),
                                            ('border', '1px solid #dddddd')]},
         {'selector': 'tbody tr:nth-child(even)', 'props': [('background-color', '#f9f9f9')]},
         {'selector': 'tbody tr:nth-child(odd)', 'props': [('background-color', 'white')]},
         {'selector': 'tbody td', 'props': [('border', '1px solid #dddddd')]}]
    ).set_properties(**{'text-align': 'center'})

对数据集进行描述性统计

df_summary = df.describe()
get_pretty_frame(df_summary)

可视化缺失值

查看是否有无重复行

df.duplicated().sum()

记下来对数据可视化，探索性分析

查看GradeClass的分别情况：

grade_counts = df['GradeClass'].value_counts()
grades = grade_counts.index.tolist()
counts = grade_counts.values.tolist()
total_counts = sum(counts)
percentages = [(count / total_counts) * 100 for count in counts]

colors = ['orange', 'lightblue', 'green', 'purple', 'coral']
explode = [0.05] * len(grades)  # Slightly "explode" all slices
plt.figure(figsize=(12, 6))
plt.pie(percentages, labels=grades, colors=colors, autopct='%1.1f%%', startangle=140, 
        shadow=True, explode=explode, wedgeprops={'edgecolor': 'black'})

plt.legend(grades, title="GradeClass", loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))

plt.title('GradeClass Distribution', fontsize=15)
plt.axis('equal') 

plt.show()

再次查看Categorical Variable 的分别情况

for i, col in enumerate(cat_col):
    vc = df[col].value_counts(normalize=True)
    axs[i].bar(vc.index.astype(str), vc, color='#3b528b')
    axs[i].set_title(col, fontsize=10)
    axs[i].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.0%}'))
    axs[i].set_xlabel('Category')
    axs[i].set_ylabel('Percentage')

# Remove empty subplots
for j in range(i+1, len(axs)):
    fig.delaxes(axs[j])

plt.suptitle('Categorical Variables Distribution', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

查看Numerical的分布情况

fig, axs = plt.subplots(len(num_col) // 2 + len(num_col) % 2, 2, figsize=(12, 6))
axs = axs.flatten()

for i, col in enumerate(num_col):
    axs[i].hist(df[col], bins=30, color='#9ee742', edgecolor='black', density=True)
    axs[i].set_title(col, fontsize=10)
    axs[i].set_xlabel('Value')
    axs[i].set_ylabel('Density')

for j in range(i+1, len(axs)):
    fig.delaxes(axs[j])

plt.suptitle('Numerical Variables Distribution', fontsize=10, y=1.02)
plt.tight_layout()
plt.show()

记下来进行双变量分析
双变量分析是一种统计分析方法，用于研究两个变量之间的关系。它旨在确定两个变量之间是否存在关联、关联的强度和方向等。

首先查看Tutoring和GradeClass的关系

crosstab_gender = pd.crosstab(df['Tutoring'], df['GradeClass'])
crosstab_gender = crosstab_gender.div(crosstab_gender.sum(axis=1), axis=0) * 100

ax_gender = crosstab_gender.plot(kind='bar', stacked=True)
plt.xlabel('Gender (0 = No Tutoring, 1 = Tutoring)')
plt.ylabel('Percentage')
plt.title('Distribution of GradeClass Based on Tutoring')
plt.legend(title='Target')
plt.xticks(rotation=0)

for p in ax_gender.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax_gender.annotate(f'{height:.1f}%', (x + width / 2, y + height / 2), 
                ha='center', va='center', fontsize=10, color='white')

plt.show()

Gender和GradeClass的关系

接下来查看相关性分析

corr = df.corr()

target_corr = corr['GradeClass'].drop('GradeClass')

sns.set(font_scale=1.2)
sns.set_style("white")
sns.set_palette("PuBuGn_d")
sns.heatmap(target_corr.to_frame(), cmap="BrBG", annot=True, fmt='.2f')
plt.title('Correlation with GradeClass Column')
plt.show()

记下来对数据离群值查看

for i, col in enumerate(num_col):
    axs[i].boxplot(df1[col])
    axs[i].set_title(col, fontsize=10)
    axs[i].set_ylabel('Value')

for j in range(i+1, len(axs)):
    fig.delaxes(axs[j])

plt.suptitle('Boxplot of Numerical Variables', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

记下来对数据标准化和模型建立

models = {
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'SVC': SVC(),
    'LGBM': LGBMClassifier(),
    'Bagging': BaggingClassifier(),
    'XGB': XGBClassifier(),
    'AdaBoost': AdaBoostClassifier()
}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

for model_name, model in models.items():
    print(f"Evaluating {model_name}...")
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', model)])
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{model_name}: Accuracy = {accuracy:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    
   
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=model.classes_, yticklabels=model.classes_)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.title(f'Confusion Matrix for {model_name}')
    plt.show()

决策树结果