考试成绩分析
📖 背景
你最好的朋友是一所大学校的行政人员。学校要求每个学生参加年终数学、阅读和写作考试。
由于您最近学习了数据操作和可视化,建议您帮助您的朋友分析评分结果。学校校长想知道备考课程是否有帮助。她还想探讨父母教育水平对考试成绩的影响。
💾 数据
该文件具有以下字段:
- “gender” - male / female
- “race/ethnicity” - one of 5 combinations of race/ethnicity
- “parent_education_level” - highest education level of either parent
- “lunch” - whether the student receives free/reduced or standard lunch
- “test_prep_course” - whether the student took the test preparation course
- “math” - exam score in math
- “reading” - exam score in reading
- “writing” - exam score in writing
💪 挑战
创建一份报告来回答校长的问题。包括:
- 有/没有备考课程的学生的平均阅读分数是多少?
- 不同父母教育水平下,学生的平均分数是多少?
- 比较在不同家长教育水平的下,有/没有参加考试准备课程的学生的平均分数。
- 校长想知道在一门科目上表现出色的孩子是否在其他科目上也取得了不错的成绩。查看分数之间的相关性。
探索性数据分析
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df = pd.read_csv('exams.csv')
df.head()
gender | race/ethnicity | parent_education_level | lunch | test_prep_course | math | reading | writing | |
---|---|---|---|---|---|---|---|---|
0 | female | group B | bachelor's degree | standard | none | 72 | 72 | 74 |
1 | female | group C | some college | standard | completed | 69 | 90 | 88 |
2 | female | group B | master's degree | standard | none | 90 | 95 | 93 |
3 | male | group A | associate's degree | free/reduced | none | 47 | 57 | 44 |
4 | male | group C | some college | standard | none | 76 | 78 | 75 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parent_education_level 1000 non-null object
3 lunch 1000 non-null object
4 test_prep_course 1000 non-null object
5 math 1000 non-null int64
6 reading 1000 non-null int64
7 writing 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
Q1:有/没有备考课程的学生阅读平均成绩是多少?
df.groupby('test_prep_course')['reading'].mean()
test_prep_course
completed 73.893855
none 66.534268
Name: reading, dtype: float64
df.groupby('test_prep_course')['reading'].mean().plot(kind='bar')
plt.ylabel('score')
plt.title('the average reading scores for students with/without the test preparation course');
with_test_prep = df[df['test_prep_course'] == 'completed']
without_test_prep = df[df['test_prep_course'] == 'none']
fig, ax = plt.subplots(1, 3, figsize=(15,6), sharey=True)
cols = ['math', 'reading', 'writing']
for i, col in enumerate(cols):
sns.kdeplot(with_test_prep[col], ax=ax[i], label=str(col) + 'with test prep')
sns.kdeplot(without_test_prep[col], linestyle='--', ax=ax[i], label=str(col) + 'without test prep')
ax[i].legend()
plt.suptitle('KDE Plots of Exam Scores With and Without Preparation', fontsize = 20, fontweight = 'bold');
- 参加考试准备课程的学生各科的分数均有所提高。
- 参加考试准备课程的学生的平均阅读分数约为 74,而没有参加的学生为 66.5
Q2:不同父母教育水平下,学生的平均分数是多少?
df.groupby('parent_education_level')[['math', 'reading', 'writing']].mean().style.background_gradient(cmap='RdYlGn_r')
math | reading | writing | |
---|---|---|---|
parent_education_level | |||
associate's degree | 67.882883 | 70.927928 | 69.896396 |
bachelor's degree | 69.389831 | 73.000000 | 73.381356 |
high school | 62.137755 | 64.704082 | 62.448980 |
master's degree | 69.745763 | 75.372881 | 75.677966 |
some college | 67.128319 | 69.460177 | 68.840708 |
some high school | 63.497207 | 66.938547 | 64.888268 |
# x = df.groupby('parent_education_level')[['math', 'reading', 'writing']].mean()
# x.style.apply(lambda m: ["background: red" if i == m.argmax() else '' for i,_ in enumerate(m)])
df.groupby('parent_education_level')[['math', 'reading', 'writing']].mean().plot(kind='bar')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1));
avg_scores = df.groupby('parent_education_level')[['math', 'reading', 'writing']].mean()
fig, ax = plt.subplots()
sns.pointplot(data = avg_scores, x = avg_scores.index, y = 'math',label='Math')
sns.pointplot(data = avg_scores, x = avg_scores.index, y = 'reading', label='Reading',color='r')
sns.pointplot(data = avg_scores, x = avg_scores.index, y = 'writing', label='Writing',color='g')
ax.legend(handles=ax.lines[::len(avg_scores)+1], labels=["Math","Reading","Writing"])
# ax.set_xticklabels([t.get_text().split("T")[0] for t in ax.get_xticklabels()])
# plt.gcf().autofmt_xdate()
plt.xticks(rotation=45);
- 除了高中外,总体来看,父母的教育程度越高,孩子的学习成绩就越高。
- 拥有硕士学位的父母,孩子的每门学科中的平均分最高。
Q3:比较在不同家长教育水平的下,有/没有参加考试准备课程的学生的平均分数。
cols = ['math', 'reading', 'writing']
fig, axes = plt.subplots(3,1, figsize=(10, 6), sharex=True, gridspec_kw={
'hspace': 0.5})
for i, col in enumerate(cols):
sns.boxplot(x='parent_education_level', y=col, hue='test_prep_course', data=df, ax=axes[i])
axes[i].set_title(col.capitalize() + ' Scores')
#axes[i].set_xlabel('Parent Education Level')
axes[i].set_ylabel(col.capitalize() + ' Score')
axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation