用python对学生成绩进行预测

用python对学生成绩进行预测


一、提出问题

1.影响学生考试成绩的因素都有哪些?

二、理解数据

0. 采集数据

采集到的数据文件为:xAPI-Edu-Data.csv

  • 数据说明

    gender ——性别
    Nationality——国籍
    Placeofbirth——出生地
    StageID——教育阶段
    GradeID——年级
    SectionID——班级
    Topic—— 课程主题
    Semester- 学期
    Relation——谁对学生负责
    raisedhands——学生举手次数
    VisiTedResources- 学生访问课程次数
    AnnouncementsViewing——学生查看通知次数
    Discussion——学生参加讨论组的次数
    ParentAnsweringSurvey——家长回答是否由学校提供的调查
    ParentschoolSatisfaction —— 家长对学校的满意程度
    Student Absence Days——学生缺席的天数
    class——成绩
    

1. 导入数据

  • 代码

    # 1.导入数据
    import pandas as pd
    sp=pd.read_csv("xAPI-Edu-Data.csv") #导入数据
    

2.查看数据集信息

2.1 查看数据集大小
  • 代码

    print("数据集大小",sp.shape)
    
    
  • 结果

    数据集大小 (480, 17)
    
2.2 查看各字段数据类型,缺失值
  • 代码

    # 2.2 查看各字段数据类型,缺失值
    print(sp.info())
    
  • 结果

    RangeIndex: 480 entries, 0 to 479
    Data columns (total 17 columns):
    gender                      480 non-null object
    NationalITy                 480 non-null object
    PlaceofBirth                480 non-null object
    StageID                     480 non-null object
    GradeID                     480 non-null object
    SectionID                   480 non-null object
    Topic                       480 non-null object
    Semester                    480 non-null object
    Relation                    480 non-null object
    raisedhands                 480 non-null int64
    VisITedResources            480 non-null int64
    AnnouncementsView           480 non-null int64
    Discussion                  480 non-null int64
    ParentAnsweringSurvey       480 non-null object
    ParentschoolSatisfaction    480 non-null object
    StudentAbsenceDays          480 non-null object
    Class                       480 non-null object
    dtypes: int64(4), object(13)
    memory usage: 63.8+ KB
    None
    
2.3 观察数据统计描述
  • 代码

    # 2.3 观察数据统计描述
    sp.describe() #仅对数值型数据进行统计
    
  • 结果

    raisedhandsVisITedResourcesAnnouncementsViewDiscussion
    count480.000000480.000000480.000000480.000000
    mean46.77500054.79791737.91875043.283333
    std30.77922333.08000726.61124427.637735
    min0.0000000.0000000.0000001.000000
    25%15.75000020.00000014.00000020.000000
    50%50.00000065.00000033.00000039.000000
    75%75.00000084.00000058.00000070.000000
    max100.00000099.00000098.00000099.000000
  • 分析
    <1> 整体数据不存在缺失值。
    <2> 类别型变量较多。
    <3> 数字型变量不存在较明显异常值。

3.数据预处理(有无缺失值,有无异常值)

4.相关性分析

4.1 单变量分析
4.1.1 类别型变量分析
  • 代码

    # 4.1.1 类别型变量分析
    import matplotlib.pyplot as plt
    %matplotlib inline
    plt.rcParams['font.sans-serif']='SimHei'# 设置全局字体,会被局部字体顶替
    fig1 = plt.figure(facecolor='white',figsize=(20,40))
    ax1=plt.subplot(6,3,1)
    plt.pie(sp['gender'].value_counts(),labels=['M',"F"],autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('性别分布情况')
     
    ax1=plt.subplot(6,3,2)
    plt.pie(sp['NationalITy'].value_counts(),labels=list(set(sp['NationalITy'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('国籍分布情况')
     
    ax1=plt.subplot(6,3,3)
    plt.pie(sp['PlaceofBirth'].value_counts(),labels=list(set(sp['PlaceofBirth'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('出生地分布情况')
     
    ax1=plt.subplot(6,3,4)
    plt.pie(sp['StageID'].value_counts(),labels=list(set(sp['StageID'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('教育阶段分布情况')
     
    ax1=plt.subplot(6,3,5)
    plt.pie(sp['GradeID'].value_counts(),labels=list(set(sp['GradeID'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('年级分布情况')
     
    ax1=plt.subplot(6,3,6)
    plt.pie(sp['Relation'].value_counts(),labels=list(set(sp['Relation'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('对学生负责情况')
    
    ax1=plt.subplot(6,3,7)
    plt.pie(sp['raisedhands'].value_counts(),labels=list(set(sp['raisedhands'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('学生举手次数情况')
    
    ax1=plt.subplot(6,3,8)
    plt.pie(sp['SectionID'].value_counts(),labels=list(set(sp['SectionID'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('班级分布情况')
    
    ax1=plt.subplot(6,3,9)
    plt.pie(sp['Topic'].value_counts(),labels=list(set(sp['Topic'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('课程主题情况')
    
    ax1=plt.subplot(6,3,10)
    plt.pie(sp['Semester'].value_counts(),labels=list(set(sp['Semester'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('学期分布情况')
    
    ax1=plt.subplot(6,3,11)
    plt.pie(sp['ParentAnsweringSurvey'].value_counts(),labels=list(set(sp['ParentAnsweringSurvey'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('家长回答是否由学校提供的调查情况')
    
    ax1=plt.subplot(6,3,12)
    plt.pie(sp['ParentschoolSatisfaction'].value_counts(),labels=list(set(sp['ParentschoolSatisfaction'].values.T.tolist()[:])),autopct='%1.1f%%',startangle=0)
    plt.axis('equal')
    plt.title('家长对学校的满意程度情况')
    
  • 结果
    在这里插入图片描述

  • 分析
    <1> 学生中女生数量多于男生,学生的国籍、出生70%是埃及或叙利亚。学生来自高中最多,小学次之,初中最少。学生年级大多是4,、7、10年级,班级中A班和C班较多。
    <2> 学科方面:学IT的学生人数最多,其次是法语,人数最少的是历史,学it的人数几乎是学历史的5倍左右。
    <3> 学生负责人是妈妈比较多,超过60%的家长对学校满意。
    <4> 有60%的学生缺席次数不超过七天,学生成绩等级达到优秀的最多,中等次之。

4.1.2 数值型变量分析
  • 代码

    # 4.1.2 数值型变量分析
    import matplotlib.pyplot as plt
    import seaborn as sns
    %matplotlib inline
    plt.rcParams['font.sans-serif']='SimHei'# 设置全局字体,会被局部字体顶替
    fig,axes=plt.subplots(2,2)
    fig.set_size_inches(20,10)
    sns.distplot(sp['raisedhands'],ax=axes[0,0])
    sns.distplot(sp['VisITedResources'],ax=axes[0,1])
    sns.distplot(sp['AnnouncementsView'],ax=axes[1,0])
    sns.distplot(sp['Discussion'],ax=axes[1,1])
    axes[0,0].set(xlabel='raisedhands')
    axes[0,1].set(xlabel='VisITedResources')
    axes[1,0].set(xlabel='AnnouncementsView')
    axes[1,1].set(xlabel='Discussion')
    
  • 结果
    在这里插入图片描述

  • 分析
    <1> 学生举手次数和访问课程次数比较平均,有学生举手次数或访问课程次数达到100%,也存在学生从不举手或访问课程。
    <2> 大部分学生查看通知和参加讨论的次数较少。

4.2 多变量分析
4.2.1 家长回答是否由学校提供的调查和成绩的相关性分析
  • 代码

    # 4.2.1 家长回答是否由学校提供的调查和成绩的相关性分析
    sns.set(rc={'figure.figsize':(20,7)})
    sns.countplot(x='ParentAnsweringSurvey',hue='Class',hue_order=['L','M','H'],data=sp)
    
  • 结果
    在这里插入图片描述

4.2.2 学科和成绩的相关性分析
  • 代码

    # 4.2.2 学科和成绩的相关性分析
    sns.set(rc={'figure.figsize':(20,7)})
    sns.countplot(x='Topic',hue='Class',hue_order=['L','M','H'],data=sp)
    
  • 结果
    在这里插入图片描述

  • 分析
    <1> 对于12个学科,有11个学科都存在着部分同学处于不及格的状态,只有地质学这门课程学生全部达到中等及以上。
    <2> 对于IT这门课程,虽然学习的人数最多,但是大部分同学获得的成绩在中等及中等以下,只有少部分学生的成绩达到优秀,说明学生对于这门课程的掌握程度还有待于提升;
    <3> 学习历史的人数最少,但是成绩中等的学生最多,成绩优秀和成绩较差的学生相对较少。

4.2.3 班级和成绩的相关性分析
  • 代码

    # 4.2.3 班级和成绩的相关性分析(SectionID:班级)
    sns.set(rc={'figure.figsize':(20,7)})
    sns.countplot(x='SectionID',hue='Class',hue_order=['L','M','H'],data=sp)
    
  • 结果
    在这里插入图片描述

  • 分析
    <1> 班级和成绩分布符合客观规律,每个班级都是中等成绩的人数最多。

4.2.4 学生负责人和学生成绩的相关性
  • 代码

    # 4.2.4 学生负责人和学生成绩的相关性
    sns.set(rc={'figure.figsize':(20,7)})
    sns.countplot(x='Relation',hue='Class',hue_order=['L','M','H'],data=sp)
    
  • 结果
    在这里插入图片描述

  • 分析
    <1> 学生负责人是母亲的学生学习成绩中优秀的多,中等次之,不及格少,而负责人是父亲的则相反。

4.2.5 性别和成绩的相关性分析
  • 代码

    # 4.2.5 性别和成绩的相关性分析
    sns.set(rc={'figure.figsize':(20,7)})
    sns.countplot(x='gender',hue='Class',order=['M','F'],hue_order=['L','M','H'],data=sp)
    
  • 结果
    在这里插入图片描述

  • 分析
    <1> 男生中取得中等成绩的最多,其次是不及格的人数,成绩优秀的占比较小。而女生中大部分人取得中等及其以上,只有小部分人不及格,整体水平较好。

4.2.6 学生举手次数和成绩的相关性分析
  • 代码

    # 4.2.6 学生举手次数和成绩的相关性分析
    sns.set(rc={'figure.figsize':(20,7)})
    sns.countplot(x='raisedhands',hue='Class',hue_order=['L','M','H'],data=sp)
    
  • 结果
    在这里插入图片描述

三、建立决策树模型

0.算法思想:

决策树( decision tree )是一个树结构其每个非叶节点表示一个特征属性上的测试,每个分支代表这个特征属性在某个值域上的输出,而每个叶节点存放一个类别。决策树模型核心是下面几部分:

  • 结点和有向边组成
  • 结点有内部结点和叶结点俩种类型
  • 内部结点表示一个特征,叶节点表示一个类.

1.特征选择和查看相关系数

1.1 计算各个特征的相关系数
  • 代码

    # 1.1 计算各个特征的相关系数
    s=pd.get_dummies(sp)#把所有的非数值型变量转化为数值型
    corrDf = s.corr() #计算各个特征的相关系数
    corrDf
    
  • 结果

    raisedhandsVisITedResourcesAnnouncementsViewDiscussiongender_Fgender_MNationalITy_EgyptNationalITy_IranNationalITy_IraqNationalITy_JordanRelation_MumParentAnsweringSurvey_NoParentAnsweringSurvey_YesParentschoolSatisfaction_BadParentschoolSatisfaction_GoodStudentAbsenceDays_Above-7StudentAbsenceDays_Under-7Class_HClass_LClass_M
    raisedhands1.0000000.6915720.6439180.3393860.149978-0.149978-0.024964-0.0949250.1924430.0506860.364237-3.165704e-013.165704e-01-0.2970150.297015-0.4638820.4638820.495681-0.5829970.062315
    VisITedResources0.6915721.0000000.5945000.2432920.210932-0.2109320.005028-0.0810240.1927730.1912660.360240-3.824715e-013.824715e-01-0.3638350.363835-0.4990300.4990300.469735-0.6620610.156442
    AnnouncementsView0.6439180.5945001.0000000.4172900.052139-0.0521390.031622-0.0518540.1543180.1261690.339505-3.963565e-013.963565e-01-0.2987440.298744-0.3121340.3121340.376987-0.5041530.101392
    Discussion0.3393860.2432920.4172901.0000000.124703-0.1247030.010264-0.0031920.055484-0.0067250.026720-2.321965e-012.321965e-01-0.0611040.061104-0.2187780.2187780.243656-0.2704510.016300
    gender_F0.1499780.2109320.0521390.1247031.000000-1.000000-0.040886-0.046264-0.0418270.1470610.195142-2.235860e-022.235860e-02-0.0934780.093478-0.2090110.2090110.220294-0.218841-0.008085
    gender_M-0.149978-0.210932-0.052139-0.124703-1.0000001.0000000.0408860.0462640.041827-0.147061-0.1951422.235860e-02-2.235860e-020.093478-0.0934780.209011-0.209011-0.2202940.2188410.008085
    NationalITy_Egypt-0.0249640.0050280.0316220.010264-0.0408860.0408861.000000-0.015552-0.030296-0.1033000.0720093.289629e-02-3.289629e-02-0.0479850.0479850.044519-0.044519-0.0222940.0215440.001354
    NationalITy_Iran-0.094925-0.081024-0.051854-0.003192-0.0462640.046264-0.0155521.000000-0.024658-0.0840770.0586091.417478e-02-1.417478e-020.024970-0.0249700.061775-0.061775-0.0729240.0175350.051475
    NationalITy_Iraq0.1924430.1927730.1543180.055484-0.0418270.041827-0.030296-0.0246581.000000-0.163782-0.020843-1.129609e-011.129609e-01-0.1758600.175860-0.0560560.0560560.163521-0.131460-0.033536
    NationalITy_Jordan0.0506860.1912660.126169-0.0067250.147061-0.147061-0.103300-0.084077-0.1637821.0000000.1979184.160100e-02-4.160100e-02-0.0121640.0121640.040462-0.0404620.020149-0.0838020.055950
    NationalITy_KW-0.258474-0.332290-0.3216030.019081-0.1007900.100790-0.106599-0.086762-0.169014-0.576278-0.2930834.939254e-02-4.939254e-020.175563-0.1755630.077216-0.077216-0.1600310.201578-0.031989
    NationalITy_Lybia-0.122978-0.165005-0.123098-0.160083-0.0073050.007305-0.015552-0.012658-0.024658-0.0840770.1348491.275730e-01-1.275730e-01-0.0902760.0902760.138394-0.138394-0.0729240.187574-0.099644
    NationalITy_Morocco0.0342120.0074960.0839030.018151-0.0218230.021823-0.012672-0.010314-0.020091-0.0685040.0632835.774658e-02-5.774658e-020.067295-0.067295-0.0277030.027703-0.009205-0.0030310.011159
    NationalITy_Palestine0.2813800.1604960.2037290.087273-0.0407840.040784-0.034405-0.028002-0.054549-0.1859940.045326-7.615359e-027.615359e-02-0.1997090.199709-0.1660170.1660170.072384-0.1492880.066115
    NationalITy_SaudiArabia0.0441370.0106270.0711710.062471-0.0003010.000301-0.021170-0.017230-0.033565-0.114446-0.099473-5.086803e-025.086803e-020.076773-0.076773-0.0107260.0107260.083759-0.060297-0.023434
    NationalITy_Syria-0.029632-0.0060930.0010260.005046-0.0199340.019934-0.016816-0.013687-0.026662-0.0909090.004490-3.722194e-023.722194e-02-0.0264060.0264060.007619-0.007619-0.0026970.005828-0.002699
    NationalITy_Tunis-0.0127160.0385330.020066-0.118607-0.0935690.093569-0.022135-0.018016-0.035095-0.119662-0.025092-6.724750e-036.724750e-030.062876-0.062876-0.0211280.021128-0.0160800.024957-0.007393
    NationalITy_USA-0.018082-0.009526-0.014469-0.0391890.070613-0.070613-0.015552-0.012658-0.024658-0.0840770.020489-6.142403e-026.142403e-02-0.0518610.051861-0.0531550.0531550.050328-0.024974-0.024085
    NationalITy_lebanon0.0956160.0485840.041714-0.0693290.112457-0.112457-0.026488-0.021559-0.041996-0.1431930.092181-3.266199e-023.266199e-020.007890-0.007890-0.1097200.1097200.098063-0.063829-0.033449
    NationalITy_venzuela0.0493730.0486730.0551410.060764-0.0346100.034610-0.006316-0.005141-0.010014-0.0341450.054764-4.029582e-024.029582e-02-0.0366620.036662-0.0371450.0371450.070493-0.027406-0.040467
    PlaceofBirth_Egypt0.0189950.0329160.0645550.034742-0.0089750.0089750.886766-0.015552-0.030296-0.0712690.1032333.289629e-02-3.289629e-02-0.0479850.0479850.013140-0.0131400.0113570.021544-0.029591
    PlaceofBirth_Iran-0.094925-0.081024-0.051854-0.003192-0.0462640.046264-0.0155521.000000-0.024658-0.0840770.0586091.417478e-02-1.417478e-020.024970-0.0249700.061775-0.061775-0.0729240.0175350.051475
    PlaceofBirth_Iraq0.1924430.1927730.1543180.055484-0.0418270.041827-0.030296-0.0246581.000000-0.163782-0.020843-1.129609e-011.129609e-01-0.1758600.175860-0.0560560.0560560.163521-0.131460-0.033536
    PlaceofBirth_Jordan0.1058200.2031160.140071-0.0464860.115271-0.115271-0.105179-0.085606-0.1667620.8018130.182511-1.509355e-161.509355e-16-0.1145480.1145480.008538-0.0085380.008841-0.1231630.101329
    PlaceofBirth_KuwaIT-0.265991-0.339184-0.3304560.005455-0.1128770.112877-0.075349-0.087149-0.169767-0.569874-0.3138415.421667e-02-5.421667e-020.180729-0.1807290.091215-0.091215-0.1720700.208527-0.027094
    PlaceofBirth_Lybia-0.122978-0.165005-0.123098-0.160083-0.0073050.007305-0.015552-0.012658-0.024658-0.0840770.1348491.275730e-01-1.275730e-01-0.0902760.0902760.138394-0.138394-0.0729240.187574-0.099644
    PlaceofBirth_Morocco0.0342120.0074960.0839030.018151-0.0218230.021823-0.012672-0.010314-0.020091-0.0685040.0632835.774658e-02-5.774658e-020.067295-0.067295-0.0277030.027703-0.009205-0.0030310.011159
    PlaceofBirth_Palestine0.1561970.1368450.1881030.1596420.010733-0.010733-0.020163-0.016411-0.0319690.0126750.056217-1.286408e-011.286408e-01-0.1170410.117041-0.1185820.1185820.097208-0.087491-0.011633
    PlaceofBirth_SaudiArabia0.024384-0.014318-0.0125300.018692-0.0683200.068320-0.025669-0.020892-0.0406990.079066-0.0605574.679083e-02-4.679083e-020.088766-0.088766-0.0086940.0086940.057638-0.006139-0.047546
    PlaceofBirth_Syria-0.0089340.0205490.0130410.021258-0.0073050.007305-0.015552-0.012658-0.024658-0.0840770.020489-2.362463e-022.362463e-02-0.0134450.013445-0.0148450.0148450.009244-0.0249740.013695
    GradeID_G-110.0625810.0379160.0907920.0749890.006944-0.006944-0.023063-0.018772-0.036567-0.124682-0.034842-1.778657e-021.778657e-020.023883-0.023883-0.0307550.0307550.060574-0.041885-0.018478
    GradeID_G-12-0.046877-0.063946-0.0739010.0644880.115403-0.115403-0.021170-0.017230-0.033565-0.1144460.0137393.332733e-02-3.332733e-020.019728-0.0197280.017718-0.0177180.0227510.034390-0.051484
    SectionID_A0.1381110.0578260.1292050.0785120.042438-0.042438-0.040786-0.0586090.0208430.146543-0.018493-6.936602e-036.936602e-03-0.0680380.0680380.020676-0.0206760.002590-0.0372250.030701
    SectionID_B-0.100492-0.015519-0.080693-0.030468-0.0171340.0171340.060260-0.0034450.007233-0.0806500.030769-9.368029e-039.368029e-030.050105-0.0501050.004896-0.0048960.0344590.037824-0.065303
    SectionID_C-0.082925-0.086973-0.103785-0.099599-0.0525270.052527-0.0356920.125882-0.056589-0.139103-0.0229643.253000e-02-3.253000e-020.039672-0.039672-0.0516520.051652-0.0730710.0012190.066110
    Topic_Arabic-0.039771-0.0199840.0190450.019829-0.0726420.072642-0.051748-0.0421180.099999-0.0018750.010131-1.039216e-021.039216e-020.076585-0.076585-0.0061840.0061840.0214900.019988-0.037526
    Topic_Biology0.1731980.1698280.1319110.035070-0.0167640.016764-0.035692-0.0290500.1080340.0942310.029525-1.951800e-021.951800e-02-0.1013850.101385-0.0692360.0692360.134356-0.076826-0.055272
    Topic_Chemistry0.0806450.0852900.051402-0.0355920.064546-0.064546-0.031713-0.0258110.0411390.2671270.041780-8.671100e-028.671100e-02-0.0665830.0665830.047847-0.0478470.0607360.035756-0.087629
    Topic_English0.040244-0.0538360.0050160.0093840.038517-0.0385170.008233-0.036187-0.0021360.0130420.0948999.094948e-02-9.094948e-020.020134-0.0201340.015971-0.0159710.057746-0.030887-0.025651
    Topic_French-0.0794810.081832-0.061776-0.2406910.079722-0.0797220.0350690.065077-0.0285080.1232720.2267931.050920e-01-1.050920e-01-0.0680850.068085-0.0605120.0605120.010283-0.0165350.005239
    Topic_Geology0.2056230.1862450.1973930.0793550.024825-0.024825-0.031713-0.0258110.0411390.2272580.041780-8.671100e-028.671100e-02-0.0665830.066583-0.0693290.069329-0.023038-0.1376060.143480
    Topic_History0.0692500.0862800.1646120.1604610.023818-0.023818-0.028063-0.0228410.0576940.026554-0.039053-7.134743e-027.134743e-02-0.0753380.0753380.031426-0.031426-0.037945-0.0491020.078531
    Topic_IT-0.254687-0.290641-0.361976-0.015514-0.0286310.028631-0.0301170.038239-0.108870-0.327592-0.2762634.611430e-03-4.611430e-030.072754-0.0727540.076894-0.076894-0.1501260.1524950.002524
    Topic_Math-0.063363-0.128456-0.077126-0.009205-0.0562120.0562120.045525-0.024065-0.046879-0.159842-0.095636-2.438189e-022.438189e-020.037038-0.0370380.034205-0.034205-0.0047420.033337-0.025267
    Topic_Quran0.0557130.0709800.0445160.0078540.020267-0.0202670.043149-0.024658-0.048035-0.059904-0.0208437.530727e-03-7.530727e-030.069051-0.069051-0.0357030.0357030.0325590.004046-0.033536
    Topic_Science0.0649680.0256360.1638160.1681160.061891-0.0618910.101865-0.038792-0.0109110.0807170.028432-1.788691e-021.788691e-02-0.0273540.0273540.037379-0.0373790.013516-0.0535460.035159
    Topic_Spanish-0.029389-0.015874-0.020799-0.097155-0.1580750.158075-0.0324020.142411-0.051374-0.136060-0.0049642.008181e-02-2.008181e-020.080836-0.080836-0.0373140.037314-0.0492180.0294480.019087
    Semester_F-0.178358-0.173219-0.287066-0.0190830.049156-0.049156-0.079693-0.039856-0.004567-0.041646-0.1487052.362791e-02-2.362791e-02-0.0252580.0252580.072462-0.072462-0.0956860.115048-0.014257
    Semester_S0.1783580.1732190.2870660.019083-0.0491560.0491560.0796930.0398560.0045670.0416460.148705-2.362791e-022.362791e-020.025258-0.025258-0.0724620.0724620.095686-0.1150480.014257
    Relation_Father-0.364237-0.360240-0.339505-0.026720-0.1951420.195142-0.072009-0.0586090.020843-0.197918-1.0000001.638105e-01-1.638105e-010.287698-0.2876980.219687-0.219687-0.3871380.2796150.107497
    Relation_Mum0.3642370.3602400.3395050.0267200.195142-0.1951420.0720090.058609-0.0208430.1979181.000000-1.638105e-011.638105e-01-0.2876980.287698-0.2196870.2196870.387138-0.279615-0.107497
    ParentAnsweringSurvey_No-0.316570-0.382472-0.396357-0.232197-0.0223590.0223590.0328960.014175-0.1129610.041601-0.1638111.000000e+00-1.000000e+000.539875-0.5398750.261152-0.261152-0.3139930.413547-0.078795
    ParentAnsweringSurvey_Yes0.3165700.3824720.3963570.2321970.022359-0.022359-0.032896-0.0141750.112961-0.0416010.163811-1.000000e+001.000000e+00-0.5398750.539875-0.2611520.2611520.313993-0.4135470.078795
    ParentschoolSatisfaction_Bad-0.297015-0.363835-0.298744-0.061104-0.0934780.093478-0.0479850.024970-0.175860-0.012164-0.2876985.398748e-01-5.398748e-011.000000-1.0000000.228385-0.228385-0.2956540.331473-0.022716
    ParentschoolSatisfaction_Good0.2970150.3638350.2987440.0611040.093478-0.0934780.047985-0.0249700.1758600.0121640.287698-5.398748e-015.398748e-01-1.0000001.000000-0.2283850.2283850.295654-0.3314730.022716
    StudentAbsenceDays_Above-7-0.463882-0.499030-0.312134-0.218778-0.2090110.2090110.0445190.061775-0.0560560.040462-0.2196872.611518e-01-2.611518e-010.228385-0.2283851.000000-1.000000-0.4896290.631674-0.111142
    StudentAbsenceDays_Under-70.4638820.4990300.3121340.2187780.209011-0.209011-0.044519-0.0617750.056056-0.0404620.219687-2.611518e-012.611518e-01-0.2283850.228385-1.0000001.0000000.489629-0.6316740.111142
    Class_H0.4956810.4697350.3769870.2436560.220294-0.220294-0.022294-0.0729240.1635210.0201490.387138-3.139929e-013.139929e-01-0.2956540.295654-0.4896290.4896291.000000-0.388777-0.574052
    Class_L-0.582997-0.662061-0.504153-0.270451-0.2188410.2188410.0215440.017535-0.131460-0.083802-0.2796154.135474e-01-4.135474e-010.331473-0.3314730.631674-0.631674-0.3887771.000000-0.531226
    Class_M0.0623150.1564420.1013920.016300-0.0080850.0080850.0013540.051475-0.0335360.055950-0.107497-7.879499e-027.879499e-02-0.0227160.022716-0.1111420.111142-0.574052-0.5312261.000000

    75 rows × 75 columns

1.2 查看各个特征与优秀成绩(‘Class_H’)的相关系数
  • 代码

    # 1.2 查看各个特征与优秀成绩('Class_H')的相关系数,
    # ascending=False表示按降序排列
    corrDf['Class_H'].sort_values(ascending =False)
    
  • 结果

    Class_H                          1.000000
    raisedhands                      0.495681
    StudentAbsenceDays_Under-7       0.489629
    VisITedResources                 0.469735
    Relation_Mum                     0.387138
    AnnouncementsView                0.376987
    ParentAnsweringSurvey_Yes        0.313993
    ParentschoolSatisfaction_Good    0.295654
    Discussion                       0.243656
    gender_F                         0.220294
    NationalITy_Iraq                 0.163521
    PlaceofBirth_Iraq                0.163521
    Topic_Biology                    0.134356
    PlaceofBirth_lebanon             0.125929
    NationalITy_lebanon              0.098063
    PlaceofBirth_Palestine           0.097208
    Semester_S                       0.095686
    NationalITy_SaudiArabia          0.083759
    GradeID_G-06                     0.082955
    NationalITy_Palestine            0.072384
    NationalITy_venzuela             0.070493
    PlaceofBirth_venzuela            0.070493
    Topic_Chemistry                  0.060736
    GradeID_G-11                     0.060574
    Topic_English                    0.057746
    PlaceofBirth_SaudiArabia         0.057638
    NationalITy_USA                  0.050328
    SectionID_B                      0.034459
    Topic_Quran                      0.032559
    PlaceofBirth_USA                 0.032209
                                       ...   
    PlaceofBirth_Morocco            -0.009205
    NationalITy_Morocco             -0.009205
    GradeID_G-10                    -0.009205
    GradeID_G-07                    -0.009845
    GradeID_G-08                    -0.014039
    NationalITy_Tunis               -0.016080
    NationalITy_Egypt               -0.022294
    Topic_Geology                   -0.023038
    GradeID_G-02                    -0.024633
    StageID_lowerlevel              -0.035864
    Topic_History                   -0.037945
    Topic_Spanish                   -0.049218
    GradeID_G-05                    -0.051403
    GradeID_G-09                    -0.066500
    PlaceofBirth_Lybia              -0.072924
    NationalITy_Lybia               -0.072924
    PlaceofBirth_Iran               -0.072924
    NationalITy_Iran                -0.072924
    SectionID_C                     -0.073071
    Semester_F                      -0.095686
    Topic_IT                        -0.150126
    NationalITy_KW                  -0.160031
    PlaceofBirth_KuwaIT             -0.172070
    gender_M                        -0.220294
    ParentschoolSatisfaction_Bad    -0.295654
    ParentAnsweringSurvey_No        -0.313993
    Relation_Father                 -0.387138
    Class_L                         -0.388777
    StudentAbsenceDays_Above-7      -0.489629
    Class_M                         -0.574052
    Name: Class_H, Length: 75, dtype: float64
    
  • 分析

    根据各个特征与优秀成绩(Class_H)的相关系数大小,我们可以看出除了变量GreadID,StageID,Semester,SectionID,PlaceofBirth等,大部分变量和优秀成绩之间有强烈的相关性,所以我们选择除去GreadID,StageID,Semester,SectionID,PlaceofBirth之外的变量,做为模型的特征输入。

1.3 除去没有太大关系的变量 作为模型的特征输入
  • 代码

    # 1.3 除去没有太大关系的变量 作为模型的特征输入
    x=sp.drop(['Class','StageID','GradeID','Semester','SectionID','PlaceofBirth'],axis=1)
    y=sp['Class']
    x=pd.get_dummies(x)#把所有的非数值型变量转化为数值型
    x.head()
    
  • 结果

    raisedhandsVisITedResourcesAnnouncementsViewDiscussiongender_Fgender_MNationalITy_EgyptNationalITy_IranNationalITy_IraqNationalITy_JordanTopic_ScienceTopic_SpanishRelation_FatherRelation_MumParentAnsweringSurvey_NoParentAnsweringSurvey_YesParentschoolSatisfaction_BadParentschoolSatisfaction_GoodStudentAbsenceDays_Above-7StudentAbsenceDays_Under-7
    015162200100000010010101
    120203250100000010010101
    21070300100000010101010
    330255350100000010101010
    4405012500100000010101010

    5 rows × 40 columns

四、预测成绩

1.选择测试数据集和训练数据集

  • 代码

    # 1.选择测试数据集和训练数据集
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import LabelEncoder
    from sklearn.metrics import accuracy_score
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=10)
    

2.训练模型并检验准确率

  • 代码

    # 2.训练模型并检验准确率
    lr=LogisticRegression(max_iter=10000)
    lr.fit(x_train,y_train)
    predict_y=lr.predict(x_test)
    print('predict',predict_y)
    score=accuracy_score(y_test,predict_y)
    score
    
  • 结果

    predict ['H' 'M' 'M' 'L' 'H' 'H' 'H' 'H' 'M' 'L' 'M' 'M' 'H' 'M' 'M' 'H' 'L' 'H'
     'M' 'M' 'L' 'M' 'M' 'M' 'M' 'L' 'H' 'M' 'M' 'H' 'M' 'H' 'L' 'M' 'M' 'L'
     'M' 'M' 'L' 'H' 'M' 'L' 'H' 'L' 'M' 'M' 'H' 'M' 'M' 'H' 'M' 'L' 'H' 'M'
     'M' 'M' 'H' 'L' 'M' 'H' 'H' 'L' 'H' 'M' 'M' 'M' 'M' 'H' 'L' 'H' 'M' 'L'
     'M' 'H' 'M' 'H' 'M' 'L' 'M' 'H' 'L' 'H' 'L' 'L' 'M' 'H' 'L' 'H' 'H' 'M'
     'M' 'L' 'M' 'M' 'H' 'M']
    
    0.7395833333333334
    
  • 18
    点赞
  • 181
    收藏
    觉得还不错? 一键收藏
  • 13
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 13
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值